Discussion:
Unicode-handling library
(too old to reply)
Randall Randall
2004-08-25 01:06:23 UTC
Permalink
Raw Message
I've started on a small library that simplifies unicode handling.
It's currently intended to be fully portable Common Lisp, and
the functions it defines should conform to the CLHS's definitions.

You can find it at
http://www.randallsquared.com/download/unicode-0.99rc1.lisp .

In order to try it out, you'll need to get
http://www.randallsquared.com/download/tables.lisp
and data from the Unicode consortium, at
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt .

This basically has all the things I had in mind for 1.0, to wit:
import and export of UTF-8, UTF-16*, UTF-32*, us-ascii, ISO 8859-[1-16];
most string and character functions implemented;
The most basic 15100 characters (if UnicodeData.txt supplied).

Things it doesn't have yet, but planned for after 1.0, are:
Conversions: SCSU, &-escaped ASCII, CMUCL characters,
OpenMCL characters, etc
Include other unicode characters
rework at least some errors to be cerrors
convenience helpers for reading and writing files and other streams
handle one-to-many mappings of case for *ansi-compliant* ==> NIL
maybe more printer methods, though I'm not very familiar with those.

Trivial (because this file has no extended UTF-8 sequence) example:
* (with-open-file (f "/Users/randall/unicode-test.txt"
:element-type '(unsigned-byte 8))
(let ((utf8 (make-array 13)))
(read-sequence utf8 f)
(utf-8->internal utf8)))
; =>
#(#\U+0054 #\U+0068 #\U+0069 #\U+0073 #\U+0020 #\U+0069 #\U+0073
#\U+0020 #\U+0061 #\U+0020 #\U+0075 #\U+006E #\U+0069)

If this is useful for anyone, I'd appreciate bug reports and feature
requests!

--
Randall Randall <***@randallsquared.com>
Property law should use #'EQ , not #'EQUAL .
Arthur Lemmens
2004-08-25 07:52:10 UTC
Permalink
Raw Message
Hi Randall,
Post by Randall Randall
I've started on a small library that simplifies unicode handling.
Looks nice.
Post by Randall Randall
Takes a long time to start up, since it needs to read in all the unicode
data. This seems unlikely to improve with further code point databases,
I think you could solve that by reading in the unicode data at compile
time. To give you an idea of how this could be done, here are a few
snippets of my own library for dealing with character encodings:

(eval-when (:compile-toplevel :load-toplevel :execute)
(defun load-unicode-table (filename)
;; Returns a mapping vector corresponding to the information in the
;; given mapping file from the Unicode consortium. ...))

(define-simple-character-encoding
:iso-8859-2
:vector #.(load-unicode-table "8859-2"))


Arthur
Randall Randall
2004-08-25 22:15:44 UTC
Permalink
Raw Message
Post by Arthur Lemmens
Hi Randall,
Looks nice.
Thanks!
Post by Arthur Lemmens
Post by Randall Randall
Takes a long time to start up, since it needs to read in all the unicode
data. This seems unlikely to improve with further code point databases,
I think you could solve that by reading in the unicode data at compile
time. To give you an idea of how this could be done, here are a few
(eval-when (:compile-toplevel :load-toplevel :execute)
[snip]

Adding (eval-when ...) does indeed mostly solve that problem,
reducing the load time on my 1Ghz G4 from ~50 seconds to ~2.

After separating the load, compilation, and package stuff into
package.lisp, there's a new version (also with some minor bugfixes):
http://www.randallsquared.com/download/unicode-0.99rc2.tar.gz

Thanks for the advice!

--
Randall Randall <***@randallsquared.com>
Property law should use #'EQ , not #'EQUAL .
Klaus Harbo
2004-09-02 12:38:31 UTC
Permalink
Raw Message
I'm wondering if anyone could recommend a good, comprehensive book about Unicode?

-K.
Kalle Olavi Niemitalo
2004-09-02 18:40:49 UTC
Permalink
Raw Message
Post by Klaus Harbo
I'm wondering if anyone could recommend a good, comprehensive
book about Unicode?
I bought The Unicode Standard Version 3.0 on paper, and it was a
mistake. Most of the book is filled with code charts, which make
it unwieldy and are easier to use electronically. The obscure
symbols have provided some laughs though.
Adam Warner
2004-09-03 10:45:06 UTC
Permalink
Raw Message
Hi Klaus Harbo,
Post by Klaus Harbo
I'm wondering if anyone could recommend a good, comprehensive book about Unicode?
The whole book is now available online:
<http://www.unicode.org/versions/Unicode4.0.1/>

Regards,
Adam

Loading...