Unicode-handling library

Discussion:

(too old to reply)

Randall Randall

2004-08-25 01:06:23 UTC

I've started on a small library that simplifies unicode handling.
It's currently intended to be fully portable Common Lisp, and
the functions it defines should conform to the CLHS's definitions.

You can find it at
http://www.randallsquared.com/download/unicode-0.99rc1.lisp .

In order to try it out, you'll need to get
http://www.randallsquared.com/download/tables.lisp
and data from the Unicode consortium, at
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt .

This basically has all the things I had in mind for 1.0, to wit:
import and export of UTF-8, UTF-16*, UTF-32*, us-ascii, ISO 8859-[1-16];
most string and character functions implemented;
The most basic 15100 characters (if UnicodeData.txt supplied).

Things it doesn't have yet, but planned for after 1.0, are:
Conversions: SCSU, &-escaped ASCII, CMUCL characters,
OpenMCL characters, etc
Include other unicode characters
rework at least some errors to be cerrors
convenience helpers for reading and writing files and other streams
handle one-to-many mappings of case for *ansi-compliant* ==> NIL
maybe more printer methods, though I'm not very familiar with those.

Trivial (because this file has no extended UTF-8 sequence) example:
* (with-open-file (f "/Users/randall/unicode-test.txt"
:element-type '(unsigned-byte 8))
(let ((utf8 (make-array 13)))
(read-sequence utf8 f)
(utf-8->internal utf8)))
; =>
#(#\U+0054 #\U+0068 #\U+0069 #\U+0073 #\U+0020 #\U+0069 #\U+0073
#\U+0020 #\U+0061 #\U+0020 #\U+0075 #\U+006E #\U+0069)

If this is useful for anyone, I'd appreciate bug reports and feature
requests!

--
Randall Randall <***@randallsquared.com>
Property law should use #'EQ , not #'EQUAL .

Arthur Lemmens

2004-08-25 07:52:10 UTC

Permalink

Hi Randall,

Post by Randall Randall
I've started on a small library that simplifies unicode handling.

Looks nice.

Post by Randall Randall
Takes a long time to start up, since it needs to read in all the unicode
data. This seems unlikely to improve with further code point databases,

I think you could solve that by reading in the unicode data at compile
time. To give you an idea of how this could be done, here are a few
snippets of my own library for dealing with character encodings:

(eval-when (:compile-toplevel :load-toplevel :execute)
(defun load-unicode-table (filename)
;; Returns a mapping vector corresponding to the information in the
;; given mapping file from the Unicode consortium. ...))

(define-simple-character-encoding
:iso-8859-2
:vector #.(load-unicode-table "8859-2"))

Arthur

Randall Randall

2004-08-25 22:15:44 UTC

Permalink

Post by Arthur Lemmens
Hi Randall,
Looks nice.

Thanks!

Post by Arthur Lemmens

Post by Randall Randall
Takes a long time to start up, since it needs to read in all the unicode
data. This seems unlikely to improve with further code point databases,

I think you could solve that by reading in the unicode data at compile
time. To give you an idea of how this could be done, here are a few
(eval-when (:compile-toplevel :load-toplevel :execute)

[snip]

Adding (eval-when ...) does indeed mostly solve that problem,
reducing the load time on my 1Ghz G4 from ~50 seconds to ~2.

After separating the load, compilation, and package stuff into
package.lisp, there's a new version (also with some minor bugfixes):
http://www.randallsquared.com/download/unicode-0.99rc2.tar.gz

Thanks for the advice!

--
Randall Randall <***@randallsquared.com>
Property law should use #'EQ , not #'EQUAL .

Klaus Harbo

2004-09-02 12:38:31 UTC

Permalink

I'm wondering if anyone could recommend a good, comprehensive book about Unicode?

-K.

Kalle Olavi Niemitalo

2004-09-02 18:40:49 UTC

Permalink

Post by Klaus Harbo
I'm wondering if anyone could recommend a good, comprehensive
book about Unicode?

I bought The Unicode Standard Version 3.0 on paper, and it was a
mistake. Most of the book is filled with code charts, which make
it unwieldy and are easier to use electronically. The obscure
symbols have provided some laughs though.

Adam Warner

2004-09-03 10:45:06 UTC

Permalink

Hi Klaus Harbo,

Post by Klaus Harbo
I'm wondering if anyone could recommend a good, comprehensive book about Unicode?

The whole book is now available online:
<http://www.unicode.org/versions/Unicode4.0.1/>

Regards,
Adam