From ab9520d18845248ef79ee98e8d671f8eecfec288 Mon Sep 17 00:00:00 2001 From: "Steven G. Johnson" Date: Tue, 15 Jul 2014 15:29:52 -0400 Subject: import of utf8proc-v1.1.6 --- README | 116 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 116 insertions(+) create mode 100644 README (limited to 'README') diff --git a/README b/README new file mode 100644 index 0000000..e72ffff --- /dev/null +++ b/README @@ -0,0 +1,116 @@ + +Please read the LICENSE file, which is shipping with this software. + + +*** QUICK START *** + +For compilation of the C library call "make c-library", for compilation of +the ruby library call "make ruby-library" and for compilation of the +PostgreSQL extension call "make pgsql-library". + +For ruby you can also create a gem-file by calling "make ruby-gem". + +"make all" can be used to build everything, but both ruby and PostgreSQL +installations are required in this case. + + +*** GENERAL INFORMATION *** + +The C library is found in this directory after successful compilation and +is named "libutf8proc.a" and "libutf8proc.so". The ruby library consists of +the files "utf8proc.rb" and "utf8proc_native.so", which are found in the +subdirectory "ruby/". If you chose to create a gem-file it is placed in the +"ruby/gem" directory. The PostgreSQL extension is named "utf8proc_pgsql.so" +and resides in the "pgsql/" directory. + +Both the ruby library and the PostgreSQL extension are built as stand-alone +libraries and are therefore not dependent the dynamic version of the +C library files, but this behaviour might change in future releases. + +The Unicode version being supported is 5.0.0. +Note: Version 4.1.0 of Unicode Standard Annex #29 was used, as + version 5.0.0 had not been available at the time of implementation. + +For Unicode normalizations, the following options have to be used: +Normalization Form C: STABLE, COMPOSE +Normalization Form D: STABLE, DECOMPOSE +Normalization Form KC: STABLE, COMPOSE, COMPAT +Normalization Form KD: STABLE, DECOMPOSE, COMPAT + + +*** C LIBRARY *** + +The documentation for the C library is found in the utf8proc.h header file. +"utf8proc_map" is most likely function you will be using for mapping UTF-8 +strings, unless you want to allocate memory yourself. + + +*** RUBY API *** + +The ruby library adds the methods "utf8map" and "utf8map!" to the String +class, and the method "utf8" to the Integer class. + +The String#utf8map method does the same as the "utf8proc_map" C function. +Options for the mapping procedure are passed as symbols, i.e: +"Hello".utf8map(:casefold) => "hello" + +The descriptions of all options are found in the C header file +"utf8proc.h". Please notice that the according symbols in ruby are all +lowercase. + +String#utf8map! is the destructive function in the meaning that the string +is replaced by the result. + +There are shortcuts for the 4 normalization forms specified by Unicode: +String#utf8nfd, String#utf8nfd!, +String#utf8nfc, String#utf8nfc!, +String#utf8nfkd, String#utf8nfkd!, +String#utf8nfkc, String#utf8nfkc! + +The method Integer#utf8 returns a UTF-8 string, which is containing the +unicode char given by the code point. +0x000A.utf8 => "\n" +0x2028.utf8 => "\342\200\250" + + +*** POSTGRESQL API *** + +For PostgreSQL there are two SQL functions supplied named "unifold" and +"unistrip". These functions function can be used to prepare index fields in +order to be folded in a way where string-comparisons make more sense, e.g. +where "bathtub" == "bathtub" +or "Hello World" == "hello world". + +CREATE TABLE people ( + id serial8 primary key, + name text, + CHECK (unifold(name) NOTNULL) +); +CREATE INDEX name_idx ON people (unifold(name)); +SELECT * FROM people WHERE unifold(name) = unifold('John Doe'); + +The function "unistrip" removes character marks like accents or diaeresis, +while "unifold" keeps then. + +NOTICE: The outputs of the function can change between releases, as + utf8proc does not follow a versioning stability policy. You have to + rebuild your database indicies, if you upgrade to a newer version + of utf8proc. + + +*** TODO *** + +- detect stable code points and process segments independently in order to + save memory +- do a quick check before normalizing strings to optimize speed +- support stream processing + + +*** CONTACT *** + +If you find any bugs or experience difficulties in compiling this software, +please contact us: + +Project page: http://www.public-software-group.org/utf8proc + + -- cgit v1.2.3