summaryrefslogtreecommitdiff
path: root/README
diff options
context:
space:
mode:
authorSteven G. Johnson <stevenj@mit.edu>2014-07-15 15:29:52 -0400
committerSteven G. Johnson <stevenj@mit.edu>2014-07-15 15:29:52 -0400
commitab9520d18845248ef79ee98e8d671f8eecfec288 (patch)
tree92c5d4df269de5321b6eeb27206fded2316afb22 /README
downloadlibutf8proc-ab9520d18845248ef79ee98e8d671f8eecfec288.tar.gz
libutf8proc-ab9520d18845248ef79ee98e8d671f8eecfec288.tar.bz2
import of utf8proc-v1.1.6v1.1.6
Diffstat (limited to 'README')
-rw-r--r--README116
1 files changed, 116 insertions, 0 deletions
diff --git a/README b/README
new file mode 100644
index 0000000..e72ffff
--- /dev/null
+++ b/README
@@ -0,0 +1,116 @@
+
+Please read the LICENSE file, which is shipping with this software.
+
+
+*** QUICK START ***
+
+For compilation of the C library call "make c-library", for compilation of
+the ruby library call "make ruby-library" and for compilation of the
+PostgreSQL extension call "make pgsql-library".
+
+For ruby you can also create a gem-file by calling "make ruby-gem".
+
+"make all" can be used to build everything, but both ruby and PostgreSQL
+installations are required in this case.
+
+
+*** GENERAL INFORMATION ***
+
+The C library is found in this directory after successful compilation and
+is named "libutf8proc.a" and "libutf8proc.so". The ruby library consists of
+the files "utf8proc.rb" and "utf8proc_native.so", which are found in the
+subdirectory "ruby/". If you chose to create a gem-file it is placed in the
+"ruby/gem" directory. The PostgreSQL extension is named "utf8proc_pgsql.so"
+and resides in the "pgsql/" directory.
+
+Both the ruby library and the PostgreSQL extension are built as stand-alone
+libraries and are therefore not dependent the dynamic version of the
+C library files, but this behaviour might change in future releases.
+
+The Unicode version being supported is 5.0.0.
+Note: Version 4.1.0 of Unicode Standard Annex #29 was used, as
+ version 5.0.0 had not been available at the time of implementation.
+
+For Unicode normalizations, the following options have to be used:
+Normalization Form C: STABLE, COMPOSE
+Normalization Form D: STABLE, DECOMPOSE
+Normalization Form KC: STABLE, COMPOSE, COMPAT
+Normalization Form KD: STABLE, DECOMPOSE, COMPAT
+
+
+*** C LIBRARY ***
+
+The documentation for the C library is found in the utf8proc.h header file.
+"utf8proc_map" is most likely function you will be using for mapping UTF-8
+strings, unless you want to allocate memory yourself.
+
+
+*** RUBY API ***
+
+The ruby library adds the methods "utf8map" and "utf8map!" to the String
+class, and the method "utf8" to the Integer class.
+
+The String#utf8map method does the same as the "utf8proc_map" C function.
+Options for the mapping procedure are passed as symbols, i.e:
+"Hello".utf8map(:casefold) => "hello"
+
+The descriptions of all options are found in the C header file
+"utf8proc.h". Please notice that the according symbols in ruby are all
+lowercase.
+
+String#utf8map! is the destructive function in the meaning that the string
+is replaced by the result.
+
+There are shortcuts for the 4 normalization forms specified by Unicode:
+String#utf8nfd, String#utf8nfd!,
+String#utf8nfc, String#utf8nfc!,
+String#utf8nfkd, String#utf8nfkd!,
+String#utf8nfkc, String#utf8nfkc!
+
+The method Integer#utf8 returns a UTF-8 string, which is containing the
+unicode char given by the code point.
+0x000A.utf8 => "\n"
+0x2028.utf8 => "\342\200\250"
+
+
+*** POSTGRESQL API ***
+
+For PostgreSQL there are two SQL functions supplied named "unifold" and
+"unistrip". These functions function can be used to prepare index fields in
+order to be folded in a way where string-comparisons make more sense, e.g.
+where "bathtub" == "bath<soft hyphen>tub"
+or "Hello World" == "hello world".
+
+CREATE TABLE people (
+ id serial8 primary key,
+ name text,
+ CHECK (unifold(name) NOTNULL)
+);
+CREATE INDEX name_idx ON people (unifold(name));
+SELECT * FROM people WHERE unifold(name) = unifold('John Doe');
+
+The function "unistrip" removes character marks like accents or diaeresis,
+while "unifold" keeps then.
+
+NOTICE: The outputs of the function can change between releases, as
+ utf8proc does not follow a versioning stability policy. You have to
+ rebuild your database indicies, if you upgrade to a newer version
+ of utf8proc.
+
+
+*** TODO ***
+
+- detect stable code points and process segments independently in order to
+ save memory
+- do a quick check before normalizing strings to optimize speed
+- support stream processing
+
+
+*** CONTACT ***
+
+If you find any bugs or experience difficulties in compiling this software,
+please contact us:
+
+Project page: http://www.public-software-group.org/utf8proc
+
+