summaryrefslogtreecommitdiff
path: root/README
blob: e72ffff3bdb195e7d6591e302c00ad646411568f (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
Please read the LICENSE file, which is shipping with this software.


*** QUICK START ***

For compilation of the C library call "make c-library", for compilation of
the ruby library call "make ruby-library" and for compilation of the
PostgreSQL extension call "make pgsql-library".

For ruby you can also create a gem-file by calling "make ruby-gem".

"make all" can be used to build everything, but both ruby and PostgreSQL
installations are required in this case.


*** GENERAL INFORMATION ***

The C library is found in this directory after successful compilation and
is named "libutf8proc.a" and "libutf8proc.so". The ruby library consists of
the files "utf8proc.rb" and "utf8proc_native.so", which are found in the
subdirectory "ruby/". If you chose to create a gem-file it is placed in the
"ruby/gem" directory. The PostgreSQL extension is named "utf8proc_pgsql.so"
and resides in the "pgsql/" directory.

Both the ruby library and the PostgreSQL extension are built as stand-alone
libraries and are therefore not dependent the dynamic version of the
C library files, but this behaviour might change in future releases.

The Unicode version being supported is 5.0.0.
Note: Version 4.1.0 of Unicode Standard Annex #29 was used, as
      version 5.0.0 had not been available at the time of implementation.

For Unicode normalizations, the following options have to be used:
Normalization Form C:  STABLE, COMPOSE
Normalization Form D:  STABLE, DECOMPOSE
Normalization Form KC: STABLE, COMPOSE, COMPAT
Normalization Form KD: STABLE, DECOMPOSE, COMPAT


*** C LIBRARY ***

The documentation for the C library is found in the utf8proc.h header file.
"utf8proc_map" is most likely function you will be using for mapping UTF-8
strings, unless you want to allocate memory yourself.


*** RUBY API ***

The ruby library adds the methods "utf8map" and "utf8map!" to the String
class, and the method "utf8" to the Integer class.

The String#utf8map method does the same as the "utf8proc_map" C function.
Options for the mapping procedure are passed as symbols, i.e:
"Hello".utf8map(:casefold) => "hello"

The descriptions of all options are found in the C header file
"utf8proc.h". Please notice that the according symbols in ruby are all
lowercase.

String#utf8map! is the destructive function in the meaning that the string
is replaced by the result.

There are shortcuts for the 4 normalization forms specified by Unicode:
String#utf8nfd,  String#utf8nfd!,
String#utf8nfc,  String#utf8nfc!,
String#utf8nfkd, String#utf8nfkd!,
String#utf8nfkc, String#utf8nfkc!

The method Integer#utf8 returns a UTF-8 string, which is containing the
unicode char given by the code point.
0x000A.utf8 => "\n"
0x2028.utf8 => "\342\200\250"


*** POSTGRESQL API ***

For PostgreSQL there are two SQL functions supplied named "unifold" and
"unistrip". These functions function can be used to prepare index fields in
order to be folded in a way where string-comparisons make more sense, e.g.
where "bathtub" == "bath<soft hyphen>tub"
or "Hello World" == "hello world".

CREATE TABLE people (
  id    serial8 primary key,
  name  text,
  CHECK (unifold(name) NOTNULL)
);
CREATE INDEX name_idx ON people (unifold(name));
SELECT * FROM people WHERE unifold(name) = unifold('John Doe');

The function "unistrip" removes character marks like accents or diaeresis,
while "unifold" keeps then.

NOTICE: The outputs of the function can change between releases, as
        utf8proc does not follow a versioning stability policy. You have to
        rebuild your database indicies, if you upgrade to a newer version
        of utf8proc.


*** TODO ***

- detect stable code points and process segments independently in order to
  save memory
- do a quick check before normalizing strings to optimize speed
- support stream processing


*** CONTACT ***

If you find any bugs or experience difficulties in compiling this software,
please contact us:

Project page: http://www.public-software-group.org/utf8proc