summaryrefslogtreecommitdiff
path: root/README.md
blob: e0efefc7e46a88ebddc9ea34667644d7302c31ee (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
== libutf8proc ==

The [libutf8proc package](https://github.com/JuliaLang/libutf8proc) is
a lightly updated fork of the [utf8proc
library](http://www.public-software-group.org/utf8proc) from Jan
Behrens and the rest of the [Public Software
Group](http://www.public-software-group.org/), who deserve *nearly all
of the credit* for this package: a small, clean C library that
provides Unicode normalization, case-folding, and other operations for
data in the [UTF-8 encoding](http://en.wikipedia.org/wiki/UTF-8).

The reason for this fork is that utf8proc is used for basic Unicode
support in the [Julia language](http://julialang.org/) and the Julia
developers wanted Unicode 7 support and other features, but the
Public Software Group currently does not seem to have the resources
necessary to update utf8proc.  We hope that the fork can be merged
back into the mainline utf8proc package before too long.

(The original utf8proc package also includes Ruby and PostgreSQL plug-ins.
We removed those from libutf8proc in order to focus exclusively on the C
library for the time being.  We will strive to keep API changes to a minimum,
so libutf8proc should still be usable with the old plug-in code.)

Like utf8proc, the libutf8proc package is licensed under the
free/open-source [MIT "expat"
license](http://opensource.org/licenses/MIT) (plus certain Unicode
data governed by the similarly permissive [Unicode data
license](http://www.unicode.org/copyright.html#Exhibit1)); please see
the included `LICENSE.md` file for more detailed information.

=== Quick Start ===

For compilation of the C library run `make`.

=== General Information ===

The C library is found in this directory after successful compilation
and is named `libutf8proc.a` (for the static library) and
`libutf8proc.so` (for the dynamic library).

The Unicode version being supported is 5.0.0.
*Note:* Version 4.1.0 of Unicode Standard Annex #29 was used, as
version 5.0.0 had not been available at the time of implementation.

For Unicode normalizations, the following options are used:

* Normalization Form C:  `STABLE`, COMPOSE`
* Normalization Form D:  `STABLE`, `DECOMPOSE`
* Normalization Form KC: `STABLE`, `COMPOSE`, `COMPAT`
* Normalization Form KD: `STABLE`, `DECOMPOSE`, `COMPAT`

=== C Library ===

The documentation for the C library is found in the `utf8proc.h` header file.
`utf8proc_map` is function you will most likely be using for mapping UTF-8
strings, unless you want to allocate memory yourself.

=== To Do ===

* detect stable code points and process segments independently in order to save memory
* do a quick check before normalizing strings to optimize speed
* support stream processing

=== Contact ===

Bug reports, feature requests, and other queries can be filed at
the [libutf8proc page on Github](https://github.com/JuliaLang/libutf8proc).