summaryrefslogtreecommitdiff
path: root/data/data_generator.rb
Commit message (Collapse)AuthorAgeFilesLines
* Merge branch 'master' of https://github.com/JuliaLang/utf8procSteven G. Johnson2018-07-241-0/+13
|\
| * update data and algorithms for Unicode 11 (#140)Steven G. Johnson2018-07-241-0/+13
| |
* | update copyright statement for data_generatorSteven G. Johnson2018-07-241-0/+2
|/
* charwidth=1 for soft hyphen and unassigned codepoints (#135)Steven G. Johnson2018-07-241-1/+1
| | | | | | | | | | | | | | | | * use width=1 for soft hyphen and for unassigned/PUA codepoints * don't count unassigned codepoints when comparing with system wcwidth * more tests * indentation fixes * NEWS for 135 * remove special-casing for arabic control characters affecting a span of numbers, which are sometimes zero-width and sometimes not * regenerate
* uppercase mapping ß (U+00df) to ẞ (U+1E9E) (#134)Steven G. Johnson2018-05-021-13/+13
| | | | | | | | * uppercase(0x00df) = 0x1e9e * tests for titlecase and u+00df uppercase * NEWS, another test
* Case folding fixes (#133)Steven G. Johnson2018-05-021-1/+1
| | | | | | | | | | | | | | | | | | | | | * Fixes allowing for “Full” folding and NFKC_CaseFold compliance. * Only include C (Common) and F (Full) foldings from CaseFolding.txt. Removed S (Simple) since F & S are specified to be exclusive. * Extend UTF8PROC_IGNORE to also ignore unassigned codepoints (such as \u2065) which are specified as being discarded by NFKC_CF. * Document the changes to UTF8PROC_IGNORE in header. * Add NFKC_CF helper function with documentation. * restore old IGNORE behavior, add UTF8PROC_STRIPNA, rename to utf8proc_NFKC_Casefold, add a test * success message * test that IGNORE does not strip NA * data update * NFKC_Casefold shouldn't strip NA
* Ensure generated const data tables are hidden via "static" (#100)Paul Smith2017-02-191-5/+5
|
* Smaller tables (#68)Benito van der Zander2016-07-121-40/+123
| | | | | | | | | | | | | | | | | | | | | | | | * convert sequences to utf-16 (saves 25kb) * store sequence length in properties instead using -1 termination (saves 10kb) * cache index for slightly faster data creation * store lower/upper/title mapping in sequence array (saves 25kb). Add utf8proc_totitle, as title_mapping cannot be used to get the title codepoint anymore. Rename xxx_mapping to xxx_seqindex, so programs assuming a value with the old meaning fail at compile time * change combination array data type to uint16 (saves 40kb) * merge 1st and 2nd comb index (saves 50kb) * kill empty prefix/suffix in combination array (saves 50kb) * there was no need to have a separate combination start array, it can be merged in a single array * some fixes * mark the table as const again * and regen
* Unicode 9 updates (#70)Keno Fischer2016-06-281-3/+3
| | | | | | | | | | | | | | | | | | | | | | | * Updates for Unicode 9.0.0 TR29 Changes - New rules GB10/(12/13) are used to combine emoji-zwj sequences/ (force grapheme breaks every two RI codepoints). Unfortunately this breaks statelessness of grapheme-boundary determination. Deal with this by ignoring the problem in utf8proc_grapheme_break, and by hacking in a special case in decompose - ZWJ moved to its own boundclass, update what is now GB9 accordingly. - Add comments to indicate which rule a given case implements - The Number of bound classes Now exceeds 4 bits, expand to 8 and reorganize fields * Import Unicode 9 data * Update Grapheme break API to expose state override * Bump MAJOR version
* Reduce the size of the binary.Michaël Meyer2015-12-091-3/+4
| | | | | Use integers instead of pointers in Unicode tables. Saves 226 kb / 716 kb in the compiled library.
* sort keys to try to eliminate data dependence on Ruby versionSteven G. Johnson2015-06-251-2/+2
|
* Prefix other C99 typedefs with utf8proc_Tony Kelman2015-04-061-4/+4
|
* fix #2: add charwidth functionSteven G. Johnson2015-03-121-4/+14
|
* directory cleanup: move tests and data into subdirectoriesSteven G. Johnson2015-03-061-0/+317