summaryrefslogtreecommitdiff
path: root/src/tokeniser/tokeniser.c
Commit message (Collapse)AuthorAgeFilesLines
* hubbub_alloc -> hubbub_allocator_fnJohn Mark Bell2009-04-041-2/+3
| | | | svn path=/trunk/hubbub/; revision=7043
* Sync tokeniser tests with html5lib.John Mark Bell2009-03-101-12/+19
| | | | | | | | Sync tokeniser implementation with the spec. Fix handling of \0 in the tag open state. The unicodeCharacters test is disabled, as json-c doesn't like it. svn path=/trunk/hubbub/; revision=6755
* Make doxygen produce API documentation. I guess it helps if you enable the ↵John Mark Bell2009-01-081-1/+1
| | | | | | | | right options. Fix a couple more doxygen warnings. svn path=/trunk/hubbub/; revision=5996
* Use doxygen to create API documentation.John Mark Bell2009-01-081-12/+13
| | | | | | Add a bunch of extra commentary to stop doxygen warning. svn path=/trunk/hubbub/; revision=5994
* Fix potential read beyond available input data when processing \r in some ↵John Mark Bell2009-01-061-5/+5
| | | | | | | | | | states. What happened was that, given \rabc, we would advance past the \r, then read at current_offset + len (len == 1). I.E. read 'b' instead of 'a'. If the data in the inputstream's internal buffer happened to end immediately after the \r, then we'd read past the end of the buffer thanks to a bug in lpu_inputstream_peek which was fixed in r5965. In any case, we'd still be looking at the wrong character when looking for CRLF pairs. All regression tests now pass again. svn path=/trunk/hubbub/; revision=5967
* Port to changed lpu API.John Mark Bell2009-01-061-455/+635
| | | | | | | Drop HUBBUB_OOD and just use HUBBUB_NEEDDATA, instead. Currently aborts in bogus comment handling if it encounters a \r at the end of the inputstream's utf-8 buffer. svn path=/trunk/hubbub/; revision=5966
* Fix build breakageJohn Mark Bell2008-11-301-1/+3
| | | | svn path=/trunk/hubbub/; revision=5851
* lotsa C89, please check.François Revel2008-11-301-48/+91
| | | | svn path=/trunk/hubbub/; revision=5846
* Return errors from tokeniser constructor/destructorJohn Mark Bell2008-11-091-13/+22
| | | | svn path=/trunk/hubbub/; revision=5664
* Port hubbub to new lpu APIJohn Mark Bell2008-11-081-2/+3
| | | | svn path=/trunk/hubbub/; revision=5656
* Squash memory leakJohn Mark Bell2008-09-081-0/+2
| | | | svn path=/trunk/hubbub/; revision=5285
* Fixes for handling of CR followed immediately by multibyte sequences.John Mark Bell2008-09-061-59/+94
| | | | | | | Pedantic whitespace changes. More paranoia surrounding entity handling. svn path=/trunk/hubbub/; revision=5266
* Fix segfault caused by trampling the length of the current character when ↵John Mark Bell2008-08-181-2/+8
| | | | | | | | testing whether the 4 most recently read characters in the data state are <!--. Add a couple of assertions for paranoia. svn path=/trunk/hubbub/; revision=5146
* Do what r5107 for system ID for public IDs.Andrew Sidwell2008-08-131-14/+4
| | | | svn path=/trunk/hubbub/; revision=5108
* Another COLLECT() -> COLLECT_MS() fix.Andrew Sidwell2008-08-131-14/+4
| | | | svn path=/trunk/hubbub/; revision=5107
* Add page which crashed, and fix the bug that caused it to do so.Andrew Sidwell2008-08-131-4/+2
| | | | svn path=/trunk/hubbub/; revision=5106
* Remove the CHAR() macro, which lets make test run again.Andrew Sidwell2008-08-131-80/+74
| | | | svn path=/trunk/hubbub/; revision=5104
* Optimise COLLECT_MS() macro.Andrew Sidwell2008-08-131-5/+3
| | | | svn path=/trunk/hubbub/; revision=5099
* Fix segfault in elimination of duplicate attributes.John Mark Bell2008-08-131-7/+8
| | | | svn path=/trunk/hubbub/; revision=5098
* Optimise comment states slightly, taking advantage of the fact that buffers ↵Andrew Sidwell2008-08-131-20/+1
| | | | | | store their own length and when emitting the comment, the buffer contains the whole comment and nothing else. svn path=/trunk/hubbub/; revision=5095
* Fix tokeniser so make test passes, with possible perf hit.Andrew Sidwell2008-08-131-18/+43
| | | | svn path=/trunk/hubbub/; revision=5093
* Use COLLECT_MS() macro rather than COLLECT() in attribute values.Andrew Sidwell2008-08-131-4/+4
| | | | svn path=/trunk/hubbub/; revision=5086
* Sanity checking for string dataJohn Mark Bell2008-08-131-0/+39
| | | | svn path=/trunk/hubbub/; revision=5080
* Remember to clear the self-closing flag when emitting a tag token.Andrew Sidwell2008-08-111-0/+3
| | | | svn path=/trunk/hubbub/; revision=5030
* - Remove an unused function from utils/string.cAndrew Sidwell2008-08-111-46/+1
| | | | | | | - Remove the no-op FINISH() macro from the tokeniser - Fix a typo in the charset detector svn path=/trunk/hubbub/; revision=5007
* Move one step closer to getting encoding changes working.Andrew Sidwell2008-08-111-1/+1
| | | | svn path=/trunk/hubbub/; revision=5000
* Propagate more return codes up the chain from the token emitter.Andrew Sidwell2008-08-091-55/+38
| | | | svn path=/trunk/hubbub/; revision=4980
* Propagate the use of hubbub_error up into at least a bit of the treebuilder.Andrew Sidwell2008-08-091-2/+4
| | | | svn path=/trunk/hubbub/; revision=4979
* Move tokeniser.c across to using hubbub_error for return codes, not bools, ↵Andrew Sidwell2008-08-091-227/+236
| | | | | | so that "encoding change" requests can be sent back down the chain from the treebuilder at some point. svn path=/trunk/hubbub/; revision=4978
* Really fix handling of entities in attributesJohn Mark Bell2008-08-041-1/+1
| | | | svn path=/trunk/hubbub/; revision=4894
* Fix previous commit.Andrew Sidwell2008-08-041-6/+14
| | | | svn path=/trunk/hubbub/; revision=4893
* Fix bug in hubbub & html5lib tests relating to parsing entities ending ↵Andrew Sidwell2008-08-041-1/+1
| | | | | | without semicolons in attribute values. svn path=/trunk/hubbub/; revision=4892
* Micro-optimisationAndrew Sidwell2008-08-041-2/+1
| | | | svn path=/trunk/hubbub/; revision=4890
* Rearrange emitting functions so they're all clumped together at the bottom ↵Andrew Sidwell2008-08-041-162/+172
| | | | | | of the file. svn path=/trunk/hubbub/; revision=4889
* Refactor tokeniser token-emitting bits to remove unnecessary conditionals.Andrew Sidwell2008-08-041-63/+62
| | | | svn path=/trunk/hubbub/; revision=4888
* Change tokeniser->context.chars from a hubbub_string whose ptr part is never ↵Andrew Sidwell2008-08-031-158/+155
| | | | | | used to simply tokeniser->context.pending. svn path=/trunk/hubbub/; revision=4882
* Remove some excessive indentation.Andrew Sidwell2008-08-031-27/+23
| | | | svn path=/trunk/hubbub/; revision=4881
* Remove the now-unnecessary COLLECT_*NOBUF() macros, replace them with the ↵Andrew Sidwell2008-08-031-79/+68
| | | | | | single statements they expanded to. svn path=/trunk/hubbub/; revision=4880
* Remove tokeniser->to_buf, SWITCH(), and COLLECT_CHAR(), none of which are ↵Andrew Sidwell2008-08-031-54/+26
| | | | | | now necessary. Should should provide a small speedup. svn path=/trunk/hubbub/; revision=4873
* - Replace NDEBUG #ifdefs with #if 0s, to avoid slowing down Hubbub when ↵Andrew Sidwell2008-08-031-44/+22
| | | | | | | | | profiling - Fix a few instances of where the wrong COLLECT*() macros were used - Always use emit_current_chars(tokeniser) rather than emit_character_token(tokeniser, tokeniser->context.chars), to make sure that the pointer is always set correctly svn path=/trunk/hubbub/; revision=4872
* Fix copy-and-paste error in previous commit.Andrew Sidwell2008-07-311-1/+1
| | | | svn path=/trunk/hubbub/; revision=4845
* Handle CRs correctly everwhere.Andrew Sidwell2008-07-311-3/+123
| | | | svn path=/trunk/hubbub/; revision=4844
* Handle NUL properly everywhere it should be.Andrew Sidwell2008-07-311-5/+12
| | | | svn path=/trunk/hubbub/; revision=4843
* Merged revisions 4631-4838 via svnmerge from John Mark Bell2008-07-311-1919/+1646
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | svn://source.netsurf-browser.org/branches/takkaria/hubbub-parserutils ........ r4631 | takkaria | 2008-07-13 12:54:30 +0100 (Sun, 13 Jul 2008) | 2 lines Initial hatchet job moving to libparserutils (search and replace and a bit of cleaning up). This doesn't compile. ........ r4632 | takkaria | 2008-07-13 15:28:52 +0100 (Sun, 13 Jul 2008) | 2 lines libparserutilize everything up to the "before attribute name" state. (Not compiling) ........ r4633 | takkaria | 2008-07-13 15:32:14 +0100 (Sun, 13 Jul 2008) | 2 lines Replace all uses of "current_{comment|chars}" with just "chars". ........ r4634 | takkaria | 2008-07-13 16:12:06 +0100 (Sun, 13 Jul 2008) | 2 lines Fix lots of compile errors, lpuise "before attribute name" state. ........ r4636 | takkaria | 2008-07-13 17:23:17 +0100 (Sun, 13 Jul 2008) | 2 lines Finish lpuising the tag states, apart from character references. ........ r4637 | takkaria | 2008-07-13 19:58:52 +0100 (Sun, 13 Jul 2008) | 2 lines lpuise the comment states. ........ r4638 | takkaria | 2008-07-13 20:04:31 +0100 (Sun, 13 Jul 2008) | 2 lines Switch to setting hubbub_string::len to 0 instead of hubbub_string::ptr to NULL to indicate an empty buffer, as it was previously. ........ r4639 | takkaria | 2008-07-13 21:02:11 +0100 (Sun, 13 Jul 2008) | 2 lines "lpu up" about half of the DOCTYPE handling stages. ........ r4640 | takkaria | 2008-07-13 21:23:00 +0100 (Sun, 13 Jul 2008) | 2 lines Finish off LPUing the doctype modes. ........ r4641 | takkaria | 2008-07-13 21:37:33 +0100 (Sun, 13 Jul 2008) | 2 lines The tokeniser uses lpu apart from the entity matcher, now. ........ r4643 | takkaria | 2008-07-14 01:20:36 +0100 (Mon, 14 Jul 2008) | 2 lines Fix up the character reference matching stuff--still not properly dealt with, but compiles futher. ........ r4644 | takkaria | 2008-07-14 01:24:49 +0100 (Mon, 14 Jul 2008) | 2 lines Get the tokeniser compiling in its LPU'd form. ........ r4645 | takkaria | 2008-07-14 01:26:34 +0100 (Mon, 14 Jul 2008) | 2 lines Remember to advance the stream position after emitting tokens. ........ r4646 | takkaria | 2008-07-14 01:34:36 +0100 (Mon, 14 Jul 2008) | 2 lines Nuke the src/input directory and start work on the treebuilder. ........ r4647 | takkaria | 2008-07-14 01:56:27 +0100 (Mon, 14 Jul 2008) | 2 lines Get hubbub building in its LPU'd form. ........ r4648 | takkaria | 2008-07-14 02:41:03 +0100 (Mon, 14 Jul 2008) | 2 lines Get the tokeniser2 testrunner working. ........ r4649 | takkaria | 2008-07-14 02:48:55 +0100 (Mon, 14 Jul 2008) | 2 lines Fix test LDFLAGS so things link properly. ........ r4650 | takkaria | 2008-07-14 16:25:51 +0100 (Mon, 14 Jul 2008) | 2 lines Get testcases compiling, remove ones now covered by libparserutils. ........ r4651 | takkaria | 2008-07-14 16:37:09 +0100 (Mon, 14 Jul 2008) | 2 lines Remove more tests covered by libpu. ........ r4652 | takkaria | 2008-07-14 17:53:18 +0100 (Mon, 14 Jul 2008) | 2 lines Fix up the tokeniser a bit. ........ r4653 | takkaria | 2008-07-14 19:02:15 +0100 (Mon, 14 Jul 2008) | 3 lines - Remove the buffer_handler stuff from hubbub - Add the basics of a buffer for attribute values and text. ........ r4654 | takkaria | 2008-07-14 20:00:45 +0100 (Mon, 14 Jul 2008) | 2 lines Get character references working in attribute values, start trying to make them work in character tokens. ........ r4656 | takkaria | 2008-07-14 23:28:52 +0100 (Mon, 14 Jul 2008) | 2 lines Get entities working a bit better. ........ r4657 | takkaria | 2008-07-14 23:37:16 +0100 (Mon, 14 Jul 2008) | 2 lines Get entities working properly. (!) ........ r4658 | takkaria | 2008-07-14 23:56:10 +0100 (Mon, 14 Jul 2008) | 2 lines Make doctypes work a bit better. ........ r4659 | takkaria | 2008-07-15 00:18:49 +0100 (Tue, 15 Jul 2008) | 2 lines Get DOCTYPEs working. ........ r4660 | takkaria | 2008-07-15 00:26:36 +0100 (Tue, 15 Jul 2008) | 2 lines Fix CDATA sections. ........ r4661 | takkaria | 2008-07-15 01:01:16 +0100 (Tue, 15 Jul 2008) | 2 lines Get comments working again. ........ r4662 | takkaria | 2008-07-15 01:14:19 +0100 (Tue, 15 Jul 2008) | 2 lines Fix EOF in "after attribute name" state. ........ r4664 | takkaria | 2008-07-15 01:30:27 +0100 (Tue, 15 Jul 2008) | 2 lines Put the tests in better order, remove one now superceded with libpu. ........ r4665 | takkaria | 2008-07-15 01:46:29 +0100 (Tue, 15 Jul 2008) | 2 lines Remove a lot of now-redunant clearings of the current stream offset. ........ r4667 | jmb | 2008-07-15 11:56:54 +0100 (Tue, 15 Jul 2008) | 2 lines Completely purge charset stuff from hubbub. Parserutils handles this now. ........ r4677 | takkaria | 2008-07-15 21:03:42 +0100 (Tue, 15 Jul 2008) | 2 lines Get more tests passing, handle NUL bytes in data state. ........ r4694 | takkaria | 2008-07-18 17:55:44 +0100 (Fri, 18 Jul 2008) | 3 lines - Handle CRs correctly in some token states. - Handle NULs correctly in the CDATA state. ........ r4706 | takkaria | 2008-07-19 14:58:48 +0100 (Sat, 19 Jul 2008) | 2 lines Improve the tokeniser2 output a bit. ........ r4721 | takkaria | 2008-07-21 20:57:29 +0100 (Mon, 21 Jul 2008) | 2 lines Get a better framework in place to allow switching to using a buffer mid-collect. This fails a couple of testcases and doesn't implement proper CR or NUL support yet. ........ r4725 | takkaria | 2008-07-23 17:20:07 +0100 (Wed, 23 Jul 2008) | 2 lines Make comment tokens in tokeniser2 display both expected and actual output. ........ r4726 | takkaria | 2008-07-23 19:10:23 +0100 (Wed, 23 Jul 2008) | 4 lines - Add FINISH() macro which stops using buffered character collection. - Make the encoding U+FFFD in UTF-8 a global varabile, for sanity - Make the bogus comment state deal with NULs correctly. ........ r4730 | takkaria | 2008-07-24 00:35:16 +0100 (Thu, 24 Jul 2008) | 2 lines Try to get NUL bytes handled as the spec says. ........ r4731 | takkaria | 2008-07-24 00:40:59 +0100 (Thu, 24 Jul 2008) | 2 lines Get CRs working in the data state. ........ r4732 | takkaria | 2008-07-24 00:47:45 +0100 (Thu, 24 Jul 2008) | 2 lines Set force-quirks correctly when failing to match PUBLIC or SYSTEM in DOCTYPEs. ........ r4773 | takkaria | 2008-07-28 15:34:41 +0100 (Mon, 28 Jul 2008) | 2 lines Fix up the tokeniser, finally. ........ r4801 | takkaria | 2008-07-29 15:59:31 +0100 (Tue, 29 Jul 2008) | 2 lines Refactor macros a bit. ........ r4802 | takkaria | 2008-07-29 16:04:17 +0100 (Tue, 29 Jul 2008) | 2 lines Do s/HUBBUB_TOKENISER_STATE_/STATE_/, for shorter line lengths. ........ r4805 | takkaria | 2008-07-29 16:58:37 +0100 (Tue, 29 Jul 2008) | 4 lines Start cleaning up the hubbub tokeniser; - refactor to use new inline emit_character_token() and emit_current_tag() functions; makes code clearer - check EOF before using the CHAR() macro, so eventually it can be removed. ........ r4806 | takkaria | 2008-07-29 17:45:36 +0100 (Tue, 29 Jul 2008) | 2 lines More cleanup like the previous commit. ........ r4807 | takkaria | 2008-07-29 19:48:44 +0100 (Tue, 29 Jul 2008) | 2 lines Rewrite comment-handling code to be just the one function, whilst updating it to handle CRs and NULs properly. (All comments now always use the buffer.) ........ r4820 | takkaria | 2008-07-30 14:14:49 +0100 (Wed, 30 Jul 2008) | 2 lines Finish off the first sweep of cleaning up and refactoring the tokeniser. ........ r4821 | takkaria | 2008-07-30 15:12:22 +0100 (Wed, 30 Jul 2008) | 2 lines Add copyright statement. ........ r4822 | takkaria | 2008-07-30 17:23:01 +0100 (Wed, 30 Jul 2008) | 2 lines Apply changes made to tokeniser2 to tokeniser3. ........ r4829 | takkaria | 2008-07-31 01:59:07 +0100 (Thu, 31 Jul 2008) | 4 lines - Make the tokeniser save everything into the buffer, at least for now. - Fix logic errors introduced in refactoring - Avoid emitting more tokens than we have to (e.g. instead of emitting "<>" and switching back to the data state, just switch back to the data state and let it take care of it) ........ r4830 | takkaria | 2008-07-31 02:03:08 +0100 (Thu, 31 Jul 2008) | 2 lines Small treebuilder <isindex> fix. ........ r4831 | takkaria | 2008-07-31 02:32:29 +0100 (Thu, 31 Jul 2008) | 2 lines Stop holding on to pointers to character data across treebuilder calls. ........ r4832 | takkaria | 2008-07-31 02:45:09 +0100 (Thu, 31 Jul 2008) | 18 lines Merge revisions 4620-4831 from trunk hubbub to libinputstream hubbub, modulo one change to test/Makefile which makes the linker choke when linking tests. ------------------------------------------------------------------------ r4666 | jmb | 2008-07-15 11:52:13 +0100 (Tue, 15 Jul 2008) | 3 lines Make tree2 perform reference counting. Fix bits of the treebuilder to perform reference counting correctly in the face of *result not pointing to the same object as the node passed in to the treebuilder client callbacks. ------------------------------------------------------------------------ r4668 | jmb | 2008-07-15 12:37:30 +0100 (Tue, 15 Jul 2008) | 2 lines Fully document treebuilder callbacks. ------------------------------------------------------------------------ r4675 | takkaria | 2008-07-15 21:01:03 +0100 (Tue, 15 Jul 2008) | 2 lines Fix memory leak in tokeniser2. ------------------------------------------------------------------------ ........ r4834 | jmb | 2008-07-31 09:57:51 +0100 (Thu, 31 Jul 2008) | 2 lines Fix infinite loop in charset detector ........ r4835 | jmb | 2008-07-31 13:01:24 +0100 (Thu, 31 Jul 2008) | 2 lines Actually store namespaces on formatting list. Otherwise we read uninitialised memory. Add some semblance of filling allocations with junk to myrealloc(). ........ r4836 | jmb | 2008-07-31 13:06:07 +0100 (Thu, 31 Jul 2008) | 2 lines Lose debug again ........ r4837 | jmb | 2008-07-31 15:09:19 +0100 (Thu, 31 Jul 2008) | 2 lines Lose obsolete testdata (this is now part of lpu) ........ svn path=/trunk/hubbub/; revision=4839
* Add an explict null namespace to hubbub_ns.Andrew Sidwell2008-07-091-0/+4
| | | | svn path=/trunk/hubbub/; revision=4550
* Add the basics of namespace support.Andrew Sidwell2008-06-261-0/+1
| | | | svn path=/trunk/hubbub/; revision=4452
* Add CDATA tests and the infrastructure to support them.Andrew Sidwell2008-06-191-0/+3
| | | | svn path=/trunk/hubbub/; revision=4410
* Fix assert()s, and only compile the preceding line when debugging to avoid ↵Andrew Sidwell2008-06-191-4/+8
| | | | | | warnings. svn path=/trunk/hubbub/; revision=4408
* Use assert() instead of abort() or returning NULL in code that should not be ↵Andrew Sidwell2008-06-191-24/+10
| | | | | | reached. svn path=/trunk/hubbub/; revision=4406
* Fix remaining issues with byte-by-byte tokenisation.Andrew Sidwell2008-06-191-10/+17
| | | | svn path=/trunk/hubbub/; revision=4405