From 7b30a5520cfb56e651f0eb4da85a3e07747da7dc Mon Sep 17 00:00:00 2001 From: John Mark Bell Date: Sat, 23 Jun 2007 22:40:25 +0000 Subject: Import hubbub -- an HTML parsing library. Plenty of work still to do (like tree generation ;) svn path=/trunk/hubbub/; revision=3359 --- docs/Architecture | 83 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ docs/Todo | 12 ++++++++ 2 files changed, 95 insertions(+) create mode 100644 docs/Architecture create mode 100644 docs/Todo (limited to 'docs') diff --git a/docs/Architecture b/docs/Architecture new file mode 100644 index 0000000..73966eb --- /dev/null +++ b/docs/Architecture @@ -0,0 +1,83 @@ +Hubbub parser architecture +========================== + +Introduction +------------ + + Hubbub is a flexible HTML parser. It offers two interfaces: + + * a SAX-style event interface + * a DOM-style tree-based interface + +Overview +-------- + + Hubbub is comprised of four parts: + + * a charset handler + * an input stream veneer + * a tokeniser + * a tree builder + + Charset handler + --------------- + + The charset handler converts the raw data input into a requested encoding. + + Input stream veneer + ------------------- + + The input stream veneer provides an abstract stream-like interface over + the document buffer. This is used by the tokeniser. The document buffer + will be encoded in either UTf-8 or UTF-16 (this is client-selectable). + + Tokeniser + --------- + + The tokeniser divides the data held in the document buffer into chunks. + It sends SAX-style events for each chunk. The tokeniser is agnostic to + the charset the document buffer is stored in. + + Tree builder + ------------ + + The tree builder constructs a DOM tree from the SAX events emitted by the + tokeniser. The tree builder is tied to the document buffer charset. + +Memory usage and ownership +-------------------------- + + Memory usage within the library is well defined, as is ownership of allocated + memory. + + Raw input data provided by the library client is owned by the client. + + The document buffer is allocated on the fly by the library. + + The document buffer is created and resized by the charset handler. Its + location is passed to the tree builder through a dedicated event. While + parsing is occurring, the ownership of the document buffer lies with the + charset handler. Upon parse completion, the tree builder may request + ownership of the buffer. If it does not, the buffer will be freed on parser + destruction. + + SAX events which refer to document segments contain direct references into + the document buffer (i.e. no copying of data held in the document buffer + occurs). + + The tree builder will allocate memory for use as DOM nodes. References to + strings in the document buffer will be direct and will operate a + copy-on-write strategy. All strings (excepting those which comprise part of + the document buffer) and nodes within the DOM are reference counted. Upon a + reference count reaching 0, the item is freed. + + The above strategy permits data copying to be kept to a minimum, hence + minimising memory usage. + +Parse errors +------------ + + Notification of parse errors is made through a dedicated event similar to + that used for notification of movement of the document buffer. This event + contains the line/column offset of the error location, along with a message + detailing the error. diff --git a/docs/Todo b/docs/Todo new file mode 100644 index 0000000..2abce2b --- /dev/null +++ b/docs/Todo @@ -0,0 +1,12 @@ +TODO list +========= + + + Update tokeniser to comply with latest spec draft (currently complies + with 2007-06-12 draft) + + Implement one or more tree builders + + More charset convertors (or make the iconv codec significantly faster) + + Parse error reporting from the tokeniser + + Implement extraneous chunk insertion/tokenisation + + Statistical charset autodetection + + Shared library, for those platforms that support such things + + Optimise it -- cgit v1.2.3