summaryrefslogtreecommitdiff
path: root/docs
diff options
context:
space:
mode:
Diffstat (limited to 'docs')
-rw-r--r--docs/Macros60
1 files changed, 60 insertions, 0 deletions
diff --git a/docs/Macros b/docs/Macros
new file mode 100644
index 0000000..f301a98
--- /dev/null
+++ b/docs/Macros
@@ -0,0 +1,60 @@
+The data which Hubbub is fed (the input stream) gets buffered into a UTF-8
+buffer. This buffer only holds a subset of the input stream at any given time.
+To avoid unnecessary copying (which is both a speed and memory loss), Hubbub
+tries to make all emitted strings point into this buffer, which is then
+advanced after tokens have been emitted. This is not always possible, however,
+because HTML5 specifies behaviour which requires changing various characters to
+various other characters, and these sets of characters may not have the same
+length. These cases are:
+
+ - CR handling -- CRLFs and CRs are converted to LFs
+ - tag and attribute names are lowercased
+ - entities are allowed in attribute names
+ - NUL bytes must be turned into U+FFFD REPLACEMENT CHARACTER
+
+When collecting the strings it will emit, Hubbub starts by assuming that no
+transformations on the input stream will be required. However, if it hits one
+of the above cases, then it copies all of the collected characters into a buffer
+and switches to using that instead. This means that every time a character is
+collected and it is possible that that character could be collected into a
+buffer, the code must check if it should be collected into a buffer. To allow
+this check, and others, to happen when necessary and never otherwise, Hubbub
+uses a set of macros to collect characters, detailed below.
+
+Hubbub strings are (beginning,length) pairs. This means that once the
+beginning is set to a position in the input stream, the string can collect
+further character runs in the stream simply by adding to the length part. This
+makes extending strings very efficient.
+
+ | COLLECT(hubbub_string str, uintptr_t cptr, size_t length)
+
+ This collects the character pointed to "cptr" (of size "length") into "str",
+ whether str is a buffered or unbuffered string, but only if "str" already
+ points to collected characters.
+
+ | COLLECT_NOBUF(hubbub_string str, size_t length)
+
+ This collects "length" bytes into "str", but only if "str" already points to
+ collected characters. (There is no need to pass the character, since this
+ just increases the length of the string.)
+
+ | COLLECT_MS(hubbub_string str, uintptr_t cptr, size_t length)
+
+ If "str" is currently zero-length, this acts like START(str, cptr, length).
+ Otherwise, it just acts like COLLECT(str, cptr, length).
+
+ | START(hubbub_string str, uintptr_t cptr, size_t length)
+
+ This sets the string "str"'s start to "cptr" and its length to "length".
+
+ | START_BUF(hubbub_string str, uintptr_t cptr, size_t length)
+
+ This buffers the character of length "length" pointed to by "c" and then
+ sets "str" to point to it.
+
+ | SWITCH(hubbub_string str)
+
+ This switches the string "str" from unbuffered to buffered; it copies all
+ characters currently collected in "str" to the buffer and then updates it
+ to point there.
+