1 files changed, 60 insertions, 0 deletions
diff --git a/docs/Macros b/docs/Macros
new file mode 100644
index 0000000..f301a98
--- /dev/null
+++ b/docs/Macros
@@ -0,0 +1,60 @@
+The data which Hubbub is fed (the input stream) gets buffered into a UTF-8
+buffer.  This buffer only holds a subset of the input stream at any given time.
+To avoid unnecessary copying (which is both a speed and memory loss), Hubbub
+tries to make all emitted strings point into this buffer, which is then
+advanced after tokens have been emitted.  This is not always possible, however,
+because HTML5 specifies behaviour which requires changing various characters to
+various other characters, and these sets of characters may not have the same
+length.  These cases are:
+
+ - CR handling -- CRLFs and CRs are converted to LFs
+ - tag and attribute names are lowercased
+ - entities are allowed in attribute names
+ - NUL bytes must be turned into U+FFFD REPLACEMENT CHARACTER
+
+When collecting the strings it will emit, Hubbub starts by assuming that no
+transformations on the input stream will be required.  However, if it hits one
+of the above cases, then it copies all of the collected characters into a buffer
+and switches to using that instead.  This means that every time a character is
+collected and it is possible that that character could be collected into a
+buffer, the code must check if it should be collected into a buffer.  To allow
+this check, and others, to happen when necessary and never otherwise, Hubbub
+uses a set of macros to collect characters, detailed below.
+
+Hubbub strings are (beginning,length) pairs.  This means that once the
+beginning is set to a position in the input stream, the string can collect
+further character runs in the stream simply by adding to the length part.  This
+makes extending strings very efficient.
+
+  | COLLECT(hubbub_string str, uintptr_t cptr, size_t length)
+
+  This collects the character pointed to "cptr" (of size "length") into "str",
+  whether str is a buffered or unbuffered string, but only if "str" already
+  points to collected characters.
+  
+  | COLLECT_NOBUF(hubbub_string str, size_t length)
+  
+  This collects "length" bytes into "str", but only if "str" already points to
+  collected characters.  (There is no need to pass the character, since this
+  just increases the length of the string.)
+  
+  | COLLECT_MS(hubbub_string str, uintptr_t cptr, size_t length)
+  
+  If "str" is currently zero-length, this acts like START(str, cptr, length).
+  Otherwise, it just acts like COLLECT(str, cptr, length).
+  
+  | START(hubbub_string str, uintptr_t cptr, size_t length)
+
+  This sets the string "str"'s start to "cptr" and its length to "length".
+  
+  | START_BUF(hubbub_string str, uintptr_t cptr, size_t length)
+  
+  This buffers the character of length "length" pointed to by "c" and then
+  sets "str" to point to it.
+  
+  | SWITCH(hubbub_string str)
+  
+  This switches the string "str" from unbuffered to buffered; it copies all
+  characters currently collected in "str" to the buffer and then updates it
+  to point there.
+