1 files changed, 204 insertions, 0 deletions
diff --git a/docs/source-object-backing-store.md b/docs/source-object-backing-store.md
new file mode 100644
index 000000000..0fb3614d4
--- /dev/null
+++ b/docs/source-object-backing-store.md
@@ -0,0 +1,204 @@
+Source Object (low level) cache backing store
+=============================================
+
+Introduction
+------------
+
+The source object cache provides a system to extend the life of source
+objects (HTML files, images etc.) after they are no longer immediately
+being used.
+
+Only fetch types where we have well defined rules on caching are
+considered, in practice this limits us to HTTP(S). The section in
+RFC2616 [1] on caching specifies these rules.
+
+To further extend the objects lifetime they can be pushed into a
+backing store where the objects are available for reuse less quickly
+than from RAM but faster than retrieving from the network again.
+
+The backing store implementation provides a key:value infrastructure
+with a simple store, retrieve and invalidate interface.
+
+Generic filesystem backing store
+--------------------------------
+
+Although the backing store interface is fully pluggable a generic
+implementation based on storing objects on the filesystem in a
+hierarchy of directories.
+
+The option to alter the backing store format exists and is controlled
+by a version field. It is implementation defined what happens if a
+version mis-match occurs.
+
+As the backing store only holds cache data one should not expect a
+great deal of effort to be expended converting formats (i.e. the cache
+may simply be discarded).
+
+Layout version 1.1
+------------------
+
+An object has an identifier value generated from the URL (NetSurf
+backing stores uses the URL as the unique key). The value used is
+obtained using nsurl_hash() which is currently a 32 bit FNV so is
+directly usable.
+
+This identifier is adequate to ensure the collision rate for the
+hashed URL values (a collision for every 2^16 URLs added) is
+sufficiently low the overhead of returning the wrong object (which
+backing stores are permitted to do) is not significant.
+
+An entry list is maintained which contains all the metadata about a
+given identifier. This list is limited in length to constrain the
+resources necessary to maintain it. It is made persistent to avoid the
+overhead of reconstructing it at initialisation and to keep the data
+used to improve the eviction decisions.
+
+Each object is stored and retrieved directly into the filesystem using
+a filename generated from a RFC4648 base32 encoding of an address
+value. The objects address is derived from the identifier by cropping
+it to a shorter length.
+
+A mapping between the object address and its entry is maintained which
+uses storage directly proportional to the size of the address length.
+
+The cropping length is stored in the control file with the default
+values set at compile time. This allows existing backing stores to
+continue operating with existing data independently of new default
+setting. This setting gives some ability to tune the default cache
+index size to values suitable for a specific host operating system.
+
+E.g. Linux based systems can easily cope with several megabytes of
+mmaped index but RISC OS might want to limit this to a few megabytes
+of heap at most.
+
+The files are stored on disc using their base32 address value.
+By creating a directory for each character of the encoded filename
+(except the last which is of course the leafname) we create a
+directory structure where no directory has more than 32 entries.
+
+E.g. A 19bit address of 0x1 would be base32 encoded into AAAB
+resulting in the data being stored in a file path of
+"/store/prefix/d/B/A/A/BAAAAA".
+
+An address of 0x00040001 encodes to BAAB and a file path of
+"/store/prefix/m/B/A/A/BAABAAA"
+
+Version 1.0
+-----------
+
+The version 1 layout was identical to the 1.1 except base64url
+encoding was used, this proved problematic as some systems filesystems
+were case insensitive so upper and lower case letters collided.
+
+There is no upgrade provision from the previous version simply delete
+the cache directory.
+
+Control files
+~~~~~~~~~~~~~
+
+control
++++++++
+A control file is used to hold a list of values describing how the
+other files in the backing store should be used.
+
+entries
++++++++
+
+this file contains a table of entries describing the files held on the
+filesystem.
+
+Each control file table entry is 28 bytes and consists of
+
+ - signed 64 bit value for last use time
+
+ - 32bit full url hash allowing for index reconstruction and
+   additional collision detection. Also the possibility of increasing
+   the ADDRESS_LENGTH although this would require renaming all the
+   existing files in the cache and is not currently implemented.
+
+ - unsigned 32bit length for data
+
+ - unsigned 32bit length for metadata
+
+ - unsigned 16bit value for number of times used.
+
+ - unsigned 16bit value for flags
+
+ - unsigned 16bit value for data block index (unused)
+
+ - unsigned 16bit value for metatdata block index (unused)
+
+Address to entry index
+~~~~~~~~~~~~~~~~~~~~~~
+
+An entry index is held in RAM that allows looking up the address to
+map to an entry in the control file.
+
+The index is the only data structure whose size is directly dependant
+on the length of the hash specifically:
+
+(2 ^ (ADDRESS_BITS - 3)) * ENTRY_BITS) in bytes
+
+where ADDRESS_BITS is how long the address is in bits and ENTRY_BITS
+is how many entries the control file (and hence the while
+cache) may hold.
+
+RISCOS values
++++++++++++++
+
+By limiting the ENTRY_BITS size to 14 (16,384 entries) the entries
+list is limited to 448kilobytes.
+
+The typical values for RISC OS would set ADDRESS_BITS to 18. This
+spreads the entries over 262144 hash values which uses 512 kilobytes
+for the index. Limiting the hash space like this reduces the
+effectiveness of the cache.
+
+A small ADDRESS_LENGTH causes a collision (two URLs with the same
+address) to happen roughly for every 2 ^ (ADDRESS_BITS / 2) = 2 ^ 9 =
+512 objects stored. This roughly translates to a cache miss due to
+collision every ten pages navigated to.
+
+Larger systems
+++++++++++++++
+
+In general ENTRY_BITS set to 16 as this limits the store to 65536
+objects which given the average size of an object at 8 kilobytes
+yields half a gigabyte of disc used which is judged to be sufficient.
+
+For larger systems e.g. those using GTK frontend we would most likely
+select ADDRESS_BITS as 22 resulting in a collision every 2048 objects
+but the index using some 8 Megabytes
+
+Typical values
+--------------
+
+Example 1
+~~~~~~~~~
+
+For a store with 1034 objects generated from a random navigation of
+pages linked from the about:welcome page.
+
+Metadata total size is 593608 bytes an average of 574 bytes. The
+majority of the storage is used to hold the URLs and headers.
+
+Data total size is 9180475 bytes a mean of 8879 bytes 1648726 in the
+largest 10 entries which if excluded gives 7355 bytes average size
+
+Example 2
+~~~~~~~~~
+
+355 pages navigated in 80 minutes from about:welcome page and a
+handful of additional sites (google image search and reddit)
+
+2018 objects in cache at quit. 400 objects from news.bbc.co.uk alone
+
+Metadata total 987,439 bytes mean of 489 bytes
+
+data total 33,127,831 bytes mean of 16,416 bytes
+
+with one single 5,000,811 byte gif
+
+data totals without gif is 28,127,020 mean 13,945
+
+[1] http://tools.ietf.org/html/rfc2616#section-13
+\ No newline at end of file