diff options
Diffstat (limited to 'docs/source-object-backing-store.md')
-rw-r--r-- | docs/source-object-backing-store.md | 204 |
1 files changed, 204 insertions, 0 deletions
diff --git a/docs/source-object-backing-store.md b/docs/source-object-backing-store.md new file mode 100644 index 000000000..0fb3614d4 --- /dev/null +++ b/docs/source-object-backing-store.md @@ -0,0 +1,204 @@ +Source Object (low level) cache backing store +============================================= + +Introduction +------------ + +The source object cache provides a system to extend the life of source +objects (HTML files, images etc.) after they are no longer immediately +being used. + +Only fetch types where we have well defined rules on caching are +considered, in practice this limits us to HTTP(S). The section in +RFC2616 [1] on caching specifies these rules. + +To further extend the objects lifetime they can be pushed into a +backing store where the objects are available for reuse less quickly +than from RAM but faster than retrieving from the network again. + +The backing store implementation provides a key:value infrastructure +with a simple store, retrieve and invalidate interface. + +Generic filesystem backing store +-------------------------------- + +Although the backing store interface is fully pluggable a generic +implementation based on storing objects on the filesystem in a +hierarchy of directories. + +The option to alter the backing store format exists and is controlled +by a version field. It is implementation defined what happens if a +version mis-match occurs. + +As the backing store only holds cache data one should not expect a +great deal of effort to be expended converting formats (i.e. the cache +may simply be discarded). + +Layout version 1.1 +------------------ + +An object has an identifier value generated from the URL (NetSurf +backing stores uses the URL as the unique key). The value used is +obtained using nsurl_hash() which is currently a 32 bit FNV so is +directly usable. + +This identifier is adequate to ensure the collision rate for the +hashed URL values (a collision for every 2^16 URLs added) is +sufficiently low the overhead of returning the wrong object (which +backing stores are permitted to do) is not significant. + +An entry list is maintained which contains all the metadata about a +given identifier. This list is limited in length to constrain the +resources necessary to maintain it. It is made persistent to avoid the +overhead of reconstructing it at initialisation and to keep the data +used to improve the eviction decisions. + +Each object is stored and retrieved directly into the filesystem using +a filename generated from a RFC4648 base32 encoding of an address +value. The objects address is derived from the identifier by cropping +it to a shorter length. + +A mapping between the object address and its entry is maintained which +uses storage directly proportional to the size of the address length. + +The cropping length is stored in the control file with the default +values set at compile time. This allows existing backing stores to +continue operating with existing data independently of new default +setting. This setting gives some ability to tune the default cache +index size to values suitable for a specific host operating system. + +E.g. Linux based systems can easily cope with several megabytes of +mmaped index but RISC OS might want to limit this to a few megabytes +of heap at most. + +The files are stored on disc using their base32 address value. +By creating a directory for each character of the encoded filename +(except the last which is of course the leafname) we create a +directory structure where no directory has more than 32 entries. + +E.g. A 19bit address of 0x1 would be base32 encoded into AAAB +resulting in the data being stored in a file path of +"/store/prefix/d/B/A/A/BAAAAA". + +An address of 0x00040001 encodes to BAAB and a file path of +"/store/prefix/m/B/A/A/BAABAAA" + +Version 1.0 +----------- + +The version 1 layout was identical to the 1.1 except base64url +encoding was used, this proved problematic as some systems filesystems +were case insensitive so upper and lower case letters collided. + +There is no upgrade provision from the previous version simply delete +the cache directory. + +Control files +~~~~~~~~~~~~~ + +control ++++++++ +A control file is used to hold a list of values describing how the +other files in the backing store should be used. + +entries ++++++++ + +this file contains a table of entries describing the files held on the +filesystem. + +Each control file table entry is 28 bytes and consists of + + - signed 64 bit value for last use time + + - 32bit full url hash allowing for index reconstruction and + additional collision detection. Also the possibility of increasing + the ADDRESS_LENGTH although this would require renaming all the + existing files in the cache and is not currently implemented. + + - unsigned 32bit length for data + + - unsigned 32bit length for metadata + + - unsigned 16bit value for number of times used. + + - unsigned 16bit value for flags + + - unsigned 16bit value for data block index (unused) + + - unsigned 16bit value for metatdata block index (unused) + +Address to entry index +~~~~~~~~~~~~~~~~~~~~~~ + +An entry index is held in RAM that allows looking up the address to +map to an entry in the control file. + +The index is the only data structure whose size is directly dependant +on the length of the hash specifically: + +(2 ^ (ADDRESS_BITS - 3)) * ENTRY_BITS) in bytes + +where ADDRESS_BITS is how long the address is in bits and ENTRY_BITS +is how many entries the control file (and hence the while +cache) may hold. + +RISCOS values ++++++++++++++ + +By limiting the ENTRY_BITS size to 14 (16,384 entries) the entries +list is limited to 448kilobytes. + +The typical values for RISC OS would set ADDRESS_BITS to 18. This +spreads the entries over 262144 hash values which uses 512 kilobytes +for the index. Limiting the hash space like this reduces the +effectiveness of the cache. + +A small ADDRESS_LENGTH causes a collision (two URLs with the same +address) to happen roughly for every 2 ^ (ADDRESS_BITS / 2) = 2 ^ 9 = +512 objects stored. This roughly translates to a cache miss due to +collision every ten pages navigated to. + +Larger systems +++++++++++++++ + +In general ENTRY_BITS set to 16 as this limits the store to 65536 +objects which given the average size of an object at 8 kilobytes +yields half a gigabyte of disc used which is judged to be sufficient. + +For larger systems e.g. those using GTK frontend we would most likely +select ADDRESS_BITS as 22 resulting in a collision every 2048 objects +but the index using some 8 Megabytes + +Typical values +-------------- + +Example 1 +~~~~~~~~~ + +For a store with 1034 objects generated from a random navigation of +pages linked from the about:welcome page. + +Metadata total size is 593608 bytes an average of 574 bytes. The +majority of the storage is used to hold the URLs and headers. + +Data total size is 9180475 bytes a mean of 8879 bytes 1648726 in the +largest 10 entries which if excluded gives 7355 bytes average size + +Example 2 +~~~~~~~~~ + +355 pages navigated in 80 minutes from about:welcome page and a +handful of additional sites (google image search and reddit) + +2018 objects in cache at quit. 400 objects from news.bbc.co.uk alone + +Metadata total 987,439 bytes mean of 489 bytes + +data total 33,127,831 bytes mean of 16,416 bytes + +with one single 5,000,811 byte gif + +data totals without gif is 28,127,020 mean 13,945 + +[1] http://tools.ietf.org/html/rfc2616#section-13
\ No newline at end of file |