Skip to content

Documentary model of the web

Benjamin Ooghe-Tabanou edited this page Dec 21, 2012 · 1 revision

Documentary Model of the Web

Our documentary model relies on the following assumptions. These are not "truths", they are how we define the web as a system given our technical choices. They define our epistemology and our technical reason.

  1. Fundamental elements of the web
    1. The web is a space where contents are associated to URLs.
    2. A content is determined by a URL, a date (instant) and a context. Given these three elements, there is only one content possible.
    3. We ignore the context. This context is for example POST data associated to a query, the IP and so on. We do as if a URL and a date give the same content.
    4. The content associated to a URL is a document. Contents can be more software than documents. But we still do as if they are documents.
    5. Pages are the contents associated to URLs. It ignores the date.
    6. The same page has different states for different dates. This aspect is ruled by the concept of Harvests.
  2. Reordering the URLs
    1. The URLs are not sorted in a hierarchical order. Example: "google.com/images" should be sorted, from the most generic part to the most specific: com > google > images
    2. The components of URLs are Stems. These are the bricks that build URLs.
    3. It is possible to sort the stems from the most generic to the most specific. It's not perfect because some elements do not fit any hierarchy (typically the GET parameters) but we ignore this issue.
    4. The Tokenization is the process that sorts the stems. It is a non-bijective projection of the space of URLs to another similar space, the space of Reverse URLs.
    5. To us, the web is a space where Pages are associated to Reverse URLs. (but not the reverse... We'll see that below).
  3. Documentary structure of the web
    1. The space of Reverse URLs has the natural structure of a tree.
    2. Pages take place in a hierarchy of LRUs. This hierarchy is given by the space of reverse URLs, and the hierarchical relation is the prefixing. Example: "www.skyrock.com/blog/" is the URL of a page, of LRU "l:http|h:com|h:skyrock|h:www|p:blog|". This LRU is "contained by" (that is, prefixed by) a more generic LRU: "l:http|h:com|h:skyrock|h:www" (www.skyrock.com). This one is prefixed by "l:http|h:com|h:skyrock", then by "l:http|h:com" (all the domains that end in ".com"), then "l:http" then "", the empty LRU, that is "the web".
    3. Some LRUs are related to a page, while others are not. For example, the "l:http|h:com" LRU is useful to get all the ".com" websites, but has no page related. When we use a LRU as a prefix to get LRUs it "contains" (is the hierarchical tree of Reverse URLs), we may use the term of LRU prefix.
  4. Semiotics of the web structures
    1. The domain-level is not a particular level to us.
    2. The "website" is a notion, but not a technically relevant concept.
    3. Some LRUs may have a particular signification for a user. It may fit the notion of website but not necessary. We give to the user the possibility to declare its own relevant entities.
    4. Web entities are LRU prefixes declared relevant by the user. They are not given by the web, can be edited or redefined by the user.
Clone this wiki locally