Skip to content

Precision limit

Benjamin Ooghe-Tabanou edited this page Dec 21, 2012 · 1 revision

To allow the storage of links for relatively big web corpora (~1000 web entities) by keeping good performences in working with the index, it's been decided to set a Precision Limit on corpus.

The Precision Limit set the depth in the Reverse_URLs chain after which the links will not be stored in the memory structure.

Note that this limitation will drop information only on links index not on LRUs index. Also the crawler will follow those links too (see below) Moreover the all information is not lost since the all content of the page is stored in the raw data level.

For example a Precision Limit of 3 will imply that a link between :

fr | sciences-po | medialab | contact.html -> fr | sciences-po | recherche | news.html

will not be stored.

Although the two pagesfr | sciences-po | medialab | contact.html and fr | sciences-po | recherche | news.html will exist in the Page index

Table of Contents

FULL PRECISION exception

However locally the user might want to set a web entity at a level beyonf the precision limit. Thus the user could set a FULL PRECISION flag on a LRU asking the system to set an exception to the Precision limit. This process will allow the best level of precision but only to precise cases where needed, keeping the optimisation rule everywhere else. 

In other word this exception allows the user to set what has been called a page-level-web entity at some point of the project.

consequences on web entities

Thus it's onlye possible to set a webentity to a LRU_prefix which is longuer than the PRECISION_LIMIT by setting a FULL_PRECISION exception

reconstruct links

At the specific time, if we want to retrieve link information which has been forgotten by the memory structure, it will be possible to can the raw data level storage to reconstruct the linkage information.

Precision limit versus crawl Depth

Although the precision limit will avoid some links to be stored in the memory structure, this threshold isn't related to the crawl depth limit. Thus pages which can't be linked inside the memory structure because underneath the Precision limit will be crawled if there are above the crawl depth.

Clone this wiki locally