Skip to content

Web corpus

jrault edited this page Dec 21, 2012 · 1 revision

A web corpus is a set of web pages agregated after a certain level of depth into URL definition. This level is called the granularity of the corpus.

The links and the content between the web pages are harvested and recorded.

Those web pages are grouped into Web entities.

The time dimension of a web corpus is also kept by using archiving methods.