First prototype basic user scenario

Jump to bottom

Benjamin Ooghe-Tabanou edited this page Dec 21, 2012 · 1 revision

First prototype basic user scenario

Known Limitations :

only one corpus by server
no security or authentification

Scenario

configuration of the corpus

admin set the precision limit in core.settings.py
admin set the default web entity creation rule in core.settings.py which will be inserted in the memory structure

creating a corpus

the user add pages to the corpus (system will apply the default creation rule)
the user can change web entities based on those pages inserted (alias also)

crawling and filtering

the user ask to crawl some of the web entitites created :
1. the core will then retrive the pages of that WE from the MS and pass them as starting points to the crawler
2. the crawler stores page crawled in a queue
3. the core consume the queue and ask the Memory strucutre to store them (through the cache system), fire webentity creations, filter links depending on Precision Exception...
after a while, the user ask the content of a webentity to see new pages discovered by the crawl
after a while, the user refresh the list of web entities
the user creates new web entity based on the pages found by the crawler
the user launch new crawl tasks

advance use

the user creates webentity creation rules
the user set precision limits on some specific pages

export corpus

the suer ask a gexf of the network of webentities

Test data

corpus configuration
- PRECISION_LIMIT : 4
- webentity default rule : at first subdomain
starting points are :
- www.sciences-po.fr
- www.sciencespo.fr
- medialab.sciences-po.fr
web entities created at step 1 :
- SCIENCES PO : alias of fr|sciences-po|www & fr|sciencespo|www
- MEDIALAB : fr|sciences-po|medialab
crawl tasks at step 1 :

both web entitites

display new web entities