Skip to content

First prototype basic user scenario

Benjamin Ooghe-Tabanou edited this page Dec 21, 2012 · 1 revision

First prototype basic user scenario

Known Limitations :

  1. only one corpus by server
  2. no security or authentification

Scenario

  • configuration of the corpus
  1. admin set the precision limit in core.settings.py
  2. admin set the default web entity creation rule in core.settings.py which will be inserted in the memory structure
  • creating a corpus
  1. the user add pages to the corpus (system will apply the default creation rule)
  2. the user can change web entities based on those pages inserted (alias also)
  • crawling and filtering
  1. the user ask to crawl some of the web entitites created :
    1. the core will then retrive the pages of that WE from the MS and pass them as starting points to the crawler
    2. the crawler stores page crawled in a queue
    3. the core consume the queue and ask the Memory strucutre to store them (through the cache system), fire webentity creations, filter links depending on Precision Exception...
  2. after a while, the user ask the content of a webentity to see new pages discovered by the crawl
  3. after a while, the user refresh the list of web entities
  4. the user creates new web entity based on the pages found by the crawler
  5. the user launch new crawl tasks
  • advance use
  1. the user creates webentity creation rules
  2. the user set precision limits on some specific pages
  • export corpus
  1. the suer ask a gexf of the network of webentities

Test data

  • corpus configuration
    • PRECISION_LIMIT : 4
    • webentity default rule : at first subdomain
  • starting points are :
  • web entities created at step 1 :
    • SCIENCES PO : alias of fr|sciences-po|www & fr|sciencespo|www
    • MEDIALAB : fr|sciences-po|medialab
  • crawl tasks at step 1 :
both web entitites
  • display new web entities
Clone this wiki locally