-
Notifications
You must be signed in to change notification settings - Fork 59
Functions and Interfaces group
Benjamin Ooghe-Tabanou edited this page Dec 21, 2012
·
1 revision
Tommaso, Andrews, Fabien, Alexis, Paolo, Daniele, Donato, Julien,
- Navicrawler next version interfaces
- cartographic exploration : content + topology
- Social sciences researchers
who want to use the web as source of data for they research
- Librarians
who want to constitute a collection of websites, make it available to library's users (student, teacher, researcher), make it harvestable by national or international repositories like CERIMES http://www.signets-universites.fr/ with OAI-PMH protocol , make it searchable by the library portal with SRU protocol.
- Issue experts (or aspiring to be)
who want to confirm or discover the topology of the web discussion about their issues
- Set up of the corpus
- Manage corpora/projects
- Give the corpus a name
- Create a new corpus or chose to feed an existing corpus (if allowed)
- Chose who has the right to feed a corpus
- Define keywords and stopwords
- Define the entry points
- Scrape hyperlinks from any copied-pasted text (url harvester from DMI tools)
- Import existing (and previously exported) PEACE corpora
- (Connect to other existing archives) (not in the prototype)
- Define the granularity of the corpus [domains,]
- Define heuristics (BUT be aware of the methodological problems and define their granularity)
- Chose the default classification action (undefined/excluded/included)
- Manage corpora/projects
- Management of the codebook
- Attribute a title to the codebook
- Import/export the codebook with notice
- Creating 'qualifications terms'
- Grouping qualifications-terms deciding of they are
- tag group (not exclusive inside the group)
- partitions (exclusive inside the group)
- Corpus
- Compact view
- url (ordered by number of incoming links from the included entities)
- filter the entities by
- Compact view
# status
- Full view (spread-sheet for viewing and editing the corpus)
- url (dipendending on the level of granularity)
- title (by default granularity)
- status (include/excluded/undefined)
- groups of tags or partitions
- incoming links
- outgoing links
- who added the web entity to the corpus
- all the graph indicators calculated by the server
- Full view (spread-sheet for viewing and editing the corpus)
- Assisted navigation
- Define is the web entity status (included/excluded/undefined)
- Assign qualification terms to the web entity
- Assign a name to the web entity (the default is based on the level of granularity)
- Visual
- graph visualization and navigation
- Questions
- search / time
- difference in set-up / explore?
- Granularity, stems and Web entitites
- E.g. google.com/images/pageA.html
- Blocks that constitute URLs are called stems
- E.g. {com, com.google, com.google.images}
- Web entities are typically nodes in graph of your corpus
- Graph of stemmed URLs (Reverse URLs, arranged from least to most specific part):
- d:com.d:google.p:images.p:pageA.html
- Advantages: 1) easy to redefine Web entity without having to recalculate everything 2) forces social scientists to think about websites
- Granularity: how far to go in URL?
- Web entities are relevant only for visualization and possibly selection (e.g. co-link)
- User - UI - CVS for corpi (local or online?) - ... - ARC - WWW
- Assisted crawling
- starting point(s): URLs
- One iteration:
- Snowball Crawl
- Define Web entities (choose granularity and specificness) based reverse URL pattern.
- Limit the corpus (define boundaries and throw away rest of discovered URLs)
# manually (accept/reject) # reverse URL scheme should also be used to define blacklist # co-link # etc
- GOTO One iteration
- Qualify
- Analyse
- Sharing corpora
- Integrating different corpora in a single repository
- Presentation / mapping
- Harvesting by OAI-PMH protocol
- Search by SRU protocol