Skip to content

Discussions around the project

Benjamin Ooghe-Tabanou edited this page Dec 21, 2012 · 1 revision

Table of Contents

The generalisation of the notion of link

  • Precisely as we generalize the notion of 'website' by using the notion of 'web entity', we should think about we could generalize the notion of 'link' beyond the notion of 'hyperlinks'. In particular, the new platform of the social web are providing forms of connection other than the html hyperlink. In Twitter, for example, just following hyperlink (or even re-tweet and followed/followers) is not much informative: you have to consider the 'citation' (sometime even the 'implicit citation').
  • Generalizing the notion of 'link' will also open the possibility of corpus of linked documents other than the web, for example scientific literature.
  • Generalizing the notion of 'link' force us to blur the distinction between links and contents. Links are contents (in the sense that they provide information both on the linker and on the linked beyond the fact that the two are linked).

Generic tool vs 'in-build methods' tool

  • It is very interesting to provide users the possibility of defining what type of entities and what type of links they are interested in. Many tools already exists (especially in the commercial market) to do simple analyses of web corpora. This project can have a real added value if it is able to offer something more: the possibility of being adapted to the special (and specialized) needs of researches.
  • However this create huge problems because:
    • Users that are not technically competent (or simply users that do not have much time to dedicate to the use of the tool) may have problems defining (formally enough) what interests them.
    • Not knowing the 'objects' you have to work with makes it much more difficult to develop the algorithms and tools to treat them.
    • A possibility to having the cake and eat it too is to provide a generic framework on the top of which users will be allowed to add their own plug-ins. A simple 'in-build method' (as clear as possible) may be provided to users that want starting using the tool.
    • A possible solution to built-in methods in the tool is to develop it on the basis of the conceptual and methodological tools of web-studies. The problem of this approach is that web-studies are still relatively young and still pretty far from reaching consensus on notions and methods.
    • Automatic vs manual tool
  • The general problem is should we automatize as much as possible all the painful and time-consuming operations or should we encourage users to do them by hand. Automatizing can of course facilitate the use of the tool and make it more appealing for its users. However, the more we automatize the more we blackbox the tool and the less we encourage users to manually explore they corpora.
  • What is interesting about the Navicrawler is that it limits automatic crawling, obliging researches to reflect on their corpora and navigate them. But the Navicrawler also assist your navigation, by giving you contextual information of the 'place' of the web where you are.
  • Adding too much automatic function may also have the disadvantage of encouraging users to create too big corpora (which may in turn create problems in the analysis and in the archival phases).
  • It is important to help users understand what is a web corpus. If they do not understand that, they may do a worse job than an automatic crawler.

Possible functions of the tool

  • The baseline of the tool is constituting corpora of linked entities. All the other possible functions of the tool (such as analysing and archiving) sit on the top of this function.
  • The two main existing sets of techniques for analysing the web are network-analysis and textual-analysis. These two sets of techniques are relatively mature, what would be interesting to try out is mix the two. Mixing topology and content analysis is one of the main objective of this project.
  • Two of the main functions that the tool is meant to provide - the constitution of the corpora and the interpretation of the corpora - are very difficult to combine. They are difficult to combine not just for technical reasons but because they require very different skills for the users. Often in research projects (academic or not) they are actually two different persons doing the two job: web-miners and web-analysts - a specialist of the web and a specialist of the topic.
  • Can we get rid of the web-specialist and implement his/her competences in the tool itself? Maybe we can do that for the simple 'built-in playground plug-in', certainly not for the tool as a general framework.
  • It is very important to provide both functions and to allow easily passing from one to the other. If crawling can be partially automatized, qualification is probably something that should remain hand-made.
  • It make no sense to strive to implement all possible analyses in our tool. There already exists many great tools for doing all sort of analysis. Our strategy should be of making sure that the tool is able to export data for as many of these tools as possible.

Archiving the web

  • Archiving is also and important plus of this project. It is very important to consider the question of archiving not just from the point of the view of what is technically possible and how, but also from the point of view of the users.
  • Though being a real plus, archive raises many major difficulties:
    • Legal issues connected to the privacy and copyright of the websites
    • The size of data-sets start exploding as soon as you start archiving them through time.
  • The tool we are building is meant to handle relatively small corpora (1.000 web entities? less?). We prefer renouncing to the breath of the analysis to assure its depth.
  • The problem with focusing on small corpora is that you lose the context of the web at the time you gathered your corpus. There are at least a couple of possible solution to this problem:
    • It would be interesting to explore the possibility to take advantage of some of the analysis that other online tools (or online initiatives) are providing. What is the Google rank of a website? What is its Alexa rank? ... Being able to retrieve some a few of these information may help to store some information on the context as well as on the texts of our context.
    • Another possibility is to explore the possibility of 'connecting' the tool the large archival enterprises of the web (especially the one by national library).

Evaluating the tool

  • Two possible criteria of evaluation of the tool can be:
    • if you cannot play with it, it is not a good tool.
    • if you cannot do serious stuff with it, it is not a good tool.
  • The first criterion is meant to evaluate the 'playground' built-in plug-in. The second criterion is meant to evaluate the tool as a general framework from web research.
Clone this wiki locally