Skip to content

Google Search Results (Tutorial)

Benjamin Ooghe-Tabanou edited this page Apr 24, 2020 · 6 revisions

Goal: We study how the results of climate change related Google queries are hyperlinked

Date: 2018-06-06

Setup and prerequisites

We create a new corpus and we will shortly launch a crawl. If you do not know how to do it, these tasks were covered in the reference tutorial 1 “Website Structure”.

Querying Google

We browse to Google.com and type the query “climate change”.

LRUs tree example

We copy-paste the URL of the results page and start a new crawl with it.

For reference, the URL used for this tutorial is the following:

https://www.google.com/search?source=hp&ei=ynUNW5HQKoTWU4C5j4gJ&q=climate+change&oq=climate+change&gs_l=psy-ab.3..0i131k1j0l9.944.2676.0.2869.14.14.0.0.0.0.56.539.13.13.0..2..0...1.1.64.psy-ab..1.13.537....0.TsLD8A1PwvQ

Note 1: it is important to define the web entity at the level of the result page. The rationale for picking the results page as a separated entity is because it means something to us. Every time a page or a set of pages has a special role in the protocol, it should be isolated as a distinct web entity.

LRUs tree example

Note 2: do not crawl Google as a whole. Some websites do not like to be crawled, and doing so might lead to issues such as having your access shut down. I am sure that you do not want to have Google block you. If for some reason you still want to crawl Google or some other sensitive website, be sure you always use a depth of 0 or maybe 1, so that the amount of pages crawled is strongly limited. However crawling Google seems a bad idea both technically and methodologically.

Monitoring the outcome of a crawl

List of web entities

We click on WEB ENTITIES in the left menu. This view displays the web entities of the corpus as a list but as we can see, there is only one right now (the Google results page). This is due to the status of the web entities. The web entities linked from the results page are known but not crawled, and thus their status is discovered. This status is not displayed by default and we need to do it in order to display these web entities.

LRUs tree example

We just check the DISCOVERED box and click APPLY CHANGES to make them appear.

Note: we will see how the status of web entities work very shortly.

LRUs tree example

There are 15 web entities in the list. One of them is the crawled Google page, and 10 are the results. The 4 others are Google related domains: Google.com, Google.fr, Googleusercontent.com and Youtube.com.

Crawling spawns new discovered web entities. During the crawl Hyphe found hyperlinks pointing at various other pages. These pages do not necessary belong to a preexisting web entity, in our situation only one web entity was already defined (the Google results page for “climate change”). For those new pages, new web entities are declared. In Hyphe a web page cannot be indexed without being associated to one, and only one, web entity. Of course we did not take a decision about the boundaries of these web entities. Hyphe just applied a default rule. These rules are configurable, the result may vary depending on the settings of the corpus. The newly created web entities have the discovered status, which means that they have been found by Hyphe. We can redefine the boundaries of these entities if they do not match our expectations.

Network of web entities

The network view allows to see the same thing with the addition of links which makes the crawl process more clear. We click on NETWORK in the left menu, and like previously we need to enable the discovered web entities to see them.

LRUs tree example

Note: we tuned the visualization with the side bar on the right to make nodes bigger

The dots (or “nodes”) represent the web entities, and the lines (or “edges” or “links”) represent one or more hyperlinks between pages contained in these web entities. The links are oriented but their direction is not visible because it would become too cluttered, but exported data features orientation. The color and size of the nodes are specified in the right side bar. By default the color indicates the status and the size indicates the number of inbound links also called “indegree”. In other terms, the indegree is the number of other entities in the corpus that are citing it.

But wait, what? This is not what we should have! The links should come from the only crawled entity, but instead they come from a discovered entity that is not even crawled! This situation is caused by a redirection. When Hyphe tried to crawl the Google results page, it was redirected to a different URL that was out of the bounds of the web entity. The page that was crawled as a results ended up belonging to the generic Google entity, and not the specific results page that we originally targeted.

LRUs tree example

This situation comes up pretty often with Hyphe. For various reasons, Hyphe does not see exactly what we see in our everyday browsing. The same way the conditions of browsing are not exactly the same for two users (cookies, IP location, user agent…), Hyphe’s conditions of harvesting may vary from ours. The redirection was caused by this phenomenon. The web is a mess, and crawling it is not always as clean as we would like.

As we will see, this anomaly has no practical consequences (skip this paragraph if you are not interested in the details). In this situation Hyphe followed the redirection and as a result, the downloaded page did not end up belonging to the web entity that was crawled. A web entity being marked as “crawled” or “not crawled” reflects the fact that we did or did not run a crawl job on it, it does not reflect the fact that the pages inside were actually harvested. On the contrary the fact that a page is marked as “crawled” reflects the fact that it was downloaded, whatever is the status of the web entity containing it. The state of the hyperlinks reflects information as it was effectively downloaded, not how it was intended to be, but the attributes of web entities (crawled, status…) reflect our intention insofar as they reflect what we commanded Hyphe to do. Also note that the redirections do not appear as links in Hyphe (as we write this tutorial in June 2018), though this possibility deserves to be considered.

About the status of web entities

We can see the state of the corpus by clicking on OVERVIEW in the left menu. The first row is dedicated to Hyphe’s current activity, while the second row is dedicated to the state of the corpus. It features 4 blocks names IN, UNDECIDED, OUT and DISCOVERED. These are the four statuses used in Hyphe.

LRUs tree example

  1. IN: web entities that we accepted in our corpus
  2. UNDECIDED: web entities that we hesitate to accept in the corpus
  3. OUR: web entities that we refused
  4. DISCOVERED: web entities found by Hyphe that we did not look yet

The IN entities are usually crawled. It is not technically mandatory, but it is important for methodological reasons. The OUT entities do not need to be crawled because we are not interested in knowing who they cite, but they are not deleted because we are interested in knowing how cited they are. The UNDECIDED is just a joker that we will use in situation where we cannot decide if we want the entity in the corpus or not. The DISCOVERED entities are never crawled, because crawling a web entity sets it to IN. You can have a IN uncrawled but a crawled entity will always be IN, UNDECIDED or OUT. Hyphe allows you to set a web entity status but not to DISCOVERED, which is reserved for the entities found during the crawl.

These statuses (and their limitations) have a straightforward application to iterative corpus curation, but this is not the protocol we are following right now.

Crawling all the discovered entities

We click on WEB ENTITIES in the left menu and we set the list so that it displays only the discovered entities. We check all the web entities by using the check box in the table header.

LRUs tree example

Different operations can be applied to the 14 selected web entities, as displayed in the right part of the screen:

  • Change their status
  • Crawl them
  • Merge them

We click on CRAWL so that we can schedule new crawl jobs. Crawling an entity will automatically set it to IN. We will use a depth of 0 click: it would be time consuming and counterproductive to actually crawl such huge websites: Google, Wikipedia, Blogger, the NASA, the Guardian…

LRUs tree example

Note: we did not specify start pages for these web entities. In such situation Hyphe tries to find some by using the prefixes and looking at pages already known because links to them were found earlier (it usually takes the 5 most linked ones). This is indicated as “auto start pages” and does not require us to check anything, but note that the crawl may fail if the guessed start pages are invalid. If this happens, we will be informed and have the possibility to fix it and recrawl.

Once the crawls are achieved, we look again at the network: we see that there are now more web entities IN and we have some links between them. We can check that Google, the web entity containing the results page that we effectively downloaded, is connected to all other web entities, as expected from our protocol.

LRUs tree example

Changing the status of web entities

We are not much interested in having our starting web entity in our corpus, and we will set its status to OUT. We click on WEB ENTITIES in the left menu, we check our starting web entity, we set its new status in the right panel and click on SET STATUS.

LRUs tree example

Actually we also want to get rid of a few web entities that were not part of the results. These are Google.fr, Google.com, Google User Content, Blogger and YouTube (we identify them by browsing the results page). We check them and set their status to OUT. As we can see, we now have 10 IN entities, 6 OUT and 209 DISCOVERED. The 10 entities IN the corpus are the 10 first Google results.

LRUs tree example

In the resulting network, interestingly, not everyone is linked to everyone. The five web entities that are linked together are Wikipedia, the Guardian, the NASA, the IPCC and the BBC.

LRUs tree example

Investigating Google search results

As we have just seen, mapping the 10 first results is a little bit short for obtaining a significant network. We decided to tweak the experiment in the following ways:

  1. Changing Google’s settings to display 100 results per page
  2. Comparing 3 queries (we created two different Hyphe projects):
    1. climate change
    2. “climate change” “loss and damage”
    3. “climate change” scam
  3. The only web entities we set out of the corpus were the Google-related entities (those using a “google.something” domain), Youtube, and Twitter. These were so connected to everything that it was hiding the interesting patterns (clusters…)

On all other points, we just replicated the same protocol.

The URLs used as starting points are the following:

Note: since we are comparing network layouts, it is worth reminding that in such context, the orientation of a network has no particular meaning, contrary to a statistical projection using axes. The fact that Wikipedia appears on the top or on the bottom is not relevant. Conversely, appearing in a central or peripheral position, which is independant from the orientation, is meaningful. More generally we must pay attention to the relative size of the nodes (here representing their degree or number of neighbors), the number of nodes and edges, which are the biggest (most connected) nodes, where they are situated in terms of center/periphery, and the possible clusters. You can search the web for “visual network analysis” to find more information on that methodology.

Network of the first 100 results of climate change:

LRUs tree example

Network of the first 100 results of “climate change” “loss and damage”:

LRUs tree example

Network of the first 100 results of “climate change” scam:

LRUs tree example

For the sake of concision, we will now refer to the query climate change as CC, “climate change” “loss and damage” as L&D, and “climate change” scam as SCAM.

The most obvious difference between these networks is their density: CC is the most sparse, SCAM the densest, and L&D lies in between. It is important to note that in the three cases, some of the web entities are disconnected (they appear as “orbiting” around the central component like the rings around planet Saturn). The denser network have less disconnected web entities and more generally a higher amount of links inside the central component.

CC’s central component is mostly composed of mainstream media (BBC, New York Times…), general or climate-focused international institutions (UN, IPCC, UNFCCC…), and american institutions (NASA, FAO…). They are not strongly connected and we find Wikipedia in central position. Most of these websites are from the “surface” of the web: notorious, very visible, and strongly cited (in general, though not in this network). The query captured a very generic set of websites, but failed at capturing the discussion about climate change.

L&D has an interesting central component with two subclusters, joined by the UN Framework Conference on Climate Change (UNFCCC). The separation between these two “wings” cannot be straightforwardly explained by the difference in the type of websites, for both clusters contain a similar mix of political institutions, information websites, public universities and civil society organizations. The division is more probably determined by a thematic difference.

  1. The left wing seems more centered on the scientific investigation and thus probably on the question of the scientific determination of the impacts that climate change will have on vulnerable countries - hence the prevalence of several scientific institutions and academic publishers.
  2. The right wing, instead, seems more focused on the North-South cooperation and thus probably on the question of climate justice and sustainable development - hence the greater presence of NGOs and groups from the global South.

Interestingly, but not surprising, the UN Framework Conference on Climate Change where both themes are negotiated is the most cited website of the network and the one that bridge its two main clusters. Interestingly, this query captures an ongoing discussion about loss and damage (in the context of climate change). The resulting network has more links than CC because the actors are less generic and are engaged together in a common subject of interest.

LRUs tree example

The Google results for the query “climate change” “loss and damage” display two subclusters.

SCAM’s central component is composed of many well linked entities of remarkable variety: mainstream media (Washington Post, BBC, Forbes…), generic platforms (Linked In, Wikipedia, Wordpress…), activists debunking information (Snopes) and/or engaged against climate denial (Desmog, Skeptical Science) as well as climate deniers (Heartland.org) and complotists (Principia Scientific, WakeUpKiwi). First of all, we must note that the query did not naïvely capture actors who claim that “climate change is a scam”: most actors actually debunk climate denial. This subspace of Google results has been invested by debunkers in a way that the “scam” keyword cannot be efficiently used as a marker of climate change denial. We hypothesize that it could be a deliberate strategy, though investigating it would require another kind of inquiry. We also note that despite the presence of oppositions, usually manifested by structural holes (competitors/adversaries tend to not cite each other), the network is dense and unclustered. No lack of links segregates climate change realists from deniers. The direction of links provides the key to understanding the relations between actors. Hyphe does not display the direction of the links, and we have to either export the network to Gephi or compare the indegree (how many entities cite you) to the outdegree (how many entities you cite), which can be done in Hyphe. We can then remark that climate change deniers like Wake Up Kiwi or Principia Scientific have many outbound links but are very poorly cited (inside this corpus of top 100 Google results). On the contrary, mainstream media like The Guardian and Forbes as well as platforms like Linked In and Wordpress are much more cited than they cite (again, inside this corpus). Climate deniers cite many authorities, while they are not cited by them. This topology is strongly hierarchised, and we hypothesize that it reflects the legitimity of actors as sources of information. It is possible that the links from the deniers to the realists are critiques, but it is nonetheless true that the realist do not link the deniers. By analogy with other cases (see Visual Network Exploration for Data Journalists) we think that legitimity is a resource that less legitimate actors try to harvest from more legitimate actors by citing them. In such a situation, the traditional absence of hyperlinks that denotes opposition is only present in one direction.

LRUs tree example

The Google results for the query “climate change” scam, comparing indegree to outdegree

The actors who cite a lot are not the same as the actors who are well cited.

By comparing the networks of three different Google queries, we realize that not all queries are equal. Contrary to L&D and SCAM, CC failed to capture a coherent discussion space. The query “climate change” is very generic and Google responds with equally generic resources. The more specific queries lead to more defined spaces, even if generic resources still tend to occupy a good percentage of the results. The way Google selects the top 100 results has no reason to delineate a coherent aggregate and if we wanted to systematically track a given topic, we should use another method. Nevertheless it reflects the information provided to internet users. In that sense, the networks depict different spaces with different rules: sometimes the results have inherent hierarchical relations, sometimes they just share a common space (that we hypothesize as a topical aggregate), and sometimes they have no specific relations.


Files

To download a file, use right-click => Save as... (or similar)

Clone this wiki locally