week 7

lse-my472 · Nov 6, 2023 · 85f28c2 · 85f28c2
1 parent c34bc12
commit 85f28c2
Show file tree

Hide file tree

Showing 5 changed files with 549 additions and 0 deletions.
diff --git a/week07/01-newspaper-rss.Rmd b/week07/01-newspaper-rss.Rmd
@@ -0,0 +1,144 @@
+---
+title: "MY472 - Week 7: Scraping newspaper RSS feeds"
+date: "11 November 2023"
+output: html_document
+---
+
+Loading packages:
+
+```{r}
+library("rvest")
+library("tidyverse")
+library("xml2")
+```
+
+In this example, we will scrape an RSS feed and some articles from the [The Guardian](www.theguardian.com). Combining the techniques we have covered in the course so far, the goal is to produce a tabular dataset in which one line represents an article.
+
+You can read through the [Guardian's RSS documentation](https://www.theguardian.com/help/feeds). As you can see, RSS is provided for each of the news category and the url is always https://www.theguardian.com/####/rss. You can even find some sub category of a specific news category. For instance, the following works: [https://www.theguardian.com/world/japan/rss](https://www.theguardian.com/world/japan/rss). As an example, let us look at the Brexit RSS feed:
+
+```{r}
+url <- "https://www.theguardian.com/politics/eu-referendum/rss"
+```
+
+Before scraping, we can check the structure of the xml document in the browser. We need to know what the element for each article is called and find the xml tag "item". Next, we read in the document and extract these elements.
+
+Note: `xml_nodes()` was deprecated in rvest 1.0.0 and it is advised to use `html_elements()` instead, also for xml files.
+
+```{r}
+# Reading the xml document
+rss_xml <- read_xml(url)
+
+# Extracting the items in the RSS feed
+elements <- html_elements(rss_xml, css = "item")
+length(elements)
+```
+
+From these elements, we extract titles, description, dates, and urls, and combine everything in a data frame:
+
+```{r}
+# Extract titles
+title <- elements %>%
+  html_element("title") %>%
+  xml_text()
+
+# Extract description
+description <- elements %>%
+  html_element("description") %>%
+  xml_text() %>%
+  str_replace_all("<.+?>", " ") # removes residual tags
+  
+# Extract date and process as datetime
+datetime <- elements %>%
+  html_element("dc\\:date") %>% 
+  xml_text() %>%
+  parse_datetime() 
+
+# Extract url
+article_url <- elements %>%
+  html_element("guid") %>%
+  xml_text()
+
+# Combine everything in a data frame/tibble
+data_articles <- tibble(title,
+                        description,
+                        datetime,
+                        article_url,
+                        full_text = "")
+```
+
+This is the data frame:
+
+```{r}
+data_articles
+```
+
+Next, let us prototype how you could scrape the text in the body of each of those URLs. We pick the first URL and write some code to get an object that contains the text of the article. We select all paragraphs in the file.
+
+```{r}
+test_url <- article_url[1]
+
+# Note: As the RSS feed changes very frequently, copy the url contained in
+# `article_url[1]` into your browser to have a look at the first article in
+# your specific list
+
+test_text <- test_url %>%
+  read_html() %>%
+  html_elements(css = "p") %>% 
+  html_text() %>%
+  paste(collapse = "\n")
+
+test_text
+```
+
+Now that the code works, let us put it into a function that generalises to all URLs in the dataset. This function can be used with a loop or apply later. Recall that we have an empty column in the dataframe called 'text' which we can now fill. Each iteration of e.g. a loop fills the ith element of that vector with the text of the article.
+
+```{r}
+text_extractor <- function(current_url, sec = 2){
+
+  current_text <- current_url %>%
+    read_html() %>% 
+    html_elements(css = "p") %>%
+    html_text() %>%
+    paste(collapse = "\n") 
+
+  Sys.sleep(sec)
+  
+  return(current_text)
+}
+```
+
+Before we begin the loop, let us reduce the dataset to only the first 10 articles to save some time in this illustration:
+
+```{r}
+# Storing only the first ten rows of the dataframe
+data_articles_small <- data_articles[1:10,]
+```
+
+For loop:
+
+```{r}
+for (i in 1:nrow(data_articles_small)) {
+  
+  # Currently scraping article ...
+  print(i)
+  
+  # Obtain URL of current article
+  current_url <- as.character(data_articles[i,"article_url"])
+  
+  # Obtain full text of article
+  data_articles_small[i, "full_text"] <- text_extractor(current_url)
+  
+}
+```
+
+In R, however, we often use apply functions which offer a more compact notation. The function `sapply()` allows to apply another function element-wise to a vector and returns a vector. This vector is assigned to the dataframe as a new column (takes a little time to run):
+
+```{r}
+data_articles_small$full_text_2 <- sapply(data_articles_small$article_url, text_extractor)
+```
+
+Both the for loop and the sapply approach yielded the same outcome:
+
+```{r}
+data_articles_small
+```
diff --git a/week07/02-introduction-to-selenium.Rmd b/week07/02-introduction-to-selenium.Rmd
@@ -0,0 +1,76 @@
+---
+title: "MY472 - Week 7: A first script using RSelenium"
+author: "Friedrich Geiecke, Daniel de Kadt"
+date: "7 November 2023"
+output: html_document
+---
+
+Note: Using `RSelenium` usually requires Java DK. First, try to see whether it is already installed on your computer - only install `RSelenium` with `install.packages("RSelenium")` and try to run the code in this document. If that does not work properly, next try to install Java DK. You can download the current version from here: https://www.oracle.com/java/technologies/downloads/. After its installation, restart RStudio.
+
+Loading the Selenium package:
+
+```{r}
+#install.packages("RSelenium") -- run once to install the package on your computer
+library("RSelenium")
+#install.packages("netstat") -- this is optional, to allow the use of free_port()
+library("netstat")
+```
+
+Launching the driver and browser (if the port is already in use, choose a different number with four digits, e.g. `rsDriver(browser=c("firefox"), port = 1234L)`). Alternatively -- as we do here -- choose a random free port using `netstat:free_port`. This code will now open what we call a 'marionette' browser (you can figure out why). Do not close this browser window!
+
+```{r}
+# note that we use 'chromever = NULL' as chrome drivers seem to be buggy as of 11/6/23
+rD <- rsDriver(browser=c("firefox"), port = free_port(random = TRUE), chromever = NULL) 
+driver <- rD$client
+```
+
+Navigate to the Google website:
+
+```{r}
+url <- "https://www.google.com/"
+driver$navigate(url)
+```
+
+Since recently there can be new privacy terms that have to be accepted once in some regions.
+
+Side note: Some such cases with pop-up windows on websites might require frame switching (although it seems that the 2023 version of Google does not require this at the moment). Should the code below not run on your computer after you have entered the correct XPath, try to uncomment the `swithToFrame` functions. As the the window is in the foreground, it can be that we have to switch the frame. Without this switching of the frame, we might not be able to click on the right element. Whether switching the frame is necessary depends on the design of the underlying website which can change.
+
+As an exercise, the XPaths of the relevant elements have to be obtained with the Inspect function of the browser and then be pasted into this code replacing the "tba".
+
+```{r}
+#driver$switchToFrame(0) # can be un-commented and tried if code does not run
+
+# note the use of single quotes around 'tba' -- this is because the xpath will sometimes include double quotes. if you have double quotes inside a string, you need to use single quotes to define the extent of the string.
+agree_button <- driver$findElement(using = "xpath", value = 'tba')
+agree_button$clickElement()
+
+#driver$switchToFrame(1)
+```
+
+Next, we will search for the LSE:
+
+```{r}
+search_field <- driver$findElement(using = "xpath", value = 'tba')
+search_field$sendKeysToElement(list("london school of economics"))
+Sys.sleep(1)
+search_field$sendKeysToElement(list(key = "enter"))
+```
+
+And navigate to the LSE website by clicking on the first link of the search results:
+
+```{r}
+first_link <- driver$findElement(using = "xpath", value = 'tba')
+first_link$clickElement()
+```
+
+Lastly, let us close the driver and browser:
+
+```{r}
+# close the Rselenium processes:
+driver$close()
+rD$server$stop()
+
+# close the associated Java processes (if using Mac or Linux this may not be necessary -- Google for correct command)
+system("taskkill /im java.exe /f", intern=FALSE, ignore.stdout=FALSE)
+```
+