Simple Webscraper to Google Sheets

Webscraper and R?

I recently assisted my colleague Danny in teaching an R Course at Measure Camp London. There I made a simple webscraper to show what R can do in terms of web scraping. This is the result:

# install.packages("xml2")
library(xml2)
library(rvest)

now <- format(Sys.time(), "%a %b %d %X %Y")

webpage <- read_html("http://dr.dk/")

newsHeadLines <- webpage %>%
  html_nodes("h2") %>%
  html_text()

write(now, file = "TimeStamp: ", sep = "", append = TRUE)
write(newsHeadLines, file = "News Headlines",
      sep = "", append = TRUE)

We need two libraries to make this work:  “XML2”  and  “rvest” . Install and load these.

Because this was made as news scraper, I wanted to keep track of time so that I could see which news headlines was scraped at what time.

This can be done by defining af variable  (of your choice of name), format (Sys.time(), “%a %b %d %X %Y). What this means is that we take a built-in function Sys.time() and format it.

Then we define what URL we want to scrap. And then what we want to scrape. This is a very simple example, and to make proper use of this, you will need to go into the HTML code and find precisely what piece of information you want out. Just getting all H2, is very simplified, however, it works for this example.

To get the information we use:

 html_nodes (“THE HTML ELEMENT”)  All this is then stored into a varible.

Now we only need to save onto the machine, so that we can use it for later. In this example, I just write it to simple .txt-file.

We do this by using the write(). We then define the filename, what are separator is, and if we want to append the data or make R overwrite the file each time. Because I wanted to make this some sort archive I made it  append=TRUE . So first we write the timestamp and the afterward the data.

Writing it to Google Sheets

library(googlesheets)
gs_auth(new_user = TRUE) # Authenticate user

cgs_ls() # List all Google Sheets

createNewSheet <- gs_new("NewsData", ws_title = "Headlines", input = newsHeadLines,
                    trim = TRUE, verbose = FALSE)

If we want to take this even further we can write the data from the scraper to a new Google Sheet.

We install and load the library:  “googlesheets” .

Then we authenticate the user – where do want to have the google sheet placed.

if you want to get an overview of the files in Google Sheets you can run the  cgs_ls() .

Then we create a new sheet (chose a name that makes sense). We use  gs_new()  for this. The first parameter is the name of the title of the new Google Sheet. The following is this WorkSheet, and lastly, we put in what data we want to be written.

Documentation

Packages Used:

Leave a Reply