1 Intro

Social media research is becoming increasingly popular and there are many freely available (and open source) tools available. This document will outline some popular methods in the R programming language and give example code for each section. The intention is that this document will become a basic ‘how to’ guide for the most popular tools.

For a more general introduction to the R language see https://www.datacamp.com/ for a interactive tutorial or start an interactive course from within R using the swirl package. Enter the following into the R console:

1.1 A note on package installation

In this document I will use the pacman package manager, as it automatically fetches packages that aren’t installed and loads them with one command: p_load(packagename).

If you don’t want to use pacman you can install packages in the usual way: install.packages("packagename") then require(packagename).

2 Data Aquisition

There are many different sources of social media data, to name just a few:

  • Traditional social networks (Twitter, Facebook)
  • User generated comments (such as comments to articles)
  • Online conversation forums
  • Search trends data

This section will explore how to pull data from online platforms for later processing in R.

2.1 Web-scraping

Web scraping generally refers to the collection of information from websites. Popular website (such as Facebook or Twitter) often provide APIs (application programming interface) to make interacting with online platforms more straightforward. However, many websites do not provide APIs and you will need to write a script to extract the information directly from the website. There are two methods often used in R, rvest and RSelenium.

2.1.1 rvest

The rvest method is the simplest and doesn’t require any external applications (meaning you can run this on laptops with restricted privileges).

The main limitation is that the rvest method doesn’t simulate a full browser, rather it just downloads the HTML and CSS scripts and provides tools to extract information from those documents.

In this example I’ll scrape reviews from the Amazon.co.uk website: https://www.amazon.co.uk/product-reviews/1782118691/ref=cm_cr_dp_see_all_btm?ie=UTF8&reviewerType=all_reviews&showViewpoints=1&sortBy=recent

The above section of code will establish a web session. At this point it’s worth explaining what the user_agent does and why we’re generating some random times to use between each scrape. This is to make the rvest session appear like it is an actual user - this makes sure the website behaves normally.

In the next section we will set up a loop that will go through each page and grab the text based on the css ‘nodes’. The easiest way to find these nodes is with a css selector, such as: http://selectorgadget.com/

Lets look at a sample of the data:

  rating review date
160 3 its okay on 18 July 2015
433 5 Brilliant book. I love the characters and had a great time reading it. on 22 August 2014
50 4 Good story on 15 April 2016

2.1.2 RSelenium

RSelenium uses the popular Selenium server to simulate an entire browser. This is more powerful, but more complicated than rvest. You only really need to use it when interacting with websites that have dynamic content.

If the website in question has any interactive elements that download HTML on the fly, then the rvest method will not be able to extract this information. This is sometimes used on interactive websites - for example, when more content is loaded when you reach the bottom of the current page.

How might this impact web scraping? A website that contains review may only display the first few lines of the review but have a button to ‘read more’. When a user clicks the read more button, a small piece of javascript fetches the content and amends the HTML on the fly. Thus, without the interactivity, the rest of the review is simply not present in the HTML. At the time of writing rvest cannot handle this - if you need to scrape from interactive websites, see the RSelenium section below. In most cases, all text is present in the HTML code and so rvest is more than enough.

2.1.2.1 Interacting with the Selenium server from R

The easiest way to get a Selenium server up and running is with a docker image. Setup docker (docker.com). Then run docker run -d -p 4445:4444 selenium/standalone-firefox from terminal.

Debugging sessions are also available that can be viewed with VNC docker run -d -p 4444:4444 -p 5901:5900 selenium/standalone-firefox-debug see: https://github.com/SeleniumHQ/docker-selenium

You can launch this from within R, at least on OSX.

You can then connect to the server using RSelenium. To do this, you just specify the external port (defined in the docker command as external:internal).

2.1.2.2 Scraping

The rest of the process is broadly similar to that of rvest with the added possibility of clicking elements on the page and interacting with the website more fully.

For a guide see: https://cran.r-project.org/web/packages/RSelenium/vignettes/RSelenium-basics.html

In this section I will scrape reviews of the book ‘Nudge’ from Goodreads. Goodreads is an example of a website that updates content using javascript - when you click next on the page of reviews it doesn’t refresh the page, but pulls the new reviews into the current page. rvest cannot handle this kind of interactivity, but RSelenium can.

The process in RSelenium is slightly more convoluted than the rvest equivalent. To extract information you must first find the elements then extract the text or html from those elements. Note the difference between findElement (which returns the first occurrence) and findElements (which returns all occurrences as a list). As most often you will be dealing with more than one element, information will be returned as a list. To get the text or html from all elements within a list use lappy then unlist to convert to usable data.

pacman::p_load(stringr, dplyr)

#Navigate to the page
remDr$navigate(url = "https://www.goodreads.com/book/show/2527900.Nudge")

#Set nextpage tracker as false for while loop & start a counter
rs.nextpage.exists=TRUE
counter = 0

#Setup the dataframe
rs.data<-
  data.frame(rating = character(),
           date = character(),
           review = character(),
           stringsAsFactors = FALSE
)

#Run through each page until there isn't any more reviews...
while(rs.nextpage.exists==TRUE){
  #Find the review 'elements'.
  rs.reviews.element<-
    remDr$findElements(using = 'css', 
                       value = '#reviews .reviewText.stacked')
  
  #Extract the relevent HTML from those elements
  rs.reviews.html<-
    lapply(rs.reviews.element, function(x){
      x$getElementAttribute("outerHTML")[[1]]})
  
  #Use rvest to parse the HTML into text. 
  rs.reveiws.text<-
    unlist(
      lapply(rs.reviews.html,
             function(x){
               read_html(x) %>% 
                 html_text(.)  
             }
      ))
  
  
  #Grab the dates
  rs.date.elements <-
    remDr$findElements(using = 'css',
                       value = '#reviews .reviewDate')
  
  rs.date.text<-
    lapply(rs.date.elements, 
           function(x){
             x$getElementText()
           })
  
  rs.date.text<-
    unlist(rs.date.text)
  
  #Grab the star rating
  rs.rating.elements <-
    remDr$findElements(using = 'css',
                       value = '#reviews .reviewHeader')
  
  rs.rating.text<-unlist(
    lapply(rs.rating.elements,
           function(x){
             x$getElementText()
           })
  )
  
  #Lets define a quick function that looks for keywords in the string and retuns a score.
  goodreads.detect<-
    function(x){
      ifelse(str_detect(x,'amazing'), 5,
             ifelse(str_detect(x, 'really liked'),4,
             ifelse(str_detect(x,'liked it'), 3,
             ifelse(str_detect(x, 'ok'), 2, 
             ifelse(str_detect(x,'did not like'),1,
             NA)))))}
  
  rs.rating.text<-
    goodreads.detect(rs.rating.text)
  
  
  
  
  #Not all ratings have reviews, but ratings with reviews always come first, so make sure that we fill in missing text variables with NA.
  rs.review.count<-length(rs.reveiws.text)
  rs.rating.count<-length(rs.rating.text)
  
  if(rs.rating.count-rs.review.count>0){
    rs.reveiws.text<-
      c(rs.reveiws.text, 
        rep(NA, rs.rating.count-rs.review.count))
  }
  
  #Place the current pages data into a data.frame 
  rs.session.combined<-data.frame(rating = rs.rating.text,
                                  date = rs.date.text,
                                  review = rs.reveiws.text,
                                  stringsAsFactors = FALSE
  )
  
  #Print it to the console
  pander::pander(rs.session.combined)
  
  #If the last review grabbed was the same as the previous one we must have got to the end... so stop.
  if(counter>0){
    if(last(rs.data$review)==
       last(rs.session.combined$review)){
      print('Last review is same as first... quiting')
      rs.nextpage.exists=FALSE
    }
  }
  
  #Add the new page of data to the already aquired data
  rs.data<-
    rbind(rs.data, 
          rs.session.combined)
  
  #How many collected so far?
  print(length(rs.data$rating))
  
  #Save on each itteration, just in case it crashes.
  saveRDS(rs.data, file="data/nudge_Goodreads.RDS")

  #Find the next page button and click it
  rs.nextpage<-
    remDr$findElement(using='css',
                      value='.next_page')
  
  
  counter <- counter + 1
  
  #Is there a valid next link?
  if(is.null(unlist(
    rs.nextpage$getElementAttribute(attrName = 'href')))
  ){
    #if not, exit the while loop
    print('No next page button, quitting...')
    rs.nextpage.exists=FALSE
  }else{
    #If there is, click the link and wait
    rs.nextpage$clickElement()
    Sys.sleep(20)
  }
}

#Check for any duplicates
rs.data<-
  rs.data %>%
    distinct()
rs.data %>% glimpse()

saveRDS(rs.data, file="data/nudge_Goodreads.RDS")

You may notice that for some of the above scraping I’ve extracted the attributes (with $getElementAttribute) then parsed with rvest, whereas for other elements I’ve extracted the text directly (with $getElementText). There is a subtle difference between these methods. $getElementText grabs the text, but with one important caveat. It only grabs the text that is drawn on the screen. So, if there is a ‘read more’ drop down (or something similar) $getElementText will only return the displayed text. This isn’t great when you want to scrape the entire review. Luckily, $getElementAttribute allows you go grab information from the html document directly. So grabbing the outerHTML attribute for the review text, will contain all the html of that review, which can then be parsed using the rvest methods.

2.2 Social Media APIs

2.2.1 Twitter

The twitter API has two main methods for collecting tweets search (using the twitteR package) or stream (using the streamR package).

2.2.1.2 Stream

According to twitter: “The streams offer samples of the public data flowing through Twitter.” This, should capture all public information on a current topic in real time. Before we can begin capturing data, we must first log into twitter. Do to this you need an API key and secret. You can get these by creating an app at https://apps.twitter.com/

The stream API lets you capture all tweets as they occur. This can be very useful for monitoring a topic over a long period of time. As I don’t want to spend a long time waiting for tweets to come in for this example, I’ll collect 10 minutes worth of tweets about Christmas (its the 19th of December, so should be plenty of tweets).

Quick look at the data:

text, retweet_count, favorited, truncated, id_str, in_reply_to_screen_name, source, retweeted, created_at, in_reply_to_status_id_str, in_reply_to_user_id_str, lang, listed_count, verified, location, user_id_str, description, geo_enabled, user_created_at, statuses_count, followers_count, favourites_count, protected, user_url, name, time_zone, user_lang, utc_offset, friends_count, screen_name, country_code, country, place_type, full_name, place_name, place_id, place_lat, place_lon, lat, lon, expanded_url and url

## [1] 16509
screen_name created_at text
itslaur3nbitch Mon Dec 19 13:23:23 +0000 2016 RT @Renee_patten: Ok but has anyone stolen the little baby Jesus from Quincy Center yet because it’s not Christmas time until that happens
Aurum_Consult Mon Dec 19 13:17:12 +0000 2016 @CapgeminiConsul and friends at Holborn doing a great job gathering Christmas gifts to give to the kids at Evelyn’s… https://t.co/6co8vjQqNT
ForLoveofaDog Mon Dec 19 13:22:17 +0000 2016 RT @CartersCollecta: Not long to go until Christmas and still stuck for ideas? There’s still time to browse and buy at…
Mbrawler Mon Dec 19 13:24:42 +0000 2016 RT @Sco2hot: The best Christmas decoration I’ve seen haha. #DieHard https://t.co/rO7e4u1nav
acenturyofmadi Mon Dec 19 13:25:21 +0000 2016 RT @reaIDonaldTrunp: Thank you Alex, because of this I will bring back Vine so you can be famous again https://t.co/yj8sIg1FPF

3 Data wrangling

Cleaning and manipulating data with the dplyr and tidyr packages is summarized here: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf

3.1 Goodreads dataset

First load the data - rather than putting it in a data.frame (which can result in slow printing to the console) we’ll put it in a tbl_df (‘tibble’) which is quicker.

## Observations: 480
## Variables: 3
## $ rating <dbl> 4, 2, 1, 5, 4, 3, 2, 1, 4, 2, 4, 5, 2, 4, 2, 5, 4, 5, 5...
## $ date   <chr> "Mar 22, 2009", "Oct 18, 2009", "Jan 09, 2012", "Oct 22...
## $ review <chr> "\n              \n            \nThis one took me longe...
## # A tibble: 6 x 3
##   rating date         review                                              
##    <dbl> <chr>        <chr>                                               
## 1   4.00 Mar 22, 2009 "\n              \n            \nThis one took me l…
## 2   2.00 Oct 18, 2009 "\n              \n            \nI don't understand…
## 3   1.00 Jan 09, 2012 "\n              \n            \nThis comes with a …
## 4   5.00 Oct 22, 2008 "\n              \n            \nThis is a terrific…
## 5   4.00 Aug 18, 2008 "\n              \n            \nI second-guessed m…
## 6   3.00 Jun 30, 2008 "\n              \n            \nI really like a lo…
## [1] "\n              \n            \nThis one took me longer to read that is reasonable for a book of its length or the clear style it is written in. I mean, such a simply written text of 250 pages ought to have finished in no time. The problem was that I don’t live in the US and so many of the examples mad"
## [1] "n superannuation, but superannuation would be a tax and would be run by the Australian Tax Office. But don’t get me started…This would be an even more interesting book if you live in America, given the nature of the examples, but either way, this is still worth a look.\n  ...more\n\n          \n        "

A few things can be done to clean up this dataset. First get rid of any duplicates. Firstly, there are some ‘’ newline characters in the review text and each review ends with ‘…more’. Some reviews might not be in English, so we will also detect the language of each review using textcat

##    catalan    english     german      malay portuguese      scots 
##          1        293          1          1          1          2 
##      welsh 
##          1

4 Outputs

4.1 Frequency analysis

One of the most straightforward analysis of social media is to look at the frequency - this could be the frequency of particular words within a corpus or the frequency of a social media output over time.

## Loading required package: pacman
## Observations: 870
## Variables: 3
## $ rating <fctr> 2, 3, 5, 5, 4, 2, 5, 2, 3, 5, 5, 5, 5, 1, 1, 5, 5, 3, ...
## $ review <fctr> NOT MY CUP OF TEA, Difficult to follow at times, thank...
## $ date   <fctr> on 7 December 2016, on 5 December 2016, on 30 November...
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

4.2 Maps

## Warning: Removed 16438 rows containing missing values (geom_point).

5 Misc

5.1 Encrypted Credentials

Throughout this document I’ve used login credentials that are not included in the script - they’re encrypted using public/private key encryption.

To encrypt you need to generate a private and public key. The public key can be shared and is used to encrypt the data. The private key should never be shared and is used to decrypt the data.

Here are some example credentials to encrypt.

To encrypt or decrypt do the following.

## $guser
## [1] "username"
## 
## $gpass
## [1] "password"

You can save this as an RDS or in a script using dput (as its useless without the key).

## as.raw(c(0xcf, 0x1d, 0x95, 0x39, 0x37, 0xe6, 0x0f, 0xc0, 0x4c, 
## 0x88, 0x3b, 0xd8, 0x6f, 0x72, 0x05, 0x81, 0x70, 0x92, 0x1a, 0x88, 
## 0x45, 0x14, 0x61, 0x69, 0x66, 0x15, 0x28, 0x3a, 0x1a, 0x04, 0xf7, 
## 0x56, 0xdd, 0xb3, 0xa2, 0x79, 0xdb, 0xf8, 0x24, 0xc1, 0xe3, 0x4a, 
## 0x22, 0x4f, 0xe4, 0x1f, 0x81, 0xe5, 0xbf, 0x80, 0xf9, 0x7b, 0xde, 
## 0xca, 0xda, 0x96, 0x9e, 0x1a, 0xea, 0x76, 0xc8, 0x8e, 0xa3, 0xf4, 
## 0xc0, 0xa2, 0xa1, 0xcc, 0xcc, 0x91, 0x66, 0x27, 0x07, 0x37, 0xf7, 
## 0x64, 0x61, 0x0a, 0xe0, 0xa9, 0xfb, 0x1c, 0x7c, 0xc5, 0xb2, 0x52, 
## 0x96, 0x61, 0xc6, 0x12, 0xbe, 0x7e, 0x44, 0x9a, 0xc7, 0x3c, 0xda, 
## 0x2a, 0x0e, 0x4e, 0xcb, 0x20, 0xbb, 0xee, 0xff, 0x0d, 0xad, 0x97, 
## 0xca, 0x53, 0xaa, 0x87, 0x20, 0x4a, 0x23, 0xf0, 0x86, 0x04, 0x06, 
## 0xa2, 0x12, 0xd6, 0x63, 0x36, 0x05, 0x3b, 0x9e, 0xc9, 0xa5, 0xa6, 
## 0x6c, 0xa8, 0x08, 0x70, 0xb1, 0x20, 0xa9, 0xe0, 0x4c, 0x0b, 0xb0, 
## 0xa7, 0x73, 0x0c, 0xa0, 0x37, 0xe5, 0x28, 0xb7, 0x5a, 0x25, 0xb5, 
## 0xae, 0xb4, 0x59, 0xc8, 0xdd, 0x04, 0x1a, 0x3c, 0x98, 0x64, 0x4e, 
## 0xfa, 0xe8, 0xb7, 0x14, 0x47, 0x70, 0xf4, 0xb4, 0x43, 0x96, 0xfe, 
## 0x68, 0x96, 0x52))