Social media research is becoming increasingly popular and there are many freely available (and open source) tools available. This document will outline some popular methods in the R programming language and give example code for each section. The intention is that this document will become a basic ‘how to’ guide for the most popular tools.
For a more general introduction to the R language see https://www.datacamp.com/ for a interactive tutorial or start an interactive course from within R using the swirl package. Enter the following into the R console:
In this document I will use the pacman
package manager, as it automatically fetches packages that aren’t installed and loads them with one command: p_load(packagename)
.
If you don’t want to use pacman
you can install packages in the usual way: install.packages("packagename")
then require(packagename)
.
There are many different sources of social media data, to name just a few:
This section will explore how to pull data from online platforms for later processing in R.
Web scraping generally refers to the collection of information from websites. Popular website (such as Facebook or Twitter) often provide APIs (application programming interface) to make interacting with online platforms more straightforward. However, many websites do not provide APIs and you will need to write a script to extract the information directly from the website. There are two methods often used in R, rvest
and RSelenium
.
The rvest
method is the simplest and doesn’t require any external applications (meaning you can run this on laptops with restricted privileges).
The main limitation is that the rvest
method doesn’t simulate a full browser, rather it just downloads the HTML and CSS scripts and provides tools to extract information from those documents.
In this example I’ll scrape reviews from the Amazon.co.uk website: https://www.amazon.co.uk/product-reviews/1782118691/ref=cm_cr_dp_see_all_btm?ie=UTF8&reviewerType=all_reviews&showViewpoints=1&sortBy=recent
#Load the packages we're going to use
pacman::p_load(rvest, #for scraping and processing HTML
stringr, #for fast string processing
httr, #to set user_agent string
pander) #nice table formatting
#First, lets generate some page reading times
reading.times<-rnorm(10000, 40, 5)
range(reading.times)
#Set up some variables
tmp <- data.frame()
amazon.data<-data.frame()
counter <- 0
uastring <- "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"
#Open a html session
html.session <-
html_session("https://www.amazon.co.uk/product-reviews/1782118691/ref=cm_cr_dp_see_all_btm?ie=UTF8&reviewerType=all_reviews&showViewpoints=1&sortBy=recent",
user_agent(uastring)
)
The above section of code will establish a web session. At this point it’s worth explaining what the user_agent
does and why we’re generating some random times to use between each scrape. This is to make the rvest
session appear like it is an actual user - this makes sure the website behaves normally.
In the next section we will set up a loop that will go through each page and grab the text based on the css ‘nodes’. The easiest way to find these nodes is with a css selector, such as: http://selectorgadget.com/
# This while loop will continue until there isn't
# a link with the css class li.a-last. In other words
# until there isn't a 'next page' link.
while(!is.na(read_html(html.session) %>%
html_node("li.a-last")
%>% html_node("a")
%>% html_attr("href"))){
#For each page, grab the review text, rating and date.
tmp.review<-
read_html(html.session) %>%
html_nodes(".review-text")%>%
html_text()
tmp.rating<-
read_html(html.session) %>%
html_nodes(".review-rating") %>%
html_text() %>%
str_extract("\\d") %>% #Just grab the digit
.[3:12] #Amazon features two reviews at the top of each page, skip these
tmp.rdate<-
read_html(html.session) %>%
html_nodes(".review-date") %>%
html_text() %>%
.[3:12]
#combine the data into a data frame
tmp<-data.frame(tmp.rating, tmp.review, tmp.rdate)
amazon.data <- rbind(amazon.data, tmp)
rm(tmp)
counter <- counter + 1
#Print some info to the console.
print(counter)
pander(amazon.data[length(amazon.data$tmp.rating),])
print('waiting for...')
print(reading.times[counter])
print('seconds')
#Wait for a while before continuing to next page
Sys.sleep(reading.times[counter])
#Navitgate to the next page
html.session <-
html.session %>%
follow_link("Next")
}
#Clean up the data.frame
amazon.data <-
data.frame(rating = amazon.data$tmp.rating,
review = amazon.data$tmp.review,
date = amazon.data$tmp.rdate)
#Save the data as RDS
saveRDS(amazon.data, 'data/amazon_review_data.RDS')
Lets look at a sample of the data:
pacman::p_load(dplyr, tidyr) #Load the data wrangling packages
amazon.data <- readRDS('data/amazon_review_data.RDS')
amazon.data %>% sample_n(3) %>% pander()
rating | review | date | |
---|---|---|---|
160 | 3 | its okay | on 18 July 2015 |
433 | 5 | Brilliant book. I love the characters and had a great time reading it. | on 22 August 2014 |
50 | 4 | Good story | on 15 April 2016 |
RSelenium
uses the popular Selenium server to simulate an entire browser. This is more powerful, but more complicated than rvest
. You only really need to use it when interacting with websites that have dynamic content.
If the website in question has any interactive elements that download HTML on the fly, then the rvest
method will not be able to extract this information. This is sometimes used on interactive websites - for example, when more content is loaded when you reach the bottom of the current page.
How might this impact web scraping? A website that contains review may only display the first few lines of the review but have a button to ‘read more’. When a user clicks the read more button, a small piece of javascript fetches the content and amends the HTML on the fly. Thus, without the interactivity, the rest of the review is simply not present in the HTML. At the time of writing rvest
cannot handle this - if you need to scrape from interactive websites, see the RSelenium
section below. In most cases, all text is present in the HTML code and so rvest
is more than enough.
The easiest way to get a Selenium server up and running is with a docker image. Setup docker (docker.com). Then run docker run -d -p 4445:4444 selenium/standalone-firefox
from terminal.
Debugging sessions are also available that can be viewed with VNC docker run -d -p 4444:4444 -p 5901:5900 selenium/standalone-firefox-debug
see: https://github.com/SeleniumHQ/docker-selenium
You can launch this from within R, at least on OSX.
You can then connect to the server using RSelenium
. To do this, you just specify the external port (defined in the docker command as external:internal).
pacman::p_load(RSelenium, rvest)
#Establish the connection
remDr <- remoteDriver(remoteServerAddr = "localhost"
, port = 4445L
, browserName = "chrome"
)
# Interact with the Server by calling nested variables
# remDr$ will autocomplete with the various options in RStudio
remDr$open()
#Its a good idea to make sure the window is full size, so sites render normally.
remDr$maxWindowSize()
The rest of the process is broadly similar to that of rvest
with the added possibility of clicking elements on the page and interacting with the website more fully.
For a guide see: https://cran.r-project.org/web/packages/RSelenium/vignettes/RSelenium-basics.html
In this section I will scrape reviews of the book ‘Nudge’ from Goodreads. Goodreads is an example of a website that updates content using javascript - when you click next on the page of reviews it doesn’t refresh the page, but pulls the new reviews into the current page. rvest
cannot handle this kind of interactivity, but RSelenium
can.
The process in RSelenium
is slightly more convoluted than the rvest
equivalent. To extract information you must first find the elements then extract the text or html from those elements. Note the difference between findElement (which returns the first occurrence) and findElements (which returns all occurrences as a list). As most often you will be dealing with more than one element, information will be returned as a list. To get the text or html from all elements within a list use lappy
then unlist
to convert to usable data.
pacman::p_load(stringr, dplyr)
#Navigate to the page
remDr$navigate(url = "https://www.goodreads.com/book/show/2527900.Nudge")
#Set nextpage tracker as false for while loop & start a counter
rs.nextpage.exists=TRUE
counter = 0
#Setup the dataframe
rs.data<-
data.frame(rating = character(),
date = character(),
review = character(),
stringsAsFactors = FALSE
)
#Run through each page until there isn't any more reviews...
while(rs.nextpage.exists==TRUE){
#Find the review 'elements'.
rs.reviews.element<-
remDr$findElements(using = 'css',
value = '#reviews .reviewText.stacked')
#Extract the relevent HTML from those elements
rs.reviews.html<-
lapply(rs.reviews.element, function(x){
x$getElementAttribute("outerHTML")[[1]]})
#Use rvest to parse the HTML into text.
rs.reveiws.text<-
unlist(
lapply(rs.reviews.html,
function(x){
read_html(x) %>%
html_text(.)
}
))
#Grab the dates
rs.date.elements <-
remDr$findElements(using = 'css',
value = '#reviews .reviewDate')
rs.date.text<-
lapply(rs.date.elements,
function(x){
x$getElementText()
})
rs.date.text<-
unlist(rs.date.text)
#Grab the star rating
rs.rating.elements <-
remDr$findElements(using = 'css',
value = '#reviews .reviewHeader')
rs.rating.text<-unlist(
lapply(rs.rating.elements,
function(x){
x$getElementText()
})
)
#Lets define a quick function that looks for keywords in the string and retuns a score.
goodreads.detect<-
function(x){
ifelse(str_detect(x,'amazing'), 5,
ifelse(str_detect(x, 'really liked'),4,
ifelse(str_detect(x,'liked it'), 3,
ifelse(str_detect(x, 'ok'), 2,
ifelse(str_detect(x,'did not like'),1,
NA)))))}
rs.rating.text<-
goodreads.detect(rs.rating.text)
#Not all ratings have reviews, but ratings with reviews always come first, so make sure that we fill in missing text variables with NA.
rs.review.count<-length(rs.reveiws.text)
rs.rating.count<-length(rs.rating.text)
if(rs.rating.count-rs.review.count>0){
rs.reveiws.text<-
c(rs.reveiws.text,
rep(NA, rs.rating.count-rs.review.count))
}
#Place the current pages data into a data.frame
rs.session.combined<-data.frame(rating = rs.rating.text,
date = rs.date.text,
review = rs.reveiws.text,
stringsAsFactors = FALSE
)
#Print it to the console
pander::pander(rs.session.combined)
#If the last review grabbed was the same as the previous one we must have got to the end... so stop.
if(counter>0){
if(last(rs.data$review)==
last(rs.session.combined$review)){
print('Last review is same as first... quiting')
rs.nextpage.exists=FALSE
}
}
#Add the new page of data to the already aquired data
rs.data<-
rbind(rs.data,
rs.session.combined)
#How many collected so far?
print(length(rs.data$rating))
#Save on each itteration, just in case it crashes.
saveRDS(rs.data, file="data/nudge_Goodreads.RDS")
#Find the next page button and click it
rs.nextpage<-
remDr$findElement(using='css',
value='.next_page')
counter <- counter + 1
#Is there a valid next link?
if(is.null(unlist(
rs.nextpage$getElementAttribute(attrName = 'href')))
){
#if not, exit the while loop
print('No next page button, quitting...')
rs.nextpage.exists=FALSE
}else{
#If there is, click the link and wait
rs.nextpage$clickElement()
Sys.sleep(20)
}
}
#Check for any duplicates
rs.data<-
rs.data %>%
distinct()
rs.data %>% glimpse()
saveRDS(rs.data, file="data/nudge_Goodreads.RDS")
You may notice that for some of the above scraping I’ve extracted the attributes (with $getElementAttribute
) then parsed with rvest
, whereas for other elements I’ve extracted the text directly (with $getElementText
). There is a subtle difference between these methods. $getElementText
grabs the text, but with one important caveat. It only grabs the text that is drawn on the screen. So, if there is a ‘read more’ drop down (or something similar) $getElementText
will only return the displayed text. This isn’t great when you want to scrape the entire review. Luckily, $getElementAttribute
allows you go grab information from the html document directly. So grabbing the outerHTML
attribute for the review text, will contain all the html of that review, which can then be parsed using the rvest
methods.
Google trends can be used as a simple measure of search popularity - the main issue with this is it can only be used for comparison within each search. The scale varies from 0-100, with 100 being the most popular.
Cleaning and manipulating data with the dplyr
and tidyr
packages is summarized here: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
First load the data - rather than putting it in a data.frame (which can result in slow printing to the console) we’ll put it in a tbl_df (‘tibble’) which is quicker.
pacman::p_load(tidyr, dplyr, ggplot2)
#Load the dataset
nudge.goodreads<-
tbl_df(readRDS('data/nudge_Goodreads.RDS'))
#Take a quick look at it
glimpse(nudge.goodreads)
## Observations: 480
## Variables: 3
## $ rating <dbl> 4, 2, 1, 5, 4, 3, 2, 1, 4, 2, 4, 5, 2, 4, 2, 5, 4, 5, 5...
## $ date <chr> "Mar 22, 2009", "Oct 18, 2009", "Jan 09, 2012", "Oct 22...
## $ review <chr> "\n \n \nThis one took me longe...
## # A tibble: 6 x 3
## rating date review
## <dbl> <chr> <chr>
## 1 4.00 Mar 22, 2009 "\n \n \nThis one took me l…
## 2 2.00 Oct 18, 2009 "\n \n \nI don't understand…
## 3 1.00 Jan 09, 2012 "\n \n \nThis comes with a …
## 4 5.00 Oct 22, 2008 "\n \n \nThis is a terrific…
## 5 4.00 Aug 18, 2008 "\n \n \nI second-guessed m…
## 6 3.00 Jun 30, 2008 "\n \n \nI really like a lo…
## [1] "\n \n \nThis one took me longer to read that is reasonable for a book of its length or the clear style it is written in. I mean, such a simply written text of 250 pages ought to have finished in no time. The problem was that I don’t live in the US and so many of the examples mad"
## [1] "n superannuation, but superannuation would be a tax and would be run by the Australian Tax Office. But don’t get me started…This would be an even more interesting book if you live in America, given the nature of the examples, but either way, this is still worth a look.\n ...more\n\n \n "
A few things can be done to clean up this dataset. First get rid of any duplicates. Firstly, there are some ‘’ newline characters in the review text and each review ends with ‘…more’. Some reviews might not be in English, so we will also detect the language of each review using textcat
pacman::p_load(stringr, textcat)
nudge.goodreads<-
nudge.goodreads %>%
distinct() %>% #remove any duplicates
mutate(review = str_replace_all(review, '\n|...more', ''),
review = str_trim(review), #Remove whitespace
lang = factor(textcat(review)))
summary(nudge.goodreads$lang)
## catalan english german malay portuguese scots
## 1 293 1 1 1 2
## welsh
## 1
One of the most straightforward analysis of social media is to look at the frequency - this could be the frequency of particular words within a corpus or the frequency of a social media output over time.
## Loading required package: pacman
## Observations: 870
## Variables: 3
## $ rating <fctr> 2, 3, 5, 5, 4, 2, 5, 2, 3, 5, 5, 5, 5, 1, 1, 5, 5, 3, ...
## $ review <fctr> NOT MY CUP OF TEA, Difficult to follow at times, thank...
## $ date <fctr> on 7 December 2016, on 5 December 2016, on 30 November...
#Need to correct data formats
p_load(lubridate) #Date conversion from strings
amazon.data<-
amazon.data%>%
mutate(rating = as.integer(rating),
review = as.character(review),
date = dmy(date))
ggplot(amazon.data, aes(x=date))+
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#devtools::install_github("dkahle/ggmap")
pacman::p_load(ggmap)
#Plot the christmas tweets
tweets.df <- readRDS('data/twitter_stream_dataset.RDS')
qmplot(lon, lat,
data = tweets.df,
maptype='toner-background',
darken = .7,
color=I('white'),
alpha=I(.33))
## Warning: Removed 16438 rows containing missing values (geom_point).
Throughout this document I’ve used login credentials that are not included in the script - they’re encrypted using public/private key encryption.
To encrypt you need to generate a private and public key. The public key can be shared and is used to encrypt the data. The private key should never be shared and is used to decrypt the data.
pacman::p_load(sodium) #sodium encryption library
#Generate the keys
sodium.privatekey <- keygen()
sodium.publickey <- pubkey(sodium.privatekey)
#Save the private key, no need to worry about the public one, unless there is more you want to encrypt at a later date using the same pub/private key combo.
saveRDS(sodium.privatekey, file="key.RDS")
Here are some example credentials to encrypt.
To encrypt or decrypt do the following.
#To encrypt
encrypted.content<-
simple_encrypt(serialize(
cred, connection = NULL), sodium.publickey)
#How to decrypt
unserialize(simple_decrypt(encrypted.content, sodium.privatekey))
## $guser
## [1] "username"
##
## $gpass
## [1] "password"
You can save this as an RDS or in a script using dput
(as its useless without the key).
## as.raw(c(0xcf, 0x1d, 0x95, 0x39, 0x37, 0xe6, 0x0f, 0xc0, 0x4c,
## 0x88, 0x3b, 0xd8, 0x6f, 0x72, 0x05, 0x81, 0x70, 0x92, 0x1a, 0x88,
## 0x45, 0x14, 0x61, 0x69, 0x66, 0x15, 0x28, 0x3a, 0x1a, 0x04, 0xf7,
## 0x56, 0xdd, 0xb3, 0xa2, 0x79, 0xdb, 0xf8, 0x24, 0xc1, 0xe3, 0x4a,
## 0x22, 0x4f, 0xe4, 0x1f, 0x81, 0xe5, 0xbf, 0x80, 0xf9, 0x7b, 0xde,
## 0xca, 0xda, 0x96, 0x9e, 0x1a, 0xea, 0x76, 0xc8, 0x8e, 0xa3, 0xf4,
## 0xc0, 0xa2, 0xa1, 0xcc, 0xcc, 0x91, 0x66, 0x27, 0x07, 0x37, 0xf7,
## 0x64, 0x61, 0x0a, 0xe0, 0xa9, 0xfb, 0x1c, 0x7c, 0xc5, 0xb2, 0x52,
## 0x96, 0x61, 0xc6, 0x12, 0xbe, 0x7e, 0x44, 0x9a, 0xc7, 0x3c, 0xda,
## 0x2a, 0x0e, 0x4e, 0xcb, 0x20, 0xbb, 0xee, 0xff, 0x0d, 0xad, 0x97,
## 0xca, 0x53, 0xaa, 0x87, 0x20, 0x4a, 0x23, 0xf0, 0x86, 0x04, 0x06,
## 0xa2, 0x12, 0xd6, 0x63, 0x36, 0x05, 0x3b, 0x9e, 0xc9, 0xa5, 0xa6,
## 0x6c, 0xa8, 0x08, 0x70, 0xb1, 0x20, 0xa9, 0xe0, 0x4c, 0x0b, 0xb0,
## 0xa7, 0x73, 0x0c, 0xa0, 0x37, 0xe5, 0x28, 0xb7, 0x5a, 0x25, 0xb5,
## 0xae, 0xb4, 0x59, 0xc8, 0xdd, 0x04, 0x1a, 0x3c, 0x98, 0x64, 0x4e,
## 0xfa, 0xe8, 0xb7, 0x14, 0x47, 0x70, 0xf4, 0xb4, 0x43, 0x96, 0xfe,
## 0x68, 0x96, 0x52))
2.2 Social Media APIs
2.2.1 Twitter
The twitter API has two main methods for collecting tweets search (using the
twitteR
package) or stream (using thestreamR
package).2.2.1.1 Search
Before you can login we must first authorise twitter. You will need your API and token, which both have a key and secret. Both can be found by creating an app at https://apps.twitter.com/
You can then initiate a search. Keeping with the theme, I will collect the last 1,000 tweets containing the phrase “life of pi” written in English.
Quick look at the data:
2.2.1.2 Stream
According to twitter: “The streams offer samples of the public data flowing through Twitter.” This, should capture all public information on a current topic in real time. Before we can begin capturing data, we must first log into twitter. Do to this you need an API key and secret. You can get these by creating an app at https://apps.twitter.com/
The stream API lets you capture all tweets as they occur. This can be very useful for monitoring a topic over a long period of time. As I don’t want to spend a long time waiting for tweets to come in for this example, I’ll collect 10 minutes worth of tweets about Christmas (its the 19th of December, so should be plenty of tweets).
Quick look at the data:
text, retweet_count, favorited, truncated, id_str, in_reply_to_screen_name, source, retweeted, created_at, in_reply_to_status_id_str, in_reply_to_user_id_str, lang, listed_count, verified, location, user_id_str, description, geo_enabled, user_created_at, statuses_count, followers_count, favourites_count, protected, user_url, name, time_zone, user_lang, utc_offset, friends_count, screen_name, country_code, country, place_type, full_name, place_name, place_id, place_lat, place_lon, lat, lon, expanded_url and url