1 Intro

Social media research is becoming increasingly popular and there are many freely available (and open source) tools available. This document will outline some popular methods in the R programming language and give example code for each section. The intention is that this document will become a basic ‘how to’ guide for the most popular tools.

For a more general introduction to the R language see https://www.datacamp.com/ for a interactive tutorial or start an interactive course from within R using the swirl package. Enter the following into the R console:

install.packages('swirl')
require(swirl)
swirl()

1.1 A note on package installation

In this document I will use the pacman package manager, as it automatically fetches packages that aren’t installed and loads them with one command: p_load(packagename).

#Run this is you don't have pacman already...
#install.packages('pacman')

If you don’t want to use pacman you can install packages in the usual way: install.packages("packagename") then require(packagename).

2 Data Aquisition

There are many different sources of social media data, to name just a few:

Traditional social networks (Twitter, Facebook)
User generated comments (such as comments to articles)
Online conversation forums
Search trends data

This section will explore how to pull data from online platforms for later processing in R.

2.1 Web-scraping

Web scraping generally refers to the collection of information from websites. Popular website (such as Facebook or Twitter) often provide APIs (application programming interface) to make interacting with online platforms more straightforward. However, many websites do not provide APIs and you will need to write a script to extract the information directly from the website. There are two methods often used in R, rvest and RSelenium.

2.1.1 rvest

The rvest method is the simplest and doesn’t require any external applications (meaning you can run this on laptops with restricted privileges).

The main limitation is that the rvest method doesn’t simulate a full browser, rather it just downloads the HTML and CSS scripts and provides tools to extract information from those documents.

In this example I’ll scrape reviews from the Amazon.co.uk website: https://www.amazon.co.uk/product-reviews/1782118691/ref=cm_cr_dp_see_all_btm?ie=UTF8&reviewerType=all_reviews&showViewpoints=1&sortBy=recent

#Load the packages we're going to use
pacman::p_load(rvest,     #for scraping and processing HTML 
               stringr,   #for fast string processing
               httr,      #to set user_agent string
               pander)    #nice table formatting 

#First, lets generate some page reading times
reading.times<-rnorm(10000, 40, 5)
range(reading.times)

#Set up some variables
tmp <- data.frame()
amazon.data<-data.frame()
counter <- 0 

uastring <- "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"

#Open a html session
html.session <- 
  html_session("https://www.amazon.co.uk/product-reviews/1782118691/ref=cm_cr_dp_see_all_btm?ie=UTF8&reviewerType=all_reviews&showViewpoints=1&sortBy=recent",
               user_agent(uastring)
               )

The above section of code will establish a web session. At this point it’s worth explaining what the user_agent does and why we’re generating some random times to use between each scrape. This is to make the rvest session appear like it is an actual user - this makes sure the website behaves normally.

In the next section we will set up a loop that will go through each page and grab the text based on the css ‘nodes’. The easiest way to find these nodes is with a css selector, such as: http://selectorgadget.com/

# This while loop will continue until there isn't 
# a link with the css class li.a-last. In other words
# until there isn't a 'next page' link. 
while(!is.na(read_html(html.session) %>%  
             html_node("li.a-last") 
             %>% html_node("a") 
             %>% html_attr("href"))){  
  
  
  #For each page, grab the review text, rating and date.
  tmp.review<-
    read_html(html.session) %>%
    html_nodes(".review-text")%>%
    html_text()
  
  tmp.rating<-
    read_html(html.session) %>%  
    html_nodes(".review-rating") %>% 
    html_text() %>%
    str_extract("\\d") %>% #Just grab the digit
    .[3:12] #Amazon features two reviews at the top of each page, skip these
  
  tmp.rdate<-
    read_html(html.session) %>%
    html_nodes(".review-date") %>%
    html_text() %>%
    .[3:12] 
  
  #combine the data into a data frame
  tmp<-data.frame(tmp.rating, tmp.review, tmp.rdate)
  amazon.data <- rbind(amazon.data, tmp)
  rm(tmp)
  
  counter <- counter + 1
  
  #Print some info to the console. 
  print(counter)
  pander(amazon.data[length(amazon.data$tmp.rating),])
  print('waiting for...')
  print(reading.times[counter])
  print('seconds')
  
  #Wait for a while before continuing to next page
  Sys.sleep(reading.times[counter])

  #Navitgate to the next page
  html.session <- 
    html.session %>% 
    follow_link("Next")
}

#Clean up the data.frame
amazon.data <- 
  data.frame(rating = amazon.data$tmp.rating,
             review = amazon.data$tmp.review,
             date = amazon.data$tmp.rdate)

#Save the data as RDS
saveRDS(amazon.data, 'data/amazon_review_data.RDS')

Lets look at a sample of the data:

pacman::p_load(dplyr, tidyr) #Load the data wrangling packages
amazon.data <- readRDS('data/amazon_review_data.RDS')

amazon.data %>% sample_n(3) %>% pander()

	rating	review	date
160	3	its okay	on 18 July 2015
433	5	Brilliant book. I love the characters and had a great time reading it.	on 22 August 2014
50	4	Good story	on 15 April 2016

2.1.2 RSelenium

RSelenium uses the popular Selenium server to simulate an entire browser. This is more powerful, but more complicated than rvest. You only really need to use it when interacting with websites that have dynamic content.

If the website in question has any interactive elements that download HTML on the fly, then the rvest method will not be able to extract this information. This is sometimes used on interactive websites - for example, when more content is loaded when you reach the bottom of the current page.

How might this impact web scraping? A website that contains review may only display the first few lines of the review but have a button to ‘read more’. When a user clicks the read more button, a small piece of javascript fetches the content and amends the HTML on the fly. Thus, without the interactivity, the rest of the review is simply not present in the HTML. At the time of writing rvest cannot handle this - if you need to scrape from interactive websites, see the RSelenium section below. In most cases, all text is present in the HTML code and so rvest is more than enough.

2.1.2.1 Interacting with the Selenium server from R

The easiest way to get a Selenium server up and running is with a docker image. Setup docker (docker.com). Then run docker run -d -p 4445:4444 selenium/standalone-firefox from terminal.

Debugging sessions are also available that can be viewed with VNC docker run -d -p 4444:4444 -p 5901:5900 selenium/standalone-firefox-debug see: https://github.com/SeleniumHQ/docker-selenium

You can launch this from within R, at least on OSX.

system('docker run -d -p 4445:4444 selenium/standalone-chrome')

You can then connect to the server using RSelenium. To do this, you just specify the external port (defined in the docker command as external:internal).

pacman::p_load(RSelenium, rvest)

#Establish the connection
remDr <- remoteDriver(remoteServerAddr = "localhost" 
                      , port = 4445L
                      , browserName = "chrome"
                      )

# Interact with the Server by calling nested variables
# remDr$ will autocomplete with the various options in RStudio
remDr$open()

#Its a good idea to make sure the window is full size, so sites render normally.
remDr$maxWindowSize()

2.1.2.2 Scraping

The rest of the process is broadly similar to that of rvest with the added possibility of clicking elements on the page and interacting with the website more fully.

For a guide see: https://cran.r-project.org/web/packages/RSelenium/vignettes/RSelenium-basics.html

In this section I will scrape reviews of the book ‘Nudge’ from Goodreads. Goodreads is an example of a website that updates content using javascript - when you click next on the page of reviews it doesn’t refresh the page, but pulls the new reviews into the current page. rvest cannot handle this kind of interactivity, but RSelenium can.

The process in RSelenium is slightly more convoluted than the rvest equivalent. To extract information you must first find the elements then extract the text or html from those elements. Note the difference between findElement (which returns the first occurrence) and findElements (which returns all occurrences as a list). As most often you will be dealing with more than one element, information will be returned as a list. To get the text or html from all elements within a list use lappy then unlist to convert to usable data.

pacman::p_load(stringr, dplyr)

#Navigate to the page
remDr$navigate(url = "https://www.goodreads.com/book/show/2527900.Nudge")

#Set nextpage tracker as false for while loop & start a counter
rs.nextpage.exists=TRUE
counter = 0

#Setup the dataframe
rs.data<-
  data.frame(rating = character(),
           date = character(),
           review = character(),
           stringsAsFactors = FALSE
)

#Run through each page until there isn't any more reviews...
while(rs.nextpage.exists==TRUE){
  #Find the review 'elements'.
  rs.reviews.element<-
    remDr$findElements(using = 'css', 
                       value = '#reviews .reviewText.stacked')
  
  #Extract the relevent HTML from those elements
  rs.reviews.html<-
    lapply(rs.reviews.element, function(x){
      x$getElementAttribute("outerHTML")[[1]]})
  
  #Use rvest to parse the HTML into text. 
  rs.reveiws.text<-
    unlist(
      lapply(rs.reviews.html,
             function(x){
               read_html(x) %>% 
                 html_text(.)  
             }
      ))
  
  
  #Grab the dates
  rs.date.elements <-
    remDr$findElements(using = 'css',
                       value = '#reviews .reviewDate')
  
  rs.date.text<-
    lapply(rs.date.elements, 
           function(x){
             x$getElementText()
           })
  
  rs.date.text<-
    unlist(rs.date.text)
  
  #Grab the star rating
  rs.rating.elements <-
    remDr$findElements(using = 'css',
                       value = '#reviews .reviewHeader')
  
  rs.rating.text<-unlist(
    lapply(rs.rating.elements,
           function(x){
             x$getElementText()
           })
  )
  
  #Lets define a quick function that looks for keywords in the string and retuns a score.
  goodreads.detect<-
    function(x){
      ifelse(str_detect(x,'amazing'), 5,
             ifelse(str_detect(x, 'really liked'),4,
             ifelse(str_detect(x,'liked it'), 3,
             ifelse(str_detect(x, 'ok'), 2, 
             ifelse(str_detect(x,'did not like'),1,
             NA)))))}
  
  rs.rating.text<-
    goodreads.detect(rs.rating.text)
  
  
  
  
  #Not all ratings have reviews, but ratings with reviews always come first, so make sure that we fill in missing text variables with NA.
  rs.review.count<-length(rs.reveiws.text)
  rs.rating.count<-length(rs.rating.text)
  
  if(rs.rating.count-rs.review.count>0){
    rs.reveiws.text<-
      c(rs.reveiws.text, 
        rep(NA, rs.rating.count-rs.review.count))
  }
  
  #Place the current pages data into a data.frame 
  rs.session.combined<-data.frame(rating = rs.rating.text,
                                  date = rs.date.text,
                                  review = rs.reveiws.text,
                                  stringsAsFactors = FALSE
  )
  
  #Print it to the console
  pander::pander(rs.session.combined)
  
  #If the last review grabbed was the same as the previous one we must have got to the end... so stop.
  if(counter>0){
    if(last(rs.data$review)==
       last(rs.session.combined$review)){
      print('Last review is same as first... quiting')
      rs.nextpage.exists=FALSE
    }
  }
  
  #Add the new page of data to the already aquired data
  rs.data<-
    rbind(rs.data, 
          rs.session.combined)
  
  #How many collected so far?
  print(length(rs.data$rating))
  
  #Save on each itteration, just in case it crashes.
  saveRDS(rs.data, file="data/nudge_Goodreads.RDS")

  #Find the next page button and click it
  rs.nextpage<-
    remDr$findElement(using='css',
                      value='.next_page')
  
  
  counter <- counter + 1
  
  #Is there a valid next link?
  if(is.null(unlist(
    rs.nextpage$getElementAttribute(attrName = 'href')))
  ){
    #if not, exit the while loop
    print('No next page button, quitting...')
    rs.nextpage.exists=FALSE
  }else{
    #If there is, click the link and wait
    rs.nextpage$clickElement()
    Sys.sleep(20)
  }
}

#Check for any duplicates
rs.data<-
  rs.data %>%
    distinct()
rs.data %>% glimpse()

saveRDS(rs.data, file="data/nudge_Goodreads.RDS")

You may notice that for some of the above scraping I’ve extracted the attributes (with $getElementAttribute) then parsed with rvest, whereas for other elements I’ve extracted the text directly (with $getElementText). There is a subtle difference between these methods. $getElementText grabs the text, but with one important caveat. It only grabs the text that is drawn on the screen. So, if there is a ‘read more’ drop down (or something similar) $getElementText will only return the displayed text. This isn’t great when you want to scrape the entire review. Luckily, $getElementAttribute allows you go grab information from the html document directly. So grabbing the outerHTML attribute for the review text, will contain all the html of that review, which can then be parsed using the rvest methods.

2.2 Social Media APIs

2.2.1 Twitter

The twitter API has two main methods for collecting tweets search (using the twitteR package) or stream (using the streamR package).

2.2.1.1 Search

Before you can login we must first authorise twitter. You will need your API and token, which both have a key and secret. Both can be found by creating an app at https://apps.twitter.com/

pacman::p_load(twitteR, stringr)

# Declare Twitter API Credentials
api_key <- "your_key_here"
api_secret <- "your_secret_here" 
token <- "your_token_here" 
token_secret <- "your_token_secret_here" 
 
# Create Twitter Connection
setup_twitter_oauth(api_key, api_secret, token, token_secret)

You can then initiate a search. Keeping with the theme, I will collect the last 1,000 tweets containing the phrase “life of pi” written in English.

pacman::p_load(twitteR)

#Grab the last 1000 tweets
d.tweets<-searchTwitter("life of pi", n=1000, lang="en")

#Convert to a data frame
d.tweets.df <- twListToDF(d.tweets)

#Standardise text encoding
d.tweets.df$text <-
  str_conv(d.tweets.df$text, 'UTF-8')

saveRDS(d.tweets.df, 'data/twitter_search_dataset.RDS')

Quick look at the data:

d.tweets.df<-
  readRDS('data/twitter_search_dataset.RDS')

#How many did we get?
length(d.tweets.df$text)

## [1] 1000

#Lets look at when they were created:
pacman::p_load(ggplot2)
qplot(d.tweets.df$created)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Quick look at the sample
pacman::p_load(dplyr, tidyr, stringr)

#What variables do we have?
str(d.tweets.df)

## 'data.frame':    1000 obs. of  16 variables:
##  $ text         : chr  "RT @StarMoviesIndia: Find yourself adrift, in the middle of nowhere with this tale of survival directed by Ang "| __truncated__ "JUST MISSED NAME IS POO. DESTRUCTION ROO LIFE OF PI REVENGE START OF EACH ZAC EFRON DIARRHEA TUBS, BUBBLE BATHS"| __truncated__ "#Inspirational #Movies: Life of Pi Cast: Suraj Sharma,Irrfan Khan,Adil Hussain https://t.co/OpndKdjRkQ" "RT @_NzmnFtma: Life of Pi" ...
##  $ favorited    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ favoriteCount: num  0 0 1 0 0 0 0 0 0 0 ...
##  $ replyToSN    : chr  NA NA NA NA ...
##  $ created      : POSIXct, format: "2016-12-19 14:34:27" "2016-12-19 14:33:56" ...
##  $ truncated    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ replyToSID   : chr  NA NA NA NA ...
##  $ id           : chr  "810855823354175488" "810855691569106944" "810847027554721793" "810844618753380352" ...
##  $ replyToUID   : chr  NA NA NA NA ...
##  $ statusSource : chr  "<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android</a>" "<a href=\"https://twitter.com/arunjangity3\" rel=\"nofollow\">DARK DIARRHEA</a>" "<a href=\"http://www.becomeInvictus.com\" rel=\"nofollow\">BecomeInvictus.com</a>" "<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>" ...
##  $ screenName   : chr  "rpunamia" "bufbvr" "BecomeInvictus" "MahathevanK" ...
##  $ retweetCount : num  15 0 0 1 3 390 3 0 0 166 ...
##  $ isRetweet    : logi  TRUE FALSE FALSE TRUE TRUE TRUE ...
##  $ retweeted    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ longitude    : chr  NA NA NA NA ...
##  $ latitude     : chr  NA NA NA NA ...

d.tweets.df %>%
  select(screenName, text) %>%
  sample_n(5) %>%
  pander(.)

	screenName	text
282	SeriesDeFilms	@whattheshot life of pi
768	seb_r7	been travelling since 8am with 2 hours to go. feeling like this is divine punishment for falling asleep during the Life of Pi
104	kalpik	Life of pi in 4k �� https://t.co/vFRJuMmh7B https://t.co/rynQ9u4pYD
2	bufbvr	JUST MISSED NAME IS POO. DESTRUCTION ROO LIFE OF PI REVENGE START OF EACH ZAC EFRON DIARRHEA TUBS, BUBBLE BATHS AND WE ALL HEARD ABOUT I
633	dankquote	When you’ve suffered a great deal in life, each additional pain is both unbearable and trifling. -Yann Martel, Life of Pi

2.2.1.2 Stream

According to twitter: “The streams offer samples of the public data flowing through Twitter.” This, should capture all public information on a current topic in real time. Before we can begin capturing data, we must first log into twitter. Do to this you need an API key and secret. You can get these by creating an app at https://apps.twitter.com/

pacman::p_load(streamR, stringr, ROAuth)

# Declare Twitter API Credentials
requestURL <- "https://api.twitter.com/oauth/request_token"
accessURL <- "https://api.twitter.com/oauth/access_token"
authURL <- "https://api.twitter.com/oauth/authorize"
api_key <- "your_key_here"
api_secret <- "your_secret_here" 
token <- "your_token_here" 
token_secret <- "your_token_secret_here" 

my_oauth <- OAuthFactory$new(consumerKey = api_key,
                             consumerSecret = api_secret,
                             requestURL = requestURL,
                             accessURL = accessURL,
                             authURL = authURL)
 
my_oauth$handshake(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))
 
 
# PART 2: Save the my_oauth data to an .Rdata file
save(my_oauth, file = "my_oauth.Rdata")

The stream API lets you capture all tweets as they occur. This can be very useful for monitoring a topic over a long period of time. As I don’t want to spend a long time waiting for tweets to come in for this example, I’ll collect 10 minutes worth of tweets about Christmas (its the 19th of December, so should be plenty of tweets).

#Load access key (created above)
load('my_oauth.Rdata')

#Capture tweets about christmas for 10 minutes
filterStream(file.name = "tweets.json", # Save tweets in a json file
             track = c("Christmas"), 
             language = "en",
             timeout = 600, # Keep connection alive for 600 seconds
             oauth = my_oauth) # Use my_oauth file as the OAuth credentials

# parse the json file and save to a data frame called tweets.df. 
tweets.df <- parseTweets("tweets.json", simplify = FALSE) 

# save it as a tbl_df
pacman::p_load(dplyr)
tweets.df <- tbl_df(tweets.df)

#Lets standardise the character encoding
pacman::p_load(stringr)
tweets.df$text <- str_conv(tweets.df$text, 'UTF-8')

#export to RDS file 
saveRDS(tweets.df, 'data/twitter_stream_dataset.RDS')

Quick look at the data:

pacman::p_load(tidyr, dplyr)
tweets.df <- readRDS('data/twitter_stream_dataset.RDS')

#What do we get from the API?
colnames(tweets.df) %>% pander(.)

text, retweet_count, favorited, truncated, id_str, in_reply_to_screen_name, source, retweeted, created_at, in_reply_to_status_id_str, in_reply_to_user_id_str, lang, listed_count, verified, location, user_id_str, description, geo_enabled, user_created_at, statuses_count, followers_count, favourites_count, protected, user_url, name, time_zone, user_lang, utc_offset, friends_count, screen_name, country_code, country, place_type, full_name, place_name, place_id, place_lat, place_lon, lat, lon, expanded_url and url

#How many?
length(tweets.df$text)

## [1] 16509

#Sample 5
tweets.df %>%
  select(screen_name, created_at, text) %>%
  sample_n(5) %>%
  pander(.)

screen_name	created_at	text
itslaur3nbitch	Mon Dec 19 13:23:23 +0000 2016	RT @Renee_patten: Ok but has anyone stolen the little baby Jesus from Quincy Center yet because it’s not Christmas time until that happens
Aurum_Consult	Mon Dec 19 13:17:12 +0000 2016	@CapgeminiConsul and friends at Holborn doing a great job gathering Christmas gifts to give to the kids at Evelyn’s… https://t.co/6co8vjQqNT
ForLoveofaDog	Mon Dec 19 13:22:17 +0000 2016	RT @CartersCollecta: Not long to go until Christmas and still stuck for ideas? There’s still time to browse and buy at…
Mbrawler	Mon Dec 19 13:24:42 +0000 2016	RT @Sco2hot: The best Christmas decoration I’ve seen haha. #DieHard https://t.co/rO7e4u1nav
acenturyofmadi	Mon Dec 19 13:25:21 +0000 2016	RT @reaIDonaldTrunp: Thank you Alex, because of this I will bring back Vine so you can be famous again https://t.co/yj8sIg1FPF

2.3 Google Trends

Google trends can be used as a simple measure of search popularity - the main issue with this is it can only be used for comparison within each search. The scale varies from 0-100, with 100 being the most popular.

Each data point is divided by the total searches of the geography and time range it represents, to compare relative popularity. Otherwise places with the most search volume would always be ranked highest.
The resulting numbers are then scaled on a range of 0 to 100 based on a topic’s proportion to all searches on all topics.
Different regions that show the same number of searches for a term will not always have the same total search volumes.

pacman::p_load(gtrendsR)

trend.data <- gtrends(
  c('life of pi book', 'nudge book', 'man booker prize'),
  geo='GB')

saveRDS(trend.data, 'data/gTrendsData.RDS')

#Plot
plot(trend.data)

3 Data wrangling

Cleaning and manipulating data with the dplyr and tidyr packages is summarized here: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf

3.1 Goodreads dataset

First load the data - rather than putting it in a data.frame (which can result in slow printing to the console) we’ll put it in a tbl_df (‘tibble’) which is quicker.

pacman::p_load(tidyr, dplyr, ggplot2)

#Load the dataset
nudge.goodreads<-
  tbl_df(readRDS('data/nudge_Goodreads.RDS'))

#Take a quick look at it
glimpse(nudge.goodreads)

## Observations: 480
## Variables: 3
## $ rating <dbl> 4, 2, 1, 5, 4, 3, 2, 1, 4, 2, 4, 5, 2, 4, 2, 5, 4, 5, 5...
## $ date   <chr> "Mar 22, 2009", "Oct 18, 2009", "Jan 09, 2012", "Oct 22...
## $ review <chr> "\n              \n            \nThis one took me longe...

head(nudge.goodreads)

## # A tibble: 6 x 3
##   rating date         review                                              
##    <dbl> <chr>        <chr>                                               
## 1   4.00 Mar 22, 2009 "\n              \n            \nThis one took me l…
## 2   2.00 Oct 18, 2009 "\n              \n            \nI don't understand…
## 3   1.00 Jan 09, 2012 "\n              \n            \nThis comes with a …
## 4   5.00 Oct 22, 2008 "\n              \n            \nThis is a terrific…
## 5   4.00 Aug 18, 2008 "\n              \n            \nI second-guessed m…
## 6   3.00 Jun 30, 2008 "\n              \n            \nI really like a lo…

#Look at the begining and end of the first review.
str_sub(nudge.goodreads$review[1], 0, 300)

## [1] "\n              \n            \nThis one took me longer to read that is reasonable for a book of its length or the clear style it is written in. I mean, such a simply written text of 250 pages ought to have finished in no time. The problem was that I don’t live in the US and so many of the examples mad"

str_sub(nudge.goodreads$review[1], -300)

## [1] "n superannuation, but superannuation would be a tax and would be run by the Australian Tax Office. But don’t get me started…This would be an even more interesting book if you live in America, given the nature of the examples, but either way, this is still worth a look.\n  ...more\n\n          \n        "

A few things can be done to clean up this dataset. First get rid of any duplicates. Firstly, there are some ‘’ newline characters in the review text and each review ends with ‘…more’. Some reviews might not be in English, so we will also detect the language of each review using textcat

pacman::p_load(stringr, textcat)

nudge.goodreads<-
  nudge.goodreads %>% 
  distinct() %>% #remove any duplicates
  mutate(review = str_replace_all(review, '\n|...more', ''),
         review = str_trim(review), #Remove whitespace
         lang = factor(textcat(review))) 

summary(nudge.goodreads$lang)

##    catalan    english     german      malay portuguese      scots 
##          1        293          1          1          1          2 
##      welsh 
##          1

nudge.goodreads <- 
  nudge.goodreads %>%
  filter(lang %in% c('english', 'scots'))

4 Outputs

4.1 Frequency analysis

One of the most straightforward analysis of social media is to look at the frequency - this could be the frequency of particular words within a corpus or the frequency of a social media output over time.

amazon.data<-
  readRDS('data/amazon_review_data.RDS')

require(pacman)

## Loading required package: pacman

p_load(tidyr, dplyr)

amazon.data %>% 
  glimpse()

## Observations: 870
## Variables: 3
## $ rating <fctr> 2, 3, 5, 5, 4, 2, 5, 2, 3, 5, 5, 5, 5, 1, 1, 5, 5, 3, ...
## $ review <fctr> NOT MY CUP OF TEA, Difficult to follow at times, thank...
## $ date   <fctr> on 7 December 2016, on 5 December 2016, on 30 November...

#Need to correct data formats
p_load(lubridate) #Date conversion from strings

amazon.data<-
  amazon.data%>%
  mutate(rating = as.integer(rating),
         review = as.character(review),
         date = dmy(date))


ggplot(amazon.data, aes(x=date))+
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

4.2 Maps

#devtools::install_github("dkahle/ggmap")
pacman::p_load(ggmap)


#Plot the christmas tweets
tweets.df <- readRDS('data/twitter_stream_dataset.RDS')

qmplot(lon, lat, 
       data = tweets.df, 
       maptype='toner-background', 
       darken = .7, 
       color=I('white'), 
       alpha=I(.33))

## Warning: Removed 16438 rows containing missing values (geom_point).

5 Misc

5.1 Encrypted Credentials

Throughout this document I’ve used login credentials that are not included in the script - they’re encrypted using public/private key encryption.

To encrypt you need to generate a private and public key. The public key can be shared and is used to encrypt the data. The private key should never be shared and is used to decrypt the data.

pacman::p_load(sodium) #sodium encryption library

#Generate the keys
sodium.privatekey <- keygen()
sodium.publickey <- pubkey(sodium.privatekey)

#Save the private key, no need to worry about the public one, unless there is more you want to encrypt at a later date using the same pub/private key combo.
saveRDS(sodium.privatekey, file="key.RDS")

Here are some example credentials to encrypt.

cred <- list()
cred$guser <- 'username'
cred$gpass <- 'password'

To encrypt or decrypt do the following.

#To encrypt
encrypted.content<-
  simple_encrypt(serialize(
    cred, connection = NULL), sodium.publickey)

#How to decrypt
unserialize(simple_decrypt(encrypted.content, sodium.privatekey))

## $guser
## [1] "username"
## 
## $gpass
## [1] "password"

You can save this as an RDS or in a script using dput (as its useless without the key).

dput(encrypted.content)

## as.raw(c(0xcf, 0x1d, 0x95, 0x39, 0x37, 0xe6, 0x0f, 0xc0, 0x4c, 
## 0x88, 0x3b, 0xd8, 0x6f, 0x72, 0x05, 0x81, 0x70, 0x92, 0x1a, 0x88, 
## 0x45, 0x14, 0x61, 0x69, 0x66, 0x15, 0x28, 0x3a, 0x1a, 0x04, 0xf7, 
## 0x56, 0xdd, 0xb3, 0xa2, 0x79, 0xdb, 0xf8, 0x24, 0xc1, 0xe3, 0x4a, 
## 0x22, 0x4f, 0xe4, 0x1f, 0x81, 0xe5, 0xbf, 0x80, 0xf9, 0x7b, 0xde, 
## 0xca, 0xda, 0x96, 0x9e, 0x1a, 0xea, 0x76, 0xc8, 0x8e, 0xa3, 0xf4, 
## 0xc0, 0xa2, 0xa1, 0xcc, 0xcc, 0x91, 0x66, 0x27, 0x07, 0x37, 0xf7, 
## 0x64, 0x61, 0x0a, 0xe0, 0xa9, 0xfb, 0x1c, 0x7c, 0xc5, 0xb2, 0x52, 
## 0x96, 0x61, 0xc6, 0x12, 0xbe, 0x7e, 0x44, 0x9a, 0xc7, 0x3c, 0xda, 
## 0x2a, 0x0e, 0x4e, 0xcb, 0x20, 0xbb, 0xee, 0xff, 0x0d, 0xad, 0x97, 
## 0xca, 0x53, 0xaa, 0x87, 0x20, 0x4a, 0x23, 0xf0, 0x86, 0x04, 0x06, 
## 0xa2, 0x12, 0xd6, 0x63, 0x36, 0x05, 0x3b, 0x9e, 0xc9, 0xa5, 0xa6, 
## 0x6c, 0xa8, 0x08, 0x70, 0xb1, 0x20, 0xa9, 0xe0, 0x4c, 0x0b, 0xb0, 
## 0xa7, 0x73, 0x0c, 0xa0, 0x37, 0xe5, 0x28, 0xb7, 0x5a, 0x25, 0xb5, 
## 0xae, 0xb4, 0x59, 0xc8, 0xdd, 0x04, 0x1a, 0x3c, 0x98, 0x64, 0x4e, 
## 0xfa, 0xe8, 0xb7, 0x14, 0x47, 0x70, 0xf4, 0xb4, 0x43, 0x96, 0xfe, 
## 0x68, 0x96, 0x52))

Social Media Research using R

Will Bowditch

27/01/2017