Need a basic guide for dummies/journalists, to learn how to scrape Twitter using R? Look no further.
What’s a pirate’s favourite programming language and statistical computing environment? Why, it’s R, of course. Jokes aside, R is the language of choice for data miners, scrapers and visualisers – as well as journalists looking to manipulate datasets that Excel just can’t handle.
Twitter data has the potential to inspire important stories. Blogger David Robinson analysed the tweets of President Donald Trump to find that tactful posts sent from an Iphone were composed by his staff, and those angrier messages from Android were typed by Trump himself.
Journalists need a tool to filter tweets and to find trends among them, R helps by grabbing that data and making it usable.
In this guide we’ll be getting set up with Rstudio on Windows, an open-source program for working with R, and we will learn the basics of twitter scraping. This is a basic how-to, with little assumed knowledge, so should hopefully translate for OSX users too, with a few tweaks. Let’s get started:
Note: If you want this guide distilled into 24 words, head to the TLDR at the bottom of the page and just follow the links to download what you need. If you have the patience, read ahead for more detailed instruction.
Step 1: Prep, downloads and installing R
You’ll firstly need to gather your tools. Head here and download the latest R package, currently R-3.3.2, and install it to your computer. You’ll also need to download Rstudio, the software we’ll be working within.
Your final download: save this script to your computer. R can use scripts (basically text files) to save commands and save you having to type them out every time.
Once you’ve followed the installation wizard, open up Rstudio to be greeted by this nice blank canvas.
Step 2: Open R and load your script
You want your screen to be divided into four sections. On the screenshot above, you can see the console panel on the left: which shows the code that you run, like a timeline of what you’ve done so far. You also have the Environment panel, with your list of elements, databases and variables (currently empty), on the right and, in the bottom right, a simple file manager.
Now, press the folder button under ‘edit’ in the main menu. Alternatively, press Ctrl-O.
Navigate to where you saved the script we downloaded earlier, and open it up in Rstudio. The program will now show a panel for scripts in the top left.
Your script is loaded, and everything you need to migrate tweets from the internet to a spreadsheet are in the top right of your screen.
Before we start scraping, let’s make sure Twitter lets us in when we knock.
Step 3: Getting your Twitter access.
To do this, we’ll need access codes from Twitter. You’ll need to head to: apps.twitter.com and create your own application (A Twitter application in this sense is just a way of connecting to the API. Hit the ‘create new app’ button and fill in the form. This is purely for personal access to your own twitter credentials, so fill in the fields with your info.
After that’s completed, head to the ‘Keys and Access Tokens’ tab in the menu of your new app, and copy the four codes you find into R.
These are the Consumer Key and the Secret Consumer Key, and the Access Token and Secret Access Token.
Once these four strings of text and numbers have been copied into the R script you downloaded, you’re good to go and can follow each stage of the script until you have the data you need.
Step 4: Running and merging data
There are three stages to the actual process of grabbing data from Twitter. These are:
- Loading the packages you need.
- Running the code to access Twitter.
- Searching tweets and saving them to file.
The first time you attempt this process, however, you’ll need to install the packages you plan to use. On the script you downloaded this is flagged as step 0 and by highlighting this and pressing Ctrl-R, you’ll install everything needed for twitter scraping.
install.packages("stringr") install.packages("twitteR") install.packages("purrr") install.packages("tidytext") install.packages("dplyr") install.packages("tidyr") install.packages("lubridate") install.packages("scales") install.packages("broom") install.packages("ggplot2")
You only need to do this the first time you attempt a twitter scrape, and can jump to step 1 in all subsequent attempts.
Mini-Step 1: Load your packages.
Those shiny new packages you added to your R setup now need to be loaded so that you can use commands associated with them. Run this:
library(stringr) library(twitteR) library(purrr) library(tidytext) library(dplyr) library(tidyr) library(lubridate) library(scales) library(broom) library(ggplot2)
Mini-Step 2: Access Twitter.
With your own personal codes copied into the script, you can run the following few lines. (Don’t forget to save your script so you won’t need to repeat the copy/paste process next time around. Your codes will remain the same unless you generate new ones:)
consumerKey = "INSERT KEY HERE" consumerSecret = "INSERT SECRET KEY HERE" accessToken = "INSERT TOKEN HERE" accessSecret = "INSERT SECRET TOKEN HERE" options(httr_oauth_cache=TRUE) setup_twitter_oauth(consumer_key = consumerKey, consumer_secret = consumerSecret, access_token = accessToken, access_secret = accessSecret)
Mini-Step 3: Searching and scraping (the fun part)
The script we’re using gives you the options to search for three different things (parts 3.1, 3.2 and 3.3). You’re able to search for the last 3200 tweets of any individual account. You can search for the last 3200 tweets to use a hashtag of your choosing. Finally, you can search for the last 3200 tweets directed to a certain user aka tweets ‘@ed’ to someone else.
In each instance, add your chosen phrasing to the lines that contain the search terms, and follow it through, updating the variable names as you go.
To best demonstrate this, here are some examples:
To create a list of Barack Obama’s tweets sent whilst he had the POTUS handle. Use this:
obamatweets<- userTimeline("potus44", n = 3200) obamatweets_df <- tbl_df(map_df(obamatweets, as.data.frame)) write.csv(obamatweets_df, "obamatweets.csv")
The function, “userTimeline”, adds the tweets of a user of your choice to the database. In this instance, the handle is POTUS44 and is written between the speech marks, and the first word on the line is the name of the value where the tweets will go.
The next line sends them to a database, and the final line writes that database to file.
To create a list of tweets containing a certain hashtag, use this:
yeswecan <- searchTwitter("#yeswecan exclude:retweets", n=3200) yeswecan_df <- tbl_df(map_df(yeswecan, as.data.frame)) write.csv(yeswecan_df, "yeswecan.csv")
To create a list of tweets sent to a user, use this:
tweetstoobama <- searchTwitter("@potus44 exclude:retweets", n=3200) tweetstoobama_df <- tbl_df(map_df(tweetstoobama, as.data.frame)) write.csv(futureexwife_df, "tweetstoobama.csv")
Step 5: Your finished sheet and where to go next
Head to the the folder where you saved your CSV file and open her up. Congratulations, you have successfully scraped your desired tweets. What you do next is up to you.
One idea is to count the instances where a word appears. Download this sheet containing a template for totalling some key words, and check out David Robinson’s guide to running a sentiment analysis on your newly collated data.
Download R, Rstudio and this script.
Get your Twitter access by creating an app here.
Open Rstudio and load the script
Follow the instructions