Define the limit of numbers to be searched
In [73]:
LIMIT = 20000
Declare all the search, strings which will be searched in a vector, following are the hastags for SuperBowl
In [74]:
superBowlStrings = c("#superbowl","#Patriots","#Falcons","#SB51")
Declare your custom search strings, which will be searched, in a vector. Leave it empty vector if no custom string.
Eg : Empty Vector will look like customUserSearchStrings = c("")
NOTE : Make sure you empty leave customUserSearchStrings vector empty if you want to see the results from superBowlStrings
In [109]:
customUserSearchStrings = c("") #c("#NASA")
Create a boolean variable to check if the custom search string is present or not. Custom search will be given preference over SuperBowl search. Number of search limit will change too i.e. LIMIT = 1000
In [76]:
isCustomSearch = FALSE
if(customUserSearchStrings[1]!=""){
isCustomSearch = TRUE
LIMIT = 1000
userSearchStrings = customUserSearchStrings
}else{
isCustomSearch = FALSE
LIMIT = 20000
userSearchStrings = superBowlStrings
}
Define all the libraries which needs to be set up for operations below
In [77]:
library("twitteR")
library("DBI")
library("RSQLite")
library("gtools")
library("bitops")
library("ggplot2")
library("RCurl")
library("RJSONIO")
library("ggmap")
library("sp")
library("mapdata")
library("maptools")
library("scales")
library("maps")
Sys.setlocale(category = "LC_ALL", locale = "C")
Setup the twitter app key for authentication
In [78]:
setup_twitter_oauth('YOUR KEY')
Following function fetches the state and country from the address_components field in json response .Following is the response
"address_components" : [ { "long_name" : "Buffalo", "short_name" : "Buffalo", "types" : [ "locality", "political" ] }, { "long_name" : "Erie County", "short_name" : "Erie County", "types" : [ "administrative_area_level_2", "political" ] }, { "long_name" : "New York", "short_name" : "NY", "types" : [ "administrative_area_level_1", "political" ] }, { "long_name" : "United States", "short_name" : "US", "types" : [ "country", "political" ] } ],
It will generate the state as New York and country as United States.
In [79]:
parseAddressComponents <- function(addr_comp){
result = tryCatch(
{
addrSize = length(addr_comp)
stateName = ""
countryName = ""
for(i in 1:addrSize){
if(addr_comp[[i]]$types[[1]] == "administrative_area_level_1"){
stateName = addr_comp[[i]]$long_name
}
if(addr_comp[[i]]$types[[1]] == "country"){
countryName = addr_comp[[i]]$long_name
}
}
if(stateName == ""){
stateName = NA
}
if(countryName == ""){
countryName = NA
}
return (c(stateName,countryName))
},
error=function(cond) {
return (c(NA,NA))
},
warning=function(cond) {
},
finally={
}
)
return (result);
}
This function generates the url which will be used to query the Google server. You have to pass the address for this function
In [80]:
getGeoCodeUrl <- function(address, return.call = "json", sensor = "false") {
root <- "https://maps.google.com/maps/api/geocode/"
url <- paste(root, return.call, "?address=", address,"&key=YOUR_KEY_GOOGLE","&sensor=", sensor, sep = "")
return(URLencode(url))
}
This function generates the url which will be used to query the Google server. You have to pass the latitude and longitude for this function
In [81]:
getLatLngURL <- function(lat, lng, return.call = "json", sensor = "false") {
root <- "https://maps.google.com/maps/api/geocode/"
latlng = paste(lat, lng, sep = ",")
url <- paste(root, return.call, "?address=", latlng,"&key=AIzaSyCskJdqaLuNJBJ58IHDqfsjJAm9jkZbcKM","&sensor=", sensor, sep = "")
return(URLencode(url))
}
The below code decomposes the json response from the Google Map API code
The output response is as follows for location Buffalo
{ "results" : [ { "address_components" : [ { "long_name" : "Buffalo", "short_name" : "Buffalo", "types" : [ "locality", "political" ] }, { "long_name" : "Erie County", "short_name" : "Erie County", "types" : [ "administrative_area_level_2", "political" ] }, { "long_name" : "New York", "short_name" : "NY", "types" : [ "administrative_area_level_1", "political" ] }, { "long_name" : "United States", "short_name" : "US", "types" : [ "country", "political" ] } ], "formatted_address" : "Buffalo, NY, USA", "geometry" : { "bounds" : { "northeast" : { "lat" : 42.9664549, "lng" : -78.795157 }, "southwest" : { "lat" : 42.826023, "lng" : -78.9337276 } }, "location" : { "lat" : 42.88644679999999, "lng" : -78.8783689 }, "location_type" : "APPROXIMATE", "viewport" : { "northeast" : { "lat" : 42.9664549, "lng" : -78.795157 }, "southwest" : { "lat" : 42.826023, "lng" : -78.9142665 } } }, "place_id" : "ChIJoeXfUmES04kRcYEfGKUEI5g", "types" : [ "locality", "political" ] } ], "status" : "OK" }
From the above json repsonse the data is extracted in a data frame shown in following codes
We remove all those locations which are not in United State or which are in United State but no information about the state.
In [82]:
getGeoCode <- function(address,userScreenName,verbose=FALSE) {
if(verbose) cat(address,"\n")
url <- getGeoCodeUrl(address)
doc <- getURL(url)
docData <- fromJSON(doc,simplify = FALSE)
if(docData$status=="OK") {
lat <- docData$results[[1]]$geometry$location$lat
lng <- docData$results[[1]]$geometry$location$lng
address_component <- parseAddressComponents(docData$results[[1]]$address_components)
state = address_component[1]
country = address_component[2]
if(!invalid(country) & country == "United States"){
if(invalid(state)){
latlongUrl = getLatLngURL(lat,lng)
latlngDoc <- getURL(latlongUrl)
latlngDocData <- fromJSON(latlngDoc,simplify = FALSE)
if(latlngDocData$status=="OK") {
address_component_latlng <- parseAddressComponents(latlngDocData$results[[1]]$address_components)
latlngState = address_component_latlng[1]
latlngCountry = address_component_latlng[2]
return(c(userScreenName, lat, lng, latlngState, latlngCountry))
}
}else{
return(c(userScreenName, lat, lng, state, country))
}
}
Sys.sleep(0.5)
}
}
1st line creates a sqlite db file
2nd line store and load tweets database which is backend registered
3rd line store the tweets in a table named "tweets", which is automatically provided by twitterR
In [83]:
if(!isCustomSearch){
mydbName = toString("DIClab1.db")
}else{
mydbName = toString("custom.db")
}
conn = dbConnect(SQLite(), dbname = mydbName)
register_sqlite_backend(mydbName)
Searches the tweets from the user input search string vector and equally divides them in equal chunks per input string in vector. So that the total sum of all the tweets is LIMIT, i.e 20000 here.
In [84]:
size = length(userSearchStrings)
for(searchStr in userSearchStrings){
tweets <- searchTwitter(searchStr,LIMIT/size)
store_tweets_db(tweets)
}
Makes a connection with the database
In [85]:
con = dbConnect(SQLite(), dbname = mydbName)
Query the database for all the screen names of the users.
In [86]:
tweetsUserNames = dbGetQuery(con,"SELECT screenName from tweets")
Generate a vector by taking transpose of the above result.
In [87]:
userNamesVec = c(t(tweetsUserNames))
Prints a few user's screen names
In [88]:
head(userNamesVec)
Queries the twitter with all the screen names to get the user information
In [89]:
lookupUserNames <- lookupUsers(userNamesVec)
Converts the data collected in the previous step to a data frame
In [90]:
userNamesDf = twListToDF(lookupUserNames)
Save this data frame so that the previous heavy work needs not to be repeated everytime
In [91]:
if(!isCustomSearch){
saveRDS(userNamesDf,file = "userNamesDf.Rda")
}else{
saveRDS(userNamesDf,file = "customUserNamesDF.Rda")
}
Load the userNameDf from the saved file
In [92]:
if(!isCustomSearch){
userNamesDf = readRDS(file = "userNamesDf.Rda")
}else{
userNamesDf = readRDS(file = "customUserNamesDF.Rda")
}
Print a few items of the data frame
In [93]:
head(userNamesDf)
This is the heart of the whole code. It generates a data frame which will contain the screen name, latitude, longitude, state and country of the all the users. We keep storing the data frames as they arrive so that in case of any problem we can resume from the failure point.
NOTE: Here in this experiment we generated some 5000+ locations.
In [94]:
userLocationDataDf = data.frame(screenName = as.character(),Lat = double(),Long = double(),State = as.character(), Country = as.character())
sizeOfDf = length(userNamesDf$screenName)
for(i in 1:sizeOfDf){
tempUserLocation = userNamesDf$location[i]
tempUserScreenName = userNamesDf$screenName[i]
if(tempUserLocation != ""){
tempVec = getGeoCode(tempUserLocation,tempUserScreenName)
if(!invalid(tempVec)){
tempDf = data.frame(screenName = as.character(tempVec[1]),Lat = as.numeric(tempVec[2]),Long = as.numeric(tempVec[3]),State = as.character(tempVec[4]),Country = as.character(tempVec[5]))
userLocationDataDf = rbind(userLocationDataDf,tempDf)
if(!isCustomSearch){
saveRDS(userLocationDataDf,file = "userLocationDataDf.Rda")
}else{
saveRDS(userLocationDataDf,file = "customUserLocationDataDf.Rda")
}
}
}
}
Save the above generated data frame since the above process is time consuming.
In [95]:
if(!isCustomSearch){
saveRDS(userLocationDataDf,file = "userLocationDataDf.Rda")
}else{
saveRDS(userLocationDataDf,file = "customUserLocationDataDf.Rda")
}
Load the data frame for further queries
In [96]:
if(!isCustomSearch){
userLocationDataDf = readRDS(file = "userLocationDataDf.Rda")
}else{
userLocationDataDf = readRDS(file = "customUserLocationDataDf.Rda")
}
Print and check if the data frame generated is in right format or not
In [97]:
head(userLocationDataDf)
Prints the number of items in the data frame
In [98]:
print(length(userLocationDataDf$screenName))
This code basically aggregates the data as per the different state and club them together. This code also uses mean function to generate the latitude and logitude of the state the mean value of them will still be in the state.
In [99]:
aggDataDf = aggregate(x = list(Latitude = userLocationDataDf$Lat, Longitude = userLocationDataDf$Long), by = list(State=userLocationDataDf$State), FUN = mean)
This part of code generates the frequency table from the userLocationDataDf data frame per state.
In [100]:
tweetLocationFreqDf = aggregate(x = list(Tweet_Count = userLocationDataDf$State), list(userLocationDataDf$State), length)
Merges the two data frames generated above
In [101]:
aggUserLocationDataDf = data.frame(aggDataDf,tweetLocationFreqDf)
Save this data frame before moving further
In [102]:
if(!isCustomSearch){
saveRDS(aggUserLocationDataDf,file = "aggUserLocationDataDf.Rda")
}else{
saveRDS(aggUserLocationDataDf,file = "aggCustomUserLocationDataDf.Rda")
}
Reload the data in a data frame again for futher processing
In [103]:
if(!isCustomSearch){
aggUserLocationDataDf = readRDS(file = "aggUserLocationDataDf.Rda")
}else{
aggUserLocationDataDf = readRDS(file = "aggCustomUserLocationDataDf.Rda")
}
Print and check if the data frame is in right format or not.
In [104]:
head(aggUserLocationDataDf)
get_map give the map of the specified location.
In [105]:
USAMap = get_map(location = 'United States', zoom = 4)
This code generates the points as per the data provided. We need to pass the map, generated in the previous step and all the points which needs to be plotted
In [106]:
mapPoints <- ggmap(USAMap) +
geom_point(aes(x = aggUserLocationDataDf$Longitude, y = aggUserLocationDataDf$Latitude, size = Tweet_Count), data = aggUserLocationDataDf, color = "blue")
Draws the map with the marked points.
In [107]:
mapPoints
In [ ]: