DIC LAB 1 Problem 2 : Study response to an event

Define the limit of numbers to be searched


In [73]:
LIMIT = 20000

Declare all the search, strings which will be searched in a vector, following are the hastags for SuperBowl


In [74]:
superBowlStrings = c("#superbowl","#Patriots","#Falcons","#SB51")

Declare your custom search strings, which will be searched, in a vector. Leave it empty vector if no custom string.

Eg : Empty Vector will look like customUserSearchStrings = c("")

NOTE : Make sure you empty leave customUserSearchStrings vector empty if you want to see the results from superBowlStrings


In [109]:
customUserSearchStrings = c("") #c("#NASA")

Create a boolean variable to check if the custom search string is present or not. Custom search will be given preference over SuperBowl search. Number of search limit will change too i.e. LIMIT = 1000


In [76]:
isCustomSearch = FALSE
if(customUserSearchStrings[1]!=""){
    isCustomSearch = TRUE
    LIMIT = 1000
    userSearchStrings = customUserSearchStrings
}else{
    isCustomSearch = FALSE
    LIMIT = 20000
    userSearchStrings = superBowlStrings
}

Define all the libraries which needs to be set up for operations below


In [77]:
library("twitteR")
library("DBI")
library("RSQLite")
library("gtools")
library("bitops")
library("ggplot2")
library("RCurl")
library("RJSONIO")
library("ggmap")
library("sp")
library("mapdata")
library("maptools")
library("scales")
library("maps")
Sys.setlocale(category = "LC_ALL", locale = "C")


'LC_CTYPE=C;LC_NUMERIC=C;LC_TIME=C;LC_COLLATE=C;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C'

Setup the twitter app key for authentication


In [78]:
setup_twitter_oauth('YOUR KEY')


[1] "Using direct authentication"

Following function fetches the state and country from the address_components field in json response .Following is the response

"address_components" : [ { "long_name" : "Buffalo", "short_name" : "Buffalo", "types" : [ "locality", "political" ] }, { "long_name" : "Erie County", "short_name" : "Erie County", "types" : [ "administrative_area_level_2", "political" ] }, { "long_name" : "New York", "short_name" : "NY", "types" : [ "administrative_area_level_1", "political" ] }, { "long_name" : "United States", "short_name" : "US", "types" : [ "country", "political" ] } ],

It will generate the state as New York and country as United States.


In [79]:
parseAddressComponents <- function(addr_comp){
    result = tryCatch(
        {
            addrSize = length(addr_comp)
            stateName = ""
            countryName = ""
            for(i in 1:addrSize){
                if(addr_comp[[i]]$types[[1]] == "administrative_area_level_1"){
                    stateName = addr_comp[[i]]$long_name
                }
                if(addr_comp[[i]]$types[[1]] == "country"){
                    countryName = addr_comp[[i]]$long_name
                }
            }
            if(stateName == ""){
                stateName = NA
            }
            if(countryName == ""){
                countryName = NA
            }
            return (c(stateName,countryName))
        },
        error=function(cond) {
            return (c(NA,NA))
        },
        warning=function(cond) {
        },
        finally={
        }
    )
    return (result);
}

This function generates the url which will be used to query the Google server. You have to pass the address for this function


In [80]:
getGeoCodeUrl <- function(address, return.call = "json", sensor = "false") {
    root <- "https://maps.google.com/maps/api/geocode/"
    url <- paste(root, return.call, "?address=", address,"&key=YOUR_KEY_GOOGLE","&sensor=", sensor, sep = "")
    return(URLencode(url))
}

This function generates the url which will be used to query the Google server. You have to pass the latitude and longitude for this function


In [81]:
getLatLngURL <- function(lat, lng, return.call = "json", sensor = "false") {
    root <- "https://maps.google.com/maps/api/geocode/"
    latlng = paste(lat, lng, sep = ",")
    url <- paste(root, return.call, "?address=", latlng,"&key=AIzaSyCskJdqaLuNJBJ58IHDqfsjJAm9jkZbcKM","&sensor=", sensor, sep = "")
    return(URLencode(url))
}

The below code decomposes the json response from the Google Map API code

The output response is as follows for location Buffalo

{ "results" : [ { "address_components" : [ { "long_name" : "Buffalo", "short_name" : "Buffalo", "types" : [ "locality", "political" ] }, { "long_name" : "Erie County", "short_name" : "Erie County", "types" : [ "administrative_area_level_2", "political" ] }, { "long_name" : "New York", "short_name" : "NY", "types" : [ "administrative_area_level_1", "political" ] }, { "long_name" : "United States", "short_name" : "US", "types" : [ "country", "political" ] } ], "formatted_address" : "Buffalo, NY, USA", "geometry" : { "bounds" : { "northeast" : { "lat" : 42.9664549, "lng" : -78.795157 }, "southwest" : { "lat" : 42.826023, "lng" : -78.9337276 } }, "location" : { "lat" : 42.88644679999999, "lng" : -78.8783689 }, "location_type" : "APPROXIMATE", "viewport" : { "northeast" : { "lat" : 42.9664549, "lng" : -78.795157 }, "southwest" : { "lat" : 42.826023, "lng" : -78.9142665 } } }, "place_id" : "ChIJoeXfUmES04kRcYEfGKUEI5g", "types" : [ "locality", "political" ] } ], "status" : "OK" }

From the above json repsonse the data is extracted in a data frame shown in following codes

We remove all those locations which are not in United State or which are in United State but no information about the state.


In [82]:
getGeoCode <- function(address,userScreenName,verbose=FALSE) {
    if(verbose) cat(address,"\n")
    url <- getGeoCodeUrl(address)
    doc <- getURL(url)
    docData <- fromJSON(doc,simplify = FALSE)
    
    if(docData$status=="OK") {
        
        lat <- docData$results[[1]]$geometry$location$lat
        lng <- docData$results[[1]]$geometry$location$lng        
        
        address_component <- parseAddressComponents(docData$results[[1]]$address_components)
        state = address_component[1]
        country = address_component[2]
        
        if(!invalid(country) & country == "United States"){
            
            if(invalid(state)){
                latlongUrl = getLatLngURL(lat,lng)
                latlngDoc <- getURL(latlongUrl)
                latlngDocData <- fromJSON(latlngDoc,simplify = FALSE)
    
                if(latlngDocData$status=="OK") {
                    address_component_latlng <- parseAddressComponents(latlngDocData$results[[1]]$address_components)
                    latlngState = address_component_latlng[1]
                    latlngCountry = address_component_latlng[2]
                    return(c(userScreenName, lat, lng, latlngState, latlngCountry))
                }
            }else{
                return(c(userScreenName, lat, lng, state, country))
            }
        } 
        Sys.sleep(0.5)
    }
}

1st line creates a sqlite db file

2nd line store and load tweets database which is backend registered

3rd line store the tweets in a table named "tweets", which is automatically provided by twitterR


In [83]:
if(!isCustomSearch){
    mydbName = toString("DIClab1.db")
}else{
    mydbName = toString("custom.db")
}
conn = dbConnect(SQLite(), dbname = mydbName)
register_sqlite_backend(mydbName)

Searches the tweets from the user input search string vector and equally divides them in equal chunks per input string in vector. So that the total sum of all the tweets is LIMIT, i.e 20000 here.


In [84]:
size = length(userSearchStrings)
for(searchStr in userSearchStrings){
    tweets <- searchTwitter(searchStr,LIMIT/size)
    store_tweets_db(tweets)
}

Makes a connection with the database


In [85]:
con = dbConnect(SQLite(), dbname = mydbName)

Query the database for all the screen names of the users.


In [86]:
tweetsUserNames = dbGetQuery(con,"SELECT screenName from tweets")

Generate a vector by taking transpose of the above result.


In [87]:
userNamesVec = c(t(tweetsUserNames))

Prints a few user's screen names


In [88]:
head(userNamesVec)


  1. 'ConstitutionWY'
  2. 'WI_Constitution'
  3. 'WAConstitution'
  4. 'ConstitutionUT'
  5. 'ConstParty_SC'
  6. 'KBistami'

Queries the twitter with all the screen names to get the user information


In [89]:
lookupUserNames <- lookupUsers(userNamesVec)

Converts the data collected in the previous step to a data frame


In [90]:
userNamesDf = twListToDF(lookupUserNames)

Save this data frame so that the previous heavy work needs not to be repeated everytime


In [91]:
if(!isCustomSearch){
    saveRDS(userNamesDf,file = "userNamesDf.Rda")
}else{
    saveRDS(userNamesDf,file = "customUserNamesDF.Rda")
}

Load the userNameDf from the saved file


In [92]:
if(!isCustomSearch){
    userNamesDf = readRDS(file = "userNamesDf.Rda")
}else{
    userNamesDf = readRDS(file = "customUserNamesDF.Rda")
}

Print a few items of the data frame


In [93]:
head(userNamesDf)


descriptionstatusesCountfollowersCountfavoritesCountfriendsCounturlnamecreatedprotectedverifiedscreenNamelocationlangidlistedCountfollowRequestSentprofileImageUrl
ConstitutionWY#NASA have been key in the #Illuminati's plan to hide our #FlatEarth & destroy humanity for #Satan 12467 164 0 19 NA NASA are ILLUMINATI 2015-10-01 04:11:59 FALSE FALSE ConstitutionWY en 3744055813 79 FALSE http://pbs.twimg.com/profile_images/658816339235225600/HEB9pDvR_normal.png
WI_Constitution#Lockheed is the main contractor helping #NASA hide our #FlatEarth & thereby hide #GOD. #Lockheed serve #SATAN. 8244 83 0 13 NA Lockheed serve SATAN 2015-10-01 06:34:14 FALSE FALSE WI_Constitution en 3745020312 103 FALSE http://pbs.twimg.com/profile_images/659579485663027200/TW05NYu3_normal.png
WAConstitutionSee the light See the way #CultofSiduri 4839 64 0 1 http://t.co/NgXssa1W1f Goth Siduri 2015-09-30 20:47:28 FALSE FALSE WAConstitution Seattle, WA en 3740593934 25 FALSE http://pbs.twimg.com/profile_images/652339690595614720/iflUNzPJ_normal.jpg
ConstitutionUTThe United States #Constitution now hangs like a thread and can only be saved by the #WhiteHorse 24369 218 1 11 http://t.co/NgXssa1W1f WhiteHorse SIDURI 2015-09-30 19:01:02 FALSE FALSE ConstitutionUT Salt Lake City, UT en 3739697178 360 FALSE http://pbs.twimg.com/profile_images/652185485545205760/xToaeRah_normal.jpg
ConstParty_SCI seek vengeance for the murder of Priestess Hypatia of the Cult of #Gnosis/#Sophia/#Siduri 7424 90 0 5 http://t.co/NgXssa1W1f Hypatia Siduri 2015-09-30 02:59:31 FALSE FALSE ConstParty_SC Too far away en 3732391574 39 FALSE http://pbs.twimg.com/profile_images/659135861745950720/FaS0KUFj_normal.jpg
KBistami 8 0 2 9 NA Khalil BISTAMI 2017-02-18 00:49:20 FALSE FALSE KBistami en 832753832564101121 0 FALSE http://pbs.twimg.com/profile_images/832757758231339009/KTm8kJeP_normal.jpg

This is the heart of the whole code. It generates a data frame which will contain the screen name, latitude, longitude, state and country of the all the users. We keep storing the data frames as they arrive so that in case of any problem we can resume from the failure point.

NOTE: Here in this experiment we generated some 5000+ locations.


In [94]:
userLocationDataDf = data.frame(screenName = as.character(),Lat = double(),Long = double(),State = as.character(), Country = as.character())

sizeOfDf = length(userNamesDf$screenName)

for(i in 1:sizeOfDf){
    tempUserLocation = userNamesDf$location[i]
    tempUserScreenName = userNamesDf$screenName[i]
    if(tempUserLocation != ""){
        tempVec = getGeoCode(tempUserLocation,tempUserScreenName)
        if(!invalid(tempVec)){
            tempDf = data.frame(screenName = as.character(tempVec[1]),Lat = as.numeric(tempVec[2]),Long = as.numeric(tempVec[3]),State = as.character(tempVec[4]),Country = as.character(tempVec[5]))
            userLocationDataDf = rbind(userLocationDataDf,tempDf)
            if(!isCustomSearch){
                saveRDS(userLocationDataDf,file = "userLocationDataDf.Rda")
            }else{
                saveRDS(userLocationDataDf,file = "customUserLocationDataDf.Rda")
            }
        }
    }
}

Save the above generated data frame since the above process is time consuming.


In [95]:
if(!isCustomSearch){
    saveRDS(userLocationDataDf,file = "userLocationDataDf.Rda")
}else{
    saveRDS(userLocationDataDf,file = "customUserLocationDataDf.Rda")
}

Load the data frame for further queries


In [96]:
if(!isCustomSearch){
    userLocationDataDf = readRDS(file = "userLocationDataDf.Rda")
}else{
    userLocationDataDf = readRDS(file = "customUserLocationDataDf.Rda")
}

Print and check if the data frame generated is in right format or not


In [97]:
head(userLocationDataDf)


screenNameLatLongStateCountry
WAConstitution47.60621 -122.33207 Washington United States
ConstitutionUT40.76078 -111.89105 Utah United States
ConstParty_SC 42.72605 -88.54242 Wisconsin United States
LovinTrump 37.09024 -95.71289 Kansas United States
MEConstitution37.09024 -95.71289 Kansas United States
Step_Holt 42.99359 -89.56843 Wisconsin United States

Prints the number of items in the data frame


In [98]:
print(length(userLocationDataDf$screenName))


[1] 216

This code basically aggregates the data as per the different state and club them together. This code also uses mean function to generate the latitude and logitude of the state the mean value of them will still be in the state.


In [99]:
aggDataDf = aggregate(x = list(Latitude = userLocationDataDf$Lat, Longitude = userLocationDataDf$Long), by = list(State=userLocationDataDf$State), FUN = mean)

This part of code generates the frequency table from the userLocationDataDf data frame per state.


In [100]:
tweetLocationFreqDf = aggregate(x = list(Tweet_Count = userLocationDataDf$State), list(userLocationDataDf$State), length)

Merges the two data frames generated above


In [101]:
aggUserLocationDataDf = data.frame(aggDataDf,tweetLocationFreqDf)

Save this data frame before moving further


In [102]:
if(!isCustomSearch){
    saveRDS(aggUserLocationDataDf,file = "aggUserLocationDataDf.Rda")
}else{
    saveRDS(aggUserLocationDataDf,file = "aggCustomUserLocationDataDf.Rda")
}

Reload the data in a data frame again for futher processing


In [103]:
if(!isCustomSearch){
    aggUserLocationDataDf = readRDS(file = "aggUserLocationDataDf.Rda")
}else{
    aggUserLocationDataDf = readRDS(file = "aggCustomUserLocationDataDf.Rda")
}

Print and check if the data frame is in right format or not.


In [104]:
head(aggUserLocationDataDf)


StateLatitudeLongitudeGroup.1Tweet_Count
Washington 47.65450 -121.80143 Washington 3
Utah 40.76078 -111.89105 Utah 1
Wisconsin 43.42920 -88.97276 Wisconsin 6
Kansas 37.09024 -95.71289 Kansas 13
South Carolina33.45218 -80.65315 South Carolina 5
Georgia 33.11962 -83.78326 Georgia 5

get_map give the map of the specified location.


In [105]:
USAMap = get_map(location = 'United States', zoom = 4)


Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=United+States&zoom=4&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=United%20States&sensor=false

This code generates the points as per the data provided. We need to pass the map, generated in the previous step and all the points which needs to be plotted


In [106]:
mapPoints <- ggmap(USAMap) +
    geom_point(aes(x = aggUserLocationDataDf$Longitude, y = aggUserLocationDataDf$Latitude, size = Tweet_Count), data = aggUserLocationDataDf, color = "blue")

Draws the map with the marked points.


In [107]:
mapPoints


Warning message:
"Removed 2 rows containing missing values (geom_point)."

References


In [ ]: