Greg Coleman
Kim Lamke
Molly Lloyd
Benjamin Sugar
Data Processing Considerations and Architecture
Parallel Computing Notes
Relavant Twitter Data Definitions
Exploratory Data Analysis - General
Exploratory Data Analysis - Hashtags
Exploratory Data Analysis - Darren Wilson Supporter
Engagement Retention
Text Classification
Engagement Classification
Word Cloud
"My simple advice to all the young people -- and those of us not so young -- is the place ourselves for the long haul. Be like a pilot light and not like a firecracker. A firecracker just pops off and it's gone. A pilot light will continue to burn." --John Lewis [0]
Civic Media researcher, Ethan Zuckerman [1] once conceived of civic engagement along two axes. The first axis runs from thick to thin. Ethan defines "thin" as "actions that require little thought", and "think" the oppoiste. The second axis runs from "symbolic" to "impactful" where "symbolic" actions tend to have to do with "voice", showing that you stand with something bumpter stickers, or a growing a beard while your team is in the playoffs. Impactful actions on the other hand, have clearer, demonstrable change.
It's easy to assume that the ideal modes of civic participation are thick and impactful. However, you can have thin and impactful, take voting for example. While voter ID laws are are making this form of participation thicker as of late, in general, it is a simple matter of showing up, pressing a few buttons, and later that evening you will have created a change which will have significant impacts.
Social media has proven itself to have the ability to cause high impact as a very thin form of engagement. Researcher Clay Shirky has pointed out that the Internet extends the advantages of information and coordination, formerly relegated only to large organizations, to individual public actors. This was exemplifed recently by the rescinding of a book deal between Martin Literary Management and Juror B37 from the Trayvon Martin case. [3] This, only hours after Twitter user @MoreandAgain posted a tweet contianing the information of the literary firm and a call to her 1600 followers to request that @sharlenemartin drop the project.
What most success stores such as these have in common is that 1) they were made up of a short but intense spikes in participation. 2) The impact sought was well defined and did not require a systematic change. A very small group of people were required to enact the change as a result of the pressure (a literary agent), and the change they were pressured to make was binary (do or do not publish a book).
Systemic change relies upon sustained participation. Sustained participation online requires certain types actors just as sustained participation offline. Some will be leaders, highly engaged, and motivated to stay and engage of their own accord. Movements are not made up solely of leaders but also follwers, perhaps less pajoritively, those who are inspired and interested enough to stand together in a movement.
The conversation on Twitter regarding the shooting of Michael Brown in Ferguson presented an interesting opporunity to inquire into user engagement. With the Trayvon Martin case still very present in the cultural consciousness, citizens were prepared for the order of events that were soon to come; grand jury indictment and potentially a criminal trial. This meant that people were openly waiting and ready for the next milestone. What that meant for data science is that much in the same reporters were standing by waiting for reacction to the grand jury decision to come down, so to could data journalists. Therefore, the opoortunity presented itself to use data captured from the initial event and compaire it with data from the next milestone to see how user engagement changed.
[2] http://www.networksolutions.com/blog/2009/08/clay-shirky-says-the-revolution-is-now-here/
[3] http://newsone.com/2634913/moreandagain-twitter-zimmerman-juror-b37-book-deal/
[a] Thanks to Lily Bui for bringing the quote to our attention, and discusions that clarifed our thinking on this.
Anything that inspired you, such as a paper, a web site, or something we discussed in class.
We were not influenced by anything in particular except for the work that is being done in the civic media space in general. The MIT Media Lab's Center for Civic Media has been one of the players whose approach we have employed. This approach involves looking media as a form of civic participation.
What questions are you trying to answer? How did these questions evolve over the course of the project? What new questions did you consider in the course of your analysis?
Our main question surrounds predictors of sustained engagement:
-What are the patterns of Twitter users engaged in the topic, and how do they change overtime?
-What role does sentiment of language play in engagement?
-What role does a users position in a network play in their longterm engagement?
Our initial data came from R-Sheif's 9GB collection of tweets with the hashtag Ferguson over the course of []. Unfortuantely, this dataset had been cleaned such that it mainly contained information about the tweets (text, hashtags, time), and not the users who tweeted them (followers, friends). Even then, much of the informaiton about the tweet such as whether it was a reply or retweet had been removed as well
Forunately, one of our teammantes stumbled upon data librarian with the Library of Congress, Ed Summers who had collected tweets with any mention of the word Ferguson from August 10th (Michael Brown was shot the previous day), until the 22nd. Due to Twitter's terms of service, Ed was not able to give us the tweets directly, however, he was able to provide the tweet ID's which could then be "reydtrated" using a program that he created called -- twarc (short for Twitter archive).
With the tweet IDs from August, we had to 'rehydrate' the underlying data from the Twitter API. Through this process we got data from ~11 million tweets.
twarc.py --hydrate tweet_ids.json > tweet_data.json
We split up the tweet ID's into four groups and rehydrated each using our own Twitter API credentials.
Through this process we got data from ~11 million tweets.
We then set up a twarc scraper to start collecting the data for the imminent Michael Brown grand jury decision that would later set off another round of protests and Twitter conversation.
twarc.py --query ferguson --stream
We started collecting tweets for the event of the grand jury decision as soon as we heard that decision was immenent. All told we collected [NUM] tweets.
We then divided up all of our tweets into two 7 day periods one from right after the shooting, and the other right after the grand jury decision. We also narrowd down our data to only those users who language setting was English since we had expected to do some NLP. We also narrow down our data by taking a look at users who had contributed more than 10 tweets during the initial August period. Our reasoning for this was that a user should show some initial threshold of interest.
We encountered processing issues throughout the length of our project due to the high volume of data we collected. At its peak, our database had 133 GB of tweets loaded into MongoDB. We made a collective decision to limit our samples to the 7 days immediately following the two events (Michael Brown's death, Aug. 9, and the grand jury non-indictment, Nov. 24) in order to help prevent memory overload and kernel crashes.
We also had issues with IPython, which is not built for big-data. The pymongo module that is built to interface between MongoDB and python was struggling to meet our needs, as was the IPython kernel and drivers which caused significant delays in our processing time due to server and kernel crashes.
For additional information please see notebooks:
As with all projects without an unlimited supply of funds or processing capability, the cost of processing services influenced our development process. We started small and added servers as we needed them. We used MongoDB which scales well horizontally, using lots of smaller servers to provide the capacity of a larger database server, however due to budget constraints we scaled vertically. We used a group IPython notebook server for development and collaboration.
We initially tried to use the python MongoDb driver ( pymongo ) to load data into our dataframes. Loading data using pymongo involves initializing the driver, making a query, and then the pymongo driver returns the date after it has parsed the BSON ( Binary JSON ) data into numpy arrays. We did discover that the pymongo was hanging on our queries and we needed to bypass pymongo. We used the MongoDB mongoexport utiity to export csv files to use for analysis in IPython. We then incorporated MongoDB as the data aggregator.
[twarc] --> [MongoDB] --> [csv file] -->[IPython]
We had to ultimately choose the relevant JSON fields we wanted to extract from MongoDB and load them into .csvs in order to get IPython to function. We downloaded two sets, the August set which compised data from August 10th - 17th, and a November set which spanned November 24 - December 1. Below is the November query
mongoexport -d ferguson -c tweets -h 10.208.160.157 --fields
"id","lang","_iso_created_at","text","user.id","user.screen_name","user.geo_enabled","user.statuses_count",
"user.friends_count","user.lang","user.name","user.following","user.followers_count","retweeted","in_reply_to_screen_name",
"retweet_count","entities.user_mentions","retweeted_status.user.id","retweeted_status.favorite_count",
"retweeted_status.favourities_count","retweeted_status.user.id","retweeted_status.user.followers_count",
"retweeted_status.user.friends_count","in_reply_to_status_id","retweeted_status.in_reply_to_status_id",
"retweeted_status.in_reply_to_user_id" --query '{ "_iso_created_at": { $lte:Date(1417305600000), $gte:Date(1416787200000)} }' --csv > nov_data_set.csv
After we loaded the MongoDB database, we create a unique index on the twitter id. This would prevent duplicate tweets from being imported or overwriting cleaned records in the database. See below for notes on MongoDB Importing and Indexing.
The data set in the form that was downloaded using twarc was JSON. This was a problem due to the size of uncompressed JSON. We would download, compressthe JSON and work with the data in a compressed form as much as possible. We tested a small, a BSON document with a data size of 27 Meg with out indexes, uncompressed to a JSON dataset of 295 Meg, a ratio of ~11 to 1. We were not able to produce a JSON representation of the entier database due to lack of storage space.
The final data set from which we drew our data was 133G, including indexes.
ferguson.tweets
dataSize : 131299088424
storageSize : 133537791136
indexSize : 8524542880,
fileSize : 145891524608
We used an escalated strategy for server deployment. Using a smaller server at first and then add processing power and capability as the need arose. We were able to spin up more processing from our cloud provider.
The initial server we used was a small server for the purpose of streaming twarc and collecting the tweets as they happened.
| twarc Collector |
| Host | Digital Ocean |
| Memory | 0.5 GB |
| Disk | 20GB |
| Cores | one |
When we started the initial research, we started with a combined server that contained MongoDB and IPython. Although underpowered, it was within budget. This was used for day to day work and contained the groups shared notebooks.
Research Server
| IPython & MongoDB |
| Memory | 8 G |
| Disk | 160 G |
| Cores | eight |
Near the end of the project, the computational and data needs called for the separation of the IPython server and the MongoDB server so a dedicated MongoDB server was added.
Added Dedicated MongoDB Server
| IPython | | MongoDB |
| Memory | 8 G | | Memory | 8 G |
| Disk | 160 G | | Disk | 160 G |
| Cores | eight | | Cores | eight |
During the last few days, we added a large computational server with 60G of RAM.
Added Dedicated IPython Server
| IPython | | IPython [Large Memory]| | MongoDB |
| Memory | 8 G | | Memory | 60 G | | Memory | 8 G |
| Disk | 160 G | | Disk | 50 | | Disk | 160 G |
| Cores | eight | | Cores | 8 | | Cores | eight |
Finally, We did have a very large server to scale vertically that we ultimately were not able to use. It would have been able to hold the entire dataset in RAM, however there were problems with the NUMA archetucture and MongoDB.
| IPython & MongoDB |
| Memory | 244 G |
| Disk | 640 G |
| Cores | 32 |
You can import a non compressed JSON file
mongoimport --db ferguson --collection tweets --type json --file ferguson-20141121215529.json
Or pipe a compressed JSON file
bzip2 -c -d ferguson-20141117075102.json.bz2 | mongoimport --db ferguson --collection tweets --type json
After importing JSON, we run a script from the mongo shell to convert the time strings to ISO date objects.
use ferguson
Run this javascript from the command line. It does 1,000,000 records at a time.
db.tweets.find({ _iso_created_at: { $exists: false }} ).limit(1000000).forEach(function(doc){
if (doc._iso_created_at instanceof Date !== true) {
doc._iso_created_at = new Date(doc.created_at);
db.tweets.save(doc);
}
});
To find out the number of tweets that do not have a timestamp converted
db.tweets.find( {_iso_created_at: { $exists: false }}, {_iso_created_at:1} ).count()
And the total number of tweets
db.tweets.count()
Exploratory Analysis: What visualizations did you use to look at your data in different ways? What are the different statistical methods you considered? Justify the decisions you made, and show any major changes to your ideas. How did you reach these conclusions?
Please see notebooks:
Exploratory Data Analysis - General
Exploratory Data Analysis - Hashtags
Exploratory Data Analysis - Darren Wilson Supporter
Final Analysis: What did you learn about the data? How did you answer the questions? How can you justify your answers?
Please see notebooks:
Engagement Retention
Text Classification
Engagement Classification
Word Cloud Example