jq is a command line JSON processor that's helpful for working with JSON data from Twitter. You'll want to download and install jq on your system to use this notebook with your data. You could also use jqplay to try out these jq statements.
This notebook works with tweets collected from the Twitter filterstream API using an earlier version of Social Feed Manager, but there are lots of tools to get data from the Twitter APIs. To use this notebook with your own data, set the path to your data file as DATA.
As background, Twitter streaming API data is line-oriented JSON, meaning one tweet in JSON format per line. Output from tools such as twarc is also often line-oriented JSON.
This notebook is intended to help people getting started with working with Twitter data using jq. There are many additional software libraries available to do further analysis, including within a notebook. As an example, see Cody Buntain's notebook analyzing #Ferguson tweets as part of the Researching Ferguson Teach-In at MITH in 2015.
We use jq a lot in working with students and faculty at GW Libraries. Do you have useful jq statements we could share here? We welcome suggestions and improvements to this notebook, via Github, Twitter (@liblaura, @dankerchner, @justin_littman or email (lwrubel at gwu dot edu).
In [6]:
DATA="data/tweets"
View the JSON data, both keys and values, in a prettified format. I'm using the head
command to just show the first tweet in the file. Alternatively, you can use cat
to look at the whole file.
In [2]:
!head -1 $DATA | jq '.'
View just the values of each field, without the labels:
In [3]:
!head -1 $DATA| jq '.[]'
Filter your data down to specific fields:
In [4]:
!head -3 $DATA | jq '[.created_at, .text]'
The Twitter API documentation describes the responses from the streaming (e.g. filter, sample) and REST (user timeline, search) APIs.
JSON is hierarchical, and the created_at
and text
fields are at the top level of the tweet. Some fields in a tweet have additional fields within them. For example, the user
field contains fields with information about the user who tweeted, including a count of their followers, location, and a unique id (id_str):
In [5]:
!head -1 $DATA | jq '[.user]'
To filter for a subset of the user
fields, use dot notation:
In [6]:
!head -2 $DATA | jq '[.user.screen_name, .user.name, .user.followers_count, .user.id_str]'
Some fields occur multiple times, such as hashtags and mentions. Pull out the hashtag text fields and put them together into one field, separated by commas:
In [7]:
!cat $DATA | jq '[([.entities.hashtags[].text] | join(","))]'
A common use of jq is to turn your JSON data into a csv file to load into other analysis software. The -r option (--raw-output) formats the field as a string suitable for csv, as opposed to a JSON-formatted string with quotes.
In [8]:
!head -8 $DATA | jq -r '[.id_str, .created_at, .text] | @csv'
You probably want to write that data to a file, however:
In [9]:
!cat $DATA | jq -r '[.id_str, .created_at, .text] | @csv' > tweets.csv
In [10]:
!head tweets.csv
Some fields, particularly the text of a tweet, have newline characters. This can be a problem with your csv, breaking a tweet across lines. Substitute all occurrences of the newline character (\n) with a space:
In [11]:
!cat $DATA | jq -r '[.id_str, .created_at, (.text | gsub("\n";" "))] | @csv' > tweets-oneline.csv
In [12]:
!head tweets-oneline.csv
If you'd like JSON format as your output, you can specify the keys in the JSON objects created in the output:
In [7]:
!cat $DATA | jq -c '{{id: .id_str, user_id: .user.id_str, screen_name: .user.screen_name, created_at: .created_at, text: .text, user_mentions: [.entities.user_mentions[]?.screen_name], hashtags: [.entities.hashtags[]?.text], urls: [.entities.urls[]?.expanded_url]}}' > newtweets.json
In [8]:
!head newtweets.json
In [ ]: