This notebook is a companion to Getting Started Working with Twitter Data Using jq. It focuses on recipes that the Social Feed Manager team has used when preparing datasets of tweets for researchers.
We will continue to add additional recipes to this notebook. If you have any suggestions, please contact us.
This notebook requires at least jq 1.5. Note that only earlier versions may be available from your package manager; manual installation may be necessary.
These recipes can be used with any data source that outputs tweets as line-oriented JSON. Within the context of SFM, this is usually the output of twitter_rest_warc_iter.py or twitter_stream_warc_iter.py within a processing container. Alternatively, Twarc is a commandline tool for retrieving data from the Twitter API that outputs tweets as line-oriented JSON.
For the purposes of this notebook, we will use a line-oriented JSON file that was created using Twarc. It contains the user timeline of @SocialFeedMgr. The command used to produce this file was twarc.py --timeline socialfeedmgr > tweets.json
.
For an explanation of the fields in a tweet see the Tweet Field Guide. For other helpful tweet processing utilities, see twarc utils.
For the sake of brevity, some of the examples may only output a subset of the tweets fields and/or a subset of the tweets contained in tweets.json
. The following example outputs the tweet id and text of all of the first 5 tweets.
In [1]:
!head -n5 tweets.json | jq -c '[.id_str, .text]'
In [2]:
!head -n5 tweets.json | jq -c '[.created_at, .created_at | strptime("%A %B %d %T %z %Y") | todate]'
In [3]:
!cat tweets.json | jq -c 'select(.text | contains("blog")) | [.id_str, .text]'
In [4]:
!cat tweets.json | jq -c 'select(.text | contains("BLOG")) | [.id_str, .text]'
To ignore case, use a regular expression filter with the case-insensitive flag.
In [5]:
!cat tweets.json | jq -c 'select(.text | test("BLog"; "i")) | [.id_str, .text]'
In [6]:
!cat tweets.json | jq -c 'select(.text | test("BLog|twarc"; "i")) | [.id_str, .text]'
In [7]:
!cat tweets.json | jq -c 'select((.text | test("BLog"; "i")) and (.text | test("twitter"; "i"))) | [.id_str, .text]'
In [8]:
!cat tweets.json | jq -c 'select((.created_at | strptime("%A %B %d %T %z %Y") | mktime) > ("2016-11-05T00:00:00Z" | fromdateiso8601)) | [.id_str, .created_at, (.created_at | strptime("%A %B %d %T %z %Y") | todate)]'
In [9]:
!cat tweets.json | jq -c 'select(has("retweeted_status")) | [.id_str, .retweeted_status.id]'
In [10]:
!cat tweets.json | jq -c 'select(has("quoted_status")) | [.id_str, .quoted_status.id]'
Following is a CSV output that has fields similar to the CSV output produced by SFM's export functionality.
Note that is uses the -r
flag for jq instead of the -c
flag.
Also note that is it is necessary to remove line breaks from the tweet text to prevent it from breaking the CSV. This is done with (.text | gsub("\n";" "))
.
In [11]:
!head -n5 tweets.json | jq -r '[(.created_at | strptime("%A %B %d %T %z %Y") | todate), .id_str, .user.screen_name, .user.followers_count, .user.friends_count, .retweet_count, .favorite_count, .in_reply_to_screen_name, "http://twitter.com/" + .user.screen_name + "/status/" + .id_str, (.text | gsub("\n";" ")), has("retweeted_status"), has("quoted_status")] | @csv'
In [12]:
!echo "[]" | jq -r '["created_at","twitter_id","screen_name","followers_count","friends_count","retweet_count","favorite_count","in_reply_to_screen_name","twitter_url","text","is_retweet","is_quote"] | @csv'
Excel can load CSV files with over a million rows. Howver, for practical purposes a much smaller number is recommended.
The following uses the split command to split the CSV output into multiple files. Note that the flags accepted may be different in your environment.
cat tweets.json | jq -r '[.id_str, (.text | gsub("\n";" "))] | @csv' | split --lines=5 -d --additional-suffix=.csv - tweets
ls *.csv
tweets00.csv tweets01.csv tweets02.csv tweets03.csv tweets04.csv
tweets05.csv tweets06.csv tweets07.csv tweets08.csv tweets09.csv
--lines=5
sets the number of lines to include in each file.
--additional-suffix=.csv
set the file extension.
tweets
is the base name for each file.
When outputting tweet ids, .id_str
should be used instead of .id
. See Ed Summer's blog post for an explanation.
In [13]:
!head -n5 tweets.json | jq -r '.id_str'
In [ ]: