Recipes for processing Twitter data with jq

This notebook is a companion to Getting Started Working with Twitter Data Using jq. It focuses on recipes that the Social Feed Manager team has used when preparing datasets of tweets for researchers.

We will continue to add additional recipes to this notebook. If you have any suggestions, please contact us.

This notebook requires at least jq 1.5. Note that only earlier versions may be available from your package manager; manual installation may be necessary.

These recipes can be used with any data source that outputs tweets as line-oriented JSON. Within the context of SFM, this is usually the output of twitter_rest_warc_iter.py or twitter_stream_warc_iter.py within a processing container. Alternatively, Twarc is a commandline tool for retrieving data from the Twitter API that outputs tweets as line-oriented JSON.

For the purposes of this notebook, we will use a line-oriented JSON file that was created using Twarc. It contains the user timeline of @SocialFeedMgr. The command used to produce this file was twarc.py --timeline socialfeedmgr > tweets.json.

For an explanation of the fields in a tweet see the Tweet Field Guide. For other helpful tweet processing utilities, see twarc utils.

For the sake of brevity, some of the examples may only output a subset of the tweets fields and/or a subset of the tweets contained in tweets.json. The following example outputs the tweet id and text of all of the first 5 tweets.


In [1]:
!head -n5 tweets.json | jq -c '[.id_str, .text]'


["798895564335280129","Social Feed Manager 1.3 is out, with collection portability, monitoring page, one-time harvest option, more https://t.co/L956zwfrGQ"]
["797074713612877824","@SMLabTO How might we update the toolkit's listing for Social Feed Manager? Info and docs available via https://t.co/j3zQ7kGwNn"]
["797064988116611073","RT @justin_littman: New on the @SocialFeedMgr blog: On retweets, replies, quotes & favorites: A guide for researchers. https://t.co/SjfIuLu…"]
["794234002496487424","RT @ianmilligan1: Used @SocialFeedMgr to collect almost 80,000 #CdnPoli tweets this morning. Great, intuitive UI. https://t.co/BuS3S7f6kf #…"]
["793896478037114880","Software doesn't live forever. How to get collections OUT of Social Feed Manager, a new blog post by @justin_littman https://t.co/CagQvSF7pJ"]

Dates

For both filtering and output, it is often necessary to parse and/or normalize the created_at date. The following shows the original created_at date and the date as an ISO 8601 date.


In [2]:
!head -n5 tweets.json | jq -c '[.created_at, .created_at | strptime("%A %B %d %T %z %Y") | todate]'


["2016-11-16T09:28:39Z","2016-11-16T09:28:39Z"]
["2016-11-11T08:53:15Z","2016-11-11T08:53:15Z"]
["2016-11-11T08:14:36Z","2016-11-11T08:14:36Z"]
["2016-11-03T13:45:17Z","2016-11-03T13:45:17Z"]
["2016-11-02T15:24:04Z","2016-11-02T15:24:04Z"]

Filtering

Filtering text

Case sensitive


In [3]:
!cat tweets.json | jq -c 'select(.text | contains("blog")) | [.id_str, .text]'


["797064988116611073","RT @justin_littman: New on the @SocialFeedMgr blog: On retweets, replies, quotes & favorites: A guide for researchers. https://t.co/SjfIuLu…"]
["793896478037114880","Software doesn't live forever. How to get collections OUT of Social Feed Manager, a new blog post by @justin_littman https://t.co/CagQvSF7pJ"]
["786179196577992707","A detailed look at recent technical work to improve our social media harvesters. New blog post by @justinlittman: https://t.co/FFHqJxfxl6"]
["773553804558073857","When is a Collection not an Archive? New blog post by @save4use https://t.co/JtxyksXLdV"]
["742420371434033153","RT @justin_littman: My blog post on collecting the tweets of #PulseNightclub with @SocialFeedMgr: https://t.co/qRQRNPRiOO"]
["728300814129827840","Another Try at Harvesting the Twitter Streaming API to WARC files, blog post by @justin_littman https://t.co/RJL9OqaGDW"]
["720616794168369152","More on Social Feed Manager and blog posts about the WARC approach at https://t.co/iUdSktNyRp https://t.co/XKIiqaKDDp"]

In [4]:
!cat tweets.json | jq -c 'select(.text | contains("BLOG")) | [.id_str, .text]'

Case insensitive

To ignore case, use a regular expression filter with the case-insensitive flag.


In [5]:
!cat tweets.json | jq -c 'select(.text | test("BLog"; "i")) | [.id_str, .text]'


["797064988116611073","RT @justin_littman: New on the @SocialFeedMgr blog: On retweets, replies, quotes & favorites: A guide for researchers. https://t.co/SjfIuLu…"]
["793896478037114880","Software doesn't live forever. How to get collections OUT of Social Feed Manager, a new blog post by @justin_littman https://t.co/CagQvSF7pJ"]
["786179196577992707","A detailed look at recent technical work to improve our social media harvesters. New blog post by @justinlittman: https://t.co/FFHqJxfxl6"]
["773553804558073857","When is a Collection not an Archive? New blog post by @save4use https://t.co/JtxyksXLdV"]
["742420371434033153","RT @justin_littman: My blog post on collecting the tweets of #PulseNightclub with @SocialFeedMgr: https://t.co/qRQRNPRiOO"]
["728300814129827840","Another Try at Harvesting the Twitter Streaming API to WARC files, blog post by @justin_littman https://t.co/RJL9OqaGDW"]
["720616794168369152","More on Social Feed Manager and blog posts about the WARC approach at https://t.co/iUdSktNyRp https://t.co/XKIiqaKDDp"]

Filtering on multiple terms (OR)


In [6]:
!cat tweets.json | jq -c 'select(.text | test("BLog|twarc"; "i")) | [.id_str, .text]'


["797064988116611073","RT @justin_littman: New on the @SocialFeedMgr blog: On retweets, replies, quotes & favorites: A guide for researchers. https://t.co/SjfIuLu…"]
["793896478037114880","Software doesn't live forever. How to get collections OUT of Social Feed Manager, a new blog post by @justin_littman https://t.co/CagQvSF7pJ"]
["786179196577992707","A detailed look at recent technical work to improve our social media harvesters. New blog post by @justinlittman: https://t.co/FFHqJxfxl6"]
["773553804558073857","When is a Collection not an Archive? New blog post by @save4use https://t.co/JtxyksXLdV"]
["742420371434033153","RT @justin_littman: My blog post on collecting the tweets of #PulseNightclub with @SocialFeedMgr: https://t.co/qRQRNPRiOO"]
["741072574239649792","Glad to contribute to twarc, for the benefit of everyone needing access to twitter data. https://t.co/gwKXit2cho"]
["728300814129827840","Another Try at Harvesting the Twitter Streaming API to WARC files, blog post by @justin_littman https://t.co/RJL9OqaGDW"]
["720616794168369152","More on Social Feed Manager and blog posts about the WARC approach at https://t.co/iUdSktNyRp https://t.co/XKIiqaKDDp"]

Filtering on multiple terms (AND)


In [7]:
!cat tweets.json | jq -c 'select((.text | test("BLog"; "i")) and (.text | test("twitter"; "i"))) | [.id_str, .text]'


["728300814129827840","Another Try at Harvesting the Twitter Streaming API to WARC files, blog post by @justin_littman https://t.co/RJL9OqaGDW"]

Filter dates

The following shows tweets created after November 5, 2016.


In [8]:
!cat tweets.json | jq -c 'select((.created_at | strptime("%A %B %d %T %z %Y") | mktime) > ("2016-11-05T00:00:00Z" | fromdateiso8601)) | [.id_str, .created_at, (.created_at | strptime("%A %B %d %T %z %Y") | todate)]'


["798895564335280129","Wed Nov 16 14:28:39 +0000 2016","2016-11-16T09:28:39Z"]
["797074713612877824","Fri Nov 11 13:53:15 +0000 2016","2016-11-11T08:53:15Z"]
["797064988116611073","Fri Nov 11 13:14:36 +0000 2016","2016-11-11T08:14:36Z"]

Is retweet


In [9]:
!cat tweets.json | jq -c 'select(has("retweeted_status")) | [.id_str, .retweeted_status.id]'


["797064988116611073",796843045341790200]
["794234002496487424",794224602310463500]
["793786215220776960",793748406602723300]
["791966708390957056",791929123723632600]
["791633052053176321",791632539341447200]
["789442614571466752",789416330009120800]
["785833961847001088",785802401512947700]
["780736483682549760",780732775963983900]
["773231772398194688",773229286589341700]
["766639464454193152",765636639133691900]
["758383727672225792",758316697560416300]
["752856372644024320",752584251648774100]
["750704374876364801",750391045301559300]
["743462280852086784",743459400967413800]
["743458848434958336",743458460205989900]
["743410385923997696",743121113035771900]
["743167286006079488",743166124133650400]
["743128282720284672",743127197066612700]
["742420371434033153",742418514686926800]
["742053812895027200",742048151176151000]
["720621348565970944",720621197550071800]
["720223544014213120",720222105435009000]
["715610839890403329",715607896793477100]

Is quote


In [10]:
!cat tweets.json | jq -c 'select(has("quoted_status")) | [.id_str, .quoted_status.id]'


["789100742430887936",789098583362502700]
["745995798354210816",745988440794210300]
["741072574239649792",741018168773226500]
["720616794168369152",720615458412605400]

Output

To write output to a file use > <filename>. For example: cat tweets.json | jq -r '.id_str' > tweet_ids.txt

CSV

Following is a CSV output that has fields similar to the CSV output produced by SFM's export functionality.

Note that is uses the -r flag for jq instead of the -c flag.

Also note that is it is necessary to remove line breaks from the tweet text to prevent it from breaking the CSV. This is done with (.text | gsub("\n";" ")).


In [11]:
!head -n5 tweets.json | jq -r '[(.created_at | strptime("%A %B %d %T %z %Y") | todate), .id_str, .user.screen_name, .user.followers_count, .user.friends_count, .retweet_count, .favorite_count, .in_reply_to_screen_name, "http://twitter.com/" + .user.screen_name + "/status/" + .id_str, (.text | gsub("\n";" ")), has("retweeted_status"), has("quoted_status")] | @csv'


"2016-11-16T09:28:39Z","798895564335280129","SocialFeedMgr",124,27,1,0,,"http://twitter.com/SocialFeedMgr/status/798895564335280129","Social Feed Manager 1.3 is out, with collection portability, monitoring page, one-time harvest option, more https://t.co/L956zwfrGQ",false,false
"2016-11-11T08:53:15Z","797074713612877824","SocialFeedMgr",124,27,0,0,"SMLabTO","http://twitter.com/SocialFeedMgr/status/797074713612877824","@SMLabTO How might we update the toolkit's listing for Social Feed Manager? Info and docs available via https://t.co/j3zQ7kGwNn",false,false
"2016-11-11T08:14:36Z","797064988116611073","SocialFeedMgr",124,27,4,0,,"http://twitter.com/SocialFeedMgr/status/797064988116611073","RT @justin_littman: New on the @SocialFeedMgr blog: On retweets, replies, quotes &amp; favorites: A guide for researchers. https://t.co/SjfIuLu…",true,false
"2016-11-03T13:45:17Z","794234002496487424","SocialFeedMgr",124,27,6,0,,"http://twitter.com/SocialFeedMgr/status/794234002496487424","RT @ianmilligan1: Used @SocialFeedMgr to collect almost 80,000 #CdnPoli tweets this morning. Great, intuitive UI. https://t.co/BuS3S7f6kf #…",true,false
"2016-11-02T15:24:04Z","793896478037114880","SocialFeedMgr",124,27,8,4,,"http://twitter.com/SocialFeedMgr/status/793896478037114880","Software doesn't live forever. How to get collections OUT of Social Feed Manager, a new blog post by @justin_littman https://t.co/CagQvSF7pJ",false,false

Header row

The header row should be written to the output file with > before appending the CSV with >>.


In [12]:
!echo "[]" | jq -r '["created_at","twitter_id","screen_name","followers_count","friends_count","retweet_count","favorite_count","in_reply_to_screen_name","twitter_url","text","is_retweet","is_quote"] | @csv'


"created_at","twitter_id","screen_name","followers_count","friends_count","retweet_count","favorite_count","in_reply_to_screen_name","twitter_url","text","is_retweet","is_quote"

Splitting files

Excel can load CSV files with over a million rows. Howver, for practical purposes a much smaller number is recommended.

The following uses the split command to split the CSV output into multiple files. Note that the flags accepted may be different in your environment.

cat tweets.json | jq -r '[.id_str, (.text | gsub("\n";" "))] | @csv' | split --lines=5 -d --additional-suffix=.csv - tweets
ls *.csv
tweets00.csv  tweets01.csv  tweets02.csv  tweets03.csv  tweets04.csv
tweets05.csv  tweets06.csv  tweets07.csv  tweets08.csv  tweets09.csv

--lines=5 sets the number of lines to include in each file.

--additional-suffix=.csv set the file extension.

tweets is the base name for each file.

Tweet ids

When outputting tweet ids, .id_str should be used instead of .id. See Ed Summer's blog post for an explanation.


In [13]:
!head -n5 tweets.json | jq -r '.id_str'


798895564335280129
797074713612877824
797064988116611073
794234002496487424
793896478037114880

In [ ]: