Sure, you could write a script but you can often get the job done from the command-line.
This is an assortment of command-line tools that I use for wrangling Twitter data. Of course, most of these tools and techniques can be used for wrangle other types of data. If you have others, please let me know.
To illustrate the tools, I retrieved the tweets posted by @gelmanlibrary and @gwtweets using DocNow's Twarc. Twarc is a command-line tool for retrieving data from Twitter's API.
twarc timeline gwtweets > gwtweets.jsonl
twarc timeline gelmanlibrary > gelmanlibrary.jsonl
gwtweets.jsonl
and gelmanlibrary.jsonl
are line-oriented JSON files, i.e., tweets are in JSON, with one tweet on each line.
Mac tip: To get Linux-like command brew install coreutils
In [1]:
!wc -l *.jsonl
wc gotcha: When counting many tweets, wc -l will output a partial total and then reset the count.
In [2]:
!cat gwtweets.jsonl | gzip > gwtweets.jsonl.gz
!cat gelmanlibrary.jsonl | gzip > gelmanlibrary.jsonl.gz
!ls *.jsonl.gz
In [3]:
!gunzip -c *.jsonl.gz | wc -l
In [4]:
!gunzip -c *.jsonl.gz | awk 'NR % 5 == 0' > sample.json
!wc -l sample.json
In [5]:
!gunzip -c *.jsonl.gz | split --lines=1000 -d --additional-suffix=.jsonl - tweets-
!wc -l tweets-*.jsonl
jq excels at transforming JSON data. Because jq is such a useful tool for Twitter data, we already have several blog posts dedicated to it:
However, here is an example of one of its uses.
In [6]:
!gunzip -c *.jsonl.gz | jq -r '.entities.user_mentions[].screen_name' > screen_names.txt
!head -10 screen_names.txt
In [7]:
!gunzip -c *.jsonl.gz | jq -r '.entities.user_mentions[].screen_name' | sort > sorted_screen_names.txt
!head -10 sorted_screen_names.txt
In [8]:
!gunzip -c *.jsonl.gz | jq -r '.entities.user_mentions[].screen_name' | sort | uniq -c | sort -r > unique_screen_names.txt
!head -10 unique_screen_names.txt
In [9]:
%%bash
gunzip -c gwtweets.jsonl.gz | tee >(jq -r '.user.id_str' > gwtweets-user_ids.txt) | jq -r '.id_str' > gwtweets-tweet_ids.txt
head -5 gwtweets-tweet_ids.txt
head -5 gwtweets-user_ids.txt
parallel can really speed up processes that involve multiple files. It is also useful for repeating a task multiple times, substituting in values listed in a file.
-j
controls the number of parallel processes. You should choose an appropriate number for the number of free CPUs available.
In [10]:
%%bash
ls -1 tweets-*.jsonl > src.lst
cat src.lst | sed 's/.jsonl/.jsonl.gz/' > dest.lst
parallel -a src.lst -a dest.lst -j 2 --xapply "cat {1} | gzip > {2}"
ls tweets-*.jsonl.gz
In [11]:
!git clone https://github.com/DocNow/twarc.git
!pip install -e twarc
In [12]:
!gunzip -c gwtweets.jsonl.gz | python twarc/utils/json2csv.py -x - > gwtweets.csv
!gunzip -c gelmanlibrary.jsonl.gz | python twarc/utils/json2csv.py -x - > gelmanlibrary.csv
!head -2 gwtweets.csv
Tip when loading a tweet CSV into Excel: If you open up a tweet CSV with Excel, it will mishandle tweet and user ids. For example, 976161920133816322 will become 976161920133816000.
To correctly import tweet CSV into Excel, select Data > Get External Data > Import File. When given the option of selecting the data type for fields, select text for all id fields.
csvkit supports a wide variety of operations for filtering and transforming CSV files. Here are a few highlights.
In [13]:
!pip install csvkit
In [14]:
!csvcut -c id,created_at,text gelmanlibrary.csv > gelmanlibrary_cut.csv
!head -3 gelmanlibrary_cut.csv
In [15]:
!csvstack gelmanlibrary.csv gwtweets.csv > merged.csv
!wc -l *.csv
In [16]:
!csvgrep -c tweet_type -m reply gelmanlibrary.csv | csvcut -c id,in_reply_to_status_id > replies_to_in_reply_to.csv
!head -5 replies_to_in_reply_to.csv
Here I'm joining some @gelmanlibrary replies to the tweets that they are a reply to.
The @gelmanlibrary reply will be on the left; the tweet that is being replied to will be on the right (with field names appended with "2").
Be careful not to use csvjoin on large CSVs.
In [17]:
!csvjoin -c in_reply_to_status_id,id gelmanlibrary.csv gelmanlibrary_replies_to.csv > gelmanlibrary_with_replies_to.csv
!head -2 gelmanlibrary_with_replies_to.csv
In addition to json2csv.py, Twarc includes a number of other useful tweet utilities (docs and complete list of scripts).
Here are some of my favorites.
Supports tweet compliance by retrieving the most current versions of tweets or removing unavailable (deleted or protected) tweets.
Also useful for splitting out deleted tweets.
Not surpringly, deduplicates tweets. For a retweet, --extract-retweets
will return the retweet and source tweet (i.e., the tweet that is retweeted). This is useful for extracting all of the tweets in a dataset.
Attempts to determine why a tweet was deleted, e.g., tweet deleted, user protected, retweet deleted.
Unshortens URLs contained in tweets and adds them to the tweet.