So you have tweets in a JSON file, and you'd like to get a list of the hashtags, from the most frequently occurring hashtags on down.
There are many, many different ways to accomplish this. Since we're working with the tweets in JSON format, this solution will use jq
, as well as a few bash shell / command line tools: cat
, sort
, uniq
, and wc
. If you haven't used jq
yet, our Working with Twitter Using jq notebook is a good place to start.
When we look at a tweet, we see that it has a key called entities
, and that the value of entities
contains a key called hashtags
. The value of hashtags
is a list (note the square brackets); each item in the list contains the text of a single hashtag, and the indices of the characters in the tweet text where the hashtag begins and ends.
{
created_at: "Tue Oct 30 09:15:45 +0000 2018",
id: 1057199367411679200,
id_str: "1057199367411679234",
text: "Lesson from Indra's elephant https://t.co/h5K3y5g4Ju #India #Hinduism #Buddhism #History #Culture https://t.co/qFyipqzPnE",
...
entities: {
hashtags: [
{
text: "India",
indices: [
54,
60
]
},
{
text: "Hinduism",
indices: [
61,
70
]
},
{
text: "Buddhism",
indices: [
71,
80
]
},
{
text: "History",
indices: [
81,
89
]
},
{
text: "Culture",
indices: [
90,
98
]
}
],
...
When we use jq
, we'll need to construct a filter that pulls out the hashtag text values.
In [2]:
!cat 50tweets.json | jq -cr '[.entities.hashtags][0][].text'
In [3]:
!cat tweets4hashtags.json | jq -cr '[.entities.hashtags][0][].text' > allhashtags.txt
Let's see how many hashtags we extracted:
In [4]:
!wc -l allhashtags.txt
What we'd like to do now is to count up how many of each hashtag we have. We'll use a combination of bash's sort
and uniq
commands for that. We'll also use the -c
option for uniq
, which prefaces each line with the count of lines it collapsed together in the process of uniq
ing a group of identical lines. sort
's -nr
options will allow us to sort by just the count on each line.
In [5]:
!cat allhashtags.txt | sort | uniq -c | sort -nr > rankedhashtags.txt
Let's take a look at what we have now.
In [6]:
!head -n 50 rankedhashtags.txt
Personally, I have no idea what most of these hashtags are about, but this is apparently what people were tweeting about on October 31, 2018.
And as for how many unique hashtags are in this set:
In [7]:
!wc -l rankedhashtags.txt
Again, there are many different ways to approach this! Let us know your thoughts and ideas.