Computing the top hashtags (JSON)

So you have tweets in a JSON file, and you'd like to get a list of the hashtags, from the most frequently occurring hashtags on down.

There are many, many different ways to accomplish this. Since we're working with the tweets in JSON format, this solution will use jq, as well as a few bash shell / command line tools: cat, sort, uniq, and wc. If you haven't used jq yet, our Working with Twitter Using jq notebook is a good place to start.

Where are the hashtags in tweet JSON?

When we look at a tweet, we see that it has a key called entities, and that the value of entities contains a key called hashtags. The value of hashtags is a list (note the square brackets); each item in the list contains the text of a single hashtag, and the indices of the characters in the tweet text where the hashtag begins and ends.

{
  created_at: "Tue Oct 30 09:15:45 +0000 2018",
  id: 1057199367411679200,
  id_str: "1057199367411679234",
  text: "Lesson from Indra's elephant https://t.co/h5K3y5g4Ju #India #Hinduism #Buddhism #History #Culture https://t.co/qFyipqzPnE",

  ...

  entities: {
    hashtags: [
      {
        text: "India",
        indices: [
          54,
          60
        ]
      },
      {
        text: "Hinduism",
        indices: [
          61,
          70
        ]
      },
      {
        text: "Buddhism",
        indices: [
          71,
          80
        ]
      },
      {
        text: "History",
        indices: [
          81,
          89
        ]
      },
      {
        text: "Culture",
        indices: [
          90,
          98
        ]
      }
    ],
 ...

When we use jq, we'll need to construct a filter that pulls out the hashtag text values.


In [2]:
!cat 50tweets.json | jq -cr '[.entities.hashtags][0][].text'


김유겸
유겸
Yugyeom
GOT7
갓세븐
PresentYou
LifeSite
あなたの名前から想像される色
صباح_الخير
music
network
ShootOut1stWin
acabateloparaustedes

In [3]:
!cat tweets4hashtags.json | jq -cr '[.entities.hashtags][0][].text' > allhashtags.txt

Let's see how many hashtags we extracted:


In [4]:
!wc -l allhashtags.txt


  878806 allhashtags.txt

What we'd like to do now is to count up how many of each hashtag we have. We'll use a combination of bash's sort and uniq commands for that. We'll also use the -c option for uniq, which prefaces each line with the count of lines it collapsed together in the process of uniqing a group of identical lines. sort's -nr options will allow us to sort by just the count on each line.


In [5]:
!cat allhashtags.txt | sort | uniq -c | sort -nr > rankedhashtags.txt

Let's take a look at what we have now.


In [6]:
!head -n 50 rankedhashtags.txt


8170 EXO
4123 BTS
4061 TEMPO_SEHUN
3365 GOT7
3145 ローソン
2924 SEHUN
2773 EXO_DontMessUpMyTempo
2743 weareoneEXO
2705 Halloween
2661 갓세븐
2647 EXO_TEMPO
2645 워너블다음은없어
2403 몬스타엑스
2355 MONSTA_X
2339 엑소
2279 ごちろう
2267 지니인기상_달려라상탄
2165 방탄소년단
2161 塩にぎり無料プレゼント
2042 ShootOut1stWin
1951 ハロウィン
1933 아이즈원
1874 เป๊กผลิตโชค
1873 IZONE
1753 フードファンタジー
1675 フーファン
1668 食霊のティアラ
1635 Ask_EXO
1551 어디에도_없을_완벽한_EXO
1494 AppleEvent
1465 도경수
1444 ShootOut
1403 WasteItOnMe
1370 TWICE
1367 NCT
1271 SomosLaAudiencia30
1267 NewProfilePic
1253 백일의낭군님
1223 ﷺ
1183 BAEKHYUN
1174 더쇼
1163 재민
1123 MONSTAX
1120 트와이스
1068 ジェジュン
1063 ALDUBStillReigns
1054 JIMIN
1026 RMonoBB200
1014 RT
1002 EXO_QuintupleMillionSeller

Personally, I have no idea what most of these hashtags are about, but this is apparently what people were tweeting about on October 31, 2018.

And as for how many unique hashtags are in this set:


In [7]:
!wc -l rankedhashtags.txt


  211986 rankedhashtags.txt

Again, there are many different ways to approach this! Let us know your thoughts and ideas.