Twitter data cleaning and reshaping

Demo for EMSE 6992, "Machine Learning for Analytics" on 1/30/19.

This notebook shows examples of using command-line tools for working with Twitter data.

For this exercise, we're using a dataset with 2,000 tweets collected 01/01/2019 from the Twitter sample stream API. Filename: sample.json

How many tweets are in this file? Look at the first tweet.


In [1]:
!wc -l sample.json


    2000 sample.json

head with a number flag (e.g. -5) determines the number of lines to show. Here's the first line, which is also the first line in line-oriented JSON.


In [2]:
!head -1 sample.json


{"retweeted": false, "is_quote_status": false, "retweet_count": 0, "text": "Happy birthday to a real one\ud83d\udda4 https://t.co/1o4FxnHgXg", "source": "<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>", "favorited": false, "extended_entities": {"media": [{"id": 1080174377138823175, "sizes": {"thumb": {"h": 150, "resize": "crop", "w": 150}, "large": {"h": 2047, "resize": "fit", "w": 1845}, "small": {"h": 680, "resize": "fit", "w": 613}, "medium": {"h": 1200, "resize": "fit", "w": 1082}}, "display_url": "pic.twitter.com/1o4FxnHgXg", "indices": [30, 53], "media_url_https": "https://pbs.twimg.com/media/Dv2M33zWoAc83BZ.jpg", "id_str": "1080174377138823175", "media_url": "http://pbs.twimg.com/media/Dv2M33zWoAc83BZ.jpg", "type": "photo", "url": "https://t.co/1o4FxnHgXg", "expanded_url": "https://twitter.com/emxdube/status/1080174384776732673/photo/1"}]}, "reply_count": 0, "in_reply_to_status_id_str": null, "in_reply_to_screen_name": null, "id_str": "1080174384776732673", "display_text_range": [0, 29], "in_reply_to_user_id_str": null, "in_reply_to_user_id": null, "favorite_count": 0, "contributors": null, "filter_level": "low", "id": 1080174384776732673, "truncated": false, "created_at": "Tue Jan 01 18:50:16 +0000 2019", "possibly_sensitive": false, "geo": null, "quote_count": 0, "in_reply_to_status_id": null, "place": null, "entities": {"media": [{"id": 1080174377138823175, "sizes": {"thumb": {"h": 150, "resize": "crop", "w": 150}, "large": {"h": 2047, "resize": "fit", "w": 1845}, "small": {"h": 680, "resize": "fit", "w": 613}, "medium": {"h": 1200, "resize": "fit", "w": 1082}}, "display_url": "pic.twitter.com/1o4FxnHgXg", "indices": [30, 53], "media_url_https": "https://pbs.twimg.com/media/Dv2M33zWoAc83BZ.jpg", "id_str": "1080174377138823175", "media_url": "http://pbs.twimg.com/media/Dv2M33zWoAc83BZ.jpg", "type": "photo", "url": "https://t.co/1o4FxnHgXg", "expanded_url": "https://twitter.com/emxdube/status/1080174384776732673/photo/1"}], "hashtags": [], "urls": [], "symbols": [], "user_mentions": []}, "lang": "en", "timestamp_ms": "1546368616657", "coordinates": null, "user": {"follow_request_sent": null, "description": "it\u2019s not that deep", "following": null, "location": "southampton/reading", "profile_background_tile": false, "default_profile_image": false, "url": "http://instagram.com/emdube", "profile_use_background_image": false, "geo_enabled": false, "profile_background_image_url": "http://abs.twimg.com/images/themes/theme10/bg.gif", "profile_link_color": "F58EA8", "statuses_count": 8748, "profile_sidebar_border_color": "FFFFFF", "contributors_enabled": false, "profile_background_color": "BAB3BA", "id_str": "637192198", "utc_offset": null, "followers_count": 1591, "listed_count": 3, "notifications": null, "translator_type": "none", "screen_name": "emxdube", "id": 637192198, "profile_text_color": "399399", "created_at": "Mon Jul 16 18:49:43 +0000 2012", "profile_background_image_url_https": "https://abs.twimg.com/images/themes/theme10/bg.gif", "lang": "en", "profile_banner_url": "https://pbs.twimg.com/profile_banners/637192198/1545690059", "name": "emily", "profile_image_url_https": "https://pbs.twimg.com/profile_images/1078833355720921089/WaD35kFh_normal.jpg", "time_zone": null, "profile_image_url": "http://pbs.twimg.com/profile_images/1078833355720921089/WaD35kFh_normal.jpg", "favourites_count": 17589, "protected": false, "default_profile": false, "friends_count": 399, "verified": false, "profile_sidebar_fill_color": "855682", "is_translator": false}}

View the first tweet in the file using jq.


In [3]:
!head -1 sample.json | jq '.'


{
  "retweeted": false,
  "is_quote_status": false,
  "retweet_count": 0,
  "text": "Happy birthday to a real one🖤 https://t.co/1o4FxnHgXg",
  "source": "<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>",
  "favorited": false,
  "extended_entities": {
    "media": [
      {
        "id": 1080174377138823200,
        "sizes": {
          "thumb": {
            "h": 150,
            "resize": "crop",
            "w": 150
          },
          "large": {
            "h": 2047,
            "resize": "fit",
            "w": 1845
          },
          "small": {
            "h": 680,
            "resize": "fit",
            "w": 613
          },
          "medium": {
            "h": 1200,
            "resize": "fit",
            "w": 1082
          }
        },
        "display_url": "pic.twitter.com/1o4FxnHgXg",
        "indices": [
          30,
          53
        ],
        "media_url_https": "https://pbs.twimg.com/media/Dv2M33zWoAc83BZ.jpg",
        "id_str": "1080174377138823175",
        "media_url": "http://pbs.twimg.com/media/Dv2M33zWoAc83BZ.jpg",
        "type": "photo",
        "url": "https://t.co/1o4FxnHgXg",
        "expanded_url": "https://twitter.com/emxdube/status/1080174384776732673/photo/1"
      }
    ]
  },
  "reply_count": 0,
  "in_reply_to_status_id_str": null,
  "in_reply_to_screen_name": null,
  "id_str": "1080174384776732673",
  "display_text_range": [
    0,
    29
  ],
  "in_reply_to_user_id_str": null,
  "in_reply_to_user_id": null,
  "favorite_count": 0,
  "contributors": null,
  "filter_level": "low",
  "id": 1080174384776732700,
  "truncated": false,
  "created_at": "Tue Jan 01 18:50:16 +0000 2019",
  "possibly_sensitive": false,
  "geo": null,
  "quote_count": 0,
  "in_reply_to_status_id": null,
  "place": null,
  "entities": {
    "media": [
      {
        "id": 1080174377138823200,
        "sizes": {
          "thumb": {
            "h": 150,
            "resize": "crop",
            "w": 150
          },
          "large": {
            "h": 2047,
            "resize": "fit",
            "w": 1845
          },
          "small": {
            "h": 680,
            "resize": "fit",
            "w": 613
          },
          "medium": {
            "h": 1200,
            "resize": "fit",
            "w": 1082
          }
        },
        "display_url": "pic.twitter.com/1o4FxnHgXg",
        "indices": [
          30,
          53
        ],
        "media_url_https": "https://pbs.twimg.com/media/Dv2M33zWoAc83BZ.jpg",
        "id_str": "1080174377138823175",
        "media_url": "http://pbs.twimg.com/media/Dv2M33zWoAc83BZ.jpg",
        "type": "photo",
        "url": "https://t.co/1o4FxnHgXg",
        "expanded_url": "https://twitter.com/emxdube/status/1080174384776732673/photo/1"
      }
    ],
    "hashtags": [],
    "urls": [],
    "symbols": [],
    "user_mentions": []
  },
  "lang": "en",
  "timestamp_ms": "1546368616657",
  "coordinates": null,
  "user": {
    "follow_request_sent": null,
    "description": "it’s not that deep",
    "following": null,
    "location": "southampton/reading",
    "profile_background_tile": false,
    "default_profile_image": false,
    "url": "http://instagram.com/emdube",
    "profile_use_background_image": false,
    "geo_enabled": false,
    "profile_background_image_url": "http://abs.twimg.com/images/themes/theme10/bg.gif",
    "profile_link_color": "F58EA8",
    "statuses_count": 8748,
    "profile_sidebar_border_color": "FFFFFF",
    "contributors_enabled": false,
    "profile_background_color": "BAB3BA",
    "id_str": "637192198",
    "utc_offset": null,
    "followers_count": 1591,
    "listed_count": 3,
    "notifications": null,
    "translator_type": "none",
    "screen_name": "emxdube",
    "id": 637192198,
    "profile_text_color": "399399",
    "created_at": "Mon Jul 16 18:49:43 +0000 2012",
    "profile_background_image_url_https": "https://abs.twimg.com/images/themes/theme10/bg.gif",
    "lang": "en",
    "profile_banner_url": "https://pbs.twimg.com/profile_banners/637192198/1545690059",
    "name": "emily",
    "profile_image_url_https": "https://pbs.twimg.com/profile_images/1078833355720921089/WaD35kFh_normal.jpg",
    "time_zone": null,
    "profile_image_url": "http://pbs.twimg.com/profile_images/1078833355720921089/WaD35kFh_normal.jpg",
    "favourites_count": 17589,
    "protected": false,
    "default_profile": false,
    "friends_count": 399,
    "verified": false,
    "profile_sidebar_fill_color": "855682",
    "is_translator": false
  }
}

Choose multiple fields to extract from all tweets, using jq and the -c flag for compact output.

The following examples uses head to only show a few lines and truncate the output here, but you can also use cat to process the whole file.


In [1]:
!head sample.json | jq -c '[.id_str, .text, .user.screen_name, .created_at]'


["1080174384776732673","Happy birthday to a real one🖤 https://t.co/1o4FxnHgXg","emxdube","Tue Jan 01 18:50:16 +0000 2019"]
["1080174388975226881","RT @aixagroetzner: Todos queremos que nos quieran en voz alta","areanax","Tue Jan 01 18:50:17 +0000 2019"]
["1080174388975157250","RT @PCaterianoB: Cateriano afirma que el fujiaprismo se enfoca en “salvar el pellejo de ‘AG’ y ‘señora K’” | Canal N https://t.co/wEL1dpvsy1","fokito","Tue Jan 01 18:50:17 +0000 2019"]
["1080174388987809792","RT @QueridoJeito: “Ou você arregaça as mangas e luta pelo que tem, ou você decide que está cansado e você desiste.” \n— This Is Us","ig_sg15","Tue Jan 01 18:50:17 +0000 2019"]
["1080174388975153152","RT @kofi: maybe my wife will reveal herself to me this year. maybe i’ll become the man i’ve always wanted to be.","floridbeautyy","Tue Jan 01 18:50:17 +0000 2019"]
["1080174388987809793","@SaaD_M_H7 ماعندي جائزه للي يبتسم عادي تقدر تعدي المقطع","AlF__Y","Tue Jan 01 18:50:17 +0000 2019"]
["1080174388971036672","Here is what we came up with after 240 days of work - looking for suggestions/opinions. #indiegames https://t.co/MAwP5vO8MV","SpiderMMO","Tue Jan 01 18:50:17 +0000 2019"]
["1080174388987731968","RT @Sofihassan8: بڑا شور سنتے تھے پہلو میں دل کا                                          جو چیرا تو اک قطرہ خوں نہ نکلا…","LibertyToExpres","Tue Jan 01 18:50:17 +0000 2019"]
["1080174388979343360","RT @anythingELF: ประกาศตามหาเด็กค่ะ น้องหายออกจากบ้านตั้งแต่เมื่อวานตอนเที่ยงคืน\nชื่อ คิมแจฮวาน แต่น้องจะเรียกตัวเองว่า แจนี่\nอายุ 3 ขวบ (ใ…","oooootitleottsc","Tue Jan 01 18:50:17 +0000 2019"]
["1080174389000232960","RT @PopularPups: Short leg problems 😂 https://t.co/OddDt1EWvd","ubetchiwa","Tue Jan 01 18:50:17 +0000 2019"]

For filtering and output, it is helpful to transform the date into a more uesful format. Let's look at the original created_at date and transform it into an ISO 8601 date.


In [2]:
!head sample.json | jq -c '[.created_at, (.created_at | strptime("%A %B %d %T %z %Y") | todate)]'


["Tue Jan 01 18:50:16 +0000 2019","2019-01-01T13:50:16Z"]
["Tue Jan 01 18:50:17 +0000 2019","2019-01-01T13:50:17Z"]
["Tue Jan 01 18:50:17 +0000 2019","2019-01-01T13:50:17Z"]
["Tue Jan 01 18:50:17 +0000 2019","2019-01-01T13:50:17Z"]
["Tue Jan 01 18:50:17 +0000 2019","2019-01-01T13:50:17Z"]
["Tue Jan 01 18:50:17 +0000 2019","2019-01-01T13:50:17Z"]
["Tue Jan 01 18:50:17 +0000 2019","2019-01-01T13:50:17Z"]
["Tue Jan 01 18:50:17 +0000 2019","2019-01-01T13:50:17Z"]
["Tue Jan 01 18:50:17 +0000 2019","2019-01-01T13:50:17Z"]
["Tue Jan 01 18:50:17 +0000 2019","2019-01-01T13:50:17Z"]

Look at all of the tweets in particular language, in this case Spanish ("es").


In [3]:
!head sample.json | jq -c 'select(.lang | contains("es")) | [.text]'


["RT @aixagroetzner: Todos queremos que nos quieran en voz alta"]
["RT @PCaterianoB: Cateriano afirma que el fujiaprismo se enfoca en “salvar el pellejo de ‘AG’ y ‘señora K’” | Canal N https://t.co/wEL1dpvsy1"]

Create a CSV with a subset of the fields.


In [7]:
!cat sample.json | jq -r '[.id_str, .created_at, .text] | @csv' > tweets.csv

In [8]:
!head -20 tweets.csv


"1080174384776732673","Tue Jan 01 18:50:16 +0000 2019","Happy birthday to a real one🖤 https://t.co/1o4FxnHgXg"
"1080174388975226881","Tue Jan 01 18:50:17 +0000 2019","RT @aixagroetzner: Todos queremos que nos quieran en voz alta"
"1080174388975157250","Tue Jan 01 18:50:17 +0000 2019","RT @PCaterianoB: Cateriano afirma que el fujiaprismo se enfoca en “salvar el pellejo de ‘AG’ y ‘señora K’” | Canal N https://t.co/wEL1dpvsy1"
"1080174388987809792","Tue Jan 01 18:50:17 +0000 2019","RT @QueridoJeito: “Ou você arregaça as mangas e luta pelo que tem, ou você decide que está cansado e você desiste.” 
— This Is Us"
"1080174388975153152","Tue Jan 01 18:50:17 +0000 2019","RT @kofi: maybe my wife will reveal herself to me this year. maybe i’ll become the man i’ve always wanted to be."
"1080174388987809793","Tue Jan 01 18:50:17 +0000 2019","@SaaD_M_H7 ماعندي جائزه للي يبتسم عادي تقدر تعدي المقطع"
"1080174388971036672","Tue Jan 01 18:50:17 +0000 2019","Here is what we came up with after 240 days of work - looking for suggestions/opinions. #indiegames https://t.co/MAwP5vO8MV"
"1080174388987731968","Tue Jan 01 18:50:17 +0000 2019","RT @Sofihassan8: بڑا شور سنتے تھے پہلو میں دل کا                                          جو چیرا تو اک قطرہ خوں نہ نکلا…"
"1080174388979343360","Tue Jan 01 18:50:17 +0000 2019","RT @anythingELF: ประกาศตามหาเด็กค่ะ น้องหายออกจากบ้านตั้งแต่เมื่อวานตอนเที่ยงคืน
ชื่อ คิมแจฮวาน แต่น้องจะเรียกตัวเองว่า แจนี่
อายุ 3 ขวบ (ใ…"
"1080174389000232960","Tue Jan 01 18:50:17 +0000 2019","RT @PopularPups: Short leg problems 😂 https://t.co/OddDt1EWvd"
"1080174388975190016","Tue Jan 01 18:50:17 +0000 2019","RT @BoxOffice: SECOND ACT
$7.36M Weekend (New Est.)
2,607 Screens / $2,823 Avg.
Weekend 2 / +14% Change
$21.91M Total (North America)
#Seco…"
"1080174388979294208","Tue Jan 01 18:50:17 +0000 2019","RT @eryndil_staff: しまったぁぁぁ!!!ツイートこけてたぁぁぁ😂😂😂うっそやろぉぉぉ😭

Note that that newlines in the tweet text break the CSV format, which should only have line breaks at the end of the row. Let's fix problems with the text field in the tweets by replacing the \n newline character with a space.


In [4]:
!cat sample.json | jq -r '[.id_str, (.text | gsub("\n";" "))] | @csv' > tweets.csv

In [5]:
!head -20 tweets.csv


"1080174384776732673","Happy birthday to a real one🖤 https://t.co/1o4FxnHgXg"
"1080174388975226881","RT @aixagroetzner: Todos queremos que nos quieran en voz alta"
"1080174388975157250","RT @PCaterianoB: Cateriano afirma que el fujiaprismo se enfoca en “salvar el pellejo de ‘AG’ y ‘señora K’” | Canal N https://t.co/wEL1dpvsy1"
"1080174388987809792","RT @QueridoJeito: “Ou você arregaça as mangas e luta pelo que tem, ou você decide que está cansado e você desiste.”  — This Is Us"
"1080174388975153152","RT @kofi: maybe my wife will reveal herself to me this year. maybe i’ll become the man i’ve always wanted to be."
"1080174388987809793","@SaaD_M_H7 ماعندي جائزه للي يبتسم عادي تقدر تعدي المقطع"
"1080174388971036672","Here is what we came up with after 240 days of work - looking for suggestions/opinions. #indiegames https://t.co/MAwP5vO8MV"
"1080174388987731968","RT @Sofihassan8: بڑا شور سنتے تھے پہلو میں دل کا                                          جو چیرا تو اک قطرہ خوں نہ نکلا…"
"1080174388979343360","RT @anythingELF: ประกาศตามหาเด็กค่ะ น้องหายออกจากบ้านตั้งแต่เมื่อวานตอนเที่ยงคืน ชื่อ คิมแจฮวาน แต่น้องจะเรียกตัวเองว่า แจนี่ อายุ 3 ขวบ (ใ…"
"1080174389000232960","RT @PopularPups: Short leg problems 😂 https://t.co/OddDt1EWvd"
"1080174388975190016","RT @BoxOffice: SECOND ACT $7.36M Weekend (New Est.) 2,607 Screens / $2,823 Avg. Weekend 2 / +14% Change $21.91M Total (North America) #Seco…"
"1080174388979294208","RT @eryndil_staff: しまったぁぁぁ!!!ツイートこけてたぁぁぁ😂😂😂うっそやろぉぉぉ😭 元旦からこんな調子ですが、本年もよろしくお願いします🙇🙏 https://t.co/AOESjNrXo8"
"1080174389000327168","RT @abuaardvark: Inside the stricken court of crown prince Mohammed Bin Salman - profile by Richard Spencer https://t.co/xx39yGb6X8"
"1080174388983595011","@MichalKubal @jakubhamleti Michale, když tě tu vidím... prosím tě, ten rozhovor s AB už byl? Bude? Sorry, jsem zmatený"
"1080174389008781312","RT @Peppermint_kis: Follow everyone who replies to this 💋💋💋💋💋💋Peppermint Gains 💋💋💋..."
"1080174389000265728","🍻🍻"
"1080174388983427072","Last puyat na to HAHAHA pero pag may nandemonyo sa monday na ko pasok"
"1080174388991864832","RT @6D4GyW2BncWrf2c: ေက်ာင္​းသူ😃 https://t.co/GOqc5OtfFG"
"1080174388991922177","RT @Kintasz: O melhor bolo do mundo é bolo de bolacha, quem disser o contrário leva porrada"
"1080174388987785216","@___Andrzej____ @pisorgpl 👍👍👍👍👍👍👏👏👏👏👏👏👏"

Hashtags are in nested JSON, so let's flattening the hashtags into a semi-colon delimited list.


In [9]:
!head -200 sample.json | jq -c 'select(.entities.hashtags | length >= 1) | [([.entities.hashtags[].text] | join(";"))]'


["indiegames"]
["potus;TDS;haters"]
["PossePresidencial"]
["trust;trusttheprocess;trustissues;life;livelifetofullest;motivating;motivationalquotes;motivationalquote"]
["HappNewYear2019"]
["GetUp"]
["MerryChristmas"]
["ARCplay"]
["NH106;BackToTheFuture"]
["IfChristmasWereAPerson"]
["izmireſcort;bucaeſcort"]
["انتصارات_السعوديه_العظمي_2018"]
["BamFlinstone;OnTheBeat"]
["UniqueNewYearCelebrations"]
["TwitterTuesday;Lima;Peru"]
["AudiogameJam3;charity;gamejam"]
["FiestaBowl"]
["RunBTSReturns"]
["UniqueNewYearCelebrations"]