Getting first and last date of tweets for each twitter user

The purpose of this notebook is to extract unique user id, screen name, date user created, date of first tweet in dataset, date of last tweet from a tweets collection (JSON) as a result table shown in Step 3 below.

It was originally written for Program on Etremism data's request, but can be used for any collection by replacing the input file to users' own tweets collection file.

1) Setting Input file(JSON) and Output file(CSV)


In [6]:
# For users: Change the filenames as you like.

INPUTFILE = "POE_json2.json"
OUTPUTFILE = "results.csv"

2) Extracting "UserID, screen name, date created" from the input data


In [2]:
# header
!echo "[]" | jq -r '["tweet_created_at","userID", "screen_name", "user_created_at"] | @csv' > "csvdata.csv"
!cat $INPUTFILE | jq -r '[(.created_at | strptime("%A %B %d %T %z %Y") | todate), .user.id_str, .user.screen_name, (.user.created_at | strptime("%A %B %d %T %z %Y") | todate)] | @csv' >> "csvdata.csv"
!head -5 "csvdata.csv"


"tweet_created_at","userID","screen_name","user_created_at"
"2017-04-10T11:04:58Z","3238710423","NageenNk","2015-05-06T11:21:39Z"
"2017-04-10T11:16:05Z","287745263","dappodan1","2011-04-25T16:06:38Z"
"2017-04-10T11:14:03Z","287745263","dappodan1","2011-04-25T16:06:38Z"
"2017-04-10T11:11:14Z","287745263","dappodan1","2011-04-25T16:06:38Z"

3) Getting First_tweet_date and Last_tweet_date for each user


In [3]:
import pandas as pd              

data = pd.read_csv("csvdata.csv", encoding = 'ISO-8859-1')
data2 = data.groupby(['userID', 'screen_name', 'user_created_at']).tweet_created_at.agg(['min', 'max'])
data3 = data2.reset_index()
data3.rename(columns={'min': 'first_tweet_date', 'max': 'last_tweet_date'}, inplace=True)
data3.head(5)


Out[3]:
userID screen_name user_created_at first_tweet_date last_tweet_date
0 3143581 UnitedStates 2007-04-01T18:21:58Z 2016-10-15T21:58:46Z 2017-05-13T00:26:53Z
1 18671937 V_FreaKy 2009-01-06T12:21:55Z 2009-01-06T12:24:33Z 2015-12-07T07:49:36Z
2 37378504 almanialkelli 2009-05-03T06:26:47Z 2009-05-03T06:27:25Z 2017-04-25T18:19:06Z
3 48733347 Antizionism 2009-06-19T15:23:04Z 2009-06-19T16:27:58Z 2010-10-29T22:25:47Z
4 57914577 ShamiWitness 2009-07-18T11:43:46Z 2014-11-20T18:50:52Z 2014-12-11T17:00:39Z

In [4]:
# the number of unique users
len(data3)


Out[4]:
911

4) Export the results to a csv file


In [5]:
# Export the results to a csv file whose filename is OUTPUTFILE set by user in the beginning of thie notebook.
data3.to_csv(OUTPUTFILE, index=False)