In looking at the data from Ed's archive, here are the elements that I think are useful for analysis. At the end I have included the tweet I obtained this example data from, as well as a link to the Twitter API description of the twitter object.
In [ ]:
ferguson_aug_username_unique.txt
ferguson_data_set_august.csv
ferguson_data_set_small.csv
ferguson_hashtags.csv
ferguson_nov_username_all.txt
ferguson_data_follower_count_gte0.csv
ferguson_data_set.csv
ferguson_data_set_small_user_count_gte100.csv
ferguson_hashtags_small.csv
In [ ]:
# USERID, USERNAME
t["user"]["id"], t["user"]["screen_name"]
# (352778346, u'LealtadEternaLV')
# LANGUAGE, TIME OFFSET (ENGLISH ONLY, AND US TIMEZONES TO NARROW DOWN TWEETS?)
t["user"]["lang"], t["user"]["utc_offset"]
(u'es', -16200)
# FOLLOWERS (ONE WAY), FRIENDS (TWO WAY FOLLOWERS)
t["user"]["followers_count"], t["user"]["friends_count"]
# (3109, 3377)
In [ ]:
# UID FOR THE TWEET
t["id"]
# 502138116384100352
# TIME TWEET WAS CREATED (SAME AS SENT?)
t["created_at"]
# Wed Aug 20 17:00:30 +0000 2014
# TIMEZONE OFFSET (NECCESSARY FOR "NORMALIZING" TIME?)
t["user"]["utc_offset"]
# -16200
# ACTUAL TWEET (I'D WISH THERE WAS A WAY TO PARSE THIS SO WE ONLY HAD THE TEXT, BUT OFTEN HASHTAGS ARE USED
# AS WORDS IN THE TEXT. I THINK WE SHOULD AT LEAST GET RID OF "RT" and LINKS)
t["text"]
# u'RT @RomelBolivar: #VenezuelaPuebloHumanitario Asi Repelen los disturbios en
#Ferguson #EEUU Opositor esto tu no lo llamas Represi\xf3n ? http:\u2026'
# ALL HASHTAGS IN THE TEXT
t["entities"]["hashtags"]
# [{u'indices': [18, 45], u'text': u'VenezuelaPuebloHumanitario'},
# {u'indices': [76, 85], u'text': u'Ferguson'},
# {u'indices': [86, 91], u'text': u'EEUU'}]
# USERS MENTIONED IN THE TWEET, ID and SCREENNAME
t["entities"]["user_mentions"][0]["id"], t["entities"]["user_mentions"][0]["screen_name"]
# (108731188, u'RomelBolivar')
# NUMBER OF PEOPLE WHO FAVORITED THE CURRENT TWEET, IF RETWEET THIS WILL BE DIFFERENT THAN THE ORIGINAL
t["favorite_count"]
# 34
In [ ]:
In [ ]:
# RETWEETED? (THERE IS A "retweeted" field but it shows false despite the count being greater than zero.
# This might be due to the fact that retweets of retweets are not counted so perhaps this retweet hasn't
# been retweeted but the original tweet was. Anyway, hence the check if there is a retweet_count.
bool(t["retweet_count"])
# true
# NUMBER OF RETWEETS OF THE CURRENT TWEET
t["retweet_count"]
# 109
# ID OF THE ORIGINAL TWEET BEING RETWEETED (FOR NETWORK ANALYSIS?)
t["retweeted_status"]["id"]
# 502098277094129664
# DATE TIME ORIGINAL TWEET WAS CREATED
t["retweeted_status"]"created_at"]
# u'Wed Aug 20 14:22:12 +0000 2014'
# NUMBER OF FAVORITES OF THE ORIGINAL TWEET
t["retweeted_status"]["favorite_count"]
# 10
# TOTAL FAVORITES OF THE ORINGAL TWEET, OR NUMBER OF TOTAL TWEETS THAT HAVE BEEN FAVORITED FROM THIS USER?
t["retweeted_status"]["favourites_count"]
# 3479
In [ ]:
# SCREEN NAME OF ORIGINAL USER OF RETWEETED TWEET
t["retweeted_status"]["user"]["screen_name"]
# u'RomelBolivar',
# USER ID OF ORIGINAL USER OF RETWEETED TWEET
t["retweeted_status"]["user"]["id"]
# 108731188,
# TOTAL FOLLOWERS OF ORINGAL TWEET USER
t["retweeted_status"]["user"]["followers_count"]
# 59545
# TOTAL FRIENDS OF ORIGINAL TWEET USER
t["retweeted_status"]["user"]["friends_count"]
# 59222
# TIMEZONE OF ORIGNAL USER OF RETWEETED TWEET
t["retweeted_status"]["user"]["utc_offset"]
# -16200,
In [ ]:
# SCREEN NAME OF THE USER THE RETWEET WAS ORIGNALLY REPLYING TO
t["retweeted_status"]["in_reply_to_screen_name"],
# u'ReplyToUsername',
# USER ID OF THE USER THE RETWEET WAS ORIGINALLY REPLYING TO
t["retweeted_status"]["in_reply_to_user_id"]
# 837103894
# TWEET ID OF THE TWEET THE RETWEET WAS ORIGINALLY REPLYING TO
t["retweeted_status"]["in_reply_to_status_id"]
# 515477293597193041)
If the current tweet is a reply to another tweet:
In [ ]:
# SCREEN NAME OF THE PERSON THE TWEET IS REPLYING TO
t["in_reply_to_screen_name"],
# u'HassanTheeOne',
# USER ID OF THE PERSON THE TWEET IS REPLYING TO
["in_reply_to_user_id"]
# 23497553,
# TWEET ID OF THE ORIGINAL TWEET THE CURRENT TWEET IS REPLLYING TO
t["in_reply_to_status_id"]
# 502137293596262401)
Note, in some cases you might find that this example tweet does not have data for items that do have data above. This is because some cases, this example tweet didn't have any data to make an example of (say it wasn't in reply to anything. Feel free to pull data from above and fill it in so that this example is more complete.
In [ ]:
{u'contributors': None,
u'coordinates': None,
u'created_at': u'Wed Aug 20 17:00:30 +0000 2014',
u'entities': {u'hashtags': [{u'indices': [18, 45],
u'text': u'VenezuelaPuebloHumanitario'},
{u'indices': [76, 85], u'text': u'Ferguson'},
{u'indices': [86, 91], u'text': u'EEUU'}],
u'media': [{u'display_url': u'pic.twitter.com/8Hc2Upitgk',
u'expanded_url': u'http://twitter.com/RomelBolivar/status/502098277094129664/photo/1',
u'id': 502098274686623744,
u'id_str': u'502098274686623744',
u'indices': [139, 140],
u'media_url': u'http://pbs.twimg.com/media/BvfPuHkIYAAmtix.jpg',
u'media_url_https': u'https://pbs.twimg.com/media/BvfPuHkIYAAmtix.jpg',
u'sizes': {u'large': {u'h': 339, u'resize': u'fit', u'w': 600},
u'medium': {u'h': 338, u'resize': u'fit', u'w': 600},
u'small': {u'h': 192, u'resize': u'fit', u'w': 340},
u'thumb': {u'h': 150, u'resize': u'crop', u'w': 150}},
u'source_status_id': 502098277094129664,
u'source_status_id_str': u'502098277094129664',
u'type': u'photo',
u'url': u'http://t.co/8Hc2Upitgk'}],
u'symbols': [],
u'urls': [],
u'user_mentions': [{u'id': 108731188,
u'id_str': u'108731188',
u'indices': [3, 16],
u'name': u'Romel Bol\xedvar ',
u'screen_name': u'RomelBolivar'}]},
u'extended_entities': {u'media': [{u'display_url': u'pic.twitter.com/8Hc2Upitgk',
u'expanded_url': u'http://twitter.com/RomelBolivar/status/502098277094129664/photo/1',
u'id': 502098274686623744,
u'id_str': u'502098274686623744',
u'indices': [139, 140],
u'media_url': u'http://pbs.twimg.com/media/BvfPuHkIYAAmtix.jpg',
u'media_url_https': u'https://pbs.twimg.com/media/BvfPuHkIYAAmtix.jpg',
u'sizes': {u'large': {u'h': 339, u'resize': u'fit', u'w': 600},
u'medium': {u'h': 338, u'resize': u'fit', u'w': 600},
u'small': {u'h': 192, u'resize': u'fit', u'w': 340},
u'thumb': {u'h': 150, u'resize': u'crop', u'w': 150}},
u'source_status_id': 502098277094129664,
u'source_status_id_str': u'502098277094129664',
u'type': u'photo',
u'url': u'http://t.co/8Hc2Upitgk'}]},
u'favorite_count': 0,
u'favorited': False,
u'geo': None,
u'id': 502138116384100352,
u'id_str': u'502138116384100352',
u'in_reply_to_screen_name': None,
u'in_reply_to_status_id': None,
u'in_reply_to_status_id_str': None,
u'in_reply_to_user_id': None,
u'in_reply_to_user_id_str': None,
u'lang': u'es',
u'place': None,
u'possibly_sensitive': False,
u'retweet_count': 109,
u'retweeted': False,
u'retweeted_status': {u'contributors': None,
u'coordinates': None,
u'created_at': u'Wed Aug 20 14:22:12 +0000 2014',
u'entities': {u'hashtags': [{u'indices': [0, 27],
u'text': u'VenezuelaPuebloHumanitario'},
{u'indices': [58, 67], u'text': u'Ferguson'},
{u'indices': [68, 73], u'text': u'EEUU'}],
u'media': [{u'display_url': u'pic.twitter.com/8Hc2Upitgk',
u'expanded_url': u'http://twitter.com/RomelBolivar/status/502098277094129664/photo/1',
u'id': 502098274686623744,
u'id_str': u'502098274686623744',
u'indices': [116, 138],
u'media_url': u'http://pbs.twimg.com/media/BvfPuHkIYAAmtix.jpg',
u'media_url_https': u'https://pbs.twimg.com/media/BvfPuHkIYAAmtix.jpg',
u'sizes': {u'large': {u'h': 339, u'resize': u'fit', u'w': 600},
u'medium': {u'h': 338, u'resize': u'fit', u'w': 600},
u'small': {u'h': 192, u'resize': u'fit', u'w': 340},
u'thumb': {u'h': 150, u'resize': u'crop', u'w': 150}},
u'type': u'photo',
u'url': u'http://t.co/8Hc2Upitgk'}],
u'symbols': [],
u'urls': [],
u'user_mentions': []},
u'extended_entities': {u'media': [{u'display_url': u'pic.twitter.com/8Hc2Upitgk',
u'expanded_url': u'http://twitter.com/RomelBolivar/status/502098277094129664/photo/1',
u'id': 502098274686623744,
u'id_str': u'502098274686623744',
u'indices': [116, 138],
u'media_url': u'http://pbs.twimg.com/media/BvfPuHkIYAAmtix.jpg',
u'media_url_https': u'https://pbs.twimg.com/media/BvfPuHkIYAAmtix.jpg',
u'sizes': {u'large': {u'h': 339, u'resize': u'fit', u'w': 600},
u'medium': {u'h': 338, u'resize': u'fit', u'w': 600},
u'small': {u'h': 192, u'resize': u'fit', u'w': 340},
u'thumb': {u'h': 150, u'resize': u'crop', u'w': 150}},
u'type': u'photo',
u'url': u'http://t.co/8Hc2Upitgk'}]},
u'favorite_count': 10,
u'favorited': False,
u'geo': None,
u'id': 502098277094129664,
u'id_str': u'502098277094129664',
u'in_reply_to_screen_name': None,
u'in_reply_to_status_id': None,
u'in_reply_to_status_id_str': None,
u'in_reply_to_user_id': None,
u'in_reply_to_user_id_str': None,
u'lang': u'es',
u'place': None,
u'possibly_sensitive': False,
u'retweet_count': 109,
u'retweeted': False,
u'source': u'<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>',
u'text': u'#VenezuelaPuebloHumanitario Asi Repelen los disturbios en #Ferguson #EEUU Opositor esto tu no lo llamas Represi\xf3n ? http://t.co/8Hc2Upitgk',
u'truncated': False,
u'user': {u'contributors_enabled': False,
u'created_at': u'Tue Jan 26 22:04:03 +0000 2010',
u'default_profile': False,
u'default_profile_image': False,
u'description': u'Revolucionario, Humanista, Antiimperialista ... UNIDAD,LUCHA,BATALLA Y VICTORIA .......ALERTA SIEMPRE',
u'entities': {u'description': {u'urls': []},
u'url': {u'urls': [{u'display_url': u'contactoconlarealidad.com',
u'expanded_url': u'http://www.contactoconlarealidad.com/',
u'indices': [0, 22],
u'url': u'http://t.co/pvIVI3EBEH'}]}},
u'favourites_count': 3479,
u'follow_request_sent': False,
u'followers_count': 59545,
u'following': False,
u'friends_count': 59222,
u'geo_enabled': True,
u'id': 108731188,
u'id_str': u'108731188',-
u'location': u'Caracas venezuela',
u'name': u'Lealtad a Ch\xe1vez!',
u'notifications': False,
u'profile_background_color': u'177FED',
u'profile_background_image_url': u'http://pbs.twimg.com/profile_background_images/466810585368125440/g_3AJDX7.png',
u'profile_background_image_url_https': u'https://pbs.twimg.com/profile_background_images/466810585368125440/g_3AJDX7.png',
u'profile_background_tile': True,
u'profile_banner_url': u'https://pbs.twimg.com/profile_banners/352778346/1412524332',
u'profile_image_url': u'http://pbs.twimg.com/profile_images/518790606118600704/uvucm1rX_normal.jpeg',
u'profile_image_url_https': u'https://pbs.twimg.com/profile_images/518790606118600704/uvucm1rX_normal.jpeg',
u'profile_link_color': u'ABB8C2',
u'profile_location': None,
u'profile_sidebar_border_color': u'FFFFFF',
u'profile_sidebar_fill_color': u'EFEFEF',
u'profile_text_color': u'333333',
u'profile_use_background_image': True,
u'protected': False,
u'screen_name': u'LealtadEternaLV',
u'statuses_count': 8564,
u'time_zone': u'Caracas',
u'url': None,
u'utc_offset': -16200,
u'verified': False}}
Since our predictions will be on a user basis, we'll need to aggregate the data across the tweets per user.
BASIC USER INFORMATION:
uid: id of screenname of the author of the group of tweets
screen_name: username of the author of the group of tweets
user_time: the length of time the user has been a member of twitter
USER TWEET ACTIVIITY:
total_tweets: sum of any tweets, retweets, replies sent by a given user, not including retweets of their tweets.
total_orig: sum of tweets sent by this user that are originally produced by them, or a reply.
total_retweets: sum of tweets sent by this user that were retweets
total_replies: sum of all tweets sent by this user that were replies to someone else
Rates: you could do rates for different tweets over time (total_tweets/some_period). Maybe having sustained activitiy is indicative of something as opposed to a large volume that took place in a single day. Do we take these rates as totals over entire time period? Do we take the rates of the times they were active?
USER TWEET QUALITY:
total_retweeted: sum of all of the users' tweets that were retweeted. This would involve going through all of the tweets in the entire sample, and looking at retweeted_status: user: screen_name or id, and keeping a count for each user. Or, group tweets by retweets by id of tweet being retweeted, then taking the max retweet_count.
total_mentioned: will need to look at entities>>user mentions of all tweets. For each user mention, update a count of mentions from a list of users.
total_replied: take a look at all tweets that have in_reply_to_screen_name or in_reply_to_id. For each reply, update a count of replies from a list of users.
total_favorites: take a look at all tweets that have in_reply_to_screen_name or in_reply_to_id. For each reply, update a count of favorites from a list of users. Or, take all tweets that have the aforementioned qualities and group by tweet by date, then take the count from the last time the tweet appeared.
percent_positive: percent of user's August tweets that are classified as positive.
USER NETWWORK
ferg_follwers: list of id's of those that follow the current user
ferg_friends: list of id's of those that are friends of the current user
ferg_follwers_count: sum ferg_follwers
ferg_friends_count: sum ferg_friends
ferg_friend_follower_ratio: total_ferg_follwers/total_ferg_friends (what type of ties does one prefer to keep)
Other possibilities:
followers/friends_growth: this would be a rate of growth of friends/followers from the start until end of the event. A measure of the "gravity well" of following a user has. This would require looking at the tweeets of each user by date and grabbing the follower count at the first tweet and the last tweet and then doing some simple arithmatic.
Another possibilitiy is that we can have all of those things for this user beyond just this event. Total friends and followers on Twitter in general. For this protoypical analysis and the limited reserouces, I think its more useful to have them for just the Fergusson topic.
In [1]:
user document
{user: {
uid: id of screenname of the author of the group of tweets,
screen_name: username of the author of the group of tweets,
joined: date of account creation, user.created_at from tweet
user_time: the length of time the user has been a member of twitter
},
tweets: {
activity: {
total_tweets: sum of any tweets, retweets, replies sent by a given user, not including retweets of their tweets.,
total_orig: sum of tweets sent by this user that are originally produced by them, or a reply.,
total_retweets: sum of tweets sent by this user that were retweets,
total_replies: sum of all tweets sent by this user that were replies to someone else,
Rates: you could do rates for different tweets over time (total_tweets/some_period).
Maybe having sustained activitiy is indicative of something as opposed to a large volume
that took place in a single day. Do we take these rates as totals over entire time period?
Do we take the rates of the times they were active?
tweets: all tweets, or tweet ID's? would be nice to have, unless it could be an overlapping index
},
quality: {
total_retweeted: sum of all of the users' tweets that were retweeted. This would involve going through
all of the tweets in the entire sample, and looking at retweeted_status: user: screen_name or id, and
keeping a count for each user. Or, group tweets by retweets by id of tweet being retweeted, then taking the
max retweet_count.,
total_mentioned: will need to look at entities>>user mentions of all tweets. For each user mention,
update a count of mentions from a list of users.,
total_replied: take a look at all tweets that have in_reply_to_screen_name or in_reply_to_id.
For each reply, update a count of replies from a list of users.,
total_favorites: take a look at all tweets that have in_reply_to_screen_name or in_reply_to_id.
For each reply, update a count of favorites from a list of users. Or, take all tweets that have the aforementioned
qualities and group by tweet by date, then take the count from the last time the tweet appeared.,
percent_positive: percent of user's August tweets that are classified as positive.
}
},
network: {
ferg_follwers: list of id's of those that follow the current user
ferg_friends: list of id's of those that are friends of the current user
ferg_follwers_count: sum ferg_follwers
ferg_friends_count: sum ferg_friends
ferg_friend_follower_ratio: total_ferg_follwers/total_ferg_friends (what type of ties does one prefer to keep)
}
}
In [ ]: