Michael Defferrard and Effrosyni Simou
In this first lab session we will look into how we can collect data from the Internet. Specifically, we will look into the API (Application Programming Interface) of Twitter.
We will also talk about the data cleaning process. While cleaning data is the most time-consuming, least enjoyable Data Science task, it should be perfomed nonetheless.
For this exercise you will need to be registered with Twitter and to generate access tokens. If you do not have an account in this social network you can ask a friend to create a token for you or you can create a temporary account just for the needs of this class.
You will need to create a Twitter app and copy the four tokens and secrets in the credentials.ini
file:
[twitter]
consumer_key = YOUR-CONSUMER-KEY
consumer_secret = YOUR-CONSUMER-SECRET
access_token = YOUR-ACCESS-TOKEN
access_secret = YOUR-ACCESS-SECRET
There exists a bunch of Python-based clients for Twitter. Tweepy is a popular choice.
Tasks:
In [ ]:
import os
import configparser
import tweepy # you will need to conda or pip install tweepy first
import numpy as np
import pandas as pd
In [ ]:
# Read the confidential token.
credentials = configparser.ConfigParser()
credentials.read(os.path.join('..', 'credentials.ini'))
auth = tweepy.OAuthHandler(credentials.get('twitter', 'consumer_key'), credentials.get('twitter', 'consumer_secret'))
auth.set_access_token(credentials.get('twitter', 'access_token'), credentials.get('twitter', 'access_secret'))
api = tweepy.API(auth)
user = 'EPFL_en'
Keep in mind that there is rate limiting of the API on a per user access token. You can find out more about rate limits here. In order to avoid getting a rate limit error when you need to make a lot of requests to gather your data you can construct your API instance as:
api = tweepy.API(auth,wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
This will aslo notify you about how long the period of sleep will be.
It is good practice to limit the amount of requests while developing, and then to increase to collect all the necessary data.
In [ ]:
# Number of posts / tweets to retrieve.
# Small value for development, then increase to collect final data.
n = 20 # 4000
In [ ]:
my_user=api.get_user(user)
In [ ]:
type(my_user)
In [ ]:
dir(my_user)
In [ ]:
followers = api.get_user(user).followers_count
print('{} has {} followers'.format(user, followers))
Tweepy handles much of the dirty work, like pagination. Have a look at how you can handle pagination with the Cursor objects in Tweepy with this tutorial.
In [ ]:
tw = pd.DataFrame(columns=['id', 'text', 'time', 'likes', 'shares'])
for tweet in tweepy.Cursor(api.user_timeline, screen_name=user).items(n):
serie = dict(id=tweet.id, text=tweet.text, time=tweet.created_at)
serie.update(dict(likes=tweet.favorite_count, shares=tweet.retweet_count))
tw = tw.append(serie, ignore_index=True)
In [ ]:
tw.dtypes
In [ ]:
tw.id = tw.id.astype(np.int64)
tw.likes = tw.likes.astype(np.int64)
tw.shares = tw.shares.astype(np.int64)
In [ ]:
tw.dtypes
In [ ]:
tw.head()
For instance, some tweets are just retweets without any more information. Should they be collected ?
Now, it is time for you to start collecting data from Twitter! Have fun!
In [ ]:
#your code here