NTDS'17 demo 2: Twitter data aquisition

Michael Defferrard and Effrosyni Simou

Objective

In this first lab session we will look into how we can collect data from the Internet. Specifically, we will look into the API (Application Programming Interface) of Twitter.

We will also talk about the data cleaning process. While cleaning data is the most time-consuming, least enjoyable Data Science task, it should be perfomed nonetheless.

For this exercise you will need to be registered with Twitter and to generate access tokens. If you do not have an account in this social network you can ask a friend to create a token for you or you can create a temporary account just for the needs of this class.

You will need to create a Twitter app and copy the four tokens and secrets in the credentials.ini file:

[twitter]
consumer_key = YOUR-CONSUMER-KEY
consumer_secret = YOUR-CONSUMER-SECRET
access_token = YOUR-ACCESS-TOKEN
access_secret = YOUR-ACCESS-SECRET

Ressources

Here are some links you may find useful to complete that exercise.

Web APIs:

Tutorials:

Web scraping

There exists a bunch of Python-based clients for Twitter. Tweepy is a popular choice.

Tasks:

  1. Download the relevant information from Twitter. Try to minimize the quantity of collected data to the minimum required to answer the questions.
  2. Organize the collected data in a panda dataframe. Each row is a tweet, and the columns are at least: the tweet id, the text, the creation time, the number of likes (was called favorite before) and the number of retweets.

In [ ]:
import os
import configparser

import tweepy  # you will need to conda or pip install tweepy first
import numpy as np
import pandas as pd

In [ ]:
# Read the confidential token.
credentials = configparser.ConfigParser()
credentials.read(os.path.join('..', 'credentials.ini'))

auth = tweepy.OAuthHandler(credentials.get('twitter', 'consumer_key'), credentials.get('twitter', 'consumer_secret'))
auth.set_access_token(credentials.get('twitter', 'access_token'), credentials.get('twitter', 'access_secret'))

api = tweepy.API(auth) 

user = 'EPFL_en'

Keep in mind that there is rate limiting of the API on a per user access token. You can find out more about rate limits here. In order to avoid getting a rate limit error when you need to make a lot of requests to gather your data you can construct your API instance as:

api = tweepy.API(auth,wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

This will aslo notify you about how long the period of sleep will be.

It is good practice to limit the amount of requests while developing, and then to increase to collect all the necessary data.


In [ ]:
# Number of posts / tweets to retrieve.
# Small value for development, then increase to collect final data.
n = 20  # 4000

In [ ]:
my_user=api.get_user(user)

In [ ]:
type(my_user)

In [ ]:
dir(my_user)

In [ ]:
followers = api.get_user(user).followers_count
print('{} has {} followers'.format(user, followers))

Tweepy handles much of the dirty work, like pagination. Have a look at how you can handle pagination with the Cursor objects in Tweepy with this tutorial.


In [ ]:
tw = pd.DataFrame(columns=['id', 'text', 'time', 'likes', 'shares'])
for tweet in tweepy.Cursor(api.user_timeline, screen_name=user).items(n):
    serie = dict(id=tweet.id, text=tweet.text, time=tweet.created_at)
    serie.update(dict(likes=tweet.favorite_count, shares=tweet.retweet_count))
    tw = tw.append(serie, ignore_index=True)

In [ ]:
tw.dtypes

In [ ]:
tw.id = tw.id.astype(np.int64)
tw.likes = tw.likes.astype(np.int64)
tw.shares = tw.shares.astype(np.int64)

In [ ]:
tw.dtypes

In [ ]:
tw.head()

Data Cleaning

Problems come in two flavours:

  1. Missing data, i.e. unknown values.
  2. Errors in data, i.e. wrong values.

The actions to be taken in each case is highly data and problem specific.

For instance, some tweets are just retweets without any more information. Should they be collected ?

Now, it is time for you to start collecting data from Twitter! Have fun!


In [ ]:
#your code here