A Python Tour of Data Science: Data Acquisition & Exploration

Michaël Defferrard, PhD student, EPFL LTS2

1 Exercise: problem definition

Theme of the exercise: understand the impact of your communication on social networks. A real life situation: the marketing team needs help in identifying which were the most engaging posts they made on social platforms to prepare their next AdWords campaign.

As you probably don't have a company (yet?), you can either use your own social network profile as if it were the company's one or choose an established entity, e.g. EPFL. You will need to be registered in FB or Twitter to generate access tokens. If you're not, either ask a classmate to create a token for you or create a fake / temporary account for yourself (no need to follow other people, we can fetch public data).

At the end of the exercise, you should have two datasets (Facebook & Twitter) and have used them to answer the following questions, for both Facebook and Twitter.

  1. How many followers / friends / likes has your chosen profile ?
  2. How many posts / tweets in the last year ?
  3. What were the 5 most liked posts / tweets ?
  4. Plot histograms of number of likes and comments / retweets.
  5. Plot basic statistics and an histogram of text lenght.
  6. Is there any correlation between the lenght of the text and the number of likes ?
  7. Be curious and explore your data. Did you find something interesting or surprising ?
    1. Create at least one interactive plot (with bokeh) to explore an intuition (e.g. does the posting time plays a role).

2 Ressources

Here are some links you may find useful to complete that exercise.

Web APIs: these are the references.

Tutorials:

3 Web scraping

Tasks:

  1. Download the relevant information from Facebook and Twitter. Try to minimize the quantity of collected data to the minimum required to answer the questions.
  2. Build two SQLite databases, one for Facebook and the other for Twitter, using pandas and SQLAlchemy.
    1. For FB, each row is a post, and the columns are at least (you can include more if you want): the post id, the message (i.e. the text), the time when it was posted, the number of likes and the number of comments.
    2. For Twitter, each row is a tweet, and the columns are at least: the tweet id, the text, the creation time, the number of likes (was called favorite before) and the number of retweets.

Note that some data cleaning is already necessary. E.g. there are some FB posts without message, i.e. without text. Some tweets are also just retweets without any more information. Should they be collected ?


In [ ]:
# Number of posts / tweets to retrieve.
# Small value for development, then increase to collect final data.
n = 20  # 4000

3.1 Facebook

There is two ways to scrape data from Facebook, you can choose one or combine them.

  1. The low-level approach, sending HTTP requests and receiving JSON responses to / from their Graph API. That can be achieved with the json and requests packages (altough you can use urllib or urllib2, requests has a better API). The knowledge you'll acquire using that method will be useful to query other web APIs than FB. This method is also more flexible.
  2. The high-level approach, using a Python SDK. The code you'll have to write for this method is gonna be shorter, but specific to the FB Graph API.

You will need an access token, which can be created with the help of the Graph Explorer. That tool may prove useful to test queries. Once you have your token, you may create a credentials.ini file with the following content:

[facebook]
token = YOUR-FB-ACCESS-TOKEN

In [ ]:
import configparser

credentials = configparser.ConfigParser()
credentials.read('credentials.ini')
token = credentials.get('facebook', 'token')

# Or token = 'YOUR-FB-ACCESS-TOKEN'

In [ ]:
import requests  # pip install requests
import facebook  # pip install facebook-sdk

In [ ]:
page = 'EPFL.ch'

In [ ]:
# Your code here.

3.2 Twitter

There exists a bunch of Python-based clients for Twitter. Tweepy is a popular choice.

You will need to create a Twitter app and copy the four tokens and secrets in the credentials.ini file:

[twitter]
consumer_key = YOUR-CONSUMER-KEY
consumer_secret = YOUR-CONSUMER-SECRET
access_token = YOUR-ACCESS-TOKEN
access_secret = YOUR-ACCESS-SECRET

In [ ]:
import tweepy  # pip install tweepy

# Read the confidential tokens and authenticate.
auth = tweepy.OAuthHandler(credentials.get('twitter', 'consumer_key'), credentials.get('twitter', 'consumer_secret'))
auth.set_access_token(credentials.get('twitter', 'access_token'), credentials.get('twitter', 'access_secret'))
api = tweepy.API(auth)

user = 'EPFL_en'

In [ ]:
# Your code here.

4 Data analysis

Answer the questions using pandas, statsmodels, scipy.stats, bokeh.


In [ ]:
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')
%matplotlib inline

In [ ]:
# Your code here.