Michaël Defferrard, PhD student, EPFL LTS2
Theme of the exercise: understand the impact of your communication on social networks. A real life situation: the marketing team needs help in identifying which were the most engaging posts they made on social platforms to prepare their next AdWords campaign.
As you probably don't have a company (yet?), you can either use your own social network profile as if it were the company's one or choose an established entity, e.g. EPFL. You will need to be registered in FB or Twitter to generate access tokens. If you're not, either ask a classmate to create a token for you or create a fake / temporary account for yourself (no need to follow other people, we can fetch public data).
At the end of the exercise, you should have two datasets (Facebook & Twitter) and have used them to answer the following questions, for both Facebook and Twitter.
Tasks:
Note that some data cleaning is already necessary. E.g. there are some FB posts without message, i.e. without text. Some tweets are also just retweets without any more information. Should they be collected ?
In [14]:
# Number of posts / tweets to retrieve.
# Small value for development, then increase to collect final data.
n = 4000 # 20
There is two ways to scrape data from Facebook, you can choose one or combine them.
You will need an access token, which can be created with the help of the Graph Explorer. That tool may prove useful to test queries. Once you have your token, you may create a credentials.ini
file with the following content:
[facebook]
token = YOUR-FB-ACCESS-TOKEN
In [15]:
import configparser
# Read the confidential token.
credentials = configparser.ConfigParser()
credentials.read('credentials.ini')
token = credentials.get('facebook', 'token')
# Or token = 'YOUR-FB-ACCESS-TOKEN'
In [16]:
import requests # pip install requests
import facebook # pip install facebook-sdk
import pandas as pd
In [17]:
page = 'EPFL.ch'
The process is three-way:
In [18]:
# 1. Form URL.
url = 'https://graph.facebook.com/{}?fields=likes&access_token={}'.format(page, token)
#print(url)
# 2. Get data.
data = requests.get(url).json()
print('data:', data)
# Optionally, check for errors. Most probably the session has expired.
if 'error' in data.keys():
raise Exception(data)
# 3. Extract data.
print('{} has {} likes'.format(page, data['likes']))
In [19]:
# 1. Form URL. You can click that url and see the returned JSON in your browser.
fields = 'id,created_time,message,likes.limit(0).summary(1),comments.limit(0).summary(1)'
url = 'https://graph.facebook.com/{}/posts?fields={}&access_token={}'.format(page, fields, token)
#print(url)
# Create the pandas DataFrame, a table which columns are post id, message, created time, #likes and #comments.
fb = pd.DataFrame(columns=['id', 'text', 'time', 'likes', 'comments'])
# The outer loop is to query FB multiple times, as FB sends at most 100 posts at a time.
while len(fb) < n:
# 2. Get the data from FB. At most 100 posts.
posts = requests.get(url).json()
# 3. Here we extract information for each of the received post.
for post in posts['data']:
# The information is stored in a dictionary.
serie = dict(id=post['id'], time=post['created_time'])
try:
serie['text'] = post['message']
except KeyError:
# Let's say we are not interested in posts without text.
continue
serie['likes'] = post['likes']['summary']['total_count']
serie['comments'] = post['comments']['summary']['total_count']
# Add the dictionary as a new line to our pandas DataFrame.
fb = fb.append(serie, ignore_index=True)
try:
# That URL is returned by FB to access the next 'page', i.e. the next 100 posts.
url = posts['paging']['next']
except KeyError:
# No more posts.
break
In [20]:
fb[:5]
Out[20]:
In [21]:
g = facebook.GraphAPI(token, version='2.7')
# We limit to 10 because it's slow.
posts = g.get_connections(page, 'posts', limit=10)
if 'error' in posts.keys():
# Most probably the session has expired.
raise Exception(data)
for post in posts['data']:
pid = post['id']
try:
text = post['message']
except KeyError:
continue
time = post['created_time']
likes = g.get_connections(pid, 'likes', summary=True, limit=0)
nlikes = likes['summary']['total_count']
comments = g.get_connections(pid, 'comments', summary=True, limit=0)
ncomments = comments['summary']['total_count']
print('{:6d} {:6d} {} {}'.format(nlikes, ncomments, time, text[:50]))
There exists a bunch of Python-based clients for Twitter. Tweepy is a popular choice.
You will need to create a Twitter app and copy the four tokens and secrets in the credentials.ini
file:
[twitter]
consumer_key = YOUR-CONSUMER-KEY
consumer_secret = YOUR-CONSUMER-SECRET
access_token = YOUR-ACCESS-TOKEN
access_secret = YOUR-ACCESS-SECRET
In [22]:
import tweepy # pip install tweepy
auth = tweepy.OAuthHandler(credentials.get('twitter', 'consumer_key'), credentials.get('twitter', 'consumer_secret'))
auth.set_access_token(credentials.get('twitter', 'access_token'), credentials.get('twitter', 'access_secret'))
api = tweepy.API(auth)
user = 'EPFL_en'
In [ ]:
followers = api.get_user(user).followers_count
print('{} has {} followers'.format(user, followers))
The code is much simpler for Twitter than Facebook because Tweepy handles much of the dirty work, like paging.
In [ ]:
tw = pd.DataFrame(columns=['id', 'text', 'time', 'likes', 'shares'])
for tweet in tweepy.Cursor(api.user_timeline, screen_name=user).items(n):
serie = dict(id=tweet.id, text=tweet.text, time=tweet.created_at)
serie.update(dict(likes=tweet.favorite_count, shares=tweet.retweet_count))
tw = tw.append(serie, ignore_index=True)
In [ ]:
#fb.id = fb.id.astype(int)
fb.likes = fb.likes.astype(int)
fb.comments = fb.comments.astype(int)
tw.id = tw.id.astype(int)
tw.likes = tw.likes.astype(int)
tw.shares = tw.shares.astype(int)
In [ ]:
from datetime import datetime
def convert_time(row):
return datetime.strptime(row['time'], '%Y-%m-%dT%H:%M:%S+0000')
fb['time'] = fb.apply(convert_time, axis=1)
In [ ]:
from IPython.display import display
display(fb[:5])
display(tw[:5])
Now that we collected everything, let's save it in two SQLite databases.
In [ ]:
import os
folder = os.path.join('..', 'data', 'social_media')
try:
os.makedirs(folder)
except FileExistsError:
pass
filename = os.path.join(folder, 'facebook.sqlite')
fb.to_sql('facebook', 'sqlite:///' + filename, if_exists='replace')
filename = os.path.join(folder, 'twitter.sqlite')
tw.to_sql('twitter', 'sqlite:///' + filename, if_exists='replace')
Answer the questions using pandas, statsmodels, scipy.stats, bokeh.
In [ ]:
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')
%matplotlib inline
In [ ]:
date = datetime(2016, 9, 4)
datestr = date.strftime('%Y-%m-%d')
print('Number of posts after {}: {}'.format(datestr, sum(fb.time > date)))
print('Number of tweets after {}: {}'.format(datestr, sum(tw.time > date)))
In [ ]:
display(fb.sort_values(by='likes', ascending=False)[:5])
display(tw.sort_values(by='likes', ascending=False)[:5])
In [ ]:
pd.concat([fb.describe(), tw.loc[:,'likes':'shares'].describe()], axis=1)
In [ ]:
fig, axs = plt.subplots(1, 4, figsize=(15, 5))
fb.likes.plot(kind='box', ax=axs[0]);
fb.comments.plot(kind='box', ax=axs[1]);
tw.likes.plot(kind='box', ax=axs[2]);
tw.shares.plot(kind='box', ax=axs[3]);
In [ ]:
fb.hist(bins=20, log=True, figsize=(15, 5));
In [ ]:
fig, axs = plt.subplots(1, 2, figsize=(15, 5))
tw.loc[:,'likes'].hist(bins=20, log=True, ax=axs[0]);
tw.loc[tw.shares < 200, 'shares'].hist(bins=20, log=True, ax=axs[1]);
In [ ]:
def text_length(texts):
lengths = np.empty(len(texts), dtype=int)
for i, text in enumerate(texts):
lengths[i] = len(text)
plt.figure(figsize=(15, 5))
prop = lengths.min(), '{:.2f}'.format(lengths.mean()), lengths.max()
plt.title('min = {}, mean={}, max = {}'.format(*prop))
plt.hist(lengths, bins=20)
text_length(tw.text)
text_length(fb.text)
In [ ]:
fb.id.groupby(fb.time.dt.hour).count().plot(kind='bar', alpha=0.4, color='y', figsize=(15,5));
tw.id.groupby(tw.time.dt.hour).count().plot(kind='bar', alpha=0.4, color='g', figsize=(15,5));
Let's look if the time of posting influence the number of likes. Do you see a peak at 5am ? Do you really think we should post at 5am ? What's going on here ?
In [ ]:
fb.likes.groupby(fb.time.dt.hour).mean().plot(kind='bar', figsize=(15,5));
plt.figure()
tw.likes.groupby(tw.time.dt.hour).mean().plot(kind='bar', figsize=(15,5));