Michaël Defferrard, PhD student, EPFL LTS2
Theme of the exercise: understand the impact of your communication on social networks. A real life situation: the marketing team needs help in identifying which were the most engaging posts they made on social platforms to prepare their next AdWords campaign.
As you probably don't have a company (yet?), you can either use your own social network profile as if it were the company's one or choose an established entity, e.g. EPFL. You will need to be registered in FB or Twitter to generate access tokens. If you're not, either ask a classmate to create a token for you or create a fake / temporary account for yourself (no need to follow other people, we can fetch public data).
At the end of the exercise, you should have two datasets (Facebook & Twitter) and have used them to answer the following questions, for both Facebook and Twitter.
Tasks:
Note that some data cleaning is already necessary. E.g. there are some FB posts without message, i.e. without text. Some tweets are also just retweets without any more information. Should they be collected ?
In [14]:
# Number of posts / tweets to retrieve.
# Small value for development, then increase to collect final data.
n = 20 # 4000
There is two ways to scrape data from Facebook, you can choose one or combine them.
You will need an access token, which can be created with the help of the Graph Explorer. That tool may prove useful to test queries. Once you have your token, you may create a credentials.ini
file with the following content:
[facebook]
token = YOUR-FB-ACCESS-TOKEN
In [ ]:
import configparser
credentials = configparser.ConfigParser()
credentials.read('credentials.ini')
token = credentials.get('facebook', 'token')
# Or token = 'YOUR-FB-ACCESS-TOKEN'
In [ ]:
import facebook # pip install facebook-sdk
import requests # pip install requests
import pandas as pd
import sqlalchemy
from sqlalchemy import create_engine
import numpy as np
import sys
% matplotlib inline
import matplotlib.pyplot as plt
import time
import dateutil.parser
from datetime import datetime
In [6]:
token = 'EAACEdEose0cBANVEZCN9p8hepLhRYih2GKp7ZCvFvH6OQOv4MbMjZB33TIfDDKZB8R3MFTWCVRe5Fnom6vORZCfE6PGyRUC4s2y5nNSG0kCjKYPt4kUDnFB1J46u7elzB04x3KQ5ENoEnwepTGbVjDKeZBwQ5y6JMP9RReiiZCfwQZDZD'
page = 'mytechnis'
In [42]:
graph = facebook.GraphAPI(token)
#Date from which posts will be analyzed
dateFrom = str(int(dateutil.parser.parse("1/10/2015").timestamp()))
profileJsonPage = graph.get_object(page, fields='likes')
profileJson = graph.get_object(page+'/posts',fields='id,comments.limit(1).summary(true),likes.limit(1).summary(true),created_time,type',since=dateFrom)
profilejsonFrame = pd.DataFrame(profileJson['data'])
likes = []
comments = []
created_time = []
post_id = []
post_message = []
post_type = []
#Pagination to acquire all the posts
while(True):
try:
for data in profileJson['data']:
likes.append(data['likes']['summary']['total_count'])
comments.append(data['comments']['summary']['total_count'])
created_time.append(data['created_time'])
post_id.append(data['id'])
post_type.append(data['type'])
#post_message.append(data['message'])
# Attempt to make a request to the next page of data, if it exists.
profileJson=requests.get(profileJson['paging']['next']).json()
except KeyError:
# When there are no more pages (['paging']['next']), break from the loop and end the script.
break
postsDataframe = pd.DataFrame({'id_post' : post_id, 'time' : created_time,'likes' : likes, 'comments' : comments, 'type' : post_type})
In [68]:
dt = {'data' : [], 'id':[]}
dt['data'].append('salut')
print(dt)
In [ ]:
In [43]:
postsDataframe
Out[43]:
In [44]:
postsDataframe.sort_values('likes', ascending=False).head(5)
Out[44]:
In [20]:
profileJsonPage
Out[20]:
In [21]:
postsDataframe.type.value_counts().plot(kind='bar')
plt.ylabel('number of posts')
plt.title('Histogram of the number of posts per types')
plt.grid(True)
plt.show()
In [ ]:
In [22]:
fig, ax = plt.subplots()
postsDataframe['likes'].hist(ax=ax, bins=100)
plt.xlabel('likes')
plt.ylabel('number of posts')
plt.title('Histogram of the number')
plt.grid(True)
plt.show()
postsDataframe['likes'].mean()
Out[22]:
In [23]:
group = postsDataframe.groupby('type').comments.sum()
group.plot(kind='bar').set_ylabel('Number of comments')
plt.title('Number of comments by type of posts')
Out[23]:
In [46]:
x, y1, y2 = 'comments', 'created_time', 'likes'
postsDataframe
Out[46]:
In [53]:
from bokeh.plotting import output_notebook, figure, show
from bokeh.layouts import gridplot
from bokeh.models import ColumnDataSource
output_notebook()
x, y1, y2 = 'comments', 'time', 'likes'
n = 100 # Less intensive for the browser.
options = dict(
tools='pan,box_zoom,wheel_zoom,box_select,lasso_select,crosshair,reset,save',
x_axis_type='log', y_axis_type='log',
)
plot1 = figure(
#x_range=[1,1e6],
x_axis_label=x, y_axis_label=y1,
**options
)
plot2 = figure(
x_range=plot1.x_range, y_range=plot1.y_range,
x_axis_label=x, y_axis_label=y2,
**options
)
html_color = lambda r,g,b: '#{:02x}{:02x}{:02x}'.format(r,g,b)
#colors = [html_color(150,0,0) if default == 1 else html_color(0,150,0) for default in data['DEFAULT'][:n]]
# The above line is a list comprehension.
#radii = postsDataframe['type'][:n] / 5
# To link brushing (where a selection on one plot causes a selection to update on other plots).
source = ColumnDataSource(data=dict(x=postsDataframe[x][:n], y1=postsDataframe[y1][:n], y2=postsDataframe[y2][:n]))
plot1.scatter('x', 'y1', source=source)
plot2.scatter('x', 'y2', source=source, alpha=0.6)
plot = gridplot([[plot1, plot2]], toolbar_location='right', plot_width=400, plot_height=400, title='adsf')
show(plot)
In [65]:
timing = datetime.strptime('2015-04-25T11:14:55+0000','%V')
In [7]:
engine = create_engine('sqlite:///data/fb.db')
In [15]:
con = sql.connect('facebook_db.db')
df.to_sql('facebook_db.db', con, flavor='sqlite',if_exists="replace")
con.commit()
In [ ]:
There exists a bunch of Python-based clients for Twitter. Tweepy is a popular choice.
You will need to create a Twitter app and copy the four tokens and secrets in the credentials.ini
file:
[twitter]
consumer_key = YOUR-CONSUMER-KEY
consumer_secret = YOUR-CONSUMER-SECRET
access_token = YOUR-ACCESS-TOKEN
access_secret = YOUR-ACCESS-SECRET
In [ ]:
import tweepy # pip install tweepy
consumer_key = 'Tu8Zmay4wHrV9p8z3H131OrlD'
consumer_secret = 'YP0K691HbH4kxte7zojvYdO3QkSAoPkGctwuCL0sd4i13wxfzA'
access_token = '3092757124-zZYQ1Z75w6jaFPfp7G1CFXMWBBoGN7LAoHXFtQ3'
access_secret = 'gyjp3M3rDeJr4reiUDIJTeo1P8nlzeQlyf1jithwlbl3W'
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
api = tweepy.API(auth)
#auth = tweepy.OAuthHandler(credentials.get('twitter', 'consumer_key'), credentials.get('twitter', 'consumer_secret'))
#auth.set_access_token(credentials.get('twitter', 'access_token'), credentials.get('twitter', 'access_secret'))
#api = tweepy.API(auth)
user = 'EPFL_en'
In [ ]:
# Your code here.
Answer the questions using pandas, statsmodels, scipy.stats, bokeh.
In [ ]:
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')
%matplotlib inline
In [ ]:
# Your code here.