A Python Tour of Data Science: Data Acquisition & Exploration

Michaël Defferrard, PhD student, EPFL LTS2

Exercise: problem definition

Theme of the exercise: understand the impact of your communication on social networks. A real life situation: the marketing team needs help in identifying which were the most engaging posts they made on social platforms to prepare their next AdWords campaign.

This notebook is the second part of the exercise. Given the data we collected from Facebook an Twitter in the last exercise, we will construct an ML model and evaluate how good it is to predict the number of likes of a post / tweet given the content.

1 Data importation

  1. Use pandas to import the facebook.sqlite and twitter.sqlite databases.
  2. Print the 5 first rows of both tables.

In [1]:
import pandas as pd
import numpy as np
from IPython.display import display
import os.path

folder = os.path.join('..', 'data', 'social_media')

# Your code here.
fb = pd.read_sql('facebook', 'sqlite:///' + os.path.join(folder, 'facebook.sqlite'))
tw = pd.read_sql('twitter', 'sqlite:///' + os.path.join(folder, 'twitter.sqlite'))

n, d = fb.shape
print('The data is a {} with {} samples of dimensionality {}.'.format(type(fb), n, d))


The data is a <class 'pandas.core.frame.DataFrame'> with 52 samples of dimensionality 6.

2 Vectorization

First step: transform the data into a format understandable by the machine. What to do with text ? A common choice is the so-called bag-of-word model, where we represent each word a an integer and simply count the number of appearances of a word into a document.

Example

Let's say we have a vocabulary represented by the following correspondance table.

Integer Word
0 unknown
1 dog
2 school
3 cat
4 house
5 work
6 animal

Then we can represent the following document

I have a cat. Cats are my preferred animals.

by the vector $x = [6, 0, 0, 2, 0, 0, 1]^T$.

Tasks

  1. Construct a vocabulary of the 100 most occuring words in your dataset.
  2. Build a vector $x \in \mathbb{R}^{100}$ for each document (post or tweet).

Tip: the natural language modeling libraries nltk and gensim are useful for advanced operations. You don't need them here.

Arise a first data cleaning question. We may have some text in french and other in english. What do we do ?


In [13]:
#Data cleaning
for i in range(len(fb)):
    if fb['text'][[i] == 'http' :#or fb[i] == 'the':
        fb.icol(i)
    else:
        continue


  File "<ipython-input-13-7396ddbc48a2>", line 5
    else:
       ^
SyntaxError: invalid syntax

In [ ]:


In [2]:
from sklearn.feature_extraction.text import CountVectorizer
import re

nwords = 100

# Your code here.
vectorizer = CountVectorizer(max_features = nwords)
#----------------------------------------------------------------------------------------------

fb_text_vec = vectorizer.fit_transform(fb['text'])
fb_text_vectorized = fb_text_vec.toarray()
fb_words = vectorizer.get_feature_names()

#data cleaning
fb_words.remove('http')

freqs = [(word, fb_text_vec.getcol(idx).sum()) for word, idx in vectorizer.vocabulary_.items()]
fb_Most_used = sorted(freqs, key = lambda x: -x[1])

#----------------------------------------------------------------------------------------------

tw_text_vec = vectorizer.fit_transform(tw['text'])
tw_text_vectorized = tw_text_vec.toarray()
tw_words = vectorizer.get_feature_names()

#data cleaning
tw_words.remove('rt')

freqs = [(word, tw_text_vec.getcol(idx).sum()) for word, idx in vectorizer.vocabulary_.items()]
tw_Most_used = sorted(freqs, key = lambda x: -x[1])

Exploration question: what are the 5 most used words ? Exploring your data while playing with it is a useful sanity check.


In [10]:
b = vectorizer.vocabulary_.get('2016')
print(fb_Most_used[:5])
print(tw_Most_used[:5])


[('the', 39), ('technis', 33), ('to', 24), ('http', 21), ('and', 20)]
[('co', 72), ('https', 66), ('mytechnis', 43), ('rt', 42), ('the', 26)]
Out[10]:
100

In [ ]:

3 Pre-processing

  1. The independant variables $X$ are the bags of words.
  2. The target $y$ is the number of likes.
  3. Split in half for training and testing sets.

In [11]:
# Your code here.
X = tw_text_vectorized
X = X.astype(np.float)
#X -= X.mean(axis=0)
#X /= X.std(axis=0)
y = tw['likes'] 
y = y.astype(np.float)

In [12]:
# Training and testing sets.
test_size = round(len(X)/2)
print('Split: {} testing and {} training samples'.format(test_size, y.size - test_size))
perm = np.random.permutation(y.size)
X_test  = X[:test_size]
X_train = X[test_size:]
y_test  = y[perm[:test_size]]
y_train = y[perm[test_size:]]


Split: 38 testing and 39 training samples

4 Linear regression

Using numpy, fit and evaluate the linear model $$\hat{w}, \hat{b} = \operatorname*{arg min}_{w,b} \| Xw + b - y \|_2^2.$$

Please define a class LinearRegression with two methods:

  1. fit learn the parameters $w$ and $b$ of the model given the training examples.
  2. predict gives the estimated number of likes of a post / tweet. That will be used to evaluate the model on the testing set.

To evaluate the classifier, create an accuracy(y_pred, y_true) function which computes the mean squared error $\frac1n \| \hat{y} - y \|_2^2$.

Hint: you may want to use the function scipy.sparse.linalg.spsolve().


In [13]:
import scipy.sparse
class RidgeRegression(object):
    """Our ML model."""
    
    def __init__(self, alpha=0):
        "The class' constructor. Initialize the hyper-parameters."
        self.a = alpha
    
    def predict(self, X):
        """Return the predicted class given the features."""
        return np.sign(X.dot(self.w) + self.b)
    
    def fit(self, X, y):
        """Learn the model's parameters given the training data, the closed-form way."""
        n, d = X.shape
        self.b = np.mean(y)
        Ainv = np.linalg.inv(X.T.dot(X) + self.a * np.identity(d))
        self.w = Ainv.dot(X.T).dot(y - self.b)

    def loss(self, X, y, w=None, b=None):
        """Return the current loss.
        This method is not strictly necessary, but it provides
        information on the convergence of the learning process."""
        w = self.w if w is None else w  # The ternary conditional operator
        b = self.b if b is None else b  # makes those tests concise.
        import autograd.numpy as np  # See below for autograd.
        return np.linalg.norm(np.dot(X, w) + b - y)**2 + self.a * np.linalg.norm(w, 2)**2

Interpretation: what are the most important words a post / tweet should include ?


In [15]:
# Your code here.
import sklearn.metrics

neigh = RidgeRegression()
neigh.fit(X_train, y_train) 
y_predTest = neigh.predict(X_train)
train_accuracy = sklearn.metrics.accuracy_score(y_test, y_predTest)
y_pred = neigh.predict(X_test)
test_accuracy = sklearn.metrics.accuracy_score(y_test, y_pred)
print('train_accuracy',train_accuracy)
print('test_accuracy',test_accuracy)


---------------------------------------------------------------------------
LinAlgError                               Traceback (most recent call last)
<ipython-input-15-faffdefb4c8a> in <module>()
      3 
      4 neigh = RidgeRegression()
----> 5 neigh.fit(X_train, y_train)
      6 y_predTest = neigh.predict(X_train)
      7 train_accuracy = sklearn.metrics.accuracy_score(y_test, y_predTest)

<ipython-input-13-c3010422d64f> in fit(self, X, y)
     15         n, d = X.shape
     16         self.b = np.mean(y)
---> 17         Ainv = np.linalg.inv(X.T.dot(X) + self.a * np.identity(d))
     18         self.w = Ainv.dot(X.T).dot(y - self.b)
     19 

/Users/malogrisard/anaconda/lib/python3.5/site-packages/numpy/linalg/linalg.py in inv(a)
    524     signature = 'D->D' if isComplexType(t) else 'd->d'
    525     extobj = get_linalg_error_extobj(_raise_linalgerror_singular)
--> 526     ainv = _umath_linalg.inv(a, signature=signature, extobj=extobj)
    527     return wrap(ainv.astype(result_t, copy=False))
    528 

/Users/malogrisard/anaconda/lib/python3.5/site-packages/numpy/linalg/linalg.py in _raise_linalgerror_singular(err, flag)
     88 
     89 def _raise_linalgerror_singular(err, flag):
---> 90     raise LinAlgError("Singular matrix")
     91 
     92 def _raise_linalgerror_nonposdef(err, flag):

LinAlgError: Singular matrix

5 Interactivity

  1. Create a slider for the number of words, i.e. the dimensionality of the samples $x$.
  2. Print the accuracy for each change on the slider.

In [ ]:
import ipywidgets
from IPython.display import clear_output

# Your code here.

6 Scikit learn

  1. Fit and evaluate the linear regression model using sklearn.
  2. Evaluate the model with the mean squared error metric provided by sklearn.
  3. Compare with your implementation.

In [16]:
from sklearn import linear_model, metrics

# Your code here.
neigh = sklearn.linear_model.LogisticRegression()
neigh.fit(X_train, y_train) 
y_predTest = neigh.predict(X_train)
train_accuracy = sklearn.metrics.accuracy_score(y_train, y_predTest)
y_pred = neigh.predict(X_test)
test_accuracy = sklearn.metrics.accuracy_score(y_test, y_pred)
print('train_accuracy',train_accuracy)
print('test_accuracy',test_accuracy)


train_accuracy 0.948717948718
test_accuracy 0.684210526316

7 Deep Learning

Try a simple deep learning model !

Another modeling choice would be to use a Recurrent Neural Network (RNN) and feed it the sentence words after words.


In [ ]:
import os
os.environ['KERAS_BACKEND'] = 'theano'  # tensorflow
import keras

# Your code here.

8 Evaluation

Use matplotlib to plot a performance visualization. E.g. the true number of likes and the real number of likes for all posts / tweets.

What do you observe ? What are your suggestions to improve the performance ?


In [ ]:
from matplotlib import pyplot as plt
plt.style.use('ggplot')
%matplotlib inline

# Your code here.