Introduction to Python for Data Sciences

Franck Iutzeler
Fall. 2018

a) Creating new features/models

Go to top

We saw above how to transform categorical features. It is possible to modify them in a number of ways in order to create different model.

For instance, from 1D point/value couples $(x,y)$, the linear regression fits a line. However, if we tranform $x$ into $(x^1,x^2,x^3)$, the same linear regression will fit a 3-degree polynomial.


In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
#import seaborn as sns
#sns.set() 

N = 100  #points to generate

X = np.sort(10*np.random.rand(N, 1)**0.8 , axis=0)  #abscisses

y = 4 + 0.4*np.random.rand(N) - 1. / (X.ravel() + 0.5)**2  - 1. / (10.5 - X.ravel() ) # some complicated function

plt.scatter(X,y)


Out[1]:
<matplotlib.collections.PathCollection at 0x7fca052cf048>
/usr/lib/python3/dist-packages/matplotlib/collections.py:549: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  if self._edgecolors == 'face':

Linear regression will obviously be a bad fit.


In [2]:
from sklearn.linear_model import LinearRegression

model = LinearRegression().fit(X, y)
yfit = model.predict(X)
plt.scatter(X, y)
plt.plot(X, yfit,label="linear regression")
plt.legend()


Out[2]:
<matplotlib.legend.Legend at 0x7fc9fa4accc0>
/usr/lib/python3/dist-packages/matplotlib/collections.py:549: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  if self._edgecolors == 'face':

Let us transform it into a 3-degree polynomial fit and perform the same linear regression.


In [3]:
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=3, include_bias=False) # 3 degree without degree 0 (no constant)
XPoly = poly.fit_transform(X)
print(XPoly[:5,])


[[ 0.17714296  0.03137963  0.00555868]
 [ 0.71270793  0.50795259  0.36202184]
 [ 0.85017419  0.72279615  0.61450263]
 [ 0.85765091  0.73556509  0.63085807]
 [ 0.8626183   0.74411034  0.64188319]]

In [4]:
modelPoly = LinearRegression().fit(XPoly, y)
yfitPoly = modelPoly.predict(XPoly)
plt.scatter(X, y)
plt.plot(X, yfit,label="linear regression")
plt.plot(X, yfitPoly,label="Polynomial regression (deg 3)")
plt.legend(loc = 'lower right')


Out[4]:
<matplotlib.legend.Legend at 0x7fc9fa4bb710>
/usr/lib/python3/dist-packages/matplotlib/collections.py:549: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  if self._edgecolors == 'face':

Pipeline

This 2-step fitting (Polynomial transform + Linear regression) calls for a replicated dataset which can be costly. That is why Scikit Learn implement a pipeline method that allows to perform multiple fit/transform sequentially.

This pipeline can then be used as a model.


In [5]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

polyFeat = PolynomialFeatures(degree=3, include_bias=False)
linReg = LinearRegression()

polyReg = Pipeline([ ('polyFeat',polyFeat) , ('linReg',linReg) ])

polyReg.fit(X, y) # X original not XPoly
yfitPolyNew = polyReg.predict(X)

plt.scatter(X, y)
plt.plot(X, yfit,label="linear regression")
plt.plot(X, yfitPolyNew,label="Polynomial regression (deg 3)")
plt.legend(loc = 'lower right')


Out[5]:
<matplotlib.legend.Legend at 0x7fc9fa3bdcc0>
/usr/lib/python3/dist-packages/matplotlib/collections.py:549: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  if self._edgecolors == 'face':

b) Validation and Hyperparameters tuning

Go to top

We saw above (see the lasso example of the regression part) some basic examples of how to:

  • validate our model by splitting the dataset into training and testing set (using train_test_split from sklearn.model_selection)
  • tune hyperparameter by looking at the error for different values

Scikit Learn actually provides some methods for that as well.

Validation

Scikit Learn offer a cross validation method that

  • split the dataset in several groups
  • for each of these group, fit the model on all but this group and computer the error on this one

This way all the data has gone thought the learning and validating sets hence the cross validation. This is illustrated by the following figure from the Python Data Science Handbook by Jake VanderPlas.

The score computer is computed either as the standard score of the estimator or can be precised with the scoring option (see the available metrics ).

Warning: All scorer objects follow the convention that higher return values are better than lower return values.

Let us compute the cross validation for our polynomial fit problem.


In [6]:
from sklearn.model_selection import cross_val_score

cv_score = cross_val_score(polyReg, X, y, cv=5, scoring="neg_mean_absolute_error") # 5 groups cross validation

print(cv_score)
print("Mean score:" , np.mean(cv_score))


[-0.83184925 -0.12044359 -0.22175632 -0.22954883 -0.92647202]
Mean score: -0.466014001082

Now that scoring and cross validation is done, we can focus on investigating the best parameters of our polynomial model:

  • degree
  • presence or not of an intercept

Let us see which are the parameters of our model (as this is a pipeline, this might be interesting to use the get_params function).


In [7]:
polyReg.get_params()


Out[7]:
{'linReg': LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False),
 'linReg__copy_X': True,
 'linReg__fit_intercept': True,
 'linReg__n_jobs': 1,
 'linReg__normalize': False,
 'polyFeat': PolynomialFeatures(degree=3, include_bias=False, interaction_only=False),
 'polyFeat__degree': 3,
 'polyFeat__include_bias': False,
 'polyFeat__interaction_only': False,
 'steps': [('polyFeat',
   PolynomialFeatures(degree=3, include_bias=False, interaction_only=False)),
  ('linReg',
   LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False))]}

This enables to see the parameters corresponding to the quantities to fit:

  • degree: polyFeat__degree
  • presence or not of an intercept: linReg\__fit_intercept and polyFeat__include_bias

We can now construct a dictionary of values to test.


In [8]:
param_grid = [
  {'polyFeat__degree':  np.arange(1,12),
   'linReg__fit_intercept': [True,False],
   'polyFeat__include_bias': [True,False]}]

In [9]:
from sklearn.grid_search import GridSearchCV

grid = GridSearchCV(polyReg, param_grid, cv=5) 
grid.fit(X, y)


/usr/local/lib/python3.4/dist-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
/usr/local/lib/python3.4/dist-packages/sklearn/grid_search.py:43: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)
Out[9]:
GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(steps=[('polyFeat', PolynomialFeatures(degree=3, include_bias=False, interaction_only=False)), ('linReg', LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid=[{'polyFeat__include_bias': [True, False], 'linReg__fit_intercept': [True, False], 'polyFeat__degree': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])}],
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

We can then get the best parameters and the corresponding model.


In [10]:
grid.best_params_


Out[10]:
{'linReg__fit_intercept': True,
 'polyFeat__degree': 4,
 'polyFeat__include_bias': True}

In [11]:
best_model = grid.best_estimator_.fit(X,y)
overfit = polyReg.set_params(polyFeat__degree=15).fit(X,y)

Xplot = np.linspace(-1,10.5,100).reshape(-1, 1)
yBest = best_model.predict(Xplot)
yOver = overfit.predict(Xplot)

plt.scatter(X, y)
plt.plot(Xplot, yBest , 'r' ,label="Best polynomial")
plt.plot(Xplot, yOver , 'k' , label="overfitted (deg 15)")
plt.legend(loc = 'lower right')
plt.ylim([0,5])
plt.title("Best and overfitted models")


Out[11]:
<matplotlib.text.Text at 0x7fc9fa3018d0>
/usr/lib/python3/dist-packages/matplotlib/collections.py:549: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  if self._edgecolors == 'face':

We notice that the grid search based on cross validation helped discarded overfitted models (as they were bad on validation sets).

c) Text and Image Features

Go to top

We already saw an example of feature extraction from categorical data. However, for some particular categorical data, dedicated tools exist. For instance, text and images.

Text feature extraction

In Learning applications, words are usually more important than letters, so a basic way to extract features is to construct one feature per present word and count the occurences of this word. This is known as word count. An approach to mitigate very present words (like "the" , "a" , etc) is term frequency inverse document frequency (tf-idf) which weights the occurence count by how often it appears.


In [12]:
f = open('./data/poems/poe-raven.txt', 'r')
poe = f.read().replace('\n',' ').replace('.','').replace(',','').replace('-','')
poe


Out[12]:
'                                      1845                                    THE RAVEN                                by Edgar Allan Poe      Once upon a midnight dreary while I pondered weak and weary   Over many a quaint and curious volume of forgotten lore     While I nodded nearly napping suddenly there came a tapping    As of some one gently rapping rapping at my chamber door   "\'Tis some visitor" I muttered "tapping at my chamber door                 Only this and nothing more"      Ah distinctly I remember it was in the bleak December   And each separate dying ember wrought its ghost upon the floor     Eagerly I wished the morrow; vainly I had sought to borrow     From my books surcease of sorrow sorrow for the lost Lenore   For the rare and radiant maiden whom the angels name Lenore                 Nameless here for evermore      And the silken sad uncertain rustling of each purple curtain   Thrilled me filled me with fantastic terrors never felt before;     So that now to still the beating of my heart I stood repeating     "\'Tis some visitor entreating entrance at my chamber door   Some late visitor entreating entrance at my chamber door;                 This it is and nothing more"      Presently my soul grew stronger; hesitating then no longer   "Sir" said I "or Madam truly your forgiveness I implore;     But the fact is I was napping and so gently you came rapping     And so faintly you came tapping tapping at my chamber door   That I scarce was sure I heard you" here I opened wide the door;                 Darkness there and nothing more      Deep into that darkness peering long I stood there wondering         fearing   Doubting dreaming dreams no mortals ever dared to dream before;     But the silence was unbroken and the stillness gave no token     And the only word there spoken was the whispered word "Lenore!"   This I whispered and an echo murmured back the word "Lenore!"                 Merely this and nothing more      Back into the chamber turning all my soul within me burning    Soon again I heard a tapping somewhat louder than before     "Surely" said I "surely that is something at my window lattice:     Let me see then what thereat is and this mystery explore   Let my heart be still a moment and this mystery explore;                 \'Tis the wind and nothing more"      Open here I flung the shutter when with many a flirt and         flutter   In there stepped a stately raven of the saintly days of yore;     Not the least obeisance made he; not a minute stopped or stayed         he;     But with mien of lord or lady perched above my chamber door   Perched upon a bust of Pallas just above my chamber door                 Perched and sat and nothing more     Then this ebony bird beguiling my sad fancy into smiling   By the grave and stern decorum of the countenance it wore    "Though thy crest be shorn and shaven thou" I said "art sure no         craven    Ghastly grim and ancient raven wandering from the Nightly shore   Tell me what thy lordly name is on the Night\'s Plutonian shore!"                 Quoth the Raven "Nevermore"      Much I marvelled this ungainly fowl to hear discourse so plainly   Though its answer little meaning little relevancy bore;     For we cannot help agreeing that no living human being     Ever yet was blest with seeing bird above his chamber door   Bird or beast upon the sculptured bust above his chamber door                 With such name as "Nevermore"      But the raven sitting lonely on the placid bust spoke only   That one word as if his soul in that one word he did outpour     Nothing further then he uttered not a feather then he fluttered     Till I scarcely more than muttered "other friends have flown         before   On the morrow he will leave me as my hopes have flown before"                 Then the bird said "Nevermore"       Startled at the stillness broken by reply so aptly spoken   "Doubtless" said I "what it utters is its only stock and store      Caught from some unhappy master whom unmerciful Disaster      Followed fast and followed faster till his songs one burden bore   Till the dirges of his Hope that melancholy burden bore                 Of \'Never nevermore\'"      But the Raven still beguiling all my fancy into smiling   Straight I wheeled a cushioned seat in front of bird and bust and         door;     Then upon the velvet sinking I betook myself to linking     Fancy unto fancy thinking what this ominous bird of yore   What this grim ungainly ghastly gaunt and ominous bird of yore                 Meant in croaking "Nevermore"      This I sat engaged in guessing but no syllable expressing   To the fowl whose fiery eyes now burned into my bosom\'s core;     This and more I sat divining with my head at ease reclining     On the cushion\'s velvet lining that the lamplight gloated o\'er   But whose velvet violet lining with the lamplight gloating o\'er                 She shall press ah nevermore!      Then methought the air grew denser perfumed from an unseen censer   Swung by Seraphim whose footfalls tinkled on the tufted floor     "Wretch" I cried "thy God hath lent thee by these angels he         hath sent thee     Respite respite and nepenthe from thy memories of Lenore!   Quaff oh quaff this kind nepenthe and forget this lost Lenore!"                 Quoth the Raven "Nevermore"      "Prophet!" said I "thing of evil! prophet still if bird or         devil!   Whether Tempter sent or whether tempest tossed thee here ashore     Desolate yet all undaunted on this desert land enchanted     On this home by horror haunted tell me truly I implore   Is there is there balm in Gilead? tell me tell me I implore!"                 Quoth the Raven "Nevermore"      "Prophet!" said I "thing of evil prophet still if bird or         devil!   By that Heaven that bends above us by that God we both adore     Tell this soul with sorrow laden if within the distant Aidenn     It shall clasp a sainted maiden whom the angels name Lenore   Clasp a rare and radiant maiden whom the angels name Lenore"                 Quoth the Raven "Nevermore"      "Be that word our sign in parting bird or fiend" I shrieked         upstarting   "Get thee back into the tempest and the Night\'s Plutonian shore!     Leave no black plume as a token of that lie thy soul hath spoken!     Leave my loneliness unbroken! quit the bust above my door!   Take thy beak from out my heart and take thy form from off my         door!"                Quoth the Raven "Nevermore"      And the Raven never flitting still is sitting still is sitting   On the pallid bust of Pallas just above my chamber door;     And his eyes have all the seeming of a demon\'s that is dreaming     And the lamplight o\'er him streaming throws his shadow on the         floor;   And my soul from out that shadow that lies floating on the floor                 Shall be lifted nevermore!                               THE END  '

In [13]:
from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer()
X = vec.fit_transform([poe])
X


/usr/lib/python3/dist-packages/scipy/sparse/compressed.py:130: VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`.
  if np.rank(self.data) != 1 or np.rank(self.indices) != 1 or np.rank(self.indptr) != 1:
Out[13]:
<1x438 sparse matrix of type '<class 'numpy.int64'>'
	with 438 stored elements in Compressed Sparse Row format>

The vectorizer has registered the feature names and outed a sparse matrix that can be converted to a Dataframe.


In [14]:
import pandas as pd
pd.DataFrame(X.toarray(), columns=vec.get_feature_names())


/usr/lib/python3/dist-packages/scipy/sparse/coo.py:200: VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`.
  if np.rank(self.data) != 1 or np.rank(self.row) != 1 or np.rank(self.col) != 1:
Out[14]:
1845 above adore again agreeing ah aidenn air all allan ... within wondering word wore wretch wrought yet yore you your
0 1 7 1 1 1 2 1 1 4 1 ... 2 1 6 1 1 1 2 3 3 1

1 rows × 438 columns

The tf-idf verctorizer works the same way.


In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer

vec = TfidfVectorizer()
X = vec.fit_transform([poe])
pd.DataFrame(X.toarray(), columns=vec.get_feature_names())


/usr/lib/python3/dist-packages/scipy/sparse/compressed.py:130: VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`.
  if np.rank(self.data) != 1 or np.rank(self.indices) != 1 or np.rank(self.indptr) != 1:
/usr/lib/python3/dist-packages/scipy/sparse/coo.py:200: VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`.
  if np.rank(self.data) != 1 or np.rank(self.row) != 1 or np.rank(self.col) != 1:
Out[15]:
1845 above adore again agreeing ah aidenn air all allan ... within wondering word wore wretch wrought yet yore you your
0 0.01003 0.070211 0.01003 0.01003 0.01003 0.02006 0.01003 0.01003 0.040121 0.01003 ... 0.02006 0.01003 0.060181 0.01003 0.01003 0.01003 0.02006 0.03009 0.03009 0.01003

1 rows × 438 columns

For more details, see the text feature extraction doc and image feature extraction doc as well as scikit image.

Exercise 4.4.1: Differentiating authors according to their word usage
In the folder data/poems are three poems by Edgar Alan Poe and three plays by William Shakespeare. Is it possible to differentiate these authors using only the word in their plays?

In [ ]:


Package Check and Styling

Go to top


In [ ]:
import lib.notebook_setting as nbs

packageList = ['IPython', 'numpy', 'scipy', 'matplotlib', 'cvxopt', 'pandas', 'seaborn', 'sklearn', 'tensorflow']
nbs.packageCheck(packageList)

nbs.cssStyling()