Introduction to Python for Data Sciences |
Franck Iutzeler Fall. 2018 |
We saw above how to transform categorical features. It is possible to modify them in a number of ways in order to create different model.
For instance, from 1D point/value couples $(x,y)$, the linear regression fits a line. However, if we tranform $x$ into $(x^1,x^2,x^3)$, the same linear regression will fit a 3-degree polynomial.
In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
#import seaborn as sns
#sns.set()
N = 100 #points to generate
X = np.sort(10*np.random.rand(N, 1)**0.8 , axis=0) #abscisses
y = 4 + 0.4*np.random.rand(N) - 1. / (X.ravel() + 0.5)**2 - 1. / (10.5 - X.ravel() ) # some complicated function
plt.scatter(X,y)
Out[1]:
Linear regression will obviously be a bad fit.
In [2]:
from sklearn.linear_model import LinearRegression
model = LinearRegression().fit(X, y)
yfit = model.predict(X)
plt.scatter(X, y)
plt.plot(X, yfit,label="linear regression")
plt.legend()
Out[2]:
Let us transform it into a 3-degree polynomial fit and perform the same linear regression.
In [3]:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=3, include_bias=False) # 3 degree without degree 0 (no constant)
XPoly = poly.fit_transform(X)
print(XPoly[:5,])
In [4]:
modelPoly = LinearRegression().fit(XPoly, y)
yfitPoly = modelPoly.predict(XPoly)
plt.scatter(X, y)
plt.plot(X, yfit,label="linear regression")
plt.plot(X, yfitPoly,label="Polynomial regression (deg 3)")
plt.legend(loc = 'lower right')
Out[4]:
In [5]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
polyFeat = PolynomialFeatures(degree=3, include_bias=False)
linReg = LinearRegression()
polyReg = Pipeline([ ('polyFeat',polyFeat) , ('linReg',linReg) ])
polyReg.fit(X, y) # X original not XPoly
yfitPolyNew = polyReg.predict(X)
plt.scatter(X, y)
plt.plot(X, yfit,label="linear regression")
plt.plot(X, yfitPolyNew,label="Polynomial regression (deg 3)")
plt.legend(loc = 'lower right')
Out[5]:
We saw above (see the lasso example of the regression part) some basic examples of how to:
Scikit Learn actually provides some methods for that as well.
Scikit Learn offer a cross validation method that
This way all the data has gone thought the learning and validating sets hence the cross validation. This is illustrated by the following figure from the Python Data Science Handbook by Jake VanderPlas.
The score computer is computed either as the standard score of the estimator or can be precised with the scoring option (see the available metrics ).
Let us compute the cross validation for our polynomial fit problem.
In [6]:
from sklearn.model_selection import cross_val_score
cv_score = cross_val_score(polyReg, X, y, cv=5, scoring="neg_mean_absolute_error") # 5 groups cross validation
print(cv_score)
print("Mean score:" , np.mean(cv_score))
Now that scoring and cross validation is done, we can focus on investigating the best parameters of our polynomial model:
Let us see which are the parameters of our model (as this is a pipeline, this might be interesting to use the get_params function).
In [7]:
polyReg.get_params()
Out[7]:
This enables to see the parameters corresponding to the quantities to fit:
We can now construct a dictionary of values to test.
In [8]:
param_grid = [
{'polyFeat__degree': np.arange(1,12),
'linReg__fit_intercept': [True,False],
'polyFeat__include_bias': [True,False]}]
In [9]:
from sklearn.grid_search import GridSearchCV
grid = GridSearchCV(polyReg, param_grid, cv=5)
grid.fit(X, y)
Out[9]:
We can then get the best parameters and the corresponding model.
In [10]:
grid.best_params_
Out[10]:
In [11]:
best_model = grid.best_estimator_.fit(X,y)
overfit = polyReg.set_params(polyFeat__degree=15).fit(X,y)
Xplot = np.linspace(-1,10.5,100).reshape(-1, 1)
yBest = best_model.predict(Xplot)
yOver = overfit.predict(Xplot)
plt.scatter(X, y)
plt.plot(Xplot, yBest , 'r' ,label="Best polynomial")
plt.plot(Xplot, yOver , 'k' , label="overfitted (deg 15)")
plt.legend(loc = 'lower right')
plt.ylim([0,5])
plt.title("Best and overfitted models")
Out[11]:
We notice that the grid search based on cross validation helped discarded overfitted models (as they were bad on validation sets).
We already saw an example of feature extraction from categorical data. However, for some particular categorical data, dedicated tools exist. For instance, text and images.
In Learning applications, words are usually more important than letters, so a basic way to extract features is to construct one feature per present word and count the occurences of this word. This is known as word count. An approach to mitigate very present words (like "the" , "a" , etc) is term frequency inverse document frequency (tf-idf) which weights the occurence count by how often it appears.
In [12]:
f = open('./data/poems/poe-raven.txt', 'r')
poe = f.read().replace('\n',' ').replace('.','').replace(',','').replace('-','')
poe
Out[12]:
In [13]:
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer()
X = vec.fit_transform([poe])
X
Out[13]:
The vectorizer has registered the feature names and outed a sparse matrix that can be converted to a Dataframe.
In [14]:
import pandas as pd
pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
Out[14]:
The tf-idf verctorizer works the same way.
In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer()
X = vec.fit_transform([poe])
pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
Out[15]:
For more details, see the text feature extraction doc and image feature extraction doc as well as scikit image.
In [ ]:
In [ ]:
import lib.notebook_setting as nbs
packageList = ['IPython', 'numpy', 'scipy', 'matplotlib', 'cvxopt', 'pandas', 'seaborn', 'sklearn', 'tensorflow']
nbs.packageCheck(packageList)
nbs.cssStyling()