Inspiré par l'exellent livre de Sebastian Raschka (@rasbt) : Python Machine learning et Notebook ainsi que par les tuto de Scikit Learn sur Kaggle



In [1]:

    
import pandas as pd
import numpy as np
import seaborn as sns
#sns.set_style('whitegrid')
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.simplefilter('ignore', DeprecationWarning)









    



/Users/babou/anaconda/lib/python2.7/site-packages/matplotlib/__init__.py:872: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))

Numérique :

Log transformation log(x) / log(1 + x)

Graphiques représentant le nombre de retraité et le nombre de vote FN par ville durant le second tour des elections régionales (lien github).

A gauche : pas de log
A droit : log

On observe une meilleure distribution des données si celle-ci sont transformées

Normalisation



In [2]:

    
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data', header=None)
df.columns = ["label", "Alcohol", "Malic acid", "Ash", "Alcalinity of ash", "Magnesium", "Total phenols", "Flavanoids",
                "Nonflavanoid phenols", "Proanthocyanins", "Color intensity", "Hue", "OD280/OD315 of diluted wines",
                "Proline"]
df.head()









    Out[2]:






  
    
      
      label
      Alcohol
      Malic acid
      Ash
      Alcalinity of ash
      Magnesium
      Total phenols
      Flavanoids
      Nonflavanoid phenols
      Proanthocyanins
      Color intensity
      Hue
      OD280/OD315 of diluted wines
      Proline
    
  
  
    
      0
      1
      14.23
      1.71
      2.43
      15.6
      127
      2.80
      3.06
      0.28
      2.29
      5.64
      1.04
      3.92
      1065
    
    
      1
      1
      13.20
      1.78
      2.14
      11.2
      100
      2.65
      2.76
      0.26
      1.28
      4.38
      1.05
      3.40
      1050
    
    
      2
      1
      13.16
      2.36
      2.67
      18.6
      101
      2.80
      3.24
      0.30
      2.81
      5.68
      1.03
      3.17
      1185
    
    
      3
      1
      14.37
      1.95
      2.50
      16.8
      113
      3.85
      3.49
      0.24
      2.18
      7.80
      0.86
      3.45
      1480
    
    
      4
      1
      13.24
      2.59
      2.87
      21.0
      118
      2.80
      2.69
      0.39
      1.82
      4.32
      1.04
      2.93
      735



In [3]:

    
df[['Alcohol', 'Malic acid']].describe()

La feature "Alcohol" (%/vol) et "Malic acid" (g/l) n'ont pas les mêmes échelles. On compare donc des choux et des carottes...



In [4]:

    
sns.lmplot(x="Alcohol", y="Malic acid", hue="label", data=df, fit_reg=False, markers=['^', 's', 'o'])









    



/Users/babou/anaconda/lib/python2.7/site-packages/matplotlib/__init__.py:892: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))






    Out[4]:





<seaborn.axisgrid.FacetGrid at 0x10f23ee50>



In [5]:

    
# On normalise les features
from sklearn.preprocessing import  StandardScaler

df_scale = df.copy()

for col in df_scale.columns[1:]:
    scaler = StandardScaler()
    df_scale[col] = scaler.fit_transform(df_scale[col])









    



/Users/babou/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.py:583: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
/Users/babou/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.py:646: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
/Users/babou/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.py:583: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
/Users/babou/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.py:646: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
/Users/babou/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.py:583: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
/Users/babou/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.py:646: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
/Users/babou/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.py:583: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
/Users/babou/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.py:646: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
/Users/babou/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py:420: DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.
  warnings.warn(msg, DataConversionWarning)
/Users/babou/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.py:583: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
/Users/babou/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py:420: DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.
  warnings.warn(msg, DataConversionWarning)
/Users/babou/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.py:646: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
/Users/babou/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.py:583: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
/Users/babou/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.py:646: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
/Users/babou/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.py:583: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
/Users/babou/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.py:646: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
/Users/babou/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.py:583: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
/Users/babou/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.py:646: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
/Users/babou/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.py:583: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
/Users/babou/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.py:646: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
/Users/babou/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.py:583: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
/Users/babou/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.py:646: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
/Users/babou/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.py:583: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
/Users/babou/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.py:646: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
/Users/babou/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.py:583: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
/Users/babou/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.py:646: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
/Users/babou/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py:420: DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.
  warnings.warn(msg, DataConversionWarning)
/Users/babou/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.py:583: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
/Users/babou/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py:420: DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.
  warnings.warn(msg, DataConversionWarning)
/Users/babou/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.py:646: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)



In [6]:

    
df_scale[['Alcohol', 'Malic acid']].describe()









    Out[6]:






  
    
      
      Alcohol
      Malic acid
    
  
  
    
      count
      1.780000e+02
      1.780000e+02
    
    
      mean
      -8.619821e-16
      -8.357859e-17
    
    
      std
      1.002821e+00
      1.002821e+00
    
    
      min
      -2.434235e+00
      -1.432983e+00
    
    
      25%
      -7.882448e-01
      -6.587486e-01
    
    
      50%
      6.099988e-02
      -4.231120e-01
    
    
      75%
      8.361286e-01
      6.697929e-01
    
    
      max
      2.259772e+00
      3.109192e+00

"alcool" et "acide malique" ont maintenant une moyenne de 0 et un std de 1



In [ ]:

    
sns.lmplot(x="Alcohol", y="Malic acid", hue="label", data=df_scale, fit_reg=False, markers=['^', 's', 'o'])

La normalisation est crucial lors d'un PCA (Principal Component Analysis), observons la différence à l'aide d'un graphique :



In [ ]:

    
# Using PCA
from sklearn.decomposition import PCA

# X = features, y = label
X, y = df.iloc[:, 1:].values, df.iloc[:,0].values # No scale
X_scale, y = df_scale.iloc[:, 1:].values, df_scale.iloc[:,0].values #scale

# non scale
pca = PCA(n_components=2).fit(X)
data_pca = pca.transform(X)
pca_df = pd.DataFrame(data_pca, columns=['pca_1', 'pca_2'])
pca_df['label'] = df['label']

# scale
pca_scale = PCA(n_components=2).fit(X_scale)
data_pca_scale = pca_scale.transform(X_scale)
pca_scale_df = pd.DataFrame(data_pca_scale, columns=['pca_1', 'pca_2'])
pca_scale_df['label'] = df['label']



In [ ]:

Le PCA permet de réduire le nombre dimension tout en gardant le maximum d'information. On extrait en quelques sorte le substrat de l'information. Plus d'info ici et via cette vidéo



In [ ]:

    
pca_df.head()

PCA non normalisé :



In [ ]:

    
g=sns.lmplot(x="pca_1", y="pca_2", hue="label", data=pca_df, fit_reg=False, markers=['^', 's', 'o'])

PCA normalisé



In [ ]:

    
sns.lmplot(x="pca_1", y="pca_2", hue="label", data=pca_scale_df, fit_reg=False, markers=['^', 's', 'o'])

Dans la verion "PCA normalisé", on retrouve nos 3 clusters assez délimités alors que pour la version "PCA non normalisé", la délimitation est beaucoup chaotique.

Catégorie

Les features contenant du texte doivent etre transformées afin de nourir l'algorithme par la suite.



In [7]:

    
import pandas as pd
df = pd.DataFrame([
            ['green', 'M', 10.1, 'class1'], 
            ['red', 'L', 13.5, 'class2'], 
            ['blue', 'XL', 15.3, 'class1']])

df.columns = ['color', 'size', 'prize', 'class label']
df









    Out[7]:






  
    
      
      color
      size
      prize
      class label
    
  
  
    
      0
      green
      M
      10.1
      class1
    
    
      1
      red
      L
      13.5
      class2
    
    
      2
      blue
      XL
      15.3
      class1



In [8]:

    
from sklearn.preprocessing import LabelEncoder

LE = LabelEncoder()
df['class label'] = LE.fit_transform(df['class label'])



In [9]:

    
df









    Out[9]:






  
    
      
      color
      size
      prize
      class label
    
  
  
    
      0
      green
      M
      10.1
      0
    
    
      1
      red
      L
      13.5
      1
    
    
      2
      blue
      XL
      15.3
      0

Ensuite se pose la question pour "color" et "size" :

- "size": Il y a un ordre de grandeur. Il faut donc respecter cette ordre de grandeur avec notre transformation.
- "color": Suivant le type d'algorythme utilisé, on peux la transformé comme "class label" ou bien la "binariser", c'est à dire que chaque valeur possible sera une colonne (avec des 0 ou 1)

Ordinal features



In [10]:

    
size_mapping = {
           'XL': 3,
           'L': 2,
           'M': 1}

df['size'] = df['size'].map(size_mapping)
df









    Out[10]:






  
    
      
      color
      size
      prize
      class label
    
  
  
    
      0
      green
      1
      10.1
      0
    
    
      1
      red
      2
      13.5
      1
    
    
      2
      blue
      3
      15.3
      0

Nominal features



In [11]:

    
pd.get_dummies(df)









    Out[11]:






  
    
      
      size
      prize
      class label
      color_blue
      color_green
      color_red
    
  
  
    
      0
      1
      10.1
      0
      0
      1
      0
    
    
      1
      2
      13.5
      1
      0
      0
      1
    
    
      2
      3
      15.3
      0
      1
      0
      0

Bien comprendre les métrics

Pour le challenge Rossman, la metric est le Root Mean Square Percentage Error (RMSPE). Compréhension par l'exemple :



In [12]:

    
# To compute RMSPE
def ToWeight(y):
    w = np.zeros(y.shape, dtype=float)
    ind = y != 0
    w[ind] = 1./(y[ind]**2)
    return w

def RMSPE(y, yhat):
    w = ToWeight(y)
    rmspe = np.sqrt(np.mean( w * (y - yhat)**2 ))
    return rmspe



In [13]:

    
real_value = np.array([3500, 6400, 5000])   # Les valeurs que je cherche à prédire
prediction_1 = np.array([3000, 5900, 5500]) # assez bonne
prediction_2 = np.array([2510, 4000, 8000]) # pas terrible
prediction_3 = np.array([3400, 6490, 5070]) # exellente



In [14]:

    
print "1 - Prédiction assez bonne : %s" %(RMSPE(prediction_1, real_value))
print "2 - Mauvaise prédiction : %s" %(RMSPE(prediction_2, real_value))
print "3 - Bonne prédiction : %s" %(RMSPE(prediction_3, real_value))









    



1 - Prédiction assez bonne : 0.120033446568
2 - Mauvaise prédiction : 0.467687202883
3 - Bonne prédiction : 0.0203959495373

Plus la valeur est faible, plus la prédiction est précise.

Pour le challenge Homesite, la metric est Area under the curve (AUC), elle permet d'étudier la performance d’un classifieur binaire. Elle va de 0 à 1 (1 étant un classicateur parfait)



In [15]:

    
from sklearn.metrics import roc_auc_score, roc_curve, auc



In [16]:

    
real_value = np.array([0, 0, 1, 1])   # Les valeurs que je cherche à prédire
prediction_1 = np.array([0.1, 0.4, 0.3, 0.8]) # assez bonne
prediction_2 = np.array([0.55, 0.4, 0.48, 0.35]) # pas terrible
prediction_3 = np.array([0.1, 0.2, 0.8, 0.9]) # exellente



In [17]:

    
print "1 - Prédiction assez bonne : %s" %(roc_auc_score(real_value, prediction_1))
print "2 - Mauvaise prédiction : %s" %(roc_auc_score(real_value, prediction_2))
print "3 - Bonne prédiction : %s" %(roc_auc_score(real_value, prediction_3))









    



1 - Prédiction assez bonne : 0.75
2 - Mauvaise prédiction : 0.25
3 - Bonne prédiction : 1.0

Pour aller plus loin graphiquement. N'hésitez pas à modifier les valeurs de "real_value" et des prédiction. Attention "real_value" doit avoir le même nombre de valeurs que les prédictions.



In [18]:

    
def plot_roc(true_positive_rate, false_positive_rate, roc_auc, title=''):
    """
    To plot ROC curve
    """
    title = 'Receiver Operating Characteristic' + title
    plt.title(title)
    plt.plot(false_positive_rate, true_positive_rate, 'b',
    label='AUC = %0.2f'% roc_auc)
    plt.legend(loc='lower right')
    plt.plot([0,1],[0,1],'r--')
    plt.xlim([-0.1,1.2])
    plt.ylim([-0.1,1.2])
    plt.ylabel('True Positive Rate')
    plt.xlabel('False Positive Rate')
    plt.show()



In [19]:

    
# List of all my predicitons
my_list_pred = [prediction_1, prediction_2, prediction_3]
my_predictions = []
for pred in my_list_pred:
    my_predictions.append(pred)



In [20]:

    
# Plot ROC 

i = 1
for prediction in my_predictions:
    false_positive_rate, true_positive_rate, thresholds = roc_curve(real_value, prediction)
    roc_auc = auc(false_positive_rate, true_positive_rate)
    plot_roc(true_positive_rate, false_positive_rate, roc_auc, title='_'+str(i))
    i += 1

Mesurer le performance de son modèle :



In [56]:

    
from sklearn.datasets import load_iris
# Get data
iris = load_iris()

# create X (features) and y (target)
X = iris.data
y = iris.target

print "Il y %d de lignes pour X" %(len(X))
print "Il y %d de lignes pour y" %(len(y))
X[0:5]









    



Il y 150 de lignes pour X
Il y 150 de lignes pour y






    Out[56]:





array([[ 5.1,  3.5,  1.4,  0.2],
       [ 4.9,  3. ,  1.4,  0.2],
       [ 4.7,  3.2,  1.3,  0.2],
       [ 4.6,  3.1,  1.5,  0.2],
       [ 5. ,  3.6,  1.4,  0.2]])



In [73]:

    
from sklearn.cross_validation import train_test_split
# Split le dataset en train (apprentissage) et test (phase de validation)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,
                                                    random_state=4)   # Hésitez pas à changer le random split

print "Il y %d de lignes pour X_train" %(len(X_train))
print "Il y %d de lignes pour y_train" %(len(y_train))
print "Il y %d de lignes pour X_test" %(len(X_test))
print "Il y %d de lignes pour y_test" %(len(y_test))









    



Il y 112 de lignes pour X_train
Il y 112 de lignes pour y_train
Il y 38 de lignes pour X_test
Il y 38 de lignes pour y_test



In [74]:

    
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)              # On entraine le modèle (sur train)
y_pred = knn.predict(X_test)           # On prédit le résultat sur un partie de données inconnu (test)
print accuracy_score(y_test, y_pred)   # On regarde précision du modèle de nos prédiction vers la réalité









    



0.973684210526

La performance du modèle varie selon le split (random_state). On est beaucoup trop dépendant des éléments présents dans la phase d'apprentissage (train) et dans la validation (test).

On utilisant le principe de Cross Validation, on s'assure que chaque ligne à été en phase d'apprentissage (train) ainsi que dans la phase de validation (test) :



In [75]:

    
from sklearn.cross_validation import KFold

kf = KFold(25, n_folds=5, shuffle=False)
# print the contents of each training and testing set
print '{} {:^61} {}'.format('Iteration', 'Training set observations', 'Testing set observations')
for iteration, data in enumerate(kf, start=1):
    print '{:^9} {} {:^25}'.format(iteration, data[0], data[1])









    



Iteration                   Training set observations                   Testing set observations
    1     [ 5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24]        [0 1 2 3 4]       
    2     [ 0  1  2  3  4 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24]        [5 6 7 8 9]       
    3     [ 0  1  2  3  4  5  6  7  8  9 15 16 17 18 19 20 21 22 23 24]     [10 11 12 13 14]     
    4     [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 20 21 22 23 24]     [15 16 17 18 19]     
    5     [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]     [20 21 22 23 24]

Chaque élément apparait uniquement une fois dans la phase de test

Utilisation de Cross Validation



In [76]:

    
from sklearn.cross_validation import cross_val_score

knn = KNeighborsClassifier(n_neighbors=3)
scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')
print scores









    



[ 1.          0.93333333  1.          0.93333333  0.86666667  1.
  0.93333333  1.          1.          1.        ]

On a donc 10 mesures de performances différentes.



In [78]:

    
print "La performance moyenne est de : %s et un std de %s" %(np.mean(scores), np.std(scores))









    



La performance moyenne est de : 0.966666666667 et un std de 0.04472135955

Changer les paramètres de l'algorythme :

Dans la grande majorité des cas, un algoryhme possèdent plusieurs paramètres. Changer ces paramètres influent sur le résultat :



In [27]:

    
import matplotlib.pyplot as plt
from mlxtend.data import iris_data
from mlxtend.evaluate import plot_decision_regions
from sklearn.ensemble import RandomForestClassifier



In [28]:

    
# Get Data
X, y = iris_data()
X = X[:,[0, 2]]



In [33]:

    
# To plot graph
for criterion in ['gini', 'entropy']:                # Criterion 
    for n_estimators in [3, 15]:                     # You can change this value if you want to play
        RFC = RandomForestClassifier(criterion=criterion, n_estimators=n_estimators, random_state=12)
        RFC.fit(X,y)
        title = criterion + " + " + str(n_estimators)
        fig = plot_decision_regions(X=X, y=y, clf=RFC, legend=2)
        plt.title(title)
        plt.show()



In [ ]:

	Alcohol	Malic acid
count	178.000000	178.000000
mean	13.000618	2.336348
std	0.811827	1.117146
min	11.030000	0.740000
25%	12.362500	1.602500
50%	13.050000	1.865000
75%	13.677500	3.082500
max	14.830000	5.800000

	label	Alcohol	Malic acid	Ash	Alcalinity of ash	Magnesium	Total phenols	Flavanoids	Nonflavanoid phenols	Proanthocyanins	Color intensity	Hue	OD280/OD315 of diluted wines	Proline
0	1	14.23	1.71	2.43	15.6	127	2.80	3.06	0.28	2.29	5.64	1.04	3.92	1065
1	1	13.20	1.78	2.14	11.2	100	2.65	2.76	0.26	1.28	4.38	1.05	3.40	1050
2	1	13.16	2.36	2.67	18.6	101	2.80	3.24	0.30	2.81	5.68	1.03	3.17	1185
3	1	14.37	1.95	2.50	16.8	113	3.85	3.49	0.24	2.18	7.80	0.86	3.45	1480
4	1	13.24	2.59	2.87	21.0	118	2.80	2.69	0.39	1.82	4.32	1.04	2.93	735

	Alcohol	Malic acid
count	1.780000e+02	1.780000e+02
mean	-8.619821e-16	-8.357859e-17
std	1.002821e+00	1.002821e+00
min	-2.434235e+00	-1.432983e+00
25%	-7.882448e-01	-6.587486e-01
50%	6.099988e-02	-4.231120e-01
75%	8.361286e-01	6.697929e-01
max	2.259772e+00	3.109192e+00