Analiza danych i uczenie maszynowe w Python

Autor notebooka: Jakub Nowacki.

Regresja liniowa

Regresja liniowa jest jedną z podstawowych, niemniej nadal często wykorzystywanych rodzajów regresji. Przećwiczymy ją na przykładowym zbiorze danych związanych z cukrzycą.


In [15]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
%matplotlib inline

plt.rcParams['figure.figsize'] = (10, 8)

# Zbiór danych
diabetes = datasets.load_diabetes()
print(diabetes.DESCR)


Diabetes dataset
================

Notes
-----

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

Data Set Characteristics:

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attributes:
    :Age:
    :Sex:
    :Body mass index:
    :Average blood pressure:
    :S1:
    :S2:
    :S3:
    :S4:
    :S5:
    :S6:

Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times `n_samples` (i.e. the sum of squares of each column totals 1).

Source URL:
http://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

For more information see:
Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) "Least Angle Regression," Annals of Statistics (with discussion), 407-499.
(http://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)


In [16]:
diabetes.keys()


Out[16]:
dict_keys(['data', 'target', 'DESCR', 'feature_names'])

In [17]:
diabetes.data


Out[17]:
array([[ 0.03807591,  0.05068012,  0.06169621, ..., -0.00259226,
         0.01990842, -0.01764613],
       [-0.00188202, -0.04464164, -0.05147406, ..., -0.03949338,
        -0.06832974, -0.09220405],
       [ 0.08529891,  0.05068012,  0.04445121, ..., -0.00259226,
         0.00286377, -0.02593034],
       ...,
       [ 0.04170844,  0.05068012, -0.01590626, ..., -0.01107952,
        -0.04687948,  0.01549073],
       [-0.04547248, -0.04464164,  0.03906215, ...,  0.02655962,
         0.04452837, -0.02593034],
       [-0.04547248, -0.04464164, -0.0730303 , ..., -0.03949338,
        -0.00421986,  0.00306441]])

In [18]:
diabetes.data.shape


Out[18]:
(442, 10)

In [19]:
diabetes.feature_names


Out[19]:
['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']

In [20]:
diabetes.target


Out[20]:
array([151.,  75., 141., 206., 135.,  97., 138.,  63., 110., 310., 101.,
        69., 179., 185., 118., 171., 166., 144.,  97., 168.,  68.,  49.,
        68., 245., 184., 202., 137.,  85., 131., 283., 129.,  59., 341.,
        87.,  65., 102., 265., 276., 252.,  90., 100.,  55.,  61.,  92.,
       259.,  53., 190., 142.,  75., 142., 155., 225.,  59., 104., 182.,
       128.,  52.,  37., 170., 170.,  61., 144.,  52., 128.,  71., 163.,
       150.,  97., 160., 178.,  48., 270., 202., 111.,  85.,  42., 170.,
       200., 252., 113., 143.,  51.,  52., 210.,  65., 141.,  55., 134.,
        42., 111.,  98., 164.,  48.,  96.,  90., 162., 150., 279.,  92.,
        83., 128., 102., 302., 198.,  95.,  53., 134., 144., 232.,  81.,
       104.,  59., 246., 297., 258., 229., 275., 281., 179., 200., 200.,
       173., 180.,  84., 121., 161.,  99., 109., 115., 268., 274., 158.,
       107.,  83., 103., 272.,  85., 280., 336., 281., 118., 317., 235.,
        60., 174., 259., 178., 128.,  96., 126., 288.,  88., 292.,  71.,
       197., 186.,  25.,  84.,  96., 195.,  53., 217., 172., 131., 214.,
        59.,  70., 220., 268., 152.,  47.,  74., 295., 101., 151., 127.,
       237., 225.,  81., 151., 107.,  64., 138., 185., 265., 101., 137.,
       143., 141.,  79., 292., 178.,  91., 116.,  86., 122.,  72., 129.,
       142.,  90., 158.,  39., 196., 222., 277.,  99., 196., 202., 155.,
        77., 191.,  70.,  73.,  49.,  65., 263., 248., 296., 214., 185.,
        78.,  93., 252., 150.,  77., 208.,  77., 108., 160.,  53., 220.,
       154., 259.,  90., 246., 124.,  67.,  72., 257., 262., 275., 177.,
        71.,  47., 187., 125.,  78.,  51., 258., 215., 303., 243.,  91.,
       150., 310., 153., 346.,  63.,  89.,  50.,  39., 103., 308., 116.,
       145.,  74.,  45., 115., 264.,  87., 202., 127., 182., 241.,  66.,
        94., 283.,  64., 102., 200., 265.,  94., 230., 181., 156., 233.,
        60., 219.,  80.,  68., 332., 248.,  84., 200.,  55.,  85.,  89.,
        31., 129.,  83., 275.,  65., 198., 236., 253., 124.,  44., 172.,
       114., 142., 109., 180., 144., 163., 147.,  97., 220., 190., 109.,
       191., 122., 230., 242., 248., 249., 192., 131., 237.,  78., 135.,
       244., 199., 270., 164.,  72.,  96., 306.,  91., 214.,  95., 216.,
       263., 178., 113., 200., 139., 139.,  88., 148.,  88., 243.,  71.,
        77., 109., 272.,  60.,  54., 221.,  90., 311., 281., 182., 321.,
        58., 262., 206., 233., 242., 123., 167.,  63., 197.,  71., 168.,
       140., 217., 121., 235., 245.,  40.,  52., 104., 132.,  88.,  69.,
       219.,  72., 201., 110.,  51., 277.,  63., 118.,  69., 273., 258.,
        43., 198., 242., 232., 175.,  93., 168., 275., 293., 281.,  72.,
       140., 189., 181., 209., 136., 261., 113., 131., 174., 257.,  55.,
        84.,  42., 146., 212., 233.,  91., 111., 152., 120.,  67., 310.,
        94., 183.,  66., 173.,  72.,  49.,  64.,  48., 178., 104., 132.,
       220.,  57.])

Dla lepszej jasności przykładu użyjmy jednego atrybutu do przeprowadzenia regresji.


In [21]:
diabetes_X = diabetes.data[:, np.newaxis, 2]  # wyciągamy jako wektor kolumnowy (nie trzeba tego robić jak mamy więcej niż jedną kolumnę)

# Dzielimy dane na zbiory treningowy i testowy
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]

diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]

In [22]:
diabetes_X_train[:5], diabetes_y_train[:5]


Out[22]:
(array([[ 0.06169621],
        [-0.05147406],
        [ 0.04445121],
        [-0.01159501],
        [-0.03638469]]), array([151.,  75., 141., 206., 135.]))

In [23]:
# Tworzymy obiekt modelu i go uczymy
regr = linear_model.LinearRegression()

regr.fit(diabetes_X_train, diabetes_y_train)


Out[23]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Teraz możemy sprawdzić czy model dobrze się uczy i jak przewiduje na danych testowych.


In [24]:
diabetes_y_pred = regr.predict(diabetes_X_test)
diabetes_y_pred


Out[24]:
array([225.9732401 , 115.74763374, 163.27610621, 114.73638965,
       120.80385422, 158.21988574, 236.08568105, 121.81509832,
        99.56772822, 123.83758651, 204.73711411,  96.53399594,
       154.17490936, 130.91629517,  83.3878227 , 171.36605897,
       137.99500384, 137.99500384, 189.56845268,  84.3990668 ])

Możemy przeprowadzić teraz ocenę jakości modelu.


In [25]:
print('Współczynniki: \n', regr.coef_)
print("Błąd średniokwadratowy: %.2f"
      % mean_squared_error(diabetes_y_test, diabetes_y_pred))
print('Metryka R2 (wariancji): %.2f' % r2_score(diabetes_y_test, diabetes_y_pred))


Współczynniki: 
 [938.23786125]
Błąd średniokwadratowy: 2548.07
Metryka R2 (wariancji): 0.47

Narysujmy też predykcje naszego modelu na wykresie.


In [26]:
plt.scatter(diabetes_X_test, diabetes_y_test,  color='black')
plt.scatter(diabetes_X_train, diabetes_y_train,  color='red')
plt.plot(diabetes_X_test, diabetes_y_pred, color='blue', linewidth=3)
plt.show()



In [27]:
diabetes_X = diabetes.data  # wyciągamy jako wektor kolumnowy (nie trzeba tego robić jak mamy więcej niż jedną kolumnę)
diabetes_X


Out[27]:
array([[ 0.03807591,  0.05068012,  0.06169621, ..., -0.00259226,
         0.01990842, -0.01764613],
       [-0.00188202, -0.04464164, -0.05147406, ..., -0.03949338,
        -0.06832974, -0.09220405],
       [ 0.08529891,  0.05068012,  0.04445121, ..., -0.00259226,
         0.00286377, -0.02593034],
       ...,
       [ 0.04170844,  0.05068012, -0.01590626, ..., -0.01107952,
        -0.04687948,  0.01549073],
       [-0.04547248, -0.04464164,  0.03906215, ...,  0.02655962,
         0.04452837, -0.02593034],
       [-0.04547248, -0.04464164, -0.0730303 , ..., -0.03949338,
        -0.00421986,  0.00306441]])

Zadanie

  1. Użyj więcej zmiennych do uczenia modelu; porównaj wyniki pomiaru jakości regresji.
  2. Narysuj linię regresji w stosunku do innych zmiennych.
  3. ★ Jakie cechy wpływają na najbardziej na wynik? Jak to sprawdzić?

In [29]:
# diabetes_X = diabetes.data[:, np.newaxis, 2]  # np.newaxis - wyciągamy jako wektor kolumnowy (nie trzeba tego robić jak mamy więcej niż jedną kolumnę)
diabetes_X = diabetes.data[:, [1, 2, 3]]
#diabetes_X = diabetes.data[:, np.newaxis, 2] 


# Dzielimy dane na zbiory treningowy i testowy
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]

diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]

# Tworzymy obiekt modelu i go uczymy
regr = linear_model.LinearRegression()

regr.fit(diabetes_X_train, diabetes_y_train)
diabetes_y_pred = regr.predict(diabetes_X_test)


print('Współczynniki: \n', regr.coef_)
print("Błąd średniokwadratowy: %.2f"
      % mean_squared_error(diabetes_y_test, diabetes_y_pred))
print('Metryka R2 (wariancji): %.2f' % r2_score(diabetes_y_test, diabetes_y_pred))


plt.scatter(diabetes_X_test[:,2], diabetes_y_test,  color='black')
plt.scatter(diabetes_X_train[:,2], diabetes_y_train,  color='red')
plt.plot(diabetes_X_test[:,2], diabetes_y_pred, color='blue', linewidth=3)
plt.show()


Współczynniki: 
 [-96.87616507 780.09757364 432.26095788]
Błąd średniokwadratowy: 2510.21
Metryka R2 (wariancji): 0.48

Pandas

Spróbujmy powrócić do Pandas i wykonać ten sam model.


In [30]:
import pandas as pd

dia_df = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)\
    .assign(target=diabetes.target)
    
dia_df.head()


Out[30]:
age sex bmi bp s1 s2 s3 s4 s5 s6 target
0 0.038076 0.050680 0.061696 0.021872 -0.044223 -0.034821 -0.043401 -0.002592 0.019908 -0.017646 151.0
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163 0.074412 -0.039493 -0.068330 -0.092204 75.0
2 0.085299 0.050680 0.044451 -0.005671 -0.045599 -0.034194 -0.032356 -0.002592 0.002864 -0.025930 141.0
3 -0.089063 -0.044642 -0.011595 -0.036656 0.012191 0.024991 -0.036038 0.034309 0.022692 -0.009362 206.0
4 0.005383 -0.044642 -0.036385 0.021872 0.003935 0.015596 0.008142 -0.002592 -0.031991 -0.046641 135.0

In [31]:
dia_train = dia_df.iloc[:-20, :]
dia_train.head(20)


Out[31]:
age sex bmi bp s1 s2 s3 s4 s5 s6 target
0 0.038076 0.050680 0.061696 0.021872 -0.044223 -0.034821 -0.043401 -0.002592 0.019908 -0.017646 151.0
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163 0.074412 -0.039493 -0.068330 -0.092204 75.0
2 0.085299 0.050680 0.044451 -0.005671 -0.045599 -0.034194 -0.032356 -0.002592 0.002864 -0.025930 141.0
3 -0.089063 -0.044642 -0.011595 -0.036656 0.012191 0.024991 -0.036038 0.034309 0.022692 -0.009362 206.0
4 0.005383 -0.044642 -0.036385 0.021872 0.003935 0.015596 0.008142 -0.002592 -0.031991 -0.046641 135.0
5 -0.092695 -0.044642 -0.040696 -0.019442 -0.068991 -0.079288 0.041277 -0.076395 -0.041180 -0.096346 97.0
6 -0.045472 0.050680 -0.047163 -0.015999 -0.040096 -0.024800 0.000779 -0.039493 -0.062913 -0.038357 138.0
7 0.063504 0.050680 -0.001895 0.066630 0.090620 0.108914 0.022869 0.017703 -0.035817 0.003064 63.0
8 0.041708 0.050680 0.061696 -0.040099 -0.013953 0.006202 -0.028674 -0.002592 -0.014956 0.011349 110.0
9 -0.070900 -0.044642 0.039062 -0.033214 -0.012577 -0.034508 -0.024993 -0.002592 0.067736 -0.013504 310.0
10 -0.096328 -0.044642 -0.083808 0.008101 -0.103389 -0.090561 -0.013948 -0.076395 -0.062913 -0.034215 101.0
11 0.027178 0.050680 0.017506 -0.033214 -0.007073 0.045972 -0.065491 0.071210 -0.096433 -0.059067 69.0
12 0.016281 -0.044642 -0.028840 -0.009113 -0.004321 -0.009769 0.044958 -0.039493 -0.030751 -0.042499 179.0
13 0.005383 0.050680 -0.001895 0.008101 -0.004321 -0.015719 -0.002903 -0.002592 0.038393 -0.013504 185.0
14 0.045341 -0.044642 -0.025607 -0.012556 0.017694 -0.000061 0.081775 -0.039493 -0.031991 -0.075636 118.0
15 -0.052738 0.050680 -0.018062 0.080401 0.089244 0.107662 -0.039719 0.108111 0.036056 -0.042499 171.0
16 -0.005515 -0.044642 0.042296 0.049415 0.024574 -0.023861 0.074412 -0.039493 0.052280 0.027917 166.0
17 0.070769 0.050680 0.012117 0.056301 0.034206 0.049416 -0.039719 0.034309 0.027368 -0.001078 144.0
18 -0.038207 -0.044642 -0.010517 -0.036656 -0.037344 -0.019476 -0.028674 -0.002592 -0.018118 -0.017646 97.0
19 -0.027310 -0.044642 -0.018062 -0.040099 -0.002945 -0.011335 0.037595 -0.039493 -0.008944 -0.054925 168.0

In [32]:
dia_test = dia_df.iloc[-20:, :]
dia_test


Out[32]:
age sex bmi bp s1 s2 s3 s4 s5 s6 target
422 -0.078165 0.050680 0.077863 0.052858 0.078236 0.064447 0.026550 -0.002592 0.040672 -0.009362 233.0
423 0.009016 0.050680 -0.039618 0.028758 0.038334 0.073529 -0.072854 0.108111 0.015567 -0.046641 91.0
424 0.001751 0.050680 0.011039 -0.019442 -0.016704 -0.003819 -0.047082 0.034309 0.024053 0.023775 111.0
425 -0.078165 -0.044642 -0.040696 -0.081414 -0.100638 -0.112795 0.022869 -0.076395 -0.020289 -0.050783 152.0
426 0.030811 0.050680 -0.034229 0.043677 0.057597 0.068831 -0.032356 0.057557 0.035462 0.085907 120.0
427 -0.034575 0.050680 0.005650 -0.005671 -0.073119 -0.062691 -0.006584 -0.039493 -0.045421 0.032059 67.0
428 0.048974 0.050680 0.088642 0.087287 0.035582 0.021546 -0.024993 0.034309 0.066048 0.131470 310.0
429 -0.041840 -0.044642 -0.033151 -0.022885 0.046589 0.041587 0.056003 -0.024733 -0.025952 -0.038357 94.0
430 -0.009147 -0.044642 -0.056863 -0.050428 0.021822 0.045345 -0.028674 0.034309 -0.009919 -0.017646 183.0
431 0.070769 0.050680 -0.030996 0.021872 -0.037344 -0.047034 0.033914 -0.039493 -0.014956 -0.001078 66.0
432 0.009016 -0.044642 0.055229 -0.005671 0.057597 0.044719 -0.002903 0.023239 0.055684 0.106617 173.0
433 -0.027310 -0.044642 -0.060097 -0.029771 0.046589 0.019980 0.122273 -0.039493 -0.051401 -0.009362 72.0
434 0.016281 -0.044642 0.001339 0.008101 0.005311 0.010899 0.030232 -0.039493 -0.045421 0.032059 49.0
435 -0.012780 -0.044642 -0.023451 -0.040099 -0.016704 0.004636 -0.017629 -0.002592 -0.038459 -0.038357 64.0
436 -0.056370 -0.044642 -0.074108 -0.050428 -0.024960 -0.047034 0.092820 -0.076395 -0.061177 -0.046641 48.0
437 0.041708 0.050680 0.019662 0.059744 -0.005697 -0.002566 -0.028674 -0.002592 0.031193 0.007207 178.0
438 -0.005515 0.050680 -0.015906 -0.067642 0.049341 0.079165 -0.028674 0.034309 -0.018118 0.044485 104.0
439 0.041708 0.050680 -0.015906 0.017282 -0.037344 -0.013840 -0.024993 -0.011080 -0.046879 0.015491 132.0
440 -0.045472 -0.044642 0.039062 0.001215 0.016318 0.015283 -0.028674 0.026560 0.044528 -0.025930 220.0
441 -0.045472 -0.044642 -0.073030 -0.081414 0.083740 0.027809 0.173816 -0.039493 -0.004220 0.003064 57.0

In [33]:
lr = linear_model.LinearRegression()
lr.fit(dia_train[['age', 'sex', 'bmi']], dia_train['target'])


Out[33]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [34]:
dia_test = dia_test.assign(predict=lambda x: lr.predict(x[['age', 'sex', 'bmi']]))
dia_test


Out[34]:
age sex bmi bp s1 s2 s3 s4 s5 s6 target predict
422 -0.078165 0.050680 0.077863 0.052858 0.078236 0.064447 0.026550 -0.002592 0.040672 -0.009362 233.0 211.071181
423 0.009016 0.050680 -0.039618 0.028758 0.038334 0.073529 -0.072854 0.108111 0.015567 -0.046641 91.0 116.261552
424 0.001751 0.050680 0.011039 -0.019442 -0.016704 -0.003819 -0.047082 0.034309 0.024053 0.023775 111.0 161.517691
425 -0.078165 -0.044642 -0.040696 -0.081414 -0.100638 -0.112795 0.022869 -0.076395 -0.020289 -0.050783 152.0 105.886702
426 0.030811 0.050680 -0.034229 0.043677 0.057597 0.068831 -0.032356 0.057557 0.035462 0.085907 120.0 124.331705
427 -0.034575 0.050680 0.005650 -0.005671 -0.073119 -0.062691 -0.006584 -0.039493 -0.045421 0.032059 67.0 151.351420
428 0.048974 0.050680 0.088642 0.087287 0.035582 0.021546 -0.024993 0.034309 0.066048 0.131470 310.0 239.264161
429 -0.041840 -0.044642 -0.033151 -0.022885 0.046589 0.041587 0.056003 -0.024733 -0.025952 -0.038357 94.0 118.023364
430 -0.009147 -0.044642 -0.056863 -0.050428 0.021822 0.045345 -0.028674 0.034309 -0.009919 -0.017646 183.0 101.065322
431 0.070769 0.050680 -0.030996 0.021872 -0.037344 -0.047034 0.033914 -0.039493 -0.014956 -0.001078 66.0 133.051614
432 0.009016 -0.044642 0.055229 -0.005671 0.057597 0.044719 -0.002903 0.023239 0.055684 0.106617 173.0 206.145820
433 -0.027310 -0.044642 -0.060097 -0.029771 0.046589 0.019980 0.122273 -0.039493 -0.051401 -0.009362 72.0 95.489589
434 0.016281 -0.044642 0.001339 0.008101 0.005311 0.010899 0.030232 -0.039493 -0.045421 0.032059 49.0 157.934094
435 -0.012780 -0.044642 -0.023451 -0.040099 -0.016704 0.004636 -0.017629 -0.002592 -0.038459 -0.038357 64.0 131.082359
436 -0.056370 -0.044642 -0.074108 -0.050428 -0.024960 -0.047034 0.092820 -0.076395 -0.061177 -0.046641 48.0 78.489811
437 0.041708 0.050680 0.019662 0.059744 -0.005697 -0.002566 -0.028674 -0.002592 0.031193 0.007207 178.0 175.163578
438 -0.005515 0.050680 -0.015906 -0.067642 0.049341 0.079165 -0.028674 0.034309 -0.018118 0.044485 104.0 135.839740
439 0.041708 0.050680 -0.015906 0.017282 -0.037344 -0.013840 -0.024993 -0.011080 -0.046879 0.015491 132.0 142.652120
440 -0.045472 -0.044642 0.039062 0.001215 0.016318 0.015283 -0.028674 0.026560 0.044528 -0.025930 220.0 183.507446
441 -0.045472 -0.044642 -0.073030 -0.081414 0.083740 0.027809 0.173816 -0.039493 -0.004220 0.003064 57.0 81.047094

In [35]:
print('Współczynniki: \n', lr.coef_)
print("Błąd średniokwadratowy: %.2f"
      % mean_squared_error(dia_test['target'], lr.predict(dia_test[['age', 'sex', 'bmi']])))
print('Metryka R2 (wariancji): %.2f' % r2_score(dia_test['target'], dia_test['predict']))


Współczynniki: 
 [144.25978848 -33.43463042 914.07000914]
Błąd średniokwadratowy: 2585.66
Metryka R2 (wariancji): 0.46

In [37]:
import pandas as pd


def model(dataframe, features, target, procent_testowy=20):

    dia_df = pd.DataFrame(diabetes.data, columns=diabetes.feature_names).assign(target=diabetes.target)

    # Podiał zbioru na testowy i treningowy
    dia_train = dia_df.iloc[:-procent_testowy, :]
    dia_test = dia_df.iloc[-procent_testowy:, :]

    lr = linear_model.LinearRegression()
    lr.fit(dia_train[['bmi']], dia_train['target'])

    dia_test = dia_test.assign(predict=lambda x: lr.predict(x[['bmi']]))

    
    print('Współczynniki: \n', lr.coef_)
    print("Błąd średniokwadratowy: %.2f"  % mean_squared_error(dia_test['target'], lr.predict(dia_test[['bmi']])))
    print('Metryka R2 (wariancji): %.2f' % r2_score(dia_test['target'], dia_test['predict']))

Zadanie

  1. Podobnie jak powyżej, poeksperymentuj z cechami.
  2. Zautomatyzuj powyższy eksperyment.
  3. ★ Czy są jeszcze jakieś parametry które można dostosować?

Regresja liniowa z regularyzacją

Aby wybrać odpowiedni model, który odpowiednio generalizuje, używa się technik regularyzacji. Dwie najbardziej znane techniki to Lasso, czyli regularyzacja L1, oraz Ridge, czyli regularyzacja L2. Poniżej przykłady wykorzystania tych algorytmów.


In [38]:
ridge = linear_model.Ridge()
ridge.fit(dia_train[['age', 'sex', 'bmi']], dia_train['target'])


Out[38]:
Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

In [39]:
dia_test = dia_test.assign(predict=lambda x: ridge.predict(x[['age', 'sex', 'bmi']]))
dia_test


Out[39]:
age sex bmi bp s1 s2 s3 s4 s5 s6 target predict
422 -0.078165 0.050680 0.077863 0.052858 0.078236 0.064447 0.026550 -0.002592 0.040672 -0.009362 233.0 179.626754
423 0.009016 0.050680 -0.039618 0.028758 0.038334 0.073529 -0.072854 0.108111 0.015567 -0.046641 91.0 136.525047
424 0.001751 0.050680 0.011039 -0.019442 -0.016704 -0.003819 -0.047082 0.034309 0.024053 0.023775 111.0 158.441800
425 -0.078165 -0.044642 -0.040696 -0.081414 -0.100638 -0.112795 0.022869 -0.076395 -0.020289 -0.050783 152.0 126.104300
426 0.030811 0.050680 -0.034229 0.043677 0.057597 0.068831 -0.032356 0.057557 0.035462 0.085907 120.0 141.335891
427 -0.034575 0.050680 0.005650 -0.005671 -0.073119 -0.062691 -0.006584 -0.039493 -0.045421 0.032059 67.0 152.034710
428 0.048974 0.050680 0.088642 0.087287 0.035582 0.021546 -0.024993 0.034309 0.066048 0.131470 310.0 198.426854
429 -0.041840 -0.044642 -0.033151 -0.022885 0.046589 0.041587 0.056003 -0.024733 -0.025952 -0.038357 94.0 133.477980
430 -0.009147 -0.044642 -0.056863 -0.050428 0.021822 0.045345 -0.028674 0.034309 -0.009919 -0.017646 183.0 126.437037
431 0.070769 0.050680 -0.030996 0.021872 -0.037344 -0.047034 0.033914 -0.039493 -0.014956 -0.001078 66.0 147.175451
432 0.009016 -0.044642 0.055229 -0.005671 0.057597 0.044719 -0.002903 0.023239 0.055684 0.106617 173.0 178.695047
433 -0.027310 -0.044642 -0.060097 -0.029771 0.046589 0.019980 0.122273 -0.039493 -0.051401 -0.009362 72.0 122.991844
434 0.016281 -0.044642 0.001339 0.008101 0.005311 0.010899 0.030232 -0.039493 -0.045421 0.032059 49.0 155.328408
435 -0.012780 -0.044642 -0.023451 -0.040099 -0.016704 0.004636 -0.017629 -0.002592 -0.038459 -0.038357 64.0 141.020127
436 -0.056370 -0.044642 -0.074108 -0.050428 -0.024960 -0.047034 0.092820 -0.076395 -0.061177 -0.046641 48.0 113.516516
437 0.041708 0.050680 0.019662 0.059744 -0.005697 -0.002566 -0.028674 -0.002592 0.031193 0.007207 178.0 166.697836
438 -0.005515 0.050680 -0.015906 -0.067642 0.049341 0.079165 -0.028674 0.034309 -0.018118 0.044485 104.0 145.561296
439 0.041708 0.050680 -0.015906 0.017282 -0.037344 -0.013840 -0.024993 -0.011080 -0.046879 0.015491 132.0 150.749094
440 -0.045472 -0.044642 0.039062 0.001215 0.016318 0.015283 -0.028674 0.026560 0.044528 -0.025930 220.0 165.459699
441 -0.045472 -0.044642 -0.073030 -0.081414 0.083740 0.027809 0.173816 -0.039493 -0.004220 0.003064 57.0 115.196995

In [40]:
print('Współczynniki: \n', ridge.coef_)
print("Błąd średniokwadratowy: %.2f"
      % mean_squared_error(dia_test['target'], ridge.predict(dia_test[['age', 'sex', 'bmi']])))
print('Metryka R2 (wariancji): %.2f' % r2_score(dia_test['target'], dia_test['predict']))


Współczynniki: 
 [109.85742979   3.77646864 448.40398428]
Błąd średniokwadratowy: 3602.78
Metryka R2 (wariancji): 0.25

In [41]:
lasso = linear_model.Lasso()
lasso.fit(dia_train[['age', 'sex', 'bmi']], dia_train['target'])


Out[41]:
Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)

In [42]:
dia_test = dia_test.assign(predict=lambda x: lasso.predict(x[['age', 'sex', 'bmi']]))
dia_test


Out[42]:
age sex bmi bp s1 s2 s3 s4 s5 s6 target predict
422 -0.078165 0.050680 0.077863 0.052858 0.078236 0.064447 0.026550 -0.002592 0.040672 -0.009362 233.0 191.800029
423 0.009016 0.050680 -0.039618 0.028758 0.038334 0.073529 -0.072854 0.108111 0.015567 -0.046641 91.0 133.450577
424 0.001751 0.050680 0.011039 -0.019442 -0.016704 -0.003819 -0.047082 0.034309 0.024053 0.023775 111.0 158.610433
425 -0.078165 -0.044642 -0.040696 -0.081414 -0.100638 -0.112795 0.022869 -0.076395 -0.020289 -0.050783 152.0 132.915261
426 0.030811 0.050680 -0.034229 0.043677 0.057597 0.068831 -0.032356 0.057557 0.035462 0.085907 120.0 136.127158
427 -0.034575 0.050680 0.005650 -0.005671 -0.073119 -0.062691 -0.006584 -0.039493 -0.045421 0.032059 67.0 155.933852
428 0.048974 0.050680 0.088642 0.087287 0.035582 0.021546 -0.024993 0.034309 0.066048 0.131470 310.0 197.153190
429 -0.041840 -0.044642 -0.033151 -0.022885 0.046589 0.041587 0.056003 -0.024733 -0.025952 -0.038357 94.0 136.662474
430 -0.009147 -0.044642 -0.056863 -0.050428 0.021822 0.045345 -0.028674 0.034309 -0.009919 -0.017646 183.0 124.885520
431 0.070769 0.050680 -0.030996 0.021872 -0.037344 -0.047034 0.033914 -0.039493 -0.014956 -0.001078 66.0 137.733106
432 0.009016 -0.044642 0.055229 -0.005671 0.057597 0.044719 -0.002903 0.023239 0.055684 0.106617 173.0 180.558392
433 -0.027310 -0.044642 -0.060097 -0.029771 0.046589 0.019980 0.122273 -0.039493 -0.051401 -0.009362 72.0 123.279572
434 0.016281 -0.044642 0.001339 0.008101 0.005311 0.010899 0.030232 -0.039493 -0.045421 0.032059 49.0 153.792588
435 -0.012780 -0.044642 -0.023451 -0.040099 -0.016704 0.004636 -0.017629 -0.002592 -0.038459 -0.038357 64.0 141.480318
436 -0.056370 -0.044642 -0.074108 -0.050428 -0.024960 -0.047034 0.092820 -0.076395 -0.061177 -0.046641 48.0 116.320463
437 0.041708 0.050680 0.019662 0.059744 -0.005697 -0.002566 -0.028674 -0.002592 0.031193 0.007207 178.0 162.892961
438 -0.005515 0.050680 -0.015906 -0.067642 0.049341 0.079165 -0.028674 0.034309 -0.018118 0.044485 104.0 145.227531
439 0.041708 0.050680 -0.015906 0.017282 -0.037344 -0.013840 -0.024993 -0.011080 -0.046879 0.015491 132.0 145.227531
440 -0.045472 -0.044642 0.039062 0.001215 0.016318 0.015283 -0.028674 0.026560 0.044528 -0.025930 220.0 172.528651
441 -0.045472 -0.044642 -0.073030 -0.081414 0.083740 0.027809 0.173816 -0.039493 -0.004220 0.003064 57.0 116.855779

In [ ]:
print('Współczynniki: \n', lasso.coef_)
print("Błąd średniokwadratowy: %.2f"
      % mean_squared_error(dia_test['target'], lasso.predict(dia_test[['age', 'sex', 'bmi']])))
print('Metryka R2 (wariancji): %.2f' % r2_score(dia_test['target'], dia_test['predict']))

Jak widać parametry wypadły gorzej niż dla zwykłej regresji liniowej. Wynika to z faktu, że regularyzacje mają hipetparametry, które należy dostosować do problemy. Do tego zostały stworzone wersje z wbudowaną walidacją krzyżową (Cross-validation, która również dobiera hiperparametry.


In [55]:
lasso = linear_model.LassoCV(cv=5)
lasso.fit(dia_train[['age', 'sex', 'bmi']], dia_train['target'])


Out[55]:
LassoCV(alphas=None, copy_X=True, cv=5, eps=0.001, fit_intercept=True,
    max_iter=1000, n_alphas=100, n_jobs=1, normalize=False, positive=False,
    precompute='auto', random_state=None, selection='cyclic', tol=0.0001,
    verbose=False)

In [52]:
dia_test = dia_test.assign(predict=lambda x: lasso.predict(x[['age', 'sex', 'bmi']]))
dia_test


Out[52]:
age sex bmi bp s1 s2 s3 s4 s5 s6 target predict
422 -0.078165 0.050680 0.077863 0.052858 0.078236 0.064447 0.026550 -0.002592 0.040672 -0.009362 233.0 213.025719
423 0.009016 0.050680 -0.039618 0.028758 0.038334 0.073529 -0.072854 0.108111 0.015567 -0.046641 91.0 118.673749
424 0.001751 0.050680 0.011039 -0.019442 -0.016704 -0.003819 -0.047082 0.034309 0.024053 0.023775 111.0 162.903582
425 -0.078165 -0.044642 -0.040696 -0.081414 -0.100638 -0.112795 0.022869 -0.076395 -0.020289 -0.050783 152.0 107.559784
426 0.030811 0.050680 -0.034229 0.043677 0.057597 0.068831 -0.032356 0.057557 0.035462 0.085907 120.0 126.017832
427 -0.034575 0.050680 0.005650 -0.005671 -0.073119 -0.062691 -0.006584 -0.039493 -0.045421 0.032059 67.0 153.860557
428 0.048974 0.050680 0.088642 0.087287 0.035582 0.021546 -0.024993 0.034309 0.066048 0.131470 310.0 237.482801
429 -0.041840 -0.044642 -0.033151 -0.022885 0.046589 0.041587 0.056003 -0.024733 -0.025952 -0.038357 94.0 118.521077
430 -0.009147 -0.044642 -0.056863 -0.050428 0.021822 0.045345 -0.028674 0.034309 -0.009919 -0.017646 183.0 101.242745
431 0.070769 0.050680 -0.030996 0.021872 -0.037344 -0.047034 0.033914 -0.039493 -0.014956 -0.001078 66.0 133.567324
432 0.009016 -0.044642 0.055229 -0.005671 0.057597 0.044719 -0.002903 0.023239 0.055684 0.106617 173.0 203.116372
433 -0.027310 -0.044642 -0.060097 -0.029771 0.046589 0.019980 0.122273 -0.039493 -0.051401 -0.009362 72.0 96.241665
434 0.016281 -0.044642 0.001339 0.008101 0.005311 0.010899 0.030232 -0.039493 -0.045421 0.032059 49.0 156.009136
435 -0.012780 -0.044642 -0.023451 -0.040099 -0.016704 0.004636 -0.017629 -0.002592 -0.038459 -0.038357 64.0 130.551168
436 -0.056370 -0.044642 -0.074108 -0.050428 -0.024960 -0.047034 0.092820 -0.076395 -0.061177 -0.046641 48.0 80.375038
437 0.041708 0.050680 0.019662 0.059744 -0.005697 -0.002566 -0.028674 -0.002592 0.031193 0.007207 178.0 175.248745
438 -0.005515 0.050680 -0.015906 -0.067642 0.049341 0.079165 -0.028674 0.034309 -0.018118 0.044485 104.0 138.075758
439 0.041708 0.050680 -0.015906 0.017282 -0.037344 -0.013840 -0.024993 -0.011080 -0.046879 0.015491 132.0 143.597318
440 -0.045472 -0.044642 0.039062 0.001215 0.016318 0.015283 -0.028674 0.026560 0.044528 -0.025930 220.0 182.358329
441 -0.045472 -0.044642 -0.073030 -0.081414 0.083740 0.027809 0.173816 -0.039493 -0.004220 0.003064 57.0 82.608379

In [59]:
columns = ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']


lasso = linear_model.LassoCV(cv=5)
lasso.fit(dia_train[columns], dia_train['target'])
dia_test = dia_test.assign(predict=lambda x: lasso.predict(x[columns]))


print('Współczynniki: \n', lasso.coef_)
print("Błąd średniokwadratowy: %.2f"
      % mean_squared_error(dia_test['target'], lasso.predict(dia_test[columns])))
print('Metryka R2 (wariancji): %.2f' % r2_score(dia_test['target'], dia_test['predict']))


Współczynniki: 
 [   0.         -227.7909445   515.87541365  322.65588548 -410.39307487
  172.22093534  -69.15134823  134.40868857  594.35952058   75.48272645]
Błąd średniokwadratowy: 1990.67
Metryka R2 (wariancji): 0.59

In [53]:
print('Współczynniki: \n', lasso.coef_)
print("Błąd średniokwadratowy: %.2f"
      % mean_squared_error(dia_test['target'], lasso.predict(dia_test[['age', 'sex', 'bmi']])))
print('Metryka R2 (wariancji): %.2f' % r2_score(dia_test['target'], dia_test['predict']))


Współczynniki: 
 [ 1.16925238e+02 -4.07249379e-01  8.89889954e+02]
Błąd średniokwadratowy: 2616.15
Metryka R2 (wariancji): 0.46

Zobaczmy co dzieje się w trakcie procesu walidacji krzyżowej. Algorytm liczy dla każdego podziału danych krzywą MSE w zależności od parametru alpha, jak pokazano poniżej.


In [54]:
plt.plot(-pd.np.log10(lasso.alphas_), lasso.mse_path_, linestyle='--');
plt.plot(-pd.np.log10(lasso.alphas_), lasso.mse_path_.mean(axis=1), 'k', linewidth=3);

plt.xlabel('$-log_{10}(alpha)$');
plt.ylabel('Mean Square Error (MSE)');



In [63]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
import pandas as pd
# %matplotlib inline


diabetes = datasets.load_diabetes()
dataframe = pd.DataFrame(diabetes.data, columns=diabetes.feature_names).assign(target=diabetes.target)

dane_treningowe = dataframe.iloc[:-20, :]
dane_testowe = dataframe.iloc[-20:, :]

columns = ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']

model = linear_model.LassoCV(cv=5)
model.fit(dane_treningowe[columns], dane_treningowe['target'])
dane_testowe = dane_testowe.assign(predict=lambda x: model.predict(dane_treningowe[columns]))


print('Współczynniki: \n', model.coef_)
print("Błąd średniokwadratowy: %.2f" % mean_squared_error(dia_test['target'], model.predict(dane_testowe[columns])))
print('Metryka R2 (wariancji): %.2f' % r2_score(dane_testowe['target'], dane_testowe['predict']))


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-63-fe19845627a4> in <module>()
     17 model = linear_model.LassoCV(cv=5)
     18 model.fit(dane_treningowe[columns], dane_treningowe['target'])
---> 19 dane_testowe = dane_testowe.assign(predict=lambda x: model.predict(dane_treningowe[columns]))
     20 
     21 

~\.virtualenv\book-python\lib\site-packages\pandas\core\frame.py in assign(self, **kwargs)
   2692         # ... and then assign
   2693         for k, v in results:
-> 2694             data[k] = v
   2695         return data
   2696 

~\.virtualenv\book-python\lib\site-packages\pandas\core\frame.py in __setitem__(self, key, value)
   2517         else:
   2518             # set column
-> 2519             self._set_item(key, value)
   2520 
   2521     def _setitem_slice(self, key, value):

~\.virtualenv\book-python\lib\site-packages\pandas\core\frame.py in _set_item(self, key, value)
   2583 
   2584         self._ensure_valid_index(value)
-> 2585         value = self._sanitize_column(key, value)
   2586         NDFrame._set_item(self, key, value)
   2587 

~\.virtualenv\book-python\lib\site-packages\pandas\core\frame.py in _sanitize_column(self, key, value, broadcast)
   2758 
   2759             # turn me into an ndarray
-> 2760             value = _sanitize_index(value, self.index, copy=False)
   2761             if not isinstance(value, (np.ndarray, Index)):
   2762                 if isinstance(value, list) and len(value) > 0:

~\.virtualenv\book-python\lib\site-packages\pandas\core\series.py in _sanitize_index(data, index, copy)
   3119 
   3120     if len(data) != len(index):
-> 3121         raise ValueError('Length of values does not match length of ' 'index')
   3122 
   3123     if isinstance(data, PeriodIndex):

ValueError: Length of values does not match length of index

Zadanie

  1. Spróbuj inne kolumny z LassoCV.
  2. Spróbuj różne parametry modelu.
  3. Spróbuj użyć RidgeCV.