# Analiza danych i uczenie maszynowe w Python

Autor notebooka: Jakub Nowacki.

## Regresja liniowa

Regresja liniowa jest jedną z podstawowych, niemniej nadal często wykorzystywanych rodzajów regresji. Przećwiczymy ją na przykładowym zbiorze danych związanych z cukrzycą.

``````

In [15]:

import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
%matplotlib inline

plt.rcParams['figure.figsize'] = (10, 8)

# Zbiór danych
print(diabetes.DESCR)

``````
``````

Diabetes dataset
================

Notes
-----

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

Data Set Characteristics:

:Number of Instances: 442

:Number of Attributes: First 10 columns are numeric predictive values

:Target: Column 11 is a quantitative measure of disease progression one year after baseline

:Attributes:
:Age:
:Sex:
:Body mass index:
:Average blood pressure:
:S1:
:S2:
:S3:
:S4:
:S5:
:S6:

Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times `n_samples` (i.e. the sum of squares of each column totals 1).

Source URL:
http://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) "Least Angle Regression," Annals of Statistics (with discussion), 407-499.
(http://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)

``````
``````

In [16]:

diabetes.keys()

``````
``````

Out[16]:

dict_keys(['data', 'target', 'DESCR', 'feature_names'])

``````
``````

In [17]:

diabetes.data

``````
``````

Out[17]:

array([[ 0.03807591,  0.05068012,  0.06169621, ..., -0.00259226,
0.01990842, -0.01764613],
[-0.00188202, -0.04464164, -0.05147406, ..., -0.03949338,
-0.06832974, -0.09220405],
[ 0.08529891,  0.05068012,  0.04445121, ..., -0.00259226,
0.00286377, -0.02593034],
...,
[ 0.04170844,  0.05068012, -0.01590626, ..., -0.01107952,
-0.04687948,  0.01549073],
[-0.04547248, -0.04464164,  0.03906215, ...,  0.02655962,
0.04452837, -0.02593034],
[-0.04547248, -0.04464164, -0.0730303 , ..., -0.03949338,
-0.00421986,  0.00306441]])

``````
``````

In [18]:

diabetes.data.shape

``````
``````

Out[18]:

(442, 10)

``````
``````

In [19]:

diabetes.feature_names

``````
``````

Out[19]:

['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']

``````
``````

In [20]:

diabetes.target

``````
``````

Out[20]:

array([151.,  75., 141., 206., 135.,  97., 138.,  63., 110., 310., 101.,
69., 179., 185., 118., 171., 166., 144.,  97., 168.,  68.,  49.,
68., 245., 184., 202., 137.,  85., 131., 283., 129.,  59., 341.,
87.,  65., 102., 265., 276., 252.,  90., 100.,  55.,  61.,  92.,
259.,  53., 190., 142.,  75., 142., 155., 225.,  59., 104., 182.,
128.,  52.,  37., 170., 170.,  61., 144.,  52., 128.,  71., 163.,
150.,  97., 160., 178.,  48., 270., 202., 111.,  85.,  42., 170.,
200., 252., 113., 143.,  51.,  52., 210.,  65., 141.,  55., 134.,
42., 111.,  98., 164.,  48.,  96.,  90., 162., 150., 279.,  92.,
83., 128., 102., 302., 198.,  95.,  53., 134., 144., 232.,  81.,
104.,  59., 246., 297., 258., 229., 275., 281., 179., 200., 200.,
173., 180.,  84., 121., 161.,  99., 109., 115., 268., 274., 158.,
107.,  83., 103., 272.,  85., 280., 336., 281., 118., 317., 235.,
60., 174., 259., 178., 128.,  96., 126., 288.,  88., 292.,  71.,
197., 186.,  25.,  84.,  96., 195.,  53., 217., 172., 131., 214.,
59.,  70., 220., 268., 152.,  47.,  74., 295., 101., 151., 127.,
237., 225.,  81., 151., 107.,  64., 138., 185., 265., 101., 137.,
143., 141.,  79., 292., 178.,  91., 116.,  86., 122.,  72., 129.,
142.,  90., 158.,  39., 196., 222., 277.,  99., 196., 202., 155.,
77., 191.,  70.,  73.,  49.,  65., 263., 248., 296., 214., 185.,
78.,  93., 252., 150.,  77., 208.,  77., 108., 160.,  53., 220.,
154., 259.,  90., 246., 124.,  67.,  72., 257., 262., 275., 177.,
71.,  47., 187., 125.,  78.,  51., 258., 215., 303., 243.,  91.,
150., 310., 153., 346.,  63.,  89.,  50.,  39., 103., 308., 116.,
145.,  74.,  45., 115., 264.,  87., 202., 127., 182., 241.,  66.,
94., 283.,  64., 102., 200., 265.,  94., 230., 181., 156., 233.,
60., 219.,  80.,  68., 332., 248.,  84., 200.,  55.,  85.,  89.,
31., 129.,  83., 275.,  65., 198., 236., 253., 124.,  44., 172.,
114., 142., 109., 180., 144., 163., 147.,  97., 220., 190., 109.,
191., 122., 230., 242., 248., 249., 192., 131., 237.,  78., 135.,
244., 199., 270., 164.,  72.,  96., 306.,  91., 214.,  95., 216.,
263., 178., 113., 200., 139., 139.,  88., 148.,  88., 243.,  71.,
77., 109., 272.,  60.,  54., 221.,  90., 311., 281., 182., 321.,
58., 262., 206., 233., 242., 123., 167.,  63., 197.,  71., 168.,
140., 217., 121., 235., 245.,  40.,  52., 104., 132.,  88.,  69.,
219.,  72., 201., 110.,  51., 277.,  63., 118.,  69., 273., 258.,
43., 198., 242., 232., 175.,  93., 168., 275., 293., 281.,  72.,
140., 189., 181., 209., 136., 261., 113., 131., 174., 257.,  55.,
84.,  42., 146., 212., 233.,  91., 111., 152., 120.,  67., 310.,
94., 183.,  66., 173.,  72.,  49.,  64.,  48., 178., 104., 132.,
220.,  57.])

``````

``````

In [21]:

diabetes_X = diabetes.data[:, np.newaxis, 2]  # wyciągamy jako wektor kolumnowy (nie trzeba tego robić jak mamy więcej niż jedną kolumnę)

# Dzielimy dane na zbiory treningowy i testowy
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]

diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]

``````
``````

In [22]:

diabetes_X_train[:5], diabetes_y_train[:5]

``````
``````

Out[22]:

(array([[ 0.06169621],
[-0.05147406],
[ 0.04445121],
[-0.01159501],
[-0.03638469]]), array([151.,  75., 141., 206., 135.]))

``````
``````

In [23]:

# Tworzymy obiekt modelu i go uczymy
regr = linear_model.LinearRegression()

regr.fit(diabetes_X_train, diabetes_y_train)

``````
``````

Out[23]:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

``````

Teraz możemy sprawdzić czy model dobrze się uczy i jak przewiduje na danych testowych.

``````

In [24]:

diabetes_y_pred = regr.predict(diabetes_X_test)
diabetes_y_pred

``````
``````

Out[24]:

array([225.9732401 , 115.74763374, 163.27610621, 114.73638965,
120.80385422, 158.21988574, 236.08568105, 121.81509832,
99.56772822, 123.83758651, 204.73711411,  96.53399594,
154.17490936, 130.91629517,  83.3878227 , 171.36605897,
137.99500384, 137.99500384, 189.56845268,  84.3990668 ])

``````

Możemy przeprowadzić teraz ocenę jakości modelu.

``````

In [25]:

print('Współczynniki: \n', regr.coef_)
% mean_squared_error(diabetes_y_test, diabetes_y_pred))
print('Metryka R2 (wariancji): %.2f' % r2_score(diabetes_y_test, diabetes_y_pred))

``````
``````

Współczynniki:
[938.23786125]
Metryka R2 (wariancji): 0.47

``````

Narysujmy też predykcje naszego modelu na wykresie.

``````

In [26]:

plt.scatter(diabetes_X_test, diabetes_y_test,  color='black')
plt.scatter(diabetes_X_train, diabetes_y_train,  color='red')
plt.plot(diabetes_X_test, diabetes_y_pred, color='blue', linewidth=3)
plt.show()

``````
``````

``````
``````

In [27]:

diabetes_X = diabetes.data  # wyciągamy jako wektor kolumnowy (nie trzeba tego robić jak mamy więcej niż jedną kolumnę)
diabetes_X

``````
``````

Out[27]:

array([[ 0.03807591,  0.05068012,  0.06169621, ..., -0.00259226,
0.01990842, -0.01764613],
[-0.00188202, -0.04464164, -0.05147406, ..., -0.03949338,
-0.06832974, -0.09220405],
[ 0.08529891,  0.05068012,  0.04445121, ..., -0.00259226,
0.00286377, -0.02593034],
...,
[ 0.04170844,  0.05068012, -0.01590626, ..., -0.01107952,
-0.04687948,  0.01549073],
[-0.04547248, -0.04464164,  0.03906215, ...,  0.02655962,
0.04452837, -0.02593034],
[-0.04547248, -0.04464164, -0.0730303 , ..., -0.03949338,
-0.00421986,  0.00306441]])

``````

1. Użyj więcej zmiennych do uczenia modelu; porównaj wyniki pomiaru jakości regresji.
2. Narysuj linię regresji w stosunku do innych zmiennych.
3. ★ Jakie cechy wpływają na najbardziej na wynik? Jak to sprawdzić?
``````

In [29]:

# diabetes_X = diabetes.data[:, np.newaxis, 2]  # np.newaxis - wyciągamy jako wektor kolumnowy (nie trzeba tego robić jak mamy więcej niż jedną kolumnę)
diabetes_X = diabetes.data[:, [1, 2, 3]]
#diabetes_X = diabetes.data[:, np.newaxis, 2]

# Dzielimy dane na zbiory treningowy i testowy
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]

diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]

# Tworzymy obiekt modelu i go uczymy
regr = linear_model.LinearRegression()

regr.fit(diabetes_X_train, diabetes_y_train)
diabetes_y_pred = regr.predict(diabetes_X_test)

print('Współczynniki: \n', regr.coef_)
% mean_squared_error(diabetes_y_test, diabetes_y_pred))
print('Metryka R2 (wariancji): %.2f' % r2_score(diabetes_y_test, diabetes_y_pred))

plt.scatter(diabetes_X_test[:,2], diabetes_y_test,  color='black')
plt.scatter(diabetes_X_train[:,2], diabetes_y_train,  color='red')
plt.plot(diabetes_X_test[:,2], diabetes_y_pred, color='blue', linewidth=3)
plt.show()

``````
``````

Współczynniki:
[-96.87616507 780.09757364 432.26095788]
Metryka R2 (wariancji): 0.48

``````

## Pandas

Spróbujmy powrócić do Pandas i wykonać ten sam model.

``````

In [30]:

import pandas as pd

dia_df = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)\
.assign(target=diabetes.target)

``````
``````

Out[30]:

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
}

text-align: right;
}

age
sex
bmi
bp
s1
s2
s3
s4
s5
s6
target

0
0.038076
0.050680
0.061696
0.021872
-0.044223
-0.034821
-0.043401
-0.002592
0.019908
-0.017646
151.0

1
-0.001882
-0.044642
-0.051474
-0.026328
-0.008449
-0.019163
0.074412
-0.039493
-0.068330
-0.092204
75.0

2
0.085299
0.050680
0.044451
-0.005671
-0.045599
-0.034194
-0.032356
-0.002592
0.002864
-0.025930
141.0

3
-0.089063
-0.044642
-0.011595
-0.036656
0.012191
0.024991
-0.036038
0.034309
0.022692
-0.009362
206.0

4
0.005383
-0.044642
-0.036385
0.021872
0.003935
0.015596
0.008142
-0.002592
-0.031991
-0.046641
135.0

``````
``````

In [31]:

dia_train = dia_df.iloc[:-20, :]

``````
``````

Out[31]:

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
}

text-align: right;
}

age
sex
bmi
bp
s1
s2
s3
s4
s5
s6
target

0
0.038076
0.050680
0.061696
0.021872
-0.044223
-0.034821
-0.043401
-0.002592
0.019908
-0.017646
151.0

1
-0.001882
-0.044642
-0.051474
-0.026328
-0.008449
-0.019163
0.074412
-0.039493
-0.068330
-0.092204
75.0

2
0.085299
0.050680
0.044451
-0.005671
-0.045599
-0.034194
-0.032356
-0.002592
0.002864
-0.025930
141.0

3
-0.089063
-0.044642
-0.011595
-0.036656
0.012191
0.024991
-0.036038
0.034309
0.022692
-0.009362
206.0

4
0.005383
-0.044642
-0.036385
0.021872
0.003935
0.015596
0.008142
-0.002592
-0.031991
-0.046641
135.0

5
-0.092695
-0.044642
-0.040696
-0.019442
-0.068991
-0.079288
0.041277
-0.076395
-0.041180
-0.096346
97.0

6
-0.045472
0.050680
-0.047163
-0.015999
-0.040096
-0.024800
0.000779
-0.039493
-0.062913
-0.038357
138.0

7
0.063504
0.050680
-0.001895
0.066630
0.090620
0.108914
0.022869
0.017703
-0.035817
0.003064
63.0

8
0.041708
0.050680
0.061696
-0.040099
-0.013953
0.006202
-0.028674
-0.002592
-0.014956
0.011349
110.0

9
-0.070900
-0.044642
0.039062
-0.033214
-0.012577
-0.034508
-0.024993
-0.002592
0.067736
-0.013504
310.0

10
-0.096328
-0.044642
-0.083808
0.008101
-0.103389
-0.090561
-0.013948
-0.076395
-0.062913
-0.034215
101.0

11
0.027178
0.050680
0.017506
-0.033214
-0.007073
0.045972
-0.065491
0.071210
-0.096433
-0.059067
69.0

12
0.016281
-0.044642
-0.028840
-0.009113
-0.004321
-0.009769
0.044958
-0.039493
-0.030751
-0.042499
179.0

13
0.005383
0.050680
-0.001895
0.008101
-0.004321
-0.015719
-0.002903
-0.002592
0.038393
-0.013504
185.0

14
0.045341
-0.044642
-0.025607
-0.012556
0.017694
-0.000061
0.081775
-0.039493
-0.031991
-0.075636
118.0

15
-0.052738
0.050680
-0.018062
0.080401
0.089244
0.107662
-0.039719
0.108111
0.036056
-0.042499
171.0

16
-0.005515
-0.044642
0.042296
0.049415
0.024574
-0.023861
0.074412
-0.039493
0.052280
0.027917
166.0

17
0.070769
0.050680
0.012117
0.056301
0.034206
0.049416
-0.039719
0.034309
0.027368
-0.001078
144.0

18
-0.038207
-0.044642
-0.010517
-0.036656
-0.037344
-0.019476
-0.028674
-0.002592
-0.018118
-0.017646
97.0

19
-0.027310
-0.044642
-0.018062
-0.040099
-0.002945
-0.011335
0.037595
-0.039493
-0.008944
-0.054925
168.0

``````
``````

In [32]:

dia_test = dia_df.iloc[-20:, :]
dia_test

``````
``````

Out[32]:

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
}

text-align: right;
}

age
sex
bmi
bp
s1
s2
s3
s4
s5
s6
target

422
-0.078165
0.050680
0.077863
0.052858
0.078236
0.064447
0.026550
-0.002592
0.040672
-0.009362
233.0

423
0.009016
0.050680
-0.039618
0.028758
0.038334
0.073529
-0.072854
0.108111
0.015567
-0.046641
91.0

424
0.001751
0.050680
0.011039
-0.019442
-0.016704
-0.003819
-0.047082
0.034309
0.024053
0.023775
111.0

425
-0.078165
-0.044642
-0.040696
-0.081414
-0.100638
-0.112795
0.022869
-0.076395
-0.020289
-0.050783
152.0

426
0.030811
0.050680
-0.034229
0.043677
0.057597
0.068831
-0.032356
0.057557
0.035462
0.085907
120.0

427
-0.034575
0.050680
0.005650
-0.005671
-0.073119
-0.062691
-0.006584
-0.039493
-0.045421
0.032059
67.0

428
0.048974
0.050680
0.088642
0.087287
0.035582
0.021546
-0.024993
0.034309
0.066048
0.131470
310.0

429
-0.041840
-0.044642
-0.033151
-0.022885
0.046589
0.041587
0.056003
-0.024733
-0.025952
-0.038357
94.0

430
-0.009147
-0.044642
-0.056863
-0.050428
0.021822
0.045345
-0.028674
0.034309
-0.009919
-0.017646
183.0

431
0.070769
0.050680
-0.030996
0.021872
-0.037344
-0.047034
0.033914
-0.039493
-0.014956
-0.001078
66.0

432
0.009016
-0.044642
0.055229
-0.005671
0.057597
0.044719
-0.002903
0.023239
0.055684
0.106617
173.0

433
-0.027310
-0.044642
-0.060097
-0.029771
0.046589
0.019980
0.122273
-0.039493
-0.051401
-0.009362
72.0

434
0.016281
-0.044642
0.001339
0.008101
0.005311
0.010899
0.030232
-0.039493
-0.045421
0.032059
49.0

435
-0.012780
-0.044642
-0.023451
-0.040099
-0.016704
0.004636
-0.017629
-0.002592
-0.038459
-0.038357
64.0

436
-0.056370
-0.044642
-0.074108
-0.050428
-0.024960
-0.047034
0.092820
-0.076395
-0.061177
-0.046641
48.0

437
0.041708
0.050680
0.019662
0.059744
-0.005697
-0.002566
-0.028674
-0.002592
0.031193
0.007207
178.0

438
-0.005515
0.050680
-0.015906
-0.067642
0.049341
0.079165
-0.028674
0.034309
-0.018118
0.044485
104.0

439
0.041708
0.050680
-0.015906
0.017282
-0.037344
-0.013840
-0.024993
-0.011080
-0.046879
0.015491
132.0

440
-0.045472
-0.044642
0.039062
0.001215
0.016318
0.015283
-0.028674
0.026560
0.044528
-0.025930
220.0

441
-0.045472
-0.044642
-0.073030
-0.081414
0.083740
0.027809
0.173816
-0.039493
-0.004220
0.003064
57.0

``````
``````

In [33]:

lr = linear_model.LinearRegression()
lr.fit(dia_train[['age', 'sex', 'bmi']], dia_train['target'])

``````
``````

Out[33]:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

``````
``````

In [34]:

dia_test = dia_test.assign(predict=lambda x: lr.predict(x[['age', 'sex', 'bmi']]))
dia_test

``````
``````

Out[34]:

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
}

text-align: right;
}

age
sex
bmi
bp
s1
s2
s3
s4
s5
s6
target
predict

422
-0.078165
0.050680
0.077863
0.052858
0.078236
0.064447
0.026550
-0.002592
0.040672
-0.009362
233.0
211.071181

423
0.009016
0.050680
-0.039618
0.028758
0.038334
0.073529
-0.072854
0.108111
0.015567
-0.046641
91.0
116.261552

424
0.001751
0.050680
0.011039
-0.019442
-0.016704
-0.003819
-0.047082
0.034309
0.024053
0.023775
111.0
161.517691

425
-0.078165
-0.044642
-0.040696
-0.081414
-0.100638
-0.112795
0.022869
-0.076395
-0.020289
-0.050783
152.0
105.886702

426
0.030811
0.050680
-0.034229
0.043677
0.057597
0.068831
-0.032356
0.057557
0.035462
0.085907
120.0
124.331705

427
-0.034575
0.050680
0.005650
-0.005671
-0.073119
-0.062691
-0.006584
-0.039493
-0.045421
0.032059
67.0
151.351420

428
0.048974
0.050680
0.088642
0.087287
0.035582
0.021546
-0.024993
0.034309
0.066048
0.131470
310.0
239.264161

429
-0.041840
-0.044642
-0.033151
-0.022885
0.046589
0.041587
0.056003
-0.024733
-0.025952
-0.038357
94.0
118.023364

430
-0.009147
-0.044642
-0.056863
-0.050428
0.021822
0.045345
-0.028674
0.034309
-0.009919
-0.017646
183.0
101.065322

431
0.070769
0.050680
-0.030996
0.021872
-0.037344
-0.047034
0.033914
-0.039493
-0.014956
-0.001078
66.0
133.051614

432
0.009016
-0.044642
0.055229
-0.005671
0.057597
0.044719
-0.002903
0.023239
0.055684
0.106617
173.0
206.145820

433
-0.027310
-0.044642
-0.060097
-0.029771
0.046589
0.019980
0.122273
-0.039493
-0.051401
-0.009362
72.0
95.489589

434
0.016281
-0.044642
0.001339
0.008101
0.005311
0.010899
0.030232
-0.039493
-0.045421
0.032059
49.0
157.934094

435
-0.012780
-0.044642
-0.023451
-0.040099
-0.016704
0.004636
-0.017629
-0.002592
-0.038459
-0.038357
64.0
131.082359

436
-0.056370
-0.044642
-0.074108
-0.050428
-0.024960
-0.047034
0.092820
-0.076395
-0.061177
-0.046641
48.0
78.489811

437
0.041708
0.050680
0.019662
0.059744
-0.005697
-0.002566
-0.028674
-0.002592
0.031193
0.007207
178.0
175.163578

438
-0.005515
0.050680
-0.015906
-0.067642
0.049341
0.079165
-0.028674
0.034309
-0.018118
0.044485
104.0
135.839740

439
0.041708
0.050680
-0.015906
0.017282
-0.037344
-0.013840
-0.024993
-0.011080
-0.046879
0.015491
132.0
142.652120

440
-0.045472
-0.044642
0.039062
0.001215
0.016318
0.015283
-0.028674
0.026560
0.044528
-0.025930
220.0
183.507446

441
-0.045472
-0.044642
-0.073030
-0.081414
0.083740
0.027809
0.173816
-0.039493
-0.004220
0.003064
57.0
81.047094

``````
``````

In [35]:

print('Współczynniki: \n', lr.coef_)
% mean_squared_error(dia_test['target'], lr.predict(dia_test[['age', 'sex', 'bmi']])))
print('Metryka R2 (wariancji): %.2f' % r2_score(dia_test['target'], dia_test['predict']))

``````
``````

Współczynniki:
[144.25978848 -33.43463042 914.07000914]
Metryka R2 (wariancji): 0.46

``````
``````

In [37]:

import pandas as pd

def model(dataframe, features, target, procent_testowy=20):

dia_df = pd.DataFrame(diabetes.data, columns=diabetes.feature_names).assign(target=diabetes.target)

# Podiał zbioru na testowy i treningowy
dia_train = dia_df.iloc[:-procent_testowy, :]
dia_test = dia_df.iloc[-procent_testowy:, :]

lr = linear_model.LinearRegression()
lr.fit(dia_train[['bmi']], dia_train['target'])

dia_test = dia_test.assign(predict=lambda x: lr.predict(x[['bmi']]))

print('Współczynniki: \n', lr.coef_)
print("Błąd średniokwadratowy: %.2f"  % mean_squared_error(dia_test['target'], lr.predict(dia_test[['bmi']])))
print('Metryka R2 (wariancji): %.2f' % r2_score(dia_test['target'], dia_test['predict']))

``````

1. Podobnie jak powyżej, poeksperymentuj z cechami.
2. Zautomatyzuj powyższy eksperyment.
3. ★ Czy są jeszcze jakieś parametry które można dostosować?

## Regresja liniowa z regularyzacją

Aby wybrać odpowiedni model, który odpowiednio generalizuje, używa się technik regularyzacji. Dwie najbardziej znane techniki to Lasso, czyli regularyzacja L1, oraz Ridge, czyli regularyzacja L2. Poniżej przykłady wykorzystania tych algorytmów.

``````

In [38]:

ridge = linear_model.Ridge()
ridge.fit(dia_train[['age', 'sex', 'bmi']], dia_train['target'])

``````
``````

Out[38]:

Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
normalize=False, random_state=None, solver='auto', tol=0.001)

``````
``````

In [39]:

dia_test = dia_test.assign(predict=lambda x: ridge.predict(x[['age', 'sex', 'bmi']]))
dia_test

``````
``````

Out[39]:

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
}

text-align: right;
}

age
sex
bmi
bp
s1
s2
s3
s4
s5
s6
target
predict

422
-0.078165
0.050680
0.077863
0.052858
0.078236
0.064447
0.026550
-0.002592
0.040672
-0.009362
233.0
179.626754

423
0.009016
0.050680
-0.039618
0.028758
0.038334
0.073529
-0.072854
0.108111
0.015567
-0.046641
91.0
136.525047

424
0.001751
0.050680
0.011039
-0.019442
-0.016704
-0.003819
-0.047082
0.034309
0.024053
0.023775
111.0
158.441800

425
-0.078165
-0.044642
-0.040696
-0.081414
-0.100638
-0.112795
0.022869
-0.076395
-0.020289
-0.050783
152.0
126.104300

426
0.030811
0.050680
-0.034229
0.043677
0.057597
0.068831
-0.032356
0.057557
0.035462
0.085907
120.0
141.335891

427
-0.034575
0.050680
0.005650
-0.005671
-0.073119
-0.062691
-0.006584
-0.039493
-0.045421
0.032059
67.0
152.034710

428
0.048974
0.050680
0.088642
0.087287
0.035582
0.021546
-0.024993
0.034309
0.066048
0.131470
310.0
198.426854

429
-0.041840
-0.044642
-0.033151
-0.022885
0.046589
0.041587
0.056003
-0.024733
-0.025952
-0.038357
94.0
133.477980

430
-0.009147
-0.044642
-0.056863
-0.050428
0.021822
0.045345
-0.028674
0.034309
-0.009919
-0.017646
183.0
126.437037

431
0.070769
0.050680
-0.030996
0.021872
-0.037344
-0.047034
0.033914
-0.039493
-0.014956
-0.001078
66.0
147.175451

432
0.009016
-0.044642
0.055229
-0.005671
0.057597
0.044719
-0.002903
0.023239
0.055684
0.106617
173.0
178.695047

433
-0.027310
-0.044642
-0.060097
-0.029771
0.046589
0.019980
0.122273
-0.039493
-0.051401
-0.009362
72.0
122.991844

434
0.016281
-0.044642
0.001339
0.008101
0.005311
0.010899
0.030232
-0.039493
-0.045421
0.032059
49.0
155.328408

435
-0.012780
-0.044642
-0.023451
-0.040099
-0.016704
0.004636
-0.017629
-0.002592
-0.038459
-0.038357
64.0
141.020127

436
-0.056370
-0.044642
-0.074108
-0.050428
-0.024960
-0.047034
0.092820
-0.076395
-0.061177
-0.046641
48.0
113.516516

437
0.041708
0.050680
0.019662
0.059744
-0.005697
-0.002566
-0.028674
-0.002592
0.031193
0.007207
178.0
166.697836

438
-0.005515
0.050680
-0.015906
-0.067642
0.049341
0.079165
-0.028674
0.034309
-0.018118
0.044485
104.0
145.561296

439
0.041708
0.050680
-0.015906
0.017282
-0.037344
-0.013840
-0.024993
-0.011080
-0.046879
0.015491
132.0
150.749094

440
-0.045472
-0.044642
0.039062
0.001215
0.016318
0.015283
-0.028674
0.026560
0.044528
-0.025930
220.0
165.459699

441
-0.045472
-0.044642
-0.073030
-0.081414
0.083740
0.027809
0.173816
-0.039493
-0.004220
0.003064
57.0
115.196995

``````
``````

In [40]:

print('Współczynniki: \n', ridge.coef_)
% mean_squared_error(dia_test['target'], ridge.predict(dia_test[['age', 'sex', 'bmi']])))
print('Metryka R2 (wariancji): %.2f' % r2_score(dia_test['target'], dia_test['predict']))

``````
``````

Współczynniki:
[109.85742979   3.77646864 448.40398428]
Metryka R2 (wariancji): 0.25

``````
``````

In [41]:

lasso = linear_model.Lasso()
lasso.fit(dia_train[['age', 'sex', 'bmi']], dia_train['target'])

``````
``````

Out[41]:

Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
normalize=False, positive=False, precompute=False, random_state=None,
selection='cyclic', tol=0.0001, warm_start=False)

``````
``````

In [42]:

dia_test = dia_test.assign(predict=lambda x: lasso.predict(x[['age', 'sex', 'bmi']]))
dia_test

``````
``````

Out[42]:

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
}

text-align: right;
}

age
sex
bmi
bp
s1
s2
s3
s4
s5
s6
target
predict

422
-0.078165
0.050680
0.077863
0.052858
0.078236
0.064447
0.026550
-0.002592
0.040672
-0.009362
233.0
191.800029

423
0.009016
0.050680
-0.039618
0.028758
0.038334
0.073529
-0.072854
0.108111
0.015567
-0.046641
91.0
133.450577

424
0.001751
0.050680
0.011039
-0.019442
-0.016704
-0.003819
-0.047082
0.034309
0.024053
0.023775
111.0
158.610433

425
-0.078165
-0.044642
-0.040696
-0.081414
-0.100638
-0.112795
0.022869
-0.076395
-0.020289
-0.050783
152.0
132.915261

426
0.030811
0.050680
-0.034229
0.043677
0.057597
0.068831
-0.032356
0.057557
0.035462
0.085907
120.0
136.127158

427
-0.034575
0.050680
0.005650
-0.005671
-0.073119
-0.062691
-0.006584
-0.039493
-0.045421
0.032059
67.0
155.933852

428
0.048974
0.050680
0.088642
0.087287
0.035582
0.021546
-0.024993
0.034309
0.066048
0.131470
310.0
197.153190

429
-0.041840
-0.044642
-0.033151
-0.022885
0.046589
0.041587
0.056003
-0.024733
-0.025952
-0.038357
94.0
136.662474

430
-0.009147
-0.044642
-0.056863
-0.050428
0.021822
0.045345
-0.028674
0.034309
-0.009919
-0.017646
183.0
124.885520

431
0.070769
0.050680
-0.030996
0.021872
-0.037344
-0.047034
0.033914
-0.039493
-0.014956
-0.001078
66.0
137.733106

432
0.009016
-0.044642
0.055229
-0.005671
0.057597
0.044719
-0.002903
0.023239
0.055684
0.106617
173.0
180.558392

433
-0.027310
-0.044642
-0.060097
-0.029771
0.046589
0.019980
0.122273
-0.039493
-0.051401
-0.009362
72.0
123.279572

434
0.016281
-0.044642
0.001339
0.008101
0.005311
0.010899
0.030232
-0.039493
-0.045421
0.032059
49.0
153.792588

435
-0.012780
-0.044642
-0.023451
-0.040099
-0.016704
0.004636
-0.017629
-0.002592
-0.038459
-0.038357
64.0
141.480318

436
-0.056370
-0.044642
-0.074108
-0.050428
-0.024960
-0.047034
0.092820
-0.076395
-0.061177
-0.046641
48.0
116.320463

437
0.041708
0.050680
0.019662
0.059744
-0.005697
-0.002566
-0.028674
-0.002592
0.031193
0.007207
178.0
162.892961

438
-0.005515
0.050680
-0.015906
-0.067642
0.049341
0.079165
-0.028674
0.034309
-0.018118
0.044485
104.0
145.227531

439
0.041708
0.050680
-0.015906
0.017282
-0.037344
-0.013840
-0.024993
-0.011080
-0.046879
0.015491
132.0
145.227531

440
-0.045472
-0.044642
0.039062
0.001215
0.016318
0.015283
-0.028674
0.026560
0.044528
-0.025930
220.0
172.528651

441
-0.045472
-0.044642
-0.073030
-0.081414
0.083740
0.027809
0.173816
-0.039493
-0.004220
0.003064
57.0
116.855779

``````
``````

In [ ]:

print('Współczynniki: \n', lasso.coef_)
% mean_squared_error(dia_test['target'], lasso.predict(dia_test[['age', 'sex', 'bmi']])))
print('Metryka R2 (wariancji): %.2f' % r2_score(dia_test['target'], dia_test['predict']))

``````

Jak widać parametry wypadły gorzej niż dla zwykłej regresji liniowej. Wynika to z faktu, że regularyzacje mają hipetparametry, które należy dostosować do problemy. Do tego zostały stworzone wersje z wbudowaną walidacją krzyżową (Cross-validation, która również dobiera hiperparametry.

``````

In [55]:

lasso = linear_model.LassoCV(cv=5)
lasso.fit(dia_train[['age', 'sex', 'bmi']], dia_train['target'])

``````
``````

Out[55]:

LassoCV(alphas=None, copy_X=True, cv=5, eps=0.001, fit_intercept=True,
max_iter=1000, n_alphas=100, n_jobs=1, normalize=False, positive=False,
precompute='auto', random_state=None, selection='cyclic', tol=0.0001,
verbose=False)

``````
``````

In [52]:

dia_test = dia_test.assign(predict=lambda x: lasso.predict(x[['age', 'sex', 'bmi']]))
dia_test

``````
``````

Out[52]:

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
}

text-align: right;
}

age
sex
bmi
bp
s1
s2
s3
s4
s5
s6
target
predict

422
-0.078165
0.050680
0.077863
0.052858
0.078236
0.064447
0.026550
-0.002592
0.040672
-0.009362
233.0
213.025719

423
0.009016
0.050680
-0.039618
0.028758
0.038334
0.073529
-0.072854
0.108111
0.015567
-0.046641
91.0
118.673749

424
0.001751
0.050680
0.011039
-0.019442
-0.016704
-0.003819
-0.047082
0.034309
0.024053
0.023775
111.0
162.903582

425
-0.078165
-0.044642
-0.040696
-0.081414
-0.100638
-0.112795
0.022869
-0.076395
-0.020289
-0.050783
152.0
107.559784

426
0.030811
0.050680
-0.034229
0.043677
0.057597
0.068831
-0.032356
0.057557
0.035462
0.085907
120.0
126.017832

427
-0.034575
0.050680
0.005650
-0.005671
-0.073119
-0.062691
-0.006584
-0.039493
-0.045421
0.032059
67.0
153.860557

428
0.048974
0.050680
0.088642
0.087287
0.035582
0.021546
-0.024993
0.034309
0.066048
0.131470
310.0
237.482801

429
-0.041840
-0.044642
-0.033151
-0.022885
0.046589
0.041587
0.056003
-0.024733
-0.025952
-0.038357
94.0
118.521077

430
-0.009147
-0.044642
-0.056863
-0.050428
0.021822
0.045345
-0.028674
0.034309
-0.009919
-0.017646
183.0
101.242745

431
0.070769
0.050680
-0.030996
0.021872
-0.037344
-0.047034
0.033914
-0.039493
-0.014956
-0.001078
66.0
133.567324

432
0.009016
-0.044642
0.055229
-0.005671
0.057597
0.044719
-0.002903
0.023239
0.055684
0.106617
173.0
203.116372

433
-0.027310
-0.044642
-0.060097
-0.029771
0.046589
0.019980
0.122273
-0.039493
-0.051401
-0.009362
72.0
96.241665

434
0.016281
-0.044642
0.001339
0.008101
0.005311
0.010899
0.030232
-0.039493
-0.045421
0.032059
49.0
156.009136

435
-0.012780
-0.044642
-0.023451
-0.040099
-0.016704
0.004636
-0.017629
-0.002592
-0.038459
-0.038357
64.0
130.551168

436
-0.056370
-0.044642
-0.074108
-0.050428
-0.024960
-0.047034
0.092820
-0.076395
-0.061177
-0.046641
48.0
80.375038

437
0.041708
0.050680
0.019662
0.059744
-0.005697
-0.002566
-0.028674
-0.002592
0.031193
0.007207
178.0
175.248745

438
-0.005515
0.050680
-0.015906
-0.067642
0.049341
0.079165
-0.028674
0.034309
-0.018118
0.044485
104.0
138.075758

439
0.041708
0.050680
-0.015906
0.017282
-0.037344
-0.013840
-0.024993
-0.011080
-0.046879
0.015491
132.0
143.597318

440
-0.045472
-0.044642
0.039062
0.001215
0.016318
0.015283
-0.028674
0.026560
0.044528
-0.025930
220.0
182.358329

441
-0.045472
-0.044642
-0.073030
-0.081414
0.083740
0.027809
0.173816
-0.039493
-0.004220
0.003064
57.0
82.608379

``````
``````

In [59]:

columns = ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']

lasso = linear_model.LassoCV(cv=5)
lasso.fit(dia_train[columns], dia_train['target'])
dia_test = dia_test.assign(predict=lambda x: lasso.predict(x[columns]))

print('Współczynniki: \n', lasso.coef_)
% mean_squared_error(dia_test['target'], lasso.predict(dia_test[columns])))
print('Metryka R2 (wariancji): %.2f' % r2_score(dia_test['target'], dia_test['predict']))

``````
``````

Współczynniki:
[   0.         -227.7909445   515.87541365  322.65588548 -410.39307487
172.22093534  -69.15134823  134.40868857  594.35952058   75.48272645]
Metryka R2 (wariancji): 0.59

``````
``````

In [53]:

print('Współczynniki: \n', lasso.coef_)
% mean_squared_error(dia_test['target'], lasso.predict(dia_test[['age', 'sex', 'bmi']])))
print('Metryka R2 (wariancji): %.2f' % r2_score(dia_test['target'], dia_test['predict']))

``````
``````

Współczynniki:
[ 1.16925238e+02 -4.07249379e-01  8.89889954e+02]
Metryka R2 (wariancji): 0.46

``````

Zobaczmy co dzieje się w trakcie procesu walidacji krzyżowej. Algorytm liczy dla każdego podziału danych krzywą MSE w zależności od parametru alpha, jak pokazano poniżej.

``````

In [54]:

plt.plot(-pd.np.log10(lasso.alphas_), lasso.mse_path_, linestyle='--');
plt.plot(-pd.np.log10(lasso.alphas_), lasso.mse_path_.mean(axis=1), 'k', linewidth=3);

plt.xlabel('\$-log_{10}(alpha)\$');
plt.ylabel('Mean Square Error (MSE)');

``````
``````

``````
``````

In [63]:

import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
import pandas as pd
# %matplotlib inline

dataframe = pd.DataFrame(diabetes.data, columns=diabetes.feature_names).assign(target=diabetes.target)

dane_treningowe = dataframe.iloc[:-20, :]
dane_testowe = dataframe.iloc[-20:, :]

columns = ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']

model = linear_model.LassoCV(cv=5)
model.fit(dane_treningowe[columns], dane_treningowe['target'])
dane_testowe = dane_testowe.assign(predict=lambda x: model.predict(dane_treningowe[columns]))

print('Współczynniki: \n', model.coef_)
print("Błąd średniokwadratowy: %.2f" % mean_squared_error(dia_test['target'], model.predict(dane_testowe[columns])))
print('Metryka R2 (wariancji): %.2f' % r2_score(dane_testowe['target'], dane_testowe['predict']))

``````
``````

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-63-fe19845627a4> in <module>()
17 model = linear_model.LassoCV(cv=5)
18 model.fit(dane_treningowe[columns], dane_treningowe['target'])
---> 19 dane_testowe = dane_testowe.assign(predict=lambda x: model.predict(dane_treningowe[columns]))
20
21

~\.virtualenv\book-python\lib\site-packages\pandas\core\frame.py in assign(self, **kwargs)
2692         # ... and then assign
2693         for k, v in results:
-> 2694             data[k] = v
2695         return data
2696

~\.virtualenv\book-python\lib\site-packages\pandas\core\frame.py in __setitem__(self, key, value)
2517         else:
2518             # set column
-> 2519             self._set_item(key, value)
2520
2521     def _setitem_slice(self, key, value):

~\.virtualenv\book-python\lib\site-packages\pandas\core\frame.py in _set_item(self, key, value)
2583
2584         self._ensure_valid_index(value)
-> 2585         value = self._sanitize_column(key, value)
2586         NDFrame._set_item(self, key, value)
2587

~\.virtualenv\book-python\lib\site-packages\pandas\core\frame.py in _sanitize_column(self, key, value, broadcast)
2758
2759             # turn me into an ndarray
-> 2760             value = _sanitize_index(value, self.index, copy=False)
2761             if not isinstance(value, (np.ndarray, Index)):
2762                 if isinstance(value, list) and len(value) > 0:

~\.virtualenv\book-python\lib\site-packages\pandas\core\series.py in _sanitize_index(data, index, copy)
3119
3120     if len(data) != len(index):
-> 3121         raise ValueError('Length of values does not match length of ' 'index')
3122
3123     if isinstance(data, PeriodIndex):

ValueError: Length of values does not match length of index

``````