Regressão

Exemplo - O problema dos preços de imóveis em Boston


In [1]:
from sklearn.datasets import load_boston
boston = load_boston()
print boston.DESCR


Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
http://archive.ics.uci.edu/ml/datasets/Housing


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
**References**

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
   - many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)


In [2]:
boston.data.shape


Out[2]:
(506, 13)

In [3]:
num_samples = boston.data.shape[0]
num_samples


Out[3]:
506

In [4]:
boston.data[0]


Out[4]:
array([  6.32000000e-03,   1.80000000e+01,   2.31000000e+00,
         0.00000000e+00,   5.38000000e-01,   6.57500000e+00,
         6.52000000e+01,   4.09000000e+00,   1.00000000e+00,
         2.96000000e+02,   1.53000000e+01,   3.96900000e+02,
         4.98000000e+00])

In [5]:
boston.target.shape


Out[5]:
(506,)

Distância aos centros de emprego vs. preço


In [6]:
figsize(12,8)

In [7]:
scatter(boston.data[:,5], boston.target)
xlabel(u'RM (número médio de cômodos)')
ylabel(u'Valor médio (em US$ 1.000)')


Out[7]:
<matplotlib.text.Text at 0x7f8a59970a90>

Produzindo conjuntos de treinamento e teste


In [8]:
from sklearn.cross_validation import ShuffleSplit

In [9]:
ssplit = ShuffleSplit(num_samples, n_iter=1, test_size=0.25)

In [10]:
for train_idx, test_idx in ssplit:
    pass

In [11]:
train_idx


Out[11]:
array([ 15, 362, 303, 100, 371, 389, 327,  44, 471, 274, 287,  52, 305,
       410, 318, 230, 504, 169, 302, 189, 282, 211, 124,  39, 475, 481,
       236,  30, 441, 454, 496, 155, 498, 505, 372,  49,  38, 117, 323,
       386,  51, 262, 192, 113, 103, 123, 110, 417,  59, 349,   2, 220,
       324, 258, 499,  25, 494,  23, 253, 404, 308,  19,  21, 359, 167,
       298, 387, 411, 191,  63,  62, 447, 421, 442, 229, 135, 108, 235,
        46,  56, 202, 128, 415, 289, 458, 140, 444, 428, 422, 381, 182,
        76, 373, 172, 457, 166, 150, 132, 459, 337, 463, 204, 119,   3,
       319, 143,  67, 352, 116, 495,  88, 228, 351, 254,  48, 249, 147,
       503,  10, 353, 478, 345, 174, 398, 357, 339,  28, 433, 227, 111,
        92, 440, 238,  85,  12, 199, 179,  41, 165, 134, 129, 177,  58,
       270, 267, 322, 115,   0, 354, 244, 142, 106, 154, 466,  16, 137,
       181, 401, 163,  45, 431, 291, 383, 489, 316,  33, 178, 286, 183,
       131, 206, 271, 184, 394, 449, 355, 365, 311, 414,  36, 120, 473,
       393, 399, 257,  40, 451,   9, 295, 151, 233, 278, 133, 336, 213,
       460, 424, 239, 375, 112, 173, 201, 209,  66, 138,  14,  73, 197,
       332, 397,   1, 141, 130, 148,  70, 377, 118, 122, 176, 250, 476,
       380, 500, 180, 306, 296, 265,  11, 162, 310, 114,  75, 472, 315,
       403,  20, 405, 136, 156, 486, 335, 292, 255,  80, 369,  96, 146,
       231, 175, 437, 284, 341,  50, 224, 426, 453, 370, 275,   5, 264,
       126, 186, 443,  81, 317,  84, 125, 309, 205,  90, 497, 203, 465,
        89, 320,  22, 217, 107, 139, 358,  97, 241, 427, 487, 400, 492,
       218, 348, 490, 269, 346,   7, 280, 430, 501, 225,   8, 196, 429,
       456, 188, 366, 251, 402, 468,  61, 462, 419, 300, 304, 242,  74,
       455, 388,  87, 450,  69,  94, 313,  77,  99, 364,  98,  79, 240,
        95, 185, 145, 261, 331, 330, 368,  78, 461, 237, 432, 164, 266,
       158, 207,  17,  27, 226, 293, 435, 436,  72, 288, 325, 273, 215,
       413, 260,  57, 425, 491, 104, 438, 342, 452, 340, 200,  65, 159,
       214, 420, 219, 276, 268, 326, 198, 395,  26, 152, 149, 232, 439,
       170,  54])

In [12]:
n_train = train_idx.shape[0]
n_train


Out[12]:
379

In [13]:
test_idx


Out[13]:
array([361, 263,  53, 407, 409, 171, 445, 391,  37, 378, 297, 418, 356,
       367, 343, 484, 408, 406, 127, 334, 467, 256,  68, 392, 483,  31,
         6, 344, 502, 379, 281,  24, 474, 222,  82, 246, 329, 301, 285,
       194, 234, 109, 412, 385, 277, 221, 376, 363,  64, 396,  83, 469,
       374, 212, 482, 312, 223, 279, 479, 382,   4, 338, 252, 290, 195,
       477, 485, 390, 384, 283, 248, 144, 187, 157, 216, 208,  18, 210,
       161, 347,  93, 328, 307, 423,  91, 247, 160, 153, 259, 121, 416,
       480, 101, 493,  29, 464,  86, 321, 245, 299, 190, 102, 488,  13,
        71,  34, 434,  35,  42, 294, 243, 470, 350,  47, 193, 314, 446,
        32,  43, 105, 333, 448, 360,  60, 168,  55, 272])

In [14]:
n_test = test_idx.shape[0]
n_test


Out[14]:
127

In [15]:
n_train + n_test


Out[15]:
506

Exemplo - regressão utilizando apenas RM

Vamos prever o preço dos imóveis utilizando apenas o número de cômodos (RM)


In [16]:
X = boston.data[train_idx,5].reshape(n_train,1)
X.shape


Out[16]:
(379, 1)

In [17]:
y = boston.target[train_idx]

Treinamento

Aqui, vamos utilizar a regressão linear


In [18]:
from sklearn.linear_model import LinearRegression

In [19]:
regr = LinearRegression()

regr.fit(X, y)


Out[19]:
LinearRegression(copy_X=True, fit_intercept=True, normalize=False)

In [20]:
scatter(X, y)
plot(X, regr.predict(X))
xlabel(u'RM (número médio de cômodos)')
xlim((3,9))
ylabel(u'Valor médio (em US$ 1.000)')
ylim((0,55))


Out[20]:
(0, 55)

Erro quadrático médio


In [21]:
mean((regr.predict(X) - y)**2)


Out[21]:
46.796244399594485

Score

O coeficiente $R^2$ é definido como $(1 - \frac{u}{v})$ em que

$u = \sum_{i} (y_i - \hat{y_i})^2$

$v = \sum_{i} (y_i - \mu_y)^2$

O melhor score possível é 1.0 e o pior possível 0.0


In [22]:
regr.score(X, y)


Out[22]:
0.48450411944260396

Teste


In [23]:
X_t = boston.data[test_idx,5].reshape(n_test, 1)
y_t = boston.target[test_idx]

Erro quadrático médio


In [24]:
mean((regr.predict(X_t) - y_t)**2)


Out[24]:
34.312008286451317

Score


In [25]:
regr.score(X_t, y_t)


Out[25]:
0.47519490527319708

In [26]:
scatter(X_t, y_t)
plot(X_t, regr.predict(X_t))
xlabel(u'RM (número médio de cômodos)')
xlim((3,9))
ylabel(u'Valor médio (em US$ 1.000)')
ylim((0,55))


Out[26]:
(0, 55)

In [27]:
scatter(X, y, c='b', marker='o')
scatter(X_t, y_t, c='r', marker='s')
plot(X, regr.predict(X))
plot(X_t, regr.predict(X_t), 'r-')
xlabel(u'RM (número médio de cômodos)')
xlim((3,9))
ylabel(u'Valor médio (em US$ 1.000)')
ylim((0,55))


Out[27]:
(0, 55)

Exemplo - regressão utilizando todas as dimensões

Treinamento


In [28]:
regr = LinearRegression()
X = boston.data[train_idx]
y = boston.target[train_idx]
regr.fit(X, y)


Out[28]:
LinearRegression(copy_X=True, fit_intercept=True, normalize=False)

Erro quadrático médio


In [29]:
mean((regr.predict(X) - y)**2)


Out[29]:
22.91041859376821

Score


In [30]:
regr.score(X, y)


Out[30]:
0.74762448229637368

Teste

Erro quadrático médio


In [31]:
X_t = boston.data[test_idx]
y_t = boston.target[test_idx]
mean((regr.predict(X_t) - y_t)**2)


Out[31]:
19.870238886388343

Score


In [32]:
regr.score(X_t, y_t)


Out[32]:
0.69608300062305262

Exercício

A base de dados sobre diabetes

Dez variáveis idade, sexo, índice de massa corporal, pressão arterial média e seis medições sanguíneas de $n = 442$ pacientes com diabetes.

Medida quantitativa de interesse: índice de progressão da doença após 1 ano


In [33]:
from sklearn.datasets import load_diabetes
diabetes = load_diabetes()

Exercício

Realizar regressão para prever o índice de progressão da doença a partir das 10 variávies

Solução


In [34]:
num_samples = diabetes.data.shape[0]
num_samples


Out[34]:
442

In [35]:
ssplit = ShuffleSplit(num_samples, n_iter=1, test_size=0.25)

In [36]:
for train_idx, test_idx in ssplit:
    pass

In [37]:
train_idx


Out[37]:
array([ 88, 325, 273, 279, 104, 146, 403,  48, 331, 401, 329, 381, 191,
         4,  80, 101, 202,  47,  93, 422, 278, 321, 298,  38, 197, 174,
       286, 225, 149, 404, 129, 272,  91,  39, 190, 166,  36, 239, 398,
        17, 209,  82,  54, 409,  97, 318, 378, 419, 154, 208,  96, 275,
       441, 355, 372, 284, 171,  22,  19, 315,  55,   6, 192, 258,  42,
        52, 289,  30, 291, 268,   1, 428, 185, 218, 200, 241, 326, 130,
       345, 319, 238, 250, 227, 211, 353, 263, 232, 430, 314, 281, 138,
        77, 344,  67, 396, 357, 100, 170,  63,   2, 413, 342,  49,   0,
       231, 254, 386, 150, 245, 244, 237, 135, 358, 307, 145, 434, 407,
       248, 352, 420,  79, 399,  94, 242, 178, 277, 303, 123,  85, 113,
       297,  87,  16, 346, 433, 229, 310, 293, 195, 205, 379, 371, 167,
       259, 194,  44,  46, 189, 156, 251, 230, 287, 429, 269,  31,  78,
        21,   9,  81,  98,  26, 305, 240, 290,  59, 157, 369, 196, 158,
       410, 349, 347,  72, 252, 122, 366, 356, 341, 221, 306, 423, 116,
       115, 343, 440,  23, 338, 246, 391, 380, 118, 300,   3,  15, 400,
       132, 181, 322,  99, 389, 106, 323, 282, 226, 177, 110,  27, 262,
       186, 164,  64, 188,  56, 224,  90, 324, 382, 361, 212, 376, 128,
        50, 213, 183, 274, 161,  20,  83, 141, 421, 437, 333, 288, 395,
       152, 412, 162,   5, 313, 351, 133, 105, 438,  74, 121, 108, 163,
       142, 387,  45, 139, 173, 111, 117,  62, 109,  75, 266,  34, 214,
       223, 260, 439, 264, 426, 408, 112, 144, 165, 172,  29, 406, 204,
        71, 370, 126, 417, 255, 362, 270, 257, 402, 393, 125, 120,  10,
       160, 431,  89, 360, 193, 187, 219, 153, 309,  61,  40,  43, 102,
        35, 168, 228, 182, 367,  92, 243, 363, 179, 103, 432, 340, 312,
       137, 124, 397, 236, 151, 184, 140, 424, 203, 261,  60,  12, 114,
       247,  65,   8, 416, 234, 392])

In [38]:
test_idx


Out[38]:
array([267, 155, 332, 143, 334,  37, 276, 285, 180, 339,  25, 253, 235,
        41, 256, 350,  11, 427, 368, 336, 365, 216, 308, 377, 220, 296,
       107, 316, 418, 384, 411, 292, 385, 280,  68, 394, 294, 374, 233,
       302, 176, 359, 364,  13,  58, 169, 175, 425,  28,  53, 301, 199,
       119,  57,  84,  14,  70, 383, 136, 330,   7,  24,  76,  95, 436,
       348, 271, 354, 222, 311, 388, 415, 304, 249, 335,  18, 206, 295,
       217, 435, 127, 373, 131, 147, 265, 283, 210, 320,  73, 327, 215,
       207, 299, 414,  66, 134, 148,  32, 198,  69,  51, 337, 375, 328,
        33, 390, 405, 317,  86, 159, 201])

In [39]:
regr = LinearRegression()
X = diabetes.data[train_idx]
y = diabetes.target[train_idx]
regr.fit(X, y)


Out[39]:
LinearRegression(copy_X=True, fit_intercept=True, normalize=False)

Erro quadrático médio


In [40]:
mean((regr.predict(X) - y)**2)


Out[40]:
2854.5161687412092

In [41]:
X_t = diabetes.data[test_idx]
y_t = diabetes.target[test_idx]
mean((regr.predict(X_t) - y_t)**2)


Out[41]:
2957.3328332580109

In [42]:
regr.score(X_t, y_t)


Out[42]:
0.49383612489744777

Lasso

  • Lasso é capaz de produzir modelos esparsos
    • Modelos que usam um subconjunto das variáveis
  • O ideia é reduzir os coeficientes de algumas variáveis zero
  • Modelos Lasso são assim mais fáceis de interpretar
  • Pode ser entendido como uma aplicação da faca de Occam
    • Prefira modelos mais simples

In [132]:
ssplit = ShuffleSplit(num_samples, n_iter=1, test_size=0.25)

for train_idx, test_idx in ssplit:
    pass

X = boston.data[train_idx]
y = boston.target[train_idx]

X_t = boston.data[test_idx]
y_t = boston.target[test_idx]

In [133]:
from sklearn.linear_model import Lasso

In [134]:
lasso_regr = Lasso()
lasso_regr.set_params(alpha=0.001)
lasso_regr.fit(X, y)


Out[134]:
Lasso(alpha=0.001, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute='auto', tol=0.0001,
   warm_start=False)

In [135]:
lasso_regr.score(X_t, y_t)


Out[135]:
0.7053705432249564

In [136]:
boston_vars = ['CRIM (per capita crime rate by town)',
            'ZN (proportion of residential land zoned for lots over 25,000 sq.ft.)',
            'INDUS (proportion of non-retail business acres per town)',
            'CHAS (Charles River dummy variable, = 1 if tract bounds river; 0 otherwise)',
            'NOX     nitric oxides concentration (parts per 10 million)',
            'RM  (average number of rooms per dwelling)',
            'AGE (proportion of owner-occupied units built prior to 1940)',
            'DIS (weighted distances to five Boston employment centres)',
            'RAD (index of accessibility to radial highways)',
            'TAX (full-value property-tax rate per $10,000)',
            'PTRATIO (pupil-teacher ratio by town)',
            'B (1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town)',
            'LSTAT (lower status of the population)']

In [137]:
for descr, coefficient in zip(boston_vars, lasso_regr.coef_):
    print "%.3f\t%s" % (coefficient, descr)


-0.107	CRIM (per capita crime rate by town)
0.061	ZN (proportion of residential land zoned for lots over 25,000 sq.ft.)
0.051	INDUS (proportion of non-retail business acres per town)
2.754	CHAS (Charles River dummy variable, = 1 if tract bounds river; 0 otherwise)
-17.696	NOX     nitric oxides concentration (parts per 10 million)
3.505	RM  (average number of rooms per dwelling)
0.032	AGE (proportion of owner-occupied units built prior to 1940)
-1.401	DIS (weighted distances to five Boston employment centres)
0.432	RAD (index of accessibility to radial highways)
-0.016	TAX (full-value property-tax rate per $10,000)
-0.883	PTRATIO (pupil-teacher ratio by town)
0.010	B (1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town)
-0.668	LSTAT (lower status of the population)