notebook.community

Edit and run



In [1]:

    
% matplotlib inline

import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import sklearn



In [2]:

    
from sklearn import datasets
boston = datasets.load_boston()



In [3]:

    
boston.keys()









    Out[3]:





['data', 'feature_names', 'DESCR', 'target']



In [5]:

    
boston.data.shape









    Out[5]:





(506, 13)



In [6]:

    
boston.feature_names









    Out[6]:





array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT'], 
      dtype='|S7')



In [8]:

    
print(boston.DESCR)









    



Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
http://archive.ics.uci.edu/ml/datasets/Housing


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
**References**

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
   - many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)



In [9]:

    
df = pd.DataFrame(boston.data)



In [11]:

    
df.head()



In [12]:

    
df.columns = boston.feature_names



In [13]:

    
df.head()



In [14]:

    
boston.target[:5]









    Out[14]:





array([ 24. ,  21.6,  34.7,  33.4,  36.2])



In [18]:

    
from sklearn import linear_model
lm = linear_model.LinearRegression()



In [20]:

    
X = df

Important functions: fit(), predict() and score()

model.fit(): Train the model
model.predict(): Predect an outcome
model.score: Calculate how well the predection is



In [22]:

    
lm.fit(X, boston.target)









    Out[22]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)



In [23]:

    
lm.intercept_









    Out[23]:





36.491103280361635



In [25]:

    
lm.coef_









    Out[25]:





array([ -1.07170557e-01,   4.63952195e-02,   2.08602395e-02,
         2.68856140e+00,  -1.77957587e+01,   3.80475246e+00,
         7.51061703e-04,  -1.47575880e+00,   3.05655038e-01,
        -1.23293463e-02,  -9.53463555e-01,   9.39251272e-03,
        -5.25466633e-01])



In [27]:

    
pd.DataFrame(zip(X.columns, lm.coef_), columns=['features', 'coeffecients'])









    Out[27]:






  
    
      
      features
      coeffecients
    
  
  
    
      0
      CRIM
      -0.107171
    
    
      1
      ZN
      0.046395
    
    
      2
      INDUS
      0.020860
    
    
      3
      CHAS
      2.688561
    
    
      4
      NOX
      -17.795759
    
    
      5
      RM
      3.804752
    
    
      6
      AGE
      0.000751
    
    
      7
      DIS
      -1.475759
    
    
      8
      RAD
      0.305655
    
    
      9
      TAX
      -0.012329
    
    
      10
      PTRATIO
      -0.953464
    
    
      11
      B
      0.009393
    
    
      12
      LSTAT
      -0.525467



In [31]:

    
plt.scatter(df.RM, boston.target)
plt.xlabel("Avg num of rooms")
plt.ylabel("Housing price")









    Out[31]:





<matplotlib.text.Text at 0x7f79397b2b50>



In [30]:

    
lm.predict(X)[:5]









    Out[30]:





array([ 30.00821269,  25.0298606 ,  30.5702317 ,  28.60814055,  27.94288232])



In [33]:

    
plt.scatter(boston.target, lm.predict(X))
plt.xlabel("Price")
plt.ylabel("Predicted Price")









    Out[33]:





<matplotlib.text.Text at 0x7f793756e090>

	0	1	2	4	5	6	7	8	9	10	11	12
0	0.00632	18	2.31	0.538	6.575	65.2	4.0900	1	296	15.3	396.90	4.98
1	0.02731	0	7.07	0.469	6.421	78.9	4.9671	2	242	17.8	396.90	9.14
2	0.02729	0	7.07	0.469	7.185	61.1	4.9671	2	242	17.8	392.83	4.03
3	0.03237	0	2.18	0.458	6.998	45.8	6.0622	3	222	18.7	394.63	2.94
4	0.06905	0	2.18	0.458	7.147	54.2	6.0622	3	222	18.7	396.90	5.33

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT
0	0.00632	18	2.31	0.538	6.575	65.2	4.0900	1	296	15.3	396.90	4.98
1	0.02731	0	7.07	0.469	6.421	78.9	4.9671	2	242	17.8	396.90	9.14
2	0.02729	0	7.07	0.469	7.185	61.1	4.9671	2	242	17.8	392.83	4.03
3	0.03237	0	2.18	0.458	6.998	45.8	6.0622	3	222	18.7	394.63	2.94
4	0.06905	0	2.18	0.458	7.147	54.2	6.0622	3	222	18.7	396.90	5.33

	features	coeffecients
0	CRIM	-0.107171
1	ZN	0.046395
2	INDUS	0.020860
3	CHAS	2.688561
4	NOX	-17.795759
5	RM	3.804752
6	AGE	0.000751
7	DIS	-1.475759
8	RAD	0.305655
9	TAX	-0.012329
10	PTRATIO	-0.953464
11	B	0.009393
12	LSTAT	-0.525467