notebook.community

Edit and run



In [1]:

    
import numpy as np
import pandas as pd
import sklearn
from sklearn import datasets 

%matplotlib inline
import matplotlib.pyplot as plt
from sklearn import datasets, linear_model

# load the dataset
boston = datasets.load_boston()

# see the description
print boston.DESCR









    



Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
http://archive.ics.uci.edu/ml/datasets/Housing


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
**References**

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
   - many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)







    



/Users/charilaostsarouchas/anaconda/lib/python2.7/site-packages/pytz/__init__.py:29: UserWarning: Module argparse was already imported from /Users/charilaostsarouchas/anaconda/lib/python2.7/argparse.pyc, but /Users/charilaostsarouchas/anaconda/lib/python2.7/site-packages is being added to sys.path
  from pkg_resources import resource_stream



In [12]:

    
# Lets explore a bit the Boston data:
print "boston keys: ", boston.keys()
print "boston feature names", boston.feature_names
# DESCR says "Median Value (attribute 14) is usually the target" 
# The MEDV is the boston.target

# get the df from boston data
boston_df = pd.DataFrame(boston.data, columns = boston.feature_names)

# add the target
boston_df["MEDV"] = boston.target
boston_df.head()









    



boston keys:  ['data', 'feature_names', 'DESCR', 'target']
boston feature names ['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']






    Out[12]:






  
    
      
      CRIM
      ZN
      INDUS
      CHAS
      NOX
      RM
      AGE
      DIS
      RAD
      TAX
      PTRATIO
      B
      LSTAT
      MEDV
    
  
  
    
      0
       0.00632
       18
       2.31
       0
       0.538
       6.575
       65.2
       4.0900
       1
       296
       15.3
       396.90
       4.98
       24.0
    
    
      1
       0.02731
        0
       7.07
       0
       0.469
       6.421
       78.9
       4.9671
       2
       242
       17.8
       396.90
       9.14
       21.6
    
    
      2
       0.02729
        0
       7.07
       0
       0.469
       7.185
       61.1
       4.9671
       2
       242
       17.8
       392.83
       4.03
       34.7
    
    
      3
       0.03237
        0
       2.18
       0
       0.458
       6.998
       45.8
       6.0622
       3
       222
       18.7
       394.63
       2.94
       33.4
    
    
      4
       0.06905
        0
       2.18
       0
       0.458
       7.147
       54.2
       6.0622
       3
       222
       18.7
       396.90
       5.33
       36.2



In [ ]:

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	MEDV
0	0.00632	18	2.31	0.538	6.575	65.2	4.0900	1	296	15.3	396.90	4.98	24.0
1	0.02731	0	7.07	0.469	6.421	78.9	4.9671	2	242	17.8	396.90	9.14	21.6
2	0.02729	0	7.07	0.469	7.185	61.1	4.9671	2	242	17.8	392.83	4.03	34.7
3	0.03237	0	2.18	0.458	6.998	45.8	6.0622	3	222	18.7	394.63	2.94	33.4
4	0.06905	0	2.18	0.458	7.147	54.2	6.0622	3	222	18.7	396.90	5.33	36.2