In [1]:
#支持向量机 根据训练样本的分布搜索所有可能分类器中最佳的那个
from sklearn.datasets import load_boston
boston = load_boston()

In [2]:
print boston.DESCR


Boston House Prices dataset
===========================

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
http://archive.ics.uci.edu/ml/datasets/Housing


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
**References**

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
   - many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)


In [3]:
#切分训练集和测试集
from sklearn.cross_validation import train_test_split


/Users/wizardholy/soft/dunas/lib/python2.7/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

In [4]:
import numpy as np
X = boston.data
Y = boston.target

In [25]:
X_train,Xtest,Y_Train,Y_test=train_test_split(X,
                                             Y,
                                               random_state=33,
                                            test_size=0.25)

In [15]:
Y_Train.shape


Out[15]:
(379,)

In [16]:
Y_test.shape


Out[16]:
(127,)

In [26]:
print boston.target


[ 24.   21.6  34.7  33.4  36.2  28.7  22.9  27.1  16.5  18.9  15.   18.9
  21.7  20.4  18.2  19.9  23.1  17.5  20.2  18.2  13.6  19.6  15.2  14.5
  15.6  13.9  16.6  14.8  18.4  21.   12.7  14.5  13.2  13.1  13.5  18.9
  20.   21.   24.7  30.8  34.9  26.6  25.3  24.7  21.2  19.3  20.   16.6
  14.4  19.4  19.7  20.5  25.   23.4  18.9  35.4  24.7  31.6  23.3  19.6
  18.7  16.   22.2  25.   33.   23.5  19.4  22.   17.4  20.9  24.2  21.7
  22.8  23.4  24.1  21.4  20.   20.8  21.2  20.3  28.   23.9  24.8  22.9
  23.9  26.6  22.5  22.2  23.6  28.7  22.6  22.   22.9  25.   20.6  28.4
  21.4  38.7  43.8  33.2  27.5  26.5  18.6  19.3  20.1  19.5  19.5  20.4
  19.8  19.4  21.7  22.8  18.8  18.7  18.5  18.3  21.2  19.2  20.4  19.3
  22.   20.3  20.5  17.3  18.8  21.4  15.7  16.2  18.   14.3  19.2  19.6
  23.   18.4  15.6  18.1  17.4  17.1  13.3  17.8  14.   14.4  13.4  15.6
  11.8  13.8  15.6  14.6  17.8  15.4  21.5  19.6  15.3  19.4  17.   15.6
  13.1  41.3  24.3  23.3  27.   50.   50.   50.   22.7  25.   50.   23.8
  23.8  22.3  17.4  19.1  23.1  23.6  22.6  29.4  23.2  24.6  29.9  37.2
  39.8  36.2  37.9  32.5  26.4  29.6  50.   32.   29.8  34.9  37.   30.5
  36.4  31.1  29.1  50.   33.3  30.3  34.6  34.9  32.9  24.1  42.3  48.5
  50.   22.6  24.4  22.5  24.4  20.   21.7  19.3  22.4  28.1  23.7  25.
  23.3  28.7  21.5  23.   26.7  21.7  27.5  30.1  44.8  50.   37.6  31.6
  46.7  31.5  24.3  31.7  41.7  48.3  29.   24.   25.1  31.5  23.7  23.3
  22.   20.1  22.2  23.7  17.6  18.5  24.3  20.5  24.5  26.2  24.4  24.8
  29.6  42.8  21.9  20.9  44.   50.   36.   30.1  33.8  43.1  48.8  31.
  36.5  22.8  30.7  50.   43.5  20.7  21.1  25.2  24.4  35.2  32.4  32.
  33.2  33.1  29.1  35.1  45.4  35.4  46.   50.   32.2  22.   20.1  23.2
  22.3  24.8  28.5  37.3  27.9  23.9  21.7  28.6  27.1  20.3  22.5  29.
  24.8  22.   26.4  33.1  36.1  28.4  33.4  28.2  22.8  20.3  16.1  22.1
  19.4  21.6  23.8  16.2  17.8  19.8  23.1  21.   23.8  23.1  20.4  18.5
  25.   24.6  23.   22.2  19.3  22.6  19.8  17.1  19.4  22.2  20.7  21.1
  19.5  18.5  20.6  19.   18.7  32.7  16.5  23.9  31.2  17.5  17.2  23.1
  24.5  26.6  22.9  24.1  18.6  30.1  18.2  20.6  17.8  21.7  22.7  22.6
  25.   19.9  20.8  16.8  21.9  27.5  21.9  23.1  50.   50.   50.   50.
  50.   13.8  13.8  15.   13.9  13.3  13.1  10.2  10.4  10.9  11.3  12.3
   8.8   7.2  10.5   7.4  10.2  11.5  15.1  23.2   9.7  13.8  12.7  13.1
  12.5   8.5   5.    6.3   5.6   7.2  12.1   8.3   8.5   5.   11.9  27.9
  17.2  27.5  15.   17.2  17.9  16.3   7.    7.2   7.5  10.4   8.8   8.4
  16.7  14.2  20.8  13.4  11.7   8.3  10.2  10.9  11.    9.5  14.5  14.1
  16.1  14.3  11.7  13.4   9.6   8.7   8.4  12.8  10.5  17.1  18.4  15.4
  10.8  11.8  14.9  12.6  14.1  13.   13.4  15.2  16.1  17.8  14.9  14.1
  12.7  13.5  14.9  20.   16.4  17.7  19.5  20.2  21.4  19.9  19.   19.1
  19.1  20.1  19.9  19.6  23.2  29.8  13.8  13.3  16.7  12.   14.6  21.4
  23.   23.7  25.   21.8  20.6  21.2  19.1  20.6  15.2   7.    8.1  13.6
  20.1  21.8  24.5  23.1  19.7  18.3  21.2  17.5  16.8  22.4  20.6  23.9
  22.   11.9]

In [27]:
#分析回归目标值的差异
print "The max target value is", np.max(boston.target)
print "The min target value is", np.min(boston.target)
print "The average target value is", np.average(boston.target)


The max target value is 50.0
The min target value is 5.0
The average target value is 22.5328063241

In [28]:
#观察上述数值,发现预测目标房价的差异较大,需要对特征及目标值进行标准化
#导入标准化模块, 标准化数据
from sklearn.preprocessing import StandardScaler

In [29]:
#分别初始化对特征和目标值的标准化器
ss_X = StandardScaler()
ss_Y = StandardScaler()

In [30]:
X_train = ss_X.fit_transform(X_train)

In [31]:
Xtest = ss_X.transform(Xtest)

In [33]:
print Y_test


[ 20.5   5.6  13.4  12.6  21.2  19.7  32.4  14.8  33.   21.4  30.1  36.
   8.4  21.6  16.3  23.   14.9  14.1  31.1  11.9  12.7  27.9  20.8  19.6
  32.   21.9  23.2  23.8  10.8  34.9  19.1  26.5  10.5  17.5  24.   36.1
  25.3  13.8  27.5  24.6  12.7   9.5  32.7  13.8  23.5  17.7  15.6  22.5
  26.2  20.6  14.1  33.3  15.2  14.9  21.6  17.2  23.1  11.7  20.6  22.2
  23.1  18.4  43.8  21.1  14.9  28.7  23.3  13.8  19.7  30.5  19.   19.1
  19.   26.6  17.5  21.9  13.8   8.8  19.4  28.1  21.   11.8   7.2  24.1
  20.   18.9  50.   13.3  50.   41.3  28.7  19.9  16.5  10.9  13.4  32.9
  20.6  25.   19.5  19.9  15.4  21.7  31.5  27.1   8.3  13.6   8.8  22.5
   7.5  28.6  50.   11.5  13.5  24.4  36.2  21.4  18.5  22.6  24.8  19.3
  29.8  16.4   8.4  24.7  20.1  13.1  35.2]

In [36]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X_train, Y_Train)


/Users/wizardholy/soft/dunas/lib/python2.7/site-packages/scipy/linalg/basic.py:1018: RuntimeWarning: internal gelsd driver lwork query error, required iwork dimension not returned. This is likely the result of LAPACK bug 0038, fixed in LAPACK 3.2.2 (released July 21, 2010). Falling back to 'gelss' driver.
  warnings.warn(mesg, RuntimeWarning)
Out[36]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [37]:
lr_y_pred = lr.predict(Xtest)

In [39]:
from sklearn.linear_model import SGDRegressor
sgdr = SGDRegressor()
sgdr.fit(X_train, Y_Train)


Out[39]:
SGDRegressor(alpha=0.0001, average=False, epsilon=0.1, eta0=0.01,
       fit_intercept=True, l1_ratio=0.15, learning_rate='invscaling',
       loss='squared_loss', max_iter=5, n_iter=None, penalty='l2',
       power_t=0.25, random_state=None, shuffle=True, tol=None, verbose=0,
       warm_start=False)

In [40]:
sgdr_y_pred = sgdr.predict(Xtest)

In [41]:
#使用三种回归评价机制以及两种R-Squard评价模块方法对模型的回归性能做出评价
print 'The Accuracy of LinearRegression is',lr.score(Xtest, Y_test)


The Accuracy of LinearRegression is 0.6763403831

In [42]:
from sklearn.metrics import r2_score,mean_absolute_error,mean_squared_error

In [47]:
print 'The Value of R-squared of LinearRegression is',r2_score(Y_test, lr_y_pred)
print 'The Value of mean_absolute_error of LinearRegression is',mean_absolute_error(Y_test, lr_y_pred)
print 'The Value of mean_squared_error of LinearRegression is',mean_squared_error(Y_test, lr_y_pred)


The Value of R-squared of LinearRegression is 0.6763403831
The Value of mean_absolute_error of LinearRegression is 3.5261239964
The Value of mean_squared_error of LinearRegression is 25.0969856921

In [48]:
print 'The Accuracy of SGDRegressor is',sgdr.score(Xtest, Y_test)
print 'The Value of R-squared of SGDRegressor is',r2_score(Y_test, sgdr_y_pred)
print 'The Value of mean_absolute_error of SGDRegressor is',mean_absolute_error(Y_test, sgdr_y_pred)
print 'The Value of mean_squared_error of SGDRegressor is',mean_squared_error(Y_test, sgdr_y_pred)


The Accuracy of SGDRegressor is 0.64929708222
The Value of R-squared of SGDRegressor is 0.64929708222
The Value of mean_absolute_error of SGDRegressor is 3.49963437151
The Value of mean_squared_error of SGDRegressor is 27.1939582516

In [ ]: