The dataset Module


In [1]:
from sklearn import datasets
import numpy as np

In [2]:
datasets.*?

In [3]:
boston = datasets.load_boston()
print(boston.DESCR)


Boston House Prices dataset
===========================

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
http://archive.ics.uci.edu/ml/datasets/Housing


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
**References**

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
   - many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)


In [4]:
X, y = boston.data, boston.target

Creating Sample Data


In [5]:
datasets.make_*?

In [6]:
X, y = datasets.make_regression(n_samples=1000, n_features=1,
                                n_informative=1, noise=15,
                                bias=1000, random_state=0)

In [7]:
import matplotlib.pyplot as plt
%matplotlib inline

plt.scatter(X, y);



In [8]:
X, y = datasets.make_blobs(n_samples=300, centers=4,
                           cluster_std=0.6, random_state=0)

In [9]:
plt.scatter(X[:, 0], X[:, 1], s=50);


Scaling Data


In [10]:
from sklearn import preprocessing

X, y = boston.data, boston.target
X[:, :3].mean(axis=0)


Out[10]:
array([  3.59376071,  11.36363636,  11.13677866])

In [11]:
X[:, :3].std(axis=0)


Out[11]:
array([  8.58828355,  23.29939569,   6.85357058])

In [12]:
plt.plot(X[:, :3]);


preprocessing.scale

scale centers and scales the data using the following formula:


In [13]:
X_2 = preprocessing.scale(X[:, :3])

In [14]:
X_2.mean(axis=0)


Out[14]:
array([  6.34099712e-17,  -6.34319123e-16,  -2.68291099e-15])

In [15]:
X_2.std(axis=0)


Out[15]:
array([ 1.,  1.,  1.])

In [16]:
plt.plot(X_2);


StandardScaler

Same as preprocessing.scale but persists scale settings across uses.


In [17]:
scaler = preprocessing.StandardScaler()
scaler.fit(X[:, :3])
X_3 = scaler.transform(X[:, :3])
X_3.mean(axis=0)


Out[17]:
array([  6.34099712e-17,  -6.34319123e-16,  -2.68291099e-15])

In [18]:
X_3.std(axis=0)


Out[18]:
array([ 1.,  1.,  1.])

In [19]:
plt.plot(X_3);


MinMaxScaler

Scales data within a specified range.


In [20]:
scaler = preprocessing.MinMaxScaler()
scaler.fit(X[:, :3])
X_4 = scaler.transform(X[:, :3])
X_4.max(axis=0)


Out[20]:
array([ 1.,  1.,  1.])

In [21]:
X_4.std(axis=0)


Out[21]:
array([ 0.09653024,  0.23299396,  0.25123059])

In [22]:
plt.plot(X_4);



In [23]:
scaler = preprocessing.MinMaxScaler(feature_range=(-4, 4))
scaler.fit(X[:, :3])
X_5 = scaler.transform(X[:, :3])

In [24]:
plt.plot(X_5);


Binarizing Data

preprocessing.binarize


In [25]:
new_target = preprocessing.binarize(boston.target, threshold=boston.target.mean())
new_target[:, :5]


C:\tools\Anaconda3\lib\site-packages\sklearn\utils\validation.py:395: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  DeprecationWarning)
Out[25]:
array([[ 1.,  0.,  1.,  1.,  1.]])

In [26]:
(boston.target[:5] > boston.target.mean()).astype(int)


Out[26]:
array([1, 0, 1, 1, 1])

Binarizer


In [27]:
bin = preprocessing.Binarizer(boston.target.mean())
new_target = bin.fit_transform(boston.target)
new_target[:, :5]


C:\tools\Anaconda3\lib\site-packages\sklearn\utils\validation.py:395: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  DeprecationWarning)
C:\tools\Anaconda3\lib\site-packages\sklearn\utils\validation.py:395: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  DeprecationWarning)
Out[27]:
array([[ 1.,  0.,  1.,  1.,  1.]])

Working with Categorical Variables

OneHotEncoder


In [28]:
iris = datasets.load_iris()
X = iris.data
y = iris.target

In [29]:
d = np.column_stack((X, y))

In [30]:
encoder = preprocessing.OneHotEncoder()
encoder.fit_transform(d[:, -1:]).toarray()[:5]


Out[30]:
array([[ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.]])

DictVectorizer


In [31]:
from sklearn.feature_extraction import DictVectorizer

dv = DictVectorizer()
dict = [{'species': iris.target_names[i]} for i in y]
dv.fit_transform(dict).toarray()[:5]


Out[31]:
array([[ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.]])

Patsy


In [32]:
import patsy
patsy.dmatrix('0 + C(species)', {'species': iris.target})


Out[32]:
DesignMatrix with shape (150, 3)
  C(species)[0]  C(species)[1]  C(species)[2]
              1              0              0
              1              0              0
              1              0              0
              1              0              0
              1              0              0
              1              0              0
              1              0              0
              1              0              0
              1              0              0
              1              0              0
              1              0              0
              1              0              0
              1              0              0
              1              0              0
              1              0              0
              1              0              0
              1              0              0
              1              0              0
              1              0              0
              1              0              0
              1              0              0
              1              0              0
              1              0              0
              1              0              0
              1              0              0
              1              0              0
              1              0              0
              1              0              0
              1              0              0
              1              0              0
  [120 rows omitted]
  Terms:
    'C(species)' (columns 0:3)
  (to view full data, use np.asarray(this_obj))

Binarizing Label Features

LabelBinarizer


In [33]:
from sklearn.preprocessing import LabelBinarizer

binarizer = LabelBinarizer()
new_target = binarizer.fit_transform(y)
y.shape, new_target.shape


Out[33]:
((150,), (150, 3))

In [34]:
new_target[:5]


Out[34]:
array([[1, 0, 0],
       [1, 0, 0],
       [1, 0, 0],
       [1, 0, 0],
       [1, 0, 0]])

In [35]:
new_target[-5:]


Out[35]:
array([[0, 0, 1],
       [0, 0, 1],
       [0, 0, 1],
       [0, 0, 1],
       [0, 0, 1]])

In [36]:
binarizer.classes_


Out[36]:
array([0, 1, 2])

LabelBinarizer and labels


In [37]:
binarizer = LabelBinarizer(neg_label=-1000, pos_label=1000)
binarizer.fit_transform(y)[:5]


Out[37]:
array([[ 1000, -1000, -1000],
       [ 1000, -1000, -1000],
       [ 1000, -1000, -1000],
       [ 1000, -1000, -1000],
       [ 1000, -1000, -1000]])

Inputing Missing Values through Various Strategies


In [38]:
iris = datasets.load_iris()
iris_X = iris.data
masking_array = np.random.binomial(1, .25, iris_X.shape).astype(bool)
iris_X[masking_array] = np.nan

In [39]:
masking_array[:5]


Out[39]:
array([[False, False,  True, False],
       [False,  True, False,  True],
       [False, False, False, False],
       [False, False, False, False],
       [False,  True,  True, False]], dtype=bool)

In [40]:
iris_X[:5]


Out[40]:
array([[ 5.1,  3.5,  nan,  0.2],
       [ 4.9,  nan,  1.4,  nan],
       [ 4.7,  3.2,  1.3,  0.2],
       [ 4.6,  3.1,  1.5,  0.2],
       [ 5. ,  nan,  nan,  0.2]])

By default, Imputer fills in missing values with the mean.


In [41]:
impute = preprocessing.Imputer()
iris_X_prime = impute.fit_transform(iris_X)
iris_X_prime[:5]


Out[41]:
array([[ 5.1       ,  3.5       ,  3.84513274,  0.2       ],
       [ 4.9       ,  3.04901961,  1.4       ,  1.16465517],
       [ 4.7       ,  3.2       ,  1.3       ,  0.2       ],
       [ 4.6       ,  3.1       ,  1.5       ,  0.2       ],
       [ 5.        ,  3.04901961,  3.84513274,  0.2       ]])

In [42]:
impute = preprocessing.Imputer(strategy='median')
iris_X_prime = impute.fit_transform(iris_X)
iris_X_prime[:5]


Out[42]:
array([[ 5.1,  3.5,  4.4,  0.2],
       [ 4.9,  3. ,  1.4,  1.3],
       [ 4.7,  3.2,  1.3,  0.2],
       [ 4.6,  3.1,  1.5,  0.2],
       [ 5. ,  3. ,  4.4,  0.2]])

In [43]:
iris_X[np.isnan(iris_X)] = -1
iris_X[:5]


Out[43]:
array([[ 5.1,  3.5, -1. ,  0.2],
       [ 4.9, -1. ,  1.4, -1. ],
       [ 4.7,  3.2,  1.3,  0.2],
       [ 4.6,  3.1,  1.5,  0.2],
       [ 5. , -1. , -1. ,  0.2]])

In [44]:
impute = preprocessing.Imputer(missing_values=-1)
iris_X_prime = impute.fit_transform(iris_X)
iris_X_prime[:5]


Out[44]:
array([[ 5.1       ,  3.5       ,  3.84513274,  0.2       ],
       [ 4.9       ,  3.04901961,  1.4       ,  1.16465517],
       [ 4.7       ,  3.2       ,  1.3       ,  0.2       ],
       [ 4.6       ,  3.1       ,  1.5       ,  0.2       ],
       [ 5.        ,  3.04901961,  3.84513274,  0.2       ]])

Using Pipelines for Multiple Preprocessing Steps


In [45]:
mat = datasets.make_spd_matrix(10)
masking_array = np.random.binomial(1, .1, mat.shape).astype(bool)
mat[masking_array] = np.nan
mat[:4, :4]


Out[45]:
array([[ 0.37587746,  0.08047308,  0.01226786, -0.08448386],
       [ 0.08047308,  0.50536722,  0.59754188, -0.23133298],
       [ 0.01226786,  0.59754188,  4.02138964, -1.70156591],
       [-0.08448386, -0.23133298, -1.70156591,  1.47584364]])

How to create a pipeline:


In [46]:
from sklearn import pipeline

pipe = pipeline.Pipeline([('impute', impute), ('scaler', scaler)])
pipe


Out[46]:
Pipeline(steps=[('impute', Imputer(axis=0, copy=True, missing_values=-1, strategy='mean', verbose=0)), ('scaler', MinMaxScaler(copy=True, feature_range=(-4, 4)))])

In [47]:
new_mat = pipe.fit_transform(mat)
new_mat[:4, :4]


Out[47]:
array([[ 4.        , -0.94028648, -1.60426758, -0.83089169],
       [-0.28474978,  3.11932569, -0.78612526, -0.0977672 ],
       [-1.27404554,  4.        ,  4.        ,  3.77694164],
       [-2.6773996 , -3.91940897, -4.        , -4.        ]])

To be included in Pipeline, objects should have fit, transform, and fit_transform methods.

Reducing Dimensionality with PCA (Principal Component Analysis)


In [48]:
iris = datasets.load_iris()
iris_X = iris.data

In [49]:
from sklearn import decomposition

pca = decomposition.PCA()
pca


Out[49]:
PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

In [50]:
iris_pca = pca.fit_transform(iris_X)
iris_pca[:5]


Out[50]:
array([[ -2.68420713e+00,   3.26607315e-01,  -2.15118370e-02,
          1.00615724e-03],
       [ -2.71539062e+00,  -1.69556848e-01,  -2.03521425e-01,
          9.96024240e-02],
       [ -2.88981954e+00,  -1.37345610e-01,   2.47092410e-02,
          1.93045428e-02],
       [ -2.74643720e+00,  -3.11124316e-01,   3.76719753e-02,
         -7.59552741e-02],
       [ -2.72859298e+00,   3.33924564e-01,   9.62296998e-02,
         -6.31287327e-02]])

PCA transforms the covariances of the data into column vectors that show certain percentages of the variance:


In [51]:
pca.explained_variance_ratio_


Out[51]:
array([ 0.92461621,  0.05301557,  0.01718514,  0.00518309])

High-dimensionality is problematic in data analysis. Consider representing data in fewer dimensions when models overfit on high-dimensional datasets.


In [52]:
pca = decomposition.PCA(n_components=2)
iris_X_prime = pca.fit_transform(iris_X)
iris_X.shape, iris_X_prime.shape


Out[52]:
((150, 4), (150, 2))

In [53]:
plt.scatter(iris_X_prime[:50, 0], iris_X_prime[:50, 1]);
plt.scatter(iris_X_prime[50:100, 0], iris_X_prime[50:100, 1]);
plt.scatter(iris_X_prime[100:150, 0], iris_X_prime[100:150, 1]);



In [54]:
pca.explained_variance_ratio_.sum()


Out[54]:
0.97763177502480336

You can create a PCA with the desired variance to be explained:


In [55]:
pca = decomposition.PCA(n_components=.98)
iris_X_prime = pca.fit(iris_X)
pca.explained_variance_ratio_.sum()


Out[55]:
0.99481691454981014

Using Factor Analysis for Decomposition

Factor analysis differs from PCA in that it makes assumptions about which implicit features underlie the explicit features of a dataset.


In [56]:
from sklearn.decomposition import FactorAnalysis

In [57]:
fa = FactorAnalysis(n_components=2)
iris_two_dim = fa.fit_transform(iris.data)
iris_two_dim[:5]


Out[57]:
array([[-1.33125848, -0.55846779],
       [-1.33914102,  0.00509715],
       [-1.40258715,  0.307983  ],
       [-1.29839497,  0.71854288],
       [-1.33587575, -0.36533259]])

Kernel PCA for Nonlinear Dimensionality Reduction

When data is not lineraly seperable, Kernel PCA can help. Here, data is projected by the kernel function and then PCA is performed.


In [58]:
A1_mean = [1, 1]
A1_cov = [[2, .99], [1, 1]]
A1 = np.random.multivariate_normal(A1_mean, A1_cov, 50)

A2_mean = [5, 5]
A2_cov = [[2, .99], [1, 1]]
A2 = np.random.multivariate_normal(A2_mean, A2_cov, 50)

A = np.vstack((A1, A2))

B_mean = [5, 0]
B_cov = [[.5, -1], [-.9, .5]]
B = np.random.multivariate_normal(B_mean, B_cov, 100)


C:\tools\Anaconda3\lib\site-packages\ipykernel_launcher.py:13: RuntimeWarning: covariance is not positive-semidefinite.
  del sys.path[0]

In [59]:
plt.scatter(A[:, 0], A[:, 1]);
plt.scatter(B[:, 0], B[:, 1]);



In [60]:
kpca = decomposition.KernelPCA(kernel='cosine', n_components=1)
AB = np.vstack((A, B))
AB_transformed = kpca.fit_transform(AB)

In [61]:
plt.scatter(AB_transformed[:50], np.zeros(AB_transformed[:50].shape), alpha=0.5);
plt.scatter(AB_transformed[50:], np.zeros(AB_transformed[50:].shape)+0.001, alpha=0.5);



In [62]:
pca = decomposition.PCA(n_components=2)
AB_prime = pca.fit_transform(AB)
plt.scatter(AB_prime[:, 0], np.zeros(AB_prime[:, 0].shape), alpha=0.5);
plt.scatter(AB_prime[:, 1], np.zeros(AB_prime[:, 1].shape)+0.001, alpha=0.5);


Using Truncated SVD to Reduce Dimensionality

Singular Value Decomposition (SVD) factors a matrix M into three matrices: U, Σ, and V. Whereas PCA factors the covariance matrix, SVD factors the data matrix itself.

Given an n x n matrix, SVD will create an n-column matrix. Truncated SVD will create an arbitrary columned dataset based on the specified number.


In [63]:
iris = datasets.load_iris()
iris_data = iris.data
itis_target = iris.target

In [64]:
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(2)
iris_transformed = svd.fit_transform(iris_data)
iris_data[:5]


Out[64]:
array([[ 5.1,  3.5,  1.4,  0.2],
       [ 4.9,  3. ,  1.4,  0.2],
       [ 4.7,  3.2,  1.3,  0.2],
       [ 4.6,  3.1,  1.5,  0.2],
       [ 5. ,  3.6,  1.4,  0.2]])

In [65]:
iris_transformed[:5]


Out[65]:
array([[ 5.91220352,  2.30344211],
       [ 5.57207573,  1.97383104],
       [ 5.4464847 ,  2.09653267],
       [ 5.43601924,  1.87168085],
       [ 5.87506555,  2.32934799]])

In [66]:
plt.scatter(iris_data[:50, 0], iris_data[:50, 2]);
plt.scatter(iris_data[50:100, 0], iris_data[50:100, 2]);
plt.scatter(iris_data[100:150, 0], iris_data[100:150, 2]);



In [67]:
plt.scatter(iris_transformed[:50, 0], -iris_transformed[:50, 1]);
plt.scatter(iris_transformed[50:100, 0], -iris_transformed[50:100, 1]);
plt.scatter(iris_transformed[100:150, 0], -iris_transformed[100:150, 1]);


How It Works


In [68]:
from scipy.linalg import svd

D = np.array([[1, 2], [1, 3], [1, 4]])
D


Out[68]:
array([[1, 2],
       [1, 3],
       [1, 4]])

In [69]:
U, S, V = svd(D, full_matrices=False)
U.shape, S.shape, V.shape


Out[69]:
((3, 2), (2,), (2, 2))

In [70]:
np.dot(U.dot(np.diag(S)), V)


Out[70]:
array([[ 1.,  2.],
       [ 1.,  3.],
       [ 1.,  4.]])

In [71]:
new_S = S[0]
new_U = U[:, 0]
new_U.dot(new_S)


Out[71]:
array([-2.20719466, -3.16170819, -4.11622173])

Decomposition to Classify with DictionaryLearning

DictionaryLearning assumes that the features are the basis for the resulting datasets.


In [72]:
from sklearn.decomposition import DictionaryLearning

dl = DictionaryLearning(3) # 3 species of iris
transformed = dl.fit_transform(iris_data[::2])
transformed[:5]


Out[72]:
array([[ 0.        ,  6.34476574,  0.        ],
       [ 0.        ,  5.83576461,  0.        ],
       [ 0.        ,  6.32038375,  0.        ],
       [ 0.        ,  5.89318572,  0.        ],
       [ 0.        ,  5.45222715,  0.        ]])

In [73]:
from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(transformed[0:25, 0], transformed[0:25, 1], transformed[0:25, 2]);
ax.scatter(transformed[25:50, 0], transformed[25:50, 1], transformed[25:50, 2]);
ax.scatter(transformed[50:75, 0], transformed[50:75, 1], transformed[50:75, 2]);



In [74]:
transformed = dl.transform(iris_data[1::2])

In [75]:
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(transformed[0:25, 0], transformed[0:25, 1], transformed[0:25, 2]);
ax.scatter(transformed[25:50, 0], transformed[25:50, 1], transformed[25:50, 2]);
ax.scatter(transformed[50:75, 0], transformed[50:75, 1], transformed[50:75, 2]);


Putting it All Together with Pipelines


In [76]:
iris = datasets.load_iris()
iris_data = iris.data

mask = np.random.binomial(1, .25, iris_data.shape).astype(bool)
iris_data[mask] = np.nan
iris_data[:5]


Out[76]:
array([[ 5.1,  3.5,  1.4,  0.2],
       [ 4.9,  nan,  1.4,  0.2],
       [ 4.7,  3.2,  1.3,  nan],
       [ nan,  3.1,  1.5,  nan],
       [ 5. ,  3.6,  1.4,  0.2]])

In [77]:
pca = decomposition.PCA()
imputer = preprocessing.Imputer()

pipe = pipeline.Pipeline([('imputer', imputer), ('pca', pca)])
iris_data_transformed = pipe.fit_transform(iris_data)
iris_data_transformed[:5]


Out[77]:
array([[-2.6926286 ,  0.08076282,  0.262652  ,  0.12609496],
       [-2.72935899, -0.1720479 ,  0.02528327, -0.19048606],
       [-2.56356197,  0.11508148, -0.86245153,  0.27052267],
       [-1.97689431,  0.95776527, -0.4142071 , -0.18932791],
       [-2.73379192,  0.02286896,  0.26297356,  0.24838061]])

In [78]:
pipe2 = pipeline.make_pipeline(imputer, pca)
pipe2.steps


Out[78]:
[('imputer',
  Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)),
 ('pca',
  PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False))]

In [79]:
iris_data_transformed2 = pipe2.fit_transform(iris_data)
iris_data_transformed2[:5]


Out[79]:
array([[-2.6926286 ,  0.08076282,  0.262652  ,  0.12609496],
       [-2.72935899, -0.1720479 ,  0.02528327, -0.19048606],
       [-2.56356197,  0.11508148, -0.86245153,  0.27052267],
       [-1.97689431,  0.95776527, -0.4142071 , -0.18932791],
       [-2.73379192,  0.02286896,  0.26297356,  0.24838061]])

Using Gaussian Processes for Regression


In [108]:
boston = datasets.load_boston()
boston_X = boston.data
boston_y = boston.target

train_set = np.random.choice([True, False], len(boston_y), p=[.75, .25])

In [109]:
from sklearn.gaussian_process import GaussianProcess

gp = GaussianProcess()
gp.fit(boston_X[train_set], boston_y[train_set])


C:\tools\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:52: DeprecationWarning: Class GaussianProcess is deprecated; GaussianProcess was deprecated in version 0.18 and will be removed in 0.20. Use the GaussianProcessRegressor instead.
  warnings.warn(msg, category=DeprecationWarning)
C:\tools\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:70: DeprecationWarning: Function l1_cross_distances is deprecated; l1_cross_distances was deprecated in version 0.18 and will be removed in 0.20.
  warnings.warn(msg, category=DeprecationWarning)
Out[109]:
GaussianProcess(beta0=None,
        corr=<function squared_exponential at 0x000000000C5CF730>,
        normalize=True, nugget=array(2.220446049250313e-15),
        optimizer='fmin_cobyla', random_start=1,
        random_state=<mtrand.RandomState object at 0x00000000060D01F8>,
        regr=<function constant at 0x000000000C5CF400>,
        storage_mode='full', theta0=array([[ 0.1]]), thetaL=None,
        thetaU=None, verbose=False)

In [110]:
test_preds = gp.predict(boston_X[~train_set])

In [111]:
f, ax = plt.subplots(figsize=(10, 7), nrows=3)
f.tight_layout()

ax[0].plot(range(len(test_preds)), test_preds, label='Predicted Values');
ax[0].plot(range(len(test_preds)), boston_y[~train_set], label='Actual Values');
ax[0].set_title('Predicted vs Actual');
ax[0].legend(loc='best');

ax[1].plot(range(len(test_preds)), test_preds - boston_y[~train_set]);
ax[1].set_title('Plotted Residuals');

ax[2].hist(test_preds - boston_y[~train_set]);
ax[2].set_title('Histogram of Residuals');


You can tune regr and thea0 to get different predictions:


In [112]:
gp = GaussianProcess(regr='linear', theta0=5e-1)
gp.fit(boston_X[train_set], boston_y[train_set]);
linear_preds = gp.predict(boston_X[~train_set])
f, ax = plt.subplots(figsize=(7, 5))

f.tight_layout()

ax.hist(test_preds - boston_y[~train_set], label='Residuals Original', color='b', alpha=.5);
ax.hist(linear_preds - boston_y[~train_set], label='Residuals Linear', color='r', alpha=.5);
ax.set_title('Residuals');
ax.legend(loc='best');


C:\tools\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:52: DeprecationWarning: Class GaussianProcess is deprecated; GaussianProcess was deprecated in version 0.18 and will be removed in 0.20. Use the GaussianProcessRegressor instead.
  warnings.warn(msg, category=DeprecationWarning)
C:\tools\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:70: DeprecationWarning: Function l1_cross_distances is deprecated; l1_cross_distances was deprecated in version 0.18 and will be removed in 0.20.
  warnings.warn(msg, category=DeprecationWarning)

In [114]:
f, ax = plt.subplots(figsize=(10, 7), nrows=3)
f.tight_layout()

ax[0].plot(range(len(linear_preds)), linear_preds, label='Predicted Linear Values');
ax[0].plot(range(len(linear_preds)), boston_y[~train_set], label='Actual Values');
ax[0].set_title('Predicted Linear vs Actual');
ax[0].legend(loc='best');

ax[1].plot(range(len(linear_preds)), linear_preds - boston_y[~train_set]);
ax[1].set_title('Plotted Residuals');

ax[2].hist(linear_preds - boston_y[~train_set]);
ax[2].set_title('Histogram of Residuals');



In [113]:
np.power(test_preds - boston_y[~train_set], 2).mean(), np.power(linear_preds - boston_y[~train_set], 2).mean()


Out[113]:
(39.057941598160035, 12.854653855280493)

Measuring Uncertainty


In [115]:
test_preds, MSE = gp.predict(boston_X[~train_set], eval_MSE=True)
MSE[:5]


Out[115]:
array([ 14.57757905,  49.34143487,   6.63401504,   7.24418898,  26.4753024 ])

In [119]:
f, ax = plt.subplots(figsize=(7, 5))
n = 20
rng = range(n)
ax.scatter(rng, test_preds[:n]);
ax.errorbar(rng, test_preds[:n], yerr=1.96*MSE[:n]);
ax.set_title('Predictions with Error Bars');
ax.set_xlim((-1, 21));


Defining the Gaussian Process Object Directly


In [125]:
from sklearn.gaussian_process import regression_models

X, y = datasets.make_regression(1000, 1, 1)

In [126]:
regression_models.constant(X)[:5]


Out[126]:
array([[ 1.],
       [ 1.],
       [ 1.],
       [ 1.],
       [ 1.]])

In [128]:
regression_models.linear(X)[:5]


Out[128]:
array([[ 1.        ,  0.85496518],
       [ 1.        ,  1.56124722],
       [ 1.        ,  0.44346332],
       [ 1.        ,  0.54509764],
       [ 1.        ,  0.50400804]])

In [129]:
regression_models.quadratic(X)[:5]


Out[129]:
array([[ 1.        ,  0.85496518,  0.73096547],
       [ 1.        ,  1.56124722,  2.43749289],
       [ 1.        ,  0.44346332,  0.19665971],
       [ 1.        ,  0.54509764,  0.29713143],
       [ 1.        ,  0.50400804,  0.25402411]])

Using Stochastic Gradient Descent for Regression


In [130]:
X, y = datasets.make_regression((int(1e6)))

Size of the regression (MB):


In [132]:
X.nbytes / 1e6


Out[132]:
800.0

In [133]:
from sklearn import linear_model

sgd = linear_model.SGDRegressor()
train = np.random.choice([True, False], size=len(y), p=[.75, .25])
sgd.fit(X[train], y[train])


Out[133]:
SGDRegressor(alpha=0.0001, average=False, epsilon=0.1, eta0=0.01,
       fit_intercept=True, l1_ratio=0.15, learning_rate='invscaling',
       loss='squared_loss', n_iter=5, penalty='l2', power_t=0.25,
       random_state=None, shuffle=True, verbose=0, warm_start=False)