Worked machine learning examples using SDSS data

[AstroHackWeek 2014, 2016- J. S. Bloom @profjsb]


Here we'll see some worked ML examples using scikit-learn on Sloan Digital Sky Survey Data (SDSS). This should work in both Python 2 and Python 3.

It's easiest to grab data from the SDSS skyserver SQL server.

For example to do a basic query to get two types of photometry (aperature and petrosian), corrected for extinction, for 1000 QSO sources with redshifts:

SELECT *,dered_u - mag_u AS diff_u, dered_g - mag_g AS diff_g, dered_r - mag_r AS diff_g, dered_i - mag_i AS diff_i, dered_z - mag_z AS diff_z from
(SELECT top 1000
objid, ra, dec, dered_u,dered_g,dered_r,dered_i,dered_z,psfmag_u-extinction_u AS mag_u,
psfmag_g-extinction_g AS mag_g, psfmag_r-extinction_r AS mag_r, psfmag_i-extinction_i AS mag_i,psfmag_z-extinction_z AS mag_z,z AS spec_z,dered_u - dered_g AS u_g_color, 
dered_g - dered_r AS g_r_color,dered_r - dered_i AS r_i_color,dered_i - dered_z AS i_z_color,class
FROM SpecPhoto 
WHERE 
 (class = 'QSO')
 ) as sp
 
Saving this and others like it as a csv we can then start to make our data set for classification/regression.


In [1]:
## get the data locally ... I put this on a gist
!curl -k -O https://gist.githubusercontent.com/anonymous/53781fe86383c435ff10/raw/4cc80a638e8e083775caec3005ae2feaf92b8d5b/qso10000.csv
!curl -k -O https://gist.githubusercontent.com/anonymous/2984cf01a2485afd2c3e/raw/964d4f52c989428628d42eb6faad5e212e79b665/star1000.csv
!curl -k -O https://gist.githubusercontent.com/anonymous/2984cf01a2485afd2c3e/raw/335cd1953e72f6c7cafa9ebb81b43c47cb757a9d/galaxy1000.csv


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2378k  100 2378k    0     0  1313k      0  0:00:01  0:00:01 --:--:-- 1713k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  239k  100  239k    0     0   267k      0 --:--:-- --:--:-- --:--:--  268k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  229k  100  229k    0     0   410k      0 --:--:-- --:--:-- --:--:--  413k

In [4]:
## Python 2 backward compatibility
from __future__ import absolute_import, division, print_function, unicode_literals

In [11]:
# For pretty plotting, pandas, sklearn
!conda install pandas seaborn  matplotlib scikit-learn==0.17.1 -y


Fetching package metadata .......
Solving package specifications: ..........

Package plan for installation in environment /Users/jbloom/anaconda/envs/seminar:

The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    seaborn-0.7.1              |           py35_0         282 KB

The following NEW packages will be INSTALLED:

    seaborn: 0.7.1-py35_0

Fetching packages ...
seaborn-0.7.1- 100% |################################| Time: 0:00:00 307.65 kB/s
Extracting packages ...
[      COMPLETE      ]|###################################################| 100%
Linking packages ...
[      COMPLETE      ]|###################################################| 100%

In [6]:
import copy

import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['axes.labelsize'] = 20

import pandas as pd
pd.set_option('display.max_columns', None)

import seaborn as sns
sns.set()

In [7]:
pd.read_csv("qso10000.csv",index_col=0).head()


Out[7]:
ra dec dered_u dered_g dered_r dered_i dered_z mag_u mag_g mag_r mag_i mag_z spec_z u_g_color g_r_color r_i_color i_z_color class diff_u diff_g diff_g1 diff_i diff_z
objid
1237648720142532813 146.90229 -0.984913 19.64289 19.31131 19.25328 19.15353 19.13345 19.71604 19.37595 19.32818 19.24847 19.21259 0.652417 0.331583 0.058027 0.099751 0.020077 QSO -0.073151 -0.064648 -0.074903 -0.094942 -0.079136
1237658425156829371 142.45853 6.646406 19.39569 19.34811 19.16626 18.93152 19.06013 19.40327 19.36566 19.18335 18.94222 19.08077 1.537123 0.047575 0.181847 0.234743 -0.128612 QSO -0.007589 -0.017550 -0.017090 -0.010700 -0.020636
1237660413189095710 143.15770 8.175363 19.10362 18.88904 18.70672 18.58508 18.61328 19.11102 18.88857 18.70458 18.57886 18.62583 1.467101 0.214582 0.182318 0.121645 -0.028202 QSO -0.007397 0.000473 0.002148 0.006218 -0.012548
1237660412651962520 142.49264 7.800945 19.88820 19.75146 19.52941 19.65000 19.52470 19.88709 19.75292 19.53512 19.67052 19.50256 1.014217 0.136745 0.222052 -0.120590 0.125301 QSO 0.001118 -0.001457 -0.005716 -0.020527 0.022139
1237658493336944662 142.64367 7.917698 18.45897 18.40651 18.15901 17.77130 17.75986 18.55725 18.55002 18.40316 18.01008 18.03100 0.215603 0.052462 0.247498 0.387709 0.011444 QSO -0.098282 -0.143515 -0.244150 -0.238779 -0.271137

Notice that there are several things about this dataset. First, RA and DEC are probably not something we want to use in making predictions: it's the location of the object on the sky. Second, the magnitudes are highly covariant with the colors. So dumping all but one of the magnitudes might be a good idea to avoid overfitting.


In [14]:
usecols = [str(x) for x in ["objid","dered_r","spec_z","u_g_color","g_r_color","r_i_color",
           "i_z_color","diff_u",\
           "diff_g1","diff_i","diff_z"]]

qsos = pd.read_csv("qso10000.csv",index_col=0,
                                  usecols=usecols)

qso_features = copy.copy(qsos)
qso_redshifts = qsos["spec_z"]
del qso_features["spec_z"]
qso_features.head()


Out[14]:
dered_r u_g_color g_r_color r_i_color i_z_color diff_u diff_g1 diff_i diff_z
objid
1237648720142532813 19.25328 0.331583 0.058027 0.099751 0.020077 -0.073151 -0.074903 -0.094942 -0.079136
1237658425156829371 19.16626 0.047575 0.181847 0.234743 -0.128612 -0.007589 -0.017090 -0.010700 -0.020636
1237660413189095710 18.70672 0.214582 0.182318 0.121645 -0.028202 -0.007397 0.002148 0.006218 -0.012548
1237660412651962520 19.52941 0.136745 0.222052 -0.120590 0.125301 0.001118 -0.005716 -0.020527 0.022139
1237658493336944662 18.15901 0.052462 0.247498 0.387709 0.011444 -0.098282 -0.244150 -0.238779 -0.271137

In [15]:
f, ax = plt.subplots()
bins =  ax.hist(qso_redshifts.values)
ax.set_xlabel("redshift", fontsize=18)
ax.set_ylabel("N",fontsize=18)


Out[15]:
<matplotlib.text.Text at 0x110176090>

Pretty clearly a big cut at around $z=2$.


In [16]:
import matplotlib as mpl
import matplotlib.cm as cm
## truncate the color at z=2.5 just to keep some contrast.
norm = mpl.colors.Normalize(vmin=min(qso_redshifts.values), vmax=2.5)
cmap = cm.jet
m = cm.ScalarMappable(norm=norm, cmap=cmap)
rez = pd.scatter_matrix(qso_features[0:2000], 
                        alpha=0.2,figsize=[15,15],color=m.to_rgba(qso_redshifts.values))


Egad. Some pretty crazy values for dered_r and g_r_color. Let's figure out why.


In [17]:
min(qso_features["dered_r"].values)


Out[17]:
-9999.0

Looks like there are some missing values in the catalog which are set at -9999. Let's zoink those from the dataset for now.


In [18]:
qsos = pd.read_csv("qso10000.csv",index_col=0,
                                  usecols=usecols)

qsos = qsos[(qsos["dered_r"] > -9999) & (qsos["g_r_color"] > -10) & (qsos["g_r_color"] < 10)]
qso_features = copy.copy(qsos)
qso_redshifts = qsos["spec_z"]
del qso_features["spec_z"]

In [19]:
rez = pd.scatter_matrix(qso_features[0:2000], alpha=0.2,figsize=[15,15],\
                        color=m.to_rgba(qso_redshifts.values))


Ok. This looks pretty clean. Let's save this for future use.


In [20]:
qsos.to_csv("qsos.clean.csv")

Data Munging done. Let's do some ML!

Basic Model Fitting

We need to create a training set and a testing set.


In [21]:
X = qso_features.values  # 9-d feature space
Y = qso_redshifts.values # redshifts

In [22]:
print("feature vector shape=", X.shape)
print("class shape=", Y.shape)


feature vector shape= (9988, 9)
class shape= (9988,)

In [23]:
# half of data
import math
half = math.floor(len(Y)/2)
train_X = X[:half]
train_Y = Y[:half]
test_X = X[half:]
test_Y = Y[half:]


/Users/jbloom/anaconda/envs/ipykernel_py2/lib/python2.7/site-packages/ipykernel/__main__.py:4: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
/Users/jbloom/anaconda/envs/ipykernel_py2/lib/python2.7/site-packages/ipykernel/__main__.py:5: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
/Users/jbloom/anaconda/envs/ipykernel_py2/lib/python2.7/site-packages/ipykernel/__main__.py:6: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
/Users/jbloom/anaconda/envs/ipykernel_py2/lib/python2.7/site-packages/ipykernel/__main__.py:7: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future

In [24]:
from sklearn import linear_model
clf = linear_model.LinearRegression()

In [ ]:
clf.

In [25]:
# fit the model
clf.fit(train_X, train_Y)


Out[25]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [26]:
# now do the prediction
Y_lr_pred = clf.predict(test_X)

# how well did we do?
from sklearn.metrics import mean_squared_error
mse = np.sqrt(mean_squared_error(test_Y,Y_lr_pred)) ; print("MSE",mse)


MSE 0.659053404786

In [27]:
plt.plot(test_Y,Y_lr_pred - test_Y,'o',alpha=0.1)
plt.title("Linear Regression Residuals - MSE = %.1f" % mse)
plt.xlabel("Spectroscopic Redshift")
plt.ylabel("Residual")
plt.hlines(0,min(test_Y),max(test_Y),color="red")


Out[27]:
<matplotlib.collections.LineCollection at 0x120e12f90>

In [28]:
# here's the MSE guessing the AVERAGE value
print("naive mse", ((1./len(train_Y))*(train_Y - train_Y.mean())**2).sum())


naive mse 0.643120844956

In [29]:
mean_squared_error?

k-Nearest Neighbor (KNN) Regression


In [30]:
from sklearn import neighbors
from sklearn import preprocessing
X_scaled = preprocessing.scale(X) # many methods work better on scaled X
clf1 = neighbors.KNeighborsRegressor(10)
train_X = X_scaled[:half]
test_X = X_scaled[half:]
clf1.fit(train_X,train_Y)


/Users/jbloom/anaconda/envs/ipykernel_py2/lib/python2.7/site-packages/ipykernel/__main__.py:5: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
/Users/jbloom/anaconda/envs/ipykernel_py2/lib/python2.7/site-packages/ipykernel/__main__.py:6: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
Out[30]:
KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=1, n_neighbors=10, p=2,
          weights='uniform')

In [31]:
Y_knn_pred = clf1.predict(test_X)
mse = mean_squared_error(test_Y,Y_knn_pred) ; print("MSE (KNN)", mse)
plt.plot(test_Y, Y_knn_pred - test_Y,'o',alpha=0.2)
plt.title("k-NN Residuals - MSE = %.1f" % mse)
plt.xlabel("Spectroscopic Redshift")
plt.ylabel("Residual")
plt.hlines(0,min(test_Y),max(test_Y),color="red")


MSE (KNN) 0.234493621244
Out[31]:
<matplotlib.collections.LineCollection at 0x11c10ae90>

In [32]:
from sklearn import neighbors
from sklearn import preprocessing

X_scaled = preprocessing.scale(X) # many methods work better on scaled X
train_X = X_scaled[:half]
train_Y = Y[:half]
test_X = X_scaled[half:]
test_Y = Y[half:]
clf1 = neighbors.KNeighborsRegressor(5)
clf1.fit(train_X,train_Y)


/Users/jbloom/anaconda/envs/ipykernel_py2/lib/python2.7/site-packages/ipykernel/__main__.py:5: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
/Users/jbloom/anaconda/envs/ipykernel_py2/lib/python2.7/site-packages/ipykernel/__main__.py:6: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
/Users/jbloom/anaconda/envs/ipykernel_py2/lib/python2.7/site-packages/ipykernel/__main__.py:7: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
/Users/jbloom/anaconda/envs/ipykernel_py2/lib/python2.7/site-packages/ipykernel/__main__.py:8: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
Out[32]:
KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=1, n_neighbors=5, p=2,
          weights='uniform')

In [33]:
Y_knn_pred = clf1.predict(test_X)
mse = mean_squared_error(test_Y,Y_knn_pred) ; print("MSE=",mse)
plt.scatter(test_Y, Y_knn_pred - test_Y,alpha=0.2)
plt.title("k-NN Residuals - MSE = %.1f" % mse)
plt.xlabel("Spectroscopic Redshift")
plt.ylabel("Residual")
plt.hlines(0,min(test_Y),max(test_Y),color="red")


MSE= 0.239666881833
Out[33]:
<matplotlib.collections.LineCollection at 0x119cd0dd0>

In [34]:
from sklearn.ensemble import RandomForestRegressor
clf2 = RandomForestRegressor(n_estimators=100, 
                            criterion='mse', max_depth=None, 
                            min_samples_split=2, min_samples_leaf=1, 
                            max_features='auto', max_leaf_nodes=None, 
                            bootstrap=True, oob_score=False, n_jobs=1, 
                            random_state=None, verbose=0, warm_start=False)
clf2.fit(train_X,train_Y)


Out[34]:
RandomForestRegressor(bootstrap=True, criterion=u'mse', max_depth=None,
           max_features=u'auto', max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=100, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)

In [35]:
Y_rf_pred = clf2.predict(test_X)
mse = mean_squared_error(test_Y,Y_rf_pred) ; print("MSE",mse)
plt.scatter(test_Y, Y_rf_pred - test_Y,alpha=0.2)
plt.title("RF Residuals - MSE = %.1f" % mse)
plt.xlabel("Spectroscopic Redshift")
plt.ylabel("Residual")
plt.hlines(0,min(test_Y),max(test_Y),color="red")


MSE 0.202454102592
Out[35]:
<matplotlib.collections.LineCollection at 0x115e30750>

model selection: cross-validation


In [36]:
from sklearn import cross_validation

In [37]:
from sklearn import linear_model
clf = linear_model.LinearRegression()
from sklearn.cross_validation import cross_val_score

def print_cv_score_summary(model, xx, yy, cv):
    scores = cross_val_score(model, xx, yy, cv=cv, n_jobs=1)
    print("mean: {:3f}, stdev: {:3f}".format(
        np.mean(scores), np.std(scores)))

In [38]:
print_cv_score_summary(clf,X,Y,cv=cross_validation.KFold(len(Y), 5))


mean: 0.237593, stdev: 0.026459

In [39]:
print_cv_score_summary(clf,X,Y,
    cv=cross_validation.KFold(len(Y),10,shuffle=True,random_state=1))


mean: 0.246604, stdev: 0.041721

In [40]:
print_cv_score_summary(clf2,X,Y,
    cv=cross_validation.KFold(len(Y),3,shuffle=True,random_state=1))


mean: 0.601922, stdev: 0.023209

Classification

Let's do a 3-class classification problem: star, galaxy, or QSO


In [43]:
usecols = [str(x) for x in ["objid","dered_r","u_g_color","g_r_color","r_i_color","i_z_color","diff_u",\
           "diff_g1","diff_i","diff_z","class"]]

all_sources = pd.read_csv("qso10000.csv",index_col=0,usecols=usecols)[:1000]

all_sources = all_sources.append(pd.read_csv("star1000.csv",index_col=0,usecols=usecols))

all_sources = all_sources.append(pd.read_csv("galaxy1000.csv",index_col=0,usecols=usecols))

all_sources = all_sources[(all_sources["dered_r"] > -9999) & (all_sources["g_r_color"] > -10) & (all_sources["g_r_color"] < 10)]
all_features = copy.copy(all_sources)
all_label = all_sources["class"]
del all_features["class"]
X = copy.copy(all_features.values)
Y = copy.copy(all_label.values)

In [44]:
all_sources.tail()


Out[44]:
dered_r u_g_color g_r_color r_i_color i_z_color diff_u diff_g1 diff_i diff_z
objid
1237657775542632759 15.42325 1.999353 0.970126 0.435975 0.373470 -1.944487 -1.971534 -2.052320 -1.971382
1237657775542698090 17.51366 2.212025 0.965242 0.410664 0.371384 -0.778788 -0.944075 -0.895832 -0.830559
1237657775542698177 17.15747 1.190033 0.332136 0.252352 0.070980 -2.391565 -2.977261 -2.889906 -2.671612
1237657630586634463 17.19312 1.179663 0.678915 0.394419 0.272171 -1.563450 -1.913368 -1.791895 -1.615683
1237657630049698007 17.20485 1.925320 1.126934 0.477961 0.334377 -1.211906 -1.377165 -1.402037 -1.218332

In [45]:
print("feature vector shape=", X.shape)
print("class shape=", Y.shape)


feature vector shape= (3000, 9)
class shape= (3000,)

In [46]:
Y[Y=="QSO"] = 0
Y[Y=="STAR"] = 1
Y[Y=="GALAXY"] = 2
Y = list(Y)

Let's look at random forest


In [47]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=200,oob_score=True)
clf.fit(X,Y)


Out[47]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=1,
            oob_score=True, random_state=None, verbose=0, warm_start=False)

what are the important features in the data?


In [48]:
sorted(zip(all_sources.columns.values,clf.feature_importances_),key=lambda q: q[1],reverse=True)


Out[48]:
[('u_g_color', 0.26258480958434571),
 ('diff_i', 0.16541030117105393),
 ('diff_g1', 0.15483352159605107),
 ('diff_z', 0.12061964529166984),
 ('g_r_color', 0.085033386405546099),
 ('diff_u', 0.068554012091952346),
 ('r_i_color', 0.06077107102833585),
 ('dered_r', 0.046825658066310588),
 ('i_z_color', 0.03536759476473475)]

In [49]:
clf.oob_score_ ## "Out of Bag" Error


Out[49]:
0.95299999999999996

In [52]:
import numpy as np
from sklearn import svm, datasets
cmap = cm.jet_r

# import some data to play with

plt.figure(figsize=(10,10))

X = all_features.values[:, 1:3] # use only two features for training and plotting purposes

h = 0.02  # step size in the mesh

# we create an instance of SVM and fit out data. We do not scale our
# data since we want to plot the support vectors
C = 1.0  # SVM regularization parameter
svc = svm.SVC(kernel=str('linear'), C=C).fit(X, Y)
rbf_svc = svm.SVC(kernel=str('rbf'), gamma=0.7, C=C).fit(X, Y)
poly_svc = svm.SVC(kernel=str('poly'), degree=3, C=C).fit(X, Y)
lin_svc = svm.LinearSVC(C=C).fit(X, Y)

# create a mesh to plot in
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

# title for the plots
titles = ['SVC with linear kernel',
          'SVC with RBF kernel',
          'SVC with polynomial (degree 3) kernel',
          'LinearSVC (linear kernel)']


norm = mpl.colors.Normalize(vmin=min(Y), vmax=max(Y))
m = cm.ScalarMappable(norm=norm, cmap=cmap)

for i, clf in enumerate((svc, rbf_svc, poly_svc, lin_svc)):
    # Plot the decision boundary. For that, we will assign a color to each
    # point in the mesh [x_min, m_max]x[y_min, y_max].
    plt.subplot(2, 2, i + 1)
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.contourf(xx, yy, Z,cmap=cm.Paired)
    plt.axis('off')

    # Plot also the training points
    plt.scatter(X[:, 0], X[:, 1], c=m.to_rgba(Y),cmap=cm.Paired)

    plt.title(titles[i])


model improvement with GridSearchCV

Hyperparameter optimization. Parallel: makes use of joblib


In [53]:
# fit a support vector machine classifier
from sklearn import grid_search
from sklearn import svm
from sklearn import metrics
import logging
logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s %(levelname)s %(message)s')

# instantiate the SVM object
sdss_svm = svm.SVC()

X = all_features.values
Y = all_label.values

# parameter values over which we will search
parameters = {'kernel':(str('linear'), str('rbf')), \
              'gamma':[0.5, 0.3, 0.1,  0.01],
              'C':[0.1, 2, 4, 5, 10, 20,30]}
#parameters = {'kernel':('linear', 'rbf')}

# do a grid search to find the highest 3-fold CV zero-one score
svm_tune = grid_search.GridSearchCV(sdss_svm, parameters,\
                                    n_jobs = -1, cv = 3,verbose=1)
svm_opt = svm_tune.fit(X, Y)

# print the best score and estimator
print(svm_opt.best_score_)
print(svm_opt.best_estimator_)


Fitting 3 folds for each of 56 candidates, totalling 168 fits
[Parallel(n_jobs=-1)]: Done  84 tasks      | elapsed:    3.5s
[Parallel(n_jobs=-1)]: Done 168 out of 168 | elapsed:   10.6s finished
0.955
SVC(C=20, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=0.1, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [54]:
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix

X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=0)

classifier = svm.SVC(**svm_opt.best_estimator_.get_params())
y_pred = classifier.fit(X_train, y_train).predict(X_test)

# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred)

print(cm)

# Show confusion matrix in a separate window
plt.matshow(cm)
plt.title('Confusion matrix')
plt.colorbar()
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()


[[231   1   1]
 [  7 254  19]
 [  3   9 225]]

In [55]:
# instantiate the RF learning object
sdss_rf = RandomForestClassifier()

X = all_features.values
Y = all_label.values

# parameter values over which we will search
parameters = {'n_estimators':(10,50,200),"max_features": ["auto",3,5],
              'criterion':[str("gini"),str("entropy")],"min_samples_leaf": [1,2]}
#parameters = {'kernel':('linear', 'rbf')}

# do a grid search to find the highest 3-fold CV zero-one score
rf_tune = grid_search.GridSearchCV(sdss_rf, parameters,\
                                    n_jobs = -1, cv = 3,verbose=1)
rf_opt = rf_tune.fit(X, Y)

# print the best score and estimator
print(rf_opt.best_score_)
print(rf_opt.best_estimator_)


Fitting 3 folds for each of 36 candidates, totalling 108 fits
[Parallel(n_jobs=-1)]: Done  54 tasks      | elapsed:   13.2s
[Parallel(n_jobs=-1)]: Done 108 out of 108 | elapsed:   43.4s finished
0.949
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=None, max_features=u'auto', max_leaf_nodes=None,
            min_samples_leaf=2, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [56]:
clf.get_params()


Out[56]:
{'C': 1.0,
 'class_weight': None,
 'dual': True,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'loss': 'squared_hinge',
 'max_iter': 1000,
 'multi_class': 'ovr',
 'penalty': 'l2',
 'random_state': None,
 'tol': 0.0001,
 'verbose': 0}

In [57]:
svm_opt.best_estimator_.get_params()


Out[57]:
{'C': 20,
 'cache_size': 200,
 'class_weight': None,
 'coef0': 0.0,
 'decision_function_shape': None,
 'degree': 3,
 'gamma': 0.1,
 'kernel': 'rbf',
 'max_iter': -1,
 'probability': False,
 'random_state': None,
 'shrinking': True,
 'tol': 0.001,
 'verbose': False}

In [96]:
grid_search.GridSearchCV?

Parallelism & Hyperparameter Fitting

GridSearchCV is not compute/RAM optimized. It's also not obviously optimal.


In [58]:
import time
start = time.time()
## this takes about 30 seconds

# instantiate the RF learning object
sdss_rf = RandomForestClassifier()

X = all_features.values
Y = all_label.values

# parameter values over which we will search
parameters = {'n_estimators':(10,50,200),"max_features": ["auto",3,5],
              'criterion':["gini","entropy"],"min_samples_leaf": [1,2]}
#parameters = {'kernel':('linear', 'rbf')}

# do a grid search to find the highest 3-fold CV zero-one score
rf_tune = grid_search.GridSearchCV(sdss_rf, parameters,\
                                    n_jobs = -1, cv = 3,verbose=1)
rf_opt = rf_tune.fit(X, Y)

# print the best score and estimator
print(rf_opt.best_score_)
print(rf_opt.best_estimator_)
print("total time in seconds",time.time()- start)


Fitting 3 folds for each of 36 candidates, totalling 108 fits
[Parallel(n_jobs=-1)]: Done  48 tasks      | elapsed:   16.6s
[Parallel(n_jobs=-1)]: Done 108 out of 108 | elapsed:   39.8s finished
0.949
RandomForestClassifier(bootstrap=True, class_weight=None, criterion=u'gini',
            max_depth=None, max_features=u'auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
total time in seconds 42.2648029327

Let's do this without a full search...


In [59]:
import time
start = time.time()

# instantiate the RF learning object
sdss_rf = RandomForestClassifier()

X = all_features.values
Y = all_label.values

# parameter values over which we will search
parameters = {'n_estimators':(10,50,200),"max_features": ["auto",3,5],
              'criterion':["gini","entropy"],"min_samples_leaf": [1,2]}
#parameters = {'kernel':('linear', 'rbf')}

# do a grid search to find the highest 3-fold CV zero-one score
rf_tune = grid_search.RandomizedSearchCV(sdss_rf, parameters,\
                                    n_jobs = -1, cv = 3,verbose=1)
rf_opt = rf_tune.fit(X, Y)

# print the best score and estimator
print(rf_opt.best_score_)
print(rf_opt.best_estimator_)
print("total time in seconds",time.time()- start)


Fitting 3 folds for each of 10 candidates, totalling 30 fits
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:    7.0s finished
0.948666666667
RandomForestClassifier(bootstrap=True, class_weight=None,
            criterion=u'entropy', max_depth=None, max_features=5,
            max_leaf_nodes=None, min_samples_leaf=2, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
total time in seconds 8.31015610695

In [14]:
!conda install dask distributed -y


Fetching package metadata .......
Solving package specifications: ..........

Package plan for installation in environment /Users/jbloom/anaconda/envs/seminar:

The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    click-6.6                  |           py35_0         100 KB
    docutils-0.12              |           py35_2         668 KB
    jmespath-0.9.0             |           py35_0          28 KB
    lz4-0.8.2                  |           py35_0          28 KB
    msgpack-python-0.4.7       |           py35_0          69 KB
    psutil-4.3.0               |           py35_0         218 KB
    tblib-1.3.0                |           py35_0          12 KB
    botocore-1.4.37            |           py35_0         1.9 MB
    boto3-1.3.1                |           py35_0          97 KB
    s3fs-0.0.7                 |           py35_0          24 KB
    distributed-1.12.0         |           py35_0         302 KB
    ------------------------------------------------------------
                                           Total:         3.4 MB

The following NEW packages will be INSTALLED:

    boto3:          1.3.1-py35_0 
    botocore:       1.4.37-py35_0
    click:          6.6-py35_0   
    distributed:    1.12.0-py35_0
    docutils:       0.12-py35_2  
    jmespath:       0.9.0-py35_0 
    lz4:            0.8.2-py35_0 
    msgpack-python: 0.4.7-py35_0 
    psutil:         4.3.0-py35_0 
    s3fs:           0.0.7-py35_0 
    tblib:          1.3.0-py35_0 

Fetching packages ...
click-6.6-py35 100% |################################| Time: 0:00:00 122.58 kB/s
docutils-0.12- 100% |################################| Time: 0:00:01 548.08 kB/s
jmespath-0.9.0 100% |################################| Time: 0:00:00 172.88 kB/s
lz4-0.8.2-py35 100% |################################| Time: 0:00:00 171.59 kB/s
msgpack-python 100% |################################| Time: 0:00:00 263.76 kB/s
psutil-4.3.0-p 100% |################################| Time: 0:00:00 320.51 kB/s
tblib-1.3.0-py 100% |################################| Time: 0:00:00   4.60 MB/s
botocore-1.4.3 100% |################################| Time: 0:00:02 833.33 kB/s
boto3-1.3.1-py 100% |################################| Time: 0:00:00 146.29 kB/s
s3fs-0.0.7-py3 100% |################################| Time: 0:00:00 146.49 kB/s
distributed-1. 100% |################################| Time: 0:00:01 306.15 kB/s
Extracting packages ...
[      COMPLETE      ]|###################################################| 100%
Linking packages ...
[      COMPLETE      ]|###################################################| 100%

In [60]:
import os
myhome = os.getcwd()

In [61]:
os.environ["PYTHONPATH"] = myhome + "/dask-learn"

In [62]:
myhome = !pwd
!git clone https://github.com/dask/dask-learn.git
%cd dask-learn
!git pull
!python setup.py install


fatal: destination path 'dask-learn' already exists and is not an empty directory.
/Users/jbloom/Classes/day4/dask-learn
Already up-to-date.
running install
running bdist_egg
running egg_info
writing requirements to dask_learn.egg-info/requires.txt
writing top-level names to dask_learn.egg-info/top_level.txt
writing dask_learn.egg-info/PKG-INFO
writing dependency_links to dask_learn.egg-info/dependency_links.txt
reading manifest file 'dask_learn.egg-info/SOURCES.txt'
writing manifest file 'dask_learn.egg-info/SOURCES.txt'
installing library code to build/bdist.macosx-10.6-x86_64/egg
running install_lib
running build_py
creating build/bdist.macosx-10.6-x86_64/egg
creating build/bdist.macosx-10.6-x86_64/egg/dklearn
copying build/lib/dklearn/__init__.py -> build/bdist.macosx-10.6-x86_64/egg/dklearn
copying build/lib/dklearn/averaged.py -> build/bdist.macosx-10.6-x86_64/egg/dklearn
copying build/lib/dklearn/chained.py -> build/bdist.macosx-10.6-x86_64/egg/dklearn
copying build/lib/dklearn/core.py -> build/bdist.macosx-10.6-x86_64/egg/dklearn
copying build/lib/dklearn/cross_validation.py -> build/bdist.macosx-10.6-x86_64/egg/dklearn
copying build/lib/dklearn/feature_extraction.py -> build/bdist.macosx-10.6-x86_64/egg/dklearn
copying build/lib/dklearn/grid_search.py -> build/bdist.macosx-10.6-x86_64/egg/dklearn
copying build/lib/dklearn/matrix.py -> build/bdist.macosx-10.6-x86_64/egg/dklearn
copying build/lib/dklearn/pipeline.py -> build/bdist.macosx-10.6-x86_64/egg/dklearn
copying build/lib/dklearn/utils.py -> build/bdist.macosx-10.6-x86_64/egg/dklearn
copying build/lib/dklearn/wrapped.py -> build/bdist.macosx-10.6-x86_64/egg/dklearn
byte-compiling build/bdist.macosx-10.6-x86_64/egg/dklearn/__init__.py to __init__.cpython-35.pyc
byte-compiling build/bdist.macosx-10.6-x86_64/egg/dklearn/averaged.py to averaged.cpython-35.pyc
byte-compiling build/bdist.macosx-10.6-x86_64/egg/dklearn/chained.py to chained.cpython-35.pyc
byte-compiling build/bdist.macosx-10.6-x86_64/egg/dklearn/core.py to core.cpython-35.pyc
byte-compiling build/bdist.macosx-10.6-x86_64/egg/dklearn/cross_validation.py to cross_validation.cpython-35.pyc
byte-compiling build/bdist.macosx-10.6-x86_64/egg/dklearn/feature_extraction.py to feature_extraction.cpython-35.pyc
byte-compiling build/bdist.macosx-10.6-x86_64/egg/dklearn/grid_search.py to grid_search.cpython-35.pyc
byte-compiling build/bdist.macosx-10.6-x86_64/egg/dklearn/matrix.py to matrix.cpython-35.pyc
byte-compiling build/bdist.macosx-10.6-x86_64/egg/dklearn/pipeline.py to pipeline.cpython-35.pyc
byte-compiling build/bdist.macosx-10.6-x86_64/egg/dklearn/utils.py to utils.cpython-35.pyc
byte-compiling build/bdist.macosx-10.6-x86_64/egg/dklearn/wrapped.py to wrapped.cpython-35.pyc
creating build/bdist.macosx-10.6-x86_64/egg/EGG-INFO
copying dask_learn.egg-info/PKG-INFO -> build/bdist.macosx-10.6-x86_64/egg/EGG-INFO
copying dask_learn.egg-info/SOURCES.txt -> build/bdist.macosx-10.6-x86_64/egg/EGG-INFO
copying dask_learn.egg-info/dependency_links.txt -> build/bdist.macosx-10.6-x86_64/egg/EGG-INFO
copying dask_learn.egg-info/requires.txt -> build/bdist.macosx-10.6-x86_64/egg/EGG-INFO
copying dask_learn.egg-info/top_level.txt -> build/bdist.macosx-10.6-x86_64/egg/EGG-INFO
zip_safe flag not set; analyzing archive contents...
creating 'dist/dask_learn-0.1.0-py3.5.egg' and adding 'build/bdist.macosx-10.6-x86_64/egg' to it
removing 'build/bdist.macosx-10.6-x86_64/egg' (and everything under it)
Processing dask_learn-0.1.0-py3.5.egg
Removing /Users/jbloom/anaconda/envs/seminar/lib/python3.5/site-packages/dask_learn-0.1.0-py3.5.egg
Copying dask_learn-0.1.0-py3.5.egg to /Users/jbloom/anaconda/envs/seminar/lib/python3.5/site-packages
dask-learn 0.1.0 is already the active version in easy-install.pth

Installed /Users/jbloom/anaconda/envs/seminar/lib/python3.5/site-packages/dask_learn-0.1.0-py3.5.egg
Processing dependencies for dask-learn==0.1.0
Searching for numpy==1.11.1
Best match: numpy 1.11.1
Adding numpy 1.11.1 to easy-install.pth file

Using /Users/jbloom/anaconda/envs/seminar/lib/python3.5/site-packages
Searching for scikit-learn==0.17.1
Best match: scikit-learn 0.17.1
Adding scikit-learn 0.17.1 to easy-install.pth file

Using /Users/jbloom/anaconda/envs/seminar/lib/python3.5/site-packages
Searching for dask==0.11.0
Best match: dask 0.11.0
Adding dask 0.11.0 to easy-install.pth file

Using /Users/jbloom/anaconda/envs/seminar/lib/python3.5/site-packages
Finished processing dependencies for dask-learn==0.1.0

In [63]:
from dklearn.grid_search import GridSearchCV as DaskGridSearchCV

In [64]:
import time
start = time.time()
# instantiate the RF learning object
sdss_rf = RandomForestClassifier()

X = all_features.values
Y = all_label.values

# parameter values over which we will search
parameters = {'n_estimators':(10,50,200),"max_features": ["auto",3,5],
              'criterion':["gini","entropy"],"min_samples_leaf": [1,2]}
#parameters = {'kernel':('linear', 'rbf')}

# do a grid search to find the highest 3-fold CV zero-one score
rf_tune = DaskGridSearchCV(sdss_rf, parameters,\
                                    cv = 3)
rf_opt = rf_tune.fit(X, Y)

# print the best score and estimator
print(rf_opt.best_score_)
print(rf_opt.best_estimator_)
print("total time in seconds",time.time()- start)


0.95
RandomForestClassifier(bootstrap=True, class_weight=None, criterion=u'gini',
            max_depth=None, max_features=5, max_leaf_nodes=None,
            min_samples_leaf=2, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
total time in seconds 66.4075751305

In [17]:
#To do distributed:
#from distributed import Executor
#executor = Executor()
#executor


Out[17]:
<Executor: scheduler="127.0.0.1:52744" processes=3 cores=3>

Clustering, Unsupervised Learning & Anomoly Detection

It's often of interest to find patterns in the data that you didn't know where there, as an end to itself or as a starting point of exploration.

One approach is to look at individual sources that are mis-classified.


In [176]:
usecols = [str(x) for x in ["objid","dered_r","u_g_color","g_r_color","r_i_color","i_z_color","diff_u",\
           "diff_g1","diff_i","diff_z","class"]]

all_sources = pd.read_csv("qso10000.csv",index_col=0,usecols=usecols)[:1000]

all_sources = all_sources.append(pd.read_csv("star1000.csv",index_col=0,usecols=usecols))

all_sources = all_sources.append(pd.read_csv("galaxy1000.csv",index_col=0,usecols=usecols))

all_sources = all_sources[(all_sources["dered_r"] > -9999) & (all_sources["g_r_color"] > -10) & (all_sources["g_r_color"] < 10)]
all_features = copy.copy(all_sources)
all_label = all_sources["class"]
del all_features["class"]
X = copy.copy(all_features.values)
Y = copy.copy(all_label.values)

# instantiate the RF learning object
sdss_rf = RandomForestClassifier()

X = all_features.values
Y = all_label.values

# parameter values over which we will search
parameters = {'n_estimators':(100,),"max_features": ["auto",3,4],
              'criterion':["entropy"],"min_samples_leaf": [1,2]}

# do a grid search to find the highest 5-fold CV zero-one score
rf_tune = grid_search.GridSearchCV(sdss_rf, parameters,\
                                    n_jobs = -1, cv = 5,verbose=1)
rf_opt = rf_tune.fit(X, Y)

# print the best score and estimator
print(rf_opt.best_score_)
print(rf_opt.best_estimator_)


Fitting 5 folds for each of 6 candidates, totalling 30 fits
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:   16.0s finished
0.953666666667
RandomForestClassifier(bootstrap=True, class_weight=None,
            criterion=u'entropy', max_depth=None, max_features=3,
            max_leaf_nodes=None, min_samples_leaf=2, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [177]:
probs = rf_opt.best_estimator_.predict_proba(X)
print(rf_opt.best_estimator_.classes_)
for i in range(probs.shape[0]):
    if rf_opt.best_estimator_.classes_[np.argmax(probs[i,:])] != Y[i]:
        print("Label={0:6s}".format(Y[i]), end=" ")
        print("Pgal={0:0.3f} Pqso={1:0.3f} Pstar={2:0.3f}".format(probs[i,0],probs[i,1],probs[i,2]),end=" ")
        print("http://skyserver.sdss.org/dr12/en/tools/quicklook/summary.aspx?id=" + str(all_sources.index[i]))


['GALAXY' 'QSO' 'STAR']
Label=QSO    Pgal=0.517 Pqso=0.483 Pstar=0.000 http://skyserver.sdss.org/dr12/en/tools/quicklook/summary.aspx?id=1237674288611393658
Label=QSO    Pgal=0.522 Pqso=0.473 Pstar=0.005 http://skyserver.sdss.org/dr12/en/tools/quicklook/summary.aspx?id=1237657774468956437
Label=QSO    Pgal=0.565 Pqso=0.435 Pstar=0.000 http://skyserver.sdss.org/dr12/en/tools/quicklook/summary.aspx?id=1237657628440461503
Label=QSO    Pgal=0.013 Pqso=0.488 Pstar=0.499 http://skyserver.sdss.org/dr12/en/tools/quicklook/summary.aspx?id=1237657630590698230
Label=QSO    Pgal=0.619 Pqso=0.381 Pstar=0.000 http://skyserver.sdss.org/dr12/en/tools/quicklook/summary.aspx?id=1237657630054220514
Label=QSO    Pgal=0.007 Pqso=0.329 Pstar=0.665 http://skyserver.sdss.org/dr12/en/tools/quicklook/summary.aspx?id=1237660562431017838
Label=QSO    Pgal=0.520 Pqso=0.480 Pstar=0.000 http://skyserver.sdss.org/dr12/en/tools/quicklook/summary.aspx?id=1237660562431476162
Label=QSO    Pgal=0.461 Pqso=0.396 Pstar=0.143 http://skyserver.sdss.org/dr12/en/tools/quicklook/summary.aspx?id=1237660343925997780
Label=QSO    Pgal=0.667 Pqso=0.322 Pstar=0.010 http://skyserver.sdss.org/dr12/en/tools/quicklook/summary.aspx?id=1237660343927177632
Label=QSO    Pgal=0.645 Pqso=0.355 Pstar=0.000 http://skyserver.sdss.org/dr12/en/tools/quicklook/summary.aspx?id=1237660635446378588
Label=QSO    Pgal=0.000 Pqso=0.473 Pstar=0.527 http://skyserver.sdss.org/dr12/en/tools/quicklook/summary.aspx?id=1237660959183732748
Label=QSO    Pgal=0.613 Pqso=0.385 Pstar=0.003 http://skyserver.sdss.org/dr12/en/tools/quicklook/summary.aspx?id=1237658203418525960
Label=QSO    Pgal=0.003 Pqso=0.496 Pstar=0.500 http://skyserver.sdss.org/dr12/en/tools/quicklook/summary.aspx?id=1237658190539128842
Label=QSO    Pgal=0.003 Pqso=0.421 Pstar=0.576 http://skyserver.sdss.org/dr12/en/tools/quicklook/summary.aspx?id=1237664870283870968
Label=QSO    Pgal=0.649 Pqso=0.322 Pstar=0.029 http://skyserver.sdss.org/dr12/en/tools/quicklook/summary.aspx?id=1237660343934976185
Label=STAR   Pgal=0.000 Pqso=0.525 Pstar=0.475 http://skyserver.sdss.org/dr12/en/tools/quicklook/summary.aspx?id=1237674290222071847
Label=STAR   Pgal=0.000 Pqso=0.670 Pstar=0.330 http://skyserver.sdss.org/dr12/en/tools/quicklook/summary.aspx?id=1237657630051008867
Label=STAR   Pgal=0.436 Pqso=0.172 Pstar=0.392 http://skyserver.sdss.org/dr12/en/tools/quicklook/summary.aspx?id=1237657630051795320
Label=STAR   Pgal=0.009 Pqso=0.582 Pstar=0.410 http://skyserver.sdss.org/dr12/en/tools/quicklook/summary.aspx?id=1237660562431803853
Label=GALAXY Pgal=0.337 Pqso=0.092 Pstar=0.572 http://skyserver.sdss.org/dr12/en/tools/quicklook/summary.aspx?id=1237661063877033994
Label=GALAXY Pgal=0.361 Pqso=0.410 Pstar=0.229 http://skyserver.sdss.org/dr12/en/tools/quicklook/summary.aspx?id=1237657776076358016
Label=GALAXY Pgal=0.354 Pqso=0.287 Pstar=0.358 http://skyserver.sdss.org/dr12/en/tools/quicklook/summary.aspx?id=1237657774468759924

We can also do manifold learning to be able to project structure in lower dimensions.


In [178]:
from sklearn import (manifold, datasets, decomposition, ensemble,
                     discriminant_analysis, random_projection)

In [179]:
rp = random_projection.SparseRandomProjection(n_components=2, density=0.3, random_state=1)
X_projected = rp.fit_transform(X)

In [180]:
Y[Y=="QSO"] = 0
Y[Y=="STAR"] = 1
Y[Y=="GALAXY"] = 2

In [181]:
Yi = Y.astype(np.int64)

In [182]:
plt.title("Manifold Sparse Random Projection")
plt.scatter(X_projected[:, 0], X_projected[:, 1],c=plt.cm.Set1(Yi / 3.),alpha=0.2,
           edgecolor='none',s=5*(X[:,0] - np.min(X[:,0])))


Out[182]:
<matplotlib.collections.PathCollection at 0x1109ce0d0>

In [183]:
clf = manifold.MDS(n_components=2, n_init=1, max_iter=100)
X_mds = clf.fit_transform(X)

In [184]:
plt.title("MDS Projection")
plt.scatter(X_mds[:, 0], X_mds[:, 1],c=plt.cm.Set1(Yi / 3.),alpha=0.3,
           s=5*(X[:,0] - np.min(X[:,0])))


Out[184]:
<matplotlib.collections.PathCollection at 0x113f31950>

In [ ]: