In [1]:
%pylab inline
In [31]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import sklearn
from scipy import stats, optimize
from sklearn.preprocessing import Imputer
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import Lasso, LinearRegression, Ridge
from sklearn.pipeline import Pipeline
from sklearn.base import clone
from itertools import combinations
from sklearn.metrics import explained_variance_score, r2_score, median_absolute_error
print('The scikit-learn version is {}.'.format(sklearn.__version__))
print('The pandas version is {}.'.format(pd.__version__))
print('The numpy version is {}.'.format(np.__version__))
We use pandas read_csv(path/to/csv)
method to read the csv file. Next, replace the missing values with np.NaN
i.e. Not a Number. This way we can count the number of missing values per column.
The dataset as described on UC Irvine repo. (125 predictive, 4 non-predictive, 18 potential goal)
We remove the features which are final goals and some other irrelevant features. For example the following attribute is to be predicted. murdPerPop: number of murders per 100K population (numeric - decimal) potential GOAL attribute (to be predicted)
In [24]:
df = pd.read_csv('../datasets/UCIrvineCrimeData.csv');
df = df.replace('?',np.NAN)
features = [x for x in df.columns if x not in ['fold', 'state', 'community', 'communityname', 'county'
,'ViolentCrimesPerPop']]
In [25]:
df.isnull().sum()
Out[25]:
In [5]:
df.dropna()
Out[5]:
Similarly, we can drop columns that have atleast one NaN
in any row by setting the axis argument to 1:
In [6]:
df.dropna(axis=1);
The dropna()
method supports additional parameters that can come in handy.
In [7]:
#only drop rows where all columns are null
df.dropna(how='all');
In [8]:
# drop rows that have not at least 4 non-NaN values
df.dropna(thresh=4);
In [9]:
# only drop rows where NaN appear in specific columns (here :'community')
df.dropna(subset=['community']);
Often, the removal of samples or dropping of entire feature columns is simply not feasible, because we might lost too much valuable data. In this case, we can use different interpolation techniques to estimate the missing values from the othere training samples in our dataset. One of the most common interpolation technique is mean interpolation, where we simply replace the missing value by the mean value of the entire feature column. A convenient way to achieve this is using the Imputer
class from the scikit-learn
as shown in the following code.
In [10]:
imr = Imputer(missing_values='NaN', strategy='mean', axis=0)
imr = imr.fit(df[features])
imputed_data = imr.transform(df[features]);
A convenient way to randomly partition the dataset into a separate test & training dataset is to use the train_test_split
function from scikit-learn's
cross_validation
submodule. As of now, the target variable is ''
In [11]:
#df = df.drop(["communityname", "state", "county", "community"], axis=1)
X, y = imputed_data, df['ViolentCrimesPerPop']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0);
First, we assigned the NumPy array representation of features columns to the variable X, and we assigned the predicted variable to the variable y
. Then we used the train_test_split
function to randomly split X
and y
into separate training & test datasets. By setting test_size=0.3
we assigned 30 percent of samples to X_test and the remaining 70 percent to X_train.
Sequential feature selection algorithms are a family of greedy search algorithms that can reduce an initial d-dimensional feature space into a k-dimensional feature subspace where k < d. The idea is to select the most relevant subset of features to improve computational efficieny and reduce generalization error
In [12]:
class SBS():
def __init__(self, estimator, features,
scoring=r2_score, test_size=0.25,
random_state=1):
self.scoring = scoring
self.estimator = estimator
self.features = features
self.test_size = test_size
self.random_state = random_state
def fit(self, X, y):
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size = self.test_size,
random_state = self.random_state)
dim = X_train.shape[1]
self.indices_ = tuple(range(dim))
self.subsets_ = [self.indices_]
score = self._calc_score(X_train, y_train, X_test, y_test, self.indices_)
self.scores_ = [score]
while dim > self.features:
scores = []
subsets = []
for p in combinations(self.indices_, r=dim-1):
score = self._calc_score(X_train, y_train, X_test, y_test, p)
scores.append(score)
subsets.append(p)
best = np.argmax(score)
self.indices_ = subsets[best]
self.subsets_.append(self.indices_)
dim -= 1
self.scores_.append(scores[best])
self.k_score_ = self.scores_[-1]
return self
def transform(self, X):
return X[:, self.indices_]
def _calc_score(self, X_train, y_train, X_test, y_test, indices):
self.estimator.fit(X_train[:, indices], y_train)
y_pred = self.estimator.predict(X_test[:, indices])
score = self.scoring(y_test, y_pred)
return score
In the preceding implementation, we define k_features
to be the final number of features when the algorithm should stop.
Inside the while
loop of the fit
method, the feature subsets created by the itertools.combination
function are evaluated and reduced until the feature subset has the desired dimensionality. In each iteration, the accuracy score of the best subset is collected in a list self.scores_
based on the internally created test dataset X_test
. We will use thoese scores to evaluate the results.
Let's see our implementation using the LinearRegression
In [13]:
clf = LinearRegression() #This can also be replaced by Lasso() or Ridge()
sbs = SBS(clf, features=1)
sbs.fit(X_train, y_train)
Out[13]:
In [28]:
k_feat = [len(k) for k in sbs.subsets_]
plt.plot(k_feat, sbs.scores_, marker='o')
plt.ylim([-2, 3])
plt.ylabel('Accuracy')
plt.xlabel('Number of Features')
plt.grid()
plt.show()
I have used r2_score
as a measurement of accuracy. R^2 (coefficient of determination) regression score function.
Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.
Also, please note that the accuracy scores are reported on the validation set.
Let us see which 100 features give us the best score of 0.75 on the validation set
In [29]:
k100 = sbs.subsets_[23]
In [30]:
features100 = [features[x] for x in k100]
print features100
In [45]:
plt.figure(figsize=(20, 6))
classifiers = [LinearRegression(), Lasso(), Ridge()]
names = ["Linear Regression", "Lasso", "Ridge"]
for i, classifier, name in zip(range(len(classifiers)), classifiers, names):
ax = plt.subplot(1, len(classifiers), i+1)
ax.set_ylim([-1, 1])
plt.setp(ax.get_yticklabels(), visible=True)
plt.setp(ax.get_xticklabels(), visible=True)
sbs = SBS(classifier, features=1)
sbs.fit(X_train, y_train)
k_feat = [len(k) for k in sbs.subsets_]
plt.plot(k_feat, sbs.scores_, marker='o', label="Accuracy")
plt.ylabel('Accuracy')
plt.xlabel('Number of Features')
plt.grid()
plt.legend(loc="best")
plt.title("For {} min-score = {:.2e} & max-score={:.2e}".format(name, min(sbs.scores_), max(sbs.scores_)))
plt.show()
In [41]:
Out[41]:
In [ ]: