In [7]:
%pylab inline


Populating the interactive namespace from numpy and matplotlib

In [8]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import sklearn
import csv

from scipy import stats, optimize
from sklearn.preprocessing import Imputer
from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import Pipeline

from sklearn.base import clone
from itertools import combinations
from sklearn.metrics import explained_variance_score, r2_score, median_absolute_error
from sklearn.model_selection import cross_val_score
import subprocess

print('The scikit-learn version is {}.'.format(sklearn.__version__))
print('The pandas version is {}.'.format(pd.__version__))
print('The numpy version is {}.'.format(np.__version__))


The scikit-learn version is 0.18.1.
The pandas version is 0.19.2.
The numpy version is 1.12.0.

Read the CSV

We use pandas read_csv(path/to/csv) method to read the csv file. Next, replace the missing values with np.NaN i.e. Not a Number. This way we can count the number of missing values per column.

The dataset as described on UC Irvine repo. (125 predictive, 4 non-predictive, 18 potential goal)

We remove the features which are final goals and some other irrelevant features. For example the following attribute is to be predicted. murdPerPop: number of murders per 100K population (numeric - decimal) potential GOAL attribute (to be predicted)


In [66]:
df = pd.read_csv('../../datasets/UCIrvineCrimeData.csv');
df = df.replace('?',np.NAN)
features = [x for x in df.columns if x not in ['fold', 'state', 'community', 'communityname', 'county'
                                               ,'ViolentCrimesPerPop']]
#write to csv file for preprocessing of mrmr
with open('../../datasets/test.csv', "wb") as csv_file:
    writer = csv.writer(csv_file, delimiter=',')
    writer.writerow((features))

Find the number of missing values in every column

Imputing missing values

Often, the removal of samples or dropping of entire feature columns is simply not feasible, because we might lost too much valuable data. In this case, we can use different interpolation techniques to estimate the missing values from the othere training samples in our dataset. One of the most common interpolation technique is mean interpolation, where we simply replace the missing value by the mean value of the entire feature column. A convenient way to achieve this is using the Imputer class from the scikit-learn as shown in the following code.


In [17]:
imr = Imputer(missing_values='NaN', strategy='mean', axis=0)
imr = imr.fit(df[features])
imputed_data = imr.transform(df[features]);
with open('../../datasets/test.csv', "a") as csv_file:
    writer = csv.writer(csv_file, delimiter=',')
    for line in imputed_data:
        writer.writerow(line)

Sklearn fundamentals

A convenient way to randomly partition the dataset into a separate test & training dataset is to use the train_test_split function from scikit-learn's cross_validation submodule. As of now, the target variable is ''


In [51]:
#df = df.drop(["communityname", "state", "county", "community"], axis=1)
X, y = imputed_data, df['ViolentCrimesPerPop']
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0);
#y_train = y_train*100
#y_train.astype(np.int64)
y = y*100
y = y.astype(np.int64)
y_train = []
for i in y:
    if i<33:
        y_train.append(0)
    elif i<66:
        y_train.append(1)
    else:
        y_train.append(2)

we have 100 class labels from 1% to 100%. I group them into three groups for following reasons. First of all, too much class labels will reduce the accuracy of our classifier. Besides, it is irreasonable to have much class labels, there is no big difference between 33% and 34%. So we can group them into three levels to represent low, medium and high.


In [74]:
clf = GaussianNB()
X = np.array(X)
scoring = []
selected_features = []
for i in range(2,100):
    #use mrmr do feature selection
    answer = subprocess.check_output(['./mrmr.sh',str(i+1)])
    answer = answer[:-1]
    answer = answer.split(',')
    answer = [int(i) for i in answer]
    #obtain samples with selected features
    x_train = X[:,answer]
    #do 10 folder cross validation
    scores = cross_val_score(clf, x_train, y_train, cv=10)
    scoring.append(scores.mean())
    selected_features.append(answer)

In [75]:
k_feat = range(2,100)
plt.plot(k_feat, scoring, marker='o')
plt.ylim([0.7, 0.9])
plt.ylabel('Accuracy')
plt.xlabel('Number of Features')
plt.grid()
plt.show()



In [76]:
best = np.argmax(scoring)

In [78]:
features_indices = selected_features[best]
selected = [features[x] for x in features_indices]
print selected


['numbUrban', 'NumInShelters']