In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
Let's start by loading the data for this round using a few helpers from the package in this repository.
In [2]:
from numerai import Round
r46 = Round(46)
train = r46.training_set()
test = r46.test_set()
Numer.ai data points have a lot of features, but it may be interesting to plot a projection of all the points in training and test set on two dimensions. One way to do it is a Principal Component Analysis.
The numerai package in this repository provides a plot_PCA() method which computes the PCA and produces a pair plot of the training and tournament datasets:
In [3]:
from numerai import plot_PCA
plot_PCA(train.drop('target',axis=1), test.drop('t_id',axis=1), 'is_test', 'fig', show=True)
We can sort data points from the training set on the probability that they belong to the training set, as computed by a classifier trained to recognize train from test. This approach is detailed by Zygmunt Z. in two articles about adversarial validation on his blog FastML : http://fastml.com/adversarial-validation-part-one/, http://fastml.com/adversarial-validation-part-two/
There's one question I didn't answer: what is the impact of the choice of classifier and its parameters on everything that follows?
In [4]:
if not r46.has_sorted_training_set():
r46.sort_training_set() # classifier is by default a standard sklearn.RandomForestClassifier
In [5]:
train_sorted = r46.sorted_training_set()
In [6]:
fig = plt.figure(figsize=(8,3))
ax = fig.add_subplot(111)
train_sorted.p.plot(ax=ax)
ax.axhline(y=0.5,color='r',lw=0.5)
ax.set_ylim(bottom=0, top=1)
ax.set_xlabel('Index in sorted training set')
ax.set_ylabel('Probability of being in tournament set')
Out[6]:
According to Wikipedia, the Kolmogorov–Smirnov statistic can quantify a distance between the empirical distribution functions of two samples.
One could also say that the Kolmogorov-Smirnov can find much more general kinds of difference in distribution than the t-test can.
The kolmogorov_smirnov() method of this repository returns a pandas dataframe containing for each feature the Kolmogorov-Smirnov distance between training and test set along with the associated p-value.
In [7]:
from numerai import kolmogorov_smirnov
dfks = kolmogorov_smirnov(train.drop('target', axis=1), test)
In [8]:
fig = plt.figure(figsize=(10,6))
ax1 = fig.add_subplot(211)
dfks.sort('KS',ascending=False)['KS'].plot(kind='bar',ax=ax1)
ax2 = fig.add_subplot(212)
dfks.sort('KS',ascending=False)['KS_p'].plot(kind='bar',ax=ax2)
Out[8]:
As noted in the scipy documentation for ks_2samp, if the K-S statistic is small or the p-value is high, then we cannot reject the hypothesis that the distributions of the two samples are the same.
Looking at the training set, we may be tempted to drop points that are not similar enough to the tournament set in order to train our models on a more tournament-like training set.
One way to look at this would be to set a probability threshold on the sorted training set and then cut. Another way to look is to test many cut sizes on the sorted training set and compute the K-S for each resulting training set.
In [9]:
from scipy import stats
dropped_sizes = range(5000, len(train_sorted), 5000)
mean_ks = []
for sz in dropped_sizes:
drop_train = train_sorted.iloc[sz:]
ks = []
for c in drop_train.drop(['target','p'],axis=1).columns:
res = stats.ks_2samp(drop_train[c], test[c])
ks.append(res.statistic)
dfks_ = pd.DataFrame(index=range(1,51))
dfks_['ks'] = ks
#print '{}: m = {}'.format(sz, dfks_.ks.mean())
mean_ks.append(dfks_.ks.mean())
dfmks = pd.DataFrame(index=dropped_sizes)
dfmks['m'] = mean_ks
In [10]:
fig = plt.figure(figsize=(5,5))
ax = fig.add_subplot(111)
dfmks.plot(style='.',ax=ax)
ax.set_xlabel('Number of dropped items from least similar')
ax.set_ylabel('Mean Kolmogorov-Smirnov statistic')
Out[10]:
In [11]:
dfmks.idxmin()
Out[11]:
We get a cut size on the training set which minimizes the average K-S between training and tournament set. But it considers the whole tournament set: we don't know which subset is actually used for the private leaderboard and if it has the same distribution as the whole tournament set.
Next paragraphs are inspired by the articles on feature selection written by Ando Saabas on his blog Dive into data: http://blog.datadive.net/category/feature-selection/
In [12]:
from numerai import pearson
dfp = pearson(train.drop('target', axis=1), train.target)
In [13]:
fig = plt.figure(figsize=(15,10))
ax1 = fig.add_subplot(211)
dfp.sort('pearson')['pearson'].plot(kind='bar',ax=ax1)
ax2 = fig.add_subplot(212)
dfp.sort('pearson')['pearson_p'].plot(kind='bar',ax=ax2)
#ax2.set_yscale("log")
ax2.set_ylim(top=1)
Out[13]:
In [14]:
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(train.drop('target', axis=1),
train.target,
test_size=0.2,
random_state=0)
In [15]:
from sklearn.linear_model import LogisticRegression as LR
m_lr_l2 = LR(C=0.5,
fit_intercept=False,
max_iter=100,
penalty='l2',
tol=1e-4,
solver='sag',
n_jobs=6)
m_lr_l2.fit(x_train, y_train)
Out[15]:
In [16]:
df = pd.DataFrame(m_lr_l2.coef_[0],index=range(1,len(train.columns)))
df.columns = ['LR_L2_coefs']
In [17]:
fig = plt.figure(figsize=(10,4))
ax=fig.add_subplot(111)
df.sort('LR_L2_coefs').plot(kind='bar',ax=ax)
Out[17]:
In [18]:
x_train, x_test, y_train, y_test = train_test_split(train.drop('target', axis=1),
train.target,
test_size=0.2,
random_state=0)
In [19]:
from sklearn.ensemble import RandomForestClassifier as RF
m_rf = RF(n_jobs=6)
In [20]:
m_rf.fit(x_train, y_train)
Out[20]:
In [21]:
df_rf_mdi = pd.DataFrame(m_rf.feature_importances_,index=range(1,len(train.columns)))
df_rf_mdi.columns = ['RF_mdi']
In [22]:
fig = plt.figure(figsize=(10,4))
ax=fig.add_subplot(111)
df_rf_mdi.sort('RF_mdi').plot(kind='bar',ax=ax)
ax.set_ylim([0.018,0.022])
Out[22]:
In [23]:
from sklearn.metrics import r2_score
from collections import defaultdict
from sklearn.cross_validation import ShuffleSplit
scores = defaultdict(list)
X = train.drop('target',axis=1).values
Y = train.target.values
m_rf = RF(n_jobs=6)
#crossvalidate the scores on a number of different random splits of the data
for train_idx, test_idx in ShuffleSplit(len(X), test_size=0.3):
X_train, X_test = X[train_idx], X[test_idx]
Y_train, Y_test = Y[train_idx], Y[test_idx]
r = m_rf.fit(X_train, Y_train)
acc = r2_score(Y_test, m_rf.predict(X_test))
for i in range(X.shape[1]):
X_t = X_test.copy()
np.random.shuffle(X_t[:, i])
shuff_acc = r2_score(Y_test, m_rf.predict(X_t))
scores[train.drop('target',axis=1).columns[i]].append((acc-shuff_acc)/acc)
print "Features sorted by their score:"
print sorted([(round(np.mean(score), 4), feat) for
feat, score in scores.items()], reverse=True)
In [24]:
x_train, x_test, y_train, y_test = train_test_split(train.drop('target', axis=1),
train.target,
test_size=0.2,
random_state=0)
In [28]:
from sklearn.linear_model import RandomizedLogisticRegression as RLR
warnings.filterwarnings('ignore')
m_rlr = RLR(C=0.5, n_resampling=50)
m_rlr.fit(x_train, y_train)
Out[28]:
In [29]:
df_rlr = pd.DataFrame(m_rlr.scores_,index=range(1,len(train.columns)))
df_rlr.columns = ['RLR']
In [30]:
fig = plt.figure(figsize=(10,4))
ax=fig.add_subplot(111)
df_rlr.sort('RLR').plot(kind='bar',ax=ax)
Out[30]:
In [ ]: