Cancer is still a terrible disease. Surprisingly, the rate of cancer incidence and mortality varies substantially across regions worldwide. This raises several questions as of unknown preventive and risk factors.
Within the Epidemium initiative, the Cancer Baseline project aims at collecting open-data aggregate cancer mortality risks (y) worldwide and potential explanatory factors (X) to model $y=f(X)$, hence, trying to shed light on new cancer-related factors.
For this first RAMP, you will be the first ones to analyze the data collected by more than 30 volunteers over three months and you will compete on the best cancer mortality prediction model $y=f(X$). May you lead to a new generation of solutions against cancer!
If you want to join the project after the RAMP http://wiki.epidemium.cc/wiki/Baseline#How_to_start
The simple way: Install the Anaconda python distribution https://www.continuum.io/downloads
The fine-grained way:. Install each of the following tools
The vast majority of the variables are extremely sparse and part of the challange will be to:
In [79]:
%matplotlib inline
In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
In [3]:
pd.set_option('display.max_columns', None)
Let's start with a meta-data analysis, ie. number of missing values, number of unique categories etc.
In [80]:
filename = 'data/public/train.csv'
In [81]:
df = pd.read_csv(filename)
In [82]:
df.shape
Out[82]:
In [83]:
df.head(3)
Out[83]:
In [55]:
df.describe()
Out[55]:
The following utility function will help visualize the sparsity of the dataset
In [56]:
def meta_dataframe(df, uniq_examples=7):
from collections import defaultdict
res = defaultdict(list)
for i in range(df.shape[1]):
res['col_name'].append(df.columns[i])
uniques = df.iloc[:,i].unique()
notnull_rate = df.iloc[:,i].dropna().size / df.iloc[:,i].size
res['n_uniques'].append(uniques.size)
res['n_notnull'].append(notnull_rate)
res['dtype'].append(df.iloc[:,i].dtype)
for j in range(1, uniq_examples + 1):
v = uniques[j-1] if j <= uniques.size else ''
res['value_' + str(j)].append(v)
return pd.DataFrame(res, columns=sorted(res.keys())).set_index('col_name')
In [57]:
meta_df = meta_dataframe(df.dropna(how='all', axis=[0, 1]))
In [58]:
meta_df.sort_values('n_notnull', ascending=False)
Out[58]:
How much sparse is the data
In [59]:
meta_df.n_notnull.sort_values(ascending=True, inplace=False).plot.barh(figsize=(10, 6), fontsize=1, alpha=.7);
In [60]:
meta_df.n_notnull.plot.hist(figsize=(10, 6), bins=30, alpha=.7);
We are going to follow the scikit-learn API specs. Basically,
BaseEstimator,__init__() function,fit() and a predict() function.More information in the official documentation.
In [61]:
from sklearn.base import BaseEstimator
In [62]:
from sklearn.cross_validation import train_test_split
In [63]:
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.cross_validation import cross_val_score, ShuffleSplit
In [64]:
from sklearn.preprocessing import Imputer
from sklearn.metrics import mean_squared_error
In [65]:
from sklearn.pipeline import make_pipeline
In [66]:
df_tmp = df[df['Part of'] == 'France']
In [70]:
df.isnull().any(how='all', axis=0).sum()
Out[70]:
In [74]:
df = df.drop(df_tmp.index)
In [21]:
class FeatureExtractor(object):
# The columns you want to include without pre-processing
core_cols = ['Year']
# These columns must be discarded. They are only useful in case you would like to
# do joins with external data
region_cols = ['RegionType', 'Part of', 'Region']
# Categorical columns. They must be processed (use pd.get_dummies for the simplest way)
categ_cols = ['Gender', 'Age', 'MainOrigin']
# the different factors to include in the model
additional_cols = ['HIV_15_49']
def __init__(self):
pass
def fit(self, X_df, y_array):
pass
def transform(self, X_df):
ret = X_df[self.core_cols].copy()
# dummify the categorical variables
for col in self.categ_cols:
ret = ret.join(pd.get_dummies(X_df[col], prefix=col[:3]))
# add extra information
for col in self.additional_cols:
ret[col] = X_df[col]
return ret.values
In [22]:
class Regressor(BaseEstimator):
def __init__(self):
self.clf = make_pipeline(
Imputer(strategy='median'),
RandomForestRegressor(n_estimators=20, max_depth=None))
def fit(self, X, y):
return self.clf.fit(X, y)
def predict(self, X):
return self.clf.predict(X)
In [28]:
df_features = df.drop('target', axis=1)
y = df.target.values
df_train, df_test, y_train, y_test = train_test_split(df_features, y, test_size=0.5, random_state=42)
Instanciating our model
In [29]:
feature_extractor = FeatureExtractor()
model = Regressor()
Feature processing and training
In [30]:
X_train = feature_extractor.transform(df_train)
model.fit(X_train, y_train)
Out[30]:
Testing our model
In [32]:
X_test = feature_extractor.transform(df_test)
y_pred = model.predict(X_test)
print('RMSE = ', np.sqrt(mean_squared_error(y_test, y_pred)))
In [ ]: