In [1]:
import numpy as np, pandas as pd, matplotlib.pyplot as plt, seaborn as sns
%matplotlib inline
sns.set_context('poster')
sns.set_style('darkgrid')
In [2]:
# set random seed, for reproducibility
np.random.seed(12345)
Download Hep C replication archive from http://ghdx.healthdata.org/record/hepatitis-c-prevalence-1990-and-2005-all-gbd-regions, and extract input_data.csv
Or, since H: drive is preventing me from loading that into Sage Cloud, let's look at the good, old weather data from Week 1 of class:
In [3]:
df = pd.read_csv('weather-numeric.csv')
df
Out[3]:
A.k.a. One-hot encoding (http://en.wikipedia.org/wiki/One-hot ):
In [4]:
df.outlook.value_counts()
Out[4]:
In [5]:
X = np.array(df.filter(['outlook', 'temperature']))
y = np.array(df.play)
In [6]:
import sklearn.svm
In [7]:
clf = sklearn.svm.SVC()
clf.fit(X, y)
What's the problem?
In [8]:
# SVC not smart enough to handle strings
The solution: one-hot encoding.
In [9]:
# can do this manually:
for val in df.outlook.unique():
print 'adding column', val
df[val] = (df.outlook == val)
In [10]:
X = np.array(df.filter(['sunny', 'overcast', 'rainy', 'temperature']))
y = np.array(df.play)
clf = sklearn.svm.SVC()
clf.fit(X, y)
Out[10]:
In [11]:
y_pred = clf.predict(X)
np.mean(y_pred == y)
Out[11]:
Impressed?
In [12]:
# NO, not oos
In [13]:
df.temperature
Out[13]:
In [14]:
df.temperature.mean()
Out[14]:
In [15]:
df.temperature.std()
Out[15]:
In [16]:
df['normalized_temp'] = (df.temperature - df.temperature.mean()) / df.temperature.std()
In [17]:
sns.distplot(df.normalized_temp)
Out[17]:
In [18]:
df['hot_and_humid'] = df.temperature * df.humidity
In [19]:
sns.distplot(df.hot_and_humid)
Out[19]:
Should we have normalized that somehow? Could do before or after multiplying...
In [20]:
df.hot_and_humid = (df.hot_and_humid - df.hot_and_humid.mean()) / df.hot_and_humid.std()
sns.distplot(df.hot_and_humid)
Out[20]:
OR
In [21]:
df['normalized_humidity'] = (df.humidity - df.humidity.mean()) / df.humidity.std()
In [22]:
df.hot_and_humid = df.normalized_temp * df.normalized_humidity
sns.distplot(df.hot_and_humid)
Out[22]:
There are fancier things you can consider, too. And we will perhaps return to them in the Data Transformations week. If you need one for your project, start with Box-Cox transform: http://en.wikipedia.org/wiki/Power_transform#Box.E2.80.93Cox_transformation
In [23]:
df = pd.read_csv('input_data.csv')
df.head()
Out[23]:
Other fun features you could consider adding, based on special case you are dealing with:
In [24]:
# age interval contains specific ages:
for a in np.arange(0,81,5):
df['includes_age_'+str(a)] = 1. * ((df.age_start <= a) & (df.age_end < a))
In [25]:
df.filter(like='age').head()
Out[25]:
In [26]:
# geographic hierarchy, dummy coded:
import json, networkx as nx
In [27]:
hierarchy = json.load(file('hierarchy.json'))
type(hierarchy)
Out[27]:
In [28]:
G = nx.DiGraph()
for n, n_props in hierarchy['nodes']:
G.add_node(n)
In [29]:
for u, v, edge_props in hierarchy['edges']:
G.add_edge(u,v)
In [30]:
def region_containing(country):
parents = G.predecessors(country)
assert len(parents) == 1
return parents[0]
df['region'] = df.area.map(region_containing)
In [31]:
df['super-region'] = df.region.map(region_containing)
In [32]:
df['super-region']
Out[32]:
In [35]:
for r in df.region.unique():
print 'adding column', r
df[r] = (df.region == r)
In [37]:
for r in df['super-region'].unique():
print 'adding column', r
df[r] = (df['super-region'] == r)
This "hierarchical-hot" might be good for ICD codes, too. Someone should check...