In [1]:
import numpy as np, pandas as pd, matplotlib.pyplot as plt, seaborn as sns
%matplotlib inline
sns.set_context('poster')
sns.set_style('darkgrid')
In [2]:
# set random seed, for reproducibility
np.random.seed(12345)
Download Hep C replication archive from http://ghdx.healthdata.org/record/hepatitis-c-prevalence-1990-and-2005-all-gbd-regions, and extract input_data.csv
Or, since H: drive is preventing me from loading that into Sage Cloud, let's look at the good, old weather data from Week 1 of class:
In [4]:
df = pd.read_csv('weather-numeric.csv')
df
Out[4]:
A.k.a. One-hot encoding (http://en.wikipedia.org/wiki/One-hot ):
In [5]:
df.outlook.value_counts()
Out[5]:
In [6]:
X = np.array(df.filter(['outlook', 'temperature']))
y = np.array(df.play)
In [7]:
import sklearn.svm
In [8]:
clf = sklearn.svm.SVC()
clf.fit(X, y)
What's the problem?
In [ ]:
# SVC not smart enough to handle strings
The solution: one-hot encoding.
In [9]:
# can do this manually:
for val in df.outlook.unique():
print 'adding column', val
df[val] = (df.outlook == val)
In [11]:
X = np.array(df.filter(['sunny', 'overcast', 'rainy', 'temperature']))
y = np.array(df.play)
clf = sklearn.svm.SVC()
clf.fit(X, y)
Out[11]:
In [13]:
y_pred = clf.predict(X)
np.mean(y_pred == y)
Out[13]:
Impressed?
In [ ]:
# NO, not oos
In [ ]:
df.temperature
In [ ]:
df.temperature.mean()
In [ ]:
df.temperature.std()
In [ ]:
df['normalized_temp'] = (df.temperature - df.temperature.mean()) / df.temperature.std()
In [ ]:
sns.distplot(df.normalized_temp)
In [14]:
df['hot_and_humid'] = df.temperature * df.humidity
In [15]:
sns.distplot(df.hot_and_humid)
Out[15]:
Should we have normalized that somehow? Could do before or after multiplying...
In [16]:
df.hot_and_humid = (df.hot_and_humid - df.hot_and_humid.mean()) / df.hot_and_humid.std()
sns.distplot(df.hot_and_humid)
Out[16]:
OR
In [17]:
df['normalized_humidity'] = (df.humidity - df.humidity.mean()) / df.humidity.std()
In [18]:
df.hot_and_humid = df.normalized_temp * df.normalized_humidity
sns.distplot(df.hot_and_humid)
There are fancier things you can consider, too. And we will perhaps return to them in the Data Transformations week. If you need one for your project, start with Box-Cox transform: http://en.wikipedia.org/wiki/Power_transform#Box.E2.80.93Cox_transformation
In [19]:
df = pd.read_csv('input_data.csv')
df.head()
Out[19]:
Other fun features you could consider adding, based on special case you are dealing with:
In [20]:
# age interval contains specific ages:
for a in np.arange(0,81,5):
df['includes_age_'+str(a)] = 1. * ((df.age_start <= a) & (df.age_end < a))
In [ ]:
df.filter(like='age').head()
In [ ]:
# geographic hierarchy, dummy coded:
import json, networkx as nx
In [ ]:
hierarchy = json.load(file('hierarchy.json'))
type(hierarchy)
In [ ]:
G = nx.DiGraph()
for n, n_props in hierarchy['nodes']:
G.add_node(n)
In [ ]:
for u, v, edge_props in hierarchy['edges']:
G.add_edge(u,v)
In [ ]:
def region_containing(country):
parents = G.predecessors(country)
assert len(parents) == 1
return parents[0]
df['region'] = df.area.map(region_containing)
In [ ]:
df['super-region'] = df.region.map(region_containing)
In [ ]:
df['super-region']
In [ ]:
# challenge: do a one-hot encoding of region and super-region
This "hierarchical-hot" might be good for ICD codes, too. Someone should check...