In [ ]:
!date
import numpy as np, pandas as pd, matplotlib.pyplot as plt, seaborn as sns
%matplotlib inline
sns.set_context('poster')
sns.set_style('darkgrid')
In [ ]:
# set random seed, for reproducibility
np.random.seed(12345)
# funny little var for making notebook executable
__________________ = None
Load, clean, and prepare DHS asset ownership data:
In [ ]:
df = pd.read_csv('RWA_DHS6_2010_2011_HH_ASSETS.CSV', index_col=0)
# have a look at what is in this data frame
__________________
In [ ]:
cb = pd.read_csv('RWA_DHS6_2010_2011_HH_ASSETS_codebook.CSV', index_col=0)
# cb stands for codebook. have a look at what the funny column names mean
__________________
Wouldn't it be nice if the column names were descriptive, instead of codes?
In [ ]:
# find a dictionary mapping codes to descriptions
# is it simply cb.to_dict?
cb.to_dict()
In [ ]:
# no, not quite. but it is in there:
cb.to_dict().get(__________________)
In [ ]:
# you can use pd.Series.map to change all the names,
# but you cannot do it to a list of columns directly
# too bad...
pd.Series(df.columns).map(cb.to_dict()['full name'])
In [ ]:
df.columns = pd.Series(df.columns).map(cb.to_dict()['full name'])
# did we get that right?
__________________
In [ ]:
# have a look at the survey results:
(100*df.mean()).order().plot(kind='barh')
plt.xlabel('Percent endorsed')
Now make an array of feature vectors and a corresponding array of labels:
In [ ]:
X = np.array(df.drop('has mobile telephone', axis=1))
y = np.array(df['has mobile telephone'])
And split the data into training and test sets (we'll talk more about this next week!)
In [ ]:
train = np.random.choice(range(len(df.index)), size=len(df.index)*.8, replace=False)
test = [i for i in (set(range(len(df.index))) - set(train))]
In [ ]:
X_test = X[test]
y_test = y[test]
X_train = X[train]
y_train = y[train]
Does it look reasonable?
In [ ]:
len(X_test), len(X_train)
In [ ]:
y_test.mean(), y_train.mean()
In [ ]:
X_test.mean(axis=0).round(2)
In [ ]:
X_train.mean(axis=0).round(2)
In [ ]:
import sklearn.naive_bayes
clf = sklearn.naive_bayes.BernoulliNB()
clf.fit(__________________, __________________)
In [ ]:
y_pred = clf.predict(__________________)
In [ ]:
np.mean(y_pred == y_test)
Is that good?
In [ ]:
y_random = np.random.choice([0,1], size=len(y_test))
np.mean(y_random == y_test)
Better than random, worse than perfect...
In [ ]:
import sklearn.linear_model
clf = sklearn.linear_model.LinearRegression()
clf.fit(X_train, y_train)
In [ ]:
y_pred = clf.predict(__________________)
In [ ]:
y_pred = __________________
In [ ]:
np.mean(y_pred == y_test)
Actually just about as good as N-B.
In [ ]:
import sklearn.linear_model
clf = sklearn.linear_model.LogisticRegression()
clf.fit(X_train, y_train)
In [ ]:
y_pred = clf.predict(X_test)
In [ ]:
np.mean(y_pred == y_test)
Again, about the same...
In [ ]:
import sklearn.linear_model
clf = sklearn.linear_model.Perceptron()
clf.fit(X_train, y_train)
In [ ]:
y_pred = clf.predict(X_test)
In [ ]:
np.mean(y_pred == y_test)
So it is possible to do worse. Bonus challenge: I think you can change the parameters to get this up to 75% concordance. Can you?
In [ ]:
import sklearn.tree
clf = sklearn.tree.DecisionTreeClassifier()
clf.fit(__________________, __________________)
In [ ]:
y_pred = clf.predict(__________________)
In [ ]:
np.mean(y_pred == __________________)
A tiny improvement!
Now refactor this process, so that for any sklearn classifier you can call a function to find its out-of-sample predictive accuracy on cell phone ownership:
In [ ]:
def oos_accuracy(clf, X_train=X_train, y_train=y_train, X_test=X_test, y_test=y_test):
__________________
Figure out a way to test it:
In [ ]:
__________________
Bonus challenge: Figure out a way to make it work for the linear regression predictor
(Hard because we had to round the numeric predictions)
In [ ]:
oos_accuracy(sklearn.linear_model.LinearRegression()) # got .763 before
In [ ]:
def fixed_oos_accuracy(clf, X_train=X_train, y_train=y_train, X_test=X_test, y_test=y_test):
""" Calculate out-of-sample predictive accuracy of cell phone ownership
prediction
Parameters
----------
clf : sklearn classifier or regressor
X_train, y_train, X_test, y_test : training and test data and labels
Results
-------
stores trained classifier in clf, returns oos accuracy
"""
__________________
In [ ]:
fixed_oos_accuracy(sklearn.linear_model.LinearRegression()) # got .763 before
In [ ]: