In [ ]:
!date
import numpy as np, pandas as pd, matplotlib.pyplot as plt, seaborn as sns
%matplotlib inline
sns.set_context('poster')
sns.set_style('darkgrid')

In [ ]:
# set random seed, for reproducibility
np.random.seed(12345)

# funny little var for making notebook executable
__________________ = None

Load, clean, and prepare DHS asset ownership data:


In [ ]:
df = pd.read_csv('RWA_DHS6_2010_2011_HH_ASSETS.CSV', index_col=0)

# have a look at what is in this data frame
__________________

In [ ]:
cb = pd.read_csv('RWA_DHS6_2010_2011_HH_ASSETS_codebook.CSV', index_col=0)

# cb stands for codebook.  have a look at what the funny column names mean
__________________

Wouldn't it be nice if the column names were descriptive, instead of codes?


In [ ]:
# find a dictionary mapping codes to descriptions

# is it simply cb.to_dict?
cb.to_dict()

In [ ]:
# no, not quite.  but it is in there:

cb.to_dict().get(__________________)

In [ ]:
# you can use pd.Series.map to change all the names,
# but you cannot do it to a list of columns directly

# too bad...

pd.Series(df.columns).map(cb.to_dict()['full name'])

In [ ]:
df.columns = pd.Series(df.columns).map(cb.to_dict()['full name'])

# did we get that right?
__________________

In [ ]:
# have a look at the survey results:

(100*df.mean()).order().plot(kind='barh')
plt.xlabel('Percent endorsed')

Now make an array of feature vectors and a corresponding array of labels:


In [ ]:
X = np.array(df.drop('has mobile telephone', axis=1))
y = np.array(df['has mobile telephone'])

And split the data into training and test sets (we'll talk more about this next week!)


In [ ]:
train = np.random.choice(range(len(df.index)), size=len(df.index)*.8, replace=False)
test = [i for i in (set(range(len(df.index))) - set(train))]

In [ ]:
X_test = X[test]
y_test = y[test]

X_train = X[train]
y_train = y[train]

Does it look reasonable?


In [ ]:
len(X_test), len(X_train)

In [ ]:
y_test.mean(), y_train.mean()

In [ ]:
X_test.mean(axis=0).round(2)

In [ ]:
X_train.mean(axis=0).round(2)

Now, let us try a range of prediction methods:

Naïve Bayes


In [ ]:
import sklearn.naive_bayes
clf = sklearn.naive_bayes.BernoulliNB()
clf.fit(__________________, __________________)

In [ ]:
y_pred = clf.predict(__________________)

In [ ]:
np.mean(y_pred == y_test)

Is that good?


In [ ]:
y_random = np.random.choice([0,1], size=len(y_test))

np.mean(y_random == y_test)

Better than random, worse than perfect...

Linear regression based prediction


In [ ]:
import sklearn.linear_model
clf = sklearn.linear_model.LinearRegression()
clf.fit(X_train, y_train)

In [ ]:
y_pred = clf.predict(__________________)

In [ ]:
y_pred = __________________

In [ ]:
np.mean(y_pred == y_test)

Actually just about as good as N-B.

Logistic regression


In [ ]:
import sklearn.linear_model
clf = sklearn.linear_model.LogisticRegression()
clf.fit(X_train, y_train)

In [ ]:
y_pred = clf.predict(X_test)

In [ ]:
np.mean(y_pred == y_test)

Again, about the same...

Did you read about Perceptron?


In [ ]:
import sklearn.linear_model
clf = sklearn.linear_model.Perceptron()
clf.fit(X_train, y_train)

In [ ]:
y_pred = clf.predict(X_test)

In [ ]:
np.mean(y_pred == y_test)

So it is possible to do worse. Bonus challenge: I think you can change the parameters to get this up to 75% concordance. Can you?

Decision Trees


In [ ]:
import sklearn.tree
clf = sklearn.tree.DecisionTreeClassifier()
clf.fit(__________________, __________________)

In [ ]:
y_pred = clf.predict(__________________)

In [ ]:
np.mean(y_pred == __________________)

A tiny improvement!

Now refactor this process, so that for any sklearn classifier you can call a function to find its out-of-sample predictive accuracy on cell phone ownership:


In [ ]:
def oos_accuracy(clf, X_train=X_train, y_train=y_train, X_test=X_test, y_test=y_test):
    __________________

Figure out a way to test it:


In [ ]:
__________________

Bonus challenge: Figure out a way to make it work for the linear regression predictor

(Hard because we had to round the numeric predictions)


In [ ]:
oos_accuracy(sklearn.linear_model.LinearRegression())  # got .763 before

In [ ]:
def fixed_oos_accuracy(clf, X_train=X_train, y_train=y_train, X_test=X_test, y_test=y_test):
    """ Calculate out-of-sample predictive accuracy of cell phone ownership
    prediction
    
    Parameters
    ----------
    clf : sklearn classifier or regressor
    X_train, y_train, X_test, y_test : training and test data and labels
    
    Results
    -------
    stores trained classifier in clf, returns oos accuracy
    """
    
    __________________

In [ ]:
fixed_oos_accuracy(sklearn.linear_model.LinearRegression())  # got .763 before

In [ ]: