notebook.community

Edit and run



In [ ]:

    
!date
import numpy as np, pandas as pd, matplotlib.pyplot as plt, seaborn as sns
%matplotlib inline
sns.set_context('poster')
sns.set_style('darkgrid')



In [ ]:

    
# set random seed, for reproducibility
np.random.seed(12345)

# funny little var for making notebook executable
__________________ = None

Load, clean, and prepare DHS asset ownership data:



In [ ]:

    
df = pd.read_csv('RWA_DHS6_2010_2011_HH_ASSETS.CSV', index_col=0)

# have a look at what is in this data frame
__________________



In [ ]:

    
cb = pd.read_csv('RWA_DHS6_2010_2011_HH_ASSETS_codebook.CSV', index_col=0)

# cb stands for codebook.  have a look at what the funny column names mean
__________________

Wouldn't it be nice if the column names were descriptive, instead of codes?



In [ ]:

    
# find a dictionary mapping codes to descriptions

# is it simply cb.to_dict?
cb.to_dict()



In [ ]:

    
# no, not quite.  but it is in there:

cb.to_dict().get(__________________)



In [ ]:

    
# you can use pd.Series.map to change all the names,
# but you cannot do it to a list of columns directly

# too bad...

pd.Series(df.columns).map(cb.to_dict()['full name'])



In [ ]:

    
df.columns = pd.Series(df.columns).map(cb.to_dict()['full name'])

# did we get that right?
__________________



In [ ]:

    
# have a look at the survey results:

(100*df.mean()).order().plot(kind='barh')
plt.xlabel('Percent endorsed')

Now make an array of feature vectors and a corresponding array of labels:



In [ ]:

    
X = np.array(df.drop('has mobile telephone', axis=1))
y = np.array(df['has mobile telephone'])

And split the data into training and test sets (we'll talk more about this next week!)



In [ ]:

    
train = np.random.choice(range(len(df.index)), size=len(df.index)*.8, replace=False)
test = [i for i in (set(range(len(df.index))) - set(train))]



In [ ]:

    
X_test = X[test]
y_test = y[test]

X_train = X[train]
y_train = y[train]

Does it look reasonable?



In [ ]:

    
len(X_test), len(X_train)



In [ ]:

    
y_test.mean(), y_train.mean()



In [ ]:

    
X_test.mean(axis=0).round(2)



In [ ]:

    
X_train.mean(axis=0).round(2)

Now, let us try a range of prediction methods:

Naïve Bayes



In [ ]:

    
import sklearn.naive_bayes
clf = sklearn.naive_bayes.BernoulliNB()
clf.fit(__________________, __________________)



In [ ]:

    
y_pred = clf.predict(__________________)



In [ ]:

    
np.mean(y_pred == y_test)

Is that good?



In [ ]:

    
y_random = np.random.choice([0,1], size=len(y_test))

np.mean(y_random == y_test)

Better than random, worse than perfect...

Linear regression based prediction



In [ ]:

    
import sklearn.linear_model
clf = sklearn.linear_model.LinearRegression()
clf.fit(X_train, y_train)



In [ ]:

    
y_pred = clf.predict(__________________)



In [ ]:

    
y_pred = __________________



In [ ]:

    
np.mean(y_pred == y_test)

Actually just about as good as N-B.

Logistic regression



In [ ]:

    
import sklearn.linear_model
clf = sklearn.linear_model.LogisticRegression()
clf.fit(X_train, y_train)



In [ ]:

    
y_pred = clf.predict(X_test)



In [ ]:

    
np.mean(y_pred == y_test)

Again, about the same...

Did you read about Perceptron?



In [ ]:

    
import sklearn.linear_model
clf = sklearn.linear_model.Perceptron()
clf.fit(X_train, y_train)



In [ ]:

    
y_pred = clf.predict(X_test)



In [ ]:

    
np.mean(y_pred == y_test)

So it is possible to do worse. Bonus challenge: I think you can change the parameters to get this up to 75% concordance. Can you?

Decision Trees



In [ ]:

    
import sklearn.tree
clf = sklearn.tree.DecisionTreeClassifier()
clf.fit(__________________, __________________)



In [ ]:

    
y_pred = clf.predict(__________________)



In [ ]:

    
np.mean(y_pred == __________________)

A tiny improvement!

Now refactor this process, so that for any sklearn classifier you can call a function to find its out-of-sample predictive accuracy on cell phone ownership:



In [ ]:

    
def oos_accuracy(clf, X_train=X_train, y_train=y_train, X_test=X_test, y_test=y_test):
    __________________

Figure out a way to test it:



In [ ]:

    
__________________

Bonus challenge: Figure out a way to make it work for the linear regression predictor

(Hard because we had to round the numeric predictions)



In [ ]:

    
oos_accuracy(sklearn.linear_model.LinearRegression())  # got .763 before



In [ ]:

    
def fixed_oos_accuracy(clf, X_train=X_train, y_train=y_train, X_test=X_test, y_test=y_test):
    """ Calculate out-of-sample predictive accuracy of cell phone ownership
    prediction
    
    Parameters
    ----------
    clf : sklearn classifier or regressor
    X_train, y_train, X_test, y_test : training and test data and labels
    
    Results
    -------
    stores trained classifier in clf, returns oos accuracy
    """
    
    __________________



In [ ]:

    
fixed_oos_accuracy(sklearn.linear_model.LinearRegression())  # got .763 before



In [ ]: