Gold Proximity in India

The dataset for this logistic regression exercise is described below:

http://www.stat.ufl.edu/~winner/data/gold_target1.txt

Columns are:

As level     1-8
Sb level     10-16
Lineament Proximity  24  /* 1=Present, 0 if absent  (0.5km)  */
Gold deposit proximity  32  /* 1=Present, 0=absent  (0.5km)  */

In [1]:
import numpy as np
import pandas as pd
%pylab inline
pylab.style.use('ggplot')


Populating the interactive namespace from numpy and matplotlib

In [56]:
url  = 'http://www.stat.ufl.edu/~winner/data/gold_target1.dat'
data_df = pd.read_csv(url, sep='[\s]+', engine='python', header=None)

In [57]:
data_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64 entries, 0 to 63
Data columns (total 4 columns):
0    64 non-null float64
1    64 non-null float64
2    64 non-null int64
3    64 non-null int64
dtypes: float64(2), int64(2)
memory usage: 2.1 KB

In [58]:
data_df.columns = ['as_level', 'sb_level', 'l_proximity', 'gold_proximity']

In [59]:
data_df.head()


Out[59]:
as_level sb_level l_proximity gold_proximity
0 6.77 3.08 1 1
1 15.03 6.15 1 1
2 6.43 2.35 1 1
3 0.10 0.30 0 0
4 0.10 0.30 0 0

In [11]:
data_df.plot(kind='scatter', x='as_level', y = 'gold_proximity')


Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f428ae37f0>

In [12]:
data_df.plot(kind='scatter', x='sb_level', y = 'gold_proximity')


Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f428b415f8>

In [17]:
data_df['gold_proximity'].value_counts().plot(kind='bar')


Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f42a0b0550>

In [21]:
counts = data_df.groupby(['gold_proximity', 'l_proximity']).size()
counts.unstack().plot(kind='bar')


Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f428d18080>

In [24]:
import statsmodels.formula.api as sm
model = sm.logit(formula='gold_proximity ~ C(l_proximity)', data=data_df)
result = model.fit()
result.summary()


Optimization terminated successfully.
         Current function value: 0.360516
         Iterations 7
Out[24]:
Logit Regression Results
Dep. Variable: gold_proximity No. Observations: 63
Model: Logit Df Residuals: 61
Method: MLE Df Model: 1
Date: Mon, 17 Apr 2017 Pseudo R-squ.: 0.4721
Time: 20:00:51 Log-Likelihood: -22.712
converged: True LL-Null: -43.023
LLR p-value: 1.848e-10
coef std err z P>|z| [0.025 0.975]
Intercept -2.7081 0.730 -3.708 0.000 -4.139 -1.277
C(l_proximity)[T.1] 4.1352 0.860 4.807 0.000 2.449 5.821

In [45]:
from sklearn.model_selection import KFold
fold = KFold(n_splits=5, shuffle=True)

def cross_validate(train_index, test_index):
    train_data, test_data = data_df.iloc[train_index], data_df.iloc[test_index]
    
    t_model = sm.logit(formula='gold_proximity ~ C(l_proximity)', data=train_data)
    t_result = t_model.fit()
    t_predict = t_result.predict(test_data)
    t_predict = t_predict.apply(lambda v: 1.0 if v >= 0.5 else 0.0)
    
    return pd.concat({'predicted': t_predict, 'actual': test_data['gold_proximity']}, axis=1)

cv_results = [cross_validate(train_index, test_index) for train_index, test_index in fold.split(data_df.index)]


Optimization terminated successfully.
         Current function value: 0.414925
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.277641
         Iterations 7
Optimization terminated successfully.
         Current function value: 0.337705
         Iterations 7
Optimization terminated successfully.
         Current function value: 0.408391
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.335184
         Iterations 7

In [46]:
from sklearn.metrics import confusion_matrix
import seaborn as sns

for idx, result in enumerate(cv_results):
    c = confusion_matrix(result['predicted'], result['actual'])
    pylab.figure()
    ax = sns.heatmap(c, annot=True)
    ax.set_title('Confusion Matrix for fold %s' % (idx+1))



In [48]:
from sklearn.metrics import f1_score

scores = [f1_score(r['predicted'], r['actual']) for r in cv_results]
sc = pd.Series(data=scores, name='F1_scores')
sc.plot(kind='bar', title='5-fold cross validation results')


Out[48]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f42c5c2908>