xlines of Python: machine learning

This notebook goes with the blog post of the same name.

We're going to go over a very simple machine learning exercise. We're using the data from the 2016 SEG machine learning contest.


In [1]:
import numpy as np
import matplotlib.pyplot as mpl
% matplotlib inline

import sklearn
sklearn.__version__

# Should be 0.18


Out[1]:
'0.17.1'

Read the data

numpy has a convenient function, loadtxt that can load a CSV file. It needs a file... and ours is on the web. That's OK, we don't need to download it, we can just read it by sending its text content to a StringIO object, which acts exactly like a file handle.


In [2]:
import requests
import io

r = requests.get('https://raw.githubusercontent.com/seg/2016-ml-contest/master/training_data.csv')  # 1

We can't just load it, because we only want NumPy to have to handle an array of floats and there's metadata in this file (we can't tell that, I just happen to know it... and it's normal for CSV files).

Pandas is really convenient for this sort of data.


In [3]:
import pandas as pd

df = pd.read_csv(io.StringIO(r.text))  # 2

df.head()


Out[3]:
Facies Formation Well Name Depth GR ILD_log10 DeltaPHI PHIND PE NM_M RELPOS
0 3 A1 SH SHRIMPLIN 2793.0 77.45 0.664 9.9 11.915 4.6 1 1.000
1 3 A1 SH SHRIMPLIN 2793.5 78.26 0.661 14.2 12.565 4.1 1 0.979
2 3 A1 SH SHRIMPLIN 2794.0 79.05 0.658 14.8 13.050 3.6 1 0.957
3 3 A1 SH SHRIMPLIN 2794.5 86.10 0.655 13.9 13.115 3.5 1 0.936
4 3 A1 SH SHRIMPLIN 2795.0 74.58 0.647 13.5 13.300 3.4 1 0.915

I later learned that you can just do this...


In [ ]:
df = pd.read_csv('https://raw.githubusercontent.com/seg/2016-ml-contest/master/training_data.csv')

df.head()

**A word about the data.** This dataset is not, strictly speaking, open data. It has been shared by the Kansas Geological Survey for the purposes of the contest. That's why I'm not copying the data into this repository, but instead reading it from the web. We are working on making an open access version of this dataset. In the meantime, I'd appreciarte it if you didn't replicate the data anywhere. Thanks!

Get the feature vectors, X


In [4]:
features = ['GR', 'ILD_log10', 'DeltaPHI', 'PHIND', 'PE']  # 3

Now we'll load the data we want. First the feature vectors, X. We'll just get the logs, which are in columns 4 to 8:


In [5]:
X = df[features].values  # 4

In [6]:
X.shape


Out[6]:
(3232, 5)

Get the label vector, y


In [7]:
y = df.Facies.values  # 5

In [8]:
y.shape


Out[8]:
(3232,)

In [9]:
import matplotlib.pyplot as plt
%matplotlib inline

plt.figure(figsize=(16, 1))
plt.imshow(np.array([y]), cmap='viridis', aspect=100)
plt.show()


We have data! Almost ready to train, we just have to get our test / train subsets sorted.

Extracting some training data


In [10]:
from sklearn.model_selection import train_test_split

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y)  # 6

In [12]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape


Out[12]:
((2424, 5), (2424,), (808, 5), (808,))

Optional exercise: Use the docs for train_test_split to set the size of the test set, and also to set a random seed for the splitting.

Now the fun can really begin.

Training and evaluating a model


In [13]:
from sklearn.ensemble import ExtraTreesClassifier

In [14]:
clf = ExtraTreesClassifier()  # 7

In [15]:
clf.fit(X_train, y_train)  # 8


Out[15]:
ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
           max_depth=None, max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)

In [16]:
clf.score(X_test, y_test)


Out[16]:
0.6410891089108911

Optional exercise: Try changing some hyperparameters, eg verbose, n_estimators, n_jobs, and random_state


In [ ]:
clf = ExtraTreesClassifier(... HYPERPARAMETERS GO HERE ...)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

All models have the same API (but not the same hyperparameters), so it's very easy to try lots of models.

Predict!


In [18]:
y_pred = clf.predict(X_test)  # 9

A quick score:


In [19]:
from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_pred)


Out[19]:
0.6410891089108911

The confusion matrix, showing exactly what kinds of mistakes (type 1 and type 2 errors) we're making:


In [20]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, y_pred)


Out[20]:
array([[ 50,  11,   0,   1,   0,   1,   0,   0,   0],
       [ 10, 145,  30,   2,   0,   1,   0,   2,   0],
       [  3,  35,  93,   2,   3,   2,   1,   1,   0],
       [  0,  15,   1,  22,   5,   7,   0,   3,   0],
       [  0,   3,   8,   3,  22,  16,   0,   7,   0],
       [  0,   9,   1,   3,   9,  71,   2,  18,   2],
       [  2,   5,   2,   0,   0,   1,  12,   2,   0],
       [  1,   9,   6,   2,   2,  31,   1,  69,   5],
       [  0,   0,   0,   1,   1,   0,   1,   1,  34]])

Finally, the classification report shows the type 1 and type 2 error rates (well, 1 - the error) for each facies, along with the combined, F1, score:


In [21]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))  # 10


             precision    recall  f1-score   support

          1       0.76      0.79      0.78        63
          2       0.62      0.76      0.69       190
          3       0.66      0.66      0.66       140
          4       0.61      0.42      0.49        53
          5       0.52      0.37      0.44        59
          6       0.55      0.62      0.58       115
          7       0.71      0.50      0.59        24
          8       0.67      0.55      0.60       126
          9       0.83      0.89      0.86        38

avg / total       0.64      0.64      0.64       808

More in-depth evaluation: k-fold cross-validation

We should drop out entire wells, not just a bunch of random samples. Otherwise we're training on data that's just one sample away from data we're validating against.

To do this, we need a vector that contains an integer (or something) representing each unique well.


In [22]:
wells = df['Well Name']

In [23]:
from sklearn.model_selection import LeaveOneGroupOut

logo = LeaveOneGroupOut()
clf = ExtraTreesClassifier(random_state=0)

for train, test in logo.split(X, y, groups=wells):
    # train and test are the indices of the data to use.
    well_name = wells[test[0]]
    clf.fit(X[train], y[train])
    score = clf.score(X[test], y[test])
    print("{:>20s}  {:.3f}".format(well_name, score))


     CHURCHMAN BIBLE  0.460
      CROSS H CATTLE  0.315
            LUKE G U  0.388
               NEWBY  0.376
               NOLAN  0.352
          Recruit F9  0.868
             SHANKLE  0.419
           SHRIMPLIN  0.524

© Agile Geoscience 2016