This notebook goes with the blog post of the same name.
We're going to go over a very simple machine learning exercise. We're using the data from the 2016 SEG machine learning contest.
In [1]:
import numpy as np
import matplotlib.pyplot as mpl
% matplotlib inline
import sklearn
sklearn.__version__
# Should be 0.18
Out[1]:
In [2]:
import requests
import io
r = requests.get('https://raw.githubusercontent.com/seg/2016-ml-contest/master/training_data.csv') # 1
We can't just load it, because we only want NumPy to have to handle an array of floats and there's metadata in this file (we can't tell that, I just happen to know it... and it's normal for CSV files).
Pandas is really convenient for this sort of data.
In [3]:
import pandas as pd
df = pd.read_csv(io.StringIO(r.text)) # 2
df.head()
Out[3]:
I later learned that you can just do this...
In [ ]:
df = pd.read_csv('https://raw.githubusercontent.com/seg/2016-ml-contest/master/training_data.csv')
df.head()
**A word about the data.** This dataset is not, strictly speaking, open data. It has been shared by the Kansas Geological Survey for the purposes of the contest. That's why I'm not copying the data into this repository, but instead reading it from the web. We are working on making an open access version of this dataset. In the meantime, I'd appreciarte it if you didn't replicate the data anywhere. Thanks!
In [4]:
features = ['GR', 'ILD_log10', 'DeltaPHI', 'PHIND', 'PE'] # 3
Now we'll load the data we want. First the feature vectors, X. We'll just get the logs, which are in columns 4 to 8:
In [5]:
X = df[features].values # 4
In [6]:
X.shape
Out[6]:
In [7]:
y = df.Facies.values # 5
In [8]:
y.shape
Out[8]:
In [9]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.figure(figsize=(16, 1))
plt.imshow(np.array([y]), cmap='viridis', aspect=100)
plt.show()
In [10]:
from sklearn.model_selection import train_test_split
In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y) # 6
In [12]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape
Out[12]:
Optional exercise: Use the docs for train_test_split to set the size of the test set, and also to set a random seed for the splitting.
In [13]:
from sklearn.ensemble import ExtraTreesClassifier
In [14]:
clf = ExtraTreesClassifier() # 7
In [15]:
clf.fit(X_train, y_train) # 8
Out[15]:
In [16]:
clf.score(X_test, y_test)
Out[16]:
Optional exercise: Try changing some hyperparameters, eg verbose, n_estimators, n_jobs, and random_state
In [ ]:
clf = ExtraTreesClassifier(... HYPERPARAMETERS GO HERE ...)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
All models have the same API (but not the same hyperparameters), so it's very easy to try lots of models.
In [18]:
y_pred = clf.predict(X_test) # 9
A quick score:
In [19]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)
Out[19]:
The confusion matrix, showing exactly what kinds of mistakes (type 1 and type 2 errors) we're making:
In [20]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred)
Out[20]:
Finally, the classification report shows the type 1 and type 2 error rates (well, 1 - the error) for each facies, along with the combined, F1, score:
In [21]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred)) # 10
We should drop out entire wells, not just a bunch of random samples. Otherwise we're training on data that's just one sample away from data we're validating against.
To do this, we need a vector that contains an integer (or something) representing each unique well.
In [22]:
wells = df['Well Name']
In [23]:
from sklearn.model_selection import LeaveOneGroupOut
logo = LeaveOneGroupOut()
clf = ExtraTreesClassifier(random_state=0)
for train, test in logo.split(X, y, groups=wells):
# train and test are the indices of the data to use.
well_name = wells[test[0]]
clf.fit(X[train], y[train])
score = clf.score(X[test], y[test])
print("{:>20s} {:.3f}".format(well_name, score))
© Agile Geoscience 2016