In [2]:
import numpy as np
import matplotlib.pyplot as mpl
% matplotlib inline
In [3]:
import requests
import io
r = requests.get('https://raw.githubusercontent.com/seg/2016-ml-contest/master/training_data.csv')
f = io.StringIO(r.text)
We can't just load it, because we only want NumPy to have to handle an array of floats and there's metadata in this file (we cna't tell that, I just happen to know it... and it's normal for CSV files).
Let's look at the first few rows:
In [4]:
r.text.split('\n')[:5]
Out[4]:
For convenience later, we'll make a list of the features we're going to use.
In [5]:
features = r.text.split('\n')[0].split(',')
_ = [features.pop(i) for i in reversed([0,1,2])]
features
Out[5]:
Now we'll load the data we want. First the feature vectors, X...
In [6]:
X = np.loadtxt(f, skiprows=1, delimiter=',', usecols=[3,4,5,6,7,8,9,10])
And the label vector, y:
In [7]:
_ = f.seek(0) # Reset the file reader.
y = np.loadtxt(f, skiprows=1, delimiter=',', usecols=[0])
In [8]:
X.shape, y.shape
Out[8]:
We have data! Almost ready to train, we just have to get our test / train subsets sorted.
© Agile Geoscience 2016