Read ASCII or CSV from web



In [2]:

    
import numpy as np
import matplotlib.pyplot as mpl
% matplotlib inline

Read the data

numpy has a convenient function, loadtxt that can load a CSV file. It needs a file... and ours is on the web. That's OK, we don't need to download it, we can just read it by sending its text content to a StringIO object, which acts exactly like a file handle.



In [3]:

    
import requests
import io

r = requests.get('https://raw.githubusercontent.com/seg/2016-ml-contest/master/training_data.csv')
f = io.StringIO(r.text)

We can't just load it, because we only want NumPy to have to handle an array of floats and there's metadata in this file (we cna't tell that, I just happen to know it... and it's normal for CSV files).

Let's look at the first few rows:



In [4]:

    
r.text.split('\n')[:5]









    Out[4]:





['Facies,Formation,Well Name,Depth,GR,ILD_log10,DeltaPHI,PHIND,PE,NM_M,RELPOS',
 '3,A1 SH,SHRIMPLIN,2793.0,77.45,0.664,9.9,11.915,4.6,1,1.0',
 '3,A1 SH,SHRIMPLIN,2793.5,78.26,0.661,14.2,12.565,4.1,1,0.979',
 '3,A1 SH,SHRIMPLIN,2794.0,79.05,0.658,14.8,13.05,3.6,1,0.957',
 '3,A1 SH,SHRIMPLIN,2794.5,86.1,0.655,13.9,13.115,3.5,1,0.936']

For convenience later, we'll make a list of the features we're going to use.



In [5]:

    
features = r.text.split('\n')[0].split(',')
_ = [features.pop(i) for i in reversed([0,1,2])]
features









    Out[5]:





['Depth', 'GR', 'ILD_log10', 'DeltaPHI', 'PHIND', 'PE', 'NM_M', 'RELPOS']

Now we'll load the data we want. First the feature vectors, X...



In [6]:

    
X = np.loadtxt(f, skiprows=1, delimiter=',', usecols=[3,4,5,6,7,8,9,10])

And the label vector, y:



In [7]:

    
_ = f.seek(0)  # Reset the file reader.
y = np.loadtxt(f, skiprows=1, delimiter=',', usecols=[0])



In [8]:

    
X.shape, y.shape









    Out[8]:





((3232, 8), (3232,))

We have data! Almost ready to train, we just have to get our test / train subsets sorted.