Lists

Before coming into the Notebook, spend some time in an interactive session learning about sequences (strings, lists), and doing basic indexing, slicing, append(), in, etc.

Then you can come in here...



In [1]:

    
layers = [0.23, 0.34, 0.45, 0.25, 0.23, 0.35]



In [2]:

    
uppers = layers[:-1]
lowers = layers[1:]



In [3]:

    
rcs = []
for pair in zip(lowers, uppers):
    rc = (pair[1] - pair[0]) / (pair[1] + pair[0])
    rcs.append(rc)



In [4]:

    
rcs









    Out[4]:





[-0.1929824561403509,
 -0.1392405063291139,
 0.28571428571428575,
 0.04166666666666665,
 -0.2068965517241379]

Functions

Definition, inputs, side-effects, returning, scope, docstrings



In [5]:

    
# Exercise
def compute_rc(layers):
    """
    Computes reflection coefficients given
    a list of layer impedances.
    """
    uppers = layers[:-1]
    lowers = layers[1:]
    rcs = []
    for pair in zip(lowers, uppers):
        rc = (pair[1] - pair[0]) / (pair[1] + pair[0])
        rcs.append(rc)
    return rcs



In [6]:

    
compute_rc(layers)









    Out[6]:





[-0.1929824561403509,
 -0.1392405063291139,
 0.28571428571428575,
 0.04166666666666665,
 -0.2068965517241379]

Put in a file and import into a new notebook

Numpy

Before continuing, do some basic NumPy array stuff in the interpreter.

Let's make a really big 'log' from random numbers:



In [7]:

    
import numpy as np  # Just like importing file



In [8]:

    
biglog = np.random.random(10000000)
%timeit compute_rc(biglog)









    



1 loops, best of 3: 5.65 s per loop

Note that the log has to be fairly big for the benchmarking to work properly, because otherwise the CPU caches the computation and this skews the results.

Now we can re-write our function using arrays instead of lists.



In [9]:

    
# Exercise
def compute_rc_vector(layers):
    uppers = layers[:-1]
    lowers = layers[1:]
    return (lowers - uppers) / (uppers + lowers)



In [10]:

    
%timeit compute_rc_vector(biglog)









    



10 loops, best of 3: 90 ms per loop

60 times faster on my machine!

Aside: more performance with `numba`



In [11]:

    
from numba import jit



In [12]:

    
@jit
def compute_rc_numba(layers):
    uppers = layers[:-1]
    lowers = layers[1:]
    return (lowers - uppers) / (uppers + lowers)



In [13]:

    
%timeit compute_rc_numba(biglog)









    



The slowest run took 12.52 times longer than the fastest. This could mean that an intermediate result is being cached 
1 loops, best of 3: 88.2 ms per loop

OK, we'll make a fake example.



In [14]:

    
def compute_rc_slow(layers):
    uppers = layers[:-1]
    lowers = layers[1:]
    rcs = np.zeros_like(uppers)
    for i in range(rcs.size):
        rcs[i] = (lowers[i] - uppers[i]) / (uppers[i] + lowers[i])
    return rcs



In [15]:

    
%timeit compute_rc_slow(biglog)









    



1 loops, best of 3: 6.54 s per loop



In [16]:

    
@jit
def compute_rc_faster(layers):
    uppers = layers[:-1]
    lowers = layers[1:]
    rcs = np.zeros_like(uppers)
    for i in range(rcs.size):
        rcs[i] = (lowers[i] - uppers[i]) / (uppers[i] + lowers[i])
    return rcs



In [17]:

    
%timeit compute_rc_faster(biglog)









    



The slowest run took 29.06 times longer than the fastest. This could mean that an intermediate result is being cached 
1 loops, best of 3: 64.4 ms per loop

However, you can't speed up our original list-based function this way.



In [18]:

    
@jit
def compute_rc_hopeful(layers):
    """
    Computes reflection coefficients given
    a list of layer impedances.
    """
    uppers = layers[:-1]
    lowers = layers[1:]
    rcs = []
    for pair in zip(lowers, uppers):
        rc = (pair[1] - pair[0]) / (pair[1] + pair[0])
        rcs.append(rc)
    return rcs



In [19]:

    
%timeit compute_rc_hopeful(biglog)









    



1 loops, best of 3: 5.33 s per loop

Plotting basics



In [20]:

    
import matplotlib.pyplot as plt
%matplotlib inline

Not we can only plot part of biglog because it contains too many points for matplotlib (and for our screen!). If we really wanted to plot it, we'd have to find a way to upscale it.



In [21]:

    
plt.plot(biglog[:500])









    Out[21]:





[<matplotlib.lines.Line2D at 0x10d6c7588>]



In [22]:

    
fig = plt.figure(figsize=(15,2))
ax = fig.add_subplot(111)
ax.plot(biglog[:500])
ax.set_title("big log")
plt.show()

Objected oriented basics

The point is that we often want to store data along with relevant functions (methods) in one 'thing' — an object.

Build this up, piece by piece.

Start with __init__() which is required anyway. Only define self.layers.

Then add rcs(), then plot(), and finally __len__(), once you discover that len(l) doesn't work, because this object doesn't have that property.



In [34]:

    
class Layers(object):
    
    def __init__(self, layers, label=None):
        # Just make sure we end up with an array
        self.layers = np.array(layers)
        self.label = label or "My log"
        self.length = self.layers.size  # But storing len in an attribute is unexpected...
        
    def __len__(self):  # ...better to do this.
        return len(self.layers)
        
    def rcs(self):
        uppers = self.layers[:-1]
        lowers = self.layers[1:]
        return (lowers-uppers) / (uppers+lowers)
    
    def plot(self, lw=0.5, color='#6699ff'):
        fig = plt.figure(figsize=(2,6))
        ax = fig.add_subplot(111)
        ax.barh(range(len(self.layers)), self.layers, color=color, lw=lw, align='edge', height=1.0, alpha=1.0, zorder=10)
        ax.grid(zorder=2)
        ax.set_ylabel('Layers')
        ax.set_title(self.label)
        ax.set_xlim([-0.5,1.0])
        ax.set_xlabel('Measurement (units)')
        ax.invert_yaxis()  
        #ax.set_xticks(ax.get_xticks()[::2])    # take out every second tick
        ax.spines['right'].set_visible(False)  # hide the spine on the right
        ax.yaxis.set_ticks_position('left')    # Only show ticks on the left and bottom spines
        
        plt.show()



In [35]:

    
l = Layers(layers, label='Well # 1')



In [36]:

    
l.rcs()









    Out[36]:





array([ 0.19298246,  0.13924051, -0.28571429, -0.04166667,  0.20689655])



In [37]:

    
len(l)









    Out[37]:





6



In [38]:

    
l.plot()



In [39]:

    
rel_interval = np.cumsum(l.rcs(), dtype=float) 
len(rel_interval)









    Out[39]:





5



In [40]:

    
relative_layers = np.insert(rel_interval, 0, 0)



In [41]:

    
relative_layers









    Out[41]:





array([ 0.        ,  0.19298246,  0.33222296,  0.04650868,  0.00484201,
        0.21173856])



In [42]:

    
relative = Layers(relative_layers, "relative")



In [43]:

    
relative.layers









    Out[43]:





array([ 0.        ,  0.19298246,  0.33222296,  0.04650868,  0.00484201,
        0.21173856])



In [44]:

    
relative.plot()

Web scraping basics



In [ ]:

    
url = "http://en.wikipedia.org/wiki/Jurassic"

Use View Source in your browser to figure out where the age range is on the page, and what it looks like.

Try to find the same string here.



In [ ]:

    
import requests
r = requests.get(url)
r.text[:500]

Using a regular expression:



In [ ]:

    
import re

s = re.search(r'<i>(.+?million years ago)</i>', r.text)
text = s.group(1)

Exercise: Make a function to get the start and end ages of any geologic period, taking the name of the period as an argument.



In [ ]:

    
def get_age(period):
    url = "http://en.wikipedia.org/wiki/" + period
    r = requests.get(url)
    start, end = re.search(r'<i>([\.0-9]+)–([\.0-9]+)&#160;million years ago</i>', r.text).groups()
    return float(start), float(end)



In [ ]:

    
period = "Jurassic"
get_age(period)



In [ ]:

    
def duration(period):
    t0, t1 = get_age(period)
    duration = t0 - t1
    response = "According to Wikipedia, the {0} period was {1:.2f} Ma long.".format(period, duration)
    return response



In [ ]:

    
duration('Cretaceous')

Functional programming basics

In Python, functions are first class objects — you can pass them around like other objects.

Sometimes it's convenient to think in terms of map() and reduce(). This pattern is fundamental in data analytics and some packages, such as pandas.

A couple of very common functions, sorted() and filter(), always works with a function as one of their arguments. It can be confusing the first time you see it.

With no functional argument, sorted does what you'd expect:



In [ ]:

    
l = [0.001, 1, 3, 51, 41 , 601]
sorted(l)

What if we want to sort based on the number of characters? Don't ask why, we just do. Then we write a function that returns a key which, when sorted, will give the ordering we want.



In [ ]:

    
def strlen(n):
    return len(str(n))



In [ ]:

    
sorted(l, key=strlen)

We could rewrite that tiny function as a lambda, which is basically a little unnamed function:



In [ ]:

    
sorted(l, key=lambda n: len(str(n)))

When would you make a named function vs a lambda? It all depends on whether you want to use it again or not.

`map` and `lambda`

You can think of map as 'apply this function to all of these items'.



In [ ]:

    
def sq(n):
    return n**2

# In Python 3, map produces an iterator, not a list.
# So we must cast to list to inspect its contents.
list(map(sq, l))

We can get around defining that tiny function sq() with a lambda, which you can think of as a temporary, 'throwaway' function:



In [ ]:

    
list(map(lambda n: n**2, l))

In practice, we'd often write this as a list comprehension. Then we can skip the creation of the funciton or lambda entirely:



In [ ]:

    
[n**2 for n in l]

One of the advantages of map is that it is 'lazy', so if you map a function to a giant list, you don't get a giant list back, you get an iterator. A list-comp would give you a giant list, possibly jamming the memory on your box.

`reduce`

reduce takes a sequence and applies some function to it recursively. You could think of it like a running sum, say, but for any function, not just summing.



In [ ]:

    
def runsum(a, b):
    return a + b

# For some reason reduce is not in the main namespace like map
from functools import reduce

reduce(runsum, l)



In [ ]:

    
def runmult(a, b):
    return a * b

reduce(runmult, l)

`partial` for making curry

'Preapplying' an argument to a function is sometimes useful. For example, we might have a general function for raising a number a to the power b, but then want another function which raises numbers to the 3rd power.

I could do this by simply calling the first function from the second:



In [ ]:

    
def power(a, b):
    return a**b

def cuber(a):
    return power(a, 3)

cuber(2)

But some people might find it more inuitive to do it this way:



In [ ]:

    
from functools import partial

cuber = partial(power, b=3)

cuber(2)

This is called 'currying' the function power().