In [1]:
# Pretty, big plots that are inline with my cells.
# See the matplotlib section below for more info.
import matplotlib as mpl
import seaborn
%matplotlib inline
mpl.rcParams['figure.figsize'] = (14, 5)
seaborn.set_style("darkgrid")
Quick note : this section will go fairly slowly on Numpy, and will speed up for Scipy and Matplotlib. This is to give you an idea of how things are done - just a feel for the language and syntax. Beyond that, your applications may be very different from mine, and by far the best thing to do then is to Google whatever you'd like to do. For Voronoi tessellations, for example, looking up voronoi tessellation python will give you the function scipy.spatial.Voronoi, along with examples on how to use it.
The core scientific Python stack consists of Numpy, Scipy, and Matplotlib. Numpy brings to Python what Matlab is well known for : strong array manipulation and matrix operations, elementary mathematical functions, random numbers, ...
Let's import numpy and create a few arrays.
In [2]:
import numpy as np
A = np.ones(10) # an array of ones ( floating point type )
B = np.arange(5, 10, 0.2) # like range(), but allows non-integer stride
print(A, "\n")
print(B, "\n")
In [3]:
C = np.linspace(0, 2 * np.pi, 10) # between a and b in c steps
D = np.sin(C)
E = np.random.normal(0, 1, size=5) # also beta, binomial, gamma,
# poisson, uniform, lognormal,
# negative binomial, geometric, ...
print(C, "\n")
print(D, "\n")
print(E)
Before we go on, a note on namespaces. A namespace is a way to "partition" your functions into different spaces. This is particularly useful because you may have functions with the same name, but they may not do the same thing.
In [4]:
# Another function for sin can be found in the Python math library
import math
print(math.sin(0.5))
print(np.sin(0.5))
math.sin(0.5) == np.sin(0.5)
Out[4]:
In [5]:
# BUT
x = np.linspace(0, 2 * np.pi, 20)
print(np.sin(x))
print(math.sin(x))
Note how this function doesn't like to be passed lists or arrays. It's thrown us an error, and provided us with a traceback - a "chronological" trace of where the error occurred, and what caused it.
This one's not particularly long, as the error is immediate ( and not in a subfunction ), but it does tell us which line it's on. It seems that math.sin() doesn't like to be passed anything but length-1 arrays - that is, single floats or ints. Not particularly useful when we're trying to be array-oriented, which is what numpy excels at.
In [6]:
# Numpy arrays can be easily manipulated
blah = np.random.lognormal(0, 1, size=(4, 3))
print(np.dot(blah.T, blah), "\n")
print(blah.shape, "\n")
print(type(blah))
The numpy array is its own type, but it's still a container. It expands on the capabilities of the standard Python list. Access is easier though : for multidimensional arrays, you just use a comma to separate dimensions.
In [7]:
print(type(blah[0, 0]), "\n")
print(blah[:, 2::2])
The : operator is still used. Standard arithmetic is element-wise. Dimensions are broadcast.
In [8]:
print(blah[:, 1] * blah[:, 2] - 2.5, "\n")
print(blah.sum(), blah.mean(), blah.min(), blah.max(), blah.var(), blah.cumsum().std())
In an interactive environment ( IPython console, Notebook, ... ), you can use tab-completion.
In [9]:
# We can manipulate the shapes of arrays
print(blah.shape, "\n")
print(blah.ravel().shape, "\n") # Unravel / flatten the array
print(blah.reshape(2, 6).shape, "\n")
print(blah.reshape(6, 2).T)
Linear Algebra
Numpy has a module called linalg, which offers things like norms, decompositions, matrix powers, inner and outer products, trace, eigenvalues, etc...
In [10]:
# Linear algebra
A = np.random.normal(size=(3, 3))
print(np.linalg.inv(A))
In [11]:
# Modules can be imported under their own namespace for short
import numpy.linalg as lin
print(lin.inv(A), "\n")
print(lin.eigvals(A))
In [12]:
# If we want to do some real linear algebra, we can skip elementwise
# syntax by declaring an array as a numpy matrix :
B = np.matrix(A)
print(B * B, "\n") # now MATRIX multiplication
print(A * A) # still ELEMENTWISE multiplication
In [13]:
# A was our array, B was our matrix of the same array
# Let's confirm these operations are the same
print((np.dot(A, A) == B * B).all())
print((A * A == B * B).all())
We can do logical tests on arrays.
In [14]:
a = np.arange(-2, 2, 0.5)
print(a)
# Test elementwise for things greater than or equal to 0
big = a >= 0
print(big)
# We can index using this result
print(a[big])
In [15]:
# We can index using an array of logicals, but what if I want indices ?
big2 = np.where(a >= 0)
print(big)
print(big2, "\n")
# We can index using this also
print(a[big])
print(a[big2])
x from 0 to 4 pi over 100 pointsy as the cosine of x, plus oney to be 25x4 in shapemeansmeans that are greater than 1.01, call this bigmeansmeans to give a 2x2 array, and take its logarithmmeans and vector bigmeansHints :
np.linspacenp.pinp.reshapenp.mean or simply myarray.mean(), using the axis= argumentnp.lognp.dot or np.matrix
In [16]:
# Create x and y = cos(x) + 1
x = np.linspace(0, 4 * np.pi, 100)
y = np.cos(x) + 1
# Reshape into four columns, calculate the means
y = y.reshape(25, 4)
means = y.mean(axis=0)
print("Means : {}".format(means))
# Find means that are greater than 1.01
bigmeans = means[means > 1.01]
print("Big means : {}".format(bigmeans))
# Reshape the four means, take logs, take dot product
means = np.log(means.reshape(2, 2))
print("Dot product : {}".format(np.dot(means, bigmeans)))
Pandas is a data analysis library that makes handling datasets less painful. We won't cover it in depth here - just enough to manage our dataset for the next few exercises.
In [17]:
import pandas as pd
# Read a CSV file, show the first few rows
data = pd.read_csv("sheep.csv")
data.head()
Out[17]:
In [18]:
# Show columns of the dataframe
print(data.columns, "\n")
# We can access columns by their name directly with a . in between
# Here, we can also use data["ShannonDiversity"][:5], dict-style
print(data.Directionality[:5], "\n")
# Mean of each column
print(data.mean())
In [19]:
# How many entries are in the dataset ?
print(data.count(), "\n")
# The dataset contains some N/A values in the Virulence column.
# Let's get rid of everything that contains an N/A
print("Before N/A dropping : {} entries".format(len(data)))
data = data.dropna(how="any") # how="all" - remove only if all the data is missing
print("After N/A dropping : {} entries".format(len(data)))
In [20]:
# Let's also get rid of the image ID column, we don't really need it
data.drop("Image", axis=1, inplace=True)
data.head(5)
# Here, we specify that we're dropping a *column* ( along the axis = 1 )
Out[20]:
In [21]:
# Let's grab the foci count in sheep with high directionality coefficients
high_directionality = data[data.Directionality < 0.3].FociCount
# and plot it a moving average over ten points
pd.rolling_mean(high_directionality, 10).plot()
Out[21]:
Pandas can do SQL-like operations, such as joins or group-bys.
In [22]:
# This plot wasn't very meaningful, let's group by individual sheep first.
# Here, we find the mean value of each column for each individual sheep.
vir = data.groupby("Sheep").mean()
# Let's compare some of their values in a bar plot.
vir[["Lacunarity", "Entropy", "Directionality"]].plot(kind="bar", ylim=(0, 0.8))
Out[22]:
CoffeeTimeSeries.csv, print a summary so we can see what's going onpd.read_csv().head(), maybe .describe()time containing the Timestamp column as a Datetime objectpd.to_datetime() turns timestamps into datetimes dayfirst = True because these are European datestime, call it weekdays.isoweekday(); use a list comprehension or a loopdata.hist()
In [23]:
# Read data
data = pd.read_csv("CoffeeTimeSeries.csv")
data.head(3)
Out[23]:
In [24]:
# Time array
time = pd.to_datetime(data.Timestamp, dayfirst=True)
print(time[:3])
In [25]:
# List of weekdays
weekdays = [day.isoweekday() for day in time]
print(weekdays[:20])
In [26]:
# When, during the day, are coffees had ?
data.hist(column="Hour", bins=np.arange(6, 20), align="left", width=1)
Out[26]:
Before we continue onto Scipy, let's look into some plotting. Matplotlib is aimed to feel familiar to Matlab users.
In [27]:
# This is the main matplotlib import
import matplotlib.pyplot as plt
%matplotlib inline
# In IPython Notebooks, you can use this "magic" command to get figures inline :
# %matplotlib inline
# Spyder users should already have inline plots in IPython kernels
# This magic command ( as defined by the % in front of it ) keeps
# my plots directly below IPython Notebook cells. All magic commands
# are only valid in an IPython environment.
# The following makes figures bigger and increases label sizes for legibility
mpl.rcParams["figure.figsize"] = (14, 5)
mpl.rcParams["xtick.labelsize"] = 14
mpl.rcParams["ytick.labelsize"] = 14
In [28]:
# This cell contains a few optional things.
# seaborn is a fantastic visualisation library in its own right, but here
# I'm using it as prettifying library for matplotlib, amongst other things.
# Importing it gives me beautiful plots, but it isn't necessary.
import seaborn
# seaborn doesn't come with Anaconda. It is in the Conda repo, however,
# so you can install it from the terminal using either
# conda install seaborn
# or
# pip install seaborn
# Let seaborn thrusters to max awesome
seaborn.set_style("darkgrid")
In [29]:
x = np.random.exponential(3, size=10000)
plt.subplot(1, 2, 1)
plt.plot(x[::50])
plt.axhline(x.mean(), color="red")
plt.subplot(1, 2, 2)
plt.hist(x, 50)
plt.axvline(x.mean(), c="r") # c="r" is shorter, both are valid
plt.show()
# In an interactive setting, you will need to call plt.show()
# In IPython, you can call "%matplotlib inline" after importing and you're good to go.
# From here on, I'll drop plot.show().
In [30]:
# Like Matlab, we can feed style arguments to the plot function
x = np.linspace(0, 2 * np.pi, 100)
y = np.sin(x) + np.random.normal(scale = 0.2, size = 100)
# Calling plot() several times without creating a new figure puts them on the same one
plt.plot(x, y, "r.")
plt.plot(x, np.sin(x), "k-")
Out[30]:
In [31]:
# 100 x 3 matrix on a scatter plot, third dimension plotted as size
x = np.random.uniform(low = 0, high = 1, size = (100, 3))
# Transparency ? Sure.
plt.scatter(x[:, 0], x[:, 1], s=x[:, 2]*500, alpha = 0.4)
# Set axis limits
plt.xlim([0, 1])
plt.ylim([0, 1])
Out[31]:
In [32]:
# Area plots ?
x = np.linspace(0, 6 * np.pi, 300)
y = np.exp(-0.2*x) * np.sin(x)
plt.plot(x, y, "k", linewidth=4)
plt.fill_between(x, y, y2=0, facecolor="red", alpha=0.7)
Out[32]:
In [33]:
# Error bars ? We have ten measurements of the above process, the decaying sine wave
noise = np.random.normal(size = (len(y), 10)).T
# Let's add some Gaussian noise to our observations, using broadcasting
measuredy = y + 0.05 * noise
# Let's assume we know our error is Gaussian, for simplicity. Compute mean and std :
estmean = measuredy.mean(axis=0)
eststd = measuredy.std(axis=0)
# Plot the estimated mean with two standard deviation error bars, and the real signal
plt.plot(x, y, "r", lw=3)
plt.errorbar(x, estmean, yerr = eststd * 2, lw=1)
Out[33]:
In [34]:
# Reset plotting style
seaborn.set_style("white")
In [35]:
import scipy as sp
sp.__version__
Out[35]:
In [36]:
# Two dimensional, or image plots ?
import scipy.ndimage as nd
# Initial 2D noisy signal
X = np.random.normal(size=(256, 256))
plt.figure()
plt.imshow(X, cmap="gray")
plt.title("Original 2D Gaussian noise")
# We'll use a 2D Gaussian for smoothing, with identity as covariance matrix
# We'll grab a scipy function for it for now
plt.figure()
temp = np.zeros_like(X)
temp[128, 128] = 1.
plt.imshow(nd.filters.gaussian_filter(temp, 20), cmap="coolwarm")
plt.title("Gaussian kernel")
# Generate the Perlin noise
perlin = np.zeros_like(X)
for i in 2**np.arange(6) :
perlin += nd.filters.gaussian_filter(X, int(i), mode="wrap") * i**2
# and plot it several ways
plt.figure()
plt.imshow(perlin, cmap="gray")
plt.title("Perlin Noise")
plt.figure()
plt.imshow(perlin, cmap="bone")
plt.contour(perlin, linewidths=3, cmap="jet")
plt.title("Greyscale, with contours")
#plt.xlim([0, 256])
#plt.ylim([0, 256])
#plt.axes().set_aspect('equal', 'datalim')
Out[36]:
In [37]:
# And, of course ( stolen from Jake VanderPlas )
def norm(x, x0, sigma):
return np.exp(-0.5 * (x - x0) ** 2 / sigma ** 2)
def sigmoid(x, x0, alpha):
return 1. / (1. + np.exp(- (x - x0) / alpha))
# define the curves
x = np.linspace(0, 1, 100)
y1 = np.sqrt(norm(x, 0.7, 0.05)) + 0.2 * (1.5 - sigmoid(x, 0.8, 0.05))
y2 = 0.2 * norm(x, 0.5, 0.2) + np.sqrt(norm(x, 0.6, 0.05)) + 0.1 * (1 - sigmoid(x, 0.75, 0.05))
y3 = 0.05 + 1.4 * norm(x, 0.85, 0.08)
y3[x > 0.85] = 0.05 + 1.4 * norm(x[x > 0.85], 0.85, 0.3)
with plt.xkcd() :
plt.plot(x, y1, c='gray')
plt.plot(x, y2, c='blue')
plt.plot(x, y3, c='red')
plt.text(0.3, -0.1, "Yard")
plt.text(0.5, -0.1, "Steps")
plt.text(0.7, -0.1, "Door")
plt.text(0.9, -0.1, "Inside")
plt.text(0.05, 1.1, "fear that\nthere's\nsomething\nbehind me")
plt.plot([0.15, 0.2], [1.0, 0.2], '-k', lw=0.5)
plt.text(0.25, 0.8, "forward\nspeed")
plt.plot([0.32, 0.35], [0.75, 0.35], '-k', lw=0.5)
plt.text(0.9, 0.4, "embarrassment")
plt.plot([1.0, 0.8], [0.55, 1.05], '-k', lw=0.5)
plt.title("Walking back to my\nfront door at night:")
plt.xlim([0, 1])
plt.ylim([0, 1.5])
In [38]:
# Reset plot style
seaborn.set_style("darkgrid")
plt.plot()plt.hist(), use argument bins to get seven binsnp.diff().astype(float)plt.figure(figsize=(10,10)) for a square axisplt.pie()
In [39]:
# Coffees against time
plt.subplot(121)
plt.plot(time.values, data.Coffees, lw=3)
plt.gcf().autofmt_xdate()
# Coffees on each day of the week
plt.subplot(122)
plt.hist(weekdays, bins=7, align="left")
plt.xticks(np.linspace(1, 7, 8), ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]);
In [40]:
# Number of days between each coffee
timediff = (np.diff(time) / 8.64e13).astype(float)
plt.scatter(range(len(timediff)), timediff)
plt.axhline(10, color="black")
Out[40]:
In [41]:
# Names of the contributors
contributors = np.unique(data.Name)
# Number of contributions
contributions = [np.sum(data.Name == person) for person in contributors]
# Plot ( on square axes )
plt.figure(figsize=(10,10))
plt.pie(contributions, labels=contributors);
Scipy does all sorts of awesome stuff : integration, interpolation, optimisation, FFTs, signal processing, more linear algebra, stats, image processing, ... Here are just a few examples. Don't study the code, it's easier to just Google what you're after than trying to learn every command in scipy.
In [42]:
import scipy.spatial as spatial
# Convex hulls
ax = plt.axes()
points = np.random.normal(0.5, 0.2, size=(20,2))
plt.scatter(points[:, 0], points[:, 1], s=200, c="r")
hull = spatial.ConvexHull(points)
for i in hull.simplices :
plt.plot(points[i, 0], points[i, 1], "k", linewidth=4)
spatial.voronoi_plot_2d(spatial.Voronoi(points), ax)
plt.title("Convex Hull and Voronoi Tessellation")
Out[42]:
In [43]:
from scipy.integrate import odeint
# Ordinary Differential Equations
r0 = 16.
gamma = .5
t = np.linspace(0, 8, 200)
def SIR(y, t) :
S = - r0 * gamma * y[0] * y[1]
I = r0 * gamma * y[0] * y[1] - gamma * y[1]
R = gamma * y[1]
return [S, I, R]
solution = odeint(SIR, [0.99, .01, 0.], t)
plt.figure()
plt.plot(t, solution, linewidth=3)
plt.title("Measles, Single Epidemic")
plt.xlabel("Time (weeks)")
plt.ylabel("Proportion of Population")
plt.legend(["Susceptible", "Infected", "Recovered"])
Out[43]:
In [44]:
from scipy.signal import detrend
from scipy import fftpack
# Signal Processing - create trending signal, detrend, FFT
x = np.linspace(0, 1, 300)
trend = np.zeros_like(x)
trend[100:200] = np.linspace(0, 10, 100)
trend[200:] = np.linspace(10, -20, 100)
y = np.random.normal(loc = 3 * np.sin(2 * np.pi * 20 * x)) + trend # Signal is a sine wave with noise and trend
yt = detrend(y, bp = [100, 200]) # detrend, giving break points
Yfft = fftpack.rfft(yt) # Calculate FFT
freqs = fftpack.rfftfreq(len(yt), x[1] - x[0]) # Frequencies of the FFT
plt.figure()
plt.subplot2grid((2, 2), (0, 0))
plt.title("Detrending")
plt.plot(x, y, lw=2)
plt.plot(x, trend, lw=4)
plt.subplot2grid((2, 2), (0, 1))
plt.plot(x, yt, lw=2)
plt.subplot2grid((2, 2), (1, 0), colspan=2)
plt.title("FFT")
plt.plot(freqs, (np.abs(Yfft)**2), lw=3)
plt.tight_layout()
In [45]:
import scipy.stats as st
# Distributions
plt.figure()
plt.subplot(131)
plt.title("Normal PDF")
plt.plot(np.arange(-5, 5, 0.1), st.norm.pdf(np.arange(-5, 5, 0.1), 0, 1), lw=2)
plt.subplot(132)
plt.title("Beta CDF")
plt.plot(st.beta.cdf(np.linspace(0, 1, 100), 0.5, 0.5), lw=2)
plt.subplot(133)
plt.title("Erlang PDF")
plt.plot(np.linspace(0, 10, 100), st.erlang.pdf(np.linspace(0, 10, 100), 2, loc=0), lw=2)
Out[45]:
In [46]:
# Statistical Tests
a = np.random.normal(0, 1.5, size=1000)
b = np.random.normal(.2, 1, size=1000)
plt.figure()
plt.hist(a, 30, alpha=0.7)
plt.hist(b, 30, alpha=0.7)
plt.title("p = " + str(st.ttest_ind(a, b)[1]))
Out[46]:
There's a lot more you can do with numpy, scipy, and matplotlib. The documentation's generally great, so look around for what you're after - it's probably already been implemented.
data.Time against data.Coffees and find out how many coffees, on average, are had over a period of one daytime array contains elements of type pandas.tslib.Timestamp. They're not that easy to use, so we're doing to use data.Time, which is a time in seconds ( of type int )scipy.stats, call linregress(). Grab the slope. Because data.Time is in seconds, you'll get coffees per second.weekdays array to compute the average numbers of coffees had on each day of the week, and plot itweekdays is a list, which has the method .count()resampledscipy.interpolate.interp1d()scipy.fftpack contains the functions you need; rfft and rfftfreq will be useful here
In [47]:
# Linear regression :
import scipy.stats as st
gradient, intercept, r2, pval, _ = st.linregress(data.Time, data.Coffees)
plt.plot(data.Time, data.Coffees, lw=2)
plt.plot(data.Time, data.Time * gradient + intercept, lw=2)
plt.title("Average coffees per day : %.01f" % (gradient * 24 * 60 * 60))
plt.legend(["Data", "Linear fit, R2 = %.02f, p = %e" % (r2, pval)], loc="upper left")
Out[47]:
In [48]:
# Interpolation :
import scipy.interpolate as interp
# Create an interpolator
interpolator = interp.interp1d(data.Time, data.Coffees)
# The time array I want to sample at
newTime = np.arange(data.Time.irow(0), data.Time.irow(-1), 24 * 60 * 60)
# Sample at these points
resampled = interpolator(newTime)
plt.plot(data.Time, data.Coffees)
plt.plot(newTime[::20], resampled[::20], "o")
Out[48]:
In [49]:
# Number of coffees on each day of the week
coffees_on_each_day = [weekdays.count(i) for i in range(1, 8)]
plt.plot(coffees_on_each_day, lw=3);
plt.gca().set_xticklabels(["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"])
Out[49]:
In [50]:
# FFT of derivative
from scipy import fftpack
# Derivative of the number of coffees per day
derivative = np.diff(resampled)
# Calculate the smoothed power spectrum and the relevant frequencies
# Here, we smooth using a convolution with a Hanning window
ps = np.abs(fftpack.rfft(derivative))[1:]**2 # Calculate power spectrum, remove mean component
coffeeps = np.convolve(ps, np.hanning(5), "same") # smooth
coffeefreqs = fftpack.rfftfreq(len(derivative), 1)[1:]
plt.plot(coffeefreqs, coffeeps, lw=2)
plt.axvline(coffeefreqs[np.where(coffeeps == coffeeps.max())], alpha=0.5, color="red", lw=3)
plt.title("Most power at frequency f = %f days" % (1/coffeefreqs[np.where(coffeeps == coffeeps.max())]))
Out[50]:
Join our mailing list, discuss code on the forums, see upcoming sessions