Python for Science: The Very Basics

Getting Started

Install python
Learn the basics
Jupyter Notebook
Explore!

1. Install Python

There are various packaged 'distributions' of python, which contain pretty much everything you need for scientific analysis. I use Anaconda by Continuum Analytics. To install anaconda, follow instructions here. Everything below should work with the default installation of Anaconda, but I'm not sure about the other options.

2. Learn the basics

Python is a languange, and you have to learn it. As programming languages go, it's famously straightforward and 'readable' (i.e. you can normally work out what a function does just by reading the code, which is not the case in all languages). If you're already fluent in a programming language, python should be very easy. If not, it will be difficult... but less difficult than, say, Matlab or C+. These links should get you started:

Basics. Codeacademy is an excellent place to learn basic python. I recommend doing this before going any further. I'm not going to cover very-basics below, but I will refer to them.
Python and Computer Science. This excellent course by MIT is extensive, and covers all the basics of programming in python, and how computers and algorithms work. An excellent way to start working out how best to approach your data
Python for Data Analysis course offers specific information for data analysis.
Have a look at the hundreds of example notebooks covering all sorts of weird and wonderful python applications.
There is aslo a fantastic book called 'Python for Data Analysis'. It's available as a pdf... if you can work out how to find it.

3. Jupyter Notebook

What you're reading at the moment is a 'jupyter notebook'. This is an extension of 'ipython', which provides a nice, simple interface for writing python code, making notes and showing plots within your browser. It's based around 'cells' (the embedded, grey boxes below), which can be either 'code' or 'markdown'. All this writing is in a 'markdown' cell, where you can write titles, notes and equations (in LaTeX form), and generally keep all your code and analysis organised and annotated. Look, an equation: \begin{align} \mathrm{python} >> \frac{\sqrt{\mathrm{(Matlab + Excel)^{4.5}}}}{1} \end{align}

Learn about the notebooks here.

You can either start a notebook from the terminal, or the Anaconda install includes a 'launcher' app, which gives you a nice button to click.

4. Explore!

Go through the examples below to start working out python.

Asides:

Python 2 or 3?

Short version: use python 3. Everything except older, obscure scientific packages have been updated to work with python 3, and that's where all future development will happen. Unless you're sure the package you need only works with python 2, then use python 3.

Long version: read this.

Benefits and Drawbacks

In python, you have to really understand exactly what you're trying to do to your data to be able to do it. There is no magic 'analyse' button, which will process your data in a black box. This is good, because it makes sure you really understand what you're doing to your data. This is bad, because you have to take the time to really understand how to treat your data. The extra time is worth it.

Examples

Basics



In [1]:

    
# This is a code cell. This text is a comment (preceded by a '#').
# Commented lines are ignored by python. I'll use comments to try
# and explain new concepts in the cells below.

# Once you've written some code in a cell, you need to press 
# either [Ctrl] + [Return] to run the code. You may replace
# [Ctrl] with [Shift] to run the code and move the focus to the
# next cell, or [Alt] to run the code and create a new cell
# underneath.

# There are lots of other shortcut keys that make the notebook
# interface really nice to use. Have a look at the full list in
# Help > Shortcut Keys

Libraries

Basic python can't do that much. It's strength comes in the huge number of 'Libraries' or 'Packages' which add additional functionality. A short list of useful 'basic' scientific libraries can be found here, although there are literally thousands of packages that cover almost everything you can think of. There are several packages that I find myself using on a regular basis.

The 'Scientific Python' family:

numpy: array operations
pandas: 'spreadsheet' type data management and much, much more.
matplotlib: plotting
scipy: all sorts of statistics and data processing tools.

The 'Scikit' Family:

sklearn: machine learning, multi-dimensional cluster analysis, lots of good stuff in a simple package.
skimage: image processing tools.

Other:

statsmodels: all the statistics.
uncertainties: error propagation.
multiprocessing: really easy multi-threaded computing.



In [2]:

    
# how to import a library
import numpy as np

# 'import' tells python bring numpy into your 'workspace'.
# after import, you can access numpy's functions by typing
# numpy.function_name. 

# To make things simpler, you can tell python to call numpy 
# something else ('np', in this case) by using 'as' after 
# the import statement.



In [3]:

    
# a few numpy examples:
a = np.linspace(0,100,50)  # generates an array with 50 evenly spaced points between 0 and 100.
z = np.zeros((50,50))  # generates an 50 x 50 array filled with zeros.
l = np.logspace(-5,5,100)  # generates 100 logarithmically spaced data points between 10^-5 and 10^5.

r = np.random.normal(0,1,20)  # generates 20 random numbers with a mean of 0 and a standard deviation of 1.
u = np.random.uniform(0,10,20)  # generates 20 random numbers uniformly distributed between 0 and 10.

# note: variables are assigned using '='. variables are 
# overwritten when they are re-assigned.

# if you write a statement without assigning it to a variable,
# its output will be printed below the code cell.



In [4]:

    
# let's make some data! (so much easier than lab work...)
x = np.linspace(0,20,50)
y = x * 2.4 + np.random.normal(5, 2, len(x))

Plotting



In [5]:

    
import matplotlib.pyplot as plt
# note the difference in import here: we're only importing part
# of the matplotlib library (pyplot), and calling it 'plt'.

%matplotlib inline
# this line tells jupyter notebook that we want to see the plots presented 'in line' in the browser.



In [6]:

    
fig, ax = plt.subplots(1, 1)  # make a figure with 1 subplot 'axis'
# a 'figure' can contain multiple 'axis' objects.

# note the assignment here: the plt.subplots function returns a tuple.
# tuples can be expanded to multiple variables in assignment, i.e.
# running a,b = (1,2) is the same as running a = 1 and b = 2 separately.

ax.scatter(x, y, c='purple', marker='x')  # make a scatter plot of the x,y data we generated above.
# 'c' sets the colour
# 'marker' sets the line width around the circular marker
ax.plot(x, y, ls='dotted', lw=2, c='black')  # add a line connecting all the dots.
# 'ls' sets the line style
# 'lw' sets the line width

ax.set_xlim(-1,21)  # set the x axis limits.









    Out[6]:





(-1, 21)

Note on Plot Colours:

Matplotlib colours can be defined in multiple ways:

a colour name, e.g. 'red', 'yellow', 'black'
a key letter, e.g. 'r', 'y', 'k'
a hex string, e.g. '#ff0000', '#ffff00', '#000000'
an (r,g,b,alpha) tuple, e.g. (1,0,0,1), (1,1,0,1), (0,0,0,1)
an array of numbers, the same length as x and y 'maps' a continuous colour scale to the points.

For a full description, see here.

Line fitting



In [7]:

    
# first, we must define a function that describes the line.
def linefunction(x, m, c):  
    # functions are defined using 'def', followed by the 
    # function name, inputs in brackets, and a colon.
    # note that these lines are indented. Indentation has meaning
    # in python. Indented lines are part of the preceeding statement 
    # (usually functions or loops).
    y = x * m + c  # the work happens here.
    return y  # 'return' defines what you get back from the function.



In [8]:

    
# import a least-squares curve fitting function
from scipy.optimize import curve_fit
# note the different form of import here. We're bringing in a single
# function (curve_fit) from part (optimize) of a library (scipy),
# rather than the entire library.



In [9]:

    
# fit the x,y data:
p, cov = curve_fit(linefunction, x, y)
# calculate the fitted line
# p = fitted parameters
# cov = covariance matrix of parameters
y_fit = linefunction(x, *p)
# not the '*' notation here, which can be used to expand a list of 
# parameters into function inputs



In [10]:

    
# plot the fit
fig, ax = plt.subplots(1, 1)
ax.scatter(x, y, label='data', s=5, alpha=0.5)
# 'label' names the data, and is used to create a legend
# 's' sets marker size
# 'alpha' sets marker opacity (1 = 100%)
ax.plot(x, y_fit, c='r', lw=2, label='fit')
ax.set_xlim(-1,21)

ax.set_xlabel('x')  # set x label
ax.set_ylabel('y')  # set y label

ax.legend(loc='upper left')  # make a legend
# loc defines legend position









    Out[10]:





<matplotlib.legend.Legend at 0x110e04240>

curve_fit can be applied to any function that takes (x, ...) as inputs. All sorts of other fitting routines (generic zero-finders (e.g. newton-rathson optimisers, or the 'minimize' function), non-linear least squares (nls), ordinary least squares (ols), orthogonal distance regression (odr) and much more!) are available in the scipy.optimize package. These other fitting functions allow you to accommodate measurements errors in fits (odr), set parameter constraints (minimize), and much more!



In [11]:

    
# this can also be done with ready-made functions, which also give you fit statistics!
from scipy.stats import linregress

linregress(x, y)









    Out[11]:





LinregressResult(slope=2.3939943534338712, intercept=5.2192452867186958, rvalue=0.98769783069685868, pvalue=2.4220639963083587e-40, stderr=0.054707193358382353)

Power function fitting



In [12]:

    
# make data
y2 = x**2.4 + np.random.normal(-40, 30, len(x))

# define power function:
def powerfn(x, p, c):
    return x**p + c

p, cov = curve_fit(powerfn, x, y2)

fig, ax = plt.subplots(1,1)

ax.scatter(x, y2)
ax.plot(x, powerfn(x, *p), c='r', lw=2)

# calculate R2
R2 = 1 - (np.sum((y2 - powerfn(x, *p))**2)) / np.sum((y2 - y2.mean())**2)
ax.text(x=.02, y=.98, s='$R^2$: {:.3f}'.format(R2), transform=ax.transAxes, ha='left', va='top')
# 'text' places arbitrary text within the axes.
# in this case, I'm using x,y coordinates relative to the axis size, as specified using 'transform'
# 'ha' sets the horizontal alignment of the text
# 'va' sets the vertical alignment of the text
# 'format' replaces the '{:}' statement with a number in the specified format ('.3f' = float with 3 decimal places)
# for format strings, see: https://docs.python.org/3.5/library/string.html#formatstrings
# the '$' in the string tell matplotlib that I'm giving it a LaTeX text string - useful for superscripts,
# or any mathematical typing. See: http://matplotlib.org/users/mathtext.html

ax.text(.02, .88, '$y = x^{' + '{:.2f}'.format(p[0]) + '}$' + '{:.2f}'.format(p[1]), 
        transform=ax.transAxes, ha='left', va='top')

# note: I've split this function onto two lines. When doing this, indentation matters!
# the next line has to be aligned with the bracket on the line before, otherwise
# python won't understand it.









    Out[12]:





<matplotlib.text.Text at 0x11131bfd0>

Basic Statistics: Comparing Populations



In [13]:

    
# make a couple of example 'populations'
pop1 = np.random.normal(11, 2, 1000)
pop2 = np.random.normal(7, 4, 1000)
# two independent populations, 100 measurements of each.
# pop1 mean = 11, std = 2
# pop2 mean = 7, std = 4



In [14]:

    
# have a look at the data
fig, ax = plt.subplots(1, 1)

# combine populations into single array
cpop = np.concatenate([pop1,pop2])
bins = np.linspace(cpop.min(), cpop.max(), 30)  # define 30 bins spanning the range of both populations

ax.hist(pop1, bins, alpha=0.5, label='pop1')
ax.hist(pop2, bins, alpha=0.5, label='pop2')

ax.legend()









    Out[14]:





<matplotlib.legend.Legend at 0x1116f6908>



In [15]:

    
from scipy import stats

Before we can test whether these populations are different using a T-test, we must check:

for normal distribution
for equal variance



In [16]:

    
# test for normal distribution
stats.anderson(pop1, 'norm')









    Out[16]:





AndersonResult(statistic=0.20418264480633752, critical_values=array([ 0.574,  0.653,  0.784,  0.914,  1.088]), significance_level=array([ 15. ,  10. ,   5. ,   2.5,   1. ]))



In [17]:

    
stats.anderson(pop2, 'norm')









    Out[17]:





AndersonResult(statistic=0.29633217558193792, critical_values=array([ 0.574,  0.653,  0.784,  0.914,  1.088]), significance_level=array([ 15. ,  10. ,   5. ,   2.5,   1. ]))



In [18]:

    
# What do thse numbers mean?
# The anderson-darling test checks the null hypothesis that the data 
# come from a particular distribution (in this case, normal). If the
# 'critical_values' are less than the 'significance_level', the null
# null hyopthesis of normality can be accepted at that level of 
# confidence. In this case, both populations can be considered 'normal'
# with > 97.5% confidence. Good enough for a t-test!



In [19]:

    
# test for equal variance
stats.levene(pop1, pop2)

# Levene tests the null hypothesis of equal variance. 
# p<<0.001, so must reject null hypothesis.









    Out[19]:





LeveneResult(statistic=338.81141151926073, pvalue=5.1226753441520199e-70)



In [20]:

    
# run a t-test that does not assume equal variance:
stats.ttest_ind(pop1, pop2, equal_var=False)









    Out[20]:





Ttest_indResult(statistic=28.693400449661247, pvalue=2.935493993696538e-144)

Probability Distribution Functions (PDF)

The 'pretty' histogram. These can be useful beyond just looking better, and can offer a more accurate estimate of population structure when used appropriately. Probability distribution functions are created using 'kernel density estimators'. These basically take your data, and describe it terms of an addition of multiple 'kernels' with a set width. The 'kernel' is normally a gaussian, but can be any function that precisely describes your data (e.g. for count statistics, you might want to use a poisson kernel instead).



In [21]:

    
kd1 = stats.gaussian_kde(pop1)
kd2 = stats.gaussian_kde(pop2)

# gaussian_kde is a simple gaussian probability distribution estimator included in scipy.stats.
# Other options are available. See: https://jakevdp.github.io/blog/2013/12/01/kernel-density-estimation/



In [22]:

    
fig, ax = plt.subplots(1, 1)

# combine populations into single array
cpop = np.concatenate([pop1,pop2])
bins = np.linspace(cpop.min(), cpop.max(), 100)  # define 100 bins spanning the range of both populations

ax.plot(bins, kd1(bins), label='pop1')
ax.plot(bins, kd2(bins), label='pop2')

ax.legend()









    Out[22]:





<matplotlib.legend.Legend at 0x111880898>

The above PDFs are the sum of numerous gaussian 'kernels', each placed at the position of one of your data points on the x-axis.

Dealing with Real Data



In [23]:

    
import pandas as pd

# superb library for dealing with real-world data.
# adds 'spreadsheet' type functionality to python,
# along with numerous data import/export options.



In [24]:

    
# dataset of measurements of various dimensions of three iris species.
iris = pd.read_excel('data/iris.xlsx')

# dataset info: https://en.wikipedia.org/wiki/Iris_flower_data_set

Dealing with complex data: grouping!



In [25]:

    
groups = iris.groupby('species')
# this splits the 'iris' dataset into subsets by species.

1. Summary Statistics



In [26]:

    
groups.aggregate([np.mean, np.std, len])









    Out[26]:






  
    
      
      sepal length (cm)
      sepal width (cm)
      petal length (cm)
      petal width (cm)
    
    
      
      mean
      std
      len
      mean
      std
      len
      mean
      std
      len
      mean
      std
      len
    
    
      species
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      setosa
      5.006
      0.352490
      50.0
      3.418
      0.381024
      50.0
      1.464
      0.173511
      50.0
      0.244
      0.107210
      50.0
    
    
      versicolor
      5.936
      0.516171
      50.0
      2.770
      0.313798
      50.0
      4.260
      0.469911
      50.0
      1.326
      0.197753
      50.0
    
    
      virginica
      6.588
      0.635880
      50.0
      2.974
      0.322497
      50.0
      5.552
      0.551895
      50.0
      2.026
      0.274650
      50.0

2. Loops



In [27]:

    
# groups contains a list of tuples containing (species, data)
# and you can use it in a loop:

for species, data in groups:
    print(species, data.loc[:,'sepal length (cm)'].mean())

# This is a loop. Loops are defined by 'for' or 'while'. Look these up.
# Here, I'm assigning each component of each tuple in the 'groups' list
# to 'species' and 'data'.
# This loop simply prints the species name, and the mean 'sepal length'
# of each species.
# Note the pandas indexing scheme here: you use .loc[row_index, column_index].
# You can use boolean arrays instead of specific indices here.









    



setosa 5.006
versicolor 5.936
virginica 6.588

3. Looping and plotting!?



In [28]:

    
# a more interesting application of a loop:

fig, axs = plt.subplots(2,2, figsize=[8,5])  # make 4 subplots - axs in this case is a 2x2 array
# 'figsize' specifies figure (width, height) in inches (not cm... grrr...)
axs = axs.flatten()  # transform axs from a 2x2 array to a 4x1 array, so it works with a loop.

for species, data in groups:
    for parameter, ax in zip(data.columns[1:], axs):  # indexing here removes the 'species' column from the data.
        data.loc[:,parameter].plot.kde(ax=ax, label=species)  # pandas has a kernel density estimator built in.
        ax.set_title(parameter)

fig.tight_layout()  # spaces the axes out so labels don't overlap

axs[0].legend()

# a few more advanced concepts in here. what's going on:
# two loops: one going through all the groups, the other 
# going through all the measured parameters. You have to 
# combine the paramters and the axes into a form that the 
# loop can work with. 'zip' does this.









    Out[28]:





<matplotlib.legend.Legend at 0x1130ce5f8>



In [29]:

    
# an example of what 'zip' does:
list(zip(['a','b','c'],[1,2,3]))









    Out[29]:





[('a', 1), ('b', 2), ('c', 3)]

Clustering / Machine Learning

Enter scikit-learn. The comprehensive clustering/PCA/ machine learning package for python. You can do a lot with this package, and it's surprisingly accessible.

This is fantastic for morphometrics (e.g. example here), or geochemical analyses (e.g. finding geochemically distinct populations within large datasets). It has a really nice implementation of principle component analysis (PCA), which you can see applied to the iris dataset here.



In [30]:

    
from sklearn.cluster import MeanShift, estimate_bandwidth
from sklearn.preprocessing import scale



In [31]:

    
data = iris.loc[:,['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']].values
sdata = scale(data) # scale the data so it all has the same variance (nece)



In [32]:

    
# estimate data 'bandwidth'
bw = estimate_bandwidth(sdata)
# set up clustering algorithm
ms = MeanShift(bw)



In [33]:

    
# apply algorithm to data
ms.fit(sdata)









    Out[33]:





MeanShift(bandwidth=1.6095035656336663, bin_seeding=False, cluster_all=True,
     min_bin_freq=1, n_jobs=1, seeds=None)



In [34]:

    
# determine how many clusters it found
n_clusters = len(ms.cluster_centers_)

n_clusters









    Out[34]:





2

It only found two statistically distinguishable species based on these measurements, but there are 3 species.



In [35]:

    
# take a look at them!
ms.cluster_centers_









    Out[35]:





array([[ 0.50161528, -0.31685274,  0.65388162,  0.65270451],
       [-1.05954571,  0.74408909, -1.29957089, -1.2545443 ]])

We can see from the cluster centers that the main differences are in the first two columns (sepal length and sepal width). Let's plot this, and colour it by group:



In [36]:

    
fig, ax = plt.subplots(1,1)

ax.scatter(iris.loc[:,'sepal length (cm)'], iris.loc[:,'sepal width (cm)'], c=ms.labels_)









    Out[36]:





<matplotlib.collections.PathCollection at 0x113dd2b38>

Note: we're looking at this in two dimensions, because that's where the clustering algorithm found the most difference. However, the algorithm took all four parameters into account when processing the data - there just isn't much to set the petal lengths/widths apart. You can give these algorithms as many 'dimensions' as you like to work with.

This is an example of one possible clustering algorithm - MeanShift. There are many others. These algorithms tend to be 'naive' - you give them the data, some parameters (and, in some cases, how many clusters you're expecting), and it finds statistically distinguishable populations within your multi-dimensional dataset. There are also more sophisticated 'machine learning' algorithms, which you can 'train' based on a subset of your data, and then use to 'classify' the rest.

Propagating Uncertainties



In [37]:

    
import uncertainties.unumpy as un



In [38]:

    
realdata = un.uarray(nominal_values=np.random.uniform(0,10,15), std_devs=np.random.normal(.5,.23,15))
# each nominal_value has an associated standard deviation!



In [39]:

    
constant = un.uarray(10.3, 0.02)  # define a constant



In [40]:

    
y = realdata * constant  # errors are propagated!
y









    Out[40]:





array([36.925890515970806+/-4.790251954699373,
       55.44302446548753+/-7.587241345241005,
       35.734331425822774+/-7.11725154771972,
       32.132533629242694+/-9.15525039235835,
       24.233178971918893+/-5.887801506507166,
       58.43311166156193+/-5.577522145528666,
       61.85531170392993+/-7.467062173582073,
       66.29175940149547+/-7.639334371739999,
       33.837303574896794+/-5.923366548467208,
       8.893695006642952+/-6.4283780573369125,
       27.853197243087074+/-4.420033182485448,
       60.42449310292492+/-4.341716762944576,
       84.66608497533863+/-7.280769741448619,
       45.58757728476831+/-2.06649355600281,
       42.63162805408568+/-4.011214218897233], dtype=object)



In [41]:

    
# NOTE: assumes errors are standard deviations, andthat data are normally distrinuted.



In [42]:

    
# plot it!
fig, ax = plt.subplots(1,1)

ax.errorbar(x=un.nominal_values(realdata), y=un.nominal_values(y),
            xerr=un.std_devs(realdata), yerr=un.std_devs(y),
            lw=0, elinewidth=1, marker='o', ecolor=(0,0,0,0.5))
# new plot parameters:
# 'elinewidth': width of error bar lines
# 'ecolor': colour of error bars, defined here as (r,g,b,alpha)

# note the use of 'un.nominal_values' and 'un.std_devs' to extract the 
# values and errors separately.









    Out[42]:





<Container object of 3 artists>



In [43]:

    
from scipy.integrate import odeint
# function for solving systems of ordinary differential equations

def MassSpring(state,t):
    # unpack the state vector
    x = state[0]
    xd = state[1]

    # these are our constants
    k = -2.5 # Newtons per metre
    m = 1.5 # Kilograms
    g = 9.8 # metres per second

    # compute acceleration xdd
    xdd = ((k*x)/m) + g

    # return the two state derivatives
    return [xd, xdd]

state0 = [0.0, 0.0]
t = np.arange(0.0, 10.0, 0.1)

state = odeint(MassSpring, state0, t)

plt.plot(t, state, lw=2)
plt.xlabel('TIME (sec)')
plt.ylabel('STATES')
plt.title('Mass-Spring System')
plt.legend(('$x$ (m)', '$\dot{x}$ (m/sec)'))









    Out[43]:





<matplotlib.legend.Legend at 0x114113940>

	sepal length (cm)			sepal width (cm)			petal length (cm)			petal width (cm)
	mean	std	len	mean	std	len	mean	std	len	mean	std	len
species
setosa	5.006	0.352490	50.0	3.418	0.381024	50.0	1.464	0.173511	50.0	0.244	0.107210	50.0
versicolor	5.936	0.516171	50.0	2.770	0.313798	50.0	4.260	0.469911	50.0	1.326	0.197753	50.0
virginica	6.588	0.635880	50.0	2.974	0.322497	50.0	5.552	0.551895	50.0	2.026	0.274650	50.0