There are various packaged 'distributions' of python, which contain pretty much everything you need for scientific analysis. I use Anaconda by Continuum Analytics. To install anaconda, follow instructions here. Everything below should work with the default installation of Anaconda, but I'm not sure about the other options.
Python is a languange, and you have to learn it. As programming languages go, it's famously straightforward and 'readable' (i.e. you can normally work out what a function does just by reading the code, which is not the case in all languages). If you're already fluent in a programming language, python should be very easy. If not, it will be difficult... but less difficult than, say, Matlab or C+. These links should get you started:
What you're reading at the moment is a 'jupyter notebook'. This is an extension of 'ipython', which provides a nice, simple interface for writing python code, making notes and showing plots within your browser. It's based around 'cells' (the embedded, grey boxes below), which can be either 'code' or 'markdown'. All this writing is in a 'markdown' cell, where you can write titles, notes and equations (in LaTeX form), and generally keep all your code and analysis organised and annotated. Look, an equation: \begin{align} \mathrm{python} >> \frac{\sqrt{\mathrm{(Matlab + Excel)^{4.5}}}}{1} \end{align}
Learn about the notebooks here.
You can either start a notebook from the terminal, or the Anaconda install includes a 'launcher' app, which gives you a nice button to click.
Go through the examples below to start working out python.
Short version: use python 3. Everything except older, obscure scientific packages have been updated to work with python 3, and that's where all future development will happen. Unless you're sure the package you need only works with python 2, then use python 3.
Long version: read this.
In python, you have to really understand exactly what you're trying to do to your data to be able to do it. There is no magic 'analyse' button, which will process your data in a black box. This is good, because it makes sure you really understand what you're doing to your data. This is bad, because you have to take the time to really understand how to treat your data. The extra time is worth it.
In [1]:
# This is a code cell. This text is a comment (preceded by a '#').
# Commented lines are ignored by python. I'll use comments to try
# and explain new concepts in the cells below.
# Once you've written some code in a cell, you need to press
# either [Ctrl] + [Return] to run the code. You may replace
# [Ctrl] with [Shift] to run the code and move the focus to the
# next cell, or [Alt] to run the code and create a new cell
# underneath.
# There are lots of other shortcut keys that make the notebook
# interface really nice to use. Have a look at the full list in
# Help > Shortcut Keys
Basic python can't do that much. It's strength comes in the huge number of 'Libraries' or 'Packages' which add additional functionality. A short list of useful 'basic' scientific libraries can be found here, although there are literally thousands of packages that cover almost everything you can think of. There are several packages that I find myself using on a regular basis.
The 'Scientific Python' family:
The 'Scikit' Family:
Other:
In [2]:
# how to import a library
import numpy as np
# 'import' tells python bring numpy into your 'workspace'.
# after import, you can access numpy's functions by typing
# numpy.function_name.
# To make things simpler, you can tell python to call numpy
# something else ('np', in this case) by using 'as' after
# the import statement.
In [3]:
# a few numpy examples:
a = np.linspace(0,100,50) # generates an array with 50 evenly spaced points between 0 and 100.
z = np.zeros((50,50)) # generates an 50 x 50 array filled with zeros.
l = np.logspace(-5,5,100) # generates 100 logarithmically spaced data points between 10^-5 and 10^5.
r = np.random.normal(0,1,20) # generates 20 random numbers with a mean of 0 and a standard deviation of 1.
u = np.random.uniform(0,10,20) # generates 20 random numbers uniformly distributed between 0 and 10.
# note: variables are assigned using '='. variables are
# overwritten when they are re-assigned.
# if you write a statement without assigning it to a variable,
# its output will be printed below the code cell.
In [4]:
# let's make some data! (so much easier than lab work...)
x = np.linspace(0,20,50)
y = x * 2.4 + np.random.normal(5, 2, len(x))
In [5]:
import matplotlib.pyplot as plt
# note the difference in import here: we're only importing part
# of the matplotlib library (pyplot), and calling it 'plt'.
%matplotlib inline
# this line tells jupyter notebook that we want to see the plots presented 'in line' in the browser.
In [6]:
fig, ax = plt.subplots(1, 1) # make a figure with 1 subplot 'axis'
# a 'figure' can contain multiple 'axis' objects.
# note the assignment here: the plt.subplots function returns a tuple.
# tuples can be expanded to multiple variables in assignment, i.e.
# running a,b = (1,2) is the same as running a = 1 and b = 2 separately.
ax.scatter(x, y, c='purple', marker='x') # make a scatter plot of the x,y data we generated above.
# 'c' sets the colour
# 'marker' sets the line width around the circular marker
ax.plot(x, y, ls='dotted', lw=2, c='black') # add a line connecting all the dots.
# 'ls' sets the line style
# 'lw' sets the line width
ax.set_xlim(-1,21) # set the x axis limits.
Out[6]:
Matplotlib colours can be defined in multiple ways:
For a full description, see here.
In [7]:
# first, we must define a function that describes the line.
def linefunction(x, m, c):
# functions are defined using 'def', followed by the
# function name, inputs in brackets, and a colon.
# note that these lines are indented. Indentation has meaning
# in python. Indented lines are part of the preceeding statement
# (usually functions or loops).
y = x * m + c # the work happens here.
return y # 'return' defines what you get back from the function.
In [8]:
# import a least-squares curve fitting function
from scipy.optimize import curve_fit
# note the different form of import here. We're bringing in a single
# function (curve_fit) from part (optimize) of a library (scipy),
# rather than the entire library.
In [9]:
# fit the x,y data:
p, cov = curve_fit(linefunction, x, y)
# calculate the fitted line
# p = fitted parameters
# cov = covariance matrix of parameters
y_fit = linefunction(x, *p)
# not the '*' notation here, which can be used to expand a list of
# parameters into function inputs
In [10]:
# plot the fit
fig, ax = plt.subplots(1, 1)
ax.scatter(x, y, label='data', s=5, alpha=0.5)
# 'label' names the data, and is used to create a legend
# 's' sets marker size
# 'alpha' sets marker opacity (1 = 100%)
ax.plot(x, y_fit, c='r', lw=2, label='fit')
ax.set_xlim(-1,21)
ax.set_xlabel('x') # set x label
ax.set_ylabel('y') # set y label
ax.legend(loc='upper left') # make a legend
# loc defines legend position
Out[10]:
curve_fit can be applied to any function that takes (x, ...) as inputs. All sorts of other fitting routines (generic zero-finders (e.g. newton-rathson optimisers, or the 'minimize' function), non-linear least squares (nls), ordinary least squares (ols), orthogonal distance regression (odr) and much more!) are available in the scipy.optimize package. These other fitting functions allow you to accommodate measurements errors in fits (odr), set parameter constraints (minimize), and much more!
In [11]:
# this can also be done with ready-made functions, which also give you fit statistics!
from scipy.stats import linregress
linregress(x, y)
Out[11]:
In [12]:
# make data
y2 = x**2.4 + np.random.normal(-40, 30, len(x))
# define power function:
def powerfn(x, p, c):
return x**p + c
p, cov = curve_fit(powerfn, x, y2)
fig, ax = plt.subplots(1,1)
ax.scatter(x, y2)
ax.plot(x, powerfn(x, *p), c='r', lw=2)
# calculate R2
R2 = 1 - (np.sum((y2 - powerfn(x, *p))**2)) / np.sum((y2 - y2.mean())**2)
ax.text(x=.02, y=.98, s='$R^2$: {:.3f}'.format(R2), transform=ax.transAxes, ha='left', va='top')
# 'text' places arbitrary text within the axes.
# in this case, I'm using x,y coordinates relative to the axis size, as specified using 'transform'
# 'ha' sets the horizontal alignment of the text
# 'va' sets the vertical alignment of the text
# 'format' replaces the '{:}' statement with a number in the specified format ('.3f' = float with 3 decimal places)
# for format strings, see: https://docs.python.org/3.5/library/string.html#formatstrings
# the '$' in the string tell matplotlib that I'm giving it a LaTeX text string - useful for superscripts,
# or any mathematical typing. See: http://matplotlib.org/users/mathtext.html
ax.text(.02, .88, '$y = x^{' + '{:.2f}'.format(p[0]) + '}$' + '{:.2f}'.format(p[1]),
transform=ax.transAxes, ha='left', va='top')
# note: I've split this function onto two lines. When doing this, indentation matters!
# the next line has to be aligned with the bracket on the line before, otherwise
# python won't understand it.
Out[12]:
In [13]:
# make a couple of example 'populations'
pop1 = np.random.normal(11, 2, 1000)
pop2 = np.random.normal(7, 4, 1000)
# two independent populations, 100 measurements of each.
# pop1 mean = 11, std = 2
# pop2 mean = 7, std = 4
In [14]:
# have a look at the data
fig, ax = plt.subplots(1, 1)
# combine populations into single array
cpop = np.concatenate([pop1,pop2])
bins = np.linspace(cpop.min(), cpop.max(), 30) # define 30 bins spanning the range of both populations
ax.hist(pop1, bins, alpha=0.5, label='pop1')
ax.hist(pop2, bins, alpha=0.5, label='pop2')
ax.legend()
Out[14]:
In [15]:
from scipy import stats
Before we can test whether these populations are different using a T-test, we must check:
In [16]:
# test for normal distribution
stats.anderson(pop1, 'norm')
Out[16]:
In [17]:
stats.anderson(pop2, 'norm')
Out[17]:
In [18]:
# What do thse numbers mean?
# The anderson-darling test checks the null hypothesis that the data
# come from a particular distribution (in this case, normal). If the
# 'critical_values' are less than the 'significance_level', the null
# null hyopthesis of normality can be accepted at that level of
# confidence. In this case, both populations can be considered 'normal'
# with > 97.5% confidence. Good enough for a t-test!
In [19]:
# test for equal variance
stats.levene(pop1, pop2)
# Levene tests the null hypothesis of equal variance.
# p<<0.001, so must reject null hypothesis.
Out[19]:
In [20]:
# run a t-test that does not assume equal variance:
stats.ttest_ind(pop1, pop2, equal_var=False)
Out[20]:
The 'pretty' histogram. These can be useful beyond just looking better, and can offer a more accurate estimate of population structure when used appropriately. Probability distribution functions are created using 'kernel density estimators'. These basically take your data, and describe it terms of an addition of multiple 'kernels' with a set width. The 'kernel' is normally a gaussian, but can be any function that precisely describes your data (e.g. for count statistics, you might want to use a poisson kernel instead).
In [21]:
kd1 = stats.gaussian_kde(pop1)
kd2 = stats.gaussian_kde(pop2)
# gaussian_kde is a simple gaussian probability distribution estimator included in scipy.stats.
# Other options are available. See: https://jakevdp.github.io/blog/2013/12/01/kernel-density-estimation/
In [22]:
fig, ax = plt.subplots(1, 1)
# combine populations into single array
cpop = np.concatenate([pop1,pop2])
bins = np.linspace(cpop.min(), cpop.max(), 100) # define 100 bins spanning the range of both populations
ax.plot(bins, kd1(bins), label='pop1')
ax.plot(bins, kd2(bins), label='pop2')
ax.legend()
Out[22]:
The above PDFs are the sum of numerous gaussian 'kernels', each placed at the position of one of your data points on the x-axis.
In [23]:
import pandas as pd
# superb library for dealing with real-world data.
# adds 'spreadsheet' type functionality to python,
# along with numerous data import/export options.
In [24]:
# dataset of measurements of various dimensions of three iris species.
iris = pd.read_excel('data/iris.xlsx')
# dataset info: https://en.wikipedia.org/wiki/Iris_flower_data_set
In [25]:
groups = iris.groupby('species')
# this splits the 'iris' dataset into subsets by species.
In [26]:
groups.aggregate([np.mean, np.std, len])
Out[26]:
In [27]:
# groups contains a list of tuples containing (species, data)
# and you can use it in a loop:
for species, data in groups:
print(species, data.loc[:,'sepal length (cm)'].mean())
# This is a loop. Loops are defined by 'for' or 'while'. Look these up.
# Here, I'm assigning each component of each tuple in the 'groups' list
# to 'species' and 'data'.
# This loop simply prints the species name, and the mean 'sepal length'
# of each species.
# Note the pandas indexing scheme here: you use .loc[row_index, column_index].
# You can use boolean arrays instead of specific indices here.
In [28]:
# a more interesting application of a loop:
fig, axs = plt.subplots(2,2, figsize=[8,5]) # make 4 subplots - axs in this case is a 2x2 array
# 'figsize' specifies figure (width, height) in inches (not cm... grrr...)
axs = axs.flatten() # transform axs from a 2x2 array to a 4x1 array, so it works with a loop.
for species, data in groups:
for parameter, ax in zip(data.columns[1:], axs): # indexing here removes the 'species' column from the data.
data.loc[:,parameter].plot.kde(ax=ax, label=species) # pandas has a kernel density estimator built in.
ax.set_title(parameter)
fig.tight_layout() # spaces the axes out so labels don't overlap
axs[0].legend()
# a few more advanced concepts in here. what's going on:
# two loops: one going through all the groups, the other
# going through all the measured parameters. You have to
# combine the paramters and the axes into a form that the
# loop can work with. 'zip' does this.
Out[28]:
In [29]:
# an example of what 'zip' does:
list(zip(['a','b','c'],[1,2,3]))
Out[29]:
Enter scikit-learn. The comprehensive clustering/PCA/ machine learning package for python. You can do a lot with this package, and it's surprisingly accessible.
This is fantastic for morphometrics (e.g. example here), or geochemical analyses (e.g. finding geochemically distinct populations within large datasets). It has a really nice implementation of principle component analysis (PCA), which you can see applied to the iris dataset here.
In [30]:
from sklearn.cluster import MeanShift, estimate_bandwidth
from sklearn.preprocessing import scale
In [31]:
data = iris.loc[:,['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']].values
sdata = scale(data) # scale the data so it all has the same variance (nece)
In [32]:
# estimate data 'bandwidth'
bw = estimate_bandwidth(sdata)
# set up clustering algorithm
ms = MeanShift(bw)
In [33]:
# apply algorithm to data
ms.fit(sdata)
Out[33]:
In [34]:
# determine how many clusters it found
n_clusters = len(ms.cluster_centers_)
n_clusters
Out[34]:
It only found two statistically distinguishable species based on these measurements, but there are 3 species.
In [35]:
# take a look at them!
ms.cluster_centers_
Out[35]:
We can see from the cluster centers that the main differences are in the first two columns (sepal length and sepal width). Let's plot this, and colour it by group:
In [36]:
fig, ax = plt.subplots(1,1)
ax.scatter(iris.loc[:,'sepal length (cm)'], iris.loc[:,'sepal width (cm)'], c=ms.labels_)
Out[36]:
Note: we're looking at this in two dimensions, because that's where the clustering algorithm found the most difference. However, the algorithm took all four parameters into account when processing the data - there just isn't much to set the petal lengths/widths apart. You can give these algorithms as many 'dimensions' as you like to work with.
This is an example of one possible clustering algorithm - MeanShift. There are many others. These algorithms tend to be 'naive' - you give them the data, some parameters (and, in some cases, how many clusters you're expecting), and it finds statistically distinguishable populations within your multi-dimensional dataset. There are also more sophisticated 'machine learning' algorithms, which you can 'train' based on a subset of your data, and then use to 'classify' the rest.
In [37]:
import uncertainties.unumpy as un
In [38]:
realdata = un.uarray(nominal_values=np.random.uniform(0,10,15), std_devs=np.random.normal(.5,.23,15))
# each nominal_value has an associated standard deviation!
In [39]:
constant = un.uarray(10.3, 0.02) # define a constant
In [40]:
y = realdata * constant # errors are propagated!
y
Out[40]:
In [41]:
# NOTE: assumes errors are standard deviations, andthat data are normally distrinuted.
In [42]:
# plot it!
fig, ax = plt.subplots(1,1)
ax.errorbar(x=un.nominal_values(realdata), y=un.nominal_values(y),
xerr=un.std_devs(realdata), yerr=un.std_devs(y),
lw=0, elinewidth=1, marker='o', ecolor=(0,0,0,0.5))
# new plot parameters:
# 'elinewidth': width of error bar lines
# 'ecolor': colour of error bars, defined here as (r,g,b,alpha)
# note the use of 'un.nominal_values' and 'un.std_devs' to extract the
# values and errors separately.
Out[42]:
In [43]:
from scipy.integrate import odeint
# function for solving systems of ordinary differential equations
def MassSpring(state,t):
# unpack the state vector
x = state[0]
xd = state[1]
# these are our constants
k = -2.5 # Newtons per metre
m = 1.5 # Kilograms
g = 9.8 # metres per second
# compute acceleration xdd
xdd = ((k*x)/m) + g
# return the two state derivatives
return [xd, xdd]
state0 = [0.0, 0.0]
t = np.arange(0.0, 10.0, 0.1)
state = odeint(MassSpring, state0, t)
plt.plot(t, state, lw=2)
plt.xlabel('TIME (sec)')
plt.ylabel('STATES')
plt.title('Mass-Spring System')
plt.legend(('$x$ (m)', '$\dot{x}$ (m/sec)'))
Out[43]: