Scientific Python essentials

Introduction to GIS scripting
May, 2017

© 2017, Stijn Van Hoey (mailto:stijnvanhoey@gmail.com). Licensed under CC BY 4.0 Creative Commons

Introduction

There is a large variety of packages available in Python to support research. Importing a package is like getting a piece of lab equipment out of a storage locker and setting it up on the bench for use in a project. Once a library is set up (imported), it can be used or called to perform many tasks.

In this notebook, we will focus on two fundamental packages within most scientific applications:

  1. Numpy
  2. Pandas

Furthermore, if plotting is required, this will be done with matplotlib package (we only use plot and imshow in this tutorial):


In [ ]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('seaborn-white')

Numpy

Introduction

NumPy is the fundamental package for scientific computing with Python.

Information for the freaks:

  • a powerful N-dimensional array/vector/matrix object
  • sophisticated (broadcasting) functions
  • function implementation in C/Fortran assuring good performance if vectorized
  • tools for integrating C/C++ and Fortran code
  • useful linear algebra, Fourier transform, and random number capabilities

In short: Numpy is the Python package to do fast calculations!

It is a community agreement to import the numpy package with the prefix np to identify the usage of numpy functions. Use the CTRL + SHIFT option to check the available functions of numpy:


In [ ]:
import numpy as np

In [ ]:
# np. # explore the namespace

Numpy provides many mathematical functions, which operate element-wise on a so-called numpy.ndarray data type (in short: array).

REMEMBER:
  • There is a lot of functionality in Numpy. Knowing **how to find a specific function** is more important than knowing all functions...

You were looking for some function to derive quantiles of an array...


In [ ]:
np.lookfor("quantile")

Different methods do read the manual:


In [ ]:
#?np.percentile

In [ ]:
# help(np.percentile)

In [ ]:
# use SHIFT + TAB

Showcases

  • You like to play boardgames, but you want to better know you're chances of rolling a certain combination (sum) with 2 dices:

In [ ]:
throws = 1000 # number of rolls with the dices

stone1 = np.random.uniform(1, 6, throws) # outcome of throws with dice 1
stone2 = np.random.uniform(1, 6, throws) # outcome of throws with dice 2
total = stone1 + stone2  # sum of each outcome
histogram = plt.hist(total, bins=20) # plot as histogram
  • Consider a random 10x2 matrix representing cartesian coordinates (between 0 and 1), how to convert them to polar coordinates?

In [ ]:
# random numbers (X, Y in 2 columns)
Z = np.random.random((10,2))
X, Y = Z[:,0], Z[:,1]

# Distance 
R = np.sqrt(X**2 + Y**2)
# Angle
T = np.arctan2(Y, X) # Array of angles in radians
Tdegree = T*180/(np.pi) # If you like degrees more

# NEXT PART (purely for illustration)
# plot the cartesian coordinates
plt.figure(figsize=(14, 6))
ax1 = plt.subplot(121)
ax1.plot(Z[:,0], Z[:,1], 'o')
ax1.set_title("Cartesian")
#plot the polar coorsidnates
ax2 = plt.subplot(122, polar=True)
ax2.plot(T, R, 'o')
ax2.set_title("Polar")
  • Rescale the values of a given array to values in the range [0-1] and mark zero values are Nan:

In [ ]:
nete_bodem = np.load("../data/nete_bodem.npy")
plt.imshow(nete_bodem)
plt.colorbar(shrink=0.6)

In [ ]:
nete_bodem_rescaled = (nete_bodem - nete_bodem.min())/(nete_bodem.max() - nete_bodem.min()) # rescale
nete_bodem_rescaled[nete_bodem_rescaled == 0.0] = np.nan  # assign Nan values to zero values

In [ ]:
plt.imshow(nete_bodem_rescaled)
plt.colorbar(shrink=0.6)

(Remark: There is no GIS-component in the previous manipulation, these are pure element-wise operations on an array!)

Creating numpy array


In [ ]:
np.array([1, 1.5, 2, 2.5])  #np.array(anylist)
R comparison:

One could compare the numpy array to the R vector. It contains a single data type (character, float, integer) and operations are element-wise.

Provide a range of values, with a begin, end and stepsize:


In [ ]:
np.arange(5, 12, 2)

Provide a range of values, with a begin, end and number of values in between:


In [ ]:
np.linspace(2, 13, 3)

Create empty arrays or arrays filled with ones:


In [ ]:
np.zeros((5, 2)), np.ones(5)

Request the shape or the size of the arrays:


In [ ]:
np.zeros((5, 2)).shape, np.zeros((5, 2)).size

And creating random numbers:


In [ ]:
np.random.rand(5,5)  # check with np.random. + TAB for sampling from other distributions!

Reading in from binary file:


In [ ]:
nete_bodem = np.load("../data/nete_bodem.npy")

In [ ]:
plt.imshow(nete_bodem)

Reading in from a text-file:


In [ ]:
nete_bodem_subset = np.loadtxt("../data/nete_bodem_subset.out")

In [ ]:
plt.imshow(nete_bodem_subset)

Slicing (accessing values in arrays)

This i equivalent to the slicing of a list:


In [ ]:
my_array = np.random.randint(2, 10, 10)
my_array

In [ ]:
my_array[:5], my_array[4:], my_array[-2:]

In [ ]:
my_array[0:7:2]

In [ ]:
sequence = np.arange(0, 11, 1)
sequence, sequence[::2], sequence[1::3],

Assign new values to items


In [ ]:
my_array[:2] = 10
my_array

In [ ]:
my_array = my_array.reshape(5, 2)
my_array

With multiple dimensions, we get the option of slice amongst these dimensions:


In [ ]:
my_array[0, :]

Aggregation calculations


In [ ]:
my_array = np.random.randint(2, 10, 10)
my_array

In [ ]:
print('Mean value is', np.mean(my_array))
print('Median value is',  np.median(my_array))
print('Std is', np.std(my_array))
print('Variance is', np.var(my_array))
print('Min is', my_array.min())
print('Element of minimum value is', my_array.argmin())
print('Max is', my_array.max())
print('Sum is', np.sum(my_array))
print('Prod', np.prod(my_array))
print('Unique values in this array are:', np.unique(my_array))
print('85% Percentile value is: ', np.percentile(my_array, 85))

In [ ]:
my_other_array = np.random.randint(2, 10, 10).reshape(2, 5)
my_other_array

use the argument axis to define the ax to calculate a specific statistic:


In [ ]:
my_other_array.max(), my_other_array.max(axis=1), my_other_array.max(axis=0)

Element-wise operations


In [ ]:
my_array = np.random.randint(2, 10, 10)

In [ ]:
my_array

In [ ]:
print('Cumsum is', np.cumsum(my_array))
print('CumProd is', np.cumprod(my_array))
print('CumProd of 5 first elements is', np.cumprod(my_array)[4])

In [ ]:
np.exp(my_array), np.sin(my_array)

In [ ]:
my_array%3  # == 0

Using the numpy available function from the library or using the object method?


In [ ]:
np.cumsum(my_array) == my_array.cumsum()

In [ ]:
my_array.dtype
EXERCISE:
  • Check the documentation of both `np.cumsum()` and `my_array.cumsum()`. What is the difference?
  • Why do we use brackets () to run `cumsum` and we do not use brackets when asking for the `dtype`?
REMEMBER:
  • `np.cumsum` operates a method/function from the numpy library with input an array, e.g. `my_array`
  • `my_array.cumsum()` is a method/function available to the object `my_array`
  • `dtype` is an attribute/characteristic of the object `my_array`
  • It is all about calling a **method/function()** on an **object** to perform an action. The available methods are provided by the packages (or any function you write and import).
  • Objects also have **attributes**, defining the characteristics of the object (these are not actions)

In [ ]:
my_array.cumsum()

In [ ]:
my_array.max(axis=0)

In [ ]:
my_array * my_array  # element-wise
REMEMBER:
  • The operations do work on all elements of the array at the same time, you don't need a `for` loop

What is the added value of the numpy implementation compared to 'basic' python?


In [ ]:
a_list = range(1000)
%timeit [i**2 for i in a_list]

In [ ]:
an_array = np.arange(1000)
%timeit an_array**2

Boolean indexing and filtering (!)

This is a fancy term for making selections based on a condition!

Let's start with an array that contains random values:


In [ ]:
row_array = np.random.randint(1, 20, 10)
row_array

Conditions can be checked (element-wise):


In [ ]:
row_array > 5

In [ ]:
boolean_mask = row_array > 5
boolean_mask

You can use this as a filter to select elements of an array:


In [ ]:
row_array[boolean_mask]

or, also to change the values in the array corresponding to these conditions:


In [ ]:
row_array[boolean_mask] = 20
row_array

in short - making the values equal to 20 now -20:


In [ ]:
row_array[row_array == 20] = -20
row_array
R comparison:

This is similar to conditional filtering in R on vectors...

Understanding conditional selections and assignments is CRUCIAL!

This requires some practice...


In [ ]:
AR = np.random.randint(0, 20, 15)
AR
EXERCISE:
  • Count the number of values in AR that are larger than 10 (note: you can count with True = 1 and False = 0)

In [ ]:
# %load ../notebooks/_solutions/02-scientific-python-introduction52.py
EXERCISE:
  • Change all even numbers of `AR` into zero-values.

In [ ]:
# %load ../notebooks/_solutions/02-scientific-python-introduction53.py
EXERCISE:
  • Change all even positions of matrix AR into 30 values

In [ ]:
# %load ../notebooks/_solutions/02-scientific-python-introduction54.py
EXERCISE:
  • Select all values above the 75th `percentile` of the following array AR2 ad take the square root of these values

In [ ]:
AR2 = np.random.random(10)
AR2

In [ ]:
# %load ../notebooks/_solutions/02-scientific-python-introduction56.py
EXERCISE:
  • Convert all values -99. of the array AR3 into Nan-values (Note that Nan values can be provided in float arrays as `np.nan`)

In [ ]:
AR3 = np.array([-99., 2., 3., 6., 8, -99., 7., 5., 6., -99.])

In [ ]:
# %load ../notebooks/_solutions/02-scientific-python-introduction58.py
EXERCISE:
  • Get an overview of the unique values present in the array `nete_bodem_subset`

In [ ]:
nete_bodem_subset = np.loadtxt("../data/nete_bodem_subset.out")

In [ ]:
# %load ../notebooks/_solutions/02-scientific-python-introduction60.py
EXERCISE:
  • Reclassify the values of the array `nete_bodem_subset` (binary filter):
    • values lower than or equal to 100000 should be 0
    • values higher than 100000 should be 1

In [ ]:
nete_bodem_subset = np.loadtxt("../data/nete_bodem_subset.out")

In [ ]:
# %load ../notebooks/_solutions/02-scientific-python-introduction62.py
REMEMBER:
  • No need to retain everything, but have the reflex to search in the documentation (online docs, SHIFT-TAB, help(), lookfor())!!
  • Conditional selections (boolean indexing) is crucial!

This is just touching the surface of Numpy in order to proceed to the next phase (Pandas and GeoPandas)...

More extended material on Numpy is available online:

Pandas: data analysis in Python

Introduction

For data-intensive work in Python, the Pandas library has become essential. Pandas originally meant Panel Data, though many users probably don't know that.

What is pandas?

  • Pandas can be thought of as NumPy arrays with labels for rows and columns, and better support for heterogeneous data types, but it's also much, much more than that.
  • Pandas can also be thought of as R's data.frame in Python.
  • Powerful for working with missing data, working with time series data, for reading and writing your data, for reshaping, grouping, merging your data,...

Pandas documentation is available on: http://pandas.pydata.org/pandas-docs/stable/


In [ ]:
# community agreement: import as pd
import pandas as pd

Data exploration

Reading in data to DataFrame


In [ ]:
surveys_df = pd.read_csv("../data/surveys.csv")

In [ ]:
surveys_df.head()  # Try also tail()

In [ ]:
surveys_df.shape

In [ ]:
surveys_df.columns

In [ ]:
surveys_df.info()

In [ ]:
surveys_df.dtypes

In [ ]:
surveys_df.describe()
R comparison:

See the similarities and differences with the R `data.frame` - e.g. you would use `summary(df)` instead of `df.describe()` :-)


In [ ]:
surveys_df["weight"].hist(bins=20)

Series and DataFrames

A Pandas Series is a basic holder for one-dimensional labeled data. It can be created much as a NumPy array is created:


In [ ]:
a_series = pd.Series([0.1, 0.2, 0.3, 0.4])
a_series

In [ ]:
a_series.index, a_series.values

Series do have an index and values (a numpy array!) and you can give the series a name (amongst other things)


In [ ]:
a_series.name = "example_series"

In [ ]:
a_series

In [ ]:
a_series[2]

Unlike the NumPy array, though, this index can be something other than integers:


In [ ]:
a_series2 = pd.Series(np.arange(4), index=['a', 'b', 'c', 'd'])
a_series2['c']

A DataFrame is a tabular data structure (multi-dimensional object to hold labeled data) comprised of rows and columns, akin to a spreadsheet, database table, or R's data.frame object. You can think of it as multiple Series object which share the same index.
Note that in the IPython notebook, the dataframe will display in a rich HTML view:


In [ ]:
surveys_df.head()

In [ ]:
surveys_df["species_id"].head()

If you selecte a single column of a DataFrame, you end up with... a Series:


In [ ]:
type(surveys_df), type(surveys_df["species_id"])

Aggregation and element-wise calculations

Completely similar to Numpy, aggregation statistics are available:


In [ ]:
print('Mean weight is', surveys_df["weight"].mean())
print('Median weight is',  surveys_df["weight"].median())
print('Std of weight is', surveys_df["weight"].std())
print('Variance of weight is', surveys_df["weight"].var())
print('Min is', surveys_df["weight"].min())
print('Element of minimum value is', surveys_df["weight"].argmin())
print('Max is', surveys_df["weight"].max())
print('Sum is', surveys_df["weight"].sum())
print('85% Percentile value is: ', surveys_df["weight"].quantile(0.85))

Calculations are element-wise, e.g. adding the normalized weight (relative to its mean) as an additional column:


In [ ]:
surveys_df['weight_normalised'] = surveys_df["weight"]/surveys_df["weight"].mean()

Pandas and Numpy collaborate well (Numpy methods can be applied on the DataFrame values, as these are actually numpy arrays):


In [ ]:
np.sqrt(surveys_df["hindfoot_length"]).head()

Groupby provides the functionality to do an aggregation or calculation for each group:


In [ ]:
surveys_df.groupby('sex')[['hindfoot_length', 'weight']].mean() # Try yourself with min, max,...
R comparison:

Similar with groupby in R, i.e. working with factors

Slicing

ATTENTION!:

One of pandas' basic features is the labeling of rows and columns, but this makes indexing also a bit more complex compared to numpy.

We now have to distuinguish between:
  • selection by **label**: loc
  • selection by **position** iloc

In [ ]:
# example dataframe from scratch
data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
        'population': [11.3, 64.3, 81.3, 16.9, 64.9],
        'area': [30510, 671308, 357050, 41526, 244820],
        'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']}
countries = pd.DataFrame(data)
countries = countries.set_index('country')
countries

The shortcut []


In [ ]:
countries['area'] # single []

In [ ]:
countries[['area', 'population']] # double [[]]

In [ ]:
countries['France':'Netherlands']

Systematic indexing with loc and iloc

When using [] like above, you can only select from one axis at once (rows or columns, not both). For more advanced indexing, you have some extra attributes:

  • loc: selection by label
  • iloc: selection by position

In [ ]:
countries.loc['Germany', 'area']

In [ ]:
countries.loc['France':'Germany', ['area', 'population']]

Selecting by position with iloc works similar as indexing numpy arrays:


In [ ]:
countries.iloc[0:2,1:3]

Boolean indexing

In short, similar to Numpy:


In [ ]:
countries['area'] > 100000

Selecting by conditions:


In [ ]:
countries[countries['area'] > 100000]

In [ ]:
countries['size'] = np.nan # create an exmpty new column
countries

In [ ]:
countries.loc[countries['area'] > 100000, "size"] = 'LARGE'
countries.loc[countries['area'] <= 100000, "size"] = 'SMALL'

In [ ]:
countries

Combining DataFrames (!)

An important way to combine DataFrames is to use columns in each dataset that contain common values (a common unique id) as is done in databases. Combining DataFrames using a common field is called joining. Joining DataFrames in this way is often useful when one DataFrames is a “lookup table” containing additional data that we want to include in the other.

As an example, consider the availability of the species information in a separate lookup-table:


In [ ]:
species_df = pd.read_csv("../data/species.csv", delimiter=";")
EXERCISE:
  • Check the other `read_` functions that are available in the Pandas package yourself.

In [ ]:
species_df.head()

In [ ]:
surveys_df.head()

We see that both tables do have a common identifier column (species_id), which we ca use to join the two tables together with the command merge:


In [ ]:
merged_left = pd.merge(surveys_df, species_df, how="left", on="species_id")

In [ ]:
merged_left.head()

Optional section: Pandas is great with time series


In [ ]:
flowdata = pd.read_csv("../data/vmm_flowdata.csv", index_col=0, 
                       parse_dates=True)

In [ ]:
flowdata.head()
REMEMBER:
  • `pd.read_csv` provides a lot of built-in functionality to support this kind of transactions when reading in a file! Check the **help** of the read_csv function...

The index provides many attributes to work with:


In [ ]:
flowdata.index.year, flowdata.index.dayofweek, flowdata.index.dayofyear #,...

Subselecting periods can be done by the string representation of dates:


In [ ]:
flowdata["2012-01-01 09:00":"2012-01-04 19:00"].plot()

or shorter when possible:


In [ ]:
flowdata["2009"].plot()

Combinations with other selection criteria is possible, e.g. to get all months with 30 days in the year 2009:


In [ ]:
flowdata.loc[(flowdata.index.days_in_month == 30) & (flowdata.index.year == 2009), "L06_347"].plot()

Select all 'daytime' data (between 8h and 20h) for all days, station "L06_347":


In [ ]:
flowdata[(flowdata.index.hour > 8) & (flowdata.index.hour < 20)].head()
# OR USE flowdata.between_time('08:00', '20:00')

A very powerful method is resample: converting the frequency of the time series (e.g. from hourly to daily data).


In [ ]:
flowdata.resample('A').mean().plot()

A practical example is: Plot the monthly minimum and maximum of the daily average values of the LS06_348 column


In [ ]:
daily = flowdata['LS06_348'].resample('D').mean() # calculate the daily average value
daily.resample('M').agg(['min', 'max']).plot()  # calculate the monthly minimum and maximum values

Other plots are supported as well, e.g. a bar plot of the mean of the stations in year 2013


In [ ]:
flowdata['2013'].mean().plot(kind='barh')

In [ ]:


In [ ]: