Scientific Python essentials

Introduction to GIS scripting
May, 2017

© 2017, Stijn Van Hoey (mailto:stijnvanhoey@gmail.com). Licensed under CC BY 4.0 Creative Commons

Introduction

There is a large variety of packages available in Python to support research. Importing a package is like getting a piece of lab equipment out of a storage locker and setting it up on the bench for use in a project. Once a library is set up (imported), it can be used or called to perform many tasks.

In this notebook, we will focus on two fundamental packages within most scientific applications:

  1. Numpy
  2. Pandas

Furthermore, if plotting is required, this will be done with matplotlib package (we only use plot and imshow in this tutorial):


In [136]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('seaborn-white')

Numpy

Introduction

NumPy is the fundamental package for scientific computing with Python.

Information for the freaks:

  • a powerful N-dimensional array/vector/matrix object
  • sophisticated (broadcasting) functions
  • function implementation in C/Fortran assuring good performance if vectorized
  • tools for integrating C/C++ and Fortran code
  • useful linear algebra, Fourier transform, and random number capabilities

In short: Numpy is the Python package to do fast calculations!

It is a community agreement to import the numpy package with the prefix np to identify the usage of numpy functions. Use the CTRL + SHIFT option to check the available functions of numpy:


In [2]:
import numpy as np

In [3]:
# np. # explore the namespace

Numpy provides many mathematical functions, which operate element-wise on a so-called numpy.ndarray data type (in short: array).

REMEMBER:
  • There is a lot of functionality in Numpy. Knowing **how to find a specific function** is more important than knowing all functions...

You were looking for some function to derive quantiles of an array...


In [4]:
np.lookfor("quantile")


Search results for 'quantile'
-----------------------------
numpy.percentile
    Compute the qth percentile of the data along the specified axis.
numpy.nanpercentile
    Compute the qth percentile of the data along the specified axis,

Different methods do read the manual:


In [5]:
#?np.percentile

In [6]:
# help(np.percentile)

In [7]:
# use SHIFT + TAB

Showcases

  • You like to play boardgames, but you want to better know you're chances of rolling a certain combination (sum) with 2 dices:

In [8]:
throws = 1000 # number of rolls with the dices

stone1 = np.random.uniform(1, 6, throws) # outcome of throws with dice 1
stone2 = np.random.uniform(1, 6, throws) # outcome of throws with dice 2
total = stone1 + stone2  # sum of each outcome
histogram = plt.hist(total, bins=20) # plot as histogram


  • Consider a random 10x2 matrix representing cartesian coordinates (between 0 and 1), how to convert them to polar coordinates?

In [9]:
# random numbers (X, Y in 2 columns)
Z = np.random.random((10,2))
X, Y = Z[:,0], Z[:,1]

# Distance 
R = np.sqrt(X**2 + Y**2)
# Angle
T = np.arctan2(Y, X) # Array of angles in radians
Tdegree = T*180/(np.pi) # If you like degrees more

# NEXT PART (purely for illustration)
# plot the cartesian coordinates
plt.figure(figsize=(14, 6))
ax1 = plt.subplot(121)
ax1.plot(Z[:,0], Z[:,1], 'o')
ax1.set_title("Cartesian")
#plot the polar coorsidnates
ax2 = plt.subplot(122, polar=True)
ax2.plot(T, R, 'o')
ax2.set_title("Polar")


Out[9]:
<matplotlib.text.Text at 0x7f6c39e27748>
  • Rescale the values of a given array to values in the range [0-1] and mark zero values are Nan:

In [10]:
nete_bodem = np.load("../data/nete_bodem.npy")
plt.imshow(nete_bodem)
plt.colorbar(shrink=0.6)


Out[10]:
<matplotlib.colorbar.Colorbar at 0x7f6c39e09ef0>

In [11]:
nete_bodem_rescaled = (nete_bodem - nete_bodem.min())/(nete_bodem.max() - nete_bodem.min()) # rescale
nete_bodem_rescaled[nete_bodem_rescaled == 0.0] = np.nan  # assign Nan values to zero values

In [12]:
plt.imshow(nete_bodem_rescaled)
plt.colorbar(shrink=0.6)


Out[12]:
<matplotlib.colorbar.Colorbar at 0x7f6c39dc1550>

(Remark: There is no GIS-component in the previous manipulation, these are pure element-wise operations on an array!)

Creating numpy array


In [13]:
np.array([1, 1.5, 2, 2.5])  #np.array(anylist)


Out[13]:
array([ 1. ,  1.5,  2. ,  2.5])
R comparison:

One could compare the numpy array to the R vector. It contains a single data type (character, float, integer) and operations are element-wise.

Provide a range of values, with a begin, end and stepsize:


In [14]:
np.arange(5, 12, 2)


Out[14]:
array([ 5,  7,  9, 11])

Provide a range of values, with a begin, end and number of values in between:


In [15]:
np.linspace(2, 13, 3)


Out[15]:
array([  2. ,   7.5,  13. ])

Create empty arrays or arrays filled with ones:


In [16]:
np.zeros((5, 2)), np.ones(5)


Out[16]:
(array([[ 0.,  0.],
        [ 0.,  0.],
        [ 0.,  0.],
        [ 0.,  0.],
        [ 0.,  0.]]), array([ 1.,  1.,  1.,  1.,  1.]))

Request the shape or the size of the arrays:


In [17]:
np.zeros((5, 2)).shape, np.zeros((5, 2)).size


Out[17]:
((5, 2), 10)

And creating random numbers:


In [18]:
np.random.rand(5,5)  # check with np.random. + TAB for sampling from other distributions!


Out[18]:
array([[ 0.38438942,  0.97434155,  0.38502669,  0.15532264,  0.823198  ],
       [ 0.78587399,  0.27499523,  0.70757633,  0.47966322,  0.55844034],
       [ 0.4711012 ,  0.85971443,  0.92337492,  0.7120357 ,  0.58327473],
       [ 0.99297042,  0.29729801,  0.14433942,  0.97605846,  0.23636861],
       [ 0.05894808,  0.68050613,  0.70513088,  0.30843527,  0.06710415]])

Reading in from binary file:


In [19]:
nete_bodem = np.load("../data/nete_bodem.npy")

In [20]:
plt.imshow(nete_bodem)


Out[20]:
<matplotlib.image.AxesImage at 0x7f6c39cfc390>

Reading in from a text-file:


In [21]:
nete_bodem_subset = np.loadtxt("../data/nete_bodem_subset.out")

In [22]:
plt.imshow(nete_bodem_subset)


Out[22]:
<matplotlib.image.AxesImage at 0x7f6c38b93dd8>

Slicing (accessing values in arrays)

This i equivalent to the slicing of a list:


In [23]:
my_array = np.random.randint(2, 10, 10)
my_array


Out[23]:
array([7, 2, 7, 2, 5, 4, 5, 6, 4, 2])

In [24]:
my_array[:5], my_array[4:], my_array[-2:]


Out[24]:
(array([7, 2, 7, 2, 5]), array([5, 4, 5, 6, 4, 2]), array([4, 2]))

In [25]:
my_array[0:7:2]


Out[25]:
array([7, 7, 5, 5])

In [26]:
sequence = np.arange(0, 11, 1)
sequence, sequence[::2], sequence[1::3],


Out[26]:
(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10]),
 array([ 0,  2,  4,  6,  8, 10]),
 array([ 1,  4,  7, 10]))

Assign new values to items


In [27]:
my_array[:2] = 10
my_array


Out[27]:
array([10, 10,  7,  2,  5,  4,  5,  6,  4,  2])

In [28]:
my_array = my_array.reshape(5, 2)
my_array


Out[28]:
array([[10, 10],
       [ 7,  2],
       [ 5,  4],
       [ 5,  6],
       [ 4,  2]])

With multiple dimensions, we get the option of slice amongst these dimensions:


In [29]:
my_array[0, :]


Out[29]:
array([10, 10])

Aggregation calculations


In [30]:
my_array = np.random.randint(2, 10, 10)
my_array


Out[30]:
array([2, 8, 9, 6, 4, 8, 4, 8, 7, 3])

In [31]:
print('Mean value is', np.mean(my_array))
print('Median value is',  np.median(my_array))
print('Std is', np.std(my_array))
print('Variance is', np.var(my_array))
print('Min is', my_array.min())
print('Element of minimum value is', my_array.argmin())
print('Max is', my_array.max())
print('Sum is', np.sum(my_array))
print('Prod', np.prod(my_array))
print('Unique values in this array are:', np.unique(my_array))
print('85% Percentile value is: ', np.percentile(my_array, 85))


Mean value is 5.9
Median value is 6.5
Std is 2.34307490277
Variance is 5.49
Min is 2
Element of minimum value is 0
Max is 9
Sum is 59
Prod 18579456
Unique values in this array are: [2 3 4 6 7 8 9]
85% Percentile value is:  8.0

In [32]:
my_other_array = np.random.randint(2, 10, 10).reshape(2, 5)
my_other_array


Out[32]:
array([[4, 6, 4, 8, 2],
       [2, 3, 4, 7, 4]])

use the argument axis to define the ax to calculate a specific statistic:


In [33]:
my_other_array.max(), my_other_array.max(axis=1), my_other_array.max(axis=0)


Out[33]:
(8, array([8, 7]), array([4, 6, 4, 8, 4]))

Element-wise operations


In [8]:
my_array = np.random.randint(2, 10, 10)

In [9]:
my_array


Out[9]:
array([7, 9, 3, 9, 5, 3, 3, 2, 3, 5])

In [10]:
print('Cumsum is', np.cumsum(my_array))
print('CumProd is', np.cumprod(my_array))
print('CumProd of 5 first elements is', np.cumprod(my_array)[4])


Cumsum is [ 7 16 19 28 33 36 39 41 44 49]
CumProd is [      7      63     189    1701    8505   25515   76545  153090  459270
 2296350]
CumProd of 5 first elements is 8505

In [11]:
np.exp(my_array), np.sin(my_array)


Out[11]:
(array([  1.09663316e+03,   8.10308393e+03,   2.00855369e+01,
          8.10308393e+03,   1.48413159e+02,   2.00855369e+01,
          2.00855369e+01,   7.38905610e+00,   2.00855369e+01,
          1.48413159e+02]),
 array([ 0.6569866 ,  0.41211849,  0.14112001,  0.41211849, -0.95892427,
         0.14112001,  0.14112001,  0.90929743,  0.14112001, -0.95892427]))

In [12]:
my_array%3  # == 0


Out[12]:
array([1, 0, 0, 0, 2, 0, 0, 2, 0, 2])

Using the numpy available function from the library or using the object method?


In [13]:
np.cumsum(my_array) == my_array.cumsum()


Out[13]:
array([ True,  True,  True,  True,  True,  True,  True,  True,  True,  True], dtype=bool)

In [49]:
my_array.dtype


Out[49]:
dtype('int64')
EXERCISE:
  • Check the documentation of both `np.cumsum()` and `my_array.cumsum()`. What is the difference?
  • Why do we use brackets () to run `cumsum` and we do not use brackets when asking for the `dtype`?
REMEMBER:
  • `np.cumsum` operates a method/function from the numpy library with input an array, e.g. `my_array`
  • `my_array.cumsum()` is a method/function available to the object `my_array`
  • `dtype` is an attribute/characteristic of the object `my_array`
  • It is all about calling a **method/function()** on an **object** to perform an action. The available methods are provided by the packages (or any function you write and import).
  • Objects also have **attributes**, defining the characteristics of the object (these are not actions)

In [17]:
my_array.cumsum()


Out[17]:
array([ 7, 16, 19, 28, 33, 36, 39, 41, 44, 49])

In [18]:
my_array.max(axis=0)


Out[18]:
9

In [19]:
my_array * my_array  # element-wise


Out[19]:
array([49, 81,  9, 81, 25,  9,  9,  4,  9, 25])
REMEMBER:
  • The operations do work on all elements of the array at the same time, you don't need a `for` loop

What is the added value of the numpy implementation compared to 'basic' python?


In [20]:
a_list = range(1000)
%timeit [i**2 for i in a_list]


361 µs ± 26.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [21]:
an_array = np.arange(1000)
%timeit an_array**2


1.38 µs ± 173 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Boolean indexing and filtering (!)

This is a fancy term for making selections based on a condition!

Let's start with an array that contains random values:


In [45]:
row_array = np.random.randint(1, 20, 10)
row_array


Out[45]:
array([ 5,  7,  8, 15, 11, 16,  9,  3,  5, 10])

Conditions can be checked (element-wise):


In [46]:
row_array > 5


Out[46]:
array([False,  True,  True,  True,  True,  True,  True, False, False,  True], dtype=bool)

In [47]:
boolean_mask = row_array > 5
boolean_mask


Out[47]:
array([False,  True,  True,  True,  True,  True,  True, False, False,  True], dtype=bool)

You can use this as a filter to select elements of an array:


In [48]:
row_array[boolean_mask]


Out[48]:
array([ 7,  8, 15, 11, 16,  9, 10])

or, also to change the values in the array corresponding to these conditions:


In [49]:
row_array[boolean_mask] = 20
row_array


Out[49]:
array([ 5, 20, 20, 20, 20, 20, 20,  3,  5, 20])

in short - making the values equal to 20 now -20:


In [50]:
row_array[row_array == 20] = -20
row_array


Out[50]:
array([  5, -20, -20, -20, -20, -20, -20,   3,   5, -20])
R comparison:

This is similar to conditional filtering in R on vectors...

Understanding conditional selections and assignments is CRUCIAL!

This requires some practice...


In [51]:
AR = np.random.randint(0, 20, 15)
AR


Out[51]:
array([ 5,  3, 12, 14, 11,  0,  8, 10, 16,  1, 13,  8,  8,  7, 16])
EXERCISE:
  • Count the number of values in AR that are larger than 10 (note: you can count with True = 1 and False = 0)

In [52]:
sum(AR > 10)


Out[52]:
6
EXERCISE:
  • Change all even numbers of `AR` into zero-values.

In [53]:
AR[AR%2 == 0] = 0
AR


Out[53]:
array([ 5,  3,  0,  0, 11,  0,  0,  0,  0,  1, 13,  0,  0,  7,  0])
EXERCISE:
  • Change all even positions of matrix AR into 30 values

In [54]:
AR[1::2] = 30
AR


Out[54]:
array([ 5, 30,  0, 30, 11, 30,  0, 30,  0, 30, 13, 30,  0, 30,  0])
EXERCISE:
  • Select all values above the 75th `percentile` of the following array AR2 ad take the square root of these values

In [55]:
AR2 = np.random.random(10)
AR2


Out[55]:
array([ 0.2170166 ,  0.32711233,  0.1024719 ,  0.46837866,  0.73870644,
        0.23983147,  0.84006343,  0.27242343,  0.29360816,  0.32338526])

In [56]:
np.sqrt(AR2[AR2 > np.percentile(AR2, 75)])


Out[56]:
array([ 0.68438196,  0.85948033,  0.91654974])
EXERCISE:
  • Convert all values -99. of the array AR3 into Nan-values (Note that Nan values can be provided in float arrays as `np.nan`)

In [57]:
AR3 = np.array([-99., 2., 3., 6., 8, -99., 7., 5., 6., -99.])

In [58]:
AR3[AR3 == -99.] = np.nan
AR3


Out[58]:
array([ nan,   2.,   3.,   6.,   8.,  nan,   7.,   5.,   6.,  nan])
EXERCISE:
  • Get an overview of the unique values present in the array `nete_bodem_subset`

In [59]:
nete_bodem_subset = np.loadtxt("../data/nete_bodem_subset.out")

In [60]:
np.unique(nete_bodem_subset)


Out[60]:
array([      0.,   20000.,   30000.,   80000.,  100000.,  110000.,
        120000.,  130000.,  140000.,  150000.])
EXERCISE:
  • Reclassify the values of the array `nete_bodem_subset` (binary filter):
    • values lower than or equal to 100000 should be 0
    • values higher than 100000 should be 1

In [61]:
nete_bodem_subset = np.loadtxt("../data/nete_bodem_subset.out")

In [62]:
nete_bodem_subset[nete_bodem_subset <= 100000.] = 0
nete_bodem_subset[nete_bodem_subset > 100000.] = 1

plt.imshow(nete_bodem_subset)
plt.colorbar(shrink=0.8)


Out[62]:
<matplotlib.colorbar.Colorbar at 0x7f6c39cacf98>
REMEMBER:
  • No need to retain everything, but have the reflex to search in the documentation (online docs, SHIFT-TAB, help(), lookfor())!!
  • Conditional selections (boolean indexing) is crucial!

This is just touching the surface of Numpy in order to proceed to the next phase (Pandas and GeoPandas)...

More extended material on Numpy is available online:

Pandas: data analysis in Python

Introduction

For data-intensive work in Python, the Pandas library has become essential. Pandas originally meant Panel Data, though many users probably don't know that.

What is pandas?

  • Pandas can be thought of as NumPy arrays with labels for rows and columns, and better support for heterogeneous data types, but it's also much, much more than that.
  • Pandas can also be thought of as R's data.frame in Python.
  • Powerful for working with missing data, working with time series data, for reading and writing your data, for reshaping, grouping, merging your data,...

Pandas documentation is available on: http://pandas.pydata.org/pandas-docs/stable/


In [22]:
# community agreement: import as pd
import pandas as pd

Data exploration

Reading in data to DataFrame


In [202]:
surveys_df = pd.read_csv("../data/surveys.csv")

In [203]:
surveys_df.head()  # Try also tail()


Out[203]:
record_id month day year plot_id species_id sex hindfoot_length weight
0 1 7 16 1977 2 NL M 32.0 NaN
1 2 7 16 1977 3 NL M 33.0 NaN
2 3 7 16 1977 2 DM F 37.0 NaN
3 4 7 16 1977 7 DM M 36.0 NaN
4 5 7 16 1977 3 DM M 35.0 NaN

In [204]:
surveys_df.shape


Out[204]:
(35549, 9)

In [205]:
surveys_df.columns


Out[205]:
Index(['record_id', 'month', 'day', 'year', 'plot_id', 'species_id', 'sex',
       'hindfoot_length', 'weight'],
      dtype='object')

In [206]:
surveys_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35549 entries, 0 to 35548
Data columns (total 9 columns):
record_id          35549 non-null int64
month              35549 non-null int64
day                35549 non-null int64
year               35549 non-null int64
plot_id            35549 non-null int64
species_id         34786 non-null object
sex                33038 non-null object
hindfoot_length    31438 non-null float64
weight             32283 non-null float64
dtypes: float64(2), int64(5), object(2)
memory usage: 2.4+ MB

In [207]:
surveys_df.dtypes


Out[207]:
record_id            int64
month                int64
day                  int64
year                 int64
plot_id              int64
species_id          object
sex                 object
hindfoot_length    float64
weight             float64
dtype: object

In [209]:
surveys_df.describe()


Out[209]:
record_id month day year plot_id hindfoot_length weight
count 35549.000000 35549.000000 35549.000000 35549.000000 35549.000000 31438.000000 32283.000000
mean 17775.000000 6.474022 16.105966 1990.475231 11.397001 29.287932 42.672428
std 10262.256696 3.396583 8.256691 7.493355 6.799406 9.564759 36.631259
min 1.000000 1.000000 1.000000 1977.000000 1.000000 2.000000 4.000000
25% 8888.000000 4.000000 9.000000 1984.000000 5.000000 21.000000 20.000000
50% 17775.000000 6.000000 16.000000 1990.000000 11.000000 32.000000 37.000000
75% 26662.000000 9.000000 23.000000 1997.000000 17.000000 36.000000 48.000000
max 35549.000000 12.000000 31.000000 2002.000000 24.000000 70.000000 280.000000
R comparison:

See the similarities and differences with the R `data.frame` - e.g. you would use `summary(df)` instead of `df.describe()` :-)


In [210]:
surveys_df["weight"].hist(bins=20)


Out[210]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb3792e84a8>

Series and DataFrames

A Pandas Series is a basic holder for one-dimensional labeled data. It can be created much as a NumPy array is created:


In [175]:
a_series = pd.Series([0.1, 0.2, 0.3, 0.4])
a_series


Out[175]:
0    0.1
1    0.2
2    0.3
3    0.4
dtype: float64

In [176]:
a_series.index, a_series.values


Out[176]:
(RangeIndex(start=0, stop=4, step=1), array([ 0.1,  0.2,  0.3,  0.4]))

Series do have an index and values (a numpy array!) and you can give the series a name (amongst other things)


In [177]:
a_series.name = "example_series"

In [178]:
a_series


Out[178]:
0    0.1
1    0.2
2    0.3
3    0.4
Name: example_series, dtype: float64

In [179]:
a_series[2]


Out[179]:
0.29999999999999999

Unlike the NumPy array, though, this index can be something other than integers:


In [180]:
a_series2 = pd.Series(np.arange(4), index=['a', 'b', 'c', 'd'])
a_series2['c']


Out[180]:
2

A DataFrame is a tabular data structure (multi-dimensional object to hold labeled data) comprised of rows and columns, akin to a spreadsheet, database table, or R's data.frame object. You can think of it as multiple Series object which share the same index.
Note that in the IPython notebook, the dataframe will display in a rich HTML view:


In [181]:
surveys_df.head()


Out[181]:
record_id month day year plot_id species_id sex hindfoot_length weight
0 1 7 16 1977 2 NL M 32.0 NaN
1 2 7 16 1977 3 NL M 33.0 NaN
2 3 7 16 1977 2 DM F 37.0 NaN
3 4 7 16 1977 7 DM M 36.0 NaN
4 5 7 16 1977 3 DM M 35.0 NaN

In [184]:
surveys_df["species_id"].head()


Out[184]:
0    NL
1    NL
2    DM
3    DM
4    DM
Name: species_id, dtype: object

If you selecte a single column of a DataFrame, you end up with... a Series:


In [186]:
type(surveys_df), type(surveys_df["species_id"])


Out[186]:
(pandas.core.frame.DataFrame, pandas.core.series.Series)

Aggregation and element-wise calculations

Completely similar to Numpy, aggregation statistics are available:


In [224]:
print('Mean weight is', surveys_df["weight"].mean())
print('Median weight is',  surveys_df["weight"].median())
print('Std of weight is', surveys_df["weight"].std())
print('Variance of weight is', surveys_df["weight"].var())
print('Min is', surveys_df["weight"].min())
print('Element of minimum value is', surveys_df["weight"].argmin())
print('Max is', surveys_df["weight"].max())
print('Sum is', surveys_df["weight"].sum())
print('85% Percentile value is: ', surveys_df["weight"].quantile(0.85))


Mean weight is 42.672428213
Median weight is 37.0
Std of weight is 36.6312594746
Variance of weight is 1341.84917069
Min is 4.0
Element of minimum value is 217
Max is 280.0
Sum is 1377594.0
85% Percentile value is:  53.0

Calculations are element-wise, e.g. adding the normalized weight (relative to its mean) as an additional column:


In [225]:
surveys_df['weight_normalised'] = surveys_df["weight"]/surveys_df["weight"].mean()

Pandas and Numpy collaborate well (Numpy methods can be applied on the DataFrame values, as these are actually numpy arrays):


In [257]:
np.sqrt(surveys_df["hindfoot_length"]).head()


Out[257]:
0    5.656854
1    5.744563
2    6.082763
3    6.000000
4    5.916080
Name: hindfoot_length, dtype: float64

Groupby provides the functionality to do an aggregation or calculation for each group:


In [259]:
surveys_df.groupby('sex')[['hindfoot_length', 'weight']].mean() # Try yourself with min, max,...


Out[259]:
hindfoot_length weight
sex
F 28.836780 42.170555
M 29.709578 42.995379
R comparison:

Similar with groupby in R, i.e. working with factors

Slicing

ATTENTION!:

One of pandas' basic features is the labeling of rows and columns, but this makes indexing also a bit more complex compared to numpy.

We now have to distuinguish between:
  • selection by **label**: loc
  • selection by **position** iloc

In [308]:
# example dataframe from scratch
data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
        'population': [11.3, 64.3, 81.3, 16.9, 64.9],
        'area': [30510, 671308, 357050, 41526, 244820],
        'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']}
countries = pd.DataFrame(data)
countries = countries.set_index('country')
countries


Out[308]:
area capital population
country
Belgium 30510 Brussels 11.3
France 671308 Paris 64.3
Germany 357050 Berlin 81.3
Netherlands 41526 Amsterdam 16.9
United Kingdom 244820 London 64.9

The shortcut []


In [283]:
countries['area'] # single []


Out[283]:
country
Belgium            30510
France            671308
Germany           357050
Netherlands        41526
United Kingdom    244820
Name: area, dtype: int64

In [284]:
countries[['area', 'population']] # double [[]]


Out[284]:
area population
country
Belgium 30510 11.3
France 671308 64.3
Germany 357050 81.3
Netherlands 41526 16.9
United Kingdom 244820 64.9

In [285]:
countries['France':'Netherlands']


Out[285]:
area capital population
country
France 671308 Paris 64.3
Germany 357050 Berlin 81.3
Netherlands 41526 Amsterdam 16.9

Systematic indexing with loc and iloc

When using [] like above, you can only select from one axis at once (rows or columns, not both). For more advanced indexing, you have some extra attributes:

  • loc: selection by label
  • iloc: selection by position

In [286]:
countries.loc['Germany', 'area']


Out[286]:
357050

In [287]:
countries.loc['France':'Germany', ['area', 'population']]


Out[287]:
area population
country
France 671308 64.3
Germany 357050 81.3

Selecting by position with iloc works similar as indexing numpy arrays:


In [288]:
countries.iloc[0:2,1:3]


Out[288]:
capital population
country
Belgium Brussels 11.3
France Paris 64.3

Boolean indexing

In short, similar to Numpy:


In [289]:
countries['area'] > 100000


Out[289]:
country
Belgium           False
France             True
Germany            True
Netherlands       False
United Kingdom     True
Name: area, dtype: bool

Selecting by conditions:


In [305]:
countries[countries['area'] > 100000]


Out[305]:
area capital population size
country
France 671308 Paris 64.3 LARGE
Germany 357050 Berlin 81.3 LARGE
United Kingdom 244820 London 64.9 LARGE

In [313]:
countries['size'] = np.nan # create an exmpty new column
countries


Out[313]:
area capital population size
country
Belgium 30510 Brussels 11.3 NaN
France 671308 Paris 64.3 NaN
Germany 357050 Berlin 81.3 NaN
Netherlands 41526 Amsterdam 16.9 NaN
United Kingdom 244820 London 64.9 NaN

In [312]:
countries.loc[countries['area'] > 100000, "size"] = 'LARGE'
countries.loc[countries['area'] <= 100000, "size"] = 'SMALL'

In [304]:
countries


Out[304]:
area capital population size
country
Belgium 30510 Brussels 11.3 SMALL
France 671308 Paris 64.3 LARGE
Germany 357050 Berlin 81.3 LARGE
Netherlands 41526 Amsterdam 16.9 SMALL
United Kingdom 244820 London 64.9 LARGE

Combining DataFrames (!)

An important way to combine DataFrames is to use columns in each dataset that contain common values (a common unique id) as is done in databases. Combining DataFrames using a common field is called joining. Joining DataFrames in this way is often useful when one DataFrames is a “lookup table” containing additional data that we want to include in the other.

As an example, consider the availability of the species information in a separate lookup-table:


In [293]:
species_df = pd.read_csv("../data/species.csv", delimiter=";")
EXERCISE:
  • Check the other `read_` functions that are available in the Pandas package yourself.

In [294]:
species_df.head()


Out[294]:
species_id genus species taxa
0 AB Amphispiza bilineata Bird
1 AH Ammospermophilus harrisi Rodent-not censused
2 AS Ammodramus savannarum Bird
3 BA Baiomys taylori Rodent
4 CB Campylorhynchus brunneicapillus Bird

In [295]:
surveys_df.head()


Out[295]:
record_id month day year plot_id species_id sex hindfoot_length weight weight_normalised
0 1 7 16 1977 2 NL M 32.0 NaN NaN
1 2 7 16 1977 3 NL M 33.0 NaN NaN
2 3 7 16 1977 2 DM F 37.0 NaN NaN
3 4 7 16 1977 7 DM M 36.0 NaN NaN
4 5 7 16 1977 3 DM M 35.0 NaN NaN

We see that both tables do have a common identifier column (species_id), which we ca use to join the two tables together with the command merge:


In [296]:
merged_left = pd.merge(surveys_df, species_df, how="left", on="species_id")

In [297]:
merged_left.head()


Out[297]:
record_id month day year plot_id species_id sex hindfoot_length weight weight_normalised genus species taxa
0 1 7 16 1977 2 NL M 32.0 NaN NaN Neotoma albigula Rodent
1 2 7 16 1977 3 NL M 33.0 NaN NaN Neotoma albigula Rodent
2 3 7 16 1977 2 DM F 37.0 NaN NaN Dipodomys merriami Rodent
3 4 7 16 1977 7 DM M 36.0 NaN NaN Dipodomys merriami Rodent
4 5 7 16 1977 3 DM M 35.0 NaN NaN Dipodomys merriami Rodent

Optional section: Pandas is great with time series


In [253]:
flowdata = pd.read_csv("../data/vmm_flowdata.csv", index_col=0, 
                       parse_dates=True)

In [254]:
flowdata.head()


Out[254]:
L06_347 LS06_347 LS06_348
Time
2009-01-01 00:00:00 0.137417 0.097500 0.016833
2009-01-01 03:00:00 0.131250 0.088833 0.016417
2009-01-01 06:00:00 0.113500 0.091250 0.016750
2009-01-01 09:00:00 0.135750 0.091500 0.016250
2009-01-01 12:00:00 0.140917 0.096167 0.017000
REMEMBER:
  • `pd.read_csv` provides a lot of built-in functionality to support this kind of transactions when reading in a file! Check the **help** of the read_csv function...

The index provides many attributes to work with:


In [123]:
flowdata.index.year, flowdata.index.dayofweek, flowdata.index.dayofyear #,...


Out[123]:
(array([2009, 2009, 2009, ..., 2013, 2013, 2013], dtype=int32),
 array([3, 3, 3, ..., 1, 1, 2], dtype=int32),
 array([1, 1, 1, ..., 1, 1, 2], dtype=int32))

Subselecting periods can be done by the string representation of dates:


In [124]:
flowdata["2012-01-01 09:00":"2012-01-04 19:00"].plot()


Out[124]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb37a3702b0>

or shorter when possible:


In [125]:
flowdata["2009"].plot()


Out[125]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb37a22a6a0>

Combinations with other selection criteria is possible, e.g. to get all months with 30 days in the year 2009:


In [126]:
flowdata.loc[(flowdata.index.days_in_month == 30) & (flowdata.index.year == 2009), "L06_347"].plot()


Out[126]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb37a06c160>

Select all 'daytime' data (between 8h and 20h) for all days, station "L06_347":


In [127]:
flowdata[(flowdata.index.hour > 8) & (flowdata.index.hour < 20)].head()
# OR USE flowdata.between_time('08:00', '20:00')


Out[127]:
L06_347 LS06_347 LS06_348
Time
2009-01-01 09:00:00 0.135750 0.091500 0.016250
2009-01-01 12:00:00 0.140917 0.096167 0.017000
2009-01-01 15:00:00 0.099167 0.091667 0.017583
2009-01-01 18:00:00 0.132667 0.090167 0.016250
2009-01-02 09:00:00 0.141917 0.097083 0.016417

A very powerful method is resample: converting the frequency of the time series (e.g. from hourly to daily data).


In [128]:
flowdata.resample('A').mean().plot()


Out[128]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb379fb3470>

A practical example is: Plot the monthly minimum and maximum of the daily average values of the LS06_348 column


In [129]:
daily = flowdata['LS06_348'].resample('D').mean() # calculate the daily average value
daily.resample('M').agg(['min', 'max']).plot()  # calculate the monthly minimum and maximum values


Out[129]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb379e7b6a0>

Other plots are supported as well, e.g. a bar plot of the mean of the stations in year 2013


In [131]:
flowdata['2013'].mean().plot(kind='barh')


Out[131]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb379df3a58>

In [ ]:


In [ ]: