Scientific Python essentials
Introduction to GIS scripting
May, 2017© 2017, Stijn Van Hoey (mailto:stijnvanhoey@gmail.com). Licensed under CC BY 4.0 Creative Commons
There is a large variety of packages available in Python to support research. Importing a package is like getting a piece of lab equipment out of a storage locker and setting it up on the bench for use in a project. Once a library is set up (imported), it can be used or called to perform many tasks.
In this notebook, we will focus on two fundamental packages within most scientific applications:
Furthermore, if plotting is required, this will be done with matplotlib package (we only use plot
and imshow
in this tutorial):
In [136]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('seaborn-white')
NumPy is the fundamental package for scientific computing with Python.
Information for the freaks:
In short: Numpy is the Python package to do fast calculations!
It is a community agreement to import the numpy package with the prefix np
to identify the usage of numpy functions. Use the CTRL
+ SHIFT
option to check the available functions of numpy:
In [2]:
import numpy as np
In [3]:
# np. # explore the namespace
Numpy provides many mathematical functions, which operate element-wise on a so-called numpy.ndarray
data type (in short: array
).
You were looking for some function to derive quantiles of an array...
In [4]:
np.lookfor("quantile")
Different methods do read the manual:
In [5]:
#?np.percentile
In [6]:
# help(np.percentile)
In [7]:
# use SHIFT + TAB
In [8]:
throws = 1000 # number of rolls with the dices
stone1 = np.random.uniform(1, 6, throws) # outcome of throws with dice 1
stone2 = np.random.uniform(1, 6, throws) # outcome of throws with dice 2
total = stone1 + stone2 # sum of each outcome
histogram = plt.hist(total, bins=20) # plot as histogram
In [9]:
# random numbers (X, Y in 2 columns)
Z = np.random.random((10,2))
X, Y = Z[:,0], Z[:,1]
# Distance
R = np.sqrt(X**2 + Y**2)
# Angle
T = np.arctan2(Y, X) # Array of angles in radians
Tdegree = T*180/(np.pi) # If you like degrees more
# NEXT PART (purely for illustration)
# plot the cartesian coordinates
plt.figure(figsize=(14, 6))
ax1 = plt.subplot(121)
ax1.plot(Z[:,0], Z[:,1], 'o')
ax1.set_title("Cartesian")
#plot the polar coorsidnates
ax2 = plt.subplot(122, polar=True)
ax2.plot(T, R, 'o')
ax2.set_title("Polar")
Out[9]:
In [10]:
nete_bodem = np.load("../data/nete_bodem.npy")
plt.imshow(nete_bodem)
plt.colorbar(shrink=0.6)
Out[10]:
In [11]:
nete_bodem_rescaled = (nete_bodem - nete_bodem.min())/(nete_bodem.max() - nete_bodem.min()) # rescale
nete_bodem_rescaled[nete_bodem_rescaled == 0.0] = np.nan # assign Nan values to zero values
In [12]:
plt.imshow(nete_bodem_rescaled)
plt.colorbar(shrink=0.6)
Out[12]:
(Remark: There is no GIS-component in the previous manipulation, these are pure element-wise operations on an array!)
In [13]:
np.array([1, 1.5, 2, 2.5]) #np.array(anylist)
Out[13]:
One could compare the numpy array to the R vector. It contains a single data type (character, float, integer) and operations are element-wise.
Provide a range of values, with a begin, end and stepsize:
In [14]:
np.arange(5, 12, 2)
Out[14]:
Provide a range of values, with a begin, end and number of values in between:
In [15]:
np.linspace(2, 13, 3)
Out[15]:
Create empty arrays or arrays filled with ones:
In [16]:
np.zeros((5, 2)), np.ones(5)
Out[16]:
Request the shape
or the size
of the arrays:
In [17]:
np.zeros((5, 2)).shape, np.zeros((5, 2)).size
Out[17]:
And creating random numbers:
In [18]:
np.random.rand(5,5) # check with np.random. + TAB for sampling from other distributions!
Out[18]:
Reading in from binary file:
In [19]:
nete_bodem = np.load("../data/nete_bodem.npy")
In [20]:
plt.imshow(nete_bodem)
Out[20]:
Reading in from a text-file:
In [21]:
nete_bodem_subset = np.loadtxt("../data/nete_bodem_subset.out")
In [22]:
plt.imshow(nete_bodem_subset)
Out[22]:
This i equivalent to the slicing of a list
:
In [23]:
my_array = np.random.randint(2, 10, 10)
my_array
Out[23]:
In [24]:
my_array[:5], my_array[4:], my_array[-2:]
Out[24]:
In [25]:
my_array[0:7:2]
Out[25]:
In [26]:
sequence = np.arange(0, 11, 1)
sequence, sequence[::2], sequence[1::3],
Out[26]:
Assign new values to items
In [27]:
my_array[:2] = 10
my_array
Out[27]:
In [28]:
my_array = my_array.reshape(5, 2)
my_array
Out[28]:
With multiple dimensions, we get the option of slice amongst these dimensions:
In [29]:
my_array[0, :]
Out[29]:
In [30]:
my_array = np.random.randint(2, 10, 10)
my_array
Out[30]:
In [31]:
print('Mean value is', np.mean(my_array))
print('Median value is', np.median(my_array))
print('Std is', np.std(my_array))
print('Variance is', np.var(my_array))
print('Min is', my_array.min())
print('Element of minimum value is', my_array.argmin())
print('Max is', my_array.max())
print('Sum is', np.sum(my_array))
print('Prod', np.prod(my_array))
print('Unique values in this array are:', np.unique(my_array))
print('85% Percentile value is: ', np.percentile(my_array, 85))
In [32]:
my_other_array = np.random.randint(2, 10, 10).reshape(2, 5)
my_other_array
Out[32]:
use the argument axis
to define the ax to calculate a specific statistic:
In [33]:
my_other_array.max(), my_other_array.max(axis=1), my_other_array.max(axis=0)
Out[33]:
In [8]:
my_array = np.random.randint(2, 10, 10)
In [9]:
my_array
Out[9]:
In [10]:
print('Cumsum is', np.cumsum(my_array))
print('CumProd is', np.cumprod(my_array))
print('CumProd of 5 first elements is', np.cumprod(my_array)[4])
In [11]:
np.exp(my_array), np.sin(my_array)
Out[11]:
In [12]:
my_array%3 # == 0
Out[12]:
Using the numpy available function from the library or using the object method?
In [13]:
np.cumsum(my_array) == my_array.cumsum()
Out[13]:
In [49]:
my_array.dtype
Out[49]:
In [17]:
my_array.cumsum()
Out[17]:
In [18]:
my_array.max(axis=0)
Out[18]:
In [19]:
my_array * my_array # element-wise
Out[19]:
What is the added value of the numpy implementation compared to 'basic' python?
In [20]:
a_list = range(1000)
%timeit [i**2 for i in a_list]
In [21]:
an_array = np.arange(1000)
%timeit an_array**2
This is a fancy term for making selections based on a condition!
Let's start with an array that contains random values:
In [45]:
row_array = np.random.randint(1, 20, 10)
row_array
Out[45]:
Conditions can be checked (element-wise):
In [46]:
row_array > 5
Out[46]:
In [47]:
boolean_mask = row_array > 5
boolean_mask
Out[47]:
You can use this as a filter to select elements of an array:
In [48]:
row_array[boolean_mask]
Out[48]:
or, also to change the values in the array corresponding to these conditions:
In [49]:
row_array[boolean_mask] = 20
row_array
Out[49]:
in short - making the values equal to 20 now -20:
In [50]:
row_array[row_array == 20] = -20
row_array
Out[50]:
This is similar to conditional filtering in R on vectors...
This requires some practice...
In [51]:
AR = np.random.randint(0, 20, 15)
AR
Out[51]:
In [52]:
sum(AR > 10)
Out[52]:
In [53]:
AR[AR%2 == 0] = 0
AR
Out[53]:
In [54]:
AR[1::2] = 30
AR
Out[54]:
In [55]:
AR2 = np.random.random(10)
AR2
Out[55]:
In [56]:
np.sqrt(AR2[AR2 > np.percentile(AR2, 75)])
Out[56]:
In [57]:
AR3 = np.array([-99., 2., 3., 6., 8, -99., 7., 5., 6., -99.])
In [58]:
AR3[AR3 == -99.] = np.nan
AR3
Out[58]:
In [59]:
nete_bodem_subset = np.loadtxt("../data/nete_bodem_subset.out")
In [60]:
np.unique(nete_bodem_subset)
Out[60]:
In [61]:
nete_bodem_subset = np.loadtxt("../data/nete_bodem_subset.out")
In [62]:
nete_bodem_subset[nete_bodem_subset <= 100000.] = 0
nete_bodem_subset[nete_bodem_subset > 100000.] = 1
plt.imshow(nete_bodem_subset)
plt.colorbar(shrink=0.8)
Out[62]:
This is just touching the surface of Numpy in order to proceed to the next phase (Pandas and GeoPandas)...
More extended material on Numpy is available online:
For data-intensive work in Python, the Pandas library has become essential. Pandas originally meant Panel Data, though many users probably don't know that.
What is pandas?
Pandas documentation is available on: http://pandas.pydata.org/pandas-docs/stable/
In [22]:
# community agreement: import as pd
import pandas as pd
Reading in data to DataFrame
In [202]:
surveys_df = pd.read_csv("../data/surveys.csv")
In [203]:
surveys_df.head() # Try also tail()
Out[203]:
In [204]:
surveys_df.shape
Out[204]:
In [205]:
surveys_df.columns
Out[205]:
In [206]:
surveys_df.info()
In [207]:
surveys_df.dtypes
Out[207]:
In [209]:
surveys_df.describe()
Out[209]:
See the similarities and differences with the R `data.frame` - e.g. you would use `summary(df)` instead of `df.describe()` :-)
In [210]:
surveys_df["weight"].hist(bins=20)
Out[210]:
A Pandas Series is a basic holder for one-dimensional labeled data. It can be created much as a NumPy array is created:
In [175]:
a_series = pd.Series([0.1, 0.2, 0.3, 0.4])
a_series
Out[175]:
In [176]:
a_series.index, a_series.values
Out[176]:
Series do have an index and values (a numpy array!) and you can give the series a name (amongst other things)
In [177]:
a_series.name = "example_series"
In [178]:
a_series
Out[178]:
In [179]:
a_series[2]
Out[179]:
Unlike the NumPy array, though, this index can be something other than integers:
In [180]:
a_series2 = pd.Series(np.arange(4), index=['a', 'b', 'c', 'd'])
a_series2['c']
Out[180]:
A DataFrame is a tabular data structure (multi-dimensional object to hold labeled data) comprised of rows and columns, akin to a spreadsheet, database table, or R's data.frame object. You can think of it as multiple Series object which share the same index.
Note that in the IPython notebook, the dataframe will display in a rich HTML view:
In [181]:
surveys_df.head()
Out[181]:
In [184]:
surveys_df["species_id"].head()
Out[184]:
If you selecte a single column of a DataFrame, you end up with... a Series:
In [186]:
type(surveys_df), type(surveys_df["species_id"])
Out[186]:
Completely similar to Numpy, aggregation statistics are available:
In [224]:
print('Mean weight is', surveys_df["weight"].mean())
print('Median weight is', surveys_df["weight"].median())
print('Std of weight is', surveys_df["weight"].std())
print('Variance of weight is', surveys_df["weight"].var())
print('Min is', surveys_df["weight"].min())
print('Element of minimum value is', surveys_df["weight"].argmin())
print('Max is', surveys_df["weight"].max())
print('Sum is', surveys_df["weight"].sum())
print('85% Percentile value is: ', surveys_df["weight"].quantile(0.85))
Calculations are element-wise, e.g. adding the normalized weight (relative to its mean) as an additional column:
In [225]:
surveys_df['weight_normalised'] = surveys_df["weight"]/surveys_df["weight"].mean()
Pandas and Numpy collaborate well (Numpy methods can be applied on the DataFrame values, as these are actually numpy arrays):
In [257]:
np.sqrt(surveys_df["hindfoot_length"]).head()
Out[257]:
Groupby provides the functionality to do an aggregation or calculation for each group:
In [259]:
surveys_df.groupby('sex')[['hindfoot_length', 'weight']].mean() # Try yourself with min, max,...
Out[259]:
Similar with groupby in R, i.e. working with factors
In [308]:
# example dataframe from scratch
data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
'population': [11.3, 64.3, 81.3, 16.9, 64.9],
'area': [30510, 671308, 357050, 41526, 244820],
'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']}
countries = pd.DataFrame(data)
countries = countries.set_index('country')
countries
Out[308]:
In [283]:
countries['area'] # single []
Out[283]:
In [284]:
countries[['area', 'population']] # double [[]]
Out[284]:
In [285]:
countries['France':'Netherlands']
Out[285]:
When using [] like above, you can only select from one axis at once (rows or columns, not both). For more advanced indexing, you have some extra attributes:
loc
: selection by labeliloc
: selection by position
In [286]:
countries.loc['Germany', 'area']
Out[286]:
In [287]:
countries.loc['France':'Germany', ['area', 'population']]
Out[287]:
Selecting by position with iloc works similar as indexing numpy arrays:
In [288]:
countries.iloc[0:2,1:3]
Out[288]:
In short, similar to Numpy:
In [289]:
countries['area'] > 100000
Out[289]:
Selecting by conditions:
In [305]:
countries[countries['area'] > 100000]
Out[305]:
In [313]:
countries['size'] = np.nan # create an exmpty new column
countries
Out[313]:
In [312]:
countries.loc[countries['area'] > 100000, "size"] = 'LARGE'
countries.loc[countries['area'] <= 100000, "size"] = 'SMALL'
In [304]:
countries
Out[304]:
An important way to combine DataFrames
is to use columns in each dataset that contain common values (a common unique id) as is done in databases. Combining DataFrames
using a common field is called joining. Joining DataFrames in this way is often useful when one DataFrames
is a “lookup table” containing additional data that we want to include in the other.
As an example, consider the availability of the species information in a separate lookup-table:
In [293]:
species_df = pd.read_csv("../data/species.csv", delimiter=";")
In [294]:
species_df.head()
Out[294]:
In [295]:
surveys_df.head()
Out[295]:
We see that both tables do have a common identifier column (species_id
), which we ca use to join the two tables together with the command merge
:
In [296]:
merged_left = pd.merge(surveys_df, species_df, how="left", on="species_id")
In [297]:
merged_left.head()
Out[297]:
In [253]:
flowdata = pd.read_csv("../data/vmm_flowdata.csv", index_col=0,
parse_dates=True)
In [254]:
flowdata.head()
Out[254]:
The index provides many attributes to work with:
In [123]:
flowdata.index.year, flowdata.index.dayofweek, flowdata.index.dayofyear #,...
Out[123]:
Subselecting periods can be done by the string representation of dates:
In [124]:
flowdata["2012-01-01 09:00":"2012-01-04 19:00"].plot()
Out[124]:
or shorter when possible:
In [125]:
flowdata["2009"].plot()
Out[125]:
Combinations with other selection criteria is possible, e.g. to get all months with 30 days in the year 2009:
In [126]:
flowdata.loc[(flowdata.index.days_in_month == 30) & (flowdata.index.year == 2009), "L06_347"].plot()
Out[126]:
Select all 'daytime' data (between 8h and 20h) for all days, station "L06_347":
In [127]:
flowdata[(flowdata.index.hour > 8) & (flowdata.index.hour < 20)].head()
# OR USE flowdata.between_time('08:00', '20:00')
Out[127]:
A very powerful method is resample
: converting the frequency of the time series (e.g. from hourly to daily data).
In [128]:
flowdata.resample('A').mean().plot()
Out[128]:
A practical example is: Plot the monthly minimum and maximum of the daily average values of the LS06_348
column
In [129]:
daily = flowdata['LS06_348'].resample('D').mean() # calculate the daily average value
daily.resample('M').agg(['min', 'max']).plot() # calculate the monthly minimum and maximum values
Out[129]:
Other plots are supported as well, e.g. a bar plot of the mean of the stations in year 2013
In [131]:
flowdata['2013'].mean().plot(kind='barh')
Out[131]:
Acknowledgments and Material
In [ ]:
In [ ]: