Session 3: Python Data Analysis World

Python has a very strong community in the data analytics and scientific computing world. There are a lot of great Python packages to support different analyses, but there are a few very key packages:

  • Workhorses - numpy, pandas, scipy
  • Spatial tools - shapely, ogr/gdal, geopandas, etc
  • Environments and Visualisation - Jupiter, matplotlib

You will have access to all of these after installing Anaconda and installing the additional packages described in Session 0. (The additional packages relate to spatial analysis - you can skip them if you don't need them)

Where possible, veneer-py functions will accept and return objects that are directly usable by these packages. In particular, time series and other tabular data structures are returned as pandas DataFrame objects.

This session gives very brief introductions to most of these packages. In most cases, the links in Session 0 are relevant for more information.

numpy

numpy represents multi-dimensional arrays and operations on those arrays. The arrays are typed (eg float, double precision float, integer, etc) and are indexed by integers (one per dimension).

In veneer-py, we use pandas Data Frames more than numpy arrays, but the basics of the array operations in numpy are the foundations on which pandas is built.

You can create an array of random numbers using functions under the np.random namespace. The following example creates 100 random floats using a normal distribution

Note: numpy is typically imported as np.


In [3]:
import numpy as np
random = np.random.normal(size=100)
random


Out[3]:
array([-0.32846804,  0.24233466,  0.56216504,  1.61970292,  1.61387102,
       -0.20260723,  0.40419901,  1.04339317, -0.04147032, -0.23362834,
       -0.33834912, -0.3403258 , -0.92226102,  0.93534043,  1.34859119,
       -2.00442134, -0.44725056,  0.87007699, -1.05769217, -0.64963042,
        0.44078006, -1.27336806,  2.04514494, -0.48511948, -0.47662417,
       -0.82265906, -2.30960877, -0.61543794,  0.81831407, -0.13247048,
        0.33658916,  0.98143053,  0.90600701, -1.26208505,  0.53828062,
        0.95511353, -0.50796315,  1.20768909, -0.65747147,  1.29378467,
       -0.99321247,  0.05004184,  0.64125027, -0.33449546,  1.61972184,
        0.56414875, -1.05181545, -0.6281049 , -0.25006024, -0.40256336,
        0.60213267, -0.44373157, -1.1478386 , -0.10437203,  0.17834442,
       -0.70958483, -1.76063764, -0.30497265,  0.75522514, -0.29467826,
        0.67125189,  1.17539477,  0.98764732, -0.35971512, -0.33272206,
        1.44759658, -0.6425866 , -2.64347229,  0.84533837,  0.94233639,
        0.91725454,  0.59159355, -0.37466308, -0.35004473,  0.50183837,
       -0.38612063, -0.9689832 , -2.25973279,  0.97413203,  0.26193202,
        1.29195804,  0.52753274,  0.79260475, -0.5766819 , -0.4571203 ,
        1.03198971, -0.44982421, -1.55817231,  0.20016313,  0.23892575,
        1.93247695,  0.6686423 ,  1.7152531 , -2.16294   , -1.2162558 ,
        0.57183848, -1.1964297 ,  0.13300458, -0.79852854, -0.26931035])

The functions in np.random return one dimensional arrays. You can check this with .shape and change it with .reshape()


In [4]:
random.shape


Out[4]:
(100,)

In [6]:
threed = random.reshape(10,5,2)
threed


Out[6]:
array([[[-0.32846804,  0.24233466],
        [ 0.56216504,  1.61970292],
        [ 1.61387102, -0.20260723],
        [ 0.40419901,  1.04339317],
        [-0.04147032, -0.23362834]],

       [[-0.33834912, -0.3403258 ],
        [-0.92226102,  0.93534043],
        [ 1.34859119, -2.00442134],
        [-0.44725056,  0.87007699],
        [-1.05769217, -0.64963042]],

       [[ 0.44078006, -1.27336806],
        [ 2.04514494, -0.48511948],
        [-0.47662417, -0.82265906],
        [-2.30960877, -0.61543794],
        [ 0.81831407, -0.13247048]],

       [[ 0.33658916,  0.98143053],
        [ 0.90600701, -1.26208505],
        [ 0.53828062,  0.95511353],
        [-0.50796315,  1.20768909],
        [-0.65747147,  1.29378467]],

       [[-0.99321247,  0.05004184],
        [ 0.64125027, -0.33449546],
        [ 1.61972184,  0.56414875],
        [-1.05181545, -0.6281049 ],
        [-0.25006024, -0.40256336]],

       [[ 0.60213267, -0.44373157],
        [-1.1478386 , -0.10437203],
        [ 0.17834442, -0.70958483],
        [-1.76063764, -0.30497265],
        [ 0.75522514, -0.29467826]],

       [[ 0.67125189,  1.17539477],
        [ 0.98764732, -0.35971512],
        [-0.33272206,  1.44759658],
        [-0.6425866 , -2.64347229],
        [ 0.84533837,  0.94233639]],

       [[ 0.91725454,  0.59159355],
        [-0.37466308, -0.35004473],
        [ 0.50183837, -0.38612063],
        [-0.9689832 , -2.25973279],
        [ 0.97413203,  0.26193202]],

       [[ 1.29195804,  0.52753274],
        [ 0.79260475, -0.5766819 ],
        [-0.4571203 ,  1.03198971],
        [-0.44982421, -1.55817231],
        [ 0.20016313,  0.23892575]],

       [[ 1.93247695,  0.6686423 ],
        [ 1.7152531 , -2.16294   ],
        [-1.2162558 ,  0.57183848],
        [-1.1964297 ,  0.13300458],
        [-0.79852854, -0.26931035]]])

You can perform basic arithmetic on arrays, using scalars or other arrays.

For example, given the following two arrays


In [7]:
a1 = np.array([20.0,12.0,77.0,77.0])
a2 = np.array([25.0,6.0,80.0,80.0])

In [8]:
# You can add:

a1 + a2


Out[8]:
array([  45.,   18.,  157.,  157.])

In [9]:
# Multiply (element wise):

a1 * a2


Out[9]:
array([  500.,    72.,  6160.,  6160.])

In [10]:
# Compute a dot product

a1.dot(a2)


Out[10]:
12892.0

In [17]:
# You can also perform matrix operations
# First tell numpy that your array is a matrix,
# Then transpose to compatible shapes
# Then multiply
np.matrix(a1).transpose() * np.matrix(a2)


Out[17]:
matrix([[  500.,   120.,  1600.,  1600.],
        [  300.,    72.,   960.,   960.],
        [ 1925.,   462.,  6160.,  6160.],
        [ 1925.,   462.,  6160.,  6160.]])

In [ ]:

pandas

Pandas DataFrame objects are one of the key data types used in veneer-py.

A DataFrame is a tabular, two-dimensional data structure, which can be indexed in a range of ways, including a date and date/time index. DataFrames are aranged in named columns, each with a particular type (eg double, string, integer) and in this sense they are more flexible than numpy arrays.

Each column in a DataFrame is a pandas Series, which is useful in its own right.


In [20]:
import veneer
v = veneer.Veneer(port=9876)
downstream_flow_vol = v.retrieve_multiple_time_series(criteria={'RecordingVariable':'Downstream Flow Volume'})


*** /runs/latest ***
*** /runs/1/location/Default Link 4/element/Downstream Flow Volume/variable/Downstream Flow Volume ***
*** /runs/1/location/SR2/element/Downstream Flow Volume/variable/Downstream Flow Volume ***
*** /runs/1/location/SR3/element/Downstream Flow Volume/variable/Downstream Flow Volume ***
*** /runs/1/location/Default Link 5/element/Downstream Flow Volume/variable/Downstream Flow Volume ***
*** /runs/1/location/Ungauged Inflow/element/Downstream Flow Volume/variable/Downstream Flow Volume ***
*** /runs/1/location/Storage/element/Downstream Flow Volume/variable/Downstream Flow Volume ***
*** /runs/1/location/SP/element/Downstream Flow Volume/variable/Downstream Flow Volume ***
*** /runs/1/location/MFR/element/Downstream Flow Volume/variable/Downstream Flow Volume ***
*** /runs/1/location/Water User/element/Downstream Flow Volume/variable/Downstream Flow Volume ***
*** /runs/1/location/End Of System/element/Downstream Flow Volume/variable/Downstream Flow Volume ***

Pandas DataFrames have a tabular presentation in the Jupyter notebook.

It's also possible to slice subsets of rows


In [22]:
downstream_flow_vol[0:10] # <-- Look at first 10 rows (timesteps)


Out[22]:
Default Link #4:Downstream Flow Volume Default Link #5:Downstream Flow Volume End Of System:Downstream Flow Volume MFR:Downstream Flow Volume SP:Downstream Flow Volume SR2:Downstream Flow Volume SR3:Downstream Flow Volume Storage:Downstream Flow Volume Ungauged Inflow:Downstream Flow Volume Water User:Downstream Flow Volume
1901-01-01 0.000000 514.15 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 514.15 0
1901-01-02 0.000000 483.35 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 483.35 0
1901-01-03 6349.027111 1815.10 6349.027111 6349.027111 6349.027111 6349.027111 6349.027111 6349.027111 1815.10 0
1901-01-04 0.000000 2857.05 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 2857.05 0
1901-01-05 0.000000 2763.25 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 2763.25 0
1901-01-06 316601.314118 11478.60 316601.314118 316601.314118 316601.314118 316601.314118 316601.314118 316601.314118 11478.60 0
1901-01-07 0.000000 8350.30 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 8350.30 0
1901-01-08 0.000000 5412.40 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 5412.40 0
1901-01-09 0.000000 3761.10 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 3761.10 0
1901-01-10 0.000000 2825.20 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 2825.20 0

In [27]:
downstream_flow_vol[0::3000] # <-- Look at every 3000th timestep


Out[27]:
Default Link #4:Downstream Flow Volume Default Link #5:Downstream Flow Volume End Of System:Downstream Flow Volume MFR:Downstream Flow Volume SP:Downstream Flow Volume SR2:Downstream Flow Volume SR3:Downstream Flow Volume Storage:Downstream Flow Volume Ungauged Inflow:Downstream Flow Volume Water User:Downstream Flow Volume
1901-01-01 0.0 514.15 0.0 0.0 0.0 0.0 0.0 0.0 514.15 0
1909-03-20 0.0 397.60 0.0 0.0 0.0 0.0 0.0 0.0 397.60 0
1917-06-06 0.0 3583.30 0.0 0.0 0.0 0.0 0.0 0.0 3583.30 0
1925-08-23 0.0 23529.80 0.0 0.0 0.0 0.0 0.0 0.0 23529.80 0
1933-11-09 0.0 2212.35 0.0 0.0 0.0 0.0 0.0 0.0 2212.35 0
1942-01-26 0.0 807.80 0.0 0.0 0.0 0.0 0.0 0.0 807.80 0
1950-04-14 0.0 1311.80 0.0 0.0 0.0 0.0 0.0 0.0 1311.80 0
1958-07-01 0.0 11866.05 0.0 0.0 0.0 0.0 0.0 0.0 11866.05 0
1966-09-17 0.0 4036.90 0.0 0.0 0.0 0.0 0.0 0.0 4036.90 0
1974-12-04 0.0 947.45 0.0 0.0 0.0 0.0 0.0 0.0 947.45 0
1983-02-20 0.0 2293.20 0.0 0.0 0.0 0.0 0.0 0.0 2293.20 0
1991-05-09 0.0 963.55 0.0 0.0 0.0 0.0 0.0 0.0 963.55 0
1999-07-26 0.0 10372.60 0.0 0.0 0.0 0.0 0.0 0.0 10372.60 0
2007-10-12 0.0 3394.30 0.0 0.0 0.0 0.0 0.0 0.0 3394.30 0

You can quickly get stats for each column in a DataFrame


In [29]:
downstream_flow_vol.mean()


Out[29]:
Default Link #4:Downstream Flow Volume     377.669594
Default Link #5:Downstream Flow Volume    5315.220367
End Of System:Downstream Flow Volume       377.669594
MFR:Downstream Flow Volume                 377.669594
SP:Downstream Flow Volume                  377.669594
SR2:Downstream Flow Volume                 377.669594
SR3:Downstream Flow Volume                 377.669594
Storage:Downstream Flow Volume             377.669594
Ungauged Inflow:Downstream Flow Volume    5315.220367
Water User:Downstream Flow Volume            0.000000
dtype: float64

You can get the same stats along rows:


In [31]:
downstream_flow_vol.mean(axis=1)


Out[31]:
1901-01-01       102.830000
1901-01-02        96.670000
1901-01-03      4807.338978
1901-01-04       571.410000
1901-01-05       552.650000
1901-01-06    223916.639883
1901-01-07      1670.060000
1901-01-08      1082.480000
1901-01-09       752.220000
1901-01-10       565.040000
1901-01-11       457.520000
1901-01-12       394.310000
1901-01-13       355.810000
1901-01-14       331.100000
1901-01-15       314.230000
1901-01-16       301.700000
1901-01-17       291.690000
1901-01-18      1587.390000
1901-01-19      1158.920000
1901-01-20       810.810000
1901-01-21       613.340000
1901-01-22       499.730000
1901-01-23       432.740000
1901-01-24       391.860000
1901-01-25       365.470000
1901-01-26       347.270000
1901-01-27       333.690000
1901-01-28       322.770000
1901-01-29       313.390000
1901-01-30       304.990000
                  ...      
2013-12-02       297.430000
2013-12-03       290.290000
2013-12-04       283.360000
2013-12-05       276.500000
2013-12-06       269.850000
2013-12-07       263.410000
2013-12-08       257.040000
2013-12-09       250.880000
2013-12-10       244.860000
2013-12-11       238.980000
2013-12-12       233.240000
2013-12-13       227.570000
2013-12-14       222.110000
2013-12-15       216.790000
2013-12-16       211.540000
2013-12-17       206.430000
2013-12-18       201.460000
2013-12-19       196.630000
2013-12-20       191.870000
2013-12-21       187.250000
2013-12-22       182.700000
2013-12-23       178.290000
2013-12-24       174.020000
2013-12-25       169.820000
2013-12-26       165.690000
2013-12-27       161.700000
2013-12-28       157.780000
2013-12-29       153.930000
2013-12-30       150.220000
2013-12-31       146.580000
dtype: float64

In [ ]:

Jupyter and visualisation

It is worth spending some time exploring the capabilities of the Jupyter notebook.

In terms of managing your work:

  • The Edit and Insert menus have useful functions for rearranging cells, creating new cells, etc
  • The Cell menu has functions for running all cells in a notebook, all cells above a particular point and all cells below a point.
  • The Kernel menu controls the execution and lifecycle of the Python session. (In this instance, Kernel refers to an instance of an IPython session that is connected to the notebook. The Restart command clears all variables - even though earlier output is still visible in the notebook)

At this stage, most visualisation in Python notebooks is handled by matplotlib.

Matplotlib is powerful, but the learning curve can be steep.


In [38]:
import matplotlib.pyplot as plt
%matplotlib inline

Typically, you'll create a single plot from a single cell


In [40]:
plt.hist(np.random.normal(size=500))


Out[40]:
(array([  12.,   28.,   67.,   95.,  100.,   94.,   68.,   19.,   13.,    4.]),
 array([-2.56560177, -1.99247094, -1.41934011, -0.84620928, -0.27307845,
         0.30005238,  0.87318321,  1.44631404,  2.01944487,  2.5925757 ,
         3.16570653]),
 <a list of 10 Patch objects>)

... But the matplotlib subplots functionality allows you to create matrices of plots.


In [51]:
methods=[np.random.uniform,np.random.normal,np.random.exponential]

n=len(methods)

# Create n sets of random numbers, where n is the number of methods specified
random_sets = [method(size=1000) for method in methods]

for i in range(n):
    # Arrange subplots 2 rows x 3 columns
    # Access the i'th column on the first row
    ax = plt.subplot(2,3,i+1)
    # Plot the random numbers
    ax.plot(random_sets[i])
    # Access the i'th column on the second row
    ax = plt.subplot(2,3,n+i+1)
    # Plot a histogram of the corresponding numbers
    ax.hist(random_sets[i])



In [34]:


In [45]:


In [ ]: