SQL-type operations

If you know something about relational databases and SQL, you may have heard of JOIN and GROUP BY.


In [1]:
import pandas as pd

Joining


In [2]:
mlo, gl = pd.read_csv('../data/co2-mm-mlo.csv', na_values=-99.99, index_col='Date', parse_dates=True), \
    pd.read_csv('../data/co2-mm-gl.csv', na_values=-99.99, index_col='Date', parse_dates=True)
    # pd.read_csv('https://python.g-node.org/wiki/_media/co2-mm-mlo.csv')

In [3]:
ml, gl = mlo[['Average']], gl[['Average']]

ml.columns, gl.columns = ['Average_mlo'], ['Average_gl']

ml = ml[ml.index >= '1980-01']

ml, gl = ml.head(), gl.head()

We can concatenate the two DataFrames.


In [4]:
pd.concat([ml, gl])


Out[4]:
Average_gl Average_mlo
Date
1980-01-01 NaN 337.90
1980-02-01 NaN 338.34
1980-03-01 NaN 340.01
1980-04-01 NaN 340.93
1980-05-01 NaN 341.48
1980-01-01 338.45 NaN
1980-02-01 339.14 NaN
1980-03-01 339.46 NaN
1980-04-01 339.86 NaN
1980-05-01 340.30 NaN

Alternatively, the above can be obtained with the self-describing append() method.


In [5]:
ml.append(gl)


Out[5]:
Average_gl Average_mlo
Date
1980-01-01 NaN 337.90
1980-02-01 NaN 338.34
1980-03-01 NaN 340.01
1980-04-01 NaN 340.93
1980-05-01 NaN 341.48
1980-01-01 338.45 NaN
1980-02-01 339.14 NaN
1980-03-01 339.46 NaN
1980-04-01 339.86 NaN
1980-05-01 340.30 NaN

By default, concat() concatenates along the rows (axis 0). What we did previously was 'concatenating' along the columns (axis 1). It is actually a join operation, on (index or key) Date.


In [6]:
pd.concat([ml, gl], axis=1)


Out[6]:
Average_mlo Average_gl
Date
1980-01-01 337.90 338.45
1980-02-01 338.34 339.14
1980-03-01 340.01 339.46
1980-04-01 340.93 339.86
1980-05-01 341.48 340.30

Indeed, we could alternatively use the join() method.


In [7]:
ml.join(gl)


Out[7]:
Average_mlo Average_gl
Date
1980-01-01 337.90 338.45
1980-02-01 338.34 339.14
1980-03-01 340.01 339.46
1980-04-01 340.93 339.86
1980-05-01 341.48 340.30

Wait! Which frame's index is used? JOIN can be left, right, outer, or inner (picture unions and intersections).


In [8]:
mlo = pd.read_csv('../data/co2-mm-mlo.csv', na_values=-99.99, index_col='Date', parse_dates=True)
mlo = mlo[['Average']]
mlo.head()


Out[8]:
Average
Date
1958-03-01 315.71
1958-04-01 317.45
1958-05-01 317.50
1958-06-01 NaN
1958-07-01 315.86

In [9]:
mlo.join(gl)


Out[9]:
Average Average_gl
Date
1958-03-01 315.71 NaN
1958-04-01 317.45 NaN
1958-05-01 317.50 NaN
1958-06-01 NaN NaN
1958-07-01 315.86 NaN
1958-08-01 314.93 NaN
1958-09-01 313.20 NaN
1958-10-01 NaN NaN
1958-11-01 313.33 NaN
1958-12-01 314.67 NaN
1959-01-01 315.62 NaN
1959-02-01 316.38 NaN
1959-03-01 316.71 NaN
1959-04-01 317.72 NaN
1959-05-01 318.29 NaN
1959-06-01 318.15 NaN
1959-07-01 316.54 NaN
1959-08-01 314.80 NaN
1959-09-01 313.84 NaN
1959-10-01 313.26 NaN
1959-11-01 314.80 NaN
1959-12-01 315.58 NaN
1960-01-01 316.43 NaN
1960-02-01 316.97 NaN
1960-03-01 317.58 NaN
1960-04-01 319.02 NaN
1960-05-01 320.03 NaN
1960-06-01 319.59 NaN
1960-07-01 318.18 NaN
1960-08-01 315.91 NaN
... ... ...
2014-07-01 399.10 NaN
2014-08-01 397.03 NaN
2014-09-01 395.38 NaN
2014-10-01 396.03 NaN
2014-11-01 397.28 NaN
2014-12-01 398.91 NaN
2015-01-01 399.98 NaN
2015-02-01 400.28 NaN
2015-03-01 401.54 NaN
2015-04-01 403.28 NaN
2015-05-01 403.96 NaN
2015-06-01 402.80 NaN
2015-07-01 401.31 NaN
2015-08-01 398.93 NaN
2015-09-01 397.63 NaN
2015-10-01 398.29 NaN
2015-11-01 400.16 NaN
2015-12-01 401.85 NaN
2016-01-01 402.52 NaN
2016-02-01 404.04 NaN
2016-03-01 404.83 NaN
2016-04-01 407.42 NaN
2016-05-01 407.70 NaN
2016-06-01 406.81 NaN
2016-07-01 404.39 NaN
2016-08-01 402.25 NaN
2016-09-01 401.03 NaN
2016-10-01 401.57 NaN
2016-11-01 403.53 NaN
2016-12-01 404.48 NaN

706 rows × 2 columns


In [10]:
mlo.join(gl, how='inner')


Out[10]:
Average Average_gl
Date
1980-01-01 337.90 338.45
1980-02-01 338.34 339.14
1980-03-01 340.01 339.46
1980-04-01 340.93 339.86
1980-05-01 341.48 340.30

Hands-on exercise

  1. How else, still using the join() method, could we obtain Average and Average_gl for dates between 1980-01 and 1980-05?

Grouping

We introduce the split-apply-combine approach:

  • split the data into groups;

  • apply a function to each group independently;

  • combine the results into an appropriate data structure.


In [11]:
z = pd.Series([0.5, 0.8, 0.6, 0.3], index=pd.MultiIndex.from_product([[0, 1], [0, 1]], names=['x', 'y']))

In [12]:
z


Out[12]:
x  y
0  0    0.5
   1    0.8
1  0    0.6
   1    0.3
dtype: float64

In [13]:
z.groupby('x')


Out[13]:
<pandas.core.groupby.SeriesGroupBy object at 0x7f216e09c518>

In [14]:
z.groupby('x').apply(lambda u: u.min())


Out[14]:
x
0    0.5
1    0.3
dtype: float64

The function in question is an aggregation (for each group, return the minimum value). The length of the result is the number of different groups (here, number of unique x values). Aggregation reduces the size of the data structure.


In [15]:
z.groupby(level=0).apply(lambda u: u.min())


Out[15]:
x
0    0.5
1    0.3
dtype: float64

The aggregation function can be applied directly, if it is available.


In [16]:
z.groupby('x').min()


Out[16]:
x
0    0.5
1    0.3
dtype: float64

With a hierarchical index, the level parameter can even be passed directly to certain aggregation functions.


In [17]:
z.min(level='x')


Out[17]:
x
0    0.5
1    0.3
dtype: float64

Counting the number of records in each group is also an aggregation.


In [18]:
z.groupby('x').size()


Out[18]:
x
0    2
1    2
dtype: int64

For each unique values of x (which are 0 and 1), we have two entries.

The grouping object (here, x) is referred to as the key.


In [19]:
z.groupby(['x', 'y']).size()


Out[19]:
x  y
0  0    1
   1    1
1  0    1
   1    1
dtype: int64

In [20]:
z.groupby('y').describe()


Out[20]:
count mean std min 25% 50% 75% max
y
0 2.0 0.55 0.070711 0.5 0.525 0.55 0.575 0.6
1 2.0 0.55 0.353553 0.3 0.425 0.55 0.675 0.8

In [21]:
import numpy as np

In [22]:
z.groupby('y').apply(lambda u: np.std(u, ddof=1))


Out[22]:
y
0    0.070711
1    0.353553
dtype: float64

In [23]:
z.groupby('y').apply(lambda u: np.std(u))


Out[23]:
y
0    0.05
1    0.25
dtype: float64

In [24]:
z.groupby('y').apply(np.std)


Out[24]:
y
0    0.05
1    0.25
dtype: float64

Filtering is another kind of operation which can be applied to each group. Filtering may reduce the size of the data structure, since some groups might get filtered out.


In [25]:
z


Out[25]:
x  y
0  0    0.5
   1    0.8
1  0    0.6
   1    0.3
dtype: float64

In [26]:
z.groupby('y').apply(lambda u: u.min() > 0.4)


Out[26]:
y
0     True
1    False
dtype: bool

In [27]:
z.groupby('y').filter(lambda u: u.min() > 0.4)


Out[27]:
x  y
0  0    0.5
1  0    0.6
dtype: float64

The third kind of operation is transformation, where the size of the data structure is preserved.


In [28]:
z.groupby('y').transform(lambda u: u.min())


Out[28]:
x  y
0  0    0.5
   1    0.3
1  0    0.5
   1    0.3
dtype: float64

Hands-on exercise

  1. Run z2 = z.append(pd.Series([1.5, 1.8, 1.6, 1.3], index=pd.MultiIndex.from_product([[2, 3], [0, 1]], names=['x', 'y']))). For each y, return the deviation from the mean of every element.
  2. Consider the resulting series. For each x, return the mean.
  3. Filter out the negative x means.