SQL-type operations

If you know something about relational databases and SQL, you may have heard of JOIN and GROUP BY.



In [1]:

    
import pandas as pd

Joining



In [2]:

    
mlo, gl = pd.read_csv('../data/co2-mm-mlo.csv', na_values=-99.99, index_col='Date', parse_dates=True), \
    pd.read_csv('../data/co2-mm-gl.csv', na_values=-99.99, index_col='Date', parse_dates=True)
    # pd.read_csv('https://python.g-node.org/wiki/_media/co2-mm-mlo.csv')



In [3]:

    
ml, gl = mlo[['Average']], gl[['Average']]

ml.columns, gl.columns = ['Average_mlo'], ['Average_gl']

ml = ml[ml.index >= '1980-01']

ml, gl = ml.head(), gl.head()

We can concatenate the two DataFrames.



In [4]:

    
pd.concat([ml, gl])









    Out[4]:







  
    
      
      Average_gl
      Average_mlo
    
    
      Date
      
      
    
  
  
    
      1980-01-01
      NaN
      337.90
    
    
      1980-02-01
      NaN
      338.34
    
    
      1980-03-01
      NaN
      340.01
    
    
      1980-04-01
      NaN
      340.93
    
    
      1980-05-01
      NaN
      341.48
    
    
      1980-01-01
      338.45
      NaN
    
    
      1980-02-01
      339.14
      NaN
    
    
      1980-03-01
      339.46
      NaN
    
    
      1980-04-01
      339.86
      NaN
    
    
      1980-05-01
      340.30
      NaN

Alternatively, the above can be obtained with the self-describing append() method.



In [5]:

    
ml.append(gl)









    Out[5]:







  
    
      
      Average_gl
      Average_mlo
    
    
      Date
      
      
    
  
  
    
      1980-01-01
      NaN
      337.90
    
    
      1980-02-01
      NaN
      338.34
    
    
      1980-03-01
      NaN
      340.01
    
    
      1980-04-01
      NaN
      340.93
    
    
      1980-05-01
      NaN
      341.48
    
    
      1980-01-01
      338.45
      NaN
    
    
      1980-02-01
      339.14
      NaN
    
    
      1980-03-01
      339.46
      NaN
    
    
      1980-04-01
      339.86
      NaN
    
    
      1980-05-01
      340.30
      NaN

By default, concat() concatenates along the rows (axis 0). What we did previously was 'concatenating' along the columns (axis 1). It is actually a join operation, on (index or key) Date.



In [6]:

    
pd.concat([ml, gl], axis=1)









    Out[6]:







  
    
      
      Average_mlo
      Average_gl
    
    
      Date
      
      
    
  
  
    
      1980-01-01
      337.90
      338.45
    
    
      1980-02-01
      338.34
      339.14
    
    
      1980-03-01
      340.01
      339.46
    
    
      1980-04-01
      340.93
      339.86
    
    
      1980-05-01
      341.48
      340.30

Indeed, we could alternatively use the join() method.



In [7]:

    
ml.join(gl)









    Out[7]:







  
    
      
      Average_mlo
      Average_gl
    
    
      Date
      
      
    
  
  
    
      1980-01-01
      337.90
      338.45
    
    
      1980-02-01
      338.34
      339.14
    
    
      1980-03-01
      340.01
      339.46
    
    
      1980-04-01
      340.93
      339.86
    
    
      1980-05-01
      341.48
      340.30

Wait! Which frame's index is used? JOIN can be left, right, outer, or inner (picture unions and intersections).



In [8]:

    
mlo = pd.read_csv('../data/co2-mm-mlo.csv', na_values=-99.99, index_col='Date', parse_dates=True)
mlo = mlo[['Average']]
mlo.head()



In [9]:

    
mlo.join(gl)









    Out[9]:







  
    
      
      Average
      Average_gl
    
    
      Date
      
      
    
  
  
    
      1958-03-01
      315.71
      NaN
    
    
      1958-04-01
      317.45
      NaN
    
    
      1958-05-01
      317.50
      NaN
    
    
      1958-06-01
      NaN
      NaN
    
    
      1958-07-01
      315.86
      NaN
    
    
      1958-08-01
      314.93
      NaN
    
    
      1958-09-01
      313.20
      NaN
    
    
      1958-10-01
      NaN
      NaN
    
    
      1958-11-01
      313.33
      NaN
    
    
      1958-12-01
      314.67
      NaN
    
    
      1959-01-01
      315.62
      NaN
    
    
      1959-02-01
      316.38
      NaN
    
    
      1959-03-01
      316.71
      NaN
    
    
      1959-04-01
      317.72
      NaN
    
    
      1959-05-01
      318.29
      NaN
    
    
      1959-06-01
      318.15
      NaN
    
    
      1959-07-01
      316.54
      NaN
    
    
      1959-08-01
      314.80
      NaN
    
    
      1959-09-01
      313.84
      NaN
    
    
      1959-10-01
      313.26
      NaN
    
    
      1959-11-01
      314.80
      NaN
    
    
      1959-12-01
      315.58
      NaN
    
    
      1960-01-01
      316.43
      NaN
    
    
      1960-02-01
      316.97
      NaN
    
    
      1960-03-01
      317.58
      NaN
    
    
      1960-04-01
      319.02
      NaN
    
    
      1960-05-01
      320.03
      NaN
    
    
      1960-06-01
      319.59
      NaN
    
    
      1960-07-01
      318.18
      NaN
    
    
      1960-08-01
      315.91
      NaN
    
    
      ...
      ...
      ...
    
    
      2014-07-01
      399.10
      NaN
    
    
      2014-08-01
      397.03
      NaN
    
    
      2014-09-01
      395.38
      NaN
    
    
      2014-10-01
      396.03
      NaN
    
    
      2014-11-01
      397.28
      NaN
    
    
      2014-12-01
      398.91
      NaN
    
    
      2015-01-01
      399.98
      NaN
    
    
      2015-02-01
      400.28
      NaN
    
    
      2015-03-01
      401.54
      NaN
    
    
      2015-04-01
      403.28
      NaN
    
    
      2015-05-01
      403.96
      NaN
    
    
      2015-06-01
      402.80
      NaN
    
    
      2015-07-01
      401.31
      NaN
    
    
      2015-08-01
      398.93
      NaN
    
    
      2015-09-01
      397.63
      NaN
    
    
      2015-10-01
      398.29
      NaN
    
    
      2015-11-01
      400.16
      NaN
    
    
      2015-12-01
      401.85
      NaN
    
    
      2016-01-01
      402.52
      NaN
    
    
      2016-02-01
      404.04
      NaN
    
    
      2016-03-01
      404.83
      NaN
    
    
      2016-04-01
      407.42
      NaN
    
    
      2016-05-01
      407.70
      NaN
    
    
      2016-06-01
      406.81
      NaN
    
    
      2016-07-01
      404.39
      NaN
    
    
      2016-08-01
      402.25
      NaN
    
    
      2016-09-01
      401.03
      NaN
    
    
      2016-10-01
      401.57
      NaN
    
    
      2016-11-01
      403.53
      NaN
    
    
      2016-12-01
      404.48
      NaN
    
  

706 rows × 2 columns



In [10]:

    
mlo.join(gl, how='inner')

Hands-on exercise

How else, still using the join() method, could we obtain Average and Average_gl for dates between 1980-01 and 1980-05?

Grouping

We introduce the split-apply-combine approach:

split the data into groups;
apply a function to each group independently;
combine the results into an appropriate data structure.



In [11]:

    
z = pd.Series([0.5, 0.8, 0.6, 0.3], index=pd.MultiIndex.from_product([[0, 1], [0, 1]], names=['x', 'y']))



In [12]:

    
z









    Out[12]:





x  y
0  0    0.5
   1    0.8
1  0    0.6
   1    0.3
dtype: float64



In [13]:

    
z.groupby('x')









    Out[13]:





<pandas.core.groupby.SeriesGroupBy object at 0x7f216e09c518>



In [14]:

    
z.groupby('x').apply(lambda u: u.min())









    Out[14]:





x
0    0.5
1    0.3
dtype: float64

The function in question is an aggregation (for each group, return the minimum value). The length of the result is the number of different groups (here, number of unique x values). Aggregation reduces the size of the data structure.



In [15]:

    
z.groupby(level=0).apply(lambda u: u.min())









    Out[15]:





x
0    0.5
1    0.3
dtype: float64

The aggregation function can be applied directly, if it is available.



In [16]:

    
z.groupby('x').min()









    Out[16]:





x
0    0.5
1    0.3
dtype: float64

With a hierarchical index, the level parameter can even be passed directly to certain aggregation functions.



In [17]:

    
z.min(level='x')









    Out[17]:





x
0    0.5
1    0.3
dtype: float64

Counting the number of records in each group is also an aggregation.



In [18]:

    
z.groupby('x').size()









    Out[18]:





x
0    2
1    2
dtype: int64

For each unique values of x (which are 0 and 1), we have two entries.

The grouping object (here, x) is referred to as the key.



In [19]:

    
z.groupby(['x', 'y']).size()









    Out[19]:





x  y
0  0    1
   1    1
1  0    1
   1    1
dtype: int64



In [20]:

    
z.groupby('y').describe()



In [21]:

    
import numpy as np



In [22]:

    
z.groupby('y').apply(lambda u: np.std(u, ddof=1))









    Out[22]:





y
0    0.070711
1    0.353553
dtype: float64



In [23]:

    
z.groupby('y').apply(lambda u: np.std(u))









    Out[23]:





y
0    0.05
1    0.25
dtype: float64



In [24]:

    
z.groupby('y').apply(np.std)









    Out[24]:





y
0    0.05
1    0.25
dtype: float64

Filtering is another kind of operation which can be applied to each group. Filtering may reduce the size of the data structure, since some groups might get filtered out.



In [25]:

    
z









    Out[25]:





x  y
0  0    0.5
   1    0.8
1  0    0.6
   1    0.3
dtype: float64



In [26]:

    
z.groupby('y').apply(lambda u: u.min() > 0.4)









    Out[26]:





y
0     True
1    False
dtype: bool



In [27]:

    
z.groupby('y').filter(lambda u: u.min() > 0.4)









    Out[27]:





x  y
0  0    0.5
1  0    0.6
dtype: float64

The third kind of operation is transformation, where the size of the data structure is preserved.



In [28]:

    
z.groupby('y').transform(lambda u: u.min())









    Out[28]:





x  y
0  0    0.5
   1    0.3
1  0    0.5
   1    0.3
dtype: float64

Hands-on exercise

Run z2 = z.append(pd.Series([1.5, 1.8, 1.6, 1.3], index=pd.MultiIndex.from_product([[2, 3], [0, 1]], names=['x', 'y']))). For each y, return the deviation from the mean of every element.
Consider the resulting series. For each x, return the mean.
Filter out the negative x means.

	count	mean	std	min	25%	50%	75%	max
y
0	2.0	0.55	0.070711	0.5	0.525	0.55	0.575	0.6
1	2.0	0.55	0.353553	0.3	0.425	0.55	0.675	0.8

	Average_gl	Average_mlo
Date
1980-01-01	NaN	337.90
1980-02-01	NaN	338.34
1980-03-01	NaN	340.01
1980-04-01	NaN	340.93
1980-05-01	NaN	341.48
1980-01-01	338.45	NaN
1980-02-01	339.14	NaN
1980-03-01	339.46	NaN
1980-04-01	339.86	NaN
1980-05-01	340.30	NaN

	Average
Date
1958-03-01	315.71
1958-04-01	317.45
1958-05-01	317.50
1958-06-01	NaN
1958-07-01	315.86