If you know something about relational databases and SQL, you may have heard of JOIN and GROUP BY.
In [1]:
import pandas as pd
In [2]:
mlo, gl = pd.read_csv('../data/co2-mm-mlo.csv', na_values=-99.99, index_col='Date', parse_dates=True), \
pd.read_csv('../data/co2-mm-gl.csv', na_values=-99.99, index_col='Date', parse_dates=True)
# pd.read_csv('https://python.g-node.org/wiki/_media/co2-mm-mlo.csv')
In [3]:
ml, gl = mlo[['Average']], gl[['Average']]
ml.columns, gl.columns = ['Average_mlo'], ['Average_gl']
ml = ml[ml.index >= '1980-01']
ml, gl = ml.head(), gl.head()
We can concatenate the two DataFrames.
In [4]:
pd.concat([ml, gl])
Out[4]:
Alternatively, the above can be obtained with the self-describing append()
method.
In [5]:
ml.append(gl)
Out[5]:
By default, concat()
concatenates along the rows (axis 0). What we did previously was 'concatenating' along the columns (axis 1). It is actually a join operation, on (index or key) Date
.
In [6]:
pd.concat([ml, gl], axis=1)
Out[6]:
Indeed, we could alternatively use the join()
method.
In [7]:
ml.join(gl)
Out[7]:
Wait! Which frame's index is used? JOIN can be left, right, outer, or inner (picture unions and intersections).
In [8]:
mlo = pd.read_csv('../data/co2-mm-mlo.csv', na_values=-99.99, index_col='Date', parse_dates=True)
mlo = mlo[['Average']]
mlo.head()
Out[8]:
In [9]:
mlo.join(gl)
Out[9]:
In [10]:
mlo.join(gl, how='inner')
Out[10]:
join()
method, could we obtain Average
and Average_gl
for dates between 1980-01 and 1980-05?We introduce the split-apply-combine approach:
split the data into groups;
apply a function to each group independently;
combine the results into an appropriate data structure.
In [11]:
z = pd.Series([0.5, 0.8, 0.6, 0.3], index=pd.MultiIndex.from_product([[0, 1], [0, 1]], names=['x', 'y']))
In [12]:
z
Out[12]:
In [13]:
z.groupby('x')
Out[13]:
In [14]:
z.groupby('x').apply(lambda u: u.min())
Out[14]:
The function in question is an aggregation (for each group, return the minimum value). The length of the result is the number of different groups (here, number of unique x values). Aggregation reduces the size of the data structure.
In [15]:
z.groupby(level=0).apply(lambda u: u.min())
Out[15]:
The aggregation function can be applied directly, if it is available.
In [16]:
z.groupby('x').min()
Out[16]:
With a hierarchical index, the level parameter can even be passed directly to certain aggregation functions.
In [17]:
z.min(level='x')
Out[17]:
Counting the number of records in each group is also an aggregation.
In [18]:
z.groupby('x').size()
Out[18]:
For each unique values of x (which are 0 and 1), we have two entries.
The grouping object (here, x) is referred to as the key.
In [19]:
z.groupby(['x', 'y']).size()
Out[19]:
In [20]:
z.groupby('y').describe()
Out[20]:
In [21]:
import numpy as np
In [22]:
z.groupby('y').apply(lambda u: np.std(u, ddof=1))
Out[22]:
In [23]:
z.groupby('y').apply(lambda u: np.std(u))
Out[23]:
In [24]:
z.groupby('y').apply(np.std)
Out[24]:
Filtering is another kind of operation which can be applied to each group. Filtering may reduce the size of the data structure, since some groups might get filtered out.
In [25]:
z
Out[25]:
In [26]:
z.groupby('y').apply(lambda u: u.min() > 0.4)
Out[26]:
In [27]:
z.groupby('y').filter(lambda u: u.min() > 0.4)
Out[27]:
The third kind of operation is transformation, where the size of the data structure is preserved.
In [28]:
z.groupby('y').transform(lambda u: u.min())
Out[28]:
z2 = z.append(pd.Series([1.5, 1.8, 1.6, 1.3], index=pd.MultiIndex.from_product([[2, 3], [0, 1]], names=['x', 'y'])))
. For each y, return the deviation from the mean of every element.