Much of the programming work in data analysis and modeling is spent on data preparation: loading, cleaning, transforming, and rearranging. Sometimes the way that data is stored in files or databases is not the way you need it for a data processing application.
pandas along with the Python standard library provide you with a high-level, flexible, and high-performance set of core manipulations and algorithms to enable you to wrangle data into the right form without much trouble.
In [1]:
import numpy as np
from pandas import Series, DataFrame
import pandas as pd
In [2]:
df1 = DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
'data1': range(7)})
df2 = df2 = DataFrame({'key': ['a', 'b', 'd'],
'data2': range(3)})
df1
Out[2]:
In [3]:
df2
Out[3]:
In [4]:
pd.merge(df1, df2)
Out[4]:
If not specified, merge uses the overlapping column names as the keys. It’s a good practice to specify explicitly, though:
In [5]:
pd.merge(df1, df2, on='key')
Out[5]:
If the column names are different in each object, you can specify them separately:
In [6]:
df3 = DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1': range(7)})
In [7]:
df4 = DataFrame({'rkey': ['a', 'b', 'd'], 'data2': range(3)})
In [8]:
pd.merge(df3, df4, left_on='lkey', right_on='rkey')
Out[8]:
You probably noticed that the 'c' and 'd' values and associated data are missing from the result. By default merge does an 'inner' join; the keys in the result are the intersec- tion. Other possible options are 'left', 'right', and 'outer'. The outer join takes the union of the keys, combining the effect of applying both left and right joins:
In [9]:
pd.merge(df1, df2, how='outer')
Out[9]:
In [10]:
pd.merge(df1, df2, how='left')
Out[10]:
In [11]:
pd.merge(df1, df2, how='right')
Out[11]:
In [20]:
df1 = DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'], 'data1': range(6)})
df2 = DataFrame({'key': ['a', 'b', 'a', 'b', 'd'], 'data2': range(5)})
print(df1)
print(df2)
pd.merge(df1, df2, on='key', how='left')
Out[20]:
Many-to-many joins form the Cartesian product of the rows. Since there were 3 'b' rows in the left DataFrame and 2 in the right one, there are 6 'b' rows in the result. The join method only affects the distinct key values appearing in the result:
In [21]:
pd.merge(df1, df2, how='inner')
Out[21]:
In [22]:
left = DataFrame({'key1': ['foo', 'foo', 'bar'],
'key2': ['one', 'two', 'one'],
'lval': [1, 2, 3]})
right = DataFrame({'key1': ['foo', 'foo', 'bar', 'bar'],
'key2': ['one', 'one', 'one', 'two'],
'rval': [4, 5, 6, 7]})
In [23]:
pd.merge(left, right, on=['key1', 'key2'], how='outer')
Out[23]:
To determine which key combinations will appear in the result depending on the choice of merge method, think of the multiple keys as forming an array of tuples to be used as a single join key (even though it’s not actually implemented that way).
A last issue to consider in merge operations is the treatment of overlapping column names. While you can address the overlap manually (see the later section on renaming axis labels), merge has a suffixes option for specifying strings to append to overlapping names in the left and right DataFrame objects:
In [24]:
pd.merge(left, right, on='key1')
Out[24]:
In [25]:
pd.merge(left, right, on='key1', suffixes=('_left', '_right'))
Out[25]:
In [27]:
left1 = DataFrame({'key': ['a', 'b', 'a', 'a', 'b', 'c'], 'value': range(6)})
right1 = DataFrame({'group_val': [3.5, 7]}, index=['a', 'b'])
print(left1)
print(right1)
pd.merge(left1, right1, left_on='key', right_index=True)
Out[27]:
Since the default merge method is to intersect the join keys, you can instead form the union of them with an outer join:
In [28]:
pd.merge(left1, right1, left_on='key', right_index=True, how='outer')
Out[28]:
In [29]:
arr = np.arange(12).reshape((3, 4))
np.concatenate([arr, arr], axis=1)
Out[29]:
In [30]:
np.concatenate([arr, arr], axis=0)
Out[30]:
The concat function in pandas provides a consistent way to address each of these concerns. I’ll give a number of examples to illustrate how it works. Suppose we have three Series with no index overlap:
In [31]:
s1 = Series([0, 1], index=['a', 'b'])
s2 = Series([2, 3, 4], index=['c', 'd', 'e'])
s3 = Series([5, 6], index=['f', 'g'])
In [32]:
pd.concat([s1, s2, s3])
Out[32]:
By default concat works along axis=0, producing another Series. If you pass axis=1, the result will instead be a DataFrame (axis=1 is the columns):
In [33]:
pd.concat([s1, s2, s3], axis=1)
Out[33]:
In this case there is no overlap on the other axis, which as you can see is the sorted union (the 'outer' join) of the indexes. You can instead intersect them by passing join='inner':
In [35]:
s4 = pd.concat([s1 * 5, s3])
s4
Out[35]:
In [36]:
pd.concat([s1, s4], axis=1)
Out[36]:
In [37]:
In [68]: pd.concat([s1, s4], axis=1, join='inner')
Out[37]:
In [1]:
import pandas as pd
In [2]:
df = pd.read_csv("data/eu_trade_sums.csv")
In [ ]:
In [ ]:
In [ ]:
There are a number of fundamental operations for rearranging tabular data. These are alternatingly referred to as reshape or pivot operations.
Hierarchical indexing provides a consistent way to rearrange data in a DataFrame. There are two primary actions:
We will illustrate these operations through a series of examples. Consider a small DataFrame with string arrays as row and column indexes:
In [40]:
data = DataFrame(np.arange(6).reshape((2, 3)),
index=pd.Index(['Ohio', 'Colorado'], name='state'),
columns=pd.Index(['one', 'two', 'three'], name='number'))
data
Out[40]:
Using the stack method on this data pivots the columns into the rows, producing a Series:
In [43]:
result = data.stack()
result
Out[43]:
From a hierarchically-indexed Series, you can rearrange the data back into a DataFrame with unstack:
In [44]:
result.unstack()
Out[44]:
By default the innermost level is unstacked (same with stack). You can unstack a different level by passing a level number or name:
In [45]:
result.unstack(0)
Out[45]:
In [46]:
result.unstack('state')
Out[46]:
Unstacking might introduce missing data if all of the values in the level aren’t found in each of the subgroups:
In [47]:
s1 = Series([0, 1, 2, 3], index=['a', 'b', 'c', 'd'])
s2 = Series([4, 5, 6], index=['c', 'd', 'e'])
data2 = pd.concat([s1, s2], keys=['one', 'two'])
data2.unstack()
Out[47]:
Stacking filters out missing data by default, so the operation is easily invertible:
In [49]:
data2.unstack().stack()
Out[49]:
In [50]:
data2.unstack().stack(dropna=False)
Out[50]:
When unstacking in a DataFrame, the level unstacked becomes the lowest level in the result:
In [53]:
df = DataFrame({'left': result, 'right': result + 5},
columns=pd.Index(['left', 'right'], name='side'))
df
Out[53]:
In [54]:
df.unstack('state')
Out[54]:
In [55]:
df.unstack('state').stack('side')
Out[55]:
In [57]:
data = DataFrame({'k1': ['one'] * 3 + ['two'] * 4,
'k2': [1, 1, 2, 3, 3, 4, 4]})
data
Out[57]:
The DataFrame method duplicated returns a boolean Series indicating whether each row is a duplicate or not:
In [58]:
data.duplicated()
Out[58]:
Relatedly, drop_duplicates returns a DataFrame where the duplicated array is False:
In [59]:
data.drop_duplicates()
Out[59]:
Both of these methods by default consider all of the columns; alternatively you can specify any subset of them to detect duplicates. Suppose we had an additional column of values and wanted to filter duplicates only based on the 'k1' column:
In [60]:
data['v1'] = range(7)
data.drop_duplicates(['k1'])
Out[60]:
duplicated and drop_duplicates by default keep the first observed value combination. Passing take_last=True will return the last one:
In [62]:
data.drop_duplicates(['k1', 'k2'], keep='last')
Out[62]:
In [63]:
data = DataFrame({'food': ['bacon', 'pulled pork', 'bacon', 'Pastrami',
'corned beef', 'Bacon', 'pastrami', 'honey ham',
'nova lox'],
'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data
Out[63]:
Suppose you wanted to add a column indicating the type of animal that each food came from. Let’s write down a mapping of each distinct meat type to the kind of animal:
In [64]:
meat_to_animal = { 'bacon': 'pig',
'pulled pork': 'pig',
'pastrami': 'cow',
'corned beef': 'cow',
'honey ham': 'pig',
'nova lox': 'salmon'
}
The map method on a Series accepts a function or dict-like object containing a mapping, but here we have a small problem in that some of the meats above are capitalized and others are not. Thus, we also need to convert each value to lower case:
In [66]:
data['animal'] = data['food'].map(str.lower).map(meat_to_animal)
data
Out[66]:
We could also have passed a function that does all the work:
In [67]:
data['food'].map(lambda x: meat_to_animal[x.lower()])
Out[67]:
Filling in missing data with the fillna method can be thought of as a special case of more general value replacement. While map, as you’ve seen above, can be used to modify a subset of values in an object, replace provides a simpler and more flexible way to do so. Let’s consider this Series:
In [68]:
data = Series([1., -999., 2., -999., -1000., 3.])
data
Out[68]:
The -999 values might be sentinel values for missing data. To replace these with NA values that pandas understands, we can use replace, producing a new Series:
In [69]:
data.replace(-999, np.nan)
Out[69]:
If you want to replace multiple values at once, you instead pass a list then the substitute value:
In [70]:
data.replace([-999, -1000], np.nan)
Out[70]:
To use a different replacement for each value, pass a list of substitutes:
In [71]:
data.replace([-999, -1000], [np.nan, 0])
Out[71]:
The argument passed can also be a dict:
In [72]:
data.replace({-999: np.nan, -1000: 0})
Out[72]:
In [ ]: