Shaping Data

Much of the programming work in data analysis and modeling is spent on data preparation: loading, cleaning, transforming, and rearranging. Sometimes the way that data is stored in files or databases is not the way you need it for a data processing application.

pandas along with the Python standard library provide you with a high-level, flexible, and high-performance set of core manipulations and algorithms to enable you to wrangle data into the right form without much trouble.


In [1]:
import numpy as np
from pandas import Series, DataFrame
import pandas as pd

In [2]:
df1 = DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                 'data1': range(7)})
df2 = df2 = DataFrame({'key': ['a', 'b', 'd'],
                        'data2': range(3)})
df1


Out[2]:
data1 key
0 0 b
1 1 b
2 2 a
3 3 c
4 4 a
5 5 a
6 6 b

In [3]:
df2


Out[3]:
data2 key
0 0 a
1 1 b
2 2 d

Merging / Joining

This is an example of a many-to-one merge situation; the data in df1 has multiple rows labeled a and b, whereas df2 has only one row for each value in the key column. Calling merge with these objects we obtain:


In [4]:
pd.merge(df1, df2)


Out[4]:
data1 key data2
0 0 b 1
1 1 b 1
2 6 b 1
3 2 a 0
4 4 a 0
5 5 a 0

If not specified, merge uses the overlapping column names as the keys. It’s a good practice to specify explicitly, though:


In [5]:
pd.merge(df1, df2, on='key')


Out[5]:
data1 key data2
0 0 b 1
1 1 b 1
2 6 b 1
3 2 a 0
4 4 a 0
5 5 a 0

If the column names are different in each object, you can specify them separately:


In [6]:
df3 = DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1': range(7)})

In [7]:
df4 = DataFrame({'rkey': ['a', 'b', 'd'], 'data2': range(3)})

In [8]:
pd.merge(df3, df4, left_on='lkey', right_on='rkey')


Out[8]:
data1 lkey data2 rkey
0 0 b 1 b
1 1 b 1 b
2 6 b 1 b
3 2 a 0 a
4 4 a 0 a
5 5 a 0 a

You probably noticed that the 'c' and 'd' values and associated data are missing from the result. By default merge does an 'inner' join; the keys in the result are the intersec- tion. Other possible options are 'left', 'right', and 'outer'. The outer join takes the union of the keys, combining the effect of applying both left and right joins:


In [9]:
pd.merge(df1, df2, how='outer')


Out[9]:
data1 key data2
0 0.0 b 1.0
1 1.0 b 1.0
2 6.0 b 1.0
3 2.0 a 0.0
4 4.0 a 0.0
5 5.0 a 0.0
6 3.0 c NaN
7 NaN d 2.0

In [10]:
pd.merge(df1, df2, how='left')


Out[10]:
data1 key data2
0 0 b 1.0
1 1 b 1.0
2 2 a 0.0
3 3 c NaN
4 4 a 0.0
5 5 a 0.0
6 6 b 1.0

In [11]:
pd.merge(df1, df2, how='right')


Out[11]:
data1 key data2
0 0.0 b 1
1 1.0 b 1
2 6.0 b 1
3 2.0 a 0
4 4.0 a 0
5 5.0 a 0
6 NaN d 2

Many-to-Many Merges

Many-to-many merges have well-defined though not necessarily intuitive behavior. Here’s an example:


In [20]:
df1 = DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'], 'data1': range(6)})
df2 = DataFrame({'key': ['a', 'b', 'a', 'b', 'd'], 'data2': range(5)})
print(df1)
print(df2)
pd.merge(df1, df2, on='key', how='left')


   data1 key
0      0   b
1      1   b
2      2   a
3      3   c
4      4   a
5      5   b
   data2 key
0      0   a
1      1   b
2      2   a
3      3   b
4      4   d
Out[20]:
data1 key data2
0 0 b 1.0
1 0 b 3.0
2 1 b 1.0
3 1 b 3.0
4 2 a 0.0
5 2 a 2.0
6 3 c NaN
7 4 a 0.0
8 4 a 2.0
9 5 b 1.0
10 5 b 3.0

Many-to-many joins form the Cartesian product of the rows. Since there were 3 'b' rows in the left DataFrame and 2 in the right one, there are 6 'b' rows in the result. The join method only affects the distinct key values appearing in the result:


In [21]:
pd.merge(df1, df2, how='inner')


Out[21]:
data1 key data2
0 0 b 1
1 0 b 3
2 1 b 1
3 1 b 3
4 5 b 1
5 5 b 3
6 2 a 0
7 2 a 2
8 4 a 0
9 4 a 2

Compound Key

To merge with multiple keys ("compound key"), pass a list of column names:


In [22]:
left = DataFrame({'key1': ['foo', 'foo', 'bar'], 
                  'key2': ['one', 'two', 'one'],
                  'lval': [1, 2, 3]})

right = DataFrame({'key1': ['foo', 'foo', 'bar', 'bar'], 
                   'key2': ['one', 'one', 'one', 'two'],
                   'rval': [4, 5, 6, 7]})

In [23]:
pd.merge(left, right, on=['key1', 'key2'], how='outer')


Out[23]:
key1 key2 lval rval
0 foo one 1.0 4.0
1 foo one 1.0 5.0
2 foo two 2.0 NaN
3 bar one 3.0 6.0
4 bar two NaN 7.0

To determine which key combinations will appear in the result depending on the choice of merge method, think of the multiple keys as forming an array of tuples to be used as a single join key (even though it’s not actually implemented that way).

Overlapping Column Names

A last issue to consider in merge operations is the treatment of overlapping column names. While you can address the overlap manually (see the later section on renaming axis labels), merge has a suffixes option for specifying strings to append to overlapping names in the left and right DataFrame objects:


In [24]:
pd.merge(left, right, on='key1')


Out[24]:
key1 key2_x lval key2_y rval
0 foo one 1 one 4
1 foo one 1 one 5
2 foo two 2 one 4
3 foo two 2 one 5
4 bar one 3 one 6
5 bar one 3 two 7

In [25]:
pd.merge(left, right, on='key1', suffixes=('_left', '_right'))


Out[25]:
key1 key2_left lval key2_right rval
0 foo one 1 one 4
1 foo one 1 one 5
2 foo two 2 one 4
3 foo two 2 one 5
4 bar one 3 one 6
5 bar one 3 two 7

Merge on Index

In some cases, the merge key or keys in a DataFrame will be found in its index. In this case, you can pass left_index=True or right_index=True (or both) to indicate that the index should be used as the merge key:


In [27]:
left1 = DataFrame({'key': ['a', 'b', 'a', 'a', 'b', 'c'], 'value': range(6)})
right1 = DataFrame({'group_val': [3.5, 7]}, index=['a', 'b'])
print(left1)
print(right1)
pd.merge(left1, right1, left_on='key', right_index=True)


  key  value
0   a      0
1   b      1
2   a      2
3   a      3
4   b      4
5   c      5
   group_val
a        3.5
b        7.0
Out[27]:
key value group_val
0 a 0 3.5
2 a 2 3.5
3 a 3 3.5
1 b 1 7.0
4 b 4 7.0

Since the default merge method is to intersect the join keys, you can instead form the union of them with an outer join:


In [28]:
pd.merge(left1, right1, left_on='key', right_index=True, how='outer')


Out[28]:
key value group_val
0 a 0 3.5
2 a 2 3.5
3 a 3 3.5
1 b 1 7.0
4 b 4 7.0
5 c 5 NaN

Concatenation / Stacking

Another kind of data combination operation is alternatively referred to as concatena- tion, binding, or stacking. NumPy has a concatenate function for doing this with raw NumPy arrays:


In [29]:
arr = np.arange(12).reshape((3, 4))
np.concatenate([arr, arr], axis=1)


Out[29]:
array([[ 0,  1,  2,  3,  0,  1,  2,  3],
       [ 4,  5,  6,  7,  4,  5,  6,  7],
       [ 8,  9, 10, 11,  8,  9, 10, 11]])

In [30]:
np.concatenate([arr, arr], axis=0)


Out[30]:
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

The concat function in pandas provides a consistent way to address each of these concerns. I’ll give a number of examples to illustrate how it works. Suppose we have three Series with no index overlap:


In [31]:
s1 = Series([0, 1], index=['a', 'b'])
s2 = Series([2, 3, 4], index=['c', 'd', 'e'])
s3 = Series([5, 6], index=['f', 'g'])

In [32]:
pd.concat([s1, s2, s3])


Out[32]:
a    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: int64

By default concat works along axis=0, producing another Series. If you pass axis=1, the result will instead be a DataFrame (axis=1 is the columns):


In [33]:
pd.concat([s1, s2, s3], axis=1)


Out[33]:
0 1 2
a 0.0 NaN NaN
b 1.0 NaN NaN
c NaN 2.0 NaN
d NaN 3.0 NaN
e NaN 4.0 NaN
f NaN NaN 5.0
g NaN NaN 6.0

In this case there is no overlap on the other axis, which as you can see is the sorted union (the 'outer' join) of the indexes. You can instead intersect them by passing join='inner':


In [35]:
s4 = pd.concat([s1 * 5, s3])
s4


Out[35]:
a    0
b    5
f    5
g    6
dtype: int64

In [36]:
pd.concat([s1, s4], axis=1)


Out[36]:
0 1
a 0.0 0
b 1.0 5
f NaN 5
g NaN 6

In [37]:
In [68]: pd.concat([s1, s4], axis=1, join='inner')


Out[37]:
0 1
a 0 0
b 1 5

Split-Apply-Combine

Let's load some real world data to illustration splitting, transformation and grouping


In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("data/eu_trade_sums.csv")


---------------------------------------------------------------------------
IOError                                   Traceback (most recent call last)
<ipython-input-2-0cd89c97d3d6> in <module>()
----> 1 df = pd.read_csv("data/eu_trade_sums.csv")

/Users/pmui/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/io/parsers.pyc in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
    527                     skip_blank_lines=skip_blank_lines)
    528 
--> 529         return _read(filepath_or_buffer, kwds)
    530 
    531     parser_f.__name__ = name

/Users/pmui/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/io/parsers.pyc in _read(filepath_or_buffer, kwds)
    293 
    294     # Create the parser.
--> 295     parser = TextFileReader(filepath_or_buffer, **kwds)
    296 
    297     if (nrows is not None) and (chunksize is not None):

/Users/pmui/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/io/parsers.pyc in __init__(self, f, engine, **kwds)
    610             self.options['has_index_names'] = kwds['has_index_names']
    611 
--> 612         self._make_engine(self.engine)
    613 
    614     def _get_options_with_defaults(self, engine):

/Users/pmui/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/io/parsers.pyc in _make_engine(self, engine)
    745     def _make_engine(self, engine='c'):
    746         if engine == 'c':
--> 747             self._engine = CParserWrapper(self.f, **self.options)
    748         else:
    749             if engine == 'python':

/Users/pmui/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/io/parsers.pyc in __init__(self, src, **kwds)
   1117         kwds['allow_leading_cols'] = self.index_col is not False
   1118 
-> 1119         self._reader = _parser.TextReader(src, **kwds)
   1120 
   1121         # XXX

pandas/parser.pyx in pandas.parser.TextReader.__cinit__ (pandas/parser.c:3246)()

pandas/parser.pyx in pandas.parser.TextReader._setup_parser_source (pandas/parser.c:6111)()

IOError: File data/eu_trade_sums.csv does not exist

In [ ]:


In [ ]:


In [ ]:

Reshaping

There are a number of fundamental operations for rearranging tabular data. These are alternatingly referred to as reshape or pivot operations.

Hierarchical indexing provides a consistent way to rearrange data in a DataFrame. There are two primary actions:

  • stack: this “rotates” or pivots from the columns in the data to the rows
  • unstack: this pivots from the rows into the columns

We will illustrate these operations through a series of examples. Consider a small DataFrame with string arrays as row and column indexes:


In [40]:
data = DataFrame(np.arange(6).reshape((2, 3)),
                 index=pd.Index(['Ohio', 'Colorado'], name='state'), 
                 columns=pd.Index(['one', 'two', 'three'], name='number'))
data


Out[40]:
number one two three
state
Ohio 0 1 2
Colorado 3 4 5

Using the stack method on this data pivots the columns into the rows, producing a Series:


In [43]:
result = data.stack()
result


Out[43]:
state     number
Ohio      one       0
          two       1
          three     2
Colorado  one       3
          two       4
          three     5
dtype: int64

From a hierarchically-indexed Series, you can rearrange the data back into a DataFrame with unstack:


In [44]:
result.unstack()


Out[44]:
number one two three
state
Ohio 0 1 2
Colorado 3 4 5

By default the innermost level is unstacked (same with stack). You can unstack a different level by passing a level number or name:


In [45]:
result.unstack(0)


Out[45]:
state Ohio Colorado
number
one 0 3
two 1 4
three 2 5

In [46]:
result.unstack('state')


Out[46]:
state Ohio Colorado
number
one 0 3
two 1 4
three 2 5

Unstacking might introduce missing data if all of the values in the level aren’t found in each of the subgroups:


In [47]:
s1 = Series([0, 1, 2, 3], index=['a', 'b', 'c', 'd'])
s2 = Series([4, 5, 6], index=['c', 'd', 'e'])
data2 = pd.concat([s1, s2], keys=['one', 'two'])
data2.unstack()


Out[47]:
a b c d e
one 0.0 1.0 2.0 3.0 NaN
two NaN NaN 4.0 5.0 6.0

Stacking filters out missing data by default, so the operation is easily invertible:


In [49]:
data2.unstack().stack()


Out[49]:
one  a    0.0
     b    1.0
     c    2.0
     d    3.0
two  c    4.0
     d    5.0
     e    6.0
dtype: float64

In [50]:
data2.unstack().stack(dropna=False)


Out[50]:
one  a    0.0
     b    1.0
     c    2.0
     d    3.0
     e    NaN
two  a    NaN
     b    NaN
     c    4.0
     d    5.0
     e    6.0
dtype: float64

When unstacking in a DataFrame, the level unstacked becomes the lowest level in the result:


In [53]:
df = DataFrame({'left': result, 'right': result + 5}, 
               columns=pd.Index(['left', 'right'], name='side'))
df


Out[53]:
side left right
state number
Ohio one 0 5
two 1 6
three 2 7
Colorado one 3 8
two 4 9
three 5 10

In [54]:
df.unstack('state')


Out[54]:
side left right
state Ohio Colorado Ohio Colorado
number
one 0 3 5 8
two 1 4 6 9
three 2 5 7 10

In [55]:
df.unstack('state').stack('side')


Out[55]:
state Ohio Colorado
number side
one left 0 3
right 5 8
two left 1 4
right 6 9
three left 2 5
right 7 10

Removing Duplicates

Duplicate rows may be found in a DataFrame for any number of reasons. Here is an example:


In [57]:
data = DataFrame({'k1': ['one'] * 3 + ['two'] * 4,
                  'k2': [1, 1, 2, 3, 3, 4, 4]})
data


Out[57]:
k1 k2
0 one 1
1 one 1
2 one 2
3 two 3
4 two 3
5 two 4
6 two 4

The DataFrame method duplicated returns a boolean Series indicating whether each row is a duplicate or not:


In [58]:
data.duplicated()


Out[58]:
0    False
1     True
2    False
3    False
4     True
5    False
6     True
dtype: bool

Relatedly, drop_duplicates returns a DataFrame where the duplicated array is False:


In [59]:
data.drop_duplicates()


Out[59]:
k1 k2
0 one 1
2 one 2
3 two 3
5 two 4

Both of these methods by default consider all of the columns; alternatively you can specify any subset of them to detect duplicates. Suppose we had an additional column of values and wanted to filter duplicates only based on the 'k1' column:


In [60]:
data['v1'] = range(7)
data.drop_duplicates(['k1'])


Out[60]:
k1 k2 v1
0 one 1 0
3 two 3 3

duplicated and drop_duplicates by default keep the first observed value combination. Passing take_last=True will return the last one:


In [62]:
data.drop_duplicates(['k1', 'k2'], keep='last')


Out[62]:
k1 k2 v1
1 one 1 1
2 one 2 2
4 two 3 4
6 two 4 6

Transforming Data Using a Function or Mapping

For many data sets, you may wish to perform some transformation based on the values in an array, Series, or column in a DataFrame. Consider the following hypothetical data collected about some kinds of meat:


In [63]:
data = DataFrame({'food': ['bacon', 'pulled pork', 'bacon', 'Pastrami', 
                           'corned beef', 'Bacon', 'pastrami', 'honey ham',
                           'nova lox'],
                  'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data


Out[63]:
food ounces
0 bacon 4.0
1 pulled pork 3.0
2 bacon 12.0
3 Pastrami 6.0
4 corned beef 7.5
5 Bacon 8.0
6 pastrami 3.0
7 honey ham 5.0
8 nova lox 6.0

Suppose you wanted to add a column indicating the type of animal that each food came from. Let’s write down a mapping of each distinct meat type to the kind of animal:


In [64]:
meat_to_animal = { 'bacon': 'pig', 
                   'pulled pork': 'pig', 
                   'pastrami': 'cow', 
                   'corned beef': 'cow', 
                   'honey ham': 'pig', 
                   'nova lox': 'salmon'
}

The map method on a Series accepts a function or dict-like object containing a mapping, but here we have a small problem in that some of the meats above are capitalized and others are not. Thus, we also need to convert each value to lower case:


In [66]:
data['animal'] = data['food'].map(str.lower).map(meat_to_animal)
data


Out[66]:
food ounces animal
0 bacon 4.0 pig
1 pulled pork 3.0 pig
2 bacon 12.0 pig
3 Pastrami 6.0 cow
4 corned beef 7.5 cow
5 Bacon 8.0 pig
6 pastrami 3.0 cow
7 honey ham 5.0 pig
8 nova lox 6.0 salmon

We could also have passed a function that does all the work:


In [67]:
data['food'].map(lambda x: meat_to_animal[x.lower()])


Out[67]:
0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object

Replacing Values

Filling in missing data with the fillna method can be thought of as a special case of more general value replacement. While map, as you’ve seen above, can be used to modify a subset of values in an object, replace provides a simpler and more flexible way to do so. Let’s consider this Series:


In [68]:
data = Series([1., -999., 2., -999., -1000., 3.])
data


Out[68]:
0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

The -999 values might be sentinel values for missing data. To replace these with NA values that pandas understands, we can use replace, producing a new Series:


In [69]:
data.replace(-999, np.nan)


Out[69]:
0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

If you want to replace multiple values at once, you instead pass a list then the substitute value:


In [70]:
data.replace([-999, -1000], np.nan)


Out[70]:
0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

To use a different replacement for each value, pass a list of substitutes:


In [71]:
data.replace([-999, -1000], [np.nan, 0])


Out[71]:
0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

The argument passed can also be a dict:


In [72]:
data.replace({-999: np.nan, -1000: 0})


Out[72]:
0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

In [ ]: