Shaping Data

Much of the programming work in data analysis and modeling is spent on data preparation: loading, cleaning, transforming, and rearranging. Sometimes the way that data is stored in files or databases is not the way you need it for a data processing application.

pandas along with the Python standard library provide you with a high-level, flexible, and high-performance set of core manipulations and algorithms to enable you to wrangle data into the right form without much trouble.



In [1]:

    
import numpy as np
from pandas import Series, DataFrame
import pandas as pd



In [2]:

    
df1 = DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                 'data1': range(7)})
df2 = df2 = DataFrame({'key': ['a', 'b', 'd'],
                        'data2': range(3)})
df1



In [3]:

    
df2

Merging / Joining

This is an example of a many-to-one merge situation; the data in df1 has multiple rows labeled a and b, whereas df2 has only one row for each value in the key column. Calling merge with these objects we obtain:



In [4]:

    
pd.merge(df1, df2)

If not specified, merge uses the overlapping column names as the keys. It’s a good practice to specify explicitly, though:



In [5]:

    
pd.merge(df1, df2, on='key')

If the column names are different in each object, you can specify them separately:



In [6]:

    
df3 = DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1': range(7)})



In [7]:

    
df4 = DataFrame({'rkey': ['a', 'b', 'd'], 'data2': range(3)})



In [8]:

    
pd.merge(df3, df4, left_on='lkey', right_on='rkey')

You probably noticed that the 'c' and 'd' values and associated data are missing from the result. By default merge does an 'inner' join; the keys in the result are the intersec- tion. Other possible options are 'left', 'right', and 'outer'. The outer join takes the union of the keys, combining the effect of applying both left and right joins:



In [9]:

    
pd.merge(df1, df2, how='outer')



In [10]:

    
pd.merge(df1, df2, how='left')



In [11]:

    
pd.merge(df1, df2, how='right')

Many-to-Many Merges

Many-to-many merges have well-defined though not necessarily intuitive behavior. Here’s an example:



In [20]:

    
df1 = DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'], 'data1': range(6)})
df2 = DataFrame({'key': ['a', 'b', 'a', 'b', 'd'], 'data2': range(5)})
print(df1)
print(df2)
pd.merge(df1, df2, on='key', how='left')









    



   data1 key
0      0   b
1      1   b
2      2   a
3      3   c
4      4   a
5      5   b
   data2 key
0      0   a
1      1   b
2      2   a
3      3   b
4      4   d






    Out[20]:






  
    
      
      data1
      key
      data2
    
  
  
    
      0
      0
      b
      1.0
    
    
      1
      0
      b
      3.0
    
    
      2
      1
      b
      1.0
    
    
      3
      1
      b
      3.0
    
    
      4
      2
      a
      0.0
    
    
      5
      2
      a
      2.0
    
    
      6
      3
      c
      NaN
    
    
      7
      4
      a
      0.0
    
    
      8
      4
      a
      2.0
    
    
      9
      5
      b
      1.0
    
    
      10
      5
      b
      3.0

Many-to-many joins form the Cartesian product of the rows. Since there were 3 'b' rows in the left DataFrame and 2 in the right one, there are 6 'b' rows in the result. The join method only affects the distinct key values appearing in the result:



In [21]:

    
pd.merge(df1, df2, how='inner')

Compound Key

To merge with multiple keys ("compound key"), pass a list of column names:



In [22]:

    
left = DataFrame({'key1': ['foo', 'foo', 'bar'], 
                  'key2': ['one', 'two', 'one'],
                  'lval': [1, 2, 3]})

right = DataFrame({'key1': ['foo', 'foo', 'bar', 'bar'], 
                   'key2': ['one', 'one', 'one', 'two'],
                   'rval': [4, 5, 6, 7]})



In [23]:

    
pd.merge(left, right, on=['key1', 'key2'], how='outer')

To determine which key combinations will appear in the result depending on the choice of merge method, think of the multiple keys as forming an array of tuples to be used as a single join key (even though it’s not actually implemented that way).

Overlapping Column Names

A last issue to consider in merge operations is the treatment of overlapping column names. While you can address the overlap manually (see the later section on renaming axis labels), merge has a suffixes option for specifying strings to append to overlapping names in the left and right DataFrame objects:



In [24]:

    
pd.merge(left, right, on='key1')



In [25]:

    
pd.merge(left, right, on='key1', suffixes=('_left', '_right'))

Merge on Index

In some cases, the merge key or keys in a DataFrame will be found in its index. In this case, you can pass left_index=True or right_index=True (or both) to indicate that the index should be used as the merge key:



In [27]:

    
left1 = DataFrame({'key': ['a', 'b', 'a', 'a', 'b', 'c'], 'value': range(6)})
right1 = DataFrame({'group_val': [3.5, 7]}, index=['a', 'b'])
print(left1)
print(right1)
pd.merge(left1, right1, left_on='key', right_index=True)









    



  key  value
0   a      0
1   b      1
2   a      2
3   a      3
4   b      4
5   c      5
   group_val
a        3.5
b        7.0






    Out[27]:






  
    
      
      key
      value
      group_val
    
  
  
    
      0
      a
      0
      3.5
    
    
      2
      a
      2
      3.5
    
    
      3
      a
      3
      3.5
    
    
      1
      b
      1
      7.0
    
    
      4
      b
      4
      7.0

Since the default merge method is to intersect the join keys, you can instead form the union of them with an outer join:



In [28]:

    
pd.merge(left1, right1, left_on='key', right_index=True, how='outer')

Concatenation / Stacking

Another kind of data combination operation is alternatively referred to as concatena- tion, binding, or stacking. NumPy has a concatenate function for doing this with raw NumPy arrays:



In [29]:

    
arr = np.arange(12).reshape((3, 4))
np.concatenate([arr, arr], axis=1)









    Out[29]:





array([[ 0,  1,  2,  3,  0,  1,  2,  3],
       [ 4,  5,  6,  7,  4,  5,  6,  7],
       [ 8,  9, 10, 11,  8,  9, 10, 11]])



In [30]:

    
np.concatenate([arr, arr], axis=0)









    Out[30]:





array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

The concat function in pandas provides a consistent way to address each of these concerns. I’ll give a number of examples to illustrate how it works. Suppose we have three Series with no index overlap:



In [31]:

    
s1 = Series([0, 1], index=['a', 'b'])
s2 = Series([2, 3, 4], index=['c', 'd', 'e'])
s3 = Series([5, 6], index=['f', 'g'])



In [32]:

    
pd.concat([s1, s2, s3])









    Out[32]:





a    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: int64

By default concat works along axis=0, producing another Series. If you pass axis=1, the result will instead be a DataFrame (axis=1 is the columns):



In [33]:

    
pd.concat([s1, s2, s3], axis=1)

In this case there is no overlap on the other axis, which as you can see is the sorted union (the 'outer' join) of the indexes. You can instead intersect them by passing join='inner':



In [35]:

    
s4 = pd.concat([s1 * 5, s3])
s4









    Out[35]:





a    0
b    5
f    5
g    6
dtype: int64



In [36]:

    
pd.concat([s1, s4], axis=1)



In [37]:

    
In [68]: pd.concat([s1, s4], axis=1, join='inner')

Split-Apply-Combine

Let's load some real world data to illustration splitting, transformation and grouping



In [1]:

    
import pandas as pd



In [2]:

    
df = pd.read_csv("data/eu_trade_sums.csv")









    



---------------------------------------------------------------------------
IOError                                   Traceback (most recent call last)
<ipython-input-2-0cd89c97d3d6> in <module>()
----> 1 df = pd.read_csv("data/eu_trade_sums.csv")

/Users/pmui/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/io/parsers.pyc in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
    527                     skip_blank_lines=skip_blank_lines)
    528 
--> 529         return _read(filepath_or_buffer, kwds)
    530 
    531     parser_f.__name__ = name

/Users/pmui/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/io/parsers.pyc in _read(filepath_or_buffer, kwds)
    293 
    294     # Create the parser.
--> 295     parser = TextFileReader(filepath_or_buffer, **kwds)
    296 
    297     if (nrows is not None) and (chunksize is not None):

/Users/pmui/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/io/parsers.pyc in __init__(self, f, engine, **kwds)
    610             self.options['has_index_names'] = kwds['has_index_names']
    611 
--> 612         self._make_engine(self.engine)
    613 
    614     def _get_options_with_defaults(self, engine):

/Users/pmui/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/io/parsers.pyc in _make_engine(self, engine)
    745     def _make_engine(self, engine='c'):
    746         if engine == 'c':
--> 747             self._engine = CParserWrapper(self.f, **self.options)
    748         else:
    749             if engine == 'python':

/Users/pmui/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/io/parsers.pyc in __init__(self, src, **kwds)
   1117         kwds['allow_leading_cols'] = self.index_col is not False
   1118 
-> 1119         self._reader = _parser.TextReader(src, **kwds)
   1120 
   1121         # XXX

pandas/parser.pyx in pandas.parser.TextReader.__cinit__ (pandas/parser.c:3246)()

pandas/parser.pyx in pandas.parser.TextReader._setup_parser_source (pandas/parser.c:6111)()

IOError: File data/eu_trade_sums.csv does not exist



In [ ]:



In [ ]:



In [ ]:

Reshaping

There are a number of fundamental operations for rearranging tabular data. These are alternatingly referred to as reshape or pivot operations.

Hierarchical indexing provides a consistent way to rearrange data in a DataFrame. There are two primary actions:

stack: this “rotates” or pivots from the columns in the data to the rows
unstack: this pivots from the rows into the columns

We will illustrate these operations through a series of examples. Consider a small DataFrame with string arrays as row and column indexes:



In [40]:

    
data = DataFrame(np.arange(6).reshape((2, 3)),
                 index=pd.Index(['Ohio', 'Colorado'], name='state'), 
                 columns=pd.Index(['one', 'two', 'three'], name='number'))
data

Using the stack method on this data pivots the columns into the rows, producing a Series:



In [43]:

    
result = data.stack()
result









    Out[43]:





state     number
Ohio      one       0
          two       1
          three     2
Colorado  one       3
          two       4
          three     5
dtype: int64

From a hierarchically-indexed Series, you can rearrange the data back into a DataFrame with unstack:



In [44]:

    
result.unstack()

By default the innermost level is unstacked (same with stack). You can unstack a different level by passing a level number or name:



In [45]:

    
result.unstack(0)



In [46]:

    
result.unstack('state')

Unstacking might introduce missing data if all of the values in the level aren’t found in each of the subgroups:



In [47]:

    
s1 = Series([0, 1, 2, 3], index=['a', 'b', 'c', 'd'])
s2 = Series([4, 5, 6], index=['c', 'd', 'e'])
data2 = pd.concat([s1, s2], keys=['one', 'two'])
data2.unstack()

Stacking filters out missing data by default, so the operation is easily invertible:



In [49]:

    
data2.unstack().stack()









    Out[49]:





one  a    0.0
     b    1.0
     c    2.0
     d    3.0
two  c    4.0
     d    5.0
     e    6.0
dtype: float64



In [50]:

    
data2.unstack().stack(dropna=False)









    Out[50]:





one  a    0.0
     b    1.0
     c    2.0
     d    3.0
     e    NaN
two  a    NaN
     b    NaN
     c    4.0
     d    5.0
     e    6.0
dtype: float64

When unstacking in a DataFrame, the level unstacked becomes the lowest level in the result:



In [53]:

    
df = DataFrame({'left': result, 'right': result + 5}, 
               columns=pd.Index(['left', 'right'], name='side'))
df



In [54]:

    
df.unstack('state')



In [55]:

    
df.unstack('state').stack('side')

Removing Duplicates

Duplicate rows may be found in a DataFrame for any number of reasons. Here is an example:



In [57]:

    
data = DataFrame({'k1': ['one'] * 3 + ['two'] * 4,
                  'k2': [1, 1, 2, 3, 3, 4, 4]})
data

The DataFrame method duplicated returns a boolean Series indicating whether each row is a duplicate or not:



In [58]:

    
data.duplicated()









    Out[58]:





0    False
1     True
2    False
3    False
4     True
5    False
6     True
dtype: bool

Relatedly, drop_duplicates returns a DataFrame where the duplicated array is False:



In [59]:

    
data.drop_duplicates()

Both of these methods by default consider all of the columns; alternatively you can specify any subset of them to detect duplicates. Suppose we had an additional column of values and wanted to filter duplicates only based on the 'k1' column:



In [60]:

    
data['v1'] = range(7)
data.drop_duplicates(['k1'])

duplicated and drop_duplicates by default keep the first observed value combination. Passing take_last=True will return the last one:



In [62]:

    
data.drop_duplicates(['k1', 'k2'], keep='last')

Transforming Data Using a Function or Mapping

For many data sets, you may wish to perform some transformation based on the values in an array, Series, or column in a DataFrame. Consider the following hypothetical data collected about some kinds of meat:



In [63]:

    
data = DataFrame({'food': ['bacon', 'pulled pork', 'bacon', 'Pastrami', 
                           'corned beef', 'Bacon', 'pastrami', 'honey ham',
                           'nova lox'],
                  'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data









    Out[63]:






  
    
      
      food
      ounces
    
  
  
    
      0
      bacon
      4.0
    
    
      1
      pulled pork
      3.0
    
    
      2
      bacon
      12.0
    
    
      3
      Pastrami
      6.0
    
    
      4
      corned beef
      7.5
    
    
      5
      Bacon
      8.0
    
    
      6
      pastrami
      3.0
    
    
      7
      honey ham
      5.0
    
    
      8
      nova lox
      6.0

Suppose you wanted to add a column indicating the type of animal that each food came from. Let’s write down a mapping of each distinct meat type to the kind of animal:



In [64]:

    
meat_to_animal = { 'bacon': 'pig', 
                   'pulled pork': 'pig', 
                   'pastrami': 'cow', 
                   'corned beef': 'cow', 
                   'honey ham': 'pig', 
                   'nova lox': 'salmon'
}

The map method on a Series accepts a function or dict-like object containing a mapping, but here we have a small problem in that some of the meats above are capitalized and others are not. Thus, we also need to convert each value to lower case:



In [66]:

    
data['animal'] = data['food'].map(str.lower).map(meat_to_animal)
data









    Out[66]:






  
    
      
      food
      ounces
      animal
    
  
  
    
      0
      bacon
      4.0
      pig
    
    
      1
      pulled pork
      3.0
      pig
    
    
      2
      bacon
      12.0
      pig
    
    
      3
      Pastrami
      6.0
      cow
    
    
      4
      corned beef
      7.5
      cow
    
    
      5
      Bacon
      8.0
      pig
    
    
      6
      pastrami
      3.0
      cow
    
    
      7
      honey ham
      5.0
      pig
    
    
      8
      nova lox
      6.0
      salmon

We could also have passed a function that does all the work:



In [67]:

    
data['food'].map(lambda x: meat_to_animal[x.lower()])









    Out[67]:





0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object

Replacing Values

Filling in missing data with the fillna method can be thought of as a special case of more general value replacement. While map, as you’ve seen above, can be used to modify a subset of values in an object, replace provides a simpler and more flexible way to do so. Let’s consider this Series:



In [68]:

    
data = Series([1., -999., 2., -999., -1000., 3.])
data









    Out[68]:





0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

The -999 values might be sentinel values for missing data. To replace these with NA values that pandas understands, we can use replace, producing a new Series:



In [69]:

    
data.replace(-999, np.nan)









    Out[69]:





0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

If you want to replace multiple values at once, you instead pass a list then the substitute value:



In [70]:

    
data.replace([-999, -1000], np.nan)









    Out[70]:





0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

To use a different replacement for each value, pass a list of substitutes:



In [71]:

    
data.replace([-999, -1000], [np.nan, 0])









    Out[71]:





0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

The argument passed can also be a dict:



In [72]:

    
data.replace({-999: np.nan, -1000: 0})









    Out[72]:





0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64



In [ ]:

	data1	key	data2
0	0.0	b	1.0
1	1.0	b	1.0
2	6.0	b	1.0
3	2.0	a	0.0
4	4.0	a	0.0
5	5.0	a	0.0
6	3.0	c	NaN
7	NaN	d	2.0

	data1	key	data2
0	0	b	1.0
1	0	b	3.0
2	1	b	1.0
3	1	b	3.0
4	2	a	0.0
5	2	a	2.0
6	3	c	NaN
7	4	a	0.0
8	4	a	2.0
9	5	b	1.0
10	5	b	3.0

	key1	key2	lval	rval
0	foo	one	1.0	4.0
1	foo	one	1.0	5.0
2	foo	two	2.0	NaN
3	bar	one	3.0	6.0
4	bar	two	NaN	7.0

	key1	key2_x	lval	key2_y	rval
0	foo	one	1	one	4
1	foo	one	1	one	5
2	foo	two	2	one	4
3	foo	two	2	one	5
4	bar	one	3	one	6
5	bar	one	3	two	7

	key1	key2_left	lval	key2_right	rval
0	foo	one	1	one	4
1	foo	one	1	one	5
2	foo	two	2	one	4
3	foo	two	2	one	5
4	bar	one	3	one	6
5	bar	one	3	two	7

	0	1	2
a	0.0	NaN	NaN
b	1.0	NaN	NaN
c	NaN	2.0	NaN
d	NaN	3.0	NaN
e	NaN	4.0	NaN
f	NaN	NaN	5.0
g	NaN	NaN	6.0

	state	Ohio	Colorado
number	side
one	left	0	3
one	right	5	8
two	left	1	4
two	right	6	9
three	left	2	5
three	right	7	10

	food	ounces
0	bacon	4.0
1	pulled pork	3.0
2	bacon	12.0
3	Pastrami	6.0
4	corned beef	7.5
5	Bacon	8.0
6	pastrami	3.0
7	honey ham	5.0
8	nova lox	6.0

	data1	key	data2
0	0.0	b	1.0
1	1.0	b	1.0
2	6.0	b	1.0
3	2.0	a	0.0
4	4.0	a	0.0
5	5.0	a	0.0
6	3.0	c	NaN
7	NaN	d	2.0

	data1	key	data2
0	0	b	1.0
1	0	b	3.0
2	1	b	1.0
3	1	b	3.0
4	2	a	0.0
5	2	a	2.0
6	3	c	NaN
7	4	a	0.0
8	4	a	2.0
9	5	b	1.0
10	5	b	3.0

	key1	key2	lval	rval
0	foo	one	1.0	4.0
1	foo	one	1.0	5.0
2	foo	two	2.0	NaN
3	bar	one	3.0	6.0
4	bar	two	NaN	7.0

	key1	key2_x	lval	key2_y	rval
0	foo	one	1	one	4
1	foo	one	1	one	5
2	foo	two	2	one	4
3	foo	two	2	one	5
4	bar	one	3	one	6
5	bar	one	3	two	7

	key1	key2_left	lval	key2_right	rval
0	foo	one	1	one	4
1	foo	one	1	one	5
2	foo	two	2	one	4
3	foo	two	2	one	5
4	bar	one	3	one	6
5	bar	one	3	two	7

	0	1	2
a	0.0	NaN	NaN
b	1.0	NaN	NaN
c	NaN	2.0	NaN
d	NaN	3.0	NaN
e	NaN	4.0	NaN
f	NaN	NaN	5.0
g	NaN	NaN	6.0

	data1	key	data2
0	0.0	b	1.0
1	1.0	b	1.0
2	6.0	b	1.0
3	2.0	a	0.0
4	4.0	a	0.0
5	5.0	a	0.0
6	3.0	c	NaN
7	NaN	d	2.0

	data1	key	data2
0	0	b	1.0
1	0	b	3.0
2	1	b	1.0
3	1	b	3.0
4	2	a	0.0
5	2	a	2.0
6	3	c	NaN
7	4	a	0.0
8	4	a	2.0
9	5	b	1.0
10	5	b	3.0

	key1	key2	lval	rval
0	foo	one	1.0	4.0
1	foo	one	1.0	5.0
2	foo	two	2.0	NaN
3	bar	one	3.0	6.0
4	bar	two	NaN	7.0

	key1	key2_x	lval	key2_y	rval
0	foo	one	1	one	4
1	foo	one	1	one	5
2	foo	two	2	one	4
3	foo	two	2	one	5
4	bar	one	3	one	6
5	bar	one	3	two	7

	key1	key2_left	lval	key2_right	rval
0	foo	one	1	one	4
1	foo	one	1	one	5
2	foo	two	2	one	4
3	foo	two	2	one	5
4	bar	one	3	one	6
5	bar	one	3	two	7

	0	1	2
a	0.0	NaN	NaN
b	1.0	NaN	NaN
c	NaN	2.0	NaN
d	NaN	3.0	NaN
e	NaN	4.0	NaN
f	NaN	NaN	5.0
g	NaN	NaN	6.0