Tidying Data

Tidying of data is required for many reasons including these:

The names of the variables are different from what you require
Missing data
Values are not in units that you require
Period of sampling of records is not what you need
Variables are categorical and you neeed quantitative values
There is noise in the data
Information is of an incorrect type
Data is organized around incorrect axes
Data is at the wrong level of normalization
Data is duplicated

Moving away from a list of problems with data that needs to be addressed, there are several characterisitics of data that can be considered good, tidy and ready for analysis which are as follows:

Each variable is in one column
Each observation of the variable is in a different row
There should be one table for each kind of variable
If multiple tables they should be relatable
Qualitative and categorical variables have mappings to values useful for analysis

Setting up notebook



In [2]:

    
# import pandas, numpy and datetime
import numpy as np
import pandas as pd
import datetime

# set some pandas options for controlling output
pd.set_option('display.notebook_repr_html', False)
pd.set_option('display.max_columns',10)
pd.set_option('display.max_rows',10)

Working with Missing Data

Data is "missing" in pandas when it has a value of NaN (also seen as np.nan - the form from NumPy). The NaN value represents that in a particular Series that there is not a value specified for the particular index level.

In pandas, there are a number of reasons why a value can be NaN:

Join of two sets of data does not have matched values
Data that you retrieved from an external source is incomplete
NaN is not known at a give point in time and will be filled in later
There is a data collection error retrieving a value, but the event must still be recorded in the index
Reindexing of data has resulted in an index that does not have a value
Shape of a data has changed, there are new additional rows or columns



In [3]:

    
# create a DataFrame with 5 rows and 3 columns
df = pd.DataFrame(np.arange(0,15).reshape(5,3),index=['a','b','c','d','e'], columns=['col1','col2','col3'])
df









    Out[3]:





   col1  col2  col3
a     0     1     2
b     3     4     5
c     6     7     8
d     9    10    11
e    12    13    14



In [4]:

    
# add some columns and rows to the DataFrame
# column c4 with NaN values
df['c4'] = np.nan
# row 'f' with 15 through 18
df.loc['f'] = np.arange(15,19)
# row 'g' with all NaN
df.loc['g'] = np.nan
# column c5 with NaN's
df['c5'] = np.nan
# change value in col 'c4' row 'a'
df['c4']['a']=20
df









    Out[4]:





   col1  col2  col3    c4  c5
a   0.0   1.0   2.0  20.0 NaN
b   3.0   4.0   5.0   NaN NaN
c   6.0   7.0   8.0   NaN NaN
d   9.0  10.0  11.0   NaN NaN
e  12.0  13.0  14.0   NaN NaN
f  15.0  16.0  17.0  18.0 NaN
g   NaN   NaN   NaN   NaN NaN

This DataFrame object exhibits the following characteristics that will support most of the examples that follows:

One row consisting only of NaN values
One column is consisting only of NaN values
Several rows and columns consisting of both numeric and NaN values

Determining NaN values in Series and DataFrame objects



In [5]:

    
# which items are NaN?
df.isnull()









    Out[5]:





    col1   col2   col3     c4    c5
a  False  False  False  False  True
b  False  False  False   True  True
c  False  False  False   True  True
d  False  False  False   True  True
e  False  False  False   True  True
f  False  False  False  False  True
g   True   True   True   True  True



In [6]:

    
# count the number of NaN values in each column
df.isnull().sum()









    Out[6]:





col1    1
col2    1
col3    1
c4      5
c5      7
dtype: int64



In [7]:

    
# total count of NaN values
df.isnull().sum().sum()









    Out[7]:





15



In [8]:

    
# number of non-NaN values in each column
df.count()









    Out[8]:





col1    6
col2    6
col3    6
c4      2
c5      0
dtype: int64



In [9]:

    
# and can used for counting NaN values
(len(df)-df.count()).sum()









    Out[9]:





15



In [10]:

    
# which items are not null
df.notnull()









    Out[10]:





    col1   col2   col3     c4     c5
a   True   True   True   True  False
b   True   True   True  False  False
c   True   True   True  False  False
d   True   True   True  False  False
e   True   True   True  False  False
f   True   True   True   True  False
g  False  False  False  False  False

Selecting out or dropping missing data



In [11]:

    
# select the non-NaN items in column c4
df.c4[df.c4.notnull()]









    Out[11]:





a    20.0
f    18.0
Name: c4, dtype: float64



In [12]:

    
# .dropna will also return non NaN values
# this gets all non NaN items in column c4
df.c4.dropna()









    Out[12]:





a    20.0
f    18.0
Name: c4, dtype: float64



In [13]:

    
# dropna returns as copy with the values dropped
# the source DataFrame / column is not changed
df.c4









    Out[13]:





a    20.0
b     NaN
c     NaN
d     NaN
e     NaN
f    18.0
g     NaN
Name: c4, dtype: float64

When applied to a DataFrame object, dropna() will drop all rows froms a DataFrame object that have atleast one NaN value. If you want to drop only rows where all values are NaN, you can use the how="all" parameter.



In [14]:

    
# on a dataframe this will drop entire rows
# where there is atleast one NaN
# in this case, all the rows
df.dropna()









    Out[14]:





Empty DataFrame
Columns: [col1, col2, col3, c4, c5]
Index: []



In [15]:

    
# using how='all', only rows that have all values
# as NaN will be dropped
df.dropna(how='all')









    Out[15]:





   col1  col2  col3    c4  c5
a   0.0   1.0   2.0  20.0 NaN
b   3.0   4.0   5.0   NaN NaN
c   6.0   7.0   8.0   NaN NaN
d   9.0  10.0  11.0   NaN NaN
e  12.0  13.0  14.0   NaN NaN
f  15.0  16.0  17.0  18.0 NaN



In [16]:

    
# flip to drop columns instead of rows
df.dropna(how='all',axis=1) # c5 column will be dropped









    Out[16]:





   col1  col2  col3    c4
a   0.0   1.0   2.0  20.0
b   3.0   4.0   5.0   NaN
c   6.0   7.0   8.0   NaN
d   9.0  10.0  11.0   NaN
e  12.0  13.0  14.0   NaN
f  15.0  16.0  17.0  18.0
g   NaN   NaN   NaN   NaN



In [19]:

    
# make a copy of df
df2 = df.copy()
# replace two NaN cells with values
df2.loc['g'].col1=0
df2.loc['g'].col3=0
df2









    Out[19]:





   col1  col2  col3    c4  c5
a   0.0   1.0   2.0  20.0 NaN
b   3.0   4.0   5.0   NaN NaN
c   6.0   7.0   8.0   NaN NaN
d   9.0  10.0  11.0   NaN NaN
e  12.0  13.0  14.0   NaN NaN
f  15.0  16.0  17.0  18.0 NaN
g   0.0   NaN   0.0   NaN NaN



In [20]:

    
# now drop columns with any NaN values
df2.dropna(how='any',axis=1)









    Out[20]:





   col1  col3
a   0.0   2.0
b   3.0   5.0
c   6.0   8.0
d   9.0  11.0
e  12.0  14.0
f  15.0  17.0
g   0.0   0.0



In [21]:

    
# only drop columns with at least 5 NaN values
df.dropna(thresh=5,axis=1)









    Out[21]:





   col1  col2  col3
a   0.0   1.0   2.0
b   3.0   4.0   5.0
c   6.0   7.0   8.0
d   9.0  10.0  11.0
e  12.0  13.0  14.0
f  15.0  16.0  17.0
g   NaN   NaN   NaN

Note that the .dropna() method returns a copy of the DataFrame object, and the data is dropped from that copy. If you want to drop the data in the actual DataFrame, use the inplace=True parameter.

How pandas handles NaN values in mathematical operations

The NaN values are handled differently in pandas than in NumPy. NumPy functions when encountering a NaN value, will return NaN. pandas functions will typically ignore the NaN values and continue processing the function as though the values were not part of the Series object.

More specifically the way that pandas handles NaN values is as follows:

Summing of data treats NaN as 0
If all values are NaN, the result is NaN
Methods like .cumsum() and .cumprod() ignore NaN values but preserve them in resulting arrays



In [22]:

    
# create a NumPy array with one NaN value
a = np.array([1,2,np.nan,3])
# create a Series from the array
s = pd.Series(a)
# the mean of each is different
a.mean(), s.mean()









    Out[22]:





(nan, 2.0)



In [23]:

    
# demonstrate sum,mean and cumsum handling of NaN
# get one column
s = df.c4
s.sum()   # NaN values treated a 0









    Out[23]:





38.0



In [24]:

    
s.mean() # NaN treated as 0









    Out[24]:





19.0



In [25]:

    
# as 0 in the cumsum but NaN values preserved in result series
s.cumsum()









    Out[25]:





a    20.0
b     NaN
c     NaN
d     NaN
e     NaN
f    38.0
g     NaN
Name: c4, dtype: float64



In [26]:

    
# in arithmetic, a NaN value will result in NaN
df.c4 + 1









    Out[26]:





a    21.0
b     NaN
c     NaN
d     NaN
e     NaN
f    19.0
g     NaN
Name: c4, dtype: float64

Filling in missing data

If you prefer to replace the NaN values with a specific value, instead of having them propagated or flat out ignored, you can use the .fillna() method. The following code fills the NaN values with 0:



In [27]:

    
# return a new DataFrame with NaN values filled with 0
filled = df.fillna(0)
filled









    Out[27]:





   col1  col2  col3    c4   c5
a   0.0   1.0   2.0  20.0  0.0
b   3.0   4.0   5.0   0.0  0.0
c   6.0   7.0   8.0   0.0  0.0
d   9.0  10.0  11.0   0.0  0.0
e  12.0  13.0  14.0   0.0  0.0
f  15.0  16.0  17.0  18.0  0.0
g   0.0   0.0   0.0   0.0  0.0



In [28]:

    
# having replaced NaN with 0 can make
# operations such as mean have different results
filled.mean()









    Out[28]:





col1    6.428571
col2    7.285714
col3    8.142857
c4      5.428571
c5      0.000000
dtype: float64

It is also possible to limit the number of times that the data will be filled using the limit parameter. Each time the NaN values are identified, pandas will fill the NaN values with the previous value up to the limit times in each group of NaN values.



In [29]:

    
# only fills the first two NaN values in each row with 0
df.fillna(0,limit=2)









    Out[29]:





   col1  col2  col3    c4   c5
a   0.0   1.0   2.0  20.0  0.0
b   3.0   4.0   5.0   0.0  0.0
c   6.0   7.0   8.0   0.0  NaN
d   9.0  10.0  11.0   NaN  NaN
e  12.0  13.0  14.0   NaN  NaN
f  15.0  16.0  17.0  18.0  NaN
g   0.0   0.0   0.0   NaN  NaN

Forward and Backward Filling of Missing Values

Gaps in data can be filled by propagating the non-NaN values forward or backward along a Series. To demonstrate this, the following example will "fill forward" the c4 column of DataFrame:



In [30]:

    
# extract the c4 column and fill NaNs forward
df.c4.fillna(method="ffill")









    Out[30]:





a    20.0
b    20.0
c    20.0
d    20.0
e    20.0
f    18.0
g    18.0
Name: c4, dtype: float64



In [31]:

    
# perform a backword fill
df.c4.fillna(method="bfill")









    Out[31]:





a    20.0
b    18.0
c    18.0
d    18.0
e    18.0
f    18.0
g     NaN
Name: c4, dtype: float64

You can also use the convenient functions pd.ffill() or pd.bfill()

Filling Using index labels



In [33]:

    
# create a new series of values to be
# used to fill NaN values where the index label matches
fill_values = pd.Series([100,101,102], index=['a','e','g'])
fill_values









    Out[33]:





a    100
e    101
g    102
dtype: int64



In [34]:

    
# using c4, fill using fill_values
# a, e and g will be filled with matching values
df.c4.fillna(fill_values)









    Out[34]:





a     20.0
b      NaN
c      NaN
d      NaN
e    101.0
f     18.0
g    102.0
Name: c4, dtype: float64



In [35]:

    
# fill NaN values in each column with the
# mean of the values in that column
df.fillna(df.mean())









    Out[35]:





   col1  col2  col3    c4  c5
a   0.0   1.0   2.0  20.0 NaN
b   3.0   4.0   5.0  19.0 NaN
c   6.0   7.0   8.0  19.0 NaN
d   9.0  10.0  11.0  19.0 NaN
e  12.0  13.0  14.0  19.0 NaN
f  15.0  16.0  17.0  18.0 NaN
g   7.5   8.5   9.5  19.0 NaN

Interpolation of missing values

Both DataFrame and Series have an .interpolate() method that will, by default, perform a linear interpolation of missing values:



In [36]:

    
# linear interpolate the NaN values from 1 through 2
s = pd.Series([1,np.nan,np.nan,np.nan,2])
s.interpolate()









    Out[36]:





0    1.00
1    1.25
2    1.50
3    1.75
4    2.00
dtype: float64

The value of interpolation is calculated by taking the first value before and after any sequence of NaN values and then incrementally adding that value from the start and substituting NaN values. In this case, 2.0 and 1.0 are the surrounding values resulting in (2.0-1.0)/(5-1)=0.25 which is then added incrementally through all the NaN values.

The interpolation method also has the ability to specify a specific method of interpolation, one of the common methods is to use time-based interpolation.



In [37]:

    
# create a time series, but missing one date in the Series
ts = pd.Series([1,np.nan,3],index=[datetime.datetime(2014,1,1),datetime.datetime(2014,2,1),datetime.datetime(2014,4,1)])
ts









    Out[37]:





2014-01-01    1.0
2014-02-01    NaN
2014-04-01    3.0
dtype: float64



In [38]:

    
ts.interpolate()









    Out[38]:





2014-01-01    1.0
2014-02-01    2.0
2014-04-01    3.0
dtype: float64

The important thing to note is that the series is missing an entry for 2014-03-01. If we were expecting to interpolate daily values, there would be two values calculated one for 2014-02-01 and another for 2014-03-01 resulting in one more value in the numerator of the interpolation.



In [39]:

    
# this accounts for the fact that we dont have
# an entry for 2014-03-01
ts.interpolate(method="time")









    Out[39]:





2014-01-01    1.000000
2014-02-01    1.688889
2014-04-01    3.000000
dtype: float64

Interpolation can also be specified to calculate values relative to the index values when using numeric index labels.



In [40]:

    
# a series to demonstrate index label based interpolation
s = pd.Series([0,np.nan,100], index=[0,1,10])
s









    Out[40]:





0       0.0
1       NaN
10    100.0
dtype: float64



In [41]:

    
# linear interpolate
s.interpolate()









    Out[41]:





0       0.0
1      50.0
10    100.0
dtype: float64



In [42]:

    
# interpolate based upon the values in the index
s.interpolate(method="values")









    Out[42]:





0       0.0
1      10.0
10    100.0
dtype: float64

Now the value calculated for NaN is interpolated using relative positioning based upon the labels in the index. The NaN value has a label of 1, which is one tenth of the way between 0 and 10 so the interpolated value will be 0 + (100-0)/10 or 10.

Handling Duplicate Data

Often it is considered best to erron the side of having duplicates instead of missing data, especially if the data is considered to be idempotent. However duplicate data can increase the size of the dataset and if it is not idempotent, then it would not be appropriate to process the duplicates.



In [43]:

    
# a DataFrame with lots of duplicate data
data = pd.DataFrame({'a':['x'] * 3 + ['y'] * 4,'b':[1,1,2,3,3,4,4]})
data



In [44]:

    
# reports which rows are duplicates based upon
# if the data in all columns was seen before
data.duplicated()









    Out[44]:





0    False
1     True
2    False
3    False
4     True
5    False
6     True
dtype: bool



In [45]:

    
# drop duplicate rows retaining first row of the duplicates
data.drop_duplicates()









    Out[45]:





   a  b
0  x  1
2  x  2
3  y  3
5  y  4

The default operation is to keep the first row of the duplicates. If you want to keep the last row of duplicates, you can use the take_last=True parameter.



In [49]:

    
# drop duplicate rows only keeping the first instance of any data
data.drop_duplicates(keep="last")









    Out[49]:





   a  b
1  x  1
2  x  2
4  y  3
6  y  4



In [50]:

    
# add a column c with values 0..6
# this makes .duplicated() report no duplicate rows
data['c'] = range(7)
data.duplicated()









    Out[50]:





0    False
1    False
2    False
3    False
4    False
5    False
6    False
dtype: bool



In [51]:

    
# but if we specify duplicates to be dropped in columns a & b
# they will be dropped
data.drop_duplicates(['a','b'])

Transforming Data

Transformation is required for following reasons:

Values are not in the correct units
Values are qualitative and need to be converted to appropriate numeric values
Extraneous data that either wastes memory and processing time or can affect results simply by being included

To address these situations we can take one or more of the following actions:

Map values to other values using a table lookup process
Explicitly replace certain values with other values
Apply methods to transform the values based on an algorithm
Simple remove extraneous columns and rows

Mapping

pandas provides a generic ability to map values using a lookup table using the .map() method. This method performs the mapping by matching the values of the outer Series with the index labels of the inner Series returning a new Series with the index labels of the outer Series but the values from the inner Series:



In [52]:

    
# create two Series objects to demonstrate mapping
x = pd.Series({"one":1,"two":2,"three":3})
y = pd.Series({1:"a",2:"b",3:"c"})
x









    Out[52]:





one      1
three    3
two      2
dtype: int64



In [53]:

    
y









    Out[53]:





1    a
2    b
3    c
dtype: object



In [54]:

    
# map values in x to values in y
x.map(y)









    Out[54]:





one      a
three    c
two      b
dtype: object



In [55]:

    
# three in x will not align / map to a value in y
x = pd.Series({"one":1,"two":2,"three":3})
y = pd.Series({1:"a",2:"b"})
x.map(y)









    Out[55]:





one        a
three    NaN
two        b
dtype: object

Replacing Values



In [56]:

    
# create a Series to demonstrate replace
s = pd.Series([0.,1.,2.,3.,4.])
s









    Out[56]:





0    0.0
1    1.0
2    2.0
3    3.0
4    4.0
dtype: float64



In [57]:

    
# replace all items with index label 2 with value 5
s.replace(2,5)









    Out[57]:





0    0.0
1    1.0
2    5.0
3    3.0
4    4.0
dtype: float64



In [58]:

    
# replace all items with new values
s.replace([0,1,2,3,4],[4,3,2,1,0])









    Out[58]:





0    4.0
1    3.0
2    2.0
3    1.0
4    0.0
dtype: float64



In [59]:

    
# replace using entries in a dictionary
s.replace({0:10,1:100})









    Out[59]:





0     10.0
1    100.0
2      2.0
3      3.0
4      4.0
dtype: float64

If using .replace() on a DataFrame, it is possible to specify different replacement values for each column. This is performed by passing a Python dictionary to the .replace() method, where the keys of the dictionary represent the names of the columns where replacement is to occur and the values of the dictionary are values that you want to replace. The second parameter to the method is the value that will be replaced where any matches are found.



In [60]:

    
# DataFrame with two columns
df = pd.DataFrame({'a':[0,1,2,3,4],'b':[5,6,7,8,9]})
df



In [61]:

    
# specify different replacement values for each column
df.replace({'a':1,'b':8}, 100)









    Out[61]:





     a    b
0    0    5
1  100    6
2    2    7
3    3  100
4    4    9

Replacing specific values in each of the columns is very convenient, as it provides a shorthand for what otherwise would require coding a loop through all the columns.



In [64]:

    
# demonstrate replacement with pad method
# set first item to 10, to have a distinct replacement value
s[0] = 10
s









    Out[64]:





0    10.0
1     1.0
2     2.0
3     3.0
4     4.0
dtype: float64



In [65]:

    
# replace items with index label 1,2,3 using fill from the
# most recent value prior to the specified labels (10)
s.replace([1,2,3],method='pad')









    Out[65]:





0    10.0
1    10.0
2    10.0
3    10.0
4     4.0
dtype: float64

Applying Functions to Transform Data

pandas provides the ability to apply functions to individual items, entire columns, entire rows providing incredible flexibility in transformation.

Functions can be applied using the conveniently named .apply() method, which given a Python function, will iteratively call the function passing in each value from a Series, or each Series representing a DataFrame column, or a list of values representing each row in a DataFrame.



In [66]:

    
# demonstrate applying a function to every item of a series
s = pd.Series(np.arange(0,5))
s.apply(lambda x: x * 2)









    Out[66]:





0    0
1    2
2    4
3    6
4    8
dtype: int64



In [67]:

    
# demonstrate applying a sum on each column
df = pd.DataFrame(np.arange(12).reshape(4,3),columns=['a','b','c'])
df









    Out[67]:





   a   b   c
0  0   1   2
1  3   4   5
2  6   7   8
3  9  10  11



In [68]:

    
# calculate cumulative sum of items in each column
df.apply(lambda col: col.sum())









    Out[68]:





a    18
b    22
c    26
dtype: int64



In [69]:

    
# calculate the sum of items in each row
df.apply(lambda row: row.sum(),axis=1)









    Out[69]:





0     3
1    12
2    21
3    30
dtype: int64

A common practice is to take result of an apply operation and add it as a new column of the DataFrame. This is convenient as you can add into the DataFrame the result of one or more successive calculations.



In [70]:

    
# create a new column 'interim' with a * b
df['interim'] = df.apply(lambda r: r.a * r.b,axis=1)
df









    Out[70]:





   a   b   c  interim
0  0   1   2        0
1  3   4   5       12
2  6   7   8       42
3  9  10  11       90



In [72]:

    
# and now a 'result' column with 'interim' + 'c'
df['result'] = df.apply(lambda r: r.interim + r.c, axis=1)
df









    Out[72]:





   a   b   c  interim  result
0  0   1   2        0       2
1  3   4   5       12      17
2  6   7   8       42      50
3  9  10  11       90     101



In [73]:

    
# replace column a with the sum of columns a,b and c
df.a = df.a + df.b + df.c
df









    Out[73]:





    a   b   c  interim  result
0   3   1   2        0       2
1  12   4   5       12      17
2  21   7   8       42      50
3  30  10  11       90     101

Important point to note is that pandas DataFrame is not a spreadsheet where cells are assigned formulas and can be recalculated when cells that are referenced by the formula change. If you desire this to happen, you will need to execute the formulas whenever the dependent data changes. On the flip side, this is more efficient than with spreadsheets as every little change does not cause a cascade of operations to occur.

The .apply() method will always apply to the provided function to all of the items or rows or columns. If you want to apply the function to a subset of these, then first perform a Boolean selection to filter all the items you do not want to process.



In [75]:

    
# create a 3 X 5 dataframe
df = pd.DataFrame(np.arange(0,15).reshape(3,5))
df.loc[1,2]= np.nan
df









    Out[75]:





    0   1     2   3   4
0   0   1   2.0   3   4
1   5   6   NaN   8   9
2  10  11  12.0  13  14



In [76]:

    
# demonstrate applying a function to only rows having
# a count of 0 NaN values
df.dropna().apply(lambda x:x.sum(), axis=1)









    Out[76]:





0    10.0
2    60.0
dtype: float64

The .apply() method was always passed an entire row or column. If you desire to apply a function to every individual item in the DataFrame one by one then .applymap() is the method to use.



In [77]:

    
# use applymap to format all items of the DataFrame
df.applymap(lambda x: '%.2f' % x)









    Out[77]:





       0      1      2      3      4
0   0.00   1.00   2.00   3.00   4.00
1   5.00   6.00    nan   8.00   9.00
2  10.00  11.00  12.00  13.00  14.00