Tidying of data is required for many reasons including these:
Moving away from a list of problems with data that needs to be addressed, there are several characterisitics of data that can be considered good, tidy and ready for analysis which are as follows:
In [2]:
# import pandas, numpy and datetime
import numpy as np
import pandas as pd
import datetime
# set some pandas options for controlling output
pd.set_option('display.notebook_repr_html', False)
pd.set_option('display.max_columns',10)
pd.set_option('display.max_rows',10)
Data is "missing" in pandas when it has a value of NaN (also seen as np.nan - the form from NumPy). The NaN value represents that in a particular Series that there is not a value specified for the particular index level.
In pandas, there are a number of reasons why a value can be NaN:
In [3]:
# create a DataFrame with 5 rows and 3 columns
df = pd.DataFrame(np.arange(0,15).reshape(5,3),index=['a','b','c','d','e'], columns=['col1','col2','col3'])
df
Out[3]:
In [4]:
# add some columns and rows to the DataFrame
# column c4 with NaN values
df['c4'] = np.nan
# row 'f' with 15 through 18
df.loc['f'] = np.arange(15,19)
# row 'g' with all NaN
df.loc['g'] = np.nan
# column c5 with NaN's
df['c5'] = np.nan
# change value in col 'c4' row 'a'
df['c4']['a']=20
df
Out[4]:
This DataFrame object exhibits the following characteristics that will support most of the examples that follows:
In [5]:
# which items are NaN?
df.isnull()
Out[5]:
In [6]:
# count the number of NaN values in each column
df.isnull().sum()
Out[6]:
In [7]:
# total count of NaN values
df.isnull().sum().sum()
Out[7]:
In [8]:
# number of non-NaN values in each column
df.count()
Out[8]:
In [9]:
# and can used for counting NaN values
(len(df)-df.count()).sum()
Out[9]:
In [10]:
# which items are not null
df.notnull()
Out[10]:
In [11]:
# select the non-NaN items in column c4
df.c4[df.c4.notnull()]
Out[11]:
In [12]:
# .dropna will also return non NaN values
# this gets all non NaN items in column c4
df.c4.dropna()
Out[12]:
In [13]:
# dropna returns as copy with the values dropped
# the source DataFrame / column is not changed
df.c4
Out[13]:
When applied to a DataFrame object, dropna() will drop all rows froms a DataFrame object that have atleast one NaN value. If you want to drop only rows where all values are NaN, you can use the how="all" parameter.
In [14]:
# on a dataframe this will drop entire rows
# where there is atleast one NaN
# in this case, all the rows
df.dropna()
Out[14]:
In [15]:
# using how='all', only rows that have all values
# as NaN will be dropped
df.dropna(how='all')
Out[15]:
In [16]:
# flip to drop columns instead of rows
df.dropna(how='all',axis=1) # c5 column will be dropped
Out[16]:
In [19]:
# make a copy of df
df2 = df.copy()
# replace two NaN cells with values
df2.loc['g'].col1=0
df2.loc['g'].col3=0
df2
Out[19]:
In [20]:
# now drop columns with any NaN values
df2.dropna(how='any',axis=1)
Out[20]:
In [21]:
# only drop columns with at least 5 NaN values
df.dropna(thresh=5,axis=1)
Out[21]:
Note that the .dropna() method returns a copy of the DataFrame object, and the data is dropped from that copy. If you want to drop the data in the actual DataFrame, use the inplace=True parameter.
The NaN values are handled differently in pandas than in NumPy. NumPy functions when encountering a NaN value, will return NaN. pandas functions will typically ignore the NaN values and continue processing the function as though the values were not part of the Series object.
More specifically the way that pandas handles NaN values is as follows:
In [22]:
# create a NumPy array with one NaN value
a = np.array([1,2,np.nan,3])
# create a Series from the array
s = pd.Series(a)
# the mean of each is different
a.mean(), s.mean()
Out[22]:
In [23]:
# demonstrate sum,mean and cumsum handling of NaN
# get one column
s = df.c4
s.sum() # NaN values treated a 0
Out[23]:
In [24]:
s.mean() # NaN treated as 0
Out[24]:
In [25]:
# as 0 in the cumsum but NaN values preserved in result series
s.cumsum()
Out[25]:
In [26]:
# in arithmetic, a NaN value will result in NaN
df.c4 + 1
Out[26]:
In [27]:
# return a new DataFrame with NaN values filled with 0
filled = df.fillna(0)
filled
Out[27]:
In [28]:
# having replaced NaN with 0 can make
# operations such as mean have different results
filled.mean()
Out[28]:
It is also possible to limit the number of times that the data will be filled using the limit parameter. Each time the NaN values are identified, pandas will fill the NaN values with the previous value up to the limit times in each group of NaN values.
In [29]:
# only fills the first two NaN values in each row with 0
df.fillna(0,limit=2)
Out[29]:
In [30]:
# extract the c4 column and fill NaNs forward
df.c4.fillna(method="ffill")
Out[30]:
In [31]:
# perform a backword fill
df.c4.fillna(method="bfill")
Out[31]:
You can also use the convenient functions pd.ffill() or pd.bfill()
In [33]:
# create a new series of values to be
# used to fill NaN values where the index label matches
fill_values = pd.Series([100,101,102], index=['a','e','g'])
fill_values
Out[33]:
In [34]:
# using c4, fill using fill_values
# a, e and g will be filled with matching values
df.c4.fillna(fill_values)
Out[34]:
In [35]:
# fill NaN values in each column with the
# mean of the values in that column
df.fillna(df.mean())
Out[35]:
In [36]:
# linear interpolate the NaN values from 1 through 2
s = pd.Series([1,np.nan,np.nan,np.nan,2])
s.interpolate()
Out[36]:
The value of interpolation is calculated by taking the first value before and after any sequence of NaN values and then incrementally adding that value from the start and substituting NaN values. In this case, 2.0 and 1.0 are the surrounding values resulting in (2.0-1.0)/(5-1)=0.25 which is then added incrementally through all the NaN values.
The interpolation method also has the ability to specify a specific method of interpolation, one of the common methods is to use time-based interpolation.
In [37]:
# create a time series, but missing one date in the Series
ts = pd.Series([1,np.nan,3],index=[datetime.datetime(2014,1,1),datetime.datetime(2014,2,1),datetime.datetime(2014,4,1)])
ts
Out[37]:
In [38]:
ts.interpolate()
Out[38]:
The important thing to note is that the series is missing an entry for 2014-03-01. If we were expecting to interpolate daily values, there would be two values calculated one for 2014-02-01 and another for 2014-03-01 resulting in one more value in the numerator of the interpolation.
In [39]:
# this accounts for the fact that we dont have
# an entry for 2014-03-01
ts.interpolate(method="time")
Out[39]:
Interpolation can also be specified to calculate values relative to the index values when using numeric index labels.
In [40]:
# a series to demonstrate index label based interpolation
s = pd.Series([0,np.nan,100], index=[0,1,10])
s
Out[40]:
In [41]:
# linear interpolate
s.interpolate()
Out[41]:
In [42]:
# interpolate based upon the values in the index
s.interpolate(method="values")
Out[42]:
Now the value calculated for NaN is interpolated using relative positioning based upon the labels in the index. The NaN value has a label of 1, which is one tenth of the way between 0 and 10 so the interpolated value will be 0 + (100-0)/10 or 10.
Often it is considered best to erron the side of having duplicates instead of missing data, especially if the data is considered to be idempotent. However duplicate data can increase the size of the dataset and if it is not idempotent, then it would not be appropriate to process the duplicates.
In [43]:
# a DataFrame with lots of duplicate data
data = pd.DataFrame({'a':['x'] * 3 + ['y'] * 4,'b':[1,1,2,3,3,4,4]})
data
Out[43]:
In [44]:
# reports which rows are duplicates based upon
# if the data in all columns was seen before
data.duplicated()
Out[44]:
In [45]:
# drop duplicate rows retaining first row of the duplicates
data.drop_duplicates()
Out[45]:
The default operation is to keep the first row of the duplicates. If you want to keep the last row of duplicates, you can use the take_last=True parameter.
In [49]:
# drop duplicate rows only keeping the first instance of any data
data.drop_duplicates(keep="last")
Out[49]:
In [50]:
# add a column c with values 0..6
# this makes .duplicated() report no duplicate rows
data['c'] = range(7)
data.duplicated()
Out[50]:
In [51]:
# but if we specify duplicates to be dropped in columns a & b
# they will be dropped
data.drop_duplicates(['a','b'])
Out[51]:
Transformation is required for following reasons:
To address these situations we can take one or more of the following actions:
pandas provides a generic ability to map values using a lookup table using the .map() method. This method performs the mapping by matching the values of the outer Series with the index labels of the inner Series returning a new Series with the index labels of the outer Series but the values from the inner Series:
In [52]:
# create two Series objects to demonstrate mapping
x = pd.Series({"one":1,"two":2,"three":3})
y = pd.Series({1:"a",2:"b",3:"c"})
x
Out[52]:
In [53]:
y
Out[53]:
In [54]:
# map values in x to values in y
x.map(y)
Out[54]:
In [55]:
# three in x will not align / map to a value in y
x = pd.Series({"one":1,"two":2,"three":3})
y = pd.Series({1:"a",2:"b"})
x.map(y)
Out[55]:
In [56]:
# create a Series to demonstrate replace
s = pd.Series([0.,1.,2.,3.,4.])
s
Out[56]:
In [57]:
# replace all items with index label 2 with value 5
s.replace(2,5)
Out[57]:
In [58]:
# replace all items with new values
s.replace([0,1,2,3,4],[4,3,2,1,0])
Out[58]:
In [59]:
# replace using entries in a dictionary
s.replace({0:10,1:100})
Out[59]:
If using .replace() on a DataFrame, it is possible to specify different replacement values for each column. This is performed by passing a Python dictionary to the .replace() method, where the keys of the dictionary represent the names of the columns where replacement is to occur and the values of the dictionary are values that you want to replace. The second parameter to the method is the value that will be replaced where any matches are found.
In [60]:
# DataFrame with two columns
df = pd.DataFrame({'a':[0,1,2,3,4],'b':[5,6,7,8,9]})
df
Out[60]:
In [61]:
# specify different replacement values for each column
df.replace({'a':1,'b':8}, 100)
Out[61]:
Replacing specific values in each of the columns is very convenient, as it provides a shorthand for what otherwise would require coding a loop through all the columns.
In [64]:
# demonstrate replacement with pad method
# set first item to 10, to have a distinct replacement value
s[0] = 10
s
Out[64]:
In [65]:
# replace items with index label 1,2,3 using fill from the
# most recent value prior to the specified labels (10)
s.replace([1,2,3],method='pad')
Out[65]:
pandas provides the ability to apply functions to individual items, entire columns, entire rows providing incredible flexibility in transformation.
Functions can be applied using the conveniently named .apply() method, which given a Python function, will iteratively call the function passing in each value from a Series, or each Series representing a DataFrame column, or a list of values representing each row in a DataFrame.
In [66]:
# demonstrate applying a function to every item of a series
s = pd.Series(np.arange(0,5))
s.apply(lambda x: x * 2)
Out[66]:
In [67]:
# demonstrate applying a sum on each column
df = pd.DataFrame(np.arange(12).reshape(4,3),columns=['a','b','c'])
df
Out[67]:
In [68]:
# calculate cumulative sum of items in each column
df.apply(lambda col: col.sum())
Out[68]:
In [69]:
# calculate the sum of items in each row
df.apply(lambda row: row.sum(),axis=1)
Out[69]:
A common practice is to take result of an apply operation and add it as a new column of the DataFrame. This is convenient as you can add into the DataFrame the result of one or more successive calculations.
In [70]:
# create a new column 'interim' with a * b
df['interim'] = df.apply(lambda r: r.a * r.b,axis=1)
df
Out[70]:
In [72]:
# and now a 'result' column with 'interim' + 'c'
df['result'] = df.apply(lambda r: r.interim + r.c, axis=1)
df
Out[72]:
In [73]:
# replace column a with the sum of columns a,b and c
df.a = df.a + df.b + df.c
df
Out[73]:
Important point to note is that pandas DataFrame is not a spreadsheet where cells are assigned formulas and can be recalculated when cells that are referenced by the formula change. If you desire this to happen, you will need to execute the formulas whenever the dependent data changes. On the flip side, this is more efficient than with spreadsheets as every little change does not cause a cascade of operations to occur.
The .apply() method will always apply to the provided function to all of the items or rows or columns. If you want to apply the function to a subset of these, then first perform a Boolean selection to filter all the items you do not want to process.
In [75]:
# create a 3 X 5 dataframe
df = pd.DataFrame(np.arange(0,15).reshape(3,5))
df.loc[1,2]= np.nan
df
Out[75]:
In [76]:
# demonstrate applying a function to only rows having
# a count of 0 NaN values
df.dropna().apply(lambda x:x.sum(), axis=1)
Out[76]:
The .apply() method was always passed an entire row or column. If you desire to apply a function to every individual item in the DataFrame one by one then .applymap() is the method to use.
In [77]:
# use applymap to format all items of the DataFrame
df.applymap(lambda x: '%.2f' % x)
Out[77]: