Title: Missing Data In Pandas Dataframes
Slug: pandas_missing_data
Summary: Missing Data In Pandas Dataframes
Date: 2016-05-01 12:00
Category: Python
Tags: Data Wrangling
Authors: Chris Albon

import modules


In [6]:
import pandas as pd
import numpy as np

Create dataframe with missing values


In [7]:
raw_data = {'first_name': ['Jason', np.nan, 'Tina', 'Jake', 'Amy'], 
        'last_name': ['Miller', np.nan, 'Ali', 'Milner', 'Cooze'], 
        'age': [42, np.nan, 36, 24, 73], 
        'sex': ['m', np.nan, 'f', 'm', 'f'], 
        'preTestScore': [4, np.nan, np.nan, 2, 3],
        'postTestScore': [25, np.nan, np.nan, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'sex', 'preTestScore', 'postTestScore'])
df


Out[7]:
first_name last_name age sex preTestScore postTestScore
0 Jason Miller 42.0 m 4.0 25.0
1 NaN NaN NaN NaN NaN NaN
2 Tina Ali 36.0 f NaN NaN
3 Jake Milner 24.0 m 2.0 62.0
4 Amy Cooze 73.0 f 3.0 70.0

Drop missing observations


In [8]:
df_no_missing = df.dropna()
df_no_missing


Out[8]:
first_name last_name age sex preTestScore postTestScore
0 Jason Miller 42.0 m 4.0 25.0
3 Jake Milner 24.0 m 2.0 62.0
4 Amy Cooze 73.0 f 3.0 70.0

Drop rows where all cells in that row is NA


In [9]:
df_cleaned = df.dropna(how='all')
df_cleaned


Out[9]:
first_name last_name age sex preTestScore postTestScore
0 Jason Miller 42.0 m 4.0 25.0
2 Tina Ali 36.0 f NaN NaN
3 Jake Milner 24.0 m 2.0 62.0
4 Amy Cooze 73.0 f 3.0 70.0

Create a new column full of missing values


In [10]:
df['location'] = np.nan
df


Out[10]:
first_name last_name age sex preTestScore postTestScore location
0 Jason Miller 42.0 m 4.0 25.0 NaN
1 NaN NaN NaN NaN NaN NaN NaN
2 Tina Ali 36.0 f NaN NaN NaN
3 Jake Milner 24.0 m 2.0 62.0 NaN
4 Amy Cooze 73.0 f 3.0 70.0 NaN

Drop column if they only contain missing values


In [11]:
df.dropna(axis=1, how='all')


Out[11]:
first_name last_name age sex preTestScore postTestScore
0 Jason Miller 42.0 m 4.0 25.0
1 NaN NaN NaN NaN NaN NaN
2 Tina Ali 36.0 f NaN NaN
3 Jake Milner 24.0 m 2.0 62.0
4 Amy Cooze 73.0 f 3.0 70.0

Drop rows that contain less than five observations

This is really mostly useful for time series


In [12]:
df.dropna(thresh=5)


Out[12]:
first_name last_name age sex preTestScore postTestScore location
0 Jason Miller 42.0 m 4.0 25.0 NaN
3 Jake Milner 24.0 m 2.0 62.0 NaN
4 Amy Cooze 73.0 f 3.0 70.0 NaN

Fill in missing data with zeros


In [13]:
df.fillna(0)


Out[13]:
first_name last_name age sex preTestScore postTestScore location
0 Jason Miller 42.0 m 4.0 25.0 0.0
1 0 0 0.0 0 0.0 0.0 0.0
2 Tina Ali 36.0 f 0.0 0.0 0.0
3 Jake Milner 24.0 m 2.0 62.0 0.0
4 Amy Cooze 73.0 f 3.0 70.0 0.0

Fill in missing in preTestScore with the mean value of preTestScore

inplace=True means that the changes are saved to the df right away


In [14]:
df["preTestScore"].fillna(df["preTestScore"].mean(), inplace=True)
df


Out[14]:
first_name last_name age sex preTestScore postTestScore location
0 Jason Miller 42.0 m 4.0 25.0 NaN
1 NaN NaN NaN NaN 3.0 NaN NaN
2 Tina Ali 36.0 f 3.0 NaN NaN
3 Jake Milner 24.0 m 2.0 62.0 NaN
4 Amy Cooze 73.0 f 3.0 70.0 NaN

Fill in missing in postTestScore with each sex's mean value of postTestScore


In [15]:
df["postTestScore"].fillna(df.groupby("sex")["postTestScore"].transform("mean"), inplace=True)
df


Out[15]:
first_name last_name age sex preTestScore postTestScore location
0 Jason Miller 42.0 m 4.0 25.0 NaN
1 NaN NaN NaN NaN 3.0 NaN NaN
2 Tina Ali 36.0 f 3.0 70.0 NaN
3 Jake Milner 24.0 m 2.0 62.0 NaN
4 Amy Cooze 73.0 f 3.0 70.0 NaN

Select some raws but ignore the missing data points


In [16]:
# Select the rows of df where age is not NaN and sex is not NaN
df[df['age'].notnull() & df['sex'].notnull()]


Out[16]:
first_name last_name age sex preTestScore postTestScore location
0 Jason Miller 42.0 m 4.0 25.0 NaN
2 Tina Ali 36.0 f 3.0 70.0 NaN
3 Jake Milner 24.0 m 2.0 62.0 NaN
4 Amy Cooze 73.0 f 3.0 70.0 NaN