Testing for data analysis

In a data analysis context, we want to test our code, as usual, but also our data (i.e., expected schema; e.g., data types) and our statistics (i.e., expected properties of distributions; e.g., value ranges). We focus on a defensive programming approach, by running expectation checks.


In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('../data/tidy_who.csv')

In [3]:
df.sample(5)


Out[3]:
country g_whoregion year cases type sex age_range
99556 Viet Nam WPR 1990 NaN sp f 5564
133367 Guatemala AMR 1997 NaN sn m 3544
64284 Guatemala AMR 1980 NaN sp f 1524
372475 Maldives SEA 1987 NaN rel m 65
392146 Benin AFR 1996 NaN rel f 2534

Testing code

As far as code is concerned (when we implement operations to transform data), please refer to the lesson on testing, debugging, and profiling.

In the first notebook, we came across pd.testing.assert_frame_equal(); be aware that pd.testing.assert_series_equal() and pd.testing.assert_index_equal() are also available.


In [4]:
pd.testing.assert_index_equal(df.index, df.index)

Testing data


In [5]:
df['year'].dtype


Out[5]:
dtype('int64')

In [6]:
assert df['year'].dtype == 'int'

In [7]:
df['sex'].dtype


Out[7]:
dtype('O')

In [8]:
assert df['sex'].dtype == 'object'

Testing statistics


In [9]:
assert df['year'].max() <= 2017

In [10]:
assert df['cases'].min() == 0

When datasets are large, it might be difficult to carry out exact tests (for example, using pd.testing.assert_series_equal()). It might then be reasonable to test for properties of a series, rather than element-wise equality.


In [11]:
df['cases'].describe()


Out[11]:
count     81381.000000
mean        667.482496
std        4490.566875
min           0.000000
25%           3.000000
50%          28.000000
75%         200.000000
max      250051.000000
Name: cases, dtype: float64

Make use of visual checks too: For example, it is generally a lot more straightforward to spot outliers if you plot your data!


In [12]:
assert df['sex'].nunique() > 1

Handling missing data

Some data are missing, either because they exist but were not collected or because they never existed. How can we detect missing data (null values)?


In [13]:
df_sub = df[(df.country == 'Greece') & (df.year > 2014) & (df.age_range == 65)]
df_sub


Out[13]:
country g_whoregion year cases type sex age_range
48827 Greece EUR 2015 NaN sp m 65
102545 Greece EUR 2015 NaN sp f 65
156263 Greece EUR 2015 NaN sn m 65
209981 Greece EUR 2015 NaN sn f 65
263699 Greece EUR 2015 NaN ep m 65
317417 Greece EUR 2015 NaN ep f 65
371135 Greece EUR 2015 86.0 rel m 65
424853 Greece EUR 2015 42.0 rel f 65

In [14]:
df_sub['cases'].isnull()


Out[14]:
48827      True
102545     True
156263     True
209981     True
263699     True
317417     True
371135    False
424853    False
Name: cases, dtype: bool

In [15]:
df_sub['cases'].notnull()


Out[15]:
48827     False
102545    False
156263    False
209981    False
263699    False
317417    False
371135     True
424853     True
Name: cases, dtype: bool

In [16]:
df_sub['cases'].isnull().value_counts()


Out[16]:
True     6
False    2
Name: cases, dtype: int64

When summing data, null (missing) values are treated as zero.


In [17]:
df_sub['cases'].sum()


Out[17]:
128.0

In [18]:
df_sub.fillna('NA')


Out[18]:
country g_whoregion year cases type sex age_range
48827 Greece EUR 2015 NA sp m 65
102545 Greece EUR 2015 NA sp f 65
156263 Greece EUR 2015 NA sn m 65
209981 Greece EUR 2015 NA sn f 65
263699 Greece EUR 2015 NA ep m 65
317417 Greece EUR 2015 NA ep f 65
371135 Greece EUR 2015 86 rel m 65
424853 Greece EUR 2015 42 rel f 65

In [19]:
df_sub['cases'].fillna('0')


Out[19]:
48827      0
102545     0
156263     0
209981     0
263699     0
317417     0
371135    86
424853    42
Name: cases, dtype: object

In [20]:
df_sub.dropna()


Out[20]:
country g_whoregion year cases type sex age_range
371135 Greece EUR 2015 86.0 rel m 65
424853 Greece EUR 2015 42.0 rel f 65

Hands-on exercises

  1. What type would you expect the variable cases to be?
  2. Write an expectation check to ensure that the number of missing values for cases is less than the total number of observations.
  3. What is the ratio of non-null values for cases in regions EUR and AFR (together)?

Reference