Testing for data analysis

In a data analysis context, we want to test our code, as usual, but also our data (i.e., expected schema; e.g., data types) and our statistics (i.e., expected properties of distributions; e.g., value ranges). We focus on a defensive programming approach, by running expectation checks.



In [1]:

    
import pandas as pd



In [2]:

    
df = pd.read_csv('../data/tidy_who.csv')



In [3]:

    
df.sample(5)









    Out[3]:







  
    
      
      country
      g_whoregion
      year
      cases
      type
      sex
      age_range
    
  
  
    
      99556
      Viet Nam
      WPR
      1990
      NaN
      sp
      f
      5564
    
    
      133367
      Guatemala
      AMR
      1997
      NaN
      sn
      m
      3544
    
    
      64284
      Guatemala
      AMR
      1980
      NaN
      sp
      f
      1524
    
    
      372475
      Maldives
      SEA
      1987
      NaN
      rel
      m
      65
    
    
      392146
      Benin
      AFR
      1996
      NaN
      rel
      f
      2534

Testing code

As far as code is concerned (when we implement operations to transform data), please refer to the lesson on testing, debugging, and profiling.

In the first notebook, we came across pd.testing.assert_frame_equal(); be aware that pd.testing.assert_series_equal() and pd.testing.assert_index_equal() are also available.



In [4]:

    
pd.testing.assert_index_equal(df.index, df.index)

Testing data



In [5]:

    
df['year'].dtype









    Out[5]:





dtype('int64')



In [6]:

    
assert df['year'].dtype == 'int'



In [7]:

    
df['sex'].dtype









    Out[7]:





dtype('O')



In [8]:

    
assert df['sex'].dtype == 'object'

Testing statistics



In [9]:

    
assert df['year'].max() <= 2017



In [10]:

    
assert df['cases'].min() == 0

When datasets are large, it might be difficult to carry out exact tests (for example, using pd.testing.assert_series_equal()). It might then be reasonable to test for properties of a series, rather than element-wise equality.



In [11]:

    
df['cases'].describe()









    Out[11]:





count     81381.000000
mean        667.482496
std        4490.566875
min           0.000000
25%           3.000000
50%          28.000000
75%         200.000000
max      250051.000000
Name: cases, dtype: float64

Make use of visual checks too: For example, it is generally a lot more straightforward to spot outliers if you plot your data!



In [12]:

    
assert df['sex'].nunique() > 1

Handling missing data

Some data are missing, either because they exist but were not collected or because they never existed. How can we detect missing data (null values)?



In [13]:

    
df_sub = df[(df.country == 'Greece') & (df.year > 2014) & (df.age_range == 65)]
df_sub









    Out[13]:







  
    
      
      country
      g_whoregion
      year
      cases
      type
      sex
      age_range
    
  
  
    
      48827
      Greece
      EUR
      2015
      NaN
      sp
      m
      65
    
    
      102545
      Greece
      EUR
      2015
      NaN
      sp
      f
      65
    
    
      156263
      Greece
      EUR
      2015
      NaN
      sn
      m
      65
    
    
      209981
      Greece
      EUR
      2015
      NaN
      sn
      f
      65
    
    
      263699
      Greece
      EUR
      2015
      NaN
      ep
      m
      65
    
    
      317417
      Greece
      EUR
      2015
      NaN
      ep
      f
      65
    
    
      371135
      Greece
      EUR
      2015
      86.0
      rel
      m
      65
    
    
      424853
      Greece
      EUR
      2015
      42.0
      rel
      f
      65



In [14]:

    
df_sub['cases'].isnull()









    Out[14]:





48827      True
102545     True
156263     True
209981     True
263699     True
317417     True
371135    False
424853    False
Name: cases, dtype: bool



In [15]:

    
df_sub['cases'].notnull()









    Out[15]:





48827     False
102545    False
156263    False
209981    False
263699    False
317417    False
371135     True
424853     True
Name: cases, dtype: bool



In [16]:

    
df_sub['cases'].isnull().value_counts()









    Out[16]:





True     6
False    2
Name: cases, dtype: int64

When summing data, null (missing) values are treated as zero.



In [17]:

    
df_sub['cases'].sum()









    Out[17]:





128.0



In [18]:

    
df_sub.fillna('NA')









    Out[18]:







  
    
      
      country
      g_whoregion
      year
      cases
      type
      sex
      age_range
    
  
  
    
      48827
      Greece
      EUR
      2015
      NA
      sp
      m
      65
    
    
      102545
      Greece
      EUR
      2015
      NA
      sp
      f
      65
    
    
      156263
      Greece
      EUR
      2015
      NA
      sn
      m
      65
    
    
      209981
      Greece
      EUR
      2015
      NA
      sn
      f
      65
    
    
      263699
      Greece
      EUR
      2015
      NA
      ep
      m
      65
    
    
      317417
      Greece
      EUR
      2015
      NA
      ep
      f
      65
    
    
      371135
      Greece
      EUR
      2015
      86
      rel
      m
      65
    
    
      424853
      Greece
      EUR
      2015
      42
      rel
      f
      65



In [19]:

    
df_sub['cases'].fillna('0')









    Out[19]:





48827      0
102545     0
156263     0
209981     0
263699     0
317417     0
371135    86
424853    42
Name: cases, dtype: object



In [20]:

    
df_sub.dropna()









    Out[20]:







  
    
      
      country
      g_whoregion
      year
      cases
      type
      sex
      age_range
    
  
  
    
      371135
      Greece
      EUR
      2015
      86.0
      rel
      m
      65
    
    
      424853
      Greece
      EUR
      2015
      42.0
      rel
      f
      65

Hands-on exercises

What type would you expect the variable cases to be?
Write an expectation check to ensure that the number of missing values for cases is less than the total number of observations.
What is the ratio of non-null values for cases in regions EUR and AFR (together)?

Reference

Tutorial on "Best Testing Practices for Data Science" by Eric J. Ma at PyCon 2017: https://www.youtube.com/watch?v=yACtdj1_IxE

	country	g_whoregion	year	cases	type	sex	age_range
99556	Viet Nam	WPR	1990	NaN	sp	f	5564
133367	Guatemala	AMR	1997	NaN	sn	m	3544
64284	Guatemala	AMR	1980	NaN	sp	f	1524
372475	Maldives	SEA	1987	NaN	rel	m	65
392146	Benin	AFR	1996	NaN	rel	f	2534

	country	g_whoregion	year	cases	type	sex	age_range
48827	Greece	EUR	2015	NaN	sp	m	65
102545	Greece	EUR	2015	NaN	sp	f	65
156263	Greece	EUR	2015	NaN	sn	m	65
209981	Greece	EUR	2015	NaN	sn	f	65
263699	Greece	EUR	2015	NaN	ep	m	65
317417	Greece	EUR	2015	NaN	ep	f	65
371135	Greece	EUR	2015	86.0	rel	m	65
424853	Greece	EUR	2015	42.0	rel	f	65