In a data analysis context, we want to test our code, as usual, but also our data (i.e., expected schema; e.g., data types) and our statistics (i.e., expected properties of distributions; e.g., value ranges). We focus on a defensive programming approach, by running expectation checks.
In [1]:
import pandas as pd
In [2]:
df = pd.read_csv('../data/tidy_who.csv')
In [3]:
df.sample(5)
Out[3]:
As far as code is concerned (when we implement operations to transform data), please refer to the lesson on testing, debugging, and profiling.
In the first notebook, we came across pd.testing.assert_frame_equal(); be aware that pd.testing.assert_series_equal() and pd.testing.assert_index_equal() are also available.
In [4]:
pd.testing.assert_index_equal(df.index, df.index)
In [5]:
df['year'].dtype
Out[5]:
In [6]:
assert df['year'].dtype == 'int'
In [7]:
df['sex'].dtype
Out[7]:
In [8]:
assert df['sex'].dtype == 'object'
In [9]:
assert df['year'].max() <= 2017
In [10]:
assert df['cases'].min() == 0
When datasets are large, it might be difficult to carry out exact tests (for example, using pd.testing.assert_series_equal()). It might then be reasonable to test for properties of a series, rather than element-wise equality.
In [11]:
df['cases'].describe()
Out[11]:
Make use of visual checks too: For example, it is generally a lot more straightforward to spot outliers if you plot your data!
In [12]:
assert df['sex'].nunique() > 1
Some data are missing, either because they exist but were not collected or because they never existed. How can we detect missing data (null values)?
In [13]:
df_sub = df[(df.country == 'Greece') & (df.year > 2014) & (df.age_range == 65)]
df_sub
Out[13]:
In [14]:
df_sub['cases'].isnull()
Out[14]:
In [15]:
df_sub['cases'].notnull()
Out[15]:
In [16]:
df_sub['cases'].isnull().value_counts()
Out[16]:
When summing data, null (missing) values are treated as zero.
In [17]:
df_sub['cases'].sum()
Out[17]:
In [18]:
df_sub.fillna('NA')
Out[18]:
In [19]:
df_sub['cases'].fillna('0')
Out[19]:
In [20]:
df_sub.dropna()
Out[20]:
cases to be?cases is less than the total number of observations. cases in regions EUR and AFR (together)?