Sometimes they're the same person; at other times they're not. Tasks related to testing can often be assigned to either role, but there are some tasks more naturally suited to each.
Schema checks are all about making sure that the data columns that you want to have are all present, and that they have the expected data types.
The way data are provided to you should be in two files. The first file is the actual data matrix. The second file should be a metadata specification file, minimally containing the name of the CSV file it describes, and the list of columns present. Why the duplication? The list of columns is basically an implicit contract between your data provider and you, and provides a verifiable way of describing the data matrix's columns.
We're going to use a few datasets from Boston's open data repository. Let's first take a look at Boston's annual budget data, while pretending we're the person who curated the data, the "data provider".
In [2]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
In [3]:
import pandas as pd
df = pd.read_csv('data/boston_budget.csv')
df.head()
Out[3]:
To get the columns of a DataFrame object df, call df.columns. This is a list-like object that can be iterated over.
In [4]:
df.columns
Out[4]:
Structure:
key1: value
key2:
- value1
- value2
- subkey1:
- value3
Example YAML-formatted schema:
filename: boston_budget.csv
column_names:
- "Fiscal Year"
- "Service (cabinet)"
- "Department"
- "Program #"
...
- "Fund"
- "Amount"
YAML-formatted text can be read as dictionaries.
In [5]:
spec = """
filename: boston_budget.csv
columns:
- "Fiscal Year"
- "Service (Cabinet)"
- "Department"
- "Program #"
- "Program"
- "Expense Type"
- "ACCT #"
- "Expense Category (Account)"
- "Fund"
- "Amount"
"""
In [6]:
import yaml
metadata = yaml.load(spec)
metadata
Out[6]:
You can also take dictionaries, and return YAML-formatted text.
In [8]:
print(yaml.dump(metadata))
By having things YAML formatted, you preserve human-readability and computer-readability simultaneously.
Providing metadata should be something already done when doing analytics; YAML-format is a strong suggestion, but YAML schema will depend on use case.
Let's now switch roles, and pretend that we're on side of the "analyst" and are no longer the "data provider".
How would you check that the columns match the spec? Basically, check that every element in df.columns is present inside the metadata['columns'] list.
def check_schema(df, meta_columns):
for col in df.columns:
assert col in meta_columns, f'"{col}" not in metadata column spec'
In your test file, outside the function definition, write another test function, test_budget_schemas(), explicitly runs a test for just the budget data.
def test_budget_schemas():
columns = read_metadata('data/metadata_budget.yml')['columns']
df = pd.read_csv('data/boston_budget.csv')
check_schema(df, columns)
Now, run the test. Do you get the following error? Can you spot the error?
def check_schema(df, meta_columns):
for col in df.columns:
> assert col in meta_columns, f'"{col}" not in metadata column spec'
E AssertionError: " Amount" not in metadata column spec
E assert ' Amount' in ['Fiscal Year', 'Service (Cabinet)', 'Department', 'Program #', 'Program', 'Expense Type', ...]
test_datafuncs_soln.py:63: AssertionError
=================================== 1 failed, 7 passed in 0.91 seconds ===================================
If there is even a slight mis-spelling, this kind of check will help you pinpoint where that is. Note how the "Amount" column is spelled with an extra space.
At this point, I would contact the data provider to correct errors like this.
It is a logical practice to keep one schema spec file per table provided to you. However, it is also possible to take advantage of YAML "documents" to keep multiple schema specs inside a single YAML file.
The choice is yours - in cases where there are a lot of data files, it may make sense (for the sake of file-system sanity) to keep all of the specs in multiple files that represent logical groupings of data.
Inside datafuncs.py, write a function with the signature autospec(handle) that takes in a file path, and does the following:
Now, let's go "meta". Write a "meta-test" that ensures that every CSV file in the data/ directory has a schema file associated with it. (The function need not check each schema.) Until we finish filling out the rest of the exercises, this test can be allowed to fail, and we can mark it as a test to skip by marking it with an @skip decorator:
@pytest.mark.skip(reason="no way of currently testing this")
def test_my_func():
...
Now that we're done with the schema checks, let's do some sanity checks on the data as well. This is my personal favourite too, as some of the activities here overlap with the early stages of exploratory data analysis.
We're going to switch datasets here, and move to a 'corrupted' version of the Boston Economic Indicators dataset. Its file path is: ./data/boston_ei-corrupt.csv.
In [9]:
import pandas as pd
import seaborn as sns
sns.set_style('white')
%matplotlib inline
df = pd.read_csv('data/boston_ei-corrupt.csv')
df.head()
Out[9]:
In [9]:
# First, we check for missing data.
import missingno as msno
msno.matrix(df)
Immediately it's clear that there's a number of rows with empty values! Nothing beats a quick visual check like this one.
We can get a table version of this using another package called pandas_summary.
In [10]:
# We can do the same using pandas-summary.
from pandas_summary import DataFrameSummary
dfs = DataFrameSummary(df)
dfs.summary()
Out[10]:
dfs.summary() returns a Pandas DataFrame; this means we can write tests for data completeness!
# In test_datafuncs.py
from pandas_summary import DataFrameSummary
def check_data_completeness(df):
df_summary = DataFrameSummary(df).summary()
for col in df_summary.columns:
assert df_summary.loc['missing', col] == 0, f'{col} has missing values'
def test_boston_ei():
df = pd.read_csv('data/boston_ei.csv')
check_data_completeness(df)
In the Economic Indicators dataset, there are four "rate" columns: ['labor_force_part_rate', 'hotel_occup_rate', 'hotel_avg_daily_rate', 'unemp_rate'], which must have values between 0 and 1.
Add a utility function to test_datafuncs.py, check_data_range(data, lower=0, upper=1), which checks the range of the data such that:
data is a list-like object.data <= upperdata >= lowerupper and lower have default values of 1 and 0 respectively.Then, add to the test_boston_ei() function tests for each of these four columns, using the check_data_range() function.
# In test_datafuncs.py
def check_data_range(data, lower=0, upper=1):
assert min(data) >= lower, f"minimum value less than {lower}"
assert max(data) <= upper, f"maximum value greater than {upper}"
def test_boston_ei():
df = pd.read_csv('data/boston_ei.csv')
check_data_completeness(df)
zero_one_cols = ['labor_force_part_rate', 'hotel_occup_rate',
'hotel_avg_daily_rate', 'unemp_rate']
for col in zero_one_cols:
check_data_range(df['labor_force_part_rate'])
Most of what is coming is going to be a demonstration of the kinds of tools that are potentially useful for you. Feel free to relax from coding, as these aren't necessarily obviously automatable.
We can take the EDA portion further, by doing an empirical cumulative distribution plot for each data column.
In [ ]:
import numpy as np
def compute_dimensions(length):
"""
Given an integer, compute the "square-est" pair of dimensions for plotting.
Examples:
- length: 17 => rows: 4, cols: 5
- length: 14 => rows: 4, cols: 4
This is a utility function; can be tested separately.
"""
sqrt = np.sqrt(length)
floor = int(np.floor(sqrt))
ceil = int(np.ceil(sqrt))
if floor ** 2 >= length:
return (floor, floor)
elif floor * ceil >= length:
return (floor, ceil)
else:
return (ceil, ceil)
compute_dimensions(length=17)
assert compute_dimensions(17) == (4, 5)
assert compute_dimensions(16) == (4, 4)
assert compute_dimensions(15) == (4, 4)
assert compute_dimensions(11) == (3, 4)
In [14]:
# Next, let's visualize the empirical CDF for each column of data.
import matplotlib.pyplot as plt
def empirical_cumdist(data, ax, title=None):
"""
Plots the empirical cumulative distribution of values.
"""
x, y = np.sort(data), np.arange(1, len(data)+1) / len(data)
ax.scatter(x, y)
ax.set_title(title)
data_cols = [i for i in df.columns if i not in ['Year', 'Month']]
n_rows, n_cols = compute_dimensions(len(data_cols))
fig = plt.figure(figsize=(n_cols*3, n_rows*3))
from matplotlib.gridspec import GridSpec
gs = GridSpec(n_rows, n_cols)
for i, col in enumerate(data_cols):
ax = plt.subplot(gs[i])
empirical_cumdist(df[col], ax, title=col)
plt.tight_layout()
plt.show()
It's often a good idea to standardize numerical data (that aren't count data). The term standardize often refers to the statistical procedure of subtracting the mean and dividing by the standard deviation, yielding an empirical distribution of data centered on 0 and having standard deviation of 1.
Write a test for a function that standardizes a column of data. Then, write the function.
Note: This function is also implemented in the scikit-learn library as part of their preprocessing module. However, in case an engineering decision that you make is that you don't want to import an entire library just to use one function, you can re-implement it on your own.
def standard_scaler(x):
return (x - x.mean()) / x.std()
def test_standard_scaler(x):
std = standard_scaler(x)
assert np.allclose(std.mean(), 0)
assert np.allclose(std.std(), 1)
In [25]:
data_cols = [i for i in df.columns if i not in ['Year', 'Month']]
n_rows, n_cols = compute_dimensions(len(data_cols))
fig = plt.figure(figsize=(n_cols*3, n_rows*3))
from matplotlib.gridspec import GridSpec
gs = GridSpec(n_rows, n_cols)
for i, col in enumerate(data_cols):
ax = plt.subplot(gs[i])
empirical_cumdist(standard_scaler(df[col]), ax, title=col)
plt.tight_layout()
plt.show()
In [15]:
from collections import Counter
def empirical_catdist(data, ax, title=None):
d = Counter(data)
print(d)
x = range(len(d.keys()))
labels = list(d.keys())
y = list(d.values())
ax.bar(x, y)
ax.set_xticks(x)
ax.set_xticklabels(labels)
smartphone_df = pd.read_csv('data/smartphone_sanitization.csv')
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
empirical_catdist(smartphone_df['site'], ax=ax)
From Wikipedia:
In statistics, the Kolmogorov–Smirnov test (K–S test or KS test) is a nonparametric test of the equality of continuous, one-dimensional probability distributions that can be used to compare a sample with a reference probability distribution (one-sample K–S test), or to compare two samples (two-sample K–S test). It is named after Andrey Kolmogorov and Nikolai Smirnov.
In [16]:
from scipy.stats import ks_2samp
import numpy.random as npr
# Simulate a normal distribution with 10000 draws.
normal_rvs = npr.normal(size=10000)
result = ks_2samp(normal_rvs, df['labor_force_part_rate'].dropna())
result.pvalue < 0.05
Out[16]:
In [17]:
fig = plt.figure()
ax = fig.add_subplot(111)
empirical_cumdist(normal_rvs, ax=ax)
empirical_cumdist(df['hotel_occup_rate'], ax=ax)
In [36]:
data_cols = [i for i in df.columns if i not in ['Year', 'Month']]
n_rows, n_cols = compute_dimensions(len(data_cols))
fig = plt.figure(figsize=(n_cols*3, n_rows*3))
from matplotlib.gridspec import GridSpec
gs = GridSpec(n_rows, n_cols)
for i, col in enumerate(data_cols):
ax = plt.subplot(gs[i])
test = ks_2samp(normal_rvs, standard_scaler(df[col]))
empirical_cumdist(normal_rvs, ax)
empirical_cumdist(standard_scaler(df[col]), ax, title=f"{col}, p={round(test.pvalue, 2)}")
plt.tight_layout()
plt.show()
In [ ]: