Best Data Testing Practices for Data Science

Eric J. Ma

MIT Biological Engineering

How to use these notebooks

  • Follow along with Jupyter notebooks in GitHub: ericmjl/data-testing-tutorial
  • Most of what we will do is in the terminal & your favourite text editor.

Why tests?

  • We make assumptions about our code & data.
  • There are cases where those assumptions are violated.
  • Therefore, automated testing of those assumptions is important.

Tests: A Definition

A contract between your current self and your future self. What you expect to be right now should hold true in the future. What you expect to be wrong now should still be wrong in the future. Unless the requirements have changed!

Lets discuss!

What needs to be tested for:

  • code?
  • data?
  • statistics?

For code, what needs to be tested?

  • Given some example input(s), the output is correct.
  • Counter-examples should show up as incorrect.
  • Boundary cases are accounted for using defensive programming.
  • All lines of stable code are subject to at least one test.

For data, what needs to be tested?

  • Data types are appropriate. (Types)
  • Data has not been tampered with. (Integrity)
  • Missing values are accounted for. (Completeness)
  • Data schema is complete. (Structure)

For statistical analysis & ML, what else needs to be done?

  • Underlying distributions for real-valued (numeric; integer or floats) data.
  • Classifying data as categorical, ordinal, count, compositional, or continuous.
  • Categorical/ordinal values represented as strings should be converted to numerical representations.

What you can expect

Coding

  • You'll be implementing only simple functions. Nothing complicated.
  • Sample solutions are in the *_soln.py files.

Tutorial Material

  • Covered with interspersed lectures.
  • Simple exercises designed to get you familiar with how to write tests.
  • Give you a set of tools + code to bootstrap testing for another project.

Bonus Material

  • Self-paced material for the final hour of the tutorial or at home.
  • More complex topics on the topic of testing.
    • File integrity
    • Test coverage
    • Property-based tests
  • More superpowers for data testing!

Take-Homes

  • You'll get a ton of practice with pytest and assertion statements.
  • You'll will be left with self-paced learning material for hypothesis to do property-based testing.
  • You will have a starter set of tools for writing tests for your code and data.

If anything, I want you to not be afraid to write a test. If that's all you take back, this tutorial can be deemed a success.

Let's get going!