Mini-Projects

I have designed a set of mini-projects that should help you recap what you've learned throughout this tutorial. Whether you work through them in order, or just work through a select few, is your choice.

If you're feeling up to the challenge, feel free to also explore the data in-depth in this notebook.

Otherwise, you'll get enough practice for this tutorial by following the exercise prompts below. The easy ones have a lot of specific instructions; the more difficult ones will leave a lot of decisions/choices in your hands. Be sure to try both!

Project 1: Schema & Value Checking (Easier)

In this mini-project, we will explore through a dataset collected as part of a study of smartphone sanitization methods. The data are found in the file data/smartphone_sanitization_corrupted.csv. It should be clear that I have intentionally named it as such!

The data provider did not write the YAML specification file for the data. Write the YAML spec for the columns that should be present. Then, write the test for the columns as a defensive check. (recap)
For each row, the column site should be only one of three values: phone, junction, and case. Write a test that checks that the site column has only these three values.
The colonies_ and morphologies_ columns cannot contain negative values. Write a test that checks for this.
The treatment column can only be one of eight values: ethanol, phonesoap, bleachwipe, quatricide, kimwipe, FBM_2, CB30 and cellblaster.

Once you're done with this, try making a plot of the pre-colony and post-colony counts in a Jupyter notebook.

Project 2: Schema & Value Checking (Challenging)

In this mini-project, we will explore through a dataset of Financial Aid applicants to a conference. The dataset has been anonymized, and all personally-identifiable data have been removed. The data are found in the file data/finaid-applications.csv, and the metadata spec file can be found in data/metadata_finaid.yml.

The data provider has provided a metadata_finaid.yml file. Check that the columns are correct. There might be a bit of a challenge here: there is extra nested metadata, but the schema is guaranteed to be stable.
The data are a bit corrupted. Do some forensics to figure out how the data are corrupted. If you're stuck, you can reference the corrupt_data_changes.md file for the answer, but it's better if you can figure it out. Once you've done that, write tests to check for file integrity.

Once you're done with this, treat yourself to a data analysis task: plot the proportions of all "international students", i.e. students who are not living in their country of citizenship.

Project 3: File Integrity Checking

In this mini-project, we will continue the file integrity checks by figuring out how to automate routine testing tasks.

Write a function that extends the code from Notebook 3 (File Integrity Checks), such that it records the hash of every data file (only CSV files) under the data/ directory.
If you've refactored a big chunk of code into functions, write tests for each of those functions.
Write a test for data file integrity. The test function should test that the hash of every file in the database is consistent with its current hash on disk.

Project 4: Meta-testing - Missing Value Checks (Challenging)

In this mini-project, we will build some tools to automatically spit out meta-level tests.

Write a function that automatically produces missingno.matrix() plots for every CSV file in the data/ directory.
Write a function that automatically reports every column in every CSV file that contains missing data, using the pandas_summary package.
Write a function that tests that every CSV file under data/ has its own corresponding metadata YAML spec file.

Projct 5: Property-Based Testing

In this mini-project, we will practice the use of property-based tests. (TBD)



In [ ]: