I have designed a set of mini-projects that should help you recap what you've learned throughout this tutorial. Whether you work through them in order, or just work through a select few, is your choice.
If you're feeling up to the challenge, feel free to also explore the data in-depth in this notebook.
Otherwise, you'll get enough practice for this tutorial by following the exercise prompts below. The easy ones have a lot of specific instructions; the more difficult ones will leave a lot of decisions/choices in your hands. Be sure to try both!
In this mini-project, we will explore through a dataset collected as part of a study of smartphone sanitization methods. The data are found in the file data/smartphone_sanitization_corrupted.csv. It should be clear that I have intentionally named it as such!
site should be only one of three values: phone, junction, and case. Write a test that checks that the site column has only these three values.colonies_ and morphologies_ columns cannot contain negative values. Write a test that checks for this.treatment column can only be one of eight values: ethanol, phonesoap, bleachwipe, quatricide, kimwipe, FBM_2, CB30 and cellblaster.Once you're done with this, try making a plot of the pre-colony and post-colony counts in a Jupyter notebook.
In this mini-project, we will explore through a dataset of Financial Aid applicants to a conference. The dataset has been anonymized, and all personally-identifiable data have been removed. The data are found in the file data/finaid-applications.csv, and the metadata spec file can be found in data/metadata_finaid.yml.
metadata_finaid.yml file. Check that the columns are correct. There might be a bit of a challenge here: there is extra nested metadata, but the schema is guaranteed to be stable.corrupt_data_changes.md file for the answer, but it's better if you can figure it out. Once you've done that, write tests to check for file integrity.Once you're done with this, treat yourself to a data analysis task: plot the proportions of all "international students", i.e. students who are not living in their country of citizenship.
In this mini-project, we will continue the file integrity checks by figuring out how to automate routine testing tasks.
data/ directory. In this mini-project, we will build some tools to automatically spit out meta-level tests.
missingno.matrix() plots for every CSV file in the data/ directory.pandas_summary package.data/ has its own corresponding metadata YAML spec file.
In [ ]: