Introduction

Data is dirty. Any dataset that isn't properly curated and stored can suffer from many problems like having mixed data types, not being properly encoded or escaped, uneven number of fields, and so on. None of these problems are unsolvable. In fact, most of us are pretty good at cleaning data. Normally, when we know little or nothing about a given dataset, we proceed in a very predictable manner. We first try to read the data naively and see if errors are raised by the parser. If they are, we try to fix our function calls. When those are fixed, we try to run some sanity checks on the data, and end up filtering the dataset, sometimes quite heavily.

The problem with this process is that it is iterative, and worse, it is reactive. Everybody in the team has to do it if they are to use the dataset. Sure, one can simply clean it up and dump it in a new file with just a few lines of code. But we shouldn't have to run that script every time we encouter a new dataset. We would be much more comforable if data is cleaned as it is read. It is much more efficient if data cleaning is a part of data ingestion.

This can be achieved by having a centralized schema for every dataset. This schema can house the rules that the clean dataset must follow, so as to further aid its analysis. Of course, this schema can be expressed via a simple Python script which is shared with everyone who is doing analysis on the dataset in question. But the number of datasets that someone has to deal with over the timeline of a particular project can quickly get out of hand, and so do their cleaning scripts. Secondly, and more importantly, cleaning data via ad-hoc Python scripts is non trivial. Readable as Python scripts might be, it's not always easy for everyone in the team to change the cleaning process. Moreover, there are no Python libraries that offer an abstraction at the level of cleaning and validating data.

Therefore, if one has to go through the process of data validation and cleaning in a customizable, modular way, one has to make sure that:

  • the specifications for all datasets are in one place, not in different scripts.
  • datasets are grouped under a suitable name, that pertains to particular projects. (In PySemantic such a group is called a Project, as we shall see).
  • strict validation and cleaning rules must be applied to all aspects of a dataset
  • the process of validation and cleaning has to be indentically reproducible by everyone who works on the data

PySemantic makes all that happen.

1. Getting Started

Let's get our hands dirty. We'll explore more features as we go along. Before you proceed further, please make sure that you have gone through the quick start section here.

By now you should have added a project named pysemantic_demo, and used the project object to load the iris. dataset. Let's take a more detailed look at what is happening here.

1.1 The Project class

A first class citizen of the pysemantic namespace is the Project class. This class has everything you need to add, remove, read, or write datasets. In PySemantic, all datasets are classified under projects represented by instances of the Project class. Each project is identified by a unique name. This name is used to instantiate the Project class, and perform operations of all datasets registerd under it. You can think of these "projects" under pysemantic in the same way as an IDE organizes software projects. Each project in an IDE has a set of files containing source code, a set of build tools and a few other things that make a project self contained. Similarly, each project in PySemantic has its own datasets, which in turn have their schema and their validation rules. Currently, for this example, the iris dataset is loaded naively, without any rules.


In [1]:
from pysemantic import Project

In [2]:
demo = Project("pysemantic_demo")

In [3]:
iris = demo.load_dataset("iris")

In [4]:
iris.head(5)


Out[4]:
Sepal Length Sepal Width Petal Length Petal Width Species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

This is the Fisher iris dataset that we know so well. Now imagine that someone was curating for us more samples of these flowers and sending us the measurements for 150 more flowers (sepal length, sepal width, petal length, petal width, and the species). That would amount to 150 more rows from in the dataset. Now suppose that our data acquisition methods were flawed, and the data that came in was dirty. A sample of such a dirty dataset can be found here. Try loading this file into a pandas dataframe directly, using the pandas.read_csv function. Notice that there's a column called id, which contains 10 digit strings. These IDs could correspond to some automatically generated IDs by the system storing the data. If they're really just IDs, they should be read as strings, but there was no way for pandas to know that these are as good as strings (Other examples of this are phone numbers and zipcodes). In pandas, this can be fixed by using the dtype argument in pandas.read_csv. To make this preference persist in pysemantic, we can add this dataset to our data dictionary (demo_specs.yaml) by adding to it the following lines:

bad_iris:
  path: /absolute/path/to/bad_iris.csv
  dtypes:
    - id: !!python/name.__builtin__.str

The last line tells pandas that the coulmn id is to be read as a string, not as the default integer. Any type can thus be specified for any column, by adding a line formatted as follows:

- column_name: yaml-dump-of-python-type

for the given column. (Similarly, we can specify types for the other columns in the dataset too, but this isn't required since the default works fine for them.) You can try out how the Project object can infer these new specifications by doing the following:


In [6]:
demo.reload_data_dict() # Re-reads the data dictionary specifications
bad_iris = demo.load_dataset("bad_iris")