Organization

Project Structure (10 min)

Here are some characteristics of files that you should pay attention to

  1. File history
  2. File function
  3. File format
  4. File origin

We will start thinking about how to capture these characteristics using folders and simple naming schemes.

Activity: Sanity-check the data

Let's take a look at the data in Excel before we start working on it. Does everything look correct?

Discuss with your partner: Are we sure that we everything is correct by just looking at the file in excel?

Should we fix any of the problems we see in this file? Why or why not?

What if we were to change this file?

  • Changes made by hand (in GUI applications) are incredibly hard to track. Some tools are better than others, but this is outside our control.
  • If we changed our only copy of the raw data, we may never be able to reproduce what we did.
  • We want to track exactly what changed in the file, and if we must do it by hand, we should make that easy to see.

Some ideas to detect changes:

  • Include the date stamp in the file name
    • 2017-03-15_01_dleehr-fixes_gapminderDataFiveYear_superDirty.xlsx
  • Color the changed cells to highlight the changes
  • Side-by-side comparison with original
  • Record exact changes in a README file.

Clearly label the file or make the location for the file that makes sense to future you and collaborators.

How to structure your project folders -- Exercise (5 min)

The most important way to delineate file types and organize the workflow of your project is to design your file structure at the very start of your project's history.

Often you do not know exactly where your project will lead, so it is okay if your intitial design evolves as the project does. There are a few standards that we will walk through that work well for most projects.

Create a folder that will house all of your projects and call it projects.

projects/

Now let's make a folder within that folder for our new project, called gapminder:

projects/
    gapminder/

Since we have data, we want to create a data folder.

projects/
    gapminder/
        data/

Since we have this Excel spreadsheet as our original, raw data, we will save it in a folder called 00_raw to indicate that it is the starting point for our project. Create a subfolder in data/ called 00_raw and put the gapminderDataFiveYear_superDirty.xlsx file there. Now you should have a folder structure that looks like this:

projects/
    gapminder/
        data/
            00_raw/
                gapminderDataFiveYear_superDirty.xlsx

Since we want to change something in this file, we will create a new folder in data/ that shows that it is different from 00_raw/, and we will call it 01_cleaning/. Make a subfolder in data/ called 01_cleaning/ and put a copy of the gapminderDataFiveYear_superDirty.xlsx file there. Then, make an empty subfolder called 02_cleaned in the data/ folder.

projects/
    gapminder/
        data/
            00_raw/
                gapminderDataFiveYear_superDirty.xlsx
            01_cleaning/
                gapminderDataFiveYear_superDirty.xlsx
            02_cleaned/

With this structure, we can capture and communicate most of the characteristics of data as it exsts in our project: history, function, and format. In addition, there is a clear understanding of how to proceed to the next step, data cleaning.

Why 01_cleaning and 02_cleaned?

Realistically, we may not completely clean a file in one pass. Sometimes we'll get interrupted, and sometimes we'll catch things later and need to revise the file.

  1. Copy the original raw data into 01_cleaning (e.g. the xlsx file)
  2. Edit the file in 01_cleaning, saving there and recording changes.
  3. Export a cleaned version in CSV or Tab-delimited format to 02_cleaned.

If we catch things later, we don't put a file in 02_cleaned until we're done with it. And if we have more cleaning to do, we start from 01_cleaning and produce a new file in 02_cleaned.

Record keeping

There are other characteristics of the data that don't fit into file and directory names, such as the origin of the data. A common solution to keeping this detailed information in the project is adding metadata files. We will cover practices for metadata in the next section.

Good, Better, Best

  • Good Separate input, code, and from output using different folders or file names and keep track of all the input->output changes in separate files. Use descriptive file/directory names that convey function and history
  • Better Consistent folder structure across projects, ISO 8601 date stamps in file and directory names, read-only raw data
  • Best Script to generate folder structures, Version control to track modifications {: .checklist}

Metadata (10 min)

Why READMEs

Every project should describe to users what the purpose of the project is. This is commonly done in a README file. As the starting point for a project the README file is formatted as plain text (or markdown) to make it easily readable. A README file should include the following information:

  • The project name
  • The date the README was created
  • Contact information for the person(s) who maintains the project
  • Three or four sentences about the goal of the project
  • If the project uses data from an external source, where the data is from

Think about the beginning of this lesson, when we had nothing but a file with a name. These are the things that would have made it easy to make sense of that data.

So, before we make any modifications to the raw data, we need a practice for how to record the initial state of the data, as well as our modifications.

Adding a Top Level README -- Exercise (5 min)

To add a README to our project, open a text editor. For Mac users this can be BBEdit, NotePad++ for Windows users.

Now, let's make a README

  • Open text editor
  • Start writing
Project name
Today's date
Maintainer's contact info
Data Origin
3-4 sentences about the goal of the project
  • Save as README in the project directory.

This file serves as the starting point for future you, or anyone who receives this data.

Adding a README in a Subdirectory

README files in subdirectories are a good idea too. Often there are many files, and it's distracting to fill the top-level README with details about smaller pieces of the project.

  • For raw data directories, you should include the location (e.g URL) where the file was retrieved or generated.
  • For modified data directories, you should include the exact tools and steps used to modify the data, along with dates
  • For other directories like code or documentation, the README should communicate what the directories contain.

Keeping the READMEs up-to-date

  • Dates are good on file names here
  • When changing something in a directory, you should add a line at the README

Self-documenting Projects

READMEs are commentary on what we consider the "real work", and realistically can be an afterthought. We've all had projects under a deadline or someone asking for a result, and the documentation step is easy to defer until later.

Later never comes, or we forget the details by the time it does. So another good practice is to use good, descriptive names on files, directories, and in code. These are for our benefit, not the computer.

Modifying Data (10 min)

Cleaning data by hand

Now that we've created a README for our data, it's time to begin the cleaning process. Open the data file in 01_cleaning using Excel (Or if you prefer a text editor, here is the tab-delimited version). Let's take a look at the current state of our data, and find some parts of it worth cleaning.

As you can see, this data is messy. We can fix various mistakes, fill in missing values, etc. However, how can we remember what has been done? How do we know what's different between this data file and our raw data?

Recording our actions

Since we've use a GUI editor in order to make these modifications manually, we must also record our actions manually with a README file. READMEs are useful for giving a person context about the data contained within a folder. In this case, the context we wish to give are the steps that have been taken in order to create this cleaned dataset.

Pitfalls of manually cleaning data

You might think that this is a pretty labor-intensive process. Maybe you're even questioning the value of writing each of these steps down manually. It's true - recording each modification by hand is time-consuming. It's also easy to forget steps, mis-represent what you've done, and generally poorly convey the true state of the data. For this reason, it's much preferred to use a script to automate the cleaning of your data. We'll cover this in the next section.

Another challenge with using GUIs is that they tend to have different ways of doing the same thing. For example, bring up the Save As menu in the editor that you're using. You might notice that there are many options for saving this data on top of simply specifying a file name and location. In addition, some GUIs have different labels for each row / column. Excel begins its rows at 1, while python begins its rows with 0. It is important to record all information about the program used to modify the data, as well as any considerations that should be taken when interpreting the cleaned data.


In [ ]: