Discuss with your partner: Are we sure that we everything is correct by just looking at the file in excel?
Some ideas to detect changes:
2017-03-15_01_dleehr-fixes_gapminderDataFiveYear_superDirty.xlsx
Clearly label the file or make the location for the file that makes sense to future you and collaborators.
The most important way to delineate file types and organize the workflow of your project is to design your file structure at the very start of your project's history.
Often you do not know exactly where your project will lead, so it is okay if your intitial design evolves as the project does. There are a few standards that we will walk through that work well for most projects.
Create a folder that will house all of your projects and call it projects
.
projects/
Now let's make a folder within that folder for our new project, called gapminder
:
projects/
gapminder/
Since we have data, we want to create a data
folder.
projects/
gapminder/
data/
Since we have this Excel spreadsheet as our original, raw data, we will save it in a folder called 00_raw
to indicate that it is the starting point for our project. Create a subfolder in data/
called 00_raw
and put the gapminderDataFiveYear_superDirty.xlsx
file there. Now you should have a folder structure that looks like this:
projects/
gapminder/
data/
00_raw/
gapminderDataFiveYear_superDirty.xlsx
Since we want to change something in this file, we will create a new folder in data/
that shows that it is different from 00_raw/
, and we will call it 01_cleaning/
. Make a subfolder in data/
called 01_cleaning/
and put a copy of the gapminderDataFiveYear_superDirty.xlsx
file there. Then, make an empty subfolder called 02_cleaned
in the data/
folder.
projects/
gapminder/
data/
00_raw/
gapminderDataFiveYear_superDirty.xlsx
01_cleaning/
gapminderDataFiveYear_superDirty.xlsx
02_cleaned/
With this structure, we can capture and communicate most of the characteristics of data as it exsts in our project: history, function, and format. In addition, there is a clear understanding of how to proceed to the next step, data cleaning.
01_cleaning
and 02_cleaned
?Realistically, we may not completely clean a file in one pass. Sometimes we'll get interrupted, and sometimes we'll catch things later and need to revise the file.
01_cleaning
(e.g. the xlsx file)01_cleaning
, saving there and recording changes.02_cleaned
.If we catch things later, we
don't put a file in 02_cleaned
until we're done with it. And if we have more cleaning to do, we start from 01_cleaning
and produce a new file in 02_cleaned
.
There are other characteristics of the data that don't fit into file and directory names, such as the origin of the data. A common solution to keeping this detailed information in the project is adding metadata files. We will cover practices for metadata in the next section.
Good, Better, Best
- Good Separate input, code, and from output using different folders or file names and keep track of all the input->output changes in separate files. Use descriptive file/directory names that convey function and history
- Better Consistent folder structure across projects, ISO 8601 date stamps in file and directory names, read-only raw data
- Best Script to generate folder structures, Version control to track modifications {: .checklist}
Every project should describe to users what the purpose of the project is. This is commonly done in a README file. As the starting point for a project the README file is formatted as plain text (or markdown) to make it easily readable. A README file should include the following information:
Think about the beginning of this lesson, when we had nothing but a file with a name. These are the things that would have made it easy to make sense of that data.
So, before we make any modifications to the raw data, we need a practice for how to record the initial state of the data, as well as our modifications.
To add a README to our project, open a text editor. For Mac users this can be BBEdit, NotePad++ for Windows users.
Now, let's make a README
Project name
Today's date
Maintainer's contact info
Data Origin
3-4 sentences about the goal of the project
README
in the project directory.This file serves as the starting point for future you, or anyone who receives this data.
README files in subdirectories are a good idea too. Often there are many files, and it's distracting to fill the top-level README with details about smaller pieces of the project.
READMEs are commentary on what we consider the "real work", and realistically can be an afterthought. We've all had projects under a deadline or someone asking for a result, and the documentation step is easy to defer until later.
Later never comes, or we forget the details by the time it does. So another good practice is to use good, descriptive names on files, directories, and in code. These are for our benefit, not the computer.
Now that we've created a README for our data, it's time to begin the cleaning process. Open the data file in 01_cleaning
using Excel (Or if you prefer a text editor, here is the tab-delimited version). Let's take a look at the current state of our data, and find some parts of it worth cleaning.
As you can see, this data is messy. We can fix various mistakes, fill in missing values, etc. However, how can we remember what has been done? How do we know what's different between this data file and our raw data?
Since we've use a GUI editor in order to make these modifications manually, we must also record our actions manually with a README file. READMEs are useful for giving a person context about the data contained within a folder. In this case, the context we wish to give are the steps that have been taken in order to create this cleaned dataset.
You might think that this is a pretty labor-intensive process. Maybe you're even questioning the value of writing each of these steps down manually. It's true - recording each modification by hand is time-consuming. It's also easy to forget steps, mis-represent what you've done, and generally poorly convey the true state of the data. For this reason, it's much preferred to use a script to automate the cleaning of your data. We'll cover this in the next section.
Another challenge with using GUIs is that they tend to have different ways of doing the same thing. For example, bring up the Save As
menu in the editor that you're using. You might notice that there are many options for saving this data on top of simply specifying a file name and location. In addition, some GUIs have different labels for each row / column. Excel begins its rows at 1
, while python begins its rows with 0
. It is important to record all information about the program used to modify the data, as well as any considerations that should be taken when interpreting the cleaned data.
In [ ]: