You can use Python string functions to do some basic data cleaning. (For data sets with more complex cleaning needs, you might want to use a power tool like Open Refine.)
Here, we're going to write a function that takes in one row of data as input, cleans up the pieces of data in the row, then returns a cleaned version of that row. As we loop over the data in the file, we'll call this function on each row, then write out the clean row to a new file.
Let's break down our tasks:
We're going to be working with the FDIC's list of failed banks.
First, we need to write our cleaning function -- let's call our function clean_row(). We need to decide whether the row it parses will be a dictionary (using csv.DictReader) or a list (using csv.reader).
Let's use a dictionary.
Here are the fields that we are going to include in our output file. The ones that need cleaning are in bold.
In [ ]:
# first line defines the function and the argument
# ("row" is an arbitrary variable name)
"""
For the bank and institution name:
- strip whitespace
- uppercase the name
- replace '&' with 'AND'
n.b.: you can chain string methods together
"""
# strip whitespace and upcase the city
# return a dictionary of clean data
# the keys ~must~ match the headers of our output file
In [ ]:
# import the csv library
# open the two files
# create a DictReader object
# create a DictWriter object
# the fieldnames must exactly match the keys in the dictionary being returned
# by our cleaning function
# write out header row
# loop over the rows of raw data
# "row" is an arbitrary variable name
# call the cleaning function on the row
# write out the clean row
The Closing Date field in the bank failure data is in this format: 6-Sep-2011. In other words, day, then abbreviated month as text, then year.
Python's built-in datetime module has two methods that can help us reformat them: strftime() and strptime().
Your task: Add some code to the cleaning function to reformat the closing date in yyyy-mm-dd format. This will require doing some research into a module that we haven't discussed yet. (Good practice for when you're coding on your own.)
Breaking it down into smaller tasks:
datetime from the datetime module: from datetime import datetimestrptime() to turn a 6-Sep-2011-type string into a Python date objectyyyy-mm-dd using strftime()Google is your friend here. Try searching for things like "python strptime example." (Freebie: Here's a handy guide to the date directives.) Noodle around in a cell. Get something working for one date -- a test string -- before setting your solution loose on the whole file. Try new things, see what happens, fail, find solutions. It's all part of the learning process.
In [ ]: