Working with data files

Reading and writing data files is a common task, and Python offers native support for working with many kinds of data files. Today, we're going to be working mainly with CSVs.

Import the csv module

We're going to be working with delimited text files, so the first thing we need to do is import this functionality from the standard library.



In [ ]:

Opening a file to read the contents

We're going to use something called a with statement to open a file and read the contents. The open() function takes at least two arguments: The path to the file you're opening and what "mode" you're opening it in.

To start with, we're going to use the 'r' mode to read the data. We'll use the default arguments for delimiter -- comma -- and we don't need to specify a quote character.

Important: If you open a data file in w (write) mode, anything that's already in the file will be erased.

The file we're using -- MLB roster data from 2017 -- lives at data/mlb.csv.

Once we have the file open, we're going to use some functionality from the csv module to iterate over the lines of data and print each one.

Specifically, we're going to use the csv.reader method, which returns a list of lines in the data file. Each line, in turn, is a list of the "cells" of data in that line.

Then we're going to loop over the lines of data and print each line. We can also use bracket notation to retrieve elements from inside each line of data.



In [ ]:

    
# open the MLB data file `as` mlb

    
    # create a reader object

    
    # loop over the rows in the file

        
        # assign variables to each element in the row (shortcut!)

        
        # print the row, which is a list

Simple filtering

If you wanted to filter your data, you could use an if statement inside your with block.



In [ ]:

    
# open the MLB data file `as` mlb

    
    # create a reader object


    # move past the header row

    
    # loop over the rows in the file


        # assign variables to each element in the row (shortcut!)

        
        # print the line of data ~only~ if the player is on the Twins

        
            # print the row, which is a list

Exercise

Read in the MLB data, print only the names and salaries of players who make at least $1 million. (Hint: Use type coercion!)



In [ ]:

    
# open the MLB data file `as` mlb

    
    # create a reader object

    
    # move past the header row

    
    # loop over the rows in the file


        # assign variables to each element in the row (shortcut!)

        
        # print the line of data ~only~ if the player is on the Twins

        
            # print the row, which is a list

DictReader: Another way to read CSV files

Sometimes it's more convenient to work with data files as a list of dictionaries instead of a list of lists. That way, you don't have to remember the position of each "column" of data -- you can just reference the column name. To do it, we'll use a csv.DictReader object instead of a csv.reader object. Otherwise the code is much the same.



In [ ]:

    
# open the MLB data file `as` mlb

    
    # create a reader object

    
    # loop over the rows in the file


        # print just the player's name (the column header is "NAME")

Writing to CSV files

You can also use the csv module to create csv files -- same idea, you just need to change the mode to 'w'. As with reading, there's a list-based writing method and a dictionary-based method.



In [ ]:

    
# define the column names


# let's make a few rows of data to write


# open an output file in write mode

    
    # create a writer object

    
    # write the header row

    
    # loop over the data and write to file

Using DictWriter to write data

Similar to using the list-based method, except that you need to ensure that the keys in your dictionaries of data match exactly a list of fieldnames.



In [ ]:

    
# define the column names


# let's make a few rows of data to write

# open an output file in write mode

    
    # create a writer object -- pass the list of column names to the `fieldnames` keyword argument

    
    # use the writeheader method to write the header row

    
    # loop over the data and write to file

You can open multiple files for reading/writing

Sometimes you want to open multiple files at the same time. One thing you might want to do: Opening a file of raw data in read mode, clean each row in a loop and write out the clean data to a new file.

You can open multiple files in the same with block -- just separate your open() functions with a comma.

For this example, we're not going to do any cleaning -- we're just going to copy the contents of one file to another.



In [ ]:

    
# open the MLB data file `as` mlb
# also, open `mlb-copy.csv` to write to

    
    # create a reader object

    
    # create a writer object
    # we're going to use the `fieldnames` attribute of the DictReader object
    # as our output headers, as well
    # b/c we're basically just making a copy

    
    # write header row

    
    # loop over the rows in the file

        
        # what type of object is `row`?
        # how would we find out?
        
        # write row to output file