Let's scrape a practice table

The latest Mountain Goats album is called Goths. (It's good!) I made a simple HTML table with the track listing -- let's scrape it into a CSV.

Import the modules we'll need


In [ ]:

Read in the file, see what we're working with

We'll use the read() method to get the contents of the file.


In [ ]:
# in a with block, open the HTML file

    
    # .read() in the contents of a file -- it'll be a string


    # print the string to see what's there

Parse the table with BeautifulSoup

Right now, Python isn't interpreting our table as data -- it's just a string. We need to use BeautifulSoup to parse that string into data objects that Python can understand. Once the string is parsed, we'll be working with a "tree" of data that we can navigate.


In [ ]:
# use the type() function to see what kind of object `html_code` is

    
    # feed the file's contents (the string of HTML) to BeautifulSoup
    # will complain if you don't specify the parser


    # use the type() function to see what kind of object `soup` is

Decide how to target the table

BeautifulSoup has several methods for targeting elements -- by position on the page, by attribute, etc. Right now we just want to find the correct table.


In [ ]:
# by position on the page
    # find_all returns a list of matching elements, and we want the second ([1]) one

    
    # by class name
    # => with `find`, you can pass a dictionary of element attributes to match on

    
    # by ID

    
    # by style

Looping over the table rows

Let's print a list of track numbers and song titles. Look at the structure of the table -- a table has rows represented by the tag tr, and within each row there are cells represented by td tags. The find_all() method returns a list. And we know how to iterate over lists: with a for loop. Let's do that.


In [ ]:
# find the rows in the table
    # slice to skip the header row

    
    # loop over the rows


        # get the table cells in the row

        
        # assign them to variables

        
        # use the .string attribute to get the text in the cell

Write data to file

Let's put it all together and open a file to write the data to.


In [ ]:
# set up a writer object

    

    


        # get the table cells in the row

        
        # assign them to variables

        
        # write out the dictionary to file