The latest Mountain Goats album is called Goths. (It's good!) I made a simple HTML table with the track listing -- let's scrape it into a CSV.
In [ ]:
In [ ]:
# in a with block, open the HTML file
# .read() in the contents of a file -- it'll be a string
# print the string to see what's there
In [ ]:
# use the type() function to see what kind of object `html_code` is
# feed the file's contents (the string of HTML) to BeautifulSoup
# will complain if you don't specify the parser
# use the type() function to see what kind of object `soup` is
In [ ]:
# by position on the page
# find_all returns a list of matching elements, and we want the second ([1]) one
# by class name
# => with `find`, you can pass a dictionary of element attributes to match on
# by ID
# by style
Let's print a list of track numbers and song titles. Look at the structure of the table -- a table has rows represented by the tag tr, and within each row there are cells represented by td tags. The find_all() method returns a list. And we know how to iterate over lists: with a for loop. Let's do that.
In [ ]:
# find the rows in the table
# slice to skip the header row
# loop over the rows
# get the table cells in the row
# assign them to variables
# use the .string attribute to get the text in the cell
In [ ]:
# set up a writer object
# get the table cells in the row
# assign them to variables
# write out the dictionary to file