Homework assignment #3

These problem sets focus on using the Beautiful Soup library to scrape web pages.

Problem Set #1: Basic scraping

I've made a web page for you to scrape. It's available here. The page concerns the catalog of a famous widget company. You'll be answering several questions about this web page. In the cell below, I've written some code so that you end up with a variable called html_str that contains the HTML source code of the page, and a variable document that stores a Beautiful Soup object.



In [267]:

    
from bs4 import BeautifulSoup
from urllib.request import urlopen
html_str = urlopen("http://static.decontextualize.com/widgets2016.html").read()
document = BeautifulSoup(html_str, "html.parser")

Now, in the cell below, use Beautiful Soup to write an expression that evaluates to the number of <h3> tags contained in widgets2016.html.



In [268]:

    
h3_tags = document.find_all('h3')
print(type(h3_tags))
print([tag.string for tag in h3_tags])
len([tag.string for tag in h3_tags])









    



<class 'bs4.element.ResultSet'>
['Forensic Widgets', 'Wondrous widgets', 'Mood widgets', 'Hallowed widgets']






    Out[268]:





4

Now, in the cell below, write an expression or series of statements that displays the telephone number beneath the "Widget Catalog" header.



In [269]:

    
telephone_checkup = document.find('a', attrs={'class': 'tel'})
[tag.string for tag in telephone_checkup]









    Out[269]:





['212-555-9912']

In the cell below, use Beautiful Soup to write some code that prints the names of all the widgets on the page. After your code has executed, widget_names should evaluate to a list that looks like this (though not necessarily in this order):

Skinner Widget
Widget For Furtiveness
Widget For Strawman
Jittery Widget
Silver Widget
Divided Widget
Manicurist Widget
Infinite Widget
Yellow-Tipped Widget
Unshakable Widget
Self-Knowledge Widget
Widget For Cinema



In [270]:

    
all_widget = document.find_all('td', attrs={'class': 'wname'})
#USING FOR LOOP
for widget in all_widget:
    widget2 = widget.string
    print(widget2)
print("##########")

#Using LIST comprehention
raw_widget = [tag.string for tag in all_widget]
raw_widget









    



Skinner Widget
Widget For Furtiveness
Widget For Strawman
Jittery Widget
Silver Widget
Divided Widget
Manicurist Widget
Infinite Widget
Yellow-Tipped Widget
Unshakable Widget
Self-Knowledge Widget
Widget For Cinema
##########






    Out[270]:





['Skinner Widget',
 'Widget For Furtiveness',
 'Widget For Strawman',
 'Jittery Widget',
 'Silver Widget',
 'Divided Widget',
 'Manicurist Widget',
 'Infinite Widget',
 'Yellow-Tipped Widget',
 'Unshakable Widget',
 'Self-Knowledge Widget',
 'Widget For Cinema']

For this problem set, we'll continue to use the HTML page from the previous problem set. In the cell below, I've made an empty list and assigned it to a variable called widgets. Write code that populates this list with dictionaries, one dictionary per widget in the source file. The keys of each dictionary should be partno, wname, price, and quantity, and the value for each of the keys should be the value for the corresponding column for each row. After executing the cell, your list should look something like this:

[{'partno': 'C1-9476',
  'price': '$2.70',
  'quantity': u'512',
  'wname': 'Skinner Widget'},
 {'partno': 'JDJ-32/V',
  'price': '$9.36',
  'quantity': '967',
  'wname': u'Widget For Furtiveness'},
  ...several items omitted...
 {'partno': '5B-941/F',
  'price': '$13.26',
  'quantity': '919',
  'wname': 'Widget For Cinema'}]

And this expression:

widgets[5]['partno']

... should evaluate to:

LH-74/O



In [271]:

    
widgets = []

# your code here
search_table = document.find_all('tr', attrs={'class': 'winfo'})
for new_key in search_table:    
    diccionaries = {}
    partno_tag = new_key.find('td', attrs={'class': 'partno'})
    price_tag = new_key.find('td', attrs={'class': 'price'})
    quantity_tag = new_key.find('td', attrs={'class': 'quantity'})
    widget_tag = new_key.find('td', attrs={'class': 'wname'}) 
    diccionaries['partno'] = partno_tag.string
    diccionaries['price'] = price_tag.string
    diccionaries['quantity'] = quantity_tag.string 
    diccionaries['widget'] =  widget_tag.string
    widgets.append(diccionaries)  
widgets
# end your code









    Out[271]:





[{'partno': 'C1-9476',
  'price': '$2.70',
  'quantity': '512',
  'widget': 'Skinner Widget'},
 {'partno': 'JDJ-32/V',
  'price': '$9.36',
  'quantity': '967',
  'widget': 'Widget For Furtiveness'},
 {'partno': 'YP4-325/J',
  'price': '$5.17',
  'quantity': '787',
  'widget': 'Widget For Strawman'},
 {'partno': 'ZZ-274',
  'price': '$12.39',
  'quantity': '895',
  'widget': 'Jittery Widget'},
 {'partno': 'QO-794',
  'price': '$14.31',
  'quantity': '309',
  'widget': 'Silver Widget'},
 {'partno': 'LH-74/O',
  'price': '$6.79',
  'quantity': '981',
  'widget': 'Divided Widget'},
 {'partno': 'VK-486',
  'price': '$8.97',
  'quantity': '441',
  'widget': 'Manicurist Widget'},
 {'partno': 'R4K-990',
  'price': '$11.73',
  'quantity': '320',
  'widget': 'Infinite Widget'},
 {'partno': 'MZ-556/B',
  'price': '$2.35',
  'quantity': '948',
  'widget': 'Yellow-Tipped Widget'},
 {'partno': 'QV-730',
  'price': '$3.76',
  'quantity': '59',
  'widget': 'Unshakable Widget'},
 {'partno': 'T1-9731',
  'price': '$7.11',
  'quantity': '790',
  'widget': 'Self-Knowledge Widget'},
 {'partno': '5B-941/F',
  'price': '$13.26',
  'quantity': '919',
  'widget': 'Widget For Cinema'}]



In [272]:

    
#test 
widgets[5]['partno']









    Out[272]:





'LH-74/O'

In the cell below, duplicate your code from the previous question. Modify the code to ensure that the values for price and quantity in each dictionary are floating-point numbers and integers, respectively. I.e., after executing the cell, your code should display something like this:

[{'partno': 'C1-9476',
  'price': 2.7,
  'quantity': 512,
  'widgetname': 'Skinner Widget'},
 {'partno': 'JDJ-32/V',
  'price': 9.36,
  'quantity': 967,
  'widgetname': 'Widget For Furtiveness'},
 ... some items omitted ...
 {'partno': '5B-941/F',
  'price': 13.26,
  'quantity': 919,
  'widgetname': 'Widget For Cinema'}]

(Hint: Use the float() and int() functions. You may need to use string slices to convert the price field to a floating-point number.)



In [273]:

    
widgets = []

# your code here
search_table = document.find_all('tr', attrs={'class': 'winfo'})
for new_key in search_table:    
    diccionaries = {}
    partno_tag = new_key.find('td', attrs={'class': 'partno'})
    price_tag = new_key.find('td', attrs={'class': 'price'})
    quantity_tag = new_key.find('td', attrs={'class': 'quantity'})
    widget_tag = new_key.find('td', attrs={'class': 'wname'}) 
    diccionaries['partno'] = partno_tag.string
    diccionaries['price'] = float(price_tag.string[1:])
    diccionaries['quantity'] = int(quantity_tag.string) 
    diccionaries['widget'] =  widget_tag.string
    widgets.append(diccionaries)      
widgets
#widgets
# end your code









    Out[273]:





[{'partno': 'C1-9476',
  'price': 2.7,
  'quantity': 512,
  'widget': 'Skinner Widget'},
 {'partno': 'JDJ-32/V',
  'price': 9.36,
  'quantity': 967,
  'widget': 'Widget For Furtiveness'},
 {'partno': 'YP4-325/J',
  'price': 5.17,
  'quantity': 787,
  'widget': 'Widget For Strawman'},
 {'partno': 'ZZ-274',
  'price': 12.39,
  'quantity': 895,
  'widget': 'Jittery Widget'},
 {'partno': 'QO-794',
  'price': 14.31,
  'quantity': 309,
  'widget': 'Silver Widget'},
 {'partno': 'LH-74/O',
  'price': 6.79,
  'quantity': 981,
  'widget': 'Divided Widget'},
 {'partno': 'VK-486',
  'price': 8.97,
  'quantity': 441,
  'widget': 'Manicurist Widget'},
 {'partno': 'R4K-990',
  'price': 11.73,
  'quantity': 320,
  'widget': 'Infinite Widget'},
 {'partno': 'MZ-556/B',
  'price': 2.35,
  'quantity': 948,
  'widget': 'Yellow-Tipped Widget'},
 {'partno': 'QV-730',
  'price': 3.76,
  'quantity': 59,
  'widget': 'Unshakable Widget'},
 {'partno': 'T1-9731',
  'price': 7.11,
  'quantity': 790,
  'widget': 'Self-Knowledge Widget'},
 {'partno': '5B-941/F',
  'price': 13.26,
  'quantity': 919,
  'widget': 'Widget For Cinema'}]

Great! I hope you're having fun. In the cell below, write an expression or series of statements that uses the widgets list created in the cell above to calculate the total number of widgets that the factory has in its warehouse.

Expected output: 7928



In [274]:

    
new_list = []
for items in widgets:
    new_list.append(items['quantity'])
sum(new_list)









    Out[274]:





7928

In the cell below, write some Python code that prints the names of widgets whose price is above $9.30.

Expected output:

Widget For Furtiveness
Jittery Widget
Silver Widget
Infinite Widget
Widget For Cinema



In [275]:

    
for widget in widgets:
    if widget['price'] > 9.30:
        print(widget['widget'])









    



Widget For Furtiveness
Jittery Widget
Silver Widget
Infinite Widget
Widget For Cinema

Problem set #3: Sibling rivalries

In the following problem set, you will yet again be working with the data in widgets2016.html. In order to accomplish the tasks in this problem set, you'll need to learn about Beautiful Soup's .find_next_sibling() method. Here's some information about that method, cribbed from the notes:

Often, the tags we're looking for don't have a distinguishing characteristic, like a class attribute, that allows us to find them using .find() and .find_all(), and the tags also aren't in a parent-child relationship. This can be tricky! For example, take the following HTML snippet, (which I've assigned to a string called example_html):



In [276]:

    
example_html = """
<h2>Camembert</h2>
<p>A soft cheese made in the Camembert region of France.</p>

<h2>Cheddar</h2>
<p>A yellow cheese made in the Cheddar region of... France, probably, idk whatevs.</p>
"""

If our task was to create a dictionary that maps the name of the cheese to the description that follows in the <p> tag directly afterward, we'd be out of luck. Fortunately, Beautiful Soup has a .find_next_sibling() method, which allows us to search for the next tag that is a sibling of the tag you're calling it on (i.e., the two tags share a parent), that also matches particular criteria. So, for example, to accomplish the task outlined above:



In [277]:

    
example_doc = BeautifulSoup(example_html, "html.parser")
cheese_dict = {}
for h2_tag in example_doc.find_all('h2'):
    cheese_name = h2_tag.string
    cheese_desc_tag = h2_tag.find_next_sibling('p')
    cheese_dict[cheese_name] = cheese_desc_tag.string

cheese_dict









    Out[277]:





{'Camembert': 'A soft cheese made in the Camembert region of France.',
 'Cheddar': 'A yellow cheese made in the Cheddar region of... France, probably, idk whatevs.'}

With that knowledge in mind, let's go back to our widgets. In the cell below, write code that uses Beautiful Soup, and in particular the .find_next_sibling() method, to print the part numbers of the widgets that are in the table just beneath the header "Hallowed Widgets."

Expected output:

MZ-556/B
QV-730
T1-9731
5B-941/F



In [278]:

    
for h3_tag in document.find_all('h3'):
    if "Hallowed widgets" in h3_tag:
        table = h3_tag.find_next_sibling('table', {'class': 'widgetlist'})
        partno = table.find_all('td', {'class': 'partno'})     
        for x in partno:
            print(x.string)









    



MZ-556/B
QV-730
T1-9731
5B-941/F

Okay, now, the final task. If you can accomplish this, you are truly an expert web scraper. I'll have little web scraper certificates made up and I'll give you one, if you manage to do this thing. And I know you can do it!

In the cell below, I've created a variable category_counts and assigned to it an empty dictionary. Write code to populate this dictionary so that its keys are "categories" of widgets (e.g., the contents of the <h3> tags on the page: "Forensic Widgets", "Mood widgets", "Hallowed Widgets") and the value for each key is the number of widgets that occur in that category. I.e., after your code has been executed, the dictionary category_counts should look like this:

{'Forensic Widgets': 3,
 'Hallowed widgets': 4,
 'Mood widgets': 2,
 'Wondrous widgets': 3}



In [287]:

    
widget_count = {}
doc = document.find_all('h3')
for h3_tag in doc:
    widget_name = h3_tag.string
    table = h3_tag.find_next_sibling('table', {'class': 'widgetlist'})
    partno = table.find_all('td', {'class': 'partno'})
    count = len(partno)
    widget_count[widget_name] = count
widgets









    Out[287]:





{'Forensic Widgets': 3,
 'Hallowed widgets': 4,
 'Mood widgets': 2,
 'Wondrous widgets': 3}

Congratulations! You're done.



In [ ]:

Homework assignment #3

Problem Set #1: Basic scraping

Problem set #2: Widget dictionaries

Problem set #3: Sibling rivalries