Homework #5

This homework presents a sophisticated scenario in which you must design a SQL schema, insert data into it, and issue queries against it.

The scenario

In the year 20XX, I have won the lottery and decided to leave my programming days behind me in order to pursue my true calling as a cat cafe tycoon. This webpage lists the locations of my cat cafes and all the cats that are currently in residence at these cafes.

I'm interested in doing more detailed analysis of my cat cafe holdings and the cats that are currently being cared for by my cafes. For this reason, I've hired you to convert this HTML page into a workable SQL database. (Why don't I just do it myself? Because I am far too busy hanging out with adorable cats in all of my beautiful, beautiful cat cafes.)

Specifically, I want to know the answers to the following questions:

  • What's the name of the youngest cat at any location?
  • In which zip codes can I find a lilac-colored tabby?
  • What's the average weight of cats currently residing at any location (grouped by location)?
  • Which location has the most cats with tortoiseshell coats?

Because I'm not paying you very much, and because I am a merciful person who has considerable experience in these matters, I've decided to write the queries for you. (See below.) Your job is just to scrape the data from the web page, create the appropriate tables in PostgreSQL, and insert the data into those tables.

Before you continue, scroll down to "The Queries" below to examine the queries as I wrote them.

Problem set #1: Scraping the data

Your first goal is to create two data structures, both lists of dictionaries: one for the list of locations and one for the list of cats. You'll get these from scraping two <table> tags in the HTML: the first table has a class of cafe-list, the second has a class of cat-list.

Before you do anything else, though, execute the following cell to import Beautiful Soup and create a BeautifulSoup object with the content of the web page:


In [12]:
from bs4 import BeautifulSoup
from urllib.request import urlopen
html = urlopen("http://static.decontextualize.com/cats.html").read()
document = BeautifulSoup(html, "html.parser")

Let's tackle the list of cafes first. In the cell below, write some code that creates a list of dictionaries with information about each cafe, assigning it to the variable cafe_list. I've written some of the code for you; you just need to fill in the rest. The list should end up looking like this:

[{'name': 'Hang In There', 'zip': '11237'},
 {'name': 'Independent Claws', 'zip': '11201'},
 {'name': 'Paws and Play', 'zip': '11215'},
 {'name': 'Tall Tails', 'zip': '11222'},
 {'name': 'Cats Meow', 'zip': '11231'}]

In [13]:
cafe_list = list()
cafe_table = document.find('table', {'class': 'cafe-list'})
tbody = cafe_table.find('tbody')
for tr_tag in tbody.find_all('tr'):
    for td_tag in tr_tag.find_all('td'):
        cafe_name = tr_tag.find('td', {'class': 'name'}).string
        cafe_zip = tr_tag.find('td', {'class': 'zip'}).string
        cafe_dict={'name': cafe_name, 'zip': cafe_zip}
    cafe_list.append(cafe_dict)
cafe_list


Out[13]:
[{'name': 'Hang In There', 'zip': '11237'},
 {'name': 'Independent Claws', 'zip': '11201'},
 {'name': 'Paws and Play', 'zip': '11215'},
 {'name': 'Tall Tails', 'zip': '11222'},
 {'name': 'Cats Meow', 'zip': '11231'}]

Great! In the following cell, write some code that creates a list of cats from the <table> tag on the page, storing them as a list of dictionaries in a variable called cat_list. Again, I've written a bit of the code for you. Expected output:

[{'birthdate': '2015-05-20',
  'color': 'black',
  'locations': ['Paws and Play', 'Independent Claws*'],
  'name': 'Sylvester',
  'pattern': 'colorpoint',
  'weight': 10.46},
 {'birthdate': '2000-01-03',
  'color': 'cinnamon',
  'locations': ['Independent Claws*'],
  'name': 'Jasper',
  'pattern': 'solid',
  'weight': 8.06},
 {'birthdate': '2006-02-27',
  'color': 'brown',
  'locations': ['Independent Claws*'],
  'name': 'Luna',
  'pattern': 'tortoiseshell',
  'weight': 10.88},
[...many records omitted for brevity...]
 {'birthdate': '1999-01-09',
  'color': 'white',
  'locations': ['Cats Meow*', 'Independent Claws', 'Tall Tails'],
  'name': 'Lafayette',
  'pattern': 'tortoiseshell',
  'weight': 9.3}]

Note: Observe the data types of the values in each dictionary! Make sure to explicitly convert values retrieved from .string attributes of Beautiful Soup tag objects to strs using the str() function.


In [14]:
cat_list = list()
cat_table = document.find('table', {'class': 'cat-list'})
tbody = cat_table.find('tbody')
for tr_tag in tbody.find_all('tr'):
    cat_dict = {}
    cat_dict['birthdate'] = tr_tag.find('td', {'class': 'birthdate'}).string
    cat_dict['color'] = tr_tag.find('td', {'class': 'color'}).string
    cat_dict['locations'] = tr_tag.find('td', {'class': 'locations'}).string.split(",")
    cat_dict['name'] = tr_tag.find('td', {'class': 'name'}).string
    cat_dict['pattern'] = tr_tag.find('td', {'class': 'pattern'}).string
    cat_dict['weight'] = float(tr_tag.find('td', {'class': 'weight'}).string)
    cat_list.append(cat_dict)
cat_list


Out[14]:
[{'birthdate': '2015-05-20',
  'color': 'black',
  'locations': ['Paws and Play', ' Independent Claws*'],
  'name': 'Sylvester',
  'pattern': 'colorpoint',
  'weight': 10.46},
 {'birthdate': '2000-01-03',
  'color': 'cinnamon',
  'locations': ['Independent Claws*'],
  'name': 'Jasper',
  'pattern': 'solid',
  'weight': 8.06},
 {'birthdate': '2006-02-27',
  'color': 'brown',
  'locations': ['Independent Claws*'],
  'name': 'Luna',
  'pattern': 'tortoiseshell',
  'weight': 10.88},
 {'birthdate': '2015-08-13',
  'color': 'white',
  'locations': ['Tall Tails*', ' Hang In There'],
  'name': 'Georges',
  'pattern': 'tabby',
  'weight': 9.4},
 {'birthdate': '2003-09-13',
  'color': 'red',
  'locations': ['Paws and Play*'],
  'name': 'Millie',
  'pattern': 'bicolor',
  'weight': 9.27},
 {'birthdate': '2009-07-30',
  'color': 'cream',
  'locations': ['Hang In There*'],
  'name': 'Lisa',
  'pattern': 'colorpoint',
  'weight': 8.84},
 {'birthdate': '2011-12-15',
  'color': 'cream',
  'locations': ['Hang In There*', ' Cats Meow', ' Tall Tails'],
  'name': 'Oscar',
  'pattern': 'solid',
  'weight': 8.44},
 {'birthdate': '2015-12-30',
  'color': 'lilac',
  'locations': ['Paws and Play*', ' Cats Meow'],
  'name': 'Scaredy',
  'pattern': 'tabby',
  'weight': 8.83},
 {'birthdate': '2013-10-16',
  'color': 'blue',
  'locations': ['Independent Claws', ' Paws and Play*'],
  'name': 'Charlotte',
  'pattern': 'tabby',
  'weight': 9.54},
 {'birthdate': '2011-02-07',
  'color': 'white',
  'locations': ['Independent Claws*', ' Hang In There'],
  'name': 'Whiskers',
  'pattern': 'colorpoint',
  'weight': 9.47},
 {'birthdate': '2007-05-28',
  'color': 'lilac',
  'locations': ['Independent Claws', ' Cats Meow*', ' Paws and Play'],
  'name': 'Patches',
  'pattern': 'colorpoint',
  'weight': 10.12},
 {'birthdate': '1998-12-01',
  'color': 'black',
  'locations': ['Independent Claws', ' Paws and Play*'],
  'name': 'Charly',
  'pattern': 'bicolor',
  'weight': 9.2},
 {'birthdate': '2013-04-09',
  'color': 'black',
  'locations': ['Hang In There', ' Tall Tails*'],
  'name': 'Mimi',
  'pattern': 'solid',
  'weight': 9.19},
 {'birthdate': '2000-06-28',
  'color': 'black',
  'locations': ['Hang In There*'],
  'name': 'Samantha',
  'pattern': 'solid',
  'weight': 7.49},
 {'birthdate': '2014-08-16',
  'color': 'fawn',
  'locations': ['Independent Claws*', ' Paws and Play'],
  'name': 'Bella',
  'pattern': 'solid',
  'weight': 10.47},
 {'birthdate': '2009-01-28',
  'color': 'cinnamon',
  'locations': ['Tall Tails*'],
  'name': 'Angel',
  'pattern': 'tortoiseshell',
  'weight': 9.72},
 {'birthdate': '2015-12-16',
  'color': 'cinnamon',
  'locations': ['Paws and Play*'],
  'name': 'Garfield',
  'pattern': 'calico',
  'weight': 7.18},
 {'birthdate': '1997-06-12',
  'color': 'red',
  'locations': ['Paws and Play*', ' Tall Tails'],
  'name': 'Oliver',
  'pattern': 'bicolor',
  'weight': 9.44},
 {'birthdate': '2010-12-22',
  'color': 'red',
  'locations': ['Hang In There', ' Independent Claws*'],
  'name': 'Moritz',
  'pattern': 'tortoiseshell',
  'weight': 10.1},
 {'birthdate': '2009-01-10',
  'color': 'cream',
  'locations': ['Cats Meow', ' Independent Claws*', ' Tall Tails'],
  'name': 'Stevens',
  'pattern': 'bicolor',
  'weight': 8.12},
 {'birthdate': '2010-11-30',
  'color': 'cinnamon',
  'locations': ['Independent Claws*'],
  'name': 'Sassy',
  'pattern': 'solid',
  'weight': 8.67},
 {'birthdate': '2003-04-14',
  'color': 'red',
  'locations': ['Hang In There*', ' Tall Tails'],
  'name': 'Lily',
  'pattern': 'bicolor',
  'weight': 10.01},
 {'birthdate': '2014-05-15',
  'color': 'blue',
  'locations': ['Paws and Play*', ' Independent Claws', ' Tall Tails'],
  'name': "O'Malley",
  'pattern': 'tortoiseshell',
  'weight': 8.86},
 {'birthdate': '2009-01-18',
  'color': 'cinnamon',
  'locations': ['Tall Tails*'],
  'name': 'Maru',
  'pattern': 'bicolor',
  'weight': 10.09},
 {'birthdate': '2004-03-04',
  'color': 'lilac',
  'locations': ['Tall Tails', ' Hang In There', ' Cats Meow*'],
  'name': 'Oreo',
  'pattern': 'bicolor',
  'weight': 10.76},
 {'birthdate': '2005-05-17',
  'color': 'brown',
  'locations': ['Cats Meow', ' Paws and Play', ' Tall Tails*'],
  'name': 'Molly',
  'pattern': 'colorpoint',
  'weight': 9.58},
 {'birthdate': '2007-06-25',
  'color': 'brown',
  'locations': ['Hang In There', ' Independent Claws', ' Cats Meow*'],
  'name': 'Minette',
  'pattern': 'calico',
  'weight': 8.84},
 {'birthdate': '2011-02-20',
  'color': 'red',
  'locations': ['Paws and Play*'],
  'name': 'Smudge',
  'pattern': 'colorpoint',
  'weight': 12.0},
 {'birthdate': '2007-12-02',
  'color': 'black',
  'locations': ['Cats Meow', ' Hang In There', ' Independent Claws*'],
  'name': 'Jack',
  'pattern': 'colorpoint',
  'weight': 9.92},
 {'birthdate': '2009-05-06',
  'color': 'red',
  'locations': ['Paws and Play*'],
  'name': 'Lolcat',
  'pattern': 'bicolor',
  'weight': 8.88},
 {'birthdate': '1998-07-18',
  'color': 'cinnamon',
  'locations': ['Independent Claws*'],
  'name': 'Diana',
  'pattern': 'colorpoint',
  'weight': 7.81},
 {'birthdate': '1997-11-29',
  'color': 'brown',
  'locations': ['Tall Tails*'],
  'name': 'Minka',
  'pattern': 'tabby',
  'weight': 9.37},
 {'birthdate': '1997-03-09',
  'color': 'lilac',
  'locations': ['Independent Claws', ' Cats Meow', ' Hang In There*'],
  'name': 'Carlos',
  'pattern': 'tabby',
  'weight': 9.92},
 {'birthdate': '1998-03-24',
  'color': 'cinnamon',
  'locations': ['Tall Tails', ' Independent Claws', ' Hang In There*'],
  'name': 'Roquefort',
  'pattern': 'colorpoint',
  'weight': 9.73},
 {'birthdate': '2005-12-25',
  'color': 'brown',
  'locations': ['Independent Claws', ' Paws and Play*'],
  'name': 'Chanel',
  'pattern': 'tabby',
  'weight': 9.04},
 {'birthdate': '2003-12-07',
  'color': 'black',
  'locations': ['Hang In There*', ' Independent Claws'],
  'name': 'Jiji',
  'pattern': 'tabby',
  'weight': 10.35},
 {'birthdate': '1997-01-04',
  'color': 'fawn',
  'locations': ['Paws and Play*', ' Cats Meow'],
  'name': 'Poppy',
  'pattern': 'calico',
  'weight': 9.89},
 {'birthdate': '2010-10-30',
  'color': 'cinnamon',
  'locations': ['Tall Tails*'],
  'name': 'Berlioz',
  'pattern': 'colorpoint',
  'weight': 11.37},
 {'birthdate': '2010-02-26',
  'color': 'fawn',
  'locations': ['Hang In There', ' Tall Tails', ' Independent Claws*'],
  'name': 'Max',
  'pattern': 'colorpoint',
  'weight': 8.72},
 {'birthdate': '1999-01-09',
  'color': 'white',
  'locations': ['Cats Meow*', ' Independent Claws', ' Tall Tails'],
  'name': 'Lafayette',
  'pattern': 'tortoiseshell',
  'weight': 9.3}]

Problem set #2: Designing the schema

Before you do anything else, use psql to create a new database for this homework assignment using the following command:

CREATE DATABASE catcafes;

In the following cell, connect to the database using pg8000. (You may need to provide additional arguments to the .connect() method, depending on the distribution of PostgreSQL you're using.)


In [15]:
import pg8000
conn = pg8000.connect(user='postgres', password='12345', database="catcafes")

Here's a cell you can run if something goes wrong and you need to rollback the current query session:


In [19]:
conn.rollback()

In the cell below, you're going to create three tables, necessary to represent the data you scraped above. I've given the basic framework of the Python code and SQL statements to create these tables. I've given the entire CREATE TABLE statement for the cafe table, but for the other two, you'll need to supply the field names and the data types for each column. If you're unsure what to call the fields, or what fields should be in the tables, consult the queries in "The Queries" below. Hints:

  • Many of these fields will be varchars. Don't worry too much about how many characters you need—it's okay just to eyeball it.
  • Feel free to use a varchar type to store the birthdate field. No need to dig too deep into PostgreSQL's date types for this particular homework assignment.
  • Cats and locations are in a many-to-many relationship. You'll need to create a linking table to represent this relationship. (That's why there's space for you to create three tables.)
  • The linking table will need a field to keep track of whether or not a particular cafe is the "current" cafe for a given cat.

In [20]:
cursor = conn.cursor()
cursor.execute("""
CREATE TABLE cafe (
  id serial,
  name varchar(40),
  zip varchar(5)
)
""")

cursor.execute("""
CREATE TABLE cat (
  id serial,
  name varchar(50),
  birthdate varchar(50),
  weight float,
  color varchar(50),
  pattern varchar(50)
  
)
""")

cursor.execute("""
CREATE TABLE cat_cafe (
    cat_id int,
    cafe_id int,
    active bool
)
""")
conn.commit()

After executing the above cell, issuing a \d command in psql should yield something that looks like the following:

             List of relations
 Schema |    Name     |   Type   |  Owner  
--------+-------------+----------+---------
 public | cafe        | table    | allison
 public | cafe_id_seq | sequence | allison
 public | cat         | table    | allison
 public | cat_cafe    | table    | allison
 public | cat_id_seq  | sequence | allison
(5 rows)

If something doesn't look right, you can always use the DROP TABLE command to drop the tables and start again. (You can also issue a DROP DATABASE catcafes command to drop the database altogether.) Don't worry if it takes a few tries to get it right—happens to the best and most expert among us. You'll probably have to drop the database and start again from scratch several times while completing this homework.

Note: If you try to issue a DROP TABLE or DROP DATABASE command and psql seems to hang forever, it could be that PostgreSQL is waiting for current connections to close before proceeding with your command. To fix this, create a cell with the code conn.close() in your notebook and execute it. After the DROP commands have completed, make sure to run the cell containing the pg8000.connect() call again.

Problem set #3: Inserting the data

In the cell below, I've written the code to insert the cafes into the cafe table, using data from the cafe_list variable that we made earlier. If the code you wrote to create that table was correct, the following cell should execute without error or incident. Execute it before you continue.


In [21]:
cafe_name_id_map = {}
for item in cafe_list:
    cursor.execute("INSERT INTO cafe (name, zip) VALUES (%s, %s) RETURNING id",
                  [str(item['name']), str(item['zip'])])
    rowid = cursor.fetchone()[0]
    cafe_name_id_map[str(item['name'])] = rowid
conn.commit()

Issuing SELECT * FROM cafe in the psql client should yield something that looks like this:

 id |       name        |  zip  
----+-------------------+-------
  1 | Hang In There     | 11237
  2 | Independent Claws | 11201
  3 | Paws and Play     | 11215
  4 | Tall Tails        | 11222
  5 | Cats Meow         | 11231
(5 rows)

(The id values may be different depending on how many times you've cleaned the table out with DELETE.)

Note that the code in the cell above created a dictionary called cafe_name_id_map. What's in it? Let's see:


In [22]:
cafe_name_id_map


Out[22]:
{'Cats Meow': 5,
 'Hang In There': 1,
 'Independent Claws': 2,
 'Paws and Play': 3,
 'Tall Tails': 4}

The dictionary maps the name of the cat cafe to its ID in the database. You'll need these values later when you're adding records to the linking table (cat_cafe).

Now the tricky part. (Yes, believe it or not, this is the tricky part. The other stuff has all been easy by comparison.) In the cell below, write the Python code to insert each cat's data from the cat_list variable (created in Problem Set #1) into the cat table. The code should also insert the relevant data into the cat_cafe table. Hints:

  • You'll need to get the id of each cat record using the RETURNING clause of the INSERT statement and the .fetchone() method of the cursor object.
  • How do you know whether or not the current location is the "active" location for a particular cat? The page itself contains some explanatory text that might be helpful here. You might need to use some string checking and manipulation functions in order to make this determination and transform the string as needed.
  • The linking table stores an ID only for both the cat and the cafe. Use the cafe_name_id_map dictionary to get the id of the cafes inserted earlier.

In [23]:
conn.rollback()

In [24]:
cat_insert = "Insert into cat (name, birthdate, weight, color, pattern) values (%s, %s, %s, %s, %s) returning id"
cat_cafe_insert = "Insert into cat_cafe (cat_id, cafe_id, active) values (%s, %s, %s)"

for cat in cat_list:
    cursor.execute(cat_insert, [str(cat['name']), str(cat['birthdate']), float(cat['weight']), str(cat['color']), str(cat['pattern'])])
    catrowid = cursor.fetchone()[0]
    for cafe in cat['locations']:
        for place in cafe_name_id_map:
            if place in cafe:
                if '*' in cafe:
                    cursor.execute(cat_cafe_insert, [catrowid, cafe_name_id_map[place], True])
                else:
                    cursor.execute(cat_cafe_insert, [catrowid, cafe_name_id_map[place], False])
conn.commit()

Issuing a SELECT * FROM cat LIMIT 10 in psql should yield something that looks like this:

 id |   name    | birthdate  | weight |  color   |    pattern    
----+-----------+------------+--------+----------+---------------
  1 | Sylvester | 2015-05-20 |  10.46 | black    | colorpoint
  2 | Jasper    | 2000-01-03 |   8.06 | cinnamon | solid
  3 | Luna      | 2006-02-27 |  10.88 | brown    | tortoiseshell
  4 | Georges   | 2015-08-13 |   9.40 | white    | tabby
  5 | Millie    | 2003-09-13 |   9.27 | red      | bicolor
  6 | Lisa      | 2009-07-30 |   8.84 | cream    | colorpoint
  7 | Oscar     | 2011-12-15 |   8.44 | cream    | solid
  8 | Scaredy   | 2015-12-30 |   8.83 | lilac    | tabby
  9 | Charlotte | 2013-10-16 |   9.54 | blue     | tabby
 10 | Whiskers  | 2011-02-07 |   9.47 | white    | colorpoint
(10 rows)

And a SELECT * FROM cat_cafe LIMIT 10 in psql should look like this:

 cat_id | cafe_id | active 
--------+---------+--------
      1 |       3 | f
      1 |       2 | t
      2 |       2 | t
      3 |       2 | t
      4 |       4 | t
      4 |       1 | f
      5 |       3 | t
      6 |       1 | t
      7 |       1 | t
      7 |       5 | f
(10 rows)

Again, the exact values for the ID columns might be different, depending on how many times you've deleted and dropped the tables.

The Queries

Okay. To verify your work, run the following queries and check their output. If you've correctly scraped the data and imported it into SQL, running the cells should produce exactly the expected output, as indicated. If not, then you performed one of the steps above incorrectly; check your work and try again. (Note: Don't modify these cells, just run them! This homework was about scraping and inserting data, not querying it.)

What's the name of the youngest cat at any location?

Expected output: Scaredy


In [25]:
cursor.execute("SELECT max(birthdate) FROM cat")
birthdate = cursor.fetchone()[0]
cursor.execute("SELECT name FROM cat WHERE birthdate = %s", [birthdate])
print(cursor.fetchone()[0])


Scaredy

In which zip codes can I find a lilac-colored tabby?

Expected output: 11237, 11215


In [26]:
cursor.execute("""SELECT DISTINCT(cafe.zip)
    FROM cat
    JOIN cat_cafe ON cat.id = cat_cafe.cat_id
    JOIN cafe ON cafe.id = cat_cafe.cafe_id
    WHERE cat.color = 'lilac' AND cat.pattern = 'tabby' AND cat_cafe.active = true
""")
print(', '.join([x[0] for x in cursor.fetchall()]))


11237, 11215

What's the average weight of cats currently residing at all locations?

Expected output:

Independent Claws: 9.33
Paws and Play: 9.28
Tall Tails: 9.82
Hang In There: 9.25
Cats Meow: 9.76

In [27]:
cursor.execute("""
    SELECT cafe.name, avg(cat.weight)
    FROM cat
    JOIN cat_cafe ON cat.id = cat_cafe.cat_id
    JOIN cafe ON cafe.id = cat_cafe.cafe_id
    WHERE cat_cafe.active = true
    GROUP BY cafe.name
    """)
for rec in cursor.fetchall():
    print(rec[0]+":", "%0.2f" % rec[1])


Hang In There: 9.25
Independent Claws: 9.33
Paws and Play: 9.28
Tall Tails: 9.82
Cats Meow: 9.75

Which location has the most cats with tortoiseshell coats?

Expected output: Independent Claws


In [28]:
cursor.execute("""
    SELECT cafe.name
    FROM cat
    JOIN cat_cafe ON cat.id = cat_cafe.cat_id
    JOIN cafe ON cafe.id = cat_cafe.cafe_id
    WHERE cat_cafe.active = true AND cat.pattern = 'tortoiseshell'
    GROUP BY cafe.name
    ORDER BY count(cat.name) DESC
    LIMIT 1
""")
print(cursor.fetchone()[0])


Independent Claws

Did they all work? Great job! You're done.


In [ ]: