This homework presents a sophisticated scenario in which you must design a SQL schema, insert data into it, and issue queries against it.
In the year 20XX, I have won the lottery and decided to leave my programming days behind me in order to pursue my true calling as a cat cafe tycoon. This webpage lists the locations of my cat cafes and all the cats that are currently in residence at these cafes.
I'm interested in doing more detailed analysis of my cat cafe holdings and the cats that are currently being cared for by my cafes. For this reason, I've hired you to convert this HTML page into a workable SQL database. (Why don't I just do it myself? Because I am far too busy hanging out with adorable cats in all of my beautiful, beautiful cat cafes.)
Specifically, I want to know the answers to the following questions:
Because I'm not paying you very much, and because I am a merciful person who has considerable experience in these matters, I've decided to write the queries for you. (See below.) Your job is just to scrape the data from the web page, create the appropriate tables in PostgreSQL, and insert the data into those tables.
Before you continue, scroll down to "The Queries" below to examine the queries as I wrote them.
Your first goal is to create two data structures, both lists of dictionaries: one for the list of locations and one for the list of cats. You'll get these from scraping two <table>
tags in the HTML: the first table has a class of cafe-list
, the second has a class of cat-list
.
Before you do anything else, though, execute the following cell to import Beautiful Soup and create a BeautifulSoup object with the content of the web page:
In [12]:
from bs4 import BeautifulSoup
from urllib.request import urlopen
html = urlopen("http://static.decontextualize.com/cats.html").read()
document = BeautifulSoup(html, "html.parser")
Let's tackle the list of cafes first. In the cell below, write some code that creates a list of dictionaries with information about each cafe, assigning it to the variable cafe_list
. I've written some of the code for you; you just need to fill in the rest. The list should end up looking like this:
[{'name': 'Hang In There', 'zip': '11237'},
{'name': 'Independent Claws', 'zip': '11201'},
{'name': 'Paws and Play', 'zip': '11215'},
{'name': 'Tall Tails', 'zip': '11222'},
{'name': 'Cats Meow', 'zip': '11231'}]
In [13]:
cafe_list = list()
cafe_table = document.find('table', {'class': 'cafe-list'})
tbody = cafe_table.find('tbody')
for tr_tag in tbody.find_all('tr'):
for td_tag in tr_tag.find_all('td'):
cafe_name = tr_tag.find('td', {'class': 'name'}).string
cafe_zip = tr_tag.find('td', {'class': 'zip'}).string
cafe_dict={'name': cafe_name, 'zip': cafe_zip}
cafe_list.append(cafe_dict)
cafe_list
Out[13]:
Great! In the following cell, write some code that creates a list of cats from the <table>
tag on the page, storing them as a list of dictionaries in a variable called cat_list
. Again, I've written a bit of the code for you. Expected output:
[{'birthdate': '2015-05-20',
'color': 'black',
'locations': ['Paws and Play', 'Independent Claws*'],
'name': 'Sylvester',
'pattern': 'colorpoint',
'weight': 10.46},
{'birthdate': '2000-01-03',
'color': 'cinnamon',
'locations': ['Independent Claws*'],
'name': 'Jasper',
'pattern': 'solid',
'weight': 8.06},
{'birthdate': '2006-02-27',
'color': 'brown',
'locations': ['Independent Claws*'],
'name': 'Luna',
'pattern': 'tortoiseshell',
'weight': 10.88},
[...many records omitted for brevity...]
{'birthdate': '1999-01-09',
'color': 'white',
'locations': ['Cats Meow*', 'Independent Claws', 'Tall Tails'],
'name': 'Lafayette',
'pattern': 'tortoiseshell',
'weight': 9.3}]
Note: Observe the data types of the values in each dictionary! Make sure to explicitly convert values retrieved from .string
attributes of Beautiful Soup tag objects to str
s using the str()
function.
In [14]:
cat_list = list()
cat_table = document.find('table', {'class': 'cat-list'})
tbody = cat_table.find('tbody')
for tr_tag in tbody.find_all('tr'):
cat_dict = {}
cat_dict['birthdate'] = tr_tag.find('td', {'class': 'birthdate'}).string
cat_dict['color'] = tr_tag.find('td', {'class': 'color'}).string
cat_dict['locations'] = tr_tag.find('td', {'class': 'locations'}).string.split(",")
cat_dict['name'] = tr_tag.find('td', {'class': 'name'}).string
cat_dict['pattern'] = tr_tag.find('td', {'class': 'pattern'}).string
cat_dict['weight'] = float(tr_tag.find('td', {'class': 'weight'}).string)
cat_list.append(cat_dict)
cat_list
Out[14]:
Before you do anything else, use psql
to create a new database for this homework assignment using the following command:
CREATE DATABASE catcafes;
In the following cell, connect to the database using pg8000
. (You may need to provide additional arguments to the .connect()
method, depending on the distribution of PostgreSQL you're using.)
In [15]:
import pg8000
conn = pg8000.connect(user='postgres', password='12345', database="catcafes")
Here's a cell you can run if something goes wrong and you need to rollback the current query session:
In [19]:
conn.rollback()
In the cell below, you're going to create three tables, necessary to represent the data you scraped above. I've given the basic framework of the Python code and SQL statements to create these tables. I've given the entire CREATE TABLE
statement for the cafe
table, but for the other two, you'll need to supply the field names and the data types for each column. If you're unsure what to call the fields, or what fields should be in the tables, consult the queries in "The Queries" below. Hints:
varchar
s. Don't worry too much about how many characters you need—it's okay just to eyeball it.varchar
type to store the birthdate
field. No need to dig too deep into PostgreSQL's date types for this particular homework assignment.
In [20]:
cursor = conn.cursor()
cursor.execute("""
CREATE TABLE cafe (
id serial,
name varchar(40),
zip varchar(5)
)
""")
cursor.execute("""
CREATE TABLE cat (
id serial,
name varchar(50),
birthdate varchar(50),
weight float,
color varchar(50),
pattern varchar(50)
)
""")
cursor.execute("""
CREATE TABLE cat_cafe (
cat_id int,
cafe_id int,
active bool
)
""")
conn.commit()
After executing the above cell, issuing a \d
command in psql
should yield something that looks like the following:
List of relations
Schema | Name | Type | Owner
--------+-------------+----------+---------
public | cafe | table | allison
public | cafe_id_seq | sequence | allison
public | cat | table | allison
public | cat_cafe | table | allison
public | cat_id_seq | sequence | allison
(5 rows)
If something doesn't look right, you can always use the DROP TABLE
command to drop the tables and start again. (You can also issue a DROP DATABASE catcafes
command to drop the database altogether.) Don't worry if it takes a few tries to get it right—happens to the best and most expert among us. You'll probably have to drop the database and start again from scratch several times while completing this homework.
Note: If you try to issue a
DROP TABLE
orDROP DATABASE
command andpsql
seems to hang forever, it could be that PostgreSQL is waiting for current connections to close before proceeding with your command. To fix this, create a cell with the codeconn.close()
in your notebook and execute it. After theDROP
commands have completed, make sure to run the cell containing thepg8000.connect()
call again.
In the cell below, I've written the code to insert the cafes into the cafe
table, using data from the cafe_list
variable that we made earlier. If the code you wrote to create that table was correct, the following cell should execute without error or incident. Execute it before you continue.
In [21]:
cafe_name_id_map = {}
for item in cafe_list:
cursor.execute("INSERT INTO cafe (name, zip) VALUES (%s, %s) RETURNING id",
[str(item['name']), str(item['zip'])])
rowid = cursor.fetchone()[0]
cafe_name_id_map[str(item['name'])] = rowid
conn.commit()
Issuing SELECT * FROM cafe
in the psql
client should yield something that looks like this:
id | name | zip
----+-------------------+-------
1 | Hang In There | 11237
2 | Independent Claws | 11201
3 | Paws and Play | 11215
4 | Tall Tails | 11222
5 | Cats Meow | 11231
(5 rows)
(The id
values may be different depending on how many times you've cleaned the table out with DELETE
.)
Note that the code in the cell above created a dictionary called cafe_name_id_map
. What's in it? Let's see:
In [22]:
cafe_name_id_map
Out[22]:
The dictionary maps the name of the cat cafe to its ID in the database. You'll need these values later when you're adding records to the linking table (cat_cafe
).
Now the tricky part. (Yes, believe it or not, this is the tricky part. The other stuff has all been easy by comparison.) In the cell below, write the Python code to insert each cat's data from the cat_list
variable (created in Problem Set #1) into the cat
table. The code should also insert the relevant data into the cat_cafe
table. Hints:
id
of each cat record using the RETURNING
clause of the INSERT
statement and the .fetchone()
method of the cursor object.cafe_name_id_map
dictionary to get the id
of the cafes inserted earlier.
In [23]:
conn.rollback()
In [24]:
cat_insert = "Insert into cat (name, birthdate, weight, color, pattern) values (%s, %s, %s, %s, %s) returning id"
cat_cafe_insert = "Insert into cat_cafe (cat_id, cafe_id, active) values (%s, %s, %s)"
for cat in cat_list:
cursor.execute(cat_insert, [str(cat['name']), str(cat['birthdate']), float(cat['weight']), str(cat['color']), str(cat['pattern'])])
catrowid = cursor.fetchone()[0]
for cafe in cat['locations']:
for place in cafe_name_id_map:
if place in cafe:
if '*' in cafe:
cursor.execute(cat_cafe_insert, [catrowid, cafe_name_id_map[place], True])
else:
cursor.execute(cat_cafe_insert, [catrowid, cafe_name_id_map[place], False])
conn.commit()
Issuing a SELECT * FROM cat LIMIT 10
in psql
should yield something that looks like this:
id | name | birthdate | weight | color | pattern
----+-----------+------------+--------+----------+---------------
1 | Sylvester | 2015-05-20 | 10.46 | black | colorpoint
2 | Jasper | 2000-01-03 | 8.06 | cinnamon | solid
3 | Luna | 2006-02-27 | 10.88 | brown | tortoiseshell
4 | Georges | 2015-08-13 | 9.40 | white | tabby
5 | Millie | 2003-09-13 | 9.27 | red | bicolor
6 | Lisa | 2009-07-30 | 8.84 | cream | colorpoint
7 | Oscar | 2011-12-15 | 8.44 | cream | solid
8 | Scaredy | 2015-12-30 | 8.83 | lilac | tabby
9 | Charlotte | 2013-10-16 | 9.54 | blue | tabby
10 | Whiskers | 2011-02-07 | 9.47 | white | colorpoint
(10 rows)
And a SELECT * FROM cat_cafe LIMIT 10
in psql
should look like this:
cat_id | cafe_id | active
--------+---------+--------
1 | 3 | f
1 | 2 | t
2 | 2 | t
3 | 2 | t
4 | 4 | t
4 | 1 | f
5 | 3 | t
6 | 1 | t
7 | 1 | t
7 | 5 | f
(10 rows)
Again, the exact values for the ID columns might be different, depending on how many times you've deleted and dropped the tables.
Okay. To verify your work, run the following queries and check their output. If you've correctly scraped the data and imported it into SQL, running the cells should produce exactly the expected output, as indicated. If not, then you performed one of the steps above incorrectly; check your work and try again. (Note: Don't modify these cells, just run them! This homework was about scraping and inserting data, not querying it.)
Expected output: Scaredy
In [25]:
cursor.execute("SELECT max(birthdate) FROM cat")
birthdate = cursor.fetchone()[0]
cursor.execute("SELECT name FROM cat WHERE birthdate = %s", [birthdate])
print(cursor.fetchone()[0])
In [26]:
cursor.execute("""SELECT DISTINCT(cafe.zip)
FROM cat
JOIN cat_cafe ON cat.id = cat_cafe.cat_id
JOIN cafe ON cafe.id = cat_cafe.cafe_id
WHERE cat.color = 'lilac' AND cat.pattern = 'tabby' AND cat_cafe.active = true
""")
print(', '.join([x[0] for x in cursor.fetchall()]))
In [27]:
cursor.execute("""
SELECT cafe.name, avg(cat.weight)
FROM cat
JOIN cat_cafe ON cat.id = cat_cafe.cat_id
JOIN cafe ON cafe.id = cat_cafe.cafe_id
WHERE cat_cafe.active = true
GROUP BY cafe.name
""")
for rec in cursor.fetchall():
print(rec[0]+":", "%0.2f" % rec[1])
In [28]:
cursor.execute("""
SELECT cafe.name
FROM cat
JOIN cat_cafe ON cat.id = cat_cafe.cat_id
JOIN cafe ON cafe.id = cat_cafe.cafe_id
WHERE cat_cafe.active = true AND cat.pattern = 'tortoiseshell'
GROUP BY cafe.name
ORDER BY count(cat.name) DESC
LIMIT 1
""")
print(cursor.fetchone()[0])
Did they all work? Great job! You're done.
In [ ]: