This homework assignment takes the form of an IPython Notebook. There are a number of exercises below, with notebook cells that need to be completed in order to meet particular criteria. Your job is to fill in the cells as appropriate. You'll need to download this notebook file to your computer before you can complete the assignment. To do so, follow these steps:
These problem sets address SQL, with a focus on joins and aggregates.
I've prepared a SQL version of the MovieLens data for you to use in this homework. Download this .psql file here. You'll be importing this data into your own local copy of PostgreSQL.
To import the data, follow these steps:
psql
.CREATE DATABASE homework2;
\c homework2
.psql
file you downloaded earlier by typing \i
followed by the path to the .psql
file.After you run the \i
command, you should see the following output:
CREATE TABLE
CREATE TABLE
CREATE TABLE
COPY 100000
COPY 1682
COPY 943
The table schemas for the data look like this:
Table "public.udata"
Column | Type | Modifiers
-----------+---------+-----------
user_id | integer |
item_id | integer |
rating | integer |
timestamp | integer |
Table "public.uuser"
Column | Type | Modifiers
------------+-----------------------+-----------
user_id | integer |
age | integer |
gender | character varying(1) |
occupation | character varying(80) |
zip_code | character varying(10) |
Table "public.uitem"
Column | Type | Modifiers
--------------------+------------------------+-----------
movie_id | integer | not null
movie_title | character varying(81) | not null
release_date | date |
video_release_date | character varying(32) |
imdb_url | character varying(134) |
unknown | integer | not null
action | integer | not null
adventure | integer | not null
animation | integer | not null
childrens | integer | not null
comedy | integer | not null
crime | integer | not null
documentary | integer | not null
drama | integer | not null
fantasy | integer | not null
film_noir | integer | not null
horror | integer | not null
musical | integer | not null
mystery | integer | not null
romance | integer | not null
scifi | integer | not null
thriller | integer | not null
war | integer | not null
western | integer | not null
Run the cell below to create a connection object. This should work whether you have pg8000
installed or psycopg2
.
In [100]:
import pg8000
conn = pg8000.connect(user='postgres', password='password', database="homework2_radhika")
If you get an error stating that database "homework2" does not exist
, make sure that you followed the instructions above exactly. If necessary, drop the database you created (with, e.g., DROP DATABASE your_database_name
) and start again.
In all of the cells below, I've provided the necessary Python scaffolding to perform the query and display the results. All you need to do is write the SQL statements.
As noted in the tutorial, if your SQL statement has a syntax error, you'll need to rollback your connection before you can fix the error and try the query again. As a convenience, I've included the following cell, which performs the rollback process. Run it whenever you hit trouble.
In [101]:
conn.rollback()
In the cell below, fill in the string assigned to the variable statement
with a SQL query that finds all movies that belong to both the science fiction (scifi
) and horror genres. Return these movies in reverse order by their release date. (Hint: movies are located in the uitem
table. A movie's membership in a genre is indicated by a value of 1
in the uitem
table column corresponding to that genre.) Run the cell to execute the query.
Expected output:
Deep Rising (1998)
Alien: Resurrection (1997)
Hellraiser: Bloodline (1996)
Robert A. Heinlein's The Puppet Masters (1994)
Body Snatchers (1993)
Army of Darkness (1993)
Body Snatchers (1993)
Alien 3 (1992)
Heavy Metal (1981)
Alien (1979)
Night of the Living Dead (1968)
Blob, The (1958)
In [28]:
conn.rollback()
In [102]:
cursor = conn.cursor()
statement = "SELECT movie_title, release_date from uitem WHERE horror=1 AND scifi=1 ORDER BY release_date DESC;"
cursor.execute(statement)
for row in cursor:
print(row[0])
In the cell below, fill in the string assigned to the statement
variable with a SQL query that returns the number of movies that are either musicals or children's movies (columns musical
and childrens
respectively). Hint: use the count(*)
aggregate.
Expected output: 157
In [103]:
conn.rollback()
In [104]:
cursor = conn.cursor()
statement = "SELECT count(*) FROM uitem WHERE musical=1 OR childrens=1;"
cursor.execute(statement)
for row in cursor:
print(row[0])
Nicely done. Now, in the cell below, fill in the indicated string with a SQL statement that returns all occupations, along with their count, from the uuser
table that have more than fifty users listed for that occupation. (I.e., the occupation librarian
is listed for 51 users, so it should be included in these results. There are only 12 lawyers, so lawyer
should not be included in the result.)
Expected output:
administrator 79
programmer 66
librarian 51
student 196
other 105
engineer 67
educator 95
Hint: use GROUP BY
and HAVING
. (If you're stuck, try writing the query without the HAVING
first.)
In [115]:
conn.rollback()
In [113]:
cursor = conn.cursor()
statement = "SELECT occupation, count(occupation) FROM uuser GROUP BY occupation HAVING count(occupation) > 50;"
cursor.execute(statement)
for row in cursor:
print(row[0], row[1])
TA-COMMENT This looks great! If you pass in the following string as statement, you get the exact same order as Allison:
'SELECT occupation, count(*) FROM uuser GROUP BY occupation HAVING count(occupation) > 50;'
In the cell below, fill in the indicated string with a query that finds the titles of movies in the Documentary genre released before 1992 that received a rating of 5 from any user. Expected output:
Madonna: Truth or Dare (1991)
Koyaanisqatsi (1983)
Paris Is Burning (1990)
Thin Blue Line, The (1988)
Hints:
JOIN
the udata
and uitem
tables.DISTINCT()
to get a list of unique movie titles (no title should be listed more than once).uitem.release_date < '1992-01-01'
.
In [144]:
conn.rollback()
In [136]:
cursor = conn.cursor()
#statement = "SELECT movie_id from uitem limit 5;"
statement = "SELECT DISTINCT(uitem.movie_title), item_id, rating FROM udata JOIN uitem on item_id = movie_id WHERE documentary=1 AND uitem.release_date < '1992-01-01' AND udata.rating = 5;"
cursor.execute(statement)
for row in cursor:
print(row[0])
TA-COMMENT
Alternate SQL query: "SELECT DISTINCT(uitem.movie_title) FROM uitem JOIN udata ON uitem.movie_id = udata.item_id WHERE uitem.documentary = 1 AND uitem.release_date < '1992-01-01' AND rating = '5';"
You included several other items to the immediate right of SELECT
-- however, note that those items aren't actually being displayed in your output! That's an important hint that those parts of your query should not be included. So, in your query, you should not have included item_id and rating immediately to the right of SELECT
-- they are causing your output to be ordered differently.
This one's tough, so prepare yourself. Go get a cup of coffee. Stretch a little bit. Deep breath. There you go.
In the cell below, fill in the indicated string with a query that produces a list of the ten lowest rated movies in the Horror genre. For the purposes of this problem, take "lowest rated" to mean "has the lowest average rating." The query should display the titles of the movies, not their ID number. (So you'll have to use a JOIN
.)
Expected output:
Amityville 1992: It's About Time (1992) 1.00
Beyond Bedlam (1993) 1.00
Amityville: Dollhouse (1996) 1.00
Amityville: A New Generation (1993) 1.00
Amityville 3-D (1983) 1.17
Castle Freak (1995) 1.25
Amityville Curse, The (1990) 1.25
Children of the Corn: The Gathering (1996) 1.32
Machine, The (1994) 1.50
Body Parts (1991) 1.62
In [175]:
conn.rollback()
In [176]:
cursor = conn.cursor()
statement = "SELECT uitem.movie_title, avg(rating) FROM udata JOIN uitem on item_id = movie_id WHERE horror=1 GROUP BY uitem.movie_title ORDER BY avg(udata.rating) limit 10;"
cursor.execute(statement)
for row in cursor:
print(row[0], "%0.2f" % row[1])
BONUS: Extend the query above so that it only includes horror movies that have ten or more ratings. Fill in the query as indicated below.
Expected output:
Children of the Corn: The Gathering (1996) 1.32
Body Parts (1991) 1.62
Amityville II: The Possession (1982) 1.64
Jaws 3-D (1983) 1.94
Hellraiser: Bloodline (1996) 2.00
Tales from the Hood (1995) 2.04
Audrey Rose (1977) 2.17
Addiction, The (1995) 2.18
Halloween: The Curse of Michael Myers (1995) 2.20
Phantoms (1998) 2.23
In [185]:
conn.rollback()
In [186]:
cursor = conn.cursor()
statement = "SELECT uitem.movie_title, avg(rating) FROM udata JOIN uitem on item_id = movie_id WHERE horror=1 GROUP BY uitem.movie_title HAVING count(rating) > 10 ORDER BY avg(udata.rating) limit 10;"
cursor.execute(statement)
for row in cursor:
print(row[0], "%0.2f" % row[1])
Great work! You're done.