Grade: 6 / 6 -- but search "TA-COMMENT" to see a few notes on some of the problems.

Homework 2: Working with SQL (Data and Databases 2016)

This homework assignment takes the form of an IPython Notebook. There are a number of exercises below, with notebook cells that need to be completed in order to meet particular criteria. Your job is to fill in the cells as appropriate. You'll need to download this notebook file to your computer before you can complete the assignment. To do so, follow these steps:

  1. Make sure you're viewing this notebook in Github.
  2. Ctrl+click (or right click) on the "Raw" button in the Github interface, and select "Save Link As..." or your browser's equivalent. Save the file in a convenient location on your own computer.
  3. Rename the notebook file to include your own name somewhere in the filename (e.g., Homework_2_Allison_Parrish.ipynb).
  4. Open the notebook on your computer using your locally installed version of IPython Notebook.
  5. When you've completed the notebook to your satisfaction, e-mail the completed file to the address of the teaching assistant (as discussed in class).

Setting the scene

These problem sets address SQL, with a focus on joins and aggregates.

I've prepared a SQL version of the MovieLens data for you to use in this homework. Download this .psql file here. You'll be importing this data into your own local copy of PostgreSQL.

To import the data, follow these steps:

  • Launch psql.
  • At the prompt, type CREATE DATABASE homework2;
  • Connect to the database you just created by typing \c homework2
  • Import the .psql file you downloaded earlier by typing \i followed by the path to the .psql file.

After you run the \i command, you should see the following output:

CREATE TABLE
CREATE TABLE
CREATE TABLE
COPY 100000
COPY 1682
COPY 943

The table schemas for the data look like this:

      Table "public.udata"
  Column   |  Type   | Modifiers 
-----------+---------+-----------
 user_id   | integer | 
 item_id   | integer | 
 rating    | integer | 
 timestamp | integer |
              Table "public.uuser"
   Column   |         Type          | Modifiers 
------------+-----------------------+-----------
 user_id    | integer               | 
 age        | integer               | 
 gender     | character varying(1)  | 
 occupation | character varying(80) | 
 zip_code   | character varying(10) |
                  Table "public.uitem"
       Column       |          Type          | Modifiers 
--------------------+------------------------+-----------
 movie_id           | integer                | not null
 movie_title        | character varying(81)  | not null
 release_date       | date                   | 
 video_release_date | character varying(32)  | 
 imdb_url           | character varying(134) | 
 unknown            | integer                | not null
 action             | integer                | not null
 adventure          | integer                | not null
 animation          | integer                | not null
 childrens          | integer                | not null
 comedy             | integer                | not null
 crime              | integer                | not null
 documentary        | integer                | not null
 drama              | integer                | not null
 fantasy            | integer                | not null
 film_noir          | integer                | not null
 horror             | integer                | not null
 musical            | integer                | not null
 mystery            | integer                | not null
 romance            | integer                | not null
 scifi              | integer                | not null
 thriller           | integer                | not null
 war                | integer                | not null
 western            | integer                | not null

Run the cell below to create a connection object. This should work whether you have pg8000 installed or psycopg2.


In [100]:
import pg8000
conn = pg8000.connect(user='postgres', password='password', database="homework2_radhika")

If you get an error stating that database "homework2" does not exist, make sure that you followed the instructions above exactly. If necessary, drop the database you created (with, e.g., DROP DATABASE your_database_name) and start again.

In all of the cells below, I've provided the necessary Python scaffolding to perform the query and display the results. All you need to do is write the SQL statements.

As noted in the tutorial, if your SQL statement has a syntax error, you'll need to rollback your connection before you can fix the error and try the query again. As a convenience, I've included the following cell, which performs the rollback process. Run it whenever you hit trouble.


In [101]:
conn.rollback()

Problem set 1: WHERE and ORDER BY

In the cell below, fill in the string assigned to the variable statement with a SQL query that finds all movies that belong to both the science fiction (scifi) and horror genres. Return these movies in reverse order by their release date. (Hint: movies are located in the uitem table. A movie's membership in a genre is indicated by a value of 1 in the uitem table column corresponding to that genre.) Run the cell to execute the query.

Expected output:

Deep Rising (1998)
Alien: Resurrection (1997)
Hellraiser: Bloodline (1996)
Robert A. Heinlein's The Puppet Masters (1994)
Body Snatchers (1993)
Army of Darkness (1993)
Body Snatchers (1993)
Alien 3 (1992)
Heavy Metal (1981)
Alien (1979)
Night of the Living Dead (1968)
Blob, The (1958)

In [28]:
conn.rollback()

In [102]:
cursor = conn.cursor()
statement = "SELECT movie_title, release_date  from uitem WHERE horror=1 AND scifi=1 ORDER BY release_date DESC;"
cursor.execute(statement)
for row in cursor:
    print(row[0])


Deep Rising (1998)
Alien: Resurrection (1997)
Hellraiser: Bloodline (1996)
Robert A. Heinlein's The Puppet Masters (1994)
Body Snatchers (1993)
Army of Darkness (1993)
Body Snatchers (1993)
Alien 3 (1992)
Heavy Metal (1981)
Alien (1979)
Night of the Living Dead (1968)
Blob, The (1958)

Problem set 2: Aggregation, GROUP BY and HAVING

In the cell below, fill in the string assigned to the statement variable with a SQL query that returns the number of movies that are either musicals or children's movies (columns musical and childrens respectively). Hint: use the count(*) aggregate.

Expected output: 157


In [103]:
conn.rollback()

In [104]:
cursor = conn.cursor()
statement = "SELECT count(*) FROM uitem WHERE musical=1 OR childrens=1;"
cursor.execute(statement)
for row in cursor:
    print(row[0])


157

Nicely done. Now, in the cell below, fill in the indicated string with a SQL statement that returns all occupations, along with their count, from the uuser table that have more than fifty users listed for that occupation. (I.e., the occupation librarian is listed for 51 users, so it should be included in these results. There are only 12 lawyers, so lawyer should not be included in the result.)

Expected output:

administrator 79
programmer 66
librarian 51
student 196
other 105
engineer 67
educator 95

Hint: use GROUP BY and HAVING. (If you're stuck, try writing the query without the HAVING first.)


In [115]:
conn.rollback()

In [113]:
cursor = conn.cursor()
statement = "SELECT occupation, count(occupation) FROM uuser GROUP BY occupation HAVING count(occupation) > 50;"
cursor.execute(statement)
for row in cursor:
    print(row[0], row[1])


programmer 66
other 105
librarian 51
engineer 67
administrator 79
student 196
educator 95

TA-COMMENT This looks great! If you pass in the following string as statement, you get the exact same order as Allison:

'SELECT occupation, count(*) FROM uuser GROUP BY occupation HAVING count(occupation) > 50;'

Problem set 3: Joining tables

In the cell below, fill in the indicated string with a query that finds the titles of movies in the Documentary genre released before 1992 that received a rating of 5 from any user. Expected output:

Madonna: Truth or Dare (1991)
Koyaanisqatsi (1983)
Paris Is Burning (1990)
Thin Blue Line, The (1988)

Hints:

  • JOIN the udata and uitem tables.
  • Use DISTINCT() to get a list of unique movie titles (no title should be listed more than once).
  • The SQL expression to include in order to find movies released before 1992 is uitem.release_date < '1992-01-01'.

In [144]:
conn.rollback()

In [136]:
cursor = conn.cursor()
#statement = "SELECT movie_id from uitem limit 5;"
statement = "SELECT DISTINCT(uitem.movie_title), item_id, rating FROM udata JOIN uitem on item_id = movie_id WHERE documentary=1 AND uitem.release_date < '1992-01-01' AND udata.rating = 5;"
cursor.execute(statement)
for row in cursor:
    print(row[0])


Koyaanisqatsi (1983)
Madonna: Truth or Dare (1991)
Paris Is Burning (1990)
Thin Blue Line, The (1988)

TA-COMMENT Alternate SQL query: "SELECT DISTINCT(uitem.movie_title) FROM uitem JOIN udata ON uitem.movie_id = udata.item_id WHERE uitem.documentary = 1 AND uitem.release_date < '1992-01-01' AND rating = '5';"

You included several other items to the immediate right of SELECT -- however, note that those items aren't actually being displayed in your output! That's an important hint that those parts of your query should not be included. So, in your query, you should not have included item_id and rating immediately to the right of SELECT -- they are causing your output to be ordered differently.

Problem set 4: Joins and aggregations... together at last

This one's tough, so prepare yourself. Go get a cup of coffee. Stretch a little bit. Deep breath. There you go.

In the cell below, fill in the indicated string with a query that produces a list of the ten lowest rated movies in the Horror genre. For the purposes of this problem, take "lowest rated" to mean "has the lowest average rating." The query should display the titles of the movies, not their ID number. (So you'll have to use a JOIN.)

Expected output:

Amityville 1992: It's About Time (1992) 1.00
Beyond Bedlam (1993) 1.00
Amityville: Dollhouse (1996) 1.00
Amityville: A New Generation (1993) 1.00
Amityville 3-D (1983) 1.17
Castle Freak (1995) 1.25
Amityville Curse, The (1990) 1.25
Children of the Corn: The Gathering (1996) 1.32
Machine, The (1994) 1.50
Body Parts (1991) 1.62

In [175]:
conn.rollback()

In [176]:
cursor = conn.cursor()
statement = "SELECT uitem.movie_title, avg(rating) FROM udata JOIN uitem on item_id = movie_id WHERE horror=1 GROUP BY uitem.movie_title ORDER BY avg(udata.rating) limit 10;"
cursor.execute(statement)
for row in cursor:
    print(row[0], "%0.2f" % row[1])


Beyond Bedlam (1993) 1.00
Amityville: Dollhouse (1996) 1.00
Amityville: A New Generation (1993) 1.00
Amityville 1992: It's About Time (1992) 1.00
Amityville 3-D (1983) 1.17
Amityville Curse, The (1990) 1.25
Castle Freak (1995) 1.25
Children of the Corn: The Gathering (1996) 1.32
Machine, The (1994) 1.50
Body Parts (1991) 1.62

BONUS: Extend the query above so that it only includes horror movies that have ten or more ratings. Fill in the query as indicated below.

Expected output:

Children of the Corn: The Gathering (1996) 1.32
Body Parts (1991) 1.62
Amityville II: The Possession (1982) 1.64
Jaws 3-D (1983) 1.94
Hellraiser: Bloodline (1996) 2.00
Tales from the Hood (1995) 2.04
Audrey Rose (1977) 2.17
Addiction, The (1995) 2.18
Halloween: The Curse of Michael Myers (1995) 2.20
Phantoms (1998) 2.23

In [185]:
conn.rollback()

In [186]:
cursor = conn.cursor()
statement = "SELECT uitem.movie_title, avg(rating) FROM udata JOIN uitem on item_id = movie_id WHERE horror=1 GROUP BY uitem.movie_title HAVING count(rating) > 10 ORDER BY avg(udata.rating) limit 10;"
cursor.execute(statement)
for row in cursor:
    print(row[0], "%0.2f" % row[1])


Children of the Corn: The Gathering (1996) 1.32
Body Parts (1991) 1.62
Amityville II: The Possession (1982) 1.64
Jaws 3-D (1983) 1.94
Hellraiser: Bloodline (1996) 2.00
Tales from the Hood (1995) 2.04
Audrey Rose (1977) 2.17
Addiction, The (1995) 2.18
Halloween: The Curse of Michael Myers (1995) 2.20
Phantoms (1998) 2.23

Great work! You're done.