Homework 2: Working with SQL (Data and Databases 2016)

This homework assignment takes the form of an IPython Notebook. There are a number of exercises below, with notebook cells that need to be completed in order to meet particular criteria. Your job is to fill in the cells as appropriate. You'll need to download this notebook file to your computer before you can complete the assignment. To do so, follow these steps:

  1. Make sure you're viewing this notebook in Github.
  2. Ctrl+click (or right click) on the "Raw" button in the Github interface, and select "Save Link As..." or your browser's equivalent. Save the file in a convenient location on your own computer.
  3. Rename the notebook file to include your own name somewhere in the filename (e.g., Homework_2_Allison_Parrish.ipynb).
  4. Open the notebook on your computer using your locally installed version of IPython Notebook.
  5. When you've completed the notebook to your satisfaction, e-mail the completed file to the address of the teaching assistant (as discussed in class).

Setting the scene

These problem sets address SQL, with a focus on joins and aggregates.

I've prepared a SQL version of the MovieLens data for you to use in this homework. Download this .psql file here. You'll be importing this data into your own local copy of PostgreSQL.

To import the data, follow these steps:

  • Launch psql.
  • At the prompt, type CREATE DATABASE homework2;
  • Connect to the database you just created by typing \c homework2
  • Import the .psql file you downloaded earlier by typing \i followed by the path to the .psql file.

After you run the \i command, you should see the following output:

CREATE TABLE
CREATE TABLE
CREATE TABLE
COPY 100000
COPY 1682
COPY 943

The table schemas for the data look like this:

      Table "public.udata"
  Column   |  Type   | Modifiers 
-----------+---------+-----------
 user_id   | integer | 
 item_id   | integer | 
 rating    | integer | 
 timestamp | integer |
              Table "public.uuser"
   Column   |         Type          | Modifiers 
------------+-----------------------+-----------
 user_id    | integer               | 
 age        | integer               | 
 gender     | character varying(1)  | 
 occupation | character varying(80) | 
 zip_code   | character varying(10) |
                  Table "public.uitem"
       Column       |          Type          | Modifiers 
--------------------+------------------------+-----------
 movie_id           | integer                | not null
 movie_title        | character varying(81)  | not null
 release_date       | date                   | 
 video_release_date | character varying(32)  | 
 imdb_url           | character varying(134) | 
 unknown            | integer                | not null
 action             | integer                | not null
 adventure          | integer                | not null
 animation          | integer                | not null
 childrens          | integer                | not null
 comedy             | integer                | not null
 crime              | integer                | not null
 documentary        | integer                | not null
 drama              | integer                | not null
 fantasy            | integer                | not null
 film_noir          | integer                | not null
 horror             | integer                | not null
 musical            | integer                | not null
 mystery            | integer                | not null
 romance            | integer                | not null
 scifi              | integer                | not null
 thriller           | integer                | not null
 war                | integer                | not null
 western            | integer                | not null

Run the cell below to create a connection object. This should work whether you have pg8000 installed or psycopg2.


In [2]:
import pg8000
conn = pg8000.connect(database="homework2")

If you get an error stating that database "homework2" does not exist, make sure that you followed the instructions above exactly. If necessary, drop the database you created (with, e.g., DROP DATABASE your_database_name) and start again.

In all of the cells below, I've provided the necessary Python scaffolding to perform the query and display the results. All you need to do is write the SQL statements.

As noted in the tutorial, if your SQL statement has a syntax error, you'll need to rollback your connection before you can fix the error and try the query again. As a convenience, I've included the following cell, which performs the rollback process. Run it whenever you hit trouble.


In [3]:
conn.rollback()

Problem set 1: WHERE and ORDER BY

In the cell below, fill in the string assigned to the variable statement with a SQL query that finds all movies that belong to both the science fiction (scifi) and horror genres. Return these movies in reverse order by their release date. (Hint: movies are located in the uitem table. A movie's membership in a genre is indicated by a value of 1 in the uitem table column corresponding to that genre.) Run the cell to execute the query.

Expected output:

Deep Rising (1998)
Alien: Resurrection (1997)
Hellraiser: Bloodline (1996)
Robert A. Heinlein's The Puppet Masters (1994)
Body Snatchers (1993)
Army of Darkness (1993)
Body Snatchers (1993)
Alien 3 (1992)
Heavy Metal (1981)
Alien (1979)
Night of the Living Dead (1968)
Blob, The (1958)

In [4]:
cursor = conn.cursor()
statement = "SELECT movie_title FROM uitem WHERE scifi = 1 AND horror = 1 ORDER BY release_date DESC"
cursor.execute(statement)
for row in cursor:
    print(row[0])


Deep Rising (1998)
Alien: Resurrection (1997)
Hellraiser: Bloodline (1996)
Robert A. Heinlein's The Puppet Masters (1994)
Body Snatchers (1993)
Army of Darkness (1993)
Body Snatchers (1993)
Alien 3 (1992)
Heavy Metal (1981)
Alien (1979)
Night of the Living Dead (1968)
Blob, The (1958)

Problem set 2: Aggregation, GROUP BY and HAVING

In the cell below, fill in the string assigned to the statement variable with a SQL query that returns the number of movies that are either musicals or children's movies (columns musical and childrens respectively). Hint: use the count(*) aggregate.

Expected output: 157


In [10]:
cursor = conn.cursor()
statement = "SELECT COUNT(*) FROM uitem WHERE musical = 1 OR childrens = 1"
cursor.execute(statement)
for row in cursor:
    print(row[0])


157

Nicely done. Now, in the cell below, fill in the indicated string with a SQL statement that returns all occupations, along with their count, from the uuser table that have more than fifty users listed for that occupation. (I.e., the occupation librarian is listed for 51 users, so it should be included in these results. There are only 12 lawyers, so lawyer should not be included in the result.)

Expected output:

administrator 79
programmer 66
librarian 51
student 196
other 105
engineer 67
educator 95

Hint: use GROUP BY and HAVING. (If you're stuck, try writing the query without the HAVING first.)


In [5]:
cursor = conn.cursor()
statement = "SELECT DISTINCT(occupation), COUNT(*) FROM uuser GROUP BY occupation HAVING COUNT(*) > 50"
cursor.execute(statement)
for row in cursor:
    print(row[0], row[1])


administrator 79
student 196
programmer 66
librarian 51
other 105
educator 95
engineer 67

Problem set 3: Joining tables

In the cell below, fill in the indicated string with a query that finds the titles of movies in the Documentary genre released before 1992 that received a rating of 5 from any user. Expected output:

Madonna: Truth or Dare (1991)
Koyaanisqatsi (1983)
Paris Is Burning (1990)
Thin Blue Line, The (1988)

Hints:

  • JOIN the udata and uitem tables.
  • Use DISTINCT() to get a list of unique movie titles (no title should be listed more than once).
  • The SQL expression to include in order to find movies released before 1992 is uitem.release_date < '1992-01-01'.

In [6]:
cursor = conn.cursor()
statement = "SELECT DISTINCT(movie_title) FROM udata JOIN uitem ON uitem.movie_id = udata.item_id WHERE EXTRACT(YEAR FROM release_date) < 1992 AND rating = 5 GROUP BY movie_title"
# if "any" has to be taken in the sense of "every": 
# statement = "SELECT movie_title FROM uitem JOIN udata ON uitem.movie_id = udata.item_id  WHERE EXTRACT(YEAR FROM release_date) < 1992 GROUP BY movie_title HAVING MIN(rating) = 5"
cursor.execute(statement)
for row in cursor:
    print(row[0])


Cape Fear (1962)
Butch Cassidy and the Sundance Kid (1969)
My Own Private Idaho (1991)
Bride of Frankenstein (1935)
Belle de jour (1967)
Streetcar Named Desire, A (1951)
Ninotchka (1939)
Victor/Victoria (1982)
Tie Me Up! Tie Me Down! (1990)
Until the End of the World (Bis ans Ende der Welt) (1991)
Old Yeller (1957)
It's a Wonderful Life (1946)
Grease (1978)
His Girl Friday (1940)
Bananas (1971)
Jaws 2 (1978)
Return of Martin Guerre, The (Retour de Martin Guerre, Le) (1982)
Miller's Crossing (1990)
Annie Hall (1977)
Murder, My Sweet (1944)
Snow White and the Seven Dwarfs (1937)
20,000 Leagues Under the Sea (1954)
Graduate, The (1967)
Princess Bride, The (1987)
Local Hero (1983)
When Harry Met Sally... (1989)
Godfather, The (1972)
Star Trek: The Motion Picture (1979)
Tin Men (1987)
Alien (1979)
Philadelphia Story, The (1940)
Jaws (1975)
Field of Dreams (1989)
Funny Face (1957)
Ben-Hur (1959)
Shall We Dance? (1937)
Cat on a Hot Tin Roof (1958)
Spellbound (1945)
My Fair Lady (1964)
Escape from New York (1981)
Grease 2 (1982)
Dead Poets Society (1989)
Day the Earth Stood Still, The (1951)
Clockwork Orange, A (1971)
Charade (1963)
Sword in the Stone, The (1963)
Winnie the Pooh and the Blustery Day (1968)
Vertigo (1958)
Pretty Woman (1990)
Dial M for Murder (1954)
My Favorite Year (1982)
Paris, Texas (1984)
Dances with Wolves (1990)
Love in the Afternoon (1957)
Gay Divorcee, The (1934)
Tin Drum, The (Blechtrommel, Die) (1979)
Heavy Metal (1981)
Mediterraneo (1991)
High Noon (1952)
Cinderella (1950)
Lawrence of Arabia (1962)
Weekend at Bernie's (1989)
Star Trek: The Wrath of Khan (1982)
Unbearable Lightness of Being, The (1988)
Laura (1944)
Days of Thunder (1990)
Adventures of Robin Hood, The (1938)
Apartment, The (1960)
Doors, The (1991)
Blue Angel, The (Blaue Engel, Der) (1930)
Giant (1956)
Raise the Red Lantern (1991)
Big Sleep, The (1946)
8 1/2 (1963)
12 Angry Men (1957)
Gaslight (1944)
Nightmare on Elm Street, A (1984)
American Werewolf in London, An (1981)
Being There (1979)
Brazil (1985)
Enchanted April (1991)
Cyrano de Bergerac (1990)
Heathers (1989)
Carrie (1976)
Sleeper (1973)
Star Wars (1977)
Dumbo (1941)
Once Upon a Time in America (1984)
Young Frankenstein (1974)
Manhattan (1979)
Killing Fields, The (1984)
Shining, The (1980)
39 Steps, The (1935)
Double vie de Véronique, La (Double Life of Veronique, The) (1991)
To Kill a Mockingbird (1962)
Night of the Living Dead (1968)
Rebel Without a Cause (1955)
Davy Crockett, King of the Wild Frontier (1955)
So Dear to My Heart (1949)
Bedknobs and Broomsticks (1971)
Sound of Music, The (1965)
Jean de Florette (1986)
Private Benjamin (1980)
Raiders of the Lost Ark (1981)
Arsenic and Old Lace (1944)
Duck Soup (1933)
Citizen Kane (1941)
My Man Godfrey (1936)
Bonnie and Clyde (1967)
Omen, The (1976)
Father of the Bride (1950)
Fox and the Hound, The (1981)
Cool Hand Luke (1967)
Little Princess, The (1939)
Bringing Up Baby (1938)
Rosencrantz and Guildenstern Are Dead (1990)
Innocents, The (1961)
Chinatown (1974)
E.T. the Extra-Terrestrial (1982)
Walkabout (1971)
GoodFellas (1990)
Bridge on the River Kwai, The (1957)
Monty Python's Life of Brian (1979)
Nosferatu (Nosferatu, eine Symphonie des Grauens) (1922)
Faster Pussycat! Kill! Kill! (1965)
Band Wagon, The (1953)
Old Man and the Sea, The (1958)
Silence of the Lambs, The (1991)
Great Escape, The (1963)
Sex, Lies, and Videotape (1989)
Terminator 2: Judgment Day (1991)
Paths of Glory (1957)
Full Metal Jacket (1987)
Evil Dead II (1987)
Escape to Witch Mountain (1975)
Magnificent Seven, The (1954)
Monty Python and the Holy Grail (1974)
Two or Three Things I Know About Her (1966)
Platoon (1986)
Young Guns II (1990)
Drop Dead Fred (1991)
Wizard of Oz, The (1939)
Ran (1985)
Grifters, The (1990)
For Whom the Bell Tolls (1943)
Sting, The (1973)
Real Genius (1985)
Trust (1990)
North by Northwest (1959)
Koyaanisqatsi (1983)
Singin' in the Rain (1952)
Pollyanna (1960)
Ruling Class, The (1972)
Treasure of the Sierra Madre, The (1948)
This Is Spinal Tap (1984)
Amadeus (1984)
Manon of the Spring (Manon des sources) (1986)
One Flew Over the Cuckoo's Nest (1975)
Alice in Wonderland (1951)
Cat People (1982)
Nikita (La Femme Nikita) (1990)
Mary Poppins (1964)
Old Lady Who Walked in the Sea, The (Vieille qui marchait dans la mer, La) (1991)
Third Man, The (1949)
Bad Taste (1987)
Notorious (1946)
On Golden Pond (1981)
Fantasia (1940)
Thin Man, The (1934)
Thin Blue Line, The (1988)
Madonna: Truth or Dare (1991)
Women, The (1939)
Forbidden Planet (1956)
Blues Brothers, The (1980)
Glory (1989)
Seventh Seal, The (Sjunde inseglet, Det) (1957)
Empire Strikes Back, The (1980)
My Left Foot (1989)
Die xue shuang xiong (Killer, The) (1989)
My Life as a Dog (Mitt liv som hund) (1985)
Body Snatcher, The (1945)
Amityville Horror, The (1979)
Gandhi (1982)
Young Guns (1988)
Batman (1989)
Crossfire (1947)
Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963)
Good, The Bad and The Ugly, The (1966)
Manchurian Candidate, The (1962)
Lost Horizon (1937)
Gone with the Wind (1939)
Return of the Pink Panther, The (1974)
Ghost and Mrs. Muir, The (1947)
Cinema Paradiso (1988)
Blade Runner (1982)
Raging Bull (1980)
Dirty Dancing (1987)
Beauty and the Beast (1991)
Hunt for Red October, The (1990)
Rear Window (1954)
Man Who Would Be King, The (1975)
Blob, The (1958)
Star Trek VI: The Undiscovered Country (1991)
Aristocats, The (1970)
Cape Fear (1991)
All About Eve (1950)
Right Stuff, The (1983)
Room with a View, A (1986)
To Catch a Thief (1955)
Big Blue, The (Grand bleu, Le) (1988)
Transformers: The Movie, The (1986)
Affair to Remember, An (1957)
African Queen, The (1951)
Robin Hood: Prince of Thieves (1991)
Three Caballeros, The (1945)
To Be or Not to Be (1942)
Raising Arizona (1987)
Star Trek IV: The Voyage Home (1986)
Cook the Thief His Wife & Her Lover, The (1989)
Withnail and I (1987)
Birds, The (1963)
Highlander (1986)
Die Hard 2 (1990)
Wild Bunch, The (1969)
2001: A Space Odyssey (1968)
Casablanca (1942)
Somewhere in Time (1980)
Willy Wonka and the Chocolate Factory (1971)
Quiet Man, The (1952)
Sabrina (1954)
Fog, The (1980)
Howling, The (1981)
Apocalypse Now (1979)
Golden Earrings (1947)
Better Off Dead... (1985)
Some Like It Hot (1959)
Stripes (1981)
Around the World in 80 Days (1956)
Fish Called Wanda, A (1988)
East of Eden (1955)
Patton (1970)
Mr. Smith Goes to Washington (1939)
Sunset Blvd. (1950)
Godfather: Part II, The (1974)
Gigi (1958)
Die Hard (1988)
Paris Is Burning (1990)
Fried Green Tomatoes (1991)
Henry V (1989)
Top Hat (1935)
Little Lord Fauntleroy (1936)
Psycho (1960)
M (1931)
Pink Floyd - The Wall (1982)
Swiss Family Robinson (1960)
Delicatessen (1991)
Top Gun (1986)
Parent Trap, The (1961)
Rebecca (1940)
Once Upon a Time in the West (1969)
Akira (1988)
Alphaville (1965)
Believers, The (1987)
Sophie's Choice (1982)
Star Trek III: The Search for Spock (1984)
Mark of Zorro, The (1940)
M*A*S*H (1970)
Foreign Correspondent (1940)
Touch of Evil (1958)
Ghost (1990)
Purple Noon (1960)
Ballad of Narayama, The (Narayama Bushiko) (1958)
Wings of Desire (1987)
Deer Hunter, The (1978)
Pinocchio (1940)
Christmas Carol, A (1938)
Love Bug, The (1969)
Roman Holiday (1953)
Terminator, The (1984)
Great Race, The (1965)
Up in Smoke (1978)
They Made Me a Criminal (1939)
Night on Earth (1991)
Great Dictator, The (1940)
Home Alone (1990)
Meet Me in St. Louis (1944)
Candidate, The (1972)
Conan the Barbarian (1981)
Breakfast at Tiffany's (1961)
Star Trek V: The Final Frontier (1989)
Harold and Maude (1971)
Diva (1981)
Meet John Doe (1941)
Indiana Jones and the Last Crusade (1989)
Bread and Chocolate (Pane e cioccolata) (1973)
American in Paris, An (1951)
Abyss, The (1989)
Some Kind of Wonderful (1987)
Pump Up the Volume (1990)
It Happened One Night (1934)
Stalker (1979)
Picnic (1955)
Aliens (1986)
Shaggy Dog, The (1959)
Down by Law (1986)
Stand by Me (1986)
Inspector General, The (1949)
Back to the Future (1985)
Oliver & Company (1988)
Audrey Rose (1977)
Switchblade Sisters (1975)
Maltese Falcon, The (1941)

Problem set 4: Joins and aggregations... together at last

This one's tough, so prepare yourself. Go get a cup of coffee. Stretch a little bit. Deep breath. There you go.

In the cell below, fill in the indicated string with a query that produces a list of the ten lowest rated movies in the Horror genre. For the purposes of this problem, take "lowest rated" to mean "has the lowest average rating." The query should display the titles of the movies, not their ID number. (So you'll have to use a JOIN.)

Expected output:

Amityville 1992: It's About Time (1992) 1.00
Beyond Bedlam (1993) 1.00
Amityville: Dollhouse (1996) 1.00
Amityville: A New Generation (1993) 1.00
Amityville 3-D (1983) 1.17
Castle Freak (1995) 1.25
Amityville Curse, The (1990) 1.25
Children of the Corn: The Gathering (1996) 1.32
Machine, The (1994) 1.50
Body Parts (1991) 1.62

In [54]:
conn.rollback()
cursor = conn.cursor()
statement = "SELECT movie_title), AVG(rating) FROM udata JOIN uitem ON uitem.movie_id = udata.item_id WHERE horror = 1 GROUP BY movie_title ORDER BY AVG(rating) LIMIT 10"
cursor.execute(statement)
for row in cursor:
    print(row[0], "%0.2f" % row[1])


Amityville 1992: It's About Time (1992) 1.00
Beyond Bedlam (1993) 1.00
Amityville: Dollhouse (1996) 1.00
Amityville: A New Generation (1993) 1.00
Amityville 3-D (1983) 1.17
Castle Freak (1995) 1.25
Amityville Curse, The (1990) 1.25
Children of the Corn: The Gathering (1996) 1.32
Machine, The (1994) 1.50
Body Parts (1991) 1.62

BONUS: Extend the query above so that it only includes horror movies that have ten or more ratings. Fill in the query as indicated below.

Expected output:

Children of the Corn: The Gathering (1996) 1.32
Body Parts (1991) 1.62
Amityville II: The Possession (1982) 1.64
Jaws 3-D (1983) 1.94
Hellraiser: Bloodline (1996) 2.00
Tales from the Hood (1995) 2.04
Audrey Rose (1977) 2.17
Addiction, The (1995) 2.18
Halloween: The Curse of Michael Myers (1995) 2.20
Phantoms (1998) 2.23

In [51]:
cursor = conn.cursor()
statement = "SELECT movie_title, AVG(rating) FROM udata JOIN uitem ON uitem.movie_id = udata.item_id WHERE horror = 1 GROUP BY movie_title HAVING COUNT(rating) > 10 ORDER BY AVG(rating) LIMIT 10;"
cursor.execute(statement)
for row in cursor:
    print(row[0], "%0.2f" % row[1])


Children of the Corn: The Gathering (1996) 1.32
Body Parts (1991) 1.62
Amityville II: The Possession (1982) 1.64
Jaws 3-D (1983) 1.94
Hellraiser: Bloodline (1996) 2.00
Tales from the Hood (1995) 2.04
Audrey Rose (1977) 2.17
Addiction, The (1995) 2.18
Halloween: The Curse of Michael Myers (1995) 2.20
Phantoms (1998) 2.23

Great work! You're done.