Lab 6: SQL


In [ ]:
!pip install ipython-sql
%load_ext sql
%sql sqlite:///./lab06.sqlite
    
import sqlalchemy
engine = sqlalchemy.create_engine("sqlite:///lab05.sqlite")
connection = engine.connect()

!pip install -U okpy
from client.api.notebook import Notebook
ok = Notebook('lab06.ok')

Rapidgram

The date: March, 2017. All of the students at Berkeley are obsessed with the hot new social networking app, Rapidgram, where users can share text and image posts. You've been hired as Rapidgram's very first Data Scientist, in charge of analyzing their petabyte-scale user data, in order to sell it to credit card companies (I mean, they had to monetize somehow). But before you get into that, you need to learn more about their database schema.

First, run the next few cells to generate a snapshot of their data. It will be saved locally as the file lab05.sqlite.


In [ ]:
%%sql
DROP TABLE IF EXISTS users;
DROP TABLE IF EXISTS follows;
CREATE TABLE users (
    USERID   INT            NOT NULL,                 
    NAME     VARCHAR (256)  NOT NULL,                 
    YEAR     FLOAT          NOT NULL,
    PRIMARY KEY (USERID)
);
CREATE TABLE follows (
    USERID   INT            NOT NULL, 
    FOLLOWID INT            NOT NULL,
    PRIMARY KEY (USERID, FOLLOWID)
);

In [ ]:
%%capture
count = 0
users = ["Ian", "Daniel", "Sarah", "Kelly", "Sam", "Alison", "Henry", "Joey", "Mark", "Joyce", "Natalie", "John"]
years = [1, 3, 4, 3, 4, 2, 5, 2, 1, 3, 4, 2]

for username, year in zip(users, years):
    count += 1
    %sql INSERT INTO users VALUES ($count, '$username', $year);

In [ ]:
%%capture
follows =  [0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 
            0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 
            0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 
            1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 
            0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0,
            0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1,
            1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0,
            1, 1, 0, 1]

for i in range(12):
    for j in range(12):
        if i != j and follows[i + j*12]:
            %sql INSERT INTO follows VALUES ($i+1, $j+1);

Question 1: Joey's Followers

How many people follow Joey?


In [ ]:
q1 = """
...
"""
%sql $q1

In [ ]:
#SOLUTION
q1 = """
SELECT COUNT(*) FROM follows, users
    WHERE users.name="Joey"
    AND (users.userid=follows.followid)
"""
%sql $q1

In [ ]:
q1_answer = connection.execute(q1).fetchall()
_ = ok.grade('q1')
_ = ok.backup()

Question 2: I Ain't no Followback Girl

How many people does Joey follow?


In [ ]:
q2 = """
...
"""
%sql $q2

In [ ]:
#SOLUTION
q2 = """
SELECT COUNT(*) FROM follows, users
    WHERE users.name="Joey"
    AND (users.userid=follows.userid)
"""
%sql $q2

In [ ]:
q2_answer = connection.execute(q2).fetchall()
_ = ok.grade('q2')
_ = ok.backup()

Question 3: Know your Audience

What are the names of Joey's followers?


In [ ]:
q3 = """
...
"""
%sql $q3

In [ ]:
#SOLUTION
q3 = """
SELECT u1.name
    FROM follows, users as u1, users as u2
    WHERE follows.userid=u1.userid
    AND follows.followid=u2.userid
    AND u2.name="Joey"
"""
%sql $q3

In [ ]:
q3_answer = connection.execute(q3).fetchall()
_ = ok.grade('q3')
_ = ok.backup()

Question 4: Popularity Contest

How many followers does each user have? You'll need to use GROUP BY to solve this. List only the top 5 users by number of followers.


In [ ]:
q4 = """
...
"""
%sql $q4

In [ ]:
#SOLUTION
q4 = """
SELECT name, COUNT(*) as friends
    FROM follows, users
    WHERE follows.followid=users.userid
    GROUP BY name
    ORDER BY friends DESC
    LIMIT 5
"""
%sql $q4

In [ ]:
q4_answer = connection.execute(q4).fetchall()
_ = ok.grade('q4')
_ = ok.backup()

Question 5: Randomness

Rapidgram wants to get a random sample of their userbase. Specifically, they want to look at exactly one-third of the follow-relations in their data. A Rapidgram engineer suggests the following SQL query:


In [ ]:
q5a = """
SELECT u1.name as follower, u2.name as followee
    FROM follows, users as u1, users as u2
    WHERE follows.userid=u1.userid
    AND follows.followid=u2.userid
    AND RANDOM() < 0.33
"""

Do you think this query will work as intended? Why or why not? Try designing a better query below:


In [ ]:
q5b = """
...
"""
%sql $q5b

In [ ]:
#SOLUTION
q5b = """
SELECT u1.name as follower, u2.name as followee
    FROM follows, users as u1, users as u2
    WHERE follows.userid=u1.userid
    AND follows.followid=u2.userid
    ORDER BY RANDOM() LIMIT 72*1/3
"""
%sql $q5b

In [ ]:
q5_answers = [connection.execute(q5b).fetchall() for _ in range(100)]
_ = ok.grade('q5')
_ = ok.backup()

Question 6: More Randomness

Rapidgram leadership wants to give more priority to more experienced users, so they decide to weight a survey of users towards students who have spend a greater number of years at berkeley. They want to take a sample of 10 students, weighted such that a student's chance of being in the sample is proportional to their number of years spent at berkeley - for instance, a student with 6 years has three times the chance of a student with 2 years, who has twice the chance of a student with only one year.

To take this sample, they've provided you with a helpful temporary view. You can run the cell below to see its functionality.


In [ ]:
q6a = """
WITH RECURSIVE generate_series(value) AS (
  SELECT 0
  UNION ALL
  SELECT value+1 FROM generate_series
   WHERE value+1<=10
)
SELECT value
    FROM generate_series
"""
%sql $q6a

Using the generate_series view, get a sample of ten students, weighted in this manner.


In [ ]:
q6b = """
WITH RECURSIVE generate_series(value) AS (
  SELECT 0
  UNION ALL
  SELECT value+1 FROM generate_series
   WHERE value+1<=12
)
SELECT name
    FROM ...
    WHERE ...
    ORDER BY ...
    LIMIT 10
"""
%sql $q6b

In [ ]:
#SOLUTION
q6b = """
WITH RECURSIVE generate_series(value) AS (
  SELECT 0
  UNION ALL
  SELECT value+1 FROM generate_series
   WHERE value+1<=12
)
SELECT name
    FROM generate_series, users
    WHERE value < year
    ORDER BY RANDOM()
    LIMIT 10
"""
%sql $q6b

In [ ]:
q6_answers = [connection.execute(q6b).fetchall() for _ in range(100)]
_ = ok.grade('q6')
_ = ok.backup()

Question 7: Older and Wiser (challenge)

List every person who has been at Berkeley longer - that is, their year is greater - than their average follower.


In [ ]:
q7 = """
SELECT name FROM (
    SELECT ...
)
WHERE year > avg_follower_years
"""
%sql $q7

In [ ]:
#SOLUTION
q7 = """
SELECT name FROM 
    (SELECT u1.name, u1.year, AVG(u2.year) as avg_follower_years
        FROM follows, users as u1, users as u2
        WHERE follows.userid=u1.userid
        AND follows.followid=u2.userid
        GROUP BY u1.name)
    WHERE year > avg_follower_years
"""
%sql $q7

In [ ]:
q7_answer = connection.execute(q7).fetchall()
_ = ok.grade('q7')
_ = ok.backup()

In [ ]:
_ = ok.grade_all()
_ = ok.submit()

In [ ]: