Lab 6: SQL


In [2]:
!pip install ipython-sql
%load_ext sql
%sql sqlite:///./lab06.sqlite
    
import sqlalchemy
engine = sqlalchemy.create_engine("sqlite:///lab06.sqlite")
connection = engine.connect()

!pip install -U okpy
from client.api.notebook import Notebook
ok = Notebook('lab06.ok')


The sql extension is already loaded. To reload it, use:
  %reload_ext sql
=====================================================================
Assignment: Lab 6
OK, version v1.10.1
=====================================================================

Rapidgram

The date: March, 2017. All of the students at Berkeley are obsessed with the hot new social networking app, Rapidgram, where users can share text and image posts. You've been hired as Rapidgram's very first Data Scientist, in charge of analyzing their petabyte-scale user data, in order to sell it to credit card companies (I mean, they had to monetize somehow). But before you get into that, you need to learn more about their database schema.

First, run the next few cells to generate a snapshot of their data. It will be saved locally as the file lab05.sqlite.


In [3]:
%%sql
DROP TABLE IF EXISTS users;
DROP TABLE IF EXISTS follows;
CREATE TABLE users (
    USERID   INT            NOT NULL,                 
    NAME     VARCHAR (256)  NOT NULL,                 
    YEAR     FLOAT          NOT NULL,
    PRIMARY KEY (USERID)
);
CREATE TABLE follows (
    USERID   INT            NOT NULL, 
    FOLLOWID INT            NOT NULL,
    PRIMARY KEY (USERID, FOLLOWID)
);


Done.
Done.
Done.
Done.
Out[3]:
[]

In [4]:
%%capture
count = 0
users = ["Ian", "Daniel", "Sarah", "Kelly", "Sam", "Alison", "Henry", "Joey", "Mark", "Joyce", "Natalie", "John"]
years = [1, 3, 4, 3, 4, 2, 5, 2, 1, 3, 4, 2]

for username, year in zip(users, years):
    count += 1
    %sql INSERT INTO users VALUES ($count, '$username', $year);

In [5]:
%%capture
follows =  [0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 
            0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 
            0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 
            1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 
            0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0,
            0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1,
            1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0,
            1, 1, 0, 1]

for i in range(12):
    for j in range(12):
        if i != j and follows[i + j*12]:
            %sql INSERT INTO follows VALUES ($i+1, $j+1);

Question 1: Joey's Followers

How many people follow Joey?


In [11]:
q1 = """
...
"""
%sql $q1


Done.
Out[11]:
COUNT(*)
7

In [12]:
q1_answer = connection.execute(q1).fetchall()
_ = ok.grade('q1')
_ = ok.backup()


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed

Saving notebook... Saved 'lab06_master.ipynb'.
Backup... 100% complete
Backup successful for user: jsimonian@berkeley.edu
URL: https://okpy.org/cal/data100/sp17/lab06/backups/0RgZvL
NOTE: this is only a backup. To submit your assignment, use:
	python3 ok --submit

Question 2: I Ain't no Followback Girl

How many people does Joey follow?


In [ ]:
q2 = """
...
"""
%sql $q2

In [ ]:
q2_answer = connection.execute(q2).fetchall()
_ = ok.grade('q2')
_ = ok.backup()
q2_answer

Question 3: Know your Audience

What are the names of Joey's followers?


In [ ]:
q3 = """
...
"""
%sql $q3

In [ ]:
q3_answer = connection.execute(q3).fetchall()
_ = ok.grade('q3')
_ = ok.backup()

Question 4: Popularity Contest

How many followers does each user have? You'll need to use GROUP BY to solve this. List only the top 5 users by number of followers.


In [ ]:
q4 = """
...
"""
%sql $q4

In [ ]:
q4_answer = connection.execute(q4).fetchall()
_ = ok.grade('q4')
_ = ok.backup()

Question 5: Randomness

Rapidgram wants to get a random sample of their userbase. Specifically, they want to look at exactly one-third of the follow-relations in their data. A Rapidgram engineer suggests the following SQL query:


In [ ]:
q5a = """
SELECT u1.name as follower, u2.name as followee
    FROM follows, users as u1, users as u2
    WHERE follows.userid=u1.userid
    AND follows.followid=u2.userid
    AND RANDOM() < 0.33
"""

Do you think this query will work as intended? Why or why not? Try designing a better query below:


In [ ]:
q5b = """
...
"""
%sql $q5b

In [ ]:
q5_answers = [connection.execute(q5b).fetchall() for _ in range(100)]
_ = ok.grade('q5')
_ = ok.backup()

Question 6: More Randomness

Rapidgram leadership wants to give more priority to more experienced users, so they decide to weight a survey of users towards students who have spend a greater number of years at berkeley. They want to take a sample of 10 students, weighted such that a student's chance of being in the sample is proportional to their number of years spent at berkeley - for instance, a student with 6 years has three times the chance of a student with 2 years, who has twice the chance of a student with only one year.

To take this sample, they've provided you with a helpful temporary view. You can run the cell below to see its functionality.


In [ ]:
q6a = """
WITH RECURSIVE generate_series(value) AS (
  SELECT 0
  UNION ALL
  SELECT value+1 FROM generate_series
   WHERE value+1<=10
)
SELECT value
    FROM generate_series
"""
%sql $q6a

Using the generate_series view, get a sample of ten students, weighted in this manner.


In [ ]:
q6b = """
WITH RECURSIVE generate_series(value) AS (
  SELECT 0
  UNION ALL
  SELECT value+1 FROM generate_series
   WHERE value+1<=12
)
SELECT name
    FROM ...
    WHERE ...
    ORDER BY ...
    LIMIT 10
"""
%sql $q6b

In [ ]:
q6_answers = [connection.execute(q6b).fetchall() for _ in range(100)]
_ = ok.grade('q6')
_ = ok.backup()

Question 7: Older and Wiser (challenge)

List every person who has been at Berkeley longer - that is, their year is greater - than their average follower.


In [ ]:
q7 = """
SELECT name FROM (
    SELECT ...
)
WHERE year > avg_follower_years
"""
%sql $q7

In [ ]:
q7_answer = connection.execute(q7).fetchall()
_ = ok.grade('q7')
_ = ok.backup()

In [ ]:
_ = ok.grade_all()
_ = ok.submit()

In [ ]: