In [2]:
!pip install ipython-sql
%load_ext sql
%sql sqlite:///./lab06.sqlite
import sqlalchemy
engine = sqlalchemy.create_engine("sqlite:///lab06.sqlite")
connection = engine.connect()
!pip install -U okpy
from client.api.notebook import Notebook
ok = Notebook('lab06.ok')
The date: March, 2017. All of the students at Berkeley are obsessed with the hot new social networking app, Rapidgram, where users can share text and image posts. You've been hired as Rapidgram's very first Data Scientist, in charge of analyzing their petabyte-scale user data, in order to sell it to credit card companies (I mean, they had to monetize somehow). But before you get into that, you need to learn more about their database schema.
First, run the next few cells to generate a snapshot of their data. It will be saved locally as the file lab05.sqlite.
In [3]:
%%sql
DROP TABLE IF EXISTS users;
DROP TABLE IF EXISTS follows;
CREATE TABLE users (
USERID INT NOT NULL,
NAME VARCHAR (256) NOT NULL,
YEAR FLOAT NOT NULL,
PRIMARY KEY (USERID)
);
CREATE TABLE follows (
USERID INT NOT NULL,
FOLLOWID INT NOT NULL,
PRIMARY KEY (USERID, FOLLOWID)
);
Out[3]:
In [4]:
%%capture
count = 0
users = ["Ian", "Daniel", "Sarah", "Kelly", "Sam", "Alison", "Henry", "Joey", "Mark", "Joyce", "Natalie", "John"]
years = [1, 3, 4, 3, 4, 2, 5, 2, 1, 3, 4, 2]
for username, year in zip(users, years):
count += 1
%sql INSERT INTO users VALUES ($count, '$username', $year);
In [5]:
%%capture
follows = [0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1,
0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1,
0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1,
1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1,
0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0,
0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1,
1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0,
1, 1, 0, 1]
for i in range(12):
for j in range(12):
if i != j and follows[i + j*12]:
%sql INSERT INTO follows VALUES ($i+1, $j+1);
In [11]:
q1 = """
...
"""
%sql $q1
Out[11]:
In [12]:
q1_answer = connection.execute(q1).fetchall()
_ = ok.grade('q1')
_ = ok.backup()
In [ ]:
q2 = """
...
"""
%sql $q2
In [ ]:
q2_answer = connection.execute(q2).fetchall()
_ = ok.grade('q2')
_ = ok.backup()
q2_answer
In [ ]:
q3 = """
...
"""
%sql $q3
In [ ]:
q3_answer = connection.execute(q3).fetchall()
_ = ok.grade('q3')
_ = ok.backup()
In [ ]:
q4 = """
...
"""
%sql $q4
In [ ]:
q4_answer = connection.execute(q4).fetchall()
_ = ok.grade('q4')
_ = ok.backup()
In [ ]:
q5a = """
SELECT u1.name as follower, u2.name as followee
FROM follows, users as u1, users as u2
WHERE follows.userid=u1.userid
AND follows.followid=u2.userid
AND RANDOM() < 0.33
"""
Do you think this query will work as intended? Why or why not? Try designing a better query below:
In [ ]:
q5b = """
...
"""
%sql $q5b
In [ ]:
q5_answers = [connection.execute(q5b).fetchall() for _ in range(100)]
_ = ok.grade('q5')
_ = ok.backup()
Rapidgram leadership wants to give more priority to more experienced users, so they decide to weight a survey of users towards students who have spend a greater number of years at berkeley. They want to take a sample of 10 students, weighted such that a student's chance of being in the sample is proportional to their number of years spent at berkeley - for instance, a student with 6 years has three times the chance of a student with 2 years, who has twice the chance of a student with only one year.
To take this sample, they've provided you with a helpful temporary view. You can run the cell below to see its functionality.
In [ ]:
q6a = """
WITH RECURSIVE generate_series(value) AS (
SELECT 0
UNION ALL
SELECT value+1 FROM generate_series
WHERE value+1<=10
)
SELECT value
FROM generate_series
"""
%sql $q6a
Using the generate_series view, get a sample of ten students, weighted in this manner.
In [ ]:
q6b = """
WITH RECURSIVE generate_series(value) AS (
SELECT 0
UNION ALL
SELECT value+1 FROM generate_series
WHERE value+1<=12
)
SELECT name
FROM ...
WHERE ...
ORDER BY ...
LIMIT 10
"""
%sql $q6b
In [ ]:
q6_answers = [connection.execute(q6b).fetchall() for _ in range(100)]
_ = ok.grade('q6')
_ = ok.backup()
In [ ]:
q7 = """
SELECT name FROM (
SELECT ...
)
WHERE year > avg_follower_years
"""
%sql $q7
In [ ]:
q7_answer = connection.execute(q7).fetchall()
_ = ok.grade('q7')
_ = ok.backup()
In [ ]:
_ = ok.grade_all()
_ = ok.submit()
In [ ]: