Homework 4: SQL, FEC Data, and Small Donors

Due: 11:59pm Tuesday, March 14

Note: The due date has changed from March 7 to March 14. Happy studying!

In this homework, we're going to explore the Federal Election Commission's data on the money exchanged during the 2016 election.

This homework has two main parts:

Answering questions and computing descriptive statistics on the data
Conducting a hypothesis test

This is very similar to what you've done before in this class. However, in this homework almost all of our computations will be done using SQL.

Getting Started

For this assignment, you're going to use a popular cloud services provider: Heroku. This will give you some experience provisioning a database in the cloud and working on that database from your computer.

Since the free tier of Heroku's Postgres service limits users to 10,000 rows of data, we've provided a subset of the FEC dataset for you to work with.

If you're interested, you can download and load the entire dataset from http://www.fec.gov/finance/disclosure/ftpdet.shtml. It is about 4GB and contains around 24 million rows. (With Heroku and other cloud services, it is relatively straightforward to rent clusters of machines to work on much larger datasets. In particular, it would be easy to rerun your analyses in this assignment on the full dataset.)

Provisioning the Postgres DB

Visit https://signup.heroku.com/postgres-home-button and sign up for an account if you don't have one already.
Now, install the Heroku CLI: https://devcenter.heroku.com/articles/heroku-cli. Then, run heroku login to log into Heroku from your CLI.
Now, visit https://dashboard.heroku.com/apps and click New -> App. Name the app whatever you want.
You should be sent to the app details page. Click Resources in the navbar, then in the Add-on search bar, type "Postgres". You should be able to select Heroku Postgres. Make sure the free tier (Hobby Dev) is selected and click Provision. Now you should see Heroku Postgres :: Database in your Add-ons list.

Loading the data into the Heroku DB

(1) Run the lines below in your terminal to install necessary libraries.

conda install -y psycopg2
conda install -y postgresql
pip install ipython-sql

(2) Click the Heroku Postgres :: Database link in your app's Add-ons list.

(3) In the Heroku Data page you got redirected to, you should see the name of your database. Scroll down to Administration and click View Credentials. These are the credentials that allow you to connect to the database. The last entry of the list contains a line that looks like:

    heroku pg:psql db_name --app app_name

In your terminal, take that command and add "< fec.sql" to the end to get something like:

    heroku pg:psql db_name --app app_name < fec.sql

Run that command. It will run the commands in fec.sql, which load the dataset into the database. Now you should be able to run the command without the "< fec.sql" to have a postgres prompt. Try typing "\d+" at the prompt. You should get something like:

    ds100-hw4-db::DATABASE=> \d+
                                 List of relations
     Schema |     Name     | Type  |     Owner      |    Size    | Description
    --------+--------------+-------+----------------+------------+-------------
     public | cand         | table | vibrgrsqevmzkj | 16 kB      |
     public | comm         | table | vibrgrsqevmzkj | 168 kB     |
     public | indiv        | table | vibrgrsqevmzkj | 904 kB     |
     public | indiv_sample | table | vibrgrsqevmzkj | 600 kB     |
     public | inter_comm   | table | vibrgrsqevmzkj | 296 kB     |
     public | link         | table | vibrgrsqevmzkj | 8192 bytes |
    (6 rows)

Congrats! You now have a Postgres database running containing the data you need for this project.

Part 1: Descriptive Statistics



In [ ]:

    
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import sqlalchemy

!pip install -U okpy
from client.api.notebook import Notebook
ok = Notebook('hw4.ok')

Now, let's connect to your Postgres database. On your Heroku Postgres details, look at the credentials for the database. Take the long URI in the credentials and replace the portion of the code that reads <replace_me> with the URI.

It should start with postgres://.



In [ ]:

    
my_URI = "postgres://sam:@localhost:5432/fec"
%load_ext sql
%sql $my_URI
engine = sqlalchemy.create_engine(my_URI)
connection = engine.connect()

Table Descriptions

Here is a list of the tables in the database. Each table links to the documentation on the FEC page for the dataset.

Note that the table names here are slightly different from the ones in lecture. Consult the FEC page for the descriptions of the tables to find out what the correspondence is.

cand: Candidates table. Contains names and party affiliation.
comm: Committees table. Contains committee names and types.
link: Committee to candidate links.
indiv: Individual contributions. Contains recipient committee ID and transaction amount.
inter_comm: Committee-to-candidate and committee-to-committee contributions. Contains donor and recipient IDs and transaction amount.
indiv_sample: Sample of individual contributions to Hillary Clinton and Bernie Sanders. Used in Part 2 only.

Writing SQL queries

You can write SQL directly in the notebook by using the %sql magic, as demonstrated in the next cell.

Be careful when doing this. If you try to run a SQL query that returns a lot of rows (100k or more is a good rule of thumb) your browser will probably crash.

This is why in this homework, we will strongly prefer using SQL as much as possible, only materializing the SQL queries when they are small.

Because of this, your queries should work even as the size of your data goes into the terabyte range! This is the primary advantage of working with SQL as opposed to only dataframes.



In [ ]:

    
# We use `LIMIT 5` to avoid displaying a huge table.
# Although our tables shouldn't get too large to display,
# this is generally good practice when working in the
# notebook environment.  Jupyter notebooks don't handle
# very large outputs well.
%sql SELECT * from cand LIMIT 5

For longer queries, you can save your query into a string, then use it in the %sql statement. The $query in the %sql statement pulls in the value in the Python variable query.



In [ ]:

    
query = '''
SELECT cand_id, cand_name
FROM cand
WHERE cand_pty_affiliation = 'REP'
LIMIT 5
'''

%sql $query

In addition, you can assign the SQL statement to a variable and then call .DataFrame() on it to get a Pandas DataFrame.

However, it will often be more efficient to express your computation directly in SQL. For this homework, we will be grading your SQL expressions so be sure to do all computation in SQL (unless otherwise requested).



In [ ]:

    
res = %sql select * from cand limit 5
res_df = res.DataFrame()
res_df['cand_id']

Question 1a

We are interested in finding the PACs that donated large sums to the candidates. To begin to answer this question, we will look at the inter_comm table. We'll find all the transactions that exceed \$5,000. However, if there are a lot of transactions like that, it might not be useful to list them all. So before actually finding the transactions, find out how many such transactions there are. Use only SQL to compute the answer.

(It should be a table with a single column called count and a single entry, the number of transactions.)

We will be grading the query string query_q1a. You may modify our template but the result should contain the same information with the same names.



In [ ]:

    
# SOLUTION
query_q1a = '''
SELECT count(*)
FROM inter_comm
WHERE transaction_amt > 5000
'''
q1a = %sql $query_q1a
q1a



In [ ]:

    
_ = ok.grade('q01a')
_ = ok.backup()

Question 1b

Having seen that there aren't too many transactions that exceed \$5,000, let's find them all. Using only SQL, construct a table containing the recipient committee's name, the ID of the donor committee, and the transaction amount, for transactions that exceed $5,000 dollars. Sort the transactions in decreasing order by amount.

We will be grading the query string query_q1b. You may modify our template but the result should contain the same information with the same names.



In [ ]:

    
# SOLUTION
query_q1b = '''
SELECT 
    cmte_id AS donor_cmte_id, 
    name AS recipient_name, 
    transaction_amt AS transaction_amt
FROM inter_comm
WHERE transaction_amt > 5000
ORDER BY transaction_amt DESC
'''
q1b = %sql $query_q1b
q1b



In [ ]:

    
_ = ok.grade('q01b')
_ = ok.backup()

Question 1c

Of course, individual transactions could be misleading. A more interesting question is: How much did each group give in total to each committee? Find the total transaction amounts after grouping by the recipient committee's name and the ID of the donor committee. This time, just use LIMIT 20 to limit your results to the top 20 total donations.

We will be grading the query string query_q1c. You may modify our template but the result should contain the same information with the same names.



In [ ]:

    
# SOLUTION
query_q1c = '''
SELECT 
    cmte_id AS donor_cmte_id, 
    name AS recipient_name, 
    sum(transaction_amt) AS total_transaction_amt 
FROM inter_comm 
GROUP BY cmte_id, name 
ORDER BY sum(transaction_amt) DESC
LIMIT 20
'''
q1c = %sql $query_q1c
q1c



In [ ]:

    
ok.grade('q01c')
_ = ok.backup()

If you peruse the results of your last query, you should notice that some names are listed twice with slightly different spellings. Perhaps this causes some contributions to be split extraneously.

Question 1d

Find a field that uniquely identifies recipient committees and repeat your analysis from the previous question using that new identifier.

We will be grading the query string query_q1d. You may modify our template but the result should contain the same information with the same names.



In [ ]:

    
# SOLUTION
query_q1d = """
SELECT cmte_id AS donor_id,
    other_id AS recipient_id,
    sum(transaction_amt) AS total_transaction_amt
FROM inter_comm
GROUP BY cmte_id, other_id
ORDER BY sum(transaction_amt) DESC
LIMIT 20
"""
q1d = %sql $query_q1d
q1d



In [ ]:

    
_ = ok.grade('q01d')
_ = ok.backup()

Question 1e

Of course, your results are probably not very informative. Let's join these results with the comm table (perhaps twice?) to get the names of the committees involved in these transactions. As before, limit your results to the top 20 by total donation.

We will be grading the query string query_q1e. You may modify our template but the result should contain the same information with the same names.

Remember that the name column of inter_comm is not consistent. We found this out in 1(c) where we found that the same committees were named slightly differently. Because of this, you cannot use the name column of inter_comm to get the names of the committees.



In [ ]:

    
# SOLUTION
query_q1e = '''
SELECT 
    donor.cmte_nm as donor_name, 
    recipient.cmte_nm as recipient_name, 
    sum(transaction_amt) AS total_transaction_amt 
FROM inter_comm, comm AS donor, comm AS recipient
WHERE inter_comm.cmte_id = donor.cmte_id AND inter_comm.other_id = recipient.cmte_id
GROUP BY donor.cmte_nm, recipient.cmte_nm
ORDER BY sum(transaction_amt) DESC 
LIMIT 20
'''
q1e = %sql $query_q1e
q1e



In [ ]:

    
_ = ok.grade('q01e')
_ = ok.backup()

Question 2

What is the distribution of committee by state? Write a SQL query which computes for each state the number of committees in the comm table that are registered in that state. Display the results in descending order by count.

We will be grading the query string query_q2. You may modify our template but the result should contain the same information with the same names.



In [ ]:

    
# SOLUTION
query_q2 = '''
SELECT 
    cmte_st AS state, 
    count(*) AS count
FROM comm 
GROUP BY cmte_st 
ORDER BY count(*) DESC
'''
q2 = %sql $query_q2
q2



In [ ]:

    
_ = ok.grade('q02')
_ = ok.backup()

Question 3

Political Action Committees are major sources funding for campaigns. They typically represent business, labor, or ideological interests and influence campaigns through their funding. Because of this, we'd like to know how much money each committee received from PACs.

For each committee, list the total amount of donations they got from Political Action Committees. If they got no such donations, the total should be listed as null. Order the result by pac_donations, then cmte_nm.

We will be grading you on the query string query_q3. You may modify our template but the result should contain the same information with the same names.



In [ ]:

    
# SOLUTION
query_q3 = '''
WITH pac_donations(cmte_id, pac_donations) AS
(
  SELECT i.other_id, SUM(i.transaction_amt)
  FROM inter_comm i
  WHERE i.entity_tp = 'PAC'
  GROUP BY i.other_id
)
SELECT 
    c.cmte_nm AS cmte_name, 
    i.pac_donations AS pac_donations
FROM comm c LEFT OUTER JOIN pac_donations i ON c.cmte_id = i.cmte_id
ORDER BY pac_donations, cmte_nm
LIMIT 20
'''
q3 = %sql $query_q3
q3



In [ ]:

    
_ = ok.grade('q03')
_ = ok.backup()

Question 4

Committees can also contribute to other committees. When does this happen? Perhaps looking at the data can help us figure it out.

Find the names of the top 10 (directed) committee pairs that are affiliated with the Republican Party, who have the highest number of intercommittee transactions. By directed, we mean that a transaction where C1 donates to C2 is not the same as one where C2 donates to C1.

We will be grading you on the query string query_q4. You may modify our template but the result should contain the same information with the same names.



In [ ]:

    
# SOLUTION
query_q4 = '''
SELECT 
    cfrom.cmte_nm AS from_cmte_name, 
    cto.cmte_nm AS to_cmte_name
FROM comm cfrom, comm cto, inter_comm ict
WHERE cfrom.cmte_id = ict.cmte_id
    AND cfrom.cmte_pty_affiliation = 'REP'
    AND cto.cmte_pty_affiliation = 'REP'
    AND cto.cmte_id = ict.other_id
GROUP BY cfrom.cmte_id, cfrom.cmte_nm, cto.cmte_id, cto.cmte_nm
ORDER BY count(*) DESC
LIMIT 10
'''

q4 = %sql $query_q4
q4



In [ ]:

    
_ = ok.grade('q04')
_ = ok.backup()

Question 5

Some committees received donations from a common contributor. Perhaps they were ideologically similar.

Find the names of distinct candidate pairs that share a common committee contributor from Florida.

If you list a pair ("Washington", "Lincoln") you should also list ("Lincoln, Washington").

Save the result in q5.

Hint: In SQL, the "not equals" operator is <> (it's != in Python).

We will be grading you on the query string query_q5. You may modify our template but the result should contain the same information with the same names.



In [ ]:

    
# SOLUTION
query_q5 = '''
SELECT DISTINCT
    c1.cand_name AS cand_1,
    c2.cand_name AS cand_2
FROM inter_comm i1, inter_comm i2, cand c1, cand c2
WHERE i1.cand_id <> i2.cand_id
    AND i1.cmte_id = i2.cmte_id
    AND i2.state='FL' AND i1.state='FL'
    AND c1.cand_id = i1.cand_id
    AND c2.cand_id = i2.cand_id
ORDER BY cand_1
'''

q5 = %sql $query_q5
q5



In [ ]:

    
_ = ok.grade('q05')
_ = ok.backup()

Part 2: Hypothesis Testing and Bootstrap in SQL

In this part, we're going to perform a hypothesis test using SQL!

This article describes a statement by Hillary Clinton where where she claims that the majority of her campaign was funded by small donors. The article argues that her statement is false, so we ask a slightly different question:

Is there a difference in the proportion of money contributed by small donors between Hillary Clinton's and Bernie Sanders' campaigns?

For these questions, we define small donors as individuals that donated $200 or less to a campaign.

For review, we suggest looking over this chapter on Hypothesis Testing from the Data 8 textbook: https://www.inferentialthinking.com/chapters/10/testing-hypotheses.html

Question 6

Before we begin, please think about and answer the following questions.

For each question, state "Yes" or "No", followed by a one-sentence explanation.

(a) If we were working with the entire FEC dataset instead of a sample, would we still conduct a hypothesis test? Why or why not?

(b) If we were working with the entire FEC dataset instead of a sample, would we still conduct bootstrap resampling? Why or why not?

(c) Let's suppose we take our sample and compute the proportion of money contributed by small donors to Hillary and Bernie's campaign. We find that the difference is 0.0 — they received the exact same proportion of small donations. Would we still need to conduct a hypothesis test? Why or why not?

(d) Let's suppose we take our sample and compute the proportion of money contributed by small donors to Hillary and Bernie's campaign. We find that the difference is 0.3. Would we still need to conduct a hypothesis test? Why or why not?

Solution:

(a) No. If we had the entire dataset, we could directly compute the difference in proportion. Since the dataset contains all transactions, we have the entire population we care about.

(b) No, for the same reason in (a).

(c) Yes. Even though our sample difference is 0, we do not know if the population difference is 0. (However, an hypothesis test would probably not reject the null hypothesis in this case.)

(d) Yes, for the same reason in (c).

Question 7

We've taken a sample of around 2700 rows of the original FEC data for individual contributions that only include contributions to Clinton and Sanders. This sample is stored in the table indiv_sample.

The individual contributions of donors are linked to committees, not candidates directly. Hillary's primary committee was called HILLARY FOR AMERICA, and Bernie's was BERNIE 2016.

Fill in the SQL query below to compute the total contributions for each candidate's committee.

We will be grading you on the query string query_q7. You may modify our template but the result should contain the same information with the same names.



In [ ]:

    
query_q7 = """
SELECT 
    comm.cmte_nm AS cmte_nm, 
    sum(indiv_sample.transaction_amt) AS total_transaction_amt
FROM indiv_sample, comm 
WHERE indiv_sample.cmte_id = comm.cmte_id
GROUP BY comm.cmte_id, comm.cmte_nm
HAVING
    cmte_nm = 'HILLARY FOR AMERICA' OR
    cmte_nm = 'BERNIE 2016' 
"""

# Do not change anything below this line

res = %sql $query_q7
q7 = res.DataFrame().set_index("cmte_nm")
q7 # q7 will be graded



In [ ]:

    
_ = ok.grade('q07')
_ = ok.backup()

Question 8

We want to know what proportion of this money came from small donors — individuals who donated \$200 or less. For example, if Hillary raised \$1000, and \$300 of that came from small donors, her proportion of small donors would be 0.3.

Compute this proportion for each candidate by filling in the SQL query below. The resulting table should have two columns:

cmte_id which contains the Hillary's and Bernie's committee IDs
cmte_name which contains the Hillary's and Bernie's committee names
prop_funds which contains the proportion of funds contributed by small donors.

You may not create a dataframe for this problem. By keeping the calculations in SQL, this query will also work on the original dataset of individual contributions (~ 3GB).

Hint: Try using Postgres' CASE statement to filter out transactions under $200.

Hint: Remember that you can append ::float to a column name to convert its values to float. You'll have to do this to perform division correctly.

We will be grading you on the query string query_q8. You may modify our template but the result should contain the same information with the same names.



In [ ]:

    
# SOLUTION
query_q8 = '''
SELECT 
    comm.cmte_id AS cmte_id,
    comm.cmte_nm AS cmte_name,
    SUM (
        CASE WHEN transaction_amt::float <= 200.0 THEN transaction_amt::float 
        ELSE 0.0 END
        ) / SUM(transaction_amt::float) AS prop_funds
FROM indiv_sample, comm
WHERE indiv_sample.cmte_id = comm.cmte_id
GROUP BY comm.cmte_id, comm.cmte_nm
HAVING
    cmte_nm = 'HILLARY FOR AMERICA' OR
    cmte_nm = 'BERNIE 2016'
'''

# Do not change anything below this line

res = %sql $query_q8
small_donor_funds_prop = res.DataFrame()
small_donor_funds_prop



In [ ]:

    
_ = ok.grade('q08')
_ = ok.backup()

Question 9

Let's now do a bit of EDA. Fill in the SQL statements below to make histograms of the transaction amounts for both Hillary and Bernie.

Note that we do take your entire result and put it into a dataframe. This is not scalable. If indiv_sample was large, your computer would run out of memory trying to store it in a dataframe. The better way to compute the histogram would be to use SQL to generate bins and count the number of contributions in each bin using the built-in width_bucket function.



In [ ]:

    
# SOLUTION
query_q9a = """
SELECT transaction_amt  
FROM indiv_sample, comm
WHERE indiv_sample.cmte_id = comm.cmte_id
    AND cmte_nm = 'HILLARY FOR AMERICA'
"""

# Do not change anything below this line

res = %sql $query_q9a
hillary_contributions = res.DataFrame()
print(hillary_contributions.head())

# Make the Plot
sns.distplot(hillary_contributions)
plt.title('Distribution of Contribution Amounts to Hillary')
plt.xlim((-50, 3000))
plt.ylim((0, 0.02))



In [ ]:

    
# SOLUTION
query_q9b = """
SELECT transaction_amt  
FROM indiv_sample, comm
WHERE indiv_sample.cmte_id = comm.cmte_id
    AND cmte_nm = 'BERNIE 2016'
"""

# Do not change anything below this line

res = %sql $query_q9b
bernie_contributions = res.DataFrame()
print(bernie_contributions.head())


sns.distplot(bernie_contributions)
plt.title('Distribution of Contribution Amounts to Bernie')
plt.xlim((-50, 3000))
plt.ylim((0, 0.02))



In [ ]:

    
_ = ok.grade('q09')
_ = ok.backup()

Question 10

Looks like there is a difference. Let's see if it's statistically significant.

State appropriate null and alternative hypotheses for this problsm.

SOLUTION:

Null hypothesis: There is no difference in the proportion of funds contributed by small donors to Hillary and Bernie.

Alternative: There is a difference.

Constructing a Bootstrap CI

We want to create a bootstrap confidence interval of the proportion of funds contributed to Hillary Clinton by small donors.

To do this in SQL, we need to number the rows we want to bootstrap.

The following cell creates a view called hillary. Views are like tables. However, instead of storing the rows in the database, Postgres will recompute the values in the view each time you query it.

It adds a row_id column to each row in indiv_sample corresponding to a contribution to Hillary. Note that we use your hillary_cmte_id variable by including $hillary_cmte_id in the SQL.

We'll do the same for Bernie, creating a view called bernie.



In [ ]:

    
%%sql 

DROP VIEW IF EXISTS hillary CASCADE;
DROP VIEW IF EXISTS bernie CASCADE;

CREATE VIEW hillary AS 
    SELECT row_number() over () AS row_id, indiv_sample.*
    FROM indiv_sample, comm
    WHERE indiv_sample.cmte_id = comm.cmte_id 
        AND comm.cmte_nm = 'HILLARY FOR AMERICA';

CREATE VIEW bernie AS 
    SELECT row_number() over () AS row_id, indiv_sample.*
    FROM indiv_sample, comm
    WHERE indiv_sample.cmte_id = comm.cmte_id 
        AND comm.cmte_nm = 'BERNIE 2016';

SELECT * FROM hillary LIMIT 5

Question 11

Let's contruct a view containing the rows we want to sample for each bootstrap trial. For example, if we want to create 100 bootstrap samples of 3 contributions to Hillary, we want something that looks like:

trial_id | row_id
======== | ======
1        | 1002
1        | 208
1        | 1
2        | 1524
2        | 1410
2        | 1023
3        | 423
3        | 68
3        | 925
...      | ...
100      | 10

This will let us later construct a join on the hillary view that computes the bootstrap sample for each trial by sampling with replacement.

Create a view called hillary_design that contains two columns: trial_id and row_id. It should contain the IDs corresponding to 500 samples of the entire hillary view. The hillary view contains 1524 rows, so the hillary_design view should have a total of 500 * 1524 = 762000 rows.

Hint: Recall how we generated a matrix of random numbers in class. Start with that, then start tweaking it until you get the view you want. Our solution uses the Postgres functions generate_series, floor, and random.



In [ ]:

    
# SOLUTION
n_hillary_rows = 1524
n_trials = 500
seed = 0.42

query_q11 = """
CREATE VIEW hillary_design AS
SELECT trial_id, (FLOOR( $n_hillary_rows * RANDOM() ) + 1) :: INTEGER AS row_id
FROM
    GENERATE_SERIES(1, $n_trials) as a(trial_id),
    GENERATE_SERIES(1, $n_hillary_rows) as sample_size;
"""
# Fill in the $ variables set in the above string
import string
query_q11 = string.Template(query_q11).substitute(locals())

%sql drop view if exists hillary_design cascade
%sql SET SEED TO $seed
%sql $query_q11
%sql select * from hillary_design limit 5



In [ ]:

    
_ = ok.grade('q11')
_ = ok.backup()

Question 12

Construct a view called hillary_trials that uses the hillary and hillary_design views to compute the total amount contributed by small donors for each trial as well as the overall amount.

It should have three columns:

trial_id: The number of the trial, from 1 to 500
small_donor_sum: The total contributions from small donors in the trial
total: The total contributions of all donations in the trial

Hint: Our solution uses the CASE WHEN statement inside of a SUM() function call to compute the small_donor_sum.



In [ ]:

    
# SOLUTION
query_q12 = '''
CREATE VIEW hillary_trials as
SELECT d.trial_id,
  SUM(CASE WHEN transaction_amt <= 200
      THEN transaction_amt
      ELSE 0
      END) AS small_donor_sum,
  SUM(transaction_amt) AS total
FROM hillary_design d, hillary h
WHERE d.row_id = h.row_id
GROUP BY d.trial_id
'''

# Do not change anything below this line

%sql drop view if exists hillary_trials cascade
%sql SET SEED TO $seed
%sql $query_q12
%sql select * from hillary_trials limit 5



In [ ]:

    
_ = ok.grade('q12')
_ = ok.backup()

Question 13

Now, create a view called hillary_props that contains two columns:

trial_id: The number of the trial, from 1 to 500
small_donor_prop: The proportion contributed by small donors for each trial

Hint: Remember that you can append ::float to a column name to convert its values to float. You'll have to do this to perform division correctly.



In [ ]:

    
# SOLUTION
query_q13 = '''
CREATE VIEW hillary_props as
SELECT
    trial_id,
    small_donor_sum::float / total::float AS small_donor_prop
FROM hillary_trials
'''

%sql drop view if exists hillary_props cascade
%sql SET SEED TO $seed
%sql $query_q13
%sql select * from hillary_props limit 5



In [ ]:

    
_ = ok.grade('q13')
_ = ok.backup()

Question 14

Now, repeat the process to bootstrap Bernie's proportion of funds raised by small donors.

You should be able to mostly copy-paste your code for Hillary's bootstrap CI.



In [ ]:

    
# SOLUTION
n_bernie_rows = 1173
n_trials = 500

create_bernie_design = """
CREATE VIEW bernie_design AS
SELECT trial_id, (FLOOR( $n_bernie_rows * RANDOM() ) + 1) :: INTEGER AS row_id
FROM
    GENERATE_SERIES(1, $n_trials) as a(trial_id),
    GENERATE_SERIES(1, $n_bernie_rows) as sample_size;
"""

create_bernie_trials = '''
CREATE VIEW bernie_trials as
SELECT d.trial_id,
  SUM(CASE WHEN transaction_amt <= 200
      THEN transaction_amt
      ELSE 0
      END) AS small_donor_sum,
  SUM(transaction_amt) AS total
FROM bernie_design d, bernie h
WHERE d.row_id = h.row_id
GROUP BY d.trial_id
'''

create_bernie_props = '''
CREATE VIEW bernie_props as
SELECT trial_id, small_donor_sum::float / total::float AS small_donor_prop
FROM bernie_trials
'''
# Do not change anything below this line

# Fill in the $ variables set in the above string
import string
create_bernie_design = (string.Template(create_bernie_design)
                        .substitute(locals()))

%sql drop view if exists bernie_design cascade
%sql $create_bernie_design

%sql drop view if exists bernie_trials cascade
%sql $create_bernie_trials

%sql drop view if exists bernie_props
%sql $create_bernie_props

%sql SET SEED TO $seed
%sql select * from bernie_props limit 5



In [ ]:

    
_ = ok.grade('q14')
_ = ok.backup()

Plotting the sample distribution

Run the following cell to make a plot of the distribution of proportions for both Hillary and Bernie.

Again, this would not be scalable if we took many bootstrap samples. However, 500 floats is reasonable to fit in memory.



In [ ]:

    
res = %sql select * from hillary_props
hillary_trials_df = res.DataFrame()

res = %sql select * from bernie_props
bernie_trials_df = res.DataFrame()

ax = plt.subplot(1,2,1)
sns.distplot(hillary_trials_df['small_donor_prop'], ax=ax)
plt.title('Hillary Bootstrap Prop')
plt.xlim(0.1, 0.9)
plt.ylim(0, 25)

ax = plt.subplot(1,2,2)
sns.distplot(bernie_trials_df['small_donor_prop'], ax=ax)
plt.title('Bernie Bootstrap Prop')
plt.xlim(0.1, 0.9)
plt.ylim(0, 25)

Computing the Confidence Interval

Run the following cell to compute confidence intervals based on your hillary_props and bernie_props views. Think about what the intervals mean.



In [ ]:

    
compute_hillary_ci = '''
SELECT percentile_disc(0.025) WITHIN GROUP (ORDER BY small_donor_prop) as ci_lower,
       percentile_disc(0.975) WITHIN GROUP (ORDER BY small_donor_prop) as ci_upper
FROM hillary_props
'''
compute_bernie_ci = '''
SELECT percentile_disc(0.025) WITHIN GROUP (ORDER BY small_donor_prop) as ci_lower,
       percentile_disc(0.975) WITHIN GROUP (ORDER BY small_donor_prop) as ci_upper
FROM bernie_props
'''


%sql SET SEED TO $seed
hillary_ci = %sql $compute_hillary_ci
bernie_ci = %sql $compute_bernie_ci

print(hillary_ci)
print(bernie_ci)

Question 15

Based on your confidence intervals, should we reject the null?

Are there any other factors that should be taken into consideration when making this conclusion?

SOLUTION: The 95% confidence interval for Hillary's proportion of small donor contributions has no overlap with the CI for Bernie. Thus, we reject the null and conclude that Hillary and Bernie received different proportions of funds from small donors.

This is also supported by the distribution of donation amounts we computed during our EDA, and by the relatively large absolute difference in the proportions in our sample.

Congrats! You finished the homework.

Submitting your assignment

First, run the next cell to run all the tests at once.



In [ ]:

    
_ = ok.grade_all()

Then, we'll submit the assignment to OkPy so that the staff will know to grade it. You can submit as many times as you want, and you can choose which submission you want us to grade by going to https://okpy.org/cal/data100/sp17/. After you've done that, make sure you've pushed your changes to Github as well!



In [ ]:

    
# Now, we'll submit to okpy
_ = ok.submit()