Note: The due date has changed from March 7 to March 14. Happy studying!
In this homework, we're going to explore the Federal Election Commission's data on the money exchanged during the 2016 election.
This homework has two main parts:
This is very similar to what you've done before in this class. However, in this homework almost all of our computations will be done using SQL.
For this assignment, you're going to use a popular cloud services provider: Heroku. This will give you some experience provisioning a database in the cloud and working on that database from your computer.
Since the free tier of Heroku's Postgres service limits users to 10,000 rows of data, we've provided a subset of the FEC dataset for you to work with.
If you're interested, you can download and load the entire dataset from http://www.fec.gov/finance/disclosure/ftpdet.shtml. It is about 4GB and contains around 24 million rows. (With Heroku and other cloud services, it is relatively straightforward to rent clusters of machines to work on much larger datasets. In particular, it would be easy to rerun your analyses in this assignment on the full dataset.)
heroku login to log into Heroku from your CLI.(2) Click the Heroku Postgres :: Database link in your app's Add-ons list.
(3) In the Heroku Data page you got redirected to, you should see the name of your database. Scroll down to Administration and click View Credentials. These are the credentials that allow you to connect to the database. The last entry of the list contains a line that looks like:
heroku pg:psql db_name --app app_name
In your terminal, take that command and add "< fec.sql" to the end
to get something like:
heroku pg:psql db_name --app app_name < fec.sql
Run that command. It will run the commands in fec.sql, which load the dataset into the database.
Now you should be able to run the command without the "< fec.sql" to
have a postgres prompt. Try typing "\d+" at the prompt. You should get
something like:
ds100-hw4-db::DATABASE=> \d+
List of relations
Schema | Name | Type | Owner | Size | Description
--------+--------------+-------+----------------+------------+-------------
public | cand | table | vibrgrsqevmzkj | 16 kB |
public | comm | table | vibrgrsqevmzkj | 168 kB |
public | indiv | table | vibrgrsqevmzkj | 904 kB |
public | indiv_sample | table | vibrgrsqevmzkj | 600 kB |
public | inter_comm | table | vibrgrsqevmzkj | 296 kB |
public | link | table | vibrgrsqevmzkj | 8192 bytes |
(6 rows)
Congrats! You now have a Postgres database running containing the data you need for this project.
In [ ]:
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import sqlalchemy
!pip install -U okpy
from client.api.notebook import Notebook
ok = Notebook('hw4.ok')
Now, let's connect to your Postgres database. On your Heroku Postgres details,
look at the credentials for the database. Take the long URI in the credentials and
replace the portion of the code that reads <replace_me> with the URI.
It should start with postgres://.
In [ ]:
my_URI = "postgres://sam:@localhost:5432/fec"
%load_ext sql
%sql $my_URI
engine = sqlalchemy.create_engine(my_URI)
connection = engine.connect()
Here is a list of the tables in the database. Each table links to the documentation on the FEC page for the dataset.
Note that the table names here are slightly different from the ones in lecture. Consult the FEC page for the descriptions of the tables to find out what the correspondence is.
cand: Candidates table. Contains names and party affiliation.comm: Committees table. Contains committee names and types.link: Committee to candidate links.indiv: Individual contributions. Contains recipient committee ID and transaction amount.inter_comm: Committee-to-candidate and committee-to-committee contributions. Contains donor and recipient IDs and transaction amount.indiv_sample: Sample of individual contributions to Hillary Clinton and Bernie Sanders. Used in Part 2 only.You can write SQL directly in the notebook by using the %sql magic, as demonstrated in the next cell.
Be careful when doing this. If you try to run a SQL query that returns a lot of rows (100k or more is a good rule of thumb) your browser will probably crash.
This is why in this homework, we will strongly prefer using SQL as much as possible, only materializing the SQL queries when they are small.
Because of this, your queries should work even as the size of your data goes into the terabyte range! This is the primary advantage of working with SQL as opposed to only dataframes.
In [ ]:
# We use `LIMIT 5` to avoid displaying a huge table.
# Although our tables shouldn't get too large to display,
# this is generally good practice when working in the
# notebook environment. Jupyter notebooks don't handle
# very large outputs well.
%sql SELECT * from cand LIMIT 5
For longer queries, you can save your query into a string, then use it in the
%sql statement. The $query in the %sql statement pulls in the value in
the Python variable query.
In [ ]:
query = '''
SELECT cand_id, cand_name
FROM cand
WHERE cand_pty_affiliation = 'REP'
LIMIT 5
'''
%sql $query
In addition, you can assign the SQL statement to a variable and then call .DataFrame() on it to get a Pandas DataFrame.
However, it will often be more efficient to express your computation directly in SQL. For this homework, we will be grading your SQL expressions so be sure to do all computation in SQL (unless otherwise requested).
In [ ]:
res = %sql select * from cand limit 5
res_df = res.DataFrame()
res_df['cand_id']
We are interested in finding the PACs that donated large sums to the candidates. To begin to answer this question, we will look at the inter_comm table. We'll find all the transactions that exceed \$5,000. However, if there are a lot of transactions like that, it might not be useful to list them all. So before actually finding the transactions, find out how many such transactions there are. Use only SQL to compute the answer.
(It should be a table with a single column called count and a single entry, the number of transactions.)
We will be grading the query string query_q1a. You may modify our template but the result should contain the same information with the same names.
In [ ]:
# SOLUTION
query_q1a = '''
SELECT count(*)
FROM inter_comm
WHERE transaction_amt > 5000
'''
q1a = %sql $query_q1a
q1a
In [ ]:
_ = ok.grade('q01a')
_ = ok.backup()
Having seen that there aren't too many transactions that exceed \$5,000, let's find them all. Using only SQL, construct a table containing the recipient committee's name, the ID of the donor committee, and the transaction amount, for transactions that exceed $5,000 dollars. Sort the transactions in decreasing order by amount.
We will be grading the query string query_q1b. You may modify our template but the result should contain the same information with the same names.
In [ ]:
# SOLUTION
query_q1b = '''
SELECT
cmte_id AS donor_cmte_id,
name AS recipient_name,
transaction_amt AS transaction_amt
FROM inter_comm
WHERE transaction_amt > 5000
ORDER BY transaction_amt DESC
'''
q1b = %sql $query_q1b
q1b
In [ ]:
_ = ok.grade('q01b')
_ = ok.backup()
Of course, individual transactions could be misleading. A more interesting question is: How much did each group give in total to each committee? Find the total transaction amounts after grouping by the recipient committee's name and the ID of the donor committee. This time, just use LIMIT 20 to limit your results to the top 20 total donations.
We will be grading the query string query_q1c. You may modify our template but the result should contain the same information with the same names.
In [ ]:
# SOLUTION
query_q1c = '''
SELECT
cmte_id AS donor_cmte_id,
name AS recipient_name,
sum(transaction_amt) AS total_transaction_amt
FROM inter_comm
GROUP BY cmte_id, name
ORDER BY sum(transaction_amt) DESC
LIMIT 20
'''
q1c = %sql $query_q1c
q1c
In [ ]:
ok.grade('q01c')
_ = ok.backup()
If you peruse the results of your last query, you should notice that some names are listed twice with slightly different spellings. Perhaps this causes some contributions to be split extraneously.
In [ ]:
# SOLUTION
query_q1d = """
SELECT cmte_id AS donor_id,
other_id AS recipient_id,
sum(transaction_amt) AS total_transaction_amt
FROM inter_comm
GROUP BY cmte_id, other_id
ORDER BY sum(transaction_amt) DESC
LIMIT 20
"""
q1d = %sql $query_q1d
q1d
In [ ]:
_ = ok.grade('q01d')
_ = ok.backup()
Of course, your results are probably not very informative. Let's join these results with the comm table (perhaps twice?) to get the names of the committees involved in these transactions. As before, limit your results to the top 20 by total donation.
We will be grading the query string query_q1e. You may modify our template but the result should contain the same information with the same names.
Remember that the name column of inter_comm is not consistent. We found this out in 1(c) where we found that the same committees were named slightly differently. Because of this, you cannot use the name column of inter_comm to get the names of the committees.
In [ ]:
# SOLUTION
query_q1e = '''
SELECT
donor.cmte_nm as donor_name,
recipient.cmte_nm as recipient_name,
sum(transaction_amt) AS total_transaction_amt
FROM inter_comm, comm AS donor, comm AS recipient
WHERE inter_comm.cmte_id = donor.cmte_id AND inter_comm.other_id = recipient.cmte_id
GROUP BY donor.cmte_nm, recipient.cmte_nm
ORDER BY sum(transaction_amt) DESC
LIMIT 20
'''
q1e = %sql $query_q1e
q1e
In [ ]:
_ = ok.grade('q01e')
_ = ok.backup()
What is the distribution of committee by state? Write a SQL query which computes for each state the number of committees in the comm table that are registered in that state. Display the results in descending order by count.
We will be grading the query string query_q2. You may modify our template but the result should contain the same information with the same names.
In [ ]:
# SOLUTION
query_q2 = '''
SELECT
cmte_st AS state,
count(*) AS count
FROM comm
GROUP BY cmte_st
ORDER BY count(*) DESC
'''
q2 = %sql $query_q2
q2
In [ ]:
_ = ok.grade('q02')
_ = ok.backup()
Political Action Committees are major sources funding for campaigns. They typically represent business, labor, or ideological interests and influence campaigns through their funding. Because of this, we'd like to know how much money each committee received from PACs.
For each committee, list the total amount of donations they got from Political Action Committees. If they got no such donations, the total should be listed as null. Order the result by pac_donations, then cmte_nm.
We will be grading you on the query string query_q3. You may modify our template but the result should contain the same information with the same names.
In [ ]:
# SOLUTION
query_q3 = '''
WITH pac_donations(cmte_id, pac_donations) AS
(
SELECT i.other_id, SUM(i.transaction_amt)
FROM inter_comm i
WHERE i.entity_tp = 'PAC'
GROUP BY i.other_id
)
SELECT
c.cmte_nm AS cmte_name,
i.pac_donations AS pac_donations
FROM comm c LEFT OUTER JOIN pac_donations i ON c.cmte_id = i.cmte_id
ORDER BY pac_donations, cmte_nm
LIMIT 20
'''
q3 = %sql $query_q3
q3
In [ ]:
_ = ok.grade('q03')
_ = ok.backup()
Committees can also contribute to other committees. When does this happen? Perhaps looking at the data can help us figure it out.
Find the names of the top 10 (directed) committee pairs that are affiliated with the Republican Party, who have the highest number of intercommittee transactions. By directed, we mean that a transaction where C1 donates to C2 is not the same as one where C2 donates to C1.
We will be grading you on the query string query_q4. You may modify our template but the result should contain the same information with the same names.
In [ ]:
# SOLUTION
query_q4 = '''
SELECT
cfrom.cmte_nm AS from_cmte_name,
cto.cmte_nm AS to_cmte_name
FROM comm cfrom, comm cto, inter_comm ict
WHERE cfrom.cmte_id = ict.cmte_id
AND cfrom.cmte_pty_affiliation = 'REP'
AND cto.cmte_pty_affiliation = 'REP'
AND cto.cmte_id = ict.other_id
GROUP BY cfrom.cmte_id, cfrom.cmte_nm, cto.cmte_id, cto.cmte_nm
ORDER BY count(*) DESC
LIMIT 10
'''
q4 = %sql $query_q4
q4
In [ ]:
_ = ok.grade('q04')
_ = ok.backup()
Some committees received donations from a common contributor. Perhaps they were ideologically similar.
Find the names of distinct candidate pairs that share a common committee contributor from Florida.
If you list a pair ("Washington", "Lincoln") you should also list ("Lincoln, Washington").
Save the result in q5.
Hint: In SQL, the "not equals" operator is <> (it's != in Python).
We will be grading you on the query string query_q5. You may modify our template but the result should contain the same information with the same names.
In [ ]:
# SOLUTION
query_q5 = '''
SELECT DISTINCT
c1.cand_name AS cand_1,
c2.cand_name AS cand_2
FROM inter_comm i1, inter_comm i2, cand c1, cand c2
WHERE i1.cand_id <> i2.cand_id
AND i1.cmte_id = i2.cmte_id
AND i2.state='FL' AND i1.state='FL'
AND c1.cand_id = i1.cand_id
AND c2.cand_id = i2.cand_id
ORDER BY cand_1
'''
q5 = %sql $query_q5
q5
In [ ]:
_ = ok.grade('q05')
_ = ok.backup()
In this part, we're going to perform a hypothesis test using SQL!
This article describes a statement by Hillary Clinton where where she claims that the majority of her campaign was funded by small donors. The article argues that her statement is false, so we ask a slightly different question:
Is there a difference in the proportion of money contributed by small donors between Hillary Clinton's and Bernie Sanders' campaigns?
For these questions, we define small donors as individuals that donated $200 or less to a campaign.
For review, we suggest looking over this chapter on Hypothesis Testing from the Data 8 textbook: https://www.inferentialthinking.com/chapters/10/testing-hypotheses.html
Before we begin, please think about and answer the following questions.
For each question, state "Yes" or "No", followed by a one-sentence explanation.
(a) If we were working with the entire FEC dataset instead of a sample, would we still conduct a hypothesis test? Why or why not?
(b) If we were working with the entire FEC dataset instead of a sample, would we still conduct bootstrap resampling? Why or why not?
(c) Let's suppose we take our sample and compute the proportion of money contributed by small donors to Hillary and Bernie's campaign. We find that the difference is 0.0 — they received the exact same proportion of small donations. Would we still need to conduct a hypothesis test? Why or why not?
(d) Let's suppose we take our sample and compute the proportion of money contributed by small donors to Hillary and Bernie's campaign. We find that the difference is 0.3. Would we still need to conduct a hypothesis test? Why or why not?
Solution:
(a) No. If we had the entire dataset, we could directly compute the difference in proportion. Since the dataset contains all transactions, we have the entire population we care about.
(b) No, for the same reason in (a).
(c) Yes. Even though our sample difference is 0, we do not know if the population difference is 0. (However, an hypothesis test would probably not reject the null hypothesis in this case.)
(d) Yes, for the same reason in (c).
We've taken a sample of around 2700 rows of the original FEC data for individual
contributions that only include contributions to Clinton and Sanders.
This sample is stored in the table indiv_sample.
The individual contributions of donors are linked to committees,
not candidates directly. Hillary's primary committee was called
HILLARY FOR AMERICA, and Bernie's was BERNIE 2016.
Fill in the SQL query below to compute the total contributions for each candidate's committee.
We will be grading you on the query string query_q7. You may modify our template but the result should contain the same information with the same names.
In [ ]:
query_q7 = """
SELECT
comm.cmte_nm AS cmte_nm,
sum(indiv_sample.transaction_amt) AS total_transaction_amt
FROM indiv_sample, comm
WHERE indiv_sample.cmte_id = comm.cmte_id
GROUP BY comm.cmte_id, comm.cmte_nm
HAVING
cmte_nm = 'HILLARY FOR AMERICA' OR
cmte_nm = 'BERNIE 2016'
"""
# Do not change anything below this line
res = %sql $query_q7
q7 = res.DataFrame().set_index("cmte_nm")
q7 # q7 will be graded
In [ ]:
_ = ok.grade('q07')
_ = ok.backup()
We want to know what proportion of this money came from small donors — individuals who donated \$200 or less. For example, if Hillary raised \$1000, and \$300 of that came from small donors, her proportion of small donors would be 0.3.
Compute this proportion for each candidate by filling in the SQL query below. The resulting table should have two columns:
cmte_id which contains the Hillary's and Bernie's committee IDscmte_name which contains the Hillary's and Bernie's committee namesprop_funds which contains the proportion of funds contributed by
small donors.You may not create a dataframe for this problem. By keeping the calculations in SQL, this query will also work on the original dataset of individual contributions (~ 3GB).
Hint: Try using Postgres' CASE statement to filter out transactions under
$200.
Hint: Remember that you can append ::float to a column name to convert its
values to float. You'll have to do this to perform division correctly.
We will be grading you on the query string query_q8. You may modify our template but the result should contain the same information with the same names.
In [ ]:
# SOLUTION
query_q8 = '''
SELECT
comm.cmte_id AS cmte_id,
comm.cmte_nm AS cmte_name,
SUM (
CASE WHEN transaction_amt::float <= 200.0 THEN transaction_amt::float
ELSE 0.0 END
) / SUM(transaction_amt::float) AS prop_funds
FROM indiv_sample, comm
WHERE indiv_sample.cmte_id = comm.cmte_id
GROUP BY comm.cmte_id, comm.cmte_nm
HAVING
cmte_nm = 'HILLARY FOR AMERICA' OR
cmte_nm = 'BERNIE 2016'
'''
# Do not change anything below this line
res = %sql $query_q8
small_donor_funds_prop = res.DataFrame()
small_donor_funds_prop
In [ ]:
_ = ok.grade('q08')
_ = ok.backup()
Let's now do a bit of EDA. Fill in the SQL statements below to make histograms of the transaction amounts for both Hillary and Bernie.
Note that we do take your entire result and put it into a dataframe.
This is not scalable. If indiv_sample was large, your computer
would run out of memory trying to store it in a dataframe. The better way to
compute the histogram would be to use SQL to generate bins and count the number
of contributions in each bin using the built-in
width_bucket function.
In [ ]:
# SOLUTION
query_q9a = """
SELECT transaction_amt
FROM indiv_sample, comm
WHERE indiv_sample.cmte_id = comm.cmte_id
AND cmte_nm = 'HILLARY FOR AMERICA'
"""
# Do not change anything below this line
res = %sql $query_q9a
hillary_contributions = res.DataFrame()
print(hillary_contributions.head())
# Make the Plot
sns.distplot(hillary_contributions)
plt.title('Distribution of Contribution Amounts to Hillary')
plt.xlim((-50, 3000))
plt.ylim((0, 0.02))
In [ ]:
# SOLUTION
query_q9b = """
SELECT transaction_amt
FROM indiv_sample, comm
WHERE indiv_sample.cmte_id = comm.cmte_id
AND cmte_nm = 'BERNIE 2016'
"""
# Do not change anything below this line
res = %sql $query_q9b
bernie_contributions = res.DataFrame()
print(bernie_contributions.head())
sns.distplot(bernie_contributions)
plt.title('Distribution of Contribution Amounts to Bernie')
plt.xlim((-50, 3000))
plt.ylim((0, 0.02))
In [ ]:
_ = ok.grade('q09')
_ = ok.backup()
SOLUTION:
Null hypothesis: There is no difference in the proportion of funds contributed by small donors to Hillary and Bernie.
Alternative: There is a difference.
We want to create a bootstrap confidence interval of the proportion of funds contributed to Hillary Clinton by small donors.
To do this in SQL, we need to number the rows we want to bootstrap.
The following cell creates a view called hillary. Views are like tables.
However, instead of storing the rows in the database, Postgres will recompute
the values in the view each time you query it.
It adds a row_id column to each row in indiv_sample
corresponding to a contribution to Hillary. Note that we use your
hillary_cmte_id variable by including $hillary_cmte_id in the SQL.
We'll do the same for Bernie, creating a view called bernie.
In [ ]:
%%sql
DROP VIEW IF EXISTS hillary CASCADE;
DROP VIEW IF EXISTS bernie CASCADE;
CREATE VIEW hillary AS
SELECT row_number() over () AS row_id, indiv_sample.*
FROM indiv_sample, comm
WHERE indiv_sample.cmte_id = comm.cmte_id
AND comm.cmte_nm = 'HILLARY FOR AMERICA';
CREATE VIEW bernie AS
SELECT row_number() over () AS row_id, indiv_sample.*
FROM indiv_sample, comm
WHERE indiv_sample.cmte_id = comm.cmte_id
AND comm.cmte_nm = 'BERNIE 2016';
SELECT * FROM hillary LIMIT 5
Let's contruct a view containing the rows we want to sample for each bootstrap trial. For example, if we want to create 100 bootstrap samples of 3 contributions to Hillary, we want something that looks like:
trial_id | row_id
======== | ======
1 | 1002
1 | 208
1 | 1
2 | 1524
2 | 1410
2 | 1023
3 | 423
3 | 68
3 | 925
... | ...
100 | 10
This will let us later construct a join on the hillary view that computes the
bootstrap sample for each trial by sampling with replacement.
Create a view called hillary_design that contains two columns: trial_id
and row_id. It should contain the IDs corresponding to
500 samples of the entire hillary view. The hillary view contains 1524
rows, so the hillary_design view should have a total of
500 * 1524 = 762000 rows.
Hint: Recall how we generated a matrix of random numbers in class. Start with
that, then start tweaking it until you get the view you want. Our solution uses
the Postgres functions generate_series, floor, and random.
In [ ]:
# SOLUTION
n_hillary_rows = 1524
n_trials = 500
seed = 0.42
query_q11 = """
CREATE VIEW hillary_design AS
SELECT trial_id, (FLOOR( $n_hillary_rows * RANDOM() ) + 1) :: INTEGER AS row_id
FROM
GENERATE_SERIES(1, $n_trials) as a(trial_id),
GENERATE_SERIES(1, $n_hillary_rows) as sample_size;
"""
# Fill in the $ variables set in the above string
import string
query_q11 = string.Template(query_q11).substitute(locals())
%sql drop view if exists hillary_design cascade
%sql SET SEED TO $seed
%sql $query_q11
%sql select * from hillary_design limit 5
In [ ]:
_ = ok.grade('q11')
_ = ok.backup()
Construct a view called hillary_trials that uses the hillary
and hillary_design views to compute the total amount contributed
by small donors for each trial as well as the overall amount.
It should have three columns:
trial_id: The number of the trial, from 1 to 500small_donor_sum: The total contributions from small donors in the trialtotal: The total contributions of all donations in the trialHint: Our solution uses the CASE WHEN statement inside of a SUM() function
call to compute the small_donor_sum.
In [ ]:
# SOLUTION
query_q12 = '''
CREATE VIEW hillary_trials as
SELECT d.trial_id,
SUM(CASE WHEN transaction_amt <= 200
THEN transaction_amt
ELSE 0
END) AS small_donor_sum,
SUM(transaction_amt) AS total
FROM hillary_design d, hillary h
WHERE d.row_id = h.row_id
GROUP BY d.trial_id
'''
# Do not change anything below this line
%sql drop view if exists hillary_trials cascade
%sql SET SEED TO $seed
%sql $query_q12
%sql select * from hillary_trials limit 5
In [ ]:
_ = ok.grade('q12')
_ = ok.backup()
Now, create a view called hillary_props that contains two columns:
trial_id: The number of the trial, from 1 to 500small_donor_prop: The proportion contributed by small donors for each trialHint: Remember that you can append ::float to a column name to convert its
values to float. You'll have to do this to perform division correctly.
In [ ]:
# SOLUTION
query_q13 = '''
CREATE VIEW hillary_props as
SELECT
trial_id,
small_donor_sum::float / total::float AS small_donor_prop
FROM hillary_trials
'''
%sql drop view if exists hillary_props cascade
%sql SET SEED TO $seed
%sql $query_q13
%sql select * from hillary_props limit 5
In [ ]:
_ = ok.grade('q13')
_ = ok.backup()
In [ ]:
# SOLUTION
n_bernie_rows = 1173
n_trials = 500
create_bernie_design = """
CREATE VIEW bernie_design AS
SELECT trial_id, (FLOOR( $n_bernie_rows * RANDOM() ) + 1) :: INTEGER AS row_id
FROM
GENERATE_SERIES(1, $n_trials) as a(trial_id),
GENERATE_SERIES(1, $n_bernie_rows) as sample_size;
"""
create_bernie_trials = '''
CREATE VIEW bernie_trials as
SELECT d.trial_id,
SUM(CASE WHEN transaction_amt <= 200
THEN transaction_amt
ELSE 0
END) AS small_donor_sum,
SUM(transaction_amt) AS total
FROM bernie_design d, bernie h
WHERE d.row_id = h.row_id
GROUP BY d.trial_id
'''
create_bernie_props = '''
CREATE VIEW bernie_props as
SELECT trial_id, small_donor_sum::float / total::float AS small_donor_prop
FROM bernie_trials
'''
# Do not change anything below this line
# Fill in the $ variables set in the above string
import string
create_bernie_design = (string.Template(create_bernie_design)
.substitute(locals()))
%sql drop view if exists bernie_design cascade
%sql $create_bernie_design
%sql drop view if exists bernie_trials cascade
%sql $create_bernie_trials
%sql drop view if exists bernie_props
%sql $create_bernie_props
%sql SET SEED TO $seed
%sql select * from bernie_props limit 5
In [ ]:
_ = ok.grade('q14')
_ = ok.backup()
In [ ]:
res = %sql select * from hillary_props
hillary_trials_df = res.DataFrame()
res = %sql select * from bernie_props
bernie_trials_df = res.DataFrame()
ax = plt.subplot(1,2,1)
sns.distplot(hillary_trials_df['small_donor_prop'], ax=ax)
plt.title('Hillary Bootstrap Prop')
plt.xlim(0.1, 0.9)
plt.ylim(0, 25)
ax = plt.subplot(1,2,2)
sns.distplot(bernie_trials_df['small_donor_prop'], ax=ax)
plt.title('Bernie Bootstrap Prop')
plt.xlim(0.1, 0.9)
plt.ylim(0, 25)
In [ ]:
compute_hillary_ci = '''
SELECT percentile_disc(0.025) WITHIN GROUP (ORDER BY small_donor_prop) as ci_lower,
percentile_disc(0.975) WITHIN GROUP (ORDER BY small_donor_prop) as ci_upper
FROM hillary_props
'''
compute_bernie_ci = '''
SELECT percentile_disc(0.025) WITHIN GROUP (ORDER BY small_donor_prop) as ci_lower,
percentile_disc(0.975) WITHIN GROUP (ORDER BY small_donor_prop) as ci_upper
FROM bernie_props
'''
%sql SET SEED TO $seed
hillary_ci = %sql $compute_hillary_ci
bernie_ci = %sql $compute_bernie_ci
print(hillary_ci)
print(bernie_ci)
SOLUTION: The 95% confidence interval for Hillary's proportion of small donor contributions has no overlap with the CI for Bernie. Thus, we reject the null and conclude that Hillary and Bernie received different proportions of funds from small donors.
This is also supported by the distribution of donation amounts we computed during our EDA, and by the relatively large absolute difference in the proportions in our sample.
Congrats! You finished the homework.
In [ ]:
_ = ok.grade_all()
Then, we'll submit the assignment to OkPy so that the staff will know to grade it. You can submit as many times as you want, and you can choose which submission you want us to grade by going to https://okpy.org/cal/data100/sp17/. After you've done that, make sure you've pushed your changes to Github as well!
In [ ]:
# Now, we'll submit to okpy
_ = ok.submit()