"First Python Notebook" data preparation

By Ben Welsh

This notebook prepares a dataset to be used in the forthcoming class "First Python Notebook: Scripting your way to the story."

The course is scheduled to be taught for the first time at the Oct. 2 2016 "watchdog workshop" organized by Investigative Reporters and Editors at San Diego State University's school of journalism.

It will focus on teaching the basic skills of Python by analyzing the contributors to Proposition 64, which asks California voters this November to decide if the growth and sale of marijuana should be legalized.

The preparation effort is necessary because the raw data published at californiacivicdata.org is still difficult for beginners to navigate.

The goal is to create something simple so the class can focus on using Python to interrogate the data.

Import Python tools


In [1]:
import os
import requests
from datetime import datetime
from clint.textui import progress

In [2]:
import pandas
pandas.set_option('display.float_format', lambda x: '%.2f' % x)
pandas.set_option('display.max_columns', None)

Download raw data

Pull down raw CAL-ACCESS files published by the [California Civic Data Coalition's API](http://www.californiacivicdata.org/.


In [3]:
def download_csv(name):
    """
    Accepts the name of a calaccess.download CSV and returns its path.
    """
    path = os.path.join(os.getcwd(), '{}.csv'.format(name))
    if not os.path.exists(path):
        url = "http://calaccess.download/latest/{}.csv".format(name)
        r = requests.get(url, stream=True)
        with open(path, 'w') as f:
            total_length = int(r.headers.get('content-length'))
            for chunk in progress.bar(r.iter_content(chunk_size=1024), expected_size=(total_length/1024) + 1): 
                if chunk:
                    f.write(chunk)
                    f.flush()
    return path

Itemized receipts reported by campaigns


In [4]:
rcpt_path = download_csv("rcpt_cd")

In [5]:
ff_path = download_csv("filer_filings_cd")

Hack down RCPT_CD table outside of Python

The raw RCPT_CD table is nearly 2 gigabytes in size and too large to load into this Python notebook. Outside of this script, I split it into slices of 1,000,000 rows using the Linux command line.

$ split -l 1000000 rcpt_cd.csv rcpt_cd_part

I then cleaned up the file names a little and added the header row to all files using vim.

Import and trim the RCPT_CD table

The source file is believed to expand in chronological order as new filings are made with the state. Since we are only interested in studying contributions to a 2016 ballot, only the three latest slices are imported and trimmed.


In [6]:
def rcpt_part_to_dataframe(part_name):
    """
    Import a slide of the RCPT_CD table prepared for this notebook.
    """
    file_name = "rcpt_cd_parta{}.csv".format(part_name)
    path = os.path.join(os.getcwd(), file_name)
    return pandas.read_csv(path, sep=',', dtype="unicode")

In [7]:
itemized_receipts_df_h = rcpt_part_to_dataframe("h")

In [8]:
itemized_receipts_df_i = rcpt_part_to_dataframe("i")

In [9]:
itemized_receipts_df_j = rcpt_part_to_dataframe("j")

Concatenate them together into one DataFrame.


In [10]:
recent_itemized_receipts = pandas.concat([
    itemized_receipts_df_h,
    itemized_receipts_df_i,
    itemized_receipts_df_j
])

Remove amended filings


In [11]:
def remove_amended_filings(df):
    """
    Accepts a dataframe with FILING_ID and AMEND_ID files.
    
    Returns only the highest amendment for each unique filing id.
    """
    max_amendments = df.groupby('FILING_ID')['AMEND_ID'].agg("max").reset_index()
    merged_df = pandas.merge(df, max_amendments, how='inner', on=['FILING_ID', 'AMEND_ID'])
    print "Removed {} amendments".format(len(df)-len(merged_df))
    print "DataFrame now contains {} rows".format(len(merged_df))
    return merged_df

In [12]:
real_recent_itemized_receipts = remove_amended_filings(recent_itemized_receipts)


Removed 659557 amendments
DataFrame now contains 1960681 rows

Filter the table down to monetary contributions reported via Schedule A.


In [13]:
real_sked_a = real_recent_itemized_receipts[
    real_recent_itemized_receipts['FORM_TYPE'] == 'A'
]

Reduce the number of the headers down to the ones we want to keep


In [14]:
trimmed_itemized = real_sked_a[[
    'FILING_ID',
    'AMEND_ID',
    'CTRIB_NAMF',
    'CTRIB_NAML',
    'CTRIB_CITY',
    'CTRIB_ST',
    'CTRIB_ZIP4',
    'CTRIB_EMP',
    'CTRIB_OCC',
    'RCPT_DATE',
    'AMOUNT',
]]

Rename the ugly ones


In [15]:
clean_itemized = trimmed_itemized.rename(
    index=str,
    columns={
        "CTRIB_NAMF": "FIRST_NAME",
        "CTRIB_NAML": "LAST_NAME",
        "CTRIB_CITY": "CITY",
        "CTRIB_ST": "STATE",
        "CTRIB_ZIP4": "ZIPCODE",
        "CTRIB_EMP": "EMPLOYER",
        "CTRIB_OCC": "OCCUPATION",
        "RCPT_DATE": "DATE"
    }
)

Import and trim the FILER_FILINGS_CD table

Download and read in the table with links between the RCPT_CD table's FILING_ID field and the FILER_ID field that identifies which committee reported the receipt.


In [16]:
filer_filings_df = pandas.read_csv(ff_path, sep=',', index_col=False, dtype='unicode')

Since this table does not indicate if the filing is an amendment, let's just reduce it to the distinct connections between filers and filings.


In [17]:
filer_to_filing = filer_filings_df[['FILER_ID', 'FILING_ID']].drop_duplicates()

Import the shortlist of Prop. 64 committees we want to study

California's Proposition 64 asks voters if the growth and sale marijuana should be legalized in the state. As of September 20, California's Secretary of State reports that 16 million has been raised to campaign in support the measure, and 2 million to oppose it.

Here are the committees the state lists as supporting the measure.


In [19]:
supporting_committees = pandas.DataFrame([
    {"COMMITTEE_ID":"1343793","COMMITTEE_NAME":"Californians for Responsible Marijuana Reform, Sponsored by Drug Policy Action, Yes on Prop. 64"},
    {"COMMITTEE_ID":"1376077","COMMITTEE_NAME":"Californians for Sensible Reform, Sponsored by Ghost Management Group, LLC dba Weedmaps"},
    {"COMMITTEE_ID":"1385506","COMMITTEE_NAME":"Drug Policy Action - Non Profit 501c4, Yes on Prop. 64"},
    {"COMMITTEE_ID":"1385745","COMMITTEE_NAME":"Fund for Policy Reform (Nonprofit 501(C)(4))"},
    {"COMMITTEE_ID":"1371855","COMMITTEE_NAME":"Marijuana Policy Project of California"},
    {"COMMITTEE_ID":"1382525","COMMITTEE_NAME":"New Approach PAC (MPO)"},
    {"COMMITTEE_ID":"1386560","COMMITTEE_NAME":"The Adult Use Campaign for Proposition 64"},
    {"COMMITTEE_ID":"1381808","COMMITTEE_NAME":"Yes on 64, Californians to Control, Regulate and Tax Adult Use of Marijuana While Protecting Children, Sponsored by Business, Physicians, Environmental and Social-Justice Advocate Organizations"}
])
supporting_committees['COMMITTEE_POSITION'] = 'SUPPORT'

Here are the committees the state lists as opposing the measure.


In [20]:
opposing_committees = pandas.DataFrame([
    {"COMMITTEE_ID":"1382568","COMMITTEE_NAME":"No on Prop. 64, Sponsored by California Public Safety Institute"},
    {"COMMITTEE_ID":"1387789","COMMITTEE_NAME":"Sam Action, Inc., a Committee Against Proposition 64 with Help from Citizens (NonProfit 501(C)(4))"}
])
opposing_committees['COMMITTEE_POSITION'] = 'OPPOSE'

In [22]:
prop_64_committees = pandas.concat([supporting_committees, opposing_committees])

Join Prop. 64 committees to the contributions they've received


In [23]:
prop_64_filings = filer_to_filing.merge(
    prop_64_committees,
    how="inner",
    left_on='FILER_ID',
    right_on="COMMITTEE_ID"
)

In [24]:
prop_64_itemized = prop_64_filings.merge(
    clean_itemized,
    how="inner",
    left_on="FILING_ID",
    right_on="FILING_ID"
)

In [25]:
print len(prop_64_itemized)


318

Export the data to a CSV file


In [26]:
prop_64_itemized.drop('FILER_ID', axis=1, inplace=True)

In [27]:
prop_64_itemized.to_csv("./prop_64_contributions.csv", index=False)

Upload the data to the World Wide Web

I then uploaded that CSV export to Google Sheets, where it is published online here.

Caveats to keep in mind:

  • The file contains only contributions reported through June 30, 2016, when the most recent quarterly F460 filings were made. It does not include "late" expenditures reported via Form 497 form in the weeks leading up to the election.

  • It is also only monetary contributions reported via the Form 460's Schedule A. It does not include non-monetary contributions, nor does it include transfers and other cash infusions reported via Schedule I. That can add up to a significant amount of missing money. For instance, hundreds of thousands of dollars from George Soros were passed into the Drug Policy Action PAC from a previous ballot measure campaign in this filing's Schedule I.

  • Some campaigns here, such as Drug Policy Action, have passed on their contributions to other ballot measure committees. The result is that some real dollars are reflected twice. First, when they are reported as a receipt by the initial committee. Then again when they are reported as a receipt by the second committee after the first has passed it along.

  • As is always the case, these receipts are only those of "large donors" that campaigns are required to report when they surpass 200 dollars in gifts. Donors who have given less than that amount do not appear, and the summary sheets reported by the campaigns remain the authoritative source of their total fundraising.