The raw data tables mix together filings from different reporting periods (e.g. quarterlys vs. semi-annual vs. pre-elections). But we need these filings to be sorted (or at least sortable) so that or users, for example, can compare the performance of two candidates in the same reporting period.
There are two vectors at play here:
This notebook is pulling data from the downloads-website's dev database, which was last updated on...
In [2]:
from calaccess_processed.models.tracking import ProcessedDataVersion
In [3]:
ProcessedDataVersion.objects.latest()
Out[3]:
Will also need to execute some raw SQL, so I'll import a helper function in order to make the results more readable:
In [4]:
from project import sql_to_agate
Let's start by examining the distinct values of the statement type on CVR_CAMPAIGN_DISCLOSURE_CD
. And let's narrow the scope to only the Form 460 filings.
In [5]:
sql_to_agate(
"""
SELECT UPPER("STMT_TYPE"), COUNT(*)
FROM "CVR_CAMPAIGN_DISCLOSURE_CD"
WHERE "FORM_TYPE" = 'F460'
GROUP BY 1
ORDER BY COUNT(*) DESC;
"""
).print_table()
Not all of these values are defined, as previously noted in our docs:
PR
might be pre-electionQS
is pro probably quarterly statementYE
might be...I don't know "Year-end"?S
is probably semi-annualMaybe come back later and look at the actual filings. There aren't that many.
There's another similar-named column on FILER_FILINGS_CD
, but this seems to be a completely different thing:
In [6]:
sql_to_agate(
"""
SELECT FF."STMNT_TYPE", LU."CODE_DESC", COUNT(*)
FROM "FILER_FILINGS_CD" FF
JOIN "LOOKUP_CODES_CD" LU
ON FF."STMNT_TYPE" = LU."CODE_ID"
AND LU."CODE_TYPE" = 10000
GROUP BY 1, 2;
"""
).print_table()
One of the tables that caught my eye is FILING_PERIOD_CD
, which appears to have a row for each quarterly filing period:
In [7]:
sql_to_agate(
"""
SELECT *
FROM "FILING_PERIOD_CD"
"""
).print_table()
Every period is described as a quarter, and the records are equally divided among them:
In [8]:
sql_to_agate(
"""
SELECT "PERIOD_DESC", COUNT(*)
FROM "FILING_PERIOD_CD"
GROUP BY 1;
"""
).print_table()
The difference between every START_DATE
and END_DATE
is actually a three-month interval:
In [9]:
sql_to_agate(
"""
SELECT "END_DATE" - "START_DATE" AS duration, COUNT(*)
FROM "FILING_PERIOD_CD"
GROUP BY 1;
"""
).print_table()
And they have covered every year between 1973 and 2334 (how optimistic!):
In [10]:
sql_to_agate(
"""
SELECT DATE_PART('year', "START_DATE")::int as year, COUNT(*)
FROM "FILING_PERIOD_CD"
GROUP BY 1
ORDER BY 1 DESC;
"""
).print_table()
Filings are linked to filing periods via FILER_FILINGS_CD.PERIOD_ID
. While that column is not always populated, it is if you limit your results to just the Form 460 filings:
In [11]:
sql_to_agate(
"""
SELECT ff."PERIOD_ID", fp."START_DATE", fp."END_DATE", fp."PERIOD_DESC", COUNT(*)
FROM "FILER_FILINGS_CD" ff
JOIN "CVR_CAMPAIGN_DISCLOSURE_CD" cvr
ON ff."FILING_ID" = cvr."FILING_ID"
AND ff."FILING_SEQUENCE" = cvr."AMEND_ID"
AND cvr."FORM_TYPE" = 'F460'
JOIN "FILING_PERIOD_CD" fp
ON ff."PERIOD_ID" = fp."PERIOD_ID"
GROUP BY 1, 2, 3, 4
ORDER BY fp."START_DATE" DESC;
"""
).print_table()
Also, is Schwarzenegger running this cycle? Who else could be filing from so far into the future?
AAANNNNYYYway...Also need to check to make sure the join between FILER_FILINGS_CD
and CVR_CAMPAIGN_DISCLOSURE_CD
isn't filtering out too many filings:
In [12]:
sql_to_agate(
"""
SELECT cvr."FILING_ID", cvr."FORM_TYPE", cvr."FILER_NAML"
FROM "CVR_CAMPAIGN_DISCLOSURE_CD" cvr
LEFT JOIN "FILER_FILINGS_CD" ff
ON cvr."FILING_ID" = ff."FILING_ID"
AND cvr."AMEND_ID" = ff."FILING_SEQUENCE"
WHERE cvr."FORM_TYPE" = 'F460'
AND (ff."FILING_ID" IS NULL OR ff."FILING_SEQUENCE" IS NULL)
ORDER BY cvr."FILING_ID";
"""
).print_table(max_column_width=60)
So only a handful, mostly local campaigns or just nonsense test data.
So another important thing to check is how well these the dates from the filing period look-up records line up with the dates on the Form 460 filing records. It would be bad if the CVR_CAMPAIGN_DISCLOSURE_CD.FROM_DATE
were before FILING_PERIOD_CD.START_DATE
or if CVR_CAMPAIGN_DISCLOSURE_CD.THRU_DATE
were after FILING_PERIOD_CD.END_DATE
.
In [13]:
sql_to_agate(
"""
SELECT
CASE
WHEN cvr."FROM_DATE" < fp."START_DATE" THEN 'filing from_date before period start_date'
WHEN cvr."THRU_DATE" > fp."END_DATE" THEN 'filing thru_date after period end_date'
ELSE 'okay'
END as test,
COUNT(*)
FROM "CVR_CAMPAIGN_DISCLOSURE_CD" cvr
JOIN "FILER_FILINGS_CD" ff
ON cvr."FILING_ID" = ff."FILING_ID"
AND cvr."AMEND_ID" = ff."FILING_SEQUENCE"
JOIN "FILING_PERIOD_CD" fp
ON ff."PERIOD_ID" = fp."PERIOD_ID"
WHERE cvr."FORM_TYPE" = 'F460'
GROUP BY 1;
"""
).print_table(max_column_width=60)
So half of the time, the THRU_DATE
on the filing is later than the FROM_DATE
on the filing period. How big of a difference can exist between these two dates?
In [14]:
sql_to_agate(
"""
SELECT
cvr."THRU_DATE" - fp."END_DATE" as date_diff,
COUNT(*)
FROM "CVR_CAMPAIGN_DISCLOSURE_CD" cvr
JOIN "FILER_FILINGS_CD" ff
ON cvr."FILING_ID" = ff."FILING_ID"
AND cvr."AMEND_ID" = ff."FILING_SEQUENCE"
JOIN "FILING_PERIOD_CD" fp
ON ff."PERIOD_ID" = fp."PERIOD_ID"
WHERE cvr."FORM_TYPE" = 'F460'
AND cvr."THRU_DATE" > fp."END_DATE"
GROUP BY 1
ORDER BY COUNT(*) DESC;
"""
).print_table(max_column_width=60)
Ugh. Looks like, in most of the problem cases, the from date can be a whole quarter later than the end date of the filing period. Let's take a closer look at these...
In [15]:
sql_to_agate(
"""
SELECT
cvr."FILING_ID",
cvr."AMEND_ID",
cvr."FROM_DATE",
cvr."THRU_DATE",
fp."START_DATE",
fp."END_DATE"
FROM "CVR_CAMPAIGN_DISCLOSURE_CD" cvr
JOIN "FILER_FILINGS_CD" ff
ON cvr."FILING_ID" = ff."FILING_ID"
AND cvr."AMEND_ID" = ff."FILING_SEQUENCE"
JOIN "FILING_PERIOD_CD" fp
ON ff."PERIOD_ID" = fp."PERIOD_ID"
WHERE cvr."FORM_TYPE" = 'F460'
AND 90 < cvr."THRU_DATE" - fp."END_DATE"
AND cvr."THRU_DATE" - fp."END_DATE" < 93
ORDER BY cvr."THRU_DATE" DESC;
"""
).print_table(max_column_width=60)
So, actually, this sort of makes sense: Quarterly filings are for three month intervals, while the semi-annual filings are for six month intervals. And FILING_PERIOD_CD
only has records for three month intervals. Let's test this theory by getting the distinct CVR_CAMPAIGN_DISCLOSURE_CD.STMT_TYPE
values from these records:
In [16]:
sql_to_agate(
"""
SELECT UPPER(cvr."STMT_TYPE"), COUNT(*)
FROM "CVR_CAMPAIGN_DISCLOSURE_CD" cvr
JOIN "FILER_FILINGS_CD" ff
ON cvr."FILING_ID" = ff."FILING_ID"
AND cvr."AMEND_ID" = ff."FILING_SEQUENCE"
JOIN "FILING_PERIOD_CD" fp
ON ff."PERIOD_ID" = fp."PERIOD_ID"
WHERE cvr."FORM_TYPE" = 'F460'
AND 90 < cvr."THRU_DATE" - fp."END_DATE"
AND cvr."THRU_DATE" - fp."END_DATE" < 93
GROUP BY 1
ORDER BY COUNT(*) DESC;
"""
).print_table(max_column_width=60)
At least this is mostly true.
In [ ]: