A Jupyter notebook (Python 3) by Peter Donovan, info@soilcarboncoalition.org
Open data is not just a thing or a tool. It's a behavior, based on beliefs. This notebook is a way of sharing my methods and assumptions, and if you use the same or similar tools (such as R instead of Python, for example) you can retread these steps. I hope this notebook may also serve as a guide for me as well as others who want to do similar things.
With crop insurance, as with any data set, looking at the data is a good way of learning about its particulars if not its intentions. Some knowledge of the context or domain of the data is usually required.
For background on federal crop insurance, the following may be a start:
Dennis Shields' 2015 report from the Congressional Research Service: https://fas.org/sgp/crs/misc/R40532.pdf
Environmental Working Group's material on crop insurance, which includes interactive maps showing rate of return (payouts compared to premiums) on some crops by county from 2001 through 2014: http://www.ewg.org/research/crop-insurance-lottery. The average federal subsidy for crop insurance premiums is about 60%.
The Natural Resources Defense Council has a 2013 paper on crop insurance, https://www.nrdc.org/sites/default/files/soil-matters-IP.pdf. This paper suggests that crop insurance could be reformed to reward farming that is low risk with environmental rewards.
A starting hypothesis: federally subsidized crop insurance, while it sustains the economic viability of many farm businesses, might also tend to replace soil health and function as the foundation of a viable agriculture.
To investigate the hypothesis, we'll start by compiling data.
First, we get data. Download and unzip the data file from the USDA Risk Management Agency website: http://www.rma.usda.gov/data/cause.html The complete data for each year is under the "Summary of Business with Month of Loss" header. So far I am using the 2014 through 2016 data. You can get the column headers from the same web page as a Word or pdf doc.
In [1]:
#some usual imports, including some options for displaying large currency amounts with commas and only 2 decimals
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
pd.set_option('display.float_format', '{:,}'.format)
pd.set_option('display.precision',2)
From http://www.rma.usda.gov/data/cause we see that years 2010 through 2016 are available as zip archives in Summary of Business. With a slower connection it is better to download and extract the zip archives outside of this notebook. Each contains a text file such as colsom14.txt, which will be an example for this notebook.
Unzip the file and inspect it with a text editor. There are pipe characters separating the fields, and sometimes sequences of spaces before them or after them. There are no column headers, we'll add those next.
In [2]:
df = pd.read_csv('/Users/Peter/Documents/atlas/RMA/colsom14.txt',sep='|',header=None)
df.shape #this counts rows, columns in our dataframe
Out[2]:
The column headers are supplied in a Word document (Record layout: Word) from the same web page. They differ for 2010-2014 and from 2015 forward. Format them as a python list of strings as follows, and add them to the dataframe.
In [3]:
the_columns_2014 = ['Crop Year Identifier','State Code','State Abbreviation ','County Code','County Name','Crop Code','Crop Name','Insurance Plan Code','Insurance Plan Name Abbreviation','Coverage Category','Stage Code','Cause of Loss Code','Cause of Loss Description','Month of Loss','Month of Loss Name','Policies Earning Premium','Policies Indemnified','Net Planted Acres','Liability','Total Premium','Subsidy','Determined Acres','Indemnity Amount','Loss Ratio']
the_columns_15_16 = ['Crop Year Identifier', 'State Code', 'State Abbreviation ',
'County Code', 'County Name', 'Crop Code', 'Crop Name',
'Insurance Plan Code', 'Insurance Plan Name Abbreviation',
'Coverage Category', 'Stage Code', 'Cause of Loss Code',
'Cause of Loss Description', 'Month of Loss', 'Month of Loss Name',
'Policies Earning Premium', 'Policies Indemnified', 'Net Planted Acres',
'Net Endorsed Acres', 'Liability', 'Total Premium', 'Subsidy',
'Determined Acres', 'Indemnity Amount', 'Loss Ratio']
df.columns = the_columns_2014 #this adds our column headers
There are spaces on either side of some of the fields. We can use str.strip() to remove them.
In [4]:
#we strip excess white space from the columns (numeric columns don't work for strip)
cols_w_spaces = ['County Name','Crop Name','Insurance Plan Name Abbreviation','Cause of Loss Description']
for item in cols_w_spaces:
df[item] = df[item].map(lambda x: x.strip())
#check the result
print(list(df.loc[1187]))
The state and county location codes are numeric (int64). FIPS (Federal Information Processing Standard) codes for counties are 5-digit strings. We'll pad with zeros using zfill function. This will come in handy when it comes to mapping, as we will want to merge or join our data with county boundaries using the FIPS code.
In [5]:
#convert to strings, pad with zeros, 2 digits for state, 3 for county
df['State Code'] = df['State Code'].map(lambda x: str(x)).apply(lambda x: x.zfill(2))
df['County Code'] = df['County Code'].map(lambda x: str(x)).apply(lambda x: x.zfill(3))
#add FIPS or id column and test
df['FIPS'] = df['State Code'] + df['County Code']
df['FIPS'][10] #to make sure we have a 5-digit string, not a number
Out[5]:
In [6]:
counties = df.groupby(['FIPS','County Name'])
aggregated = counties.agg({'Indemnity Amount': np.sum})
aggregated.sort_values('Indemnity Amount',ascending=False)
Out[6]:
In [7]:
aggregated.reset_index(level=0, inplace=True)
aggregated.reset_index(level=0, inplace=True)
#run this twice to convert the two indexes to columns
In [8]:
#rename columns for convenience
aggregated.rename(columns={'County Name': 'name', 'FIPS': 'id', 'Indemnity Amount': 'indemnity'}, inplace=True)
#convert to $millions
aggregated['indemnity']=aggregated['indemnity']/1000000
#reorder columns and write to tab-separated tsv file for d3 mapping
aggregated = aggregated[['id','name','indemnity']]
aggregated.to_csv('/Users/Peter/Documents/atlas/RMA/indemnity2014.tsv', sep='\t', index=False)
In [9]:
df.groupby('Cause of Loss Description').agg({'Indemnity Amount':np.sum}).sort_values('Indemnity Amount',ascending=False)
Out[9]:
In [10]:
causes_2014 = df.groupby('Cause of Loss Description')['Indemnity Amount'].sum()
causes_2014.sort_values(ascending=False)
Out[10]:
In [11]:
#to generate a table of total indemnities by Cause of Loss, you can export a csv
causes_2014.to_csv('/Users/Peter/Documents/atlas/RMA/causes_2014.csv')
'Excess Moisture/Precip/Rain' and 'Drought' are by far the most common causes. Let's filter the dataframe by these two, so we can potentially see which counties had indemnities for both causes, and how much.
In [12]:
rain = df[df['Cause of Loss Description']=='Excess Moisture/Precip/Rain']
drought = df[df['Cause of Loss Description']=='Drought']
print(rain.shape, drought.shape)
Now do a groupby on each dataframe by county, with sums of indemnity amounts.
In [13]:
g_rain = rain.groupby(['FIPS','County Name']).agg({'Indemnity Amount':np.sum})
g_drought = drought.groupby(['FIPS','County Name']).agg({'Indemnity Amount':np.sum})
together=pd.concat([g_rain,g_drought],axis=1)
together.columns = ['moisture','drought']
together.head()
Out[13]:
Let's add two columns, a total, and a ratio of moisture to drought.
In [14]:
together['total']=together.moisture + together.drought
together['ratio']=together.moisture / together.drought
together.head(20)
Out[14]:
In [15]:
mixed = together[(together.ratio < 4) & (together.ratio > .25)]
mixed.shape
Out[15]:
In [16]:
mixed.reset_index(level=0, inplace=True)
mixed.reset_index(level=0, inplace=True)
#run this twice
In [17]:
mixed = mixed.rename(columns={'total':'indemnity'})
In [18]:
mixed.indemnity = mixed.indemnity/1000000
In [19]:
mixed.to_csv('/Users/Peter/Documents/atlas/RMA/moisture_plus_drought_2014.tsv', sep='\t', index=False)