Import USGS use data & Create usage table

Here we download the raw usage data for years 2000, 2005, and 2010 from the USGS usage site and synthesize all data into a tidy table listing: year, FIPS code, and total annual withdrawals by sector in MGal/day tagged by sector, source type (surface or groundwater) and water type (fresh or saline). This table is formatted as follows:

STATE	FIPS	YEAR	UseClass	SourceClass	SourceType	Amount
AL	01001	2000	PublicSupply	Surface	Fresh	0.00
AL	01001	2000	Industrial	Ground	Saline	0.00

Workflow:

Pull the raw data file for a given year, in tab-delimted format, from the USGS server into a pandas dataframe.
- Care must be taken that the FIPS codes are preserved as text, not numbers
Fix some inconsistencies in nomeclature across years in the USGS data. This includes:
- For year 2000, remapping 'IT', 'LA', 'LS', and 'PE' fields to 'IR', 'AQ', 'LI' and 'PC', respectively(reference).
  - IT -> IR (Irrigated cropland)
  - LA -> AQ (Aquaculture)
  - LS -> LI (Livestock)
  - PE -> PC (Thermoelectric power closed-loop)
- For year 2005, remapping 'LA' and 'LS' fields to 'AQ' and 'LI', respectively(reference).
  - LA -> AQ (Aquaculture)
  - LS -> LI (Livestock)
- Adding year columns to 2000 and 2005 datasets.
- Fixing an inconsistency in the total withdrawal fields (Wtotl to WTotal).
- Removing extranneous fields
Re-arrange data into a tidy format to facilitate additional analyses.
Append all tables into a single dataframe, with records tagged by the year of the dataset.
Remove rows with null values
Save the table to a file for later analysis.

This table can then easily be summarized and joined, by YEAR and FIPS code, to other accounting data tables and summarized by state.



In [1]:

    
#Import modules required for analysis
import os
import pandas as pd
import numpy as np



In [2]:

    
#Create prefix remapping dictionaries to standardize column names for all years to 2010
remapDict = {'IT-':'IR-', #Irrigated (total)
             'LA-':'AQ-', #Aquaculture
             'LS-':'LI-', #Livestock
             'PE-':'PC-', #Closed-loop thermo electric
             'Wtotl':'WTotl' #Capitalization mismatch
            }



In [3]:

    
#Create a function for importing a table and standardizing all field names
def importTable(year, remapDict):
    
    #Set the download URL using with the year
    theURL = 'http://water.usgs.gov/watuse/data/{0}/usco{0}.txt'.format(year) 
    
    #Download the dataset to a data frame (keeping the FIPS attribute as a string)
    df = pd.read_table(theURL,dtype={'FIPS':str})
    
    #Ensure that the FIPS column maintains 5 characters
    df['FIPS'] = df['FIPS'].apply(lambda x: str(x).zfill(5))
    
    #Remove the STATEFIPS and COUNTYFIPS columns (as they are not needed)
    df.drop("STATEFIPS",axis=1,inplace=True)
    df.drop("COUNTYFIPS",axis=1,inplace=True)
    
    #Use the remap dictionary to rename columns
    
    #Get the current column names as a list
    colNames = df.columns.values.tolist()
    
    for inFld,outFld in remapDict.items():
        #This loops through each item in colNames and replaces it with a revised one
        colNames_update = [x.replace(inFld,outFld) for x in colNames]
        colNames = colNames_update

    #Update the column names in the data frame
    df.columns = colNames
    
    #Add year field, if not present
    if "YEAR" not in df.columns.values: 
        df.insert(1,"YEAR",year)
        
    #Remove unnamed columns, which oddly appear in 2005 and 2010 datasets
    if "Unnamed" in df.columns.values[-1]:
        df.drop(df.columns.values[-1],axis=1,inplace=True)
        
    #Status
    print "{} records and {} attributes returned for year {}".format(df.shape[0],df.shape[1],year)
        
    #Return the data frame
    return df



In [4]:

    
#Get the tables
df2000 = importTable(2000,remapDict)
df2005 = importTable(2005,remapDict)
df2010 = importTable(2010,remapDict)









    



3222 records and 69 attributes returned for year 2000
3224 records and 106 attributes returned for year 2005
3224 records and 115 attributes returned for year 2010



In [5]:

    
#CHECK: example output with a column for each supply/source/type grouping.
df2000.head()









    Out[5]:






  
    
      
      STATE
      YEAR
      FIPS
      TP-TotPop
      PS-TOPop
      PS-WGWFr
      PS-WSWFr
      PS-WFrTo
      DO-SSPop
      DO-WGWFr
      ...
      PC-WTotl
      TO-WGWFr
      TO-WGWSa
      TO-WGWTo
      TO-WSWFr
      TO-WSWSa
      TO-WSWTo
      TO-WFrTo
      TO-WSaTo
      TO-WTotl
    
  
  
    
      0
      AL
      2000
      01001
      43.67
      39.30
      5.70
      0.00
      5.70
      4.37
      2.95
      ...
      0.0
      10.65
      0.0
      10.65
      26.60
      0.0
      26.60
      37.25
      0.0
      37.25
    
    
      1
      AL
      2000
      01003
      140.42
      107.61
      19.70
      0.00
      19.70
      32.81
      2.46
      ...
      0.0
      27.04
      0.0
      27.04
      7.59
      0.0
      7.59
      34.63
      0.0
      34.63
    
    
      2
      AL
      2000
      01005
      29.04
      22.27
      4.85
      0.00
      4.85
      6.77
      0.51
      ...
      0.0
      6.36
      0.0
      6.36
      1.06
      0.0
      1.06
      7.42
      0.0
      7.42
    
    
      3
      AL
      2000
      01007
      20.83
      17.93
      3.78
      0.00
      3.78
      2.90
      0.22
      ...
      0.0
      4.00
      0.0
      4.00
      0.00
      0.0
      0.00
      4.00
      0.0
      4.00
    
    
      4
      AL
      2000
      01009
      51.02
      45.90
      2.55
      29.26
      31.81
      5.12
      0.38
      ...
      0.0
      2.93
      0.0
      2.93
      29.33
      0.0
      29.33
      32.26
      0.0
      32.26
    
  

5 rows × 69 columns



In [6]:

    
#"Melt" data so each use/class/type becomes a new row
df2000m = pd.melt(df2000,id_vars=['STATE','FIPS','YEAR'],var_name='Class',value_name='Amount')
df2005m = pd.melt(df2005,id_vars=['STATE','FIPS','YEAR'],var_name='Class',value_name='Amount')
df2010m = pd.melt(df2010,id_vars=['STATE','FIPS','YEAR'],var_name='Class',value_name='Amount')



In [7]:

    
#CHECK: Observe how the data has been 'melted'
df2000m.head()



In [8]:

    
#Merge all records into a single table
dfUse = pd.concat([df2000m, df2005m, df2010m],ignore_index=True)



In [10]:

    
#CHECK: the dfUse frame should have as many rows as the 2000, 2005, and 2010 data combined
dfUse.shape[0]









    Out[10]:





(905812, 5)



In [11]:

    
#Create remap dictionaries. These enable extraction of specific sub-attriubutes
# contained in the Class field. For example, PS-WGWFr refers to Fresh groundwater
# withdrawn for Public Supply.
useClassMap = {'TP':'Population',
               'PS':'PublicSupply',
               'DO':'Domestic',
               'IN':'Industrial',
               'IR':'Irrigation',
               'IC':'Irrigation_Crop',
               'IG':'Irrigation_Golf',
               'LI':'Livestock',
               'AQ':'Aquaculture',
               'MI':'Mining',
               'PT':'ThermoElec',
               'PO':'ThermoElec_OnceThru',
               'PC':'ThermoElec_Recirc',
               'TO':'Total'
              }

srcClassMap = {'-WGW':'Groundwater',
               '-WSW':'Surface'
              }

srcTypeMap = {'Fr':'Fresh',
              'Sa':'Saline'
             }



In [12]:

    
#Set the use classes (PS, DO, LI, etc.)
for code,value in useClassMap.items():
    dfUse.loc[dfUse.Class.str.startswith(code), 'UseClass'] = value



In [13]:

    
#Set the source classes (surface or groundwater)
for code,value in srcClassMap.items():
    dfUse.loc[dfUse.Class.str.contains(code), 'SrcClass'] = value



In [14]:

    
#Set the source types (fresh or saline)
for code,value in srcTypeMap.items():
    dfUse.loc[dfUse.Class.str.endswith(code), 'SrcType'] = value



In [22]:

    
#CHECK: List classes with null srcClass remaps
dfUse[pd.isnull(dfUse['UseClass'])].Class.unique()









    Out[22]:





array(['COUNTY'], dtype=object)



In [24]:

    
#CHECK: List classes with null srcClass remaps
dfUse[pd.isnull(dfUse['SrcClass'])].Class.unique()









    Out[24]:





array(['TP-TotPop', 'PS-TOPop', 'PS-WFrTo', 'DO-SSPop', 'DO-WFrTo',
       'IN-WFrTo', 'IN-WSaTo', 'IN-WTotl', 'IR-IrSpr', 'IR-IrMic',
       'IR-IrSur', 'IR-IrTot', 'IR-WFrTo', 'AQ-WFrTo', 'LI-WFrTo',
       'MI-WFrTo', 'MI-WSaTo', 'MI-WTotl', 'PT-WFrTo', 'PT-WSaTo',
       'PT-WTotl', 'PC-WFrTo', 'PC-WSaTo', 'PC-WTotl', 'TO-WFrTo',
       'TO-WSaTo', 'TO-WTotl', 'PS-GWPop', 'PS-SWPop', 'PS-WSaTo',
       'PS-WTotl', 'DO-PSDel', 'DO-TOTAL ', 'IC-WFrTo', 'IC-IrSpr',
       'IC-IrMic', 'IC-IrSur', 'IC-IrTot', 'IG-WFrTo', 'IG-IrSpr',
       'IG-IrMic', 'IG-IrSur', 'IG-IrTot', 'PT-Power', 'PO-WFrTo',
       'PO-WSaTo', 'PO-WTotl', 'PO-Power', 'PC-Power', 'COUNTY',
       'DO-SSPCp', 'DO-PSPCp', 'AQ-WSaTo', 'AQ-WTotl'], dtype=object)



In [25]:

    
#CHECK: List classes with null srcClass remaps
dfUse[pd.isnull(dfUse['SrcType'])].Class.unique()









    Out[25]:





array(['TP-TotPop', 'PS-TOPop', 'PS-WFrTo', 'DO-SSPop', 'DO-WFrTo',
       'IN-WGWTo', 'IN-WSWTo', 'IN-WFrTo', 'IN-WSaTo', 'IN-WTotl',
       'IR-IrSpr', 'IR-IrMic', 'IR-IrSur', 'IR-IrTot', 'IR-WFrTo',
       'AQ-WFrTo', 'LI-WFrTo', 'MI-WGWTo', 'MI-WSWTo', 'MI-WFrTo',
       'MI-WSaTo', 'MI-WTotl', 'PT-WSWTo', 'PT-WFrTo', 'PT-WSaTo',
       'PT-WTotl', 'PO-WSWTo', 'PC-WSWTo', 'PC-WFrTo', 'PC-WSaTo',
       'PC-WTotl', 'TO-WGWTo', 'TO-WSWTo', 'TO-WFrTo', 'TO-WSaTo',
       'TO-WTotl', 'PS-GWPop', 'PS-SWPop', 'PS-WGWTo', 'PS-WSWTo',
       'PS-WSaTo', 'PS-WTotl', 'DO-PSDel', 'DO-TOTAL ', 'IC-WFrTo',
       'IC-IrSpr', 'IC-IrMic', 'IC-IrSur', 'IC-IrTot', 'IG-WFrTo',
       'IG-IrSpr', 'IG-IrMic', 'IG-IrSur', 'IG-IrTot', 'PT-WGWTo',
       'PT-Power', 'PO-WGWTo', 'PO-WFrTo', 'PO-WSaTo', 'PO-WTotl',
       'PO-Power', 'PC-WGWTo', 'PC-Power', 'COUNTY', 'DO-SSPCp',
       'DO-PSPCp', 'AQ-WGWTo', 'AQ-WSWTo', 'AQ-WSaTo', 'AQ-WTotl'], dtype=object)



In [26]:

    
#Remove rows (axis = 0), with a null value in 'any' column
dfOutput = dfUse.dropna(axis=0,how='any')



In [27]:

    
#CHECK: see that new columns were added to decode the Class
dfOutput.head()









    Out[27]:






  
    
      
      STATE
      FIPS
      YEAR
      Class
      Amount
      UseClass
      SrcClass
      SrcType
    
  
  
    
      6444
      AL
      01001
      2000
      PS-WGWFr
      5.7
      PublicSupply
      Groundwater
      Fresh
    
    
      6445
      AL
      01003
      2000
      PS-WGWFr
      19.7
      PublicSupply
      Groundwater
      Fresh
    
    
      6446
      AL
      01005
      2000
      PS-WGWFr
      4.85
      PublicSupply
      Groundwater
      Fresh
    
    
      6447
      AL
      01007
      2000
      PS-WGWFr
      3.78
      PublicSupply
      Groundwater
      Fresh
    
    
      6448
      AL
      01009
      2000
      PS-WGWFr
      2.55
      PublicSupply
      Groundwater
      Fresh



In [28]:

    
#Set the output location and filename
dataDir = '../../Data'
outFN = 'UsageDataTidy.csv'
dfOutput.to_csv(dataDir + os.sep + outFN,index=False)

	STATE	YEAR	FIPS	TP-TotPop	PS-TOPop	PS-WGWFr	PS-WSWFr	PS-WFrTo	DO-SSPop	DO-WGWFr	...	TO-WGWFr	TO-WGWTo	TO-WSWFr	TO-WSWTo	TO-WFrTo	TO-WTotl
0	AL	2000	01001	43.67	39.30	5.70	0.00	5.70	4.37	2.95	...	10.65	10.65	26.60	26.60	37.25	37.25
1	AL	2000	01003	140.42	107.61	19.70	0.00	19.70	32.81	2.46	...	27.04	27.04	7.59	7.59	34.63	34.63
2	AL	2000	01005	29.04	22.27	4.85	0.00	4.85	6.77	0.51	...	6.36	6.36	1.06	1.06	7.42	7.42
3	AL	2000	01007	20.83	17.93	3.78	0.00	3.78	2.90	0.22	...	4.00	4.00	0.00	0.00	4.00	4.00
4	AL	2000	01009	51.02	45.90	2.55	29.26	31.81	5.12	0.38	...	2.93	2.93	29.33	29.33	32.26	32.26

	STATE	FIPS	YEAR	Class	Amount
0	AL	01001	2000	TP-TotPop	43.67
1	AL	01003	2000	TP-TotPop	140.42
2	AL	01005	2000	TP-TotPop	29.04
3	AL	01007	2000	TP-TotPop	20.83
4	AL	01009	2000	TP-TotPop	51.02

	STATE	FIPS	YEAR	Class	Amount	UseClass	SrcClass	SrcType
6444	AL	01001	2000	PS-WGWFr	5.7	PublicSupply	Groundwater	Fresh
6445	AL	01003	2000	PS-WGWFr	19.7	PublicSupply	Groundwater	Fresh
6446	AL	01005	2000	PS-WGWFr	4.85	PublicSupply	Groundwater	Fresh
6447	AL	01007	2000	PS-WGWFr	3.78	PublicSupply	Groundwater	Fresh
6448	AL	01009	2000	PS-WGWFr	2.55	PublicSupply	Groundwater	Fresh