Import agate, python based data analysis


In [1]:
import agate

The well database is a big one so I'm going to force agate to take NrdPermit field as a text and PostalCD aka zip code as text also. Finally, limiting the number of rows agate checks helps it to run this code faster as well.


In [2]:
tester = agate.TypeTester(force = {
        'NrdPermit': agate.Text(),
        'PostalCD': agate.Text()
    }, limit=100)
wells = agate.Table.from_csv('wells.csv', column_types=tester)

In [3]:
print(wells)


|---------------+---------------|
|  column_names | column_types  |
|---------------+---------------|
|  X            | Number        |
|  Y            | Number        |
|  WellID       | Number        |
|  RegCD        | Text          |
|  Replacemen   | Number        |
|  Status       | Text          |
|  Useid        | Text          |
|  NrdName      | Text          |
|  NrdID        | Number        |
|  Countyname   | Text          |
|  CountyID     | Number        |
|  Township     | Number        |
|  Range        | Number        |
|  RangeDir     | Text          |
|  Section      | Number        |
|  SubSection   | Text          |
|  FootageNS    | Number        |
|  FootageDir   | Text          |
|  FootageEW    | Number        |
|  FootageD_1   | Text          |
|  NrdPermit    | Text          |
|  Acres        | Number        |
|  SeriesType   | Text          |
|  SeriesEnd    | Date          |
|  PumpRate     | Number        |
|  PColDiam     | Number        |
|  PumpDepth    | Number        |
|  TotalDepth   | Number        |
|  SWL          | Number        |
|  PWL          | Number        |
|  CertifID     | Number        |
|  OwnerID      | Number        |
|  FirstName    | Text          |
|  LastName     | Text          |
|  Address      | Text          |
|  CityNameID   | Text          |
|  StateRID     | Text          |
|  PostalCD     | Text          |
|  RegDate      | Date          |
|  Compdate     | Date          |
|  LastChgDat   | Date          |
|  DecommDate   | Date          |
|  LatDD        | Number        |
|  LongDD       | Number        |
|  CalcGPS      | Number        |
|---------------+---------------|

Since I'm going to find if a well is active during a given year, I wanted to find my min and max values for completion of those wells.


In [4]:
oldest = wells.aggregate(agate.Min('Compdate'))

In [5]:
print(oldest)


1895-01-01

In [6]:
newest = wells.aggregate(agate.Max('Compdate'))

In [7]:
print(newest)


2016-02-25

So this piece finds whether a well is active or not in a given year, and the code is a custom Python class that inherits various methods from agate.Computation class. Using a custom Computation class allows me to provide the function with multiple inputs as the typical agate.Formula() is limited to one.

  1. The init function is a typical one for custom classes, which basically says start a new instance of this class with these variables. _column_start is when the well was completed, and _column_end is when the well was decomissioned. The final value year is the year that I want to check whether a well is active or not.
  2. The get_computed_data_type function tells agate that the returned value should be a Boolean aka True or False.
  3. The validate function tells agate that our input columns should be dates and throws an error if they aren't.
  4. Finally the run function does the actual computation, which is explained in the comments below.

In [8]:
class ActiveWell(agate.Computation):
    """
    Computes whether a well is active in a year
    """
    def __init__(self, column_name_start, column_name_end, year):
        self._column_start = column_name_start
        self._column_end = column_name_end
        self._year = year

    def get_computed_data_type(self, table):
        """
        The return value is a boolean.
        """
        return agate.Boolean()

    def validate(self, table):
        """
        Verify the columns are dates.
        """
        columnStart = table.columns[self._column_start]

        if not isinstance(columnStart.data_type, agate.Date):
            raise agate.DataTypeError('Can only be applied to date data.')
            
        columnEnd = table.columns[self._column_end]

        if not isinstance(columnEnd.data_type, agate.Date):
            raise agate.DataTypeError('Can only be applied to date data.')

    def run(self, table):
        new_column = []
        
        for row in table.rows:
            #get the start value from the start column aka the year the well was constructed
            start_val = row[self._column_start]
            #some of the wells don't have constructed dates so I manually set the nulls at 1700
            if start_val is not None:
                start_year = start_val.year
            else:
                start_year = 1700
            #same process with the deconstructed well date
            end_val = row[self._column_end]
            if end_val is not None:
                end_year = end_val.year
            else:
                end_year = 5000
            #if the well was constructed before this year and hasn't been deconstructed yet, return as an active well
            if start_year < self._year < end_year:
                status = True
                new_column.append(status)
            else:
                status = False
                new_column.append(status)
        return new_column

To check my ActiveWell class, I computed 2010 active wells. After filtering so only the active wells remain, I grouped by each count and then counted its active wells. Also, the total active wells helped me debug.


In [9]:
check_wells = wells.compute([
    ('2010active', ActiveWell('Compdate', 'DecommDate', 2010))
        ])
active_wells = check_wells.where(lambda row: row['2010active'] == True)
county_groups = active_wells.group_by('Countyname')
well_counts = county_groups.aggregate([
        ('count', agate.Count())
    ])
total = well_counts.aggregate(agate.Sum('count'))
print(total)
print_table = well_counts.select(['Countyname', 'count'])
print_table.print_table(10)


99011
|-------------+--------|
|  Countyname | count  |
|-------------+--------|
|  Kearney    | 2,331  |
|  Sheridan   |   684  |
|  Hall       | 4,517  |
|  Dawson     | 4,080  |
|  Morrill    |   999  |
|  Adams      | 2,467  |
|  Box Butte  | 1,309  |
|  Keith      | 1,277  |
|  Deuel      |   354  |
|  Holt       | 3,200  |
|  ...        |   ...  |
|-------------+--------|

The years in the array were chosen due to the USDA ag census was conducted on those dates. Therefore, like the test 2010 analysis above, this program loops through each year in the array, finds the active wells in that year, groups by each count and counts the number of active wells. Finally, the counties' wells counts are outputted in a csv file.


In [13]:
years = [1997, 2002, 2007, 2012]
for year in years:
    print("Begin {0} well analysis".format(year))
    year_category = '{0}active'.format(year)
    count_category = 'count{0}'.format(year)
    check_wells = wells.compute([
            (year_category, ActiveWell('Compdate', 'DecommDate', year))
        ])
    
    active_wells = check_wells.where(lambda row: row[year_category] == True)
    county_groups = active_wells.group_by('Countyname')
    well_counts = county_groups.aggregate([
        ('Wells', agate.Count())
    ])
    print_table = well_counts.select(['Countyname', 'Wells'])
    print_table.print_table(10)
    total = well_counts.aggregate(agate.Sum('Wells'))
    print("Total active wells in {0} is {1}".format(year, total))
    well_counts.to_csv('wells{0}.csv'.format(year))


Begin 1997 well analysis
|---------------+--------|
|  Countyname   | Wells  |
|---------------+--------|
|  Hall         | 4,323  |
|  Thayer       | 1,454  |
|  Scotts Bluff |   747  |
|  Dawson       | 3,799  |
|  Frontier     |   681  |
|  Adams        | 2,179  |
|  Box Butte    | 1,161  |
|  Keith        | 1,066  |
|  Deuel        |   329  |
|  Holt         | 2,660  |
|  ...          |   ...  |
|---------------+--------|
Total active wells in 1997 is 87188
Begin 2002 well analysis
|---------------+--------|
|  Countyname   | Wells  |
|---------------+--------|
|  Hall         | 4,364  |
|  Thayer       | 1,533  |
|  Scotts Bluff |   786  |
|  Dawson       | 3,850  |
|  Adams        | 2,262  |
|  Box Butte    | 1,198  |
|  Keith        | 1,150  |
|  Deuel        |   338  |
|  Holt         | 2,779  |
|  Saline       | 1,127  |
|  ...          |   ...  |
|---------------+--------|
Total active wells in 2002 is 90660
Begin 2007 well analysis
|-------------+--------|
|  Countyname | Wells  |
|-------------+--------|
|  Kearney    | 2,326  |
|  Sheridan   |   678  |
|  Hall       | 4,522  |
|  Dawson     | 4,083  |
|  Morrill    |   997  |
|  Adams      | 2,427  |
|  Box Butte  | 1,306  |
|  Keith      | 1,265  |
|  Deuel      |   355  |
|  Holt       | 3,060  |
|  ...        |   ...  |
|-------------+--------|
Total active wells in 2007 is 97566
Begin 2012 well analysis
|-------------+--------|
|  Countyname | Wells  |
|-------------+--------|
|  Kearney    | 2,324  |
|  Sheridan   |   682  |
|  Hall       | 4,520  |
|  Dawson     | 4,052  |
|  Morrill    | 1,000  |
|  Adams      | 2,516  |
|  Box Butte  | 1,311  |
|  Keith      | 1,279  |
|  Deuel      |   353  |
|  Holt       | 3,327  |
|  ...        |   ...  |
|-------------+--------|
Total active wells in 2012 is 100019

In [ ]: