Week 4: Building Classes

This week we are going to continue to add to our palette of operations by using classes. This is going to be both an introduction to a set of operations (which we will later use as provided in pandas) as well as a discussion of how to write classes, and the advantages of doing so.

Note: In this notebook, I will be duplicating cells numerous times and expanding them. In particular, I will do this for the Dataset class. You won't need to do this in your work, and you can just update the definition cell as need be. (Although, any instances of that class will need to be re-created.)

First, let's take a brief moment to talk about how white-space delimits flow control in Python. Compare these two operations:


In [1]:
for i in range(10):
    b = i * 2
    print(b)


0
2
4
6
8
10
12
14
16
18

with:


In [2]:
for i in range(10):
    b = i * 2
print(b)


18

As you can see, for the first one (as is visually clear) the print call is inside the loop; the latter, it is outside the loop. This is identified by the difference in indentation of the two statements.

Classes

We're going to start out by creating classes. Classes are essentially "patterns" of objects, wherein objects are instantiated from them, and then can deviate. Here is a simple, trivial example, where we define a Dataset object that does ... nothing. The pass statement here is just to satisfy Python's rules for how empty classes are defined.

You can define attributes on a class, which are values or variables, and methods, which are functions.


In [3]:
class Dataset:
    pass

We can now call the class like a function, and it becomes an object:


In [4]:
d = Dataset()
d


Out[4]:
<__main__.Dataset at 0x7f8d404712b0>

We can do this again:


In [5]:
f = Dataset()

In [6]:
d is f


Out[6]:
False

So we've checked; we now have two different objects. But if we then try setting a new variable to d, we can see that we are referencing, not copying, the original object.


In [7]:
g = d

In [8]:
g is d


Out[8]:
True

Let's get ourselves set up. We'll set up matplotlib, import our libraries, and get to work creating our Dataset object.


In [9]:
%matplotlib inline

In [10]:
import numpy as np
import csv
import matplotlib.pyplot as plt

We'll make our initial pass at the class do a tiny bit more. We will now define a "special" method, __init__, which is called whenever the class is created. One gotcha here is that implicitly, the self argument refers to the instance of the class, and does not need to be supplied. So we will accept a data argument in the class, and set self.data equal to it. (Note that this does not make a copy of data -- so any in-place changes will be reflected!)


In [11]:
class Dataset:
    def __init__(self, data):
        self.data = data

Even though we're going to use data as a dictionary like we get from our favorite data file, it does not need to be. In fact, it can be anything, including a number:


In [12]:
d = Dataset( 1.095 )

In [13]:
d.data


Out[13]:
1.095

We'll talk about typing and ensuring correct types at a later time. For now, let's supply an empty dictionary to the Dataset object, and check that it's accessible.


In [14]:
d = Dataset( {} )

In [15]:
d.data


Out[15]:
{}

Great. We'll read in our data, exactly as we have done in previous weeks.


In [16]:
with open("data-readonly/Building_Inventory.csv", "r") as f:
    reader = csv.reader(f)
    header = next(reader)
    data = {}
    for column in header:
        data[column] = []
    for row in reader:
        for column, value in zip(header, row):
            data[column].append(value)

Now, pass that in!


In [17]:
d = Dataset(data)

One thing that we have done in the past has been to convert the data types after reading them in. We will add another "method" function called convert that we'll have do an in-place conversion of a given column.


In [18]:
class Dataset:
    def __init__(self, data):
        self.data = data
        
    def convert(self, column, dtype):
        self.data[column] = np.array(self.data[column], dtype=dtype)
        
d = Dataset(data)

We can use the value types we have used before, and call convert with them.


In [19]:
value_types = {'Zip code': 'int',
               'Congress Dist': 'int',
               'Senate Dist': 'int',
               'Year Acquired': 'int',
               'Year Constructed': 'int',
               'Square Footage': 'float',
               'Total Floors': 'int',
               'Floors Above Grade': 'int',
               'Floors Below Grade': 'int'}

print("Before: {}".format(type(d.data["Total Floors"])))

for key in d.data.keys():
    d.convert(key, value_types.get(key, "str"))

print("After: {}".format(type(d.data["Total Floors"])))


Before: <class 'list'>
After: <class 'numpy.ndarray'>

It's becoming obvious that we need a convenient way to see all the columns we have, so let's add a .columns() method.


In [20]:
class Dataset:
    def __init__(self, data):
        self.data = data
        
    def convert(self, column, dtype):
        self.data[column] = np.array(self.data[column], dtype=dtype)
        
    def columns(self):
        return self.data.keys()
        
d = Dataset(data)

value_types = {'Zip code': 'int',
               'Congress Dist': 'int',
               'Senate Dist': 'int',
               'Year Acquired': 'int',
               'Year Constructed': 'int',
               'Square Footage': 'float',
               'Total Floors': 'int',
               'Floors Above Grade': 'int',
               'Floors Below Grade': 'int'}

for key in d.columns():
    d.convert(key, value_types.get(key, "str"))

In [21]:
d.columns()


Out[21]:
dict_keys(['Usage Description 3', 'Senate Dist', 'Rep Full Name', 'Usage Description 2', 'Year Acquired', 'Address', 'Congressional Full Name', 'Agency Name', 'Bldg Status', 'Rep Dist', 'County', 'Square Footage', 'City', 'Floors Below Grade', 'Location Name', 'Congress Dist', 'Total Floors', 'Floors Above Grade', 'Senator Full Name', 'Year Constructed', 'Zip code', 'Usage Description'])

We will now start adding functions we're familiar with, starting with the filter_eq function from last week.

Note: Here, we are returning new Dataset instances, with copies of the arrays. This is fine for the size of data (and expected number of filtering operations) we have in this example. For much larger examples, it will not be, and we will address that in future classes.


In [22]:
class Dataset:
    def __init__(self, data):
        self.data = data
        
    def convert(self, column, dtype):
        self.data[column] = np.array(self.data[column], dtype=dtype)
        
    def columns(self):
        return self.data.keys()
    
    def filter_eq(self, column, value):
        good = (self.data[column] == value)
        new_data = {}
        for column in self.data:
            new_data[column] = self.data[column][good]
        return Dataset(new_data)
        
d = Dataset(data)

value_types = {'Zip code': 'int',
               'Congress Dist': 'int',
               'Senate Dist': 'int',
               'Year Acquired': 'int',
               'Year Constructed': 'int',
               'Square Footage': 'float',
               'Total Floors': 'int',
               'Floors Above Grade': 'int',
               'Floors Below Grade': 'int'}

for key in d.columns():
    d.convert(key, value_types.get(key, "str"))

d2 = d.filter_eq("Agency Name", "Department of Natural Resources")

In [23]:
d2.data["Senate Dist"]


Out[23]:
array([47, 47, 47, ..., 48, 59, 46])

Seems like we need an easy way to figure out how many items there are in the dataset. We'll use a size method.


In [24]:
class Dataset:
    def __init__(self, data):
        self.data = data
        
    def convert(self, column, dtype):
        self.data[column] = np.array(self.data[column], dtype=dtype)
        
    def columns(self):
        return self.data.keys()
    
    def filter_eq(self, column, value):
        good = (self.data[column] == value)
        new_data = {}
        for column in self.data:
            new_data[column] = self.data[column][good]
        return Dataset(new_data)
    
    def size(self):
        for key in self.data:
            return self.data[key].size
        
d = Dataset(data)

value_types = {'Zip code': 'int',
               'Congress Dist': 'int',
               'Senate Dist': 'int',
               'Year Acquired': 'int',
               'Year Constructed': 'int',
               'Square Footage': 'float',
               'Total Floors': 'int',
               'Floors Above Grade': 'int',
               'Floors Below Grade': 'int'}

for key in d.columns():
    d.convert(key, value_types.get(key, "str"))
print("Pre-filter: {}".format(d.size()))
d2 = d.filter_eq("Agency Name", "Department of Natural Resources")
print("Post-filter: {}".format(d2.size()))


Pre-filter: 8849
Post-filter: 3248

We'll add on the other methods we developed last week: less than, greater than, not equal, and stats.

Our stats method will operate on all of our columns simultaneously, however. Additionally, we have to add on a quick check to make sure that we can perform mathematical operations on the columns.


In [25]:
class Dataset:
    def __init__(self, data):
        self.data = data
        
    def convert(self, column, dtype):
        self.data[column] = np.array(self.data[column], dtype=dtype)
        
    def columns(self):
        return self.data.keys()
    
    def filter_eq(self, column, value):
        good = (self.data[column] == value)
        new_data = {}
        for column in self.data:
            new_data[column] = self.data[column][good]
        return Dataset(new_data)
    
    def filter_lt(self, column, value):
        good = (self.data[column] < value)
        new_data = {}
        for column in self.data:
            new_data[column] = self.data[column][good]
        return Dataset(new_data)
    
    def filter_gt(self, column, value):
        good = (self.data[column] > value)
        new_data = {}
        for column in self.data:
            new_data[column] = self.data[column][good]
        return Dataset(new_data)
    
    def filter_ne(self, column, value):
        good = (self.data[column] != value)
        new_data = {}
        for column in self.data:
            new_data[column] = self.data[column][good]
        return Dataset(new_data)
    
    def size(self):
        for key in self.data:
            return self.data[key].size
    
    def stats(self):
        statistics = {}
        for key in self.data:
            if self.data[key].dtype not in ("float", "int"):
                continue
            values = self.data[key]
            statistics[key] = (values.min(), values.max(), values.std(), values.mean())
        return statistics
    
d = Dataset(data)

value_types = {'Zip code': 'int',
               'Congress Dist': 'int',
               'Senate Dist': 'int',
               'Year Acquired': 'int',
               'Year Constructed': 'int',
               'Square Footage': 'float',
               'Total Floors': 'int',
               'Floors Above Grade': 'int',
               'Floors Below Grade': 'int'}

for key in d.columns():
    d.convert(key, value_types.get(key, "str"))
print("Pre-filter: {}".format(d.size()))
d2 = d.filter_eq("Agency Name", "Department of Natural Resources")
print("Post-filter: {}".format(d2.size()))


Pre-filter: 8849
Post-filter: 3248

Formatting Strings

Formatting strings helps to ensure uniform output, particularly when examining multiple different datasets and their numeric values. We will use the .format operation; for more information, see the Python documentation.

We'll start out with simple floating point representations, with the f, e and default examples.


In [26]:
val = -0.950714287492
print("{0} or {0:+.3f} or {0:0.5e} or {0:7.5f}".format(val))


-0.950714287492 or -0.951 or -9.50714e-01 or -0.95071

Something that will come in particularly handy for us is string-padding. We can pad out a string to a given length, for instance 20.


In [27]:
print("{0:20s} -> 1".format("hi"))


hi                   -> 1

Now, let's compare the stats for our two different datasets -- filtered and unfiltered.


In [28]:
stats1 = d.stats()
stats2 = d2.stats()
for column in d.columns():
    if column not in stats1: continue
    print("Column '{0:25s}'".format(column))
    for s1, s2 in zip(stats1[column], stats2[column]):
        print("    {0} vs {1}".format(s1, s2))


Column 'Senate Dist              '
    0 vs 6
    60 vs 60
    11.828633795912243 vs 9.126223508578196
    46.36874223076054 vs 48.10036945812808
Column 'Year Acquired            '
    0 vs 0
    2016 vs 2015
    320.1951802951111 vs 405.14285352287663
    1919.0273477229066 vs 1891.7413793103449
Column 'Square Footage           '
    0.0 vs 0.0
    1200000.0 vs 183175.0
    38195.906418657236 vs 5467.006308903433
    11497.949937846084 vs 1272.1345443349753
Column 'Floors Below Grade       '
    0 vs 0
    4 vs 1
    0.3940977107879141 vs 0.21641263681269748
    0.16318228048367048 vs 0.04926108374384237
Column 'Congress Dist            '
    0 vs 0
    18 vs 18
    4.141781466447568 vs 3.1053213758379408
    13.332127924059217 vs 14.387007389162562
Column 'Total Floors             '
    0 vs 0
    31 vs 7
    1.5418272357631746 vs 0.5712692170735618
    1.6418804384676235 vs 1.1499384236453203
Column 'Floors Above Grade       '
    0 vs 0
    30 vs 6
    1.2861155426316557 vs 0.39829383571374816
    1.4585828907221154 vs 1.0972906403940887
Column 'Year Constructed         '
    0 vs 0
    2017 vs 2016
    336.43416029812164 vs 448.10887281720636
    1911.4525935133913 vs 1869.280172413793
Column 'Zip code                 '
    1235 vs 1235
    68297 vs 68297
    1096.040039528023 vs 1346.8355268792554
    61819.58831506385 vs 61867.06650246305

Hmm, we can do this a bit nicer; let's make our floating point values all the same size, and label them.


In [29]:
stats1 = d.stats()
stats2 = d2.stats()
for column in d.columns():
    if column not in stats1: continue
    print("Column '{0:25s}'".format(column))
    mi1, ma1, st1, me1 = stats1[column]
    mi2, ma2, st2, me2 = stats2[column]
    print("    min: {0:20.3f} vs {0:20.3f}".format(mi1, mi2))
    print("    max: {0:20.3f} vs {0:20.3f}".format(ma1, ma2))
    print("    std: {0:20.3f} vs {0:20.3f}".format(st1, st2))
    print("    avg: {0:20.3f} vs {0:20.3f}".format(me1, me2))


Column 'Senate Dist              '
    min:                0.000 vs                0.000
    max:               60.000 vs               60.000
    std:               11.829 vs               11.829
    avg:               46.369 vs               46.369
Column 'Year Acquired            '
    min:                0.000 vs                0.000
    max:             2016.000 vs             2016.000
    std:              320.195 vs              320.195
    avg:             1919.027 vs             1919.027
Column 'Square Footage           '
    min:                0.000 vs                0.000
    max:          1200000.000 vs          1200000.000
    std:            38195.906 vs            38195.906
    avg:            11497.950 vs            11497.950
Column 'Floors Below Grade       '
    min:                0.000 vs                0.000
    max:                4.000 vs                4.000
    std:                0.394 vs                0.394
    avg:                0.163 vs                0.163
Column 'Congress Dist            '
    min:                0.000 vs                0.000
    max:               18.000 vs               18.000
    std:                4.142 vs                4.142
    avg:               13.332 vs               13.332
Column 'Total Floors             '
    min:                0.000 vs                0.000
    max:               31.000 vs               31.000
    std:                1.542 vs                1.542
    avg:                1.642 vs                1.642
Column 'Floors Above Grade       '
    min:                0.000 vs                0.000
    max:               30.000 vs               30.000
    std:                1.286 vs                1.286
    avg:                1.459 vs                1.459
Column 'Year Constructed         '
    min:                0.000 vs                0.000
    max:             2017.000 vs             2017.000
    std:              336.434 vs              336.434
    avg:             1911.453 vs             1911.453
Column 'Zip code                 '
    min:             1235.000 vs             1235.000
    max:            68297.000 vs            68297.000
    std:             1096.040 vs             1096.040
    avg:            61819.588 vs            61819.588

We can now absorb the compare function back into our class:


In [30]:
class Dataset:
    def __init__(self, data):
        self.data = data
        
    def convert(self, column, dtype):
        self.data[column] = np.array(self.data[column], dtype=dtype)
        
    def columns(self):
        return self.data.keys()
    
    def filter_eq(self, column, value):
        good = (self.data[column] == value)
        new_data = {}
        for column in self.data:
            new_data[column] = self.data[column][good]
        return Dataset(new_data)
    
    def filter_lt(self, column, value):
        good = (self.data[column] < value)
        new_data = {}
        for column in self.data:
            new_data[column] = self.data[column][good]
        return Dataset(new_data)
    
    def filter_gt(self, column, value):
        good = (self.data[column] > value)
        new_data = {}
        for column in self.data:
            new_data[column] = self.data[column][good]
        return Dataset(new_data)
    
    def filter_ne(self, column, value):
        good = (self.data[column] != value)
        new_data = {}
        for column in self.data:
            new_data[column] = self.data[column][good]
        return Dataset(new_data)
    
    def size(self):
        for key in self.data:
            return self.data[key].size
    
    def stats(self):
        statistics = {}
        for key in self.data:
            if self.data[key].dtype not in ("float", "int"):
                continue
            values = self.data[key]
            statistics[key] = (values.min(), values.max(), values.std(), values.mean())
        return statistics
    
    def compare(self, other):
        stats1 = self.stats()
        stats2 = other.stats()
        for column in self.columns():
            if column not in stats1: continue
            print("Column '{0:25s}'".format(column))
            for s1, s2 in zip(stats1[column], stats2[column]):
                print("    {0} vs {1}".format(s1, s2))
    
d = Dataset(data)

value_types = {'Zip code': 'int',
               'Congress Dist': 'int',
               'Senate Dist': 'int',
               'Year Acquired': 'int',
               'Year Constructed': 'int',
               'Square Footage': 'float',
               'Total Floors': 'int',
               'Floors Above Grade': 'int',
               'Floors Below Grade': 'int'}

for key in d.columns():
    d.convert(key, value_types.get(key, "str"))
print("Pre-filter: {}".format(d.size()))
d2 = d.filter_eq("Agency Name", "Department of Natural Resources")
print("Post-filter: {}".format(d2.size()))
d.compare(d2)


Pre-filter: 8849
Post-filter: 3248
Column 'Senate Dist              '
    0 vs 6
    60 vs 60
    11.828633795912243 vs 9.126223508578196
    46.36874223076054 vs 48.10036945812808
Column 'Year Acquired            '
    0 vs 0
    2016 vs 2015
    320.1951802951111 vs 405.14285352287663
    1919.0273477229066 vs 1891.7413793103449
Column 'Square Footage           '
    0.0 vs 0.0
    1200000.0 vs 183175.0
    38195.906418657236 vs 5467.006308903433
    11497.949937846084 vs 1272.1345443349753
Column 'Floors Below Grade       '
    0 vs 0
    4 vs 1
    0.3940977107879141 vs 0.21641263681269748
    0.16318228048367048 vs 0.04926108374384237
Column 'Congress Dist            '
    0 vs 0
    18 vs 18
    4.141781466447568 vs 3.1053213758379408
    13.332127924059217 vs 14.387007389162562
Column 'Total Floors             '
    0 vs 0
    31 vs 7
    1.5418272357631746 vs 0.5712692170735618
    1.6418804384676235 vs 1.1499384236453203
Column 'Floors Above Grade       '
    0 vs 0
    30 vs 6
    1.2861155426316557 vs 0.39829383571374816
    1.4585828907221154 vs 1.0972906403940887
Column 'Year Constructed         '
    0 vs 0
    2017 vs 2016
    336.43416029812164 vs 448.10887281720636
    1911.4525935133913 vs 1869.280172413793
Column 'Zip code                 '
    1235 vs 1235
    68297 vs 68297
    1096.040039528023 vs 1346.8355268792554
    61819.58831506385 vs 61867.06650246305

Splitting Datasets

We now have a base dataset, which we will want to split out -- this is a common operation, wherein instead of just filtering and "throwing away" datasets based on their characteristics, we want to create new ones for each unique value.

Last week, we did this with agencies, in an ad hoc way. We will do it again this week, but in a formalized way, by adding a split method to our Dataset object. This will iterate over all the unique values in a column and return new Dataset objects for each.


In [31]:
class Dataset:
    def __init__(self, data):
        self.data = data
        
    def convert(self, column, dtype):
        self.data[column] = np.array(self.data[column], dtype=dtype)
        
    def columns(self):
        return self.data.keys()
    
    def filter_eq(self, column, value):
        good = (self.data[column] == value)
        new_data = {}
        for column in self.data:
            new_data[column] = self.data[column][good]
        return Dataset(new_data)
    
    def filter_lt(self, column, value):
        good = (self.data[column] < value)
        new_data = {}
        for column in self.data:
            new_data[column] = self.data[column][good]
        return Dataset(new_data)
    
    def filter_gt(self, column, value):
        good = (self.data[column] > value)
        new_data = {}
        for column in self.data:
            new_data[column] = self.data[column][good]
        return Dataset(new_data)
    
    def filter_ne(self, column, value):
        good = (self.data[column] != value)
        new_data = {}
        for column in self.data:
            new_data[column] = self.data[column][good]
        return Dataset(new_data)
    
    def size(self):
        for key in self.data:
            return self.data[key].size
    
    def stats(self):
        statistics = {}
        for key in self.data:
            if self.data[key].dtype not in ("float", "int"):
                continue
            values = self.data[key]
            statistics[key] = (values.min(), values.max(), values.std(), values.mean())
        return statistics
    
    def compare(self, other):
        stats1 = self.stats()
        stats2 = other.stats()
        for column in self.columns():
            if column not in stats1: continue
            print("Column '{0:25s}'".format(column))
            for s1, s2 in zip(stats1[column], stats2[column]):
                print("    {0} vs {1}".format(s1, s2))
                
    def split(self, column):
        new_datasets = {}
        for split_value in np.unique(self.data[column]):
            new_datasets[split_value] = self.filter_eq(column, split_value)
        return new_datasets
    
d = Dataset(data)

value_types = {'Zip code': 'int',
               'Congress Dist': 'int',
               'Senate Dist': 'int',
               'Year Acquired': 'int',
               'Year Constructed': 'int',
               'Square Footage': 'float',
               'Total Floors': 'int',
               'Floors Above Grade': 'int',
               'Floors Below Grade': 'int'}

for key in d.columns():
    d.convert(key, value_types.get(key, "str"))
print("Pre-filter: {}".format(d.size()))
d2 = d.filter_eq("Agency Name", "Department of Natural Resources")
print("Post-filter: {}".format(d2.size()))


Pre-filter: 8849
Post-filter: 3248

Let's try this on Agency Name.


In [32]:
splits = d.split("Agency Name")

In [33]:
splits.keys()


Out[33]:
dict_keys(['Appellate Court / Fifth District', 'Illinois Board of Higher Education', "Governor's Office", 'Department of Natural Resources', 'Appellate Court / Third District', 'Illinois Courts', 'Chicago State University', "Department of Veterans' Affairs", 'Office of the Attorney General', 'Department of Public Health', 'Department of State Police', 'Northern Illinois University', 'Western Illinois University', 'Office of the Secretary of State', 'Department of Military Affairs', 'Department of Central Management Services', 'Department of Juvenile Justice', 'Department of Corrections', 'IL State Board of Education', 'Historic Preservation Agency', 'Illinois Community College Board', 'Department of Human Services', 'Southern Illinois University', 'Illinois Medical District Commission', 'Governors State University', 'Appellate Court / Fourth District', 'Department of Agriculture', 'Appellate Court / Second District', 'Department of Revenue', 'Eastern Illinois University', 'Department of Transportation', 'Northeastern Illinois University', 'Illinois Emergency Management Agency', 'Illinois State University', 'University of Illinois'])

Let's see how the min/max buildings compare.


In [34]:
for agency in splits:
    stats = splits[agency].stats()
    print("For {0:45s} min sq footage = {1: 10.1f} max sq footage = {2: 10.1f}".format(agency, stats["Square Footage"][0], stats["Square Footage"][1]))


For Appellate Court / Fifth District              min sq footage =    15124.0 max sq footage =    15124.0
For Illinois Board of Higher Education            min sq footage =     2464.0 max sq footage =   332000.0
For Governor's Office                             min sq footage =    45120.0 max sq footage =    45120.0
For Department of Natural Resources               min sq footage =        0.0 max sq footage =   183175.0
For Appellate Court / Third District              min sq footage =     3700.0 max sq footage =    15000.0
For Illinois Courts                               min sq footage =    54540.0 max sq footage =    54540.0
For Chicago State University                      min sq footage =      196.0 max sq footage =   185458.0
For Department of Veterans' Affairs               min sq footage =        0.0 max sq footage =   185525.0
For Office of the Attorney General                min sq footage =    60500.0 max sq footage =    60500.0
For Department of Public Health                   min sq footage =     2160.0 max sq footage =     5000.0
For Department of State Police                    min sq footage =       36.0 max sq footage =   254636.0
For Northern Illinois University                  min sq footage =        0.0 max sq footage =   298474.0
For Western Illinois University                   min sq footage =       55.0 max sq footage =   300097.0
For Office of the Secretary of State              min sq footage =       34.0 max sq footage =   452523.0
For Department of Military Affairs                min sq footage =        0.0 max sq footage =   299772.0
For Department of Central Management Services     min sq footage =      150.0 max sq footage =  1200000.0
For Department of Juvenile Justice                min sq footage =       48.0 max sq footage =    90920.0
For Department of Corrections                     min sq footage =        0.0 max sq footage =   307288.0
For IL State Board of Education                   min sq footage =    19147.0 max sq footage =    19147.0
For Historic Preservation Agency                  min sq footage =        0.0 max sq footage =   255000.0
For Illinois Community College Board              min sq footage =     4557.0 max sq footage =    99068.0
For Department of Human Services                  min sq footage =        0.0 max sq footage =   140000.0
For Southern Illinois University                  min sq footage =        0.0 max sq footage =   344652.0
For Illinois Medical District Commission          min sq footage =     9200.0 max sq footage =    22000.0
For Governors State University                    min sq footage =     2000.0 max sq footage =   500000.0
For Appellate Court / Fourth District             min sq footage =    16400.0 max sq footage =    16400.0
For Department of Agriculture                     min sq footage =        0.0 max sq footage =   149400.0
For Appellate Court / Second District             min sq footage =    43330.0 max sq footage =    43330.0
For Department of Revenue                         min sq footage =   913236.0 max sq footage =   913236.0
For Eastern Illinois University                   min sq footage =      511.0 max sq footage =   155202.0
For Department of Transportation                  min sq footage =        0.0 max sq footage =   277091.0
For Northeastern Illinois University              min sq footage =     1569.0 max sq footage =   148662.0
For Illinois Emergency Management Agency          min sq footage =     5650.0 max sq footage =    50000.0
For Illinois State University                     min sq footage =      153.0 max sq footage =   230710.0
For University of Illinois                        min sq footage =       32.0 max sq footage =   779732.0

And, we'll also add a .plot method to make this a bit easier, too.


In [35]:
class Dataset:
    def __init__(self, data):
        self.data = data
        
    def convert(self, column, dtype):
        self.data[column] = np.array(self.data[column], dtype=dtype)
        
    def columns(self):
        return self.data.keys()
    
    def filter_eq(self, column, value):
        good = (self.data[column] == value)
        new_data = {}
        for column in self.data:
            new_data[column] = self.data[column][good]
        return Dataset(new_data)
    
    def filter_lt(self, column, value):
        good = (self.data[column] < value)
        new_data = {}
        for column in self.data:
            new_data[column] = self.data[column][good]
        return Dataset(new_data)
    
    def filter_gt(self, column, value):
        good = (self.data[column] > value)
        new_data = {}
        for column in self.data:
            new_data[column] = self.data[column][good]
        return Dataset(new_data)
    
    def filter_ne(self, column, value):
        good = (self.data[column] != value)
        new_data = {}
        for column in self.data:
            new_data[column] = self.data[column][good]
        return Dataset(new_data)
    
    def size(self):
        for key in self.data:
            return self.data[key].size
    
    def stats(self):
        statistics = {}
        for key in self.data:
            if self.data[key].dtype not in ("float", "int"):
                continue
            values = self.data[key]
            statistics[key] = (values.min(), values.max(), values.std(), values.mean())
        return statistics
    
    def compare(self, other):
        stats1 = self.stats()
        stats2 = other.stats()
        for column in self.columns():
            if column not in stats1: continue
            print("Column '{0:25s}'".format(column))
            for s1, s2 in zip(stats1[column], stats2[column]):
                print("    {0} vs {1}".format(s1, s2))
                
    def split(self, column):
        new_datasets = {}
        for split_value in np.unique(self.data[column]):
            new_datasets[split_value] = self.filter_eq(column, split_value)
        return new_datasets
    
    def plot(self, x_column, y_column):
        plt.plot(self.data[x_column], self.data[y_column], '.')
    
d = Dataset(data)

value_types = {'Zip code': 'int',
               'Congress Dist': 'int',
               'Senate Dist': 'int',
               'Year Acquired': 'int',
               'Year Constructed': 'int',
               'Square Footage': 'float',
               'Total Floors': 'int',
               'Floors Above Grade': 'int',
               'Floors Below Grade': 'int'}

for key in d.columns():
    d.convert(key, value_types.get(key, "str"))
print("Pre-filter: {}".format(d.size()))
d2 = d.filter_eq("Agency Name", "Department of Natural Resources")
print("Post-filter: {}".format(d2.size()))


Pre-filter: 8849
Post-filter: 3248

We'll finish up with a couple of plots that are difficult to examine!


In [36]:
splits = d.split("Agency Name")
for agency in splits:
    splits[agency].plot("Year Acquired", "Square Footage")



In [37]:
d2 = d.filter_gt("Year Acquired", 0)
splits = d2.split("Agency Name")
for agency in splits:
    splits[agency].plot("Year Acquired", "Square Footage")
plt.yscale("log")