This week we are going to continue to add to our palette of operations by using classes. This is going to be both an introduction to a set of operations (which we will later use as provided in pandas) as well as a discussion of how to write classes, and the advantages of doing so.
Note: In this notebook, I will be duplicating cells numerous times and expanding them. In particular, I will do this for the Dataset class. You won't need to do this in your work, and you can just update the definition cell as need be. (Although, any instances of that class will need to be re-created.)
First, let's take a brief moment to talk about how white-space delimits flow control in Python. Compare these two operations:
In [1]:
for i in range(10):
b = i * 2
print(b)
with:
In [2]:
for i in range(10):
b = i * 2
print(b)
As you can see, for the first one (as is visually clear) the print call is inside the loop; the latter, it is outside the loop. This is identified by the difference in indentation of the two statements.
We're going to start out by creating classes. Classes are essentially "patterns" of objects, wherein objects are instantiated from them, and then can deviate. Here is a simple, trivial example, where we define a Dataset object that does ... nothing. The pass statement here is just to satisfy Python's rules for how empty classes are defined.
You can define attributes on a class, which are values or variables, and methods, which are functions.
In [3]:
class Dataset:
pass
We can now call the class like a function, and it becomes an object:
In [4]:
d = Dataset()
d
Out[4]:
We can do this again:
In [5]:
f = Dataset()
In [6]:
d is f
Out[6]:
So we've checked; we now have two different objects. But if we then try setting a new variable to d, we can see that we are referencing, not copying, the original object.
In [7]:
g = d
In [8]:
g is d
Out[8]:
Let's get ourselves set up. We'll set up matplotlib, import our libraries, and get to work creating our Dataset object.
In [9]:
%matplotlib inline
In [10]:
import numpy as np
import csv
import matplotlib.pyplot as plt
We'll make our initial pass at the class do a tiny bit more. We will now define a "special" method, __init__, which is called whenever the class is created. One gotcha here is that implicitly, the self argument refers to the instance of the class, and does not need to be supplied. So we will accept a data argument in the class, and set self.data equal to it. (Note that this does not make a copy of data -- so any in-place changes will be reflected!)
In [11]:
class Dataset:
def __init__(self, data):
self.data = data
Even though we're going to use data as a dictionary like we get from our favorite data file, it does not need to be. In fact, it can be anything, including a number:
In [12]:
d = Dataset( 1.095 )
In [13]:
d.data
Out[13]:
We'll talk about typing and ensuring correct types at a later time. For now, let's supply an empty dictionary to the Dataset object, and check that it's accessible.
In [14]:
d = Dataset( {} )
In [15]:
d.data
Out[15]:
Great. We'll read in our data, exactly as we have done in previous weeks.
In [16]:
with open("data-readonly/Building_Inventory.csv", "r") as f:
reader = csv.reader(f)
header = next(reader)
data = {}
for column in header:
data[column] = []
for row in reader:
for column, value in zip(header, row):
data[column].append(value)
Now, pass that in!
In [17]:
d = Dataset(data)
One thing that we have done in the past has been to convert the data types after reading them in. We will add another "method" function called convert that we'll have do an in-place conversion of a given column.
In [18]:
class Dataset:
def __init__(self, data):
self.data = data
def convert(self, column, dtype):
self.data[column] = np.array(self.data[column], dtype=dtype)
d = Dataset(data)
We can use the value types we have used before, and call convert with them.
In [19]:
value_types = {'Zip code': 'int',
'Congress Dist': 'int',
'Senate Dist': 'int',
'Year Acquired': 'int',
'Year Constructed': 'int',
'Square Footage': 'float',
'Total Floors': 'int',
'Floors Above Grade': 'int',
'Floors Below Grade': 'int'}
print("Before: {}".format(type(d.data["Total Floors"])))
for key in d.data.keys():
d.convert(key, value_types.get(key, "str"))
print("After: {}".format(type(d.data["Total Floors"])))
It's becoming obvious that we need a convenient way to see all the columns we have, so let's add a .columns() method.
In [20]:
class Dataset:
def __init__(self, data):
self.data = data
def convert(self, column, dtype):
self.data[column] = np.array(self.data[column], dtype=dtype)
def columns(self):
return self.data.keys()
d = Dataset(data)
value_types = {'Zip code': 'int',
'Congress Dist': 'int',
'Senate Dist': 'int',
'Year Acquired': 'int',
'Year Constructed': 'int',
'Square Footage': 'float',
'Total Floors': 'int',
'Floors Above Grade': 'int',
'Floors Below Grade': 'int'}
for key in d.columns():
d.convert(key, value_types.get(key, "str"))
In [21]:
d.columns()
Out[21]:
We will now start adding functions we're familiar with, starting with the filter_eq function from last week.
Note: Here, we are returning new Dataset instances, with copies of the arrays. This is fine for the size of data (and expected number of filtering operations) we have in this example. For much larger examples, it will not be, and we will address that in future classes.
In [22]:
class Dataset:
def __init__(self, data):
self.data = data
def convert(self, column, dtype):
self.data[column] = np.array(self.data[column], dtype=dtype)
def columns(self):
return self.data.keys()
def filter_eq(self, column, value):
good = (self.data[column] == value)
new_data = {}
for column in self.data:
new_data[column] = self.data[column][good]
return Dataset(new_data)
d = Dataset(data)
value_types = {'Zip code': 'int',
'Congress Dist': 'int',
'Senate Dist': 'int',
'Year Acquired': 'int',
'Year Constructed': 'int',
'Square Footage': 'float',
'Total Floors': 'int',
'Floors Above Grade': 'int',
'Floors Below Grade': 'int'}
for key in d.columns():
d.convert(key, value_types.get(key, "str"))
d2 = d.filter_eq("Agency Name", "Department of Natural Resources")
In [23]:
d2.data["Senate Dist"]
Out[23]:
Seems like we need an easy way to figure out how many items there are in the dataset. We'll use a size method.
In [24]:
class Dataset:
def __init__(self, data):
self.data = data
def convert(self, column, dtype):
self.data[column] = np.array(self.data[column], dtype=dtype)
def columns(self):
return self.data.keys()
def filter_eq(self, column, value):
good = (self.data[column] == value)
new_data = {}
for column in self.data:
new_data[column] = self.data[column][good]
return Dataset(new_data)
def size(self):
for key in self.data:
return self.data[key].size
d = Dataset(data)
value_types = {'Zip code': 'int',
'Congress Dist': 'int',
'Senate Dist': 'int',
'Year Acquired': 'int',
'Year Constructed': 'int',
'Square Footage': 'float',
'Total Floors': 'int',
'Floors Above Grade': 'int',
'Floors Below Grade': 'int'}
for key in d.columns():
d.convert(key, value_types.get(key, "str"))
print("Pre-filter: {}".format(d.size()))
d2 = d.filter_eq("Agency Name", "Department of Natural Resources")
print("Post-filter: {}".format(d2.size()))
We'll add on the other methods we developed last week: less than, greater than, not equal, and stats.
Our stats method will operate on all of our columns simultaneously, however. Additionally, we have to add on a quick check to make sure that we can perform mathematical operations on the columns.
In [25]:
class Dataset:
def __init__(self, data):
self.data = data
def convert(self, column, dtype):
self.data[column] = np.array(self.data[column], dtype=dtype)
def columns(self):
return self.data.keys()
def filter_eq(self, column, value):
good = (self.data[column] == value)
new_data = {}
for column in self.data:
new_data[column] = self.data[column][good]
return Dataset(new_data)
def filter_lt(self, column, value):
good = (self.data[column] < value)
new_data = {}
for column in self.data:
new_data[column] = self.data[column][good]
return Dataset(new_data)
def filter_gt(self, column, value):
good = (self.data[column] > value)
new_data = {}
for column in self.data:
new_data[column] = self.data[column][good]
return Dataset(new_data)
def filter_ne(self, column, value):
good = (self.data[column] != value)
new_data = {}
for column in self.data:
new_data[column] = self.data[column][good]
return Dataset(new_data)
def size(self):
for key in self.data:
return self.data[key].size
def stats(self):
statistics = {}
for key in self.data:
if self.data[key].dtype not in ("float", "int"):
continue
values = self.data[key]
statistics[key] = (values.min(), values.max(), values.std(), values.mean())
return statistics
d = Dataset(data)
value_types = {'Zip code': 'int',
'Congress Dist': 'int',
'Senate Dist': 'int',
'Year Acquired': 'int',
'Year Constructed': 'int',
'Square Footage': 'float',
'Total Floors': 'int',
'Floors Above Grade': 'int',
'Floors Below Grade': 'int'}
for key in d.columns():
d.convert(key, value_types.get(key, "str"))
print("Pre-filter: {}".format(d.size()))
d2 = d.filter_eq("Agency Name", "Department of Natural Resources")
print("Post-filter: {}".format(d2.size()))
Formatting strings helps to ensure uniform output, particularly when examining multiple different datasets and their numeric values. We will use the .format operation; for more information, see the Python documentation.
We'll start out with simple floating point representations, with the f, e and default examples.
In [26]:
val = -0.950714287492
print("{0} or {0:+.3f} or {0:0.5e} or {0:7.5f}".format(val))
Something that will come in particularly handy for us is string-padding. We can pad out a string to a given length, for instance 20.
In [27]:
print("{0:20s} -> 1".format("hi"))
Now, let's compare the stats for our two different datasets -- filtered and unfiltered.
In [28]:
stats1 = d.stats()
stats2 = d2.stats()
for column in d.columns():
if column not in stats1: continue
print("Column '{0:25s}'".format(column))
for s1, s2 in zip(stats1[column], stats2[column]):
print(" {0} vs {1}".format(s1, s2))
Hmm, we can do this a bit nicer; let's make our floating point values all the same size, and label them.
In [29]:
stats1 = d.stats()
stats2 = d2.stats()
for column in d.columns():
if column not in stats1: continue
print("Column '{0:25s}'".format(column))
mi1, ma1, st1, me1 = stats1[column]
mi2, ma2, st2, me2 = stats2[column]
print(" min: {0:20.3f} vs {0:20.3f}".format(mi1, mi2))
print(" max: {0:20.3f} vs {0:20.3f}".format(ma1, ma2))
print(" std: {0:20.3f} vs {0:20.3f}".format(st1, st2))
print(" avg: {0:20.3f} vs {0:20.3f}".format(me1, me2))
We can now absorb the compare function back into our class:
In [30]:
class Dataset:
def __init__(self, data):
self.data = data
def convert(self, column, dtype):
self.data[column] = np.array(self.data[column], dtype=dtype)
def columns(self):
return self.data.keys()
def filter_eq(self, column, value):
good = (self.data[column] == value)
new_data = {}
for column in self.data:
new_data[column] = self.data[column][good]
return Dataset(new_data)
def filter_lt(self, column, value):
good = (self.data[column] < value)
new_data = {}
for column in self.data:
new_data[column] = self.data[column][good]
return Dataset(new_data)
def filter_gt(self, column, value):
good = (self.data[column] > value)
new_data = {}
for column in self.data:
new_data[column] = self.data[column][good]
return Dataset(new_data)
def filter_ne(self, column, value):
good = (self.data[column] != value)
new_data = {}
for column in self.data:
new_data[column] = self.data[column][good]
return Dataset(new_data)
def size(self):
for key in self.data:
return self.data[key].size
def stats(self):
statistics = {}
for key in self.data:
if self.data[key].dtype not in ("float", "int"):
continue
values = self.data[key]
statistics[key] = (values.min(), values.max(), values.std(), values.mean())
return statistics
def compare(self, other):
stats1 = self.stats()
stats2 = other.stats()
for column in self.columns():
if column not in stats1: continue
print("Column '{0:25s}'".format(column))
for s1, s2 in zip(stats1[column], stats2[column]):
print(" {0} vs {1}".format(s1, s2))
d = Dataset(data)
value_types = {'Zip code': 'int',
'Congress Dist': 'int',
'Senate Dist': 'int',
'Year Acquired': 'int',
'Year Constructed': 'int',
'Square Footage': 'float',
'Total Floors': 'int',
'Floors Above Grade': 'int',
'Floors Below Grade': 'int'}
for key in d.columns():
d.convert(key, value_types.get(key, "str"))
print("Pre-filter: {}".format(d.size()))
d2 = d.filter_eq("Agency Name", "Department of Natural Resources")
print("Post-filter: {}".format(d2.size()))
d.compare(d2)
We now have a base dataset, which we will want to split out -- this is a common operation, wherein instead of just filtering and "throwing away" datasets based on their characteristics, we want to create new ones for each unique value.
Last week, we did this with agencies, in an ad hoc way. We will do it again this week, but in a formalized way, by adding a split method to our Dataset object. This will iterate over all the unique values in a column and return new Dataset objects for each.
In [31]:
class Dataset:
def __init__(self, data):
self.data = data
def convert(self, column, dtype):
self.data[column] = np.array(self.data[column], dtype=dtype)
def columns(self):
return self.data.keys()
def filter_eq(self, column, value):
good = (self.data[column] == value)
new_data = {}
for column in self.data:
new_data[column] = self.data[column][good]
return Dataset(new_data)
def filter_lt(self, column, value):
good = (self.data[column] < value)
new_data = {}
for column in self.data:
new_data[column] = self.data[column][good]
return Dataset(new_data)
def filter_gt(self, column, value):
good = (self.data[column] > value)
new_data = {}
for column in self.data:
new_data[column] = self.data[column][good]
return Dataset(new_data)
def filter_ne(self, column, value):
good = (self.data[column] != value)
new_data = {}
for column in self.data:
new_data[column] = self.data[column][good]
return Dataset(new_data)
def size(self):
for key in self.data:
return self.data[key].size
def stats(self):
statistics = {}
for key in self.data:
if self.data[key].dtype not in ("float", "int"):
continue
values = self.data[key]
statistics[key] = (values.min(), values.max(), values.std(), values.mean())
return statistics
def compare(self, other):
stats1 = self.stats()
stats2 = other.stats()
for column in self.columns():
if column not in stats1: continue
print("Column '{0:25s}'".format(column))
for s1, s2 in zip(stats1[column], stats2[column]):
print(" {0} vs {1}".format(s1, s2))
def split(self, column):
new_datasets = {}
for split_value in np.unique(self.data[column]):
new_datasets[split_value] = self.filter_eq(column, split_value)
return new_datasets
d = Dataset(data)
value_types = {'Zip code': 'int',
'Congress Dist': 'int',
'Senate Dist': 'int',
'Year Acquired': 'int',
'Year Constructed': 'int',
'Square Footage': 'float',
'Total Floors': 'int',
'Floors Above Grade': 'int',
'Floors Below Grade': 'int'}
for key in d.columns():
d.convert(key, value_types.get(key, "str"))
print("Pre-filter: {}".format(d.size()))
d2 = d.filter_eq("Agency Name", "Department of Natural Resources")
print("Post-filter: {}".format(d2.size()))
Let's try this on Agency Name.
In [32]:
splits = d.split("Agency Name")
In [33]:
splits.keys()
Out[33]:
Let's see how the min/max buildings compare.
In [34]:
for agency in splits:
stats = splits[agency].stats()
print("For {0:45s} min sq footage = {1: 10.1f} max sq footage = {2: 10.1f}".format(agency, stats["Square Footage"][0], stats["Square Footage"][1]))
And, we'll also add a .plot method to make this a bit easier, too.
In [35]:
class Dataset:
def __init__(self, data):
self.data = data
def convert(self, column, dtype):
self.data[column] = np.array(self.data[column], dtype=dtype)
def columns(self):
return self.data.keys()
def filter_eq(self, column, value):
good = (self.data[column] == value)
new_data = {}
for column in self.data:
new_data[column] = self.data[column][good]
return Dataset(new_data)
def filter_lt(self, column, value):
good = (self.data[column] < value)
new_data = {}
for column in self.data:
new_data[column] = self.data[column][good]
return Dataset(new_data)
def filter_gt(self, column, value):
good = (self.data[column] > value)
new_data = {}
for column in self.data:
new_data[column] = self.data[column][good]
return Dataset(new_data)
def filter_ne(self, column, value):
good = (self.data[column] != value)
new_data = {}
for column in self.data:
new_data[column] = self.data[column][good]
return Dataset(new_data)
def size(self):
for key in self.data:
return self.data[key].size
def stats(self):
statistics = {}
for key in self.data:
if self.data[key].dtype not in ("float", "int"):
continue
values = self.data[key]
statistics[key] = (values.min(), values.max(), values.std(), values.mean())
return statistics
def compare(self, other):
stats1 = self.stats()
stats2 = other.stats()
for column in self.columns():
if column not in stats1: continue
print("Column '{0:25s}'".format(column))
for s1, s2 in zip(stats1[column], stats2[column]):
print(" {0} vs {1}".format(s1, s2))
def split(self, column):
new_datasets = {}
for split_value in np.unique(self.data[column]):
new_datasets[split_value] = self.filter_eq(column, split_value)
return new_datasets
def plot(self, x_column, y_column):
plt.plot(self.data[x_column], self.data[y_column], '.')
d = Dataset(data)
value_types = {'Zip code': 'int',
'Congress Dist': 'int',
'Senate Dist': 'int',
'Year Acquired': 'int',
'Year Constructed': 'int',
'Square Footage': 'float',
'Total Floors': 'int',
'Floors Above Grade': 'int',
'Floors Below Grade': 'int'}
for key in d.columns():
d.convert(key, value_types.get(key, "str"))
print("Pre-filter: {}".format(d.size()))
d2 = d.filter_eq("Agency Name", "Department of Natural Resources")
print("Post-filter: {}".format(d2.size()))
We'll finish up with a couple of plots that are difficult to examine!
In [36]:
splits = d.split("Agency Name")
for agency in splits:
splits[agency].plot("Year Acquired", "Square Footage")
In [37]:
d2 = d.filter_gt("Year Acquired", 0)
splits = d2.split("Agency Name")
for agency in splits:
splits[agency].plot("Year Acquired", "Square Footage")
plt.yscale("log")