Lists are collections of heterogeneous objects. They can be appended to, iterated over, etc, and we will use them for lots of fun things. They're useful especially when you don't know in advance how big something is going to be or what types of objects will be in it.
We'll set a simple one up that includes the numbers 1 through 9.
In [1]:
a = [1, 2, 3, 4, 5, 6, 7, 8, 9]
Now let's call dir
on it to see what things we can do to it. Note that this will include lots of things starting with two underscores; for the most part these are "hidden" methods that we will use implicitly when we do things. The main methods you'll use directly are the ones that don't start with underscores.
In [2]:
dir(a)
Out[2]:
Lists can be reversed in-place. This means the return value is empty (None
) but that the list has been changed. An important thing that this means is that lists are mutable -- you can change them without copying them into a new thing.
In [3]:
a.reverse()
In [4]:
a
Out[4]:
We can sort them, too. Here the sorting is trivial -- it'll end up just reversing it back to what it was. But, we can sort a more complex list as well.
In [5]:
a.sort()
In [6]:
a
Out[6]:
Because lists are mutable, we can insert things into them. Lists are zero-indexed, which means that the very first place is 0, not 1. This makes insertion a lot easier if you think about the position you're inserting at -- 0 is the first (so it pre-empts the first item in the list) and so on. Here, we'll insert at position 3, which is between the numbers 3 and 4 in this list.
In [7]:
a.insert(3, 3.9)
In [8]:
a
Out[8]:
We can also append values.
In [9]:
a.append(10)
In [10]:
a
Out[10]:
We can also remove an item; note that using pop
here will not only remove the item, but return it as a return value. If we were to use del
then it would not return it.
In [11]:
a.pop(3)
Out[11]:
In [12]:
a
Out[12]:
We can also use negative indices. This means "the last" item.
In [13]:
a.pop(-1)
Out[13]:
In [14]:
a[1:5:2]
Out[14]:
Here we just start at the beginning and take every other item.
In [15]:
a[::2]
Out[15]:
Every other item, starting from the second:
In [16]:
a[1::2]
Out[16]:
We can also iterate in reverse:
In [17]:
a[::-1]
Out[17]:
In reverse, but every second.
In [18]:
a[::-2]
Out[18]:
Lists can include objects of different types.
In [19]:
a.append("blast off")
In [20]:
a
Out[20]:
In [21]:
a.pop(-1)
Out[21]:
A common problem you may run into is that sometimes, numbers look like strings. This can cause problems, as we'll see:
In [22]:
a.append('10')
In [23]:
a
Out[23]:
If it were the number 10, this would work. Unfortunately, strings and numbers can't be sorted together.
In [24]:
a.sort()
Dictionaries (dict
objects) are hashes, where a key is looked up to find a value. Both keys and values can be of hetereogeneous types within a given dict; there are some restrictions on what can be used as a key. (The type must be "hashable," which among other things means that it can't be a list.)
We can initialize an empty dict with the curly brackets, {}
, and then we can assign things to this dict.
In [25]:
b = {}
Here, we can just use an integer key that gets us to a string.
In [26]:
b[0] = 'a'
If we look at the dict, we can see what it includes.
In [27]:
b
Out[27]:
We can see a view on what all the keys are using .keys()
:
In [28]:
b.keys()
Out[28]:
If we just want to see what all the values are, we can use .values()
:
In [29]:
b.values()
Out[29]:
If we ask for a key that doesn't exist, we get a KeyError
:
In [30]:
b[1]
Earlier, I noted that lists can't be used as keys in dicts, but they can be used as values. For example:
In [31]:
b = {0: [1, 2, 3], 1: [4, 5, 6], 2: [7, 8, 9]}
In [32]:
b
Out[32]:
We can also iterate over the keys in a dict, simply by iterating over the dict itself. This statement will return each of the keys in turn, and we can see what value it is associated with.
In [33]:
for key in b:
print(b[key])
In [34]:
c = set([1,2,3,4,5])
d = set([4,5,6,7,8])
We can now subtract one from the other, to see all objects in one but not the other.
In [35]:
c - d
Out[35]:
We can also union them:
In [36]:
e = c.union(d)
In [37]:
e
Out[37]:
An interesting component of sets is that they accept iterables. This means that if you supply to them strings, they will look at each character of the string as an independent object. So we can create two sets from two strings, and see what they contain -- all the unique values in each of the strings.
In [38]:
s1 = "Hello there, how are you?"
s2 = "I am fine, how are you doing today?"
v1 = set(s1)
v2 = set(s2)
In [39]:
v1
Out[39]:
In [40]:
v2
Out[40]:
Let's see how many there are in each:
In [41]:
len(s1), len(v1)
Out[41]:
In [42]:
len(s2), len(v2)
Out[42]:
If we combine, we can see how many unique characters in the two strings combined there are:
In [43]:
len(v1.union(v2))
Out[43]:
In [44]:
for value in a:
print(value)
If we iterate over a dictionary, we get the keys. We can also explicitly iterate over keys:
In [45]:
for name in b.keys():
print(b[name])
If we iterate over a set, we get all the values in that set. Note, however, that this iteration order is not guaranteed to be consistent, and should not be relied upon.
In [46]:
for value in v1:
print(value)
We will start out using a dataset from the Illinois Open Data repository about the buildings under state ownership in Illinois. You can download it here: https://data.illinois.gov/Housing/Building-Inventory/utd5-tdr2 by clicking "Export" or by going to our class data repository.
At this point in the class, we will be utilizing very simple data reading and visualization techniques, so that we have the opportunity to see basic data structures, simple visualization, and so forth, before we start getting into pandas and other more advanced libraries.
We will use the built-in csv module to read in the data.
In [47]:
import csv
Here, we'll use next
to get the first line, then we will proceed to read the rest. The first line in this file is the header.
In [48]:
f = open("Building_Inventory.csv")
csv_reader = csv.reader(f)
header = next(csv_reader)
In [49]:
header
Out[49]:
We will now pre-initialize a dict with the header values so that we can subsequently iterate and fill it. This will help us transform from a row-based store to a column-based store.
In [50]:
data = {}
for name in header:
data[name] = []
data
Out[50]:
We're going to use zip
to simultaneously iterate over two iterables; this works like follows, where you can see it "zip" up the two items and yield each in turn.
In [51]:
list1 = ['a', 'b', 'c', 'd']
list2 = [1, 2, 3, 4]
for v in zip(list1, list2):
print(v)
Now, for every row, we append to the appropriate list.
In [52]:
for row in csv_reader:
for name, value in zip(header, row):
data[name].append(value)
This gives us results like this:
In [53]:
data['Zip code']
Out[53]:
We have one name/list pair for every header entry:
In [54]:
data.keys()
Out[54]:
We can see how many zip code entries (rows) there are:
In [55]:
len(data['Zip code'])
Out[55]:
As well as how many unique zip codes there are.
In [56]:
len(set(data['Zip code']))
Out[56]:
Now, the same thing with congressional districts:
In [57]:
len(set(data['Congress Dist']))
Out[57]:
In [58]:
len(set(data['Congressional Full Name']))
Out[58]:
There's a special data structure called a Counter
that we can use to figure out how many of each unique item there are.
In [59]:
from collections import Counter
In [60]:
c = Counter(data['Zip code'])
It associates each item with a count, as it iterates, and then we can get that information back.
In [61]:
max(c.values())
Out[61]:
We can sort by the most common:
In [62]:
c.most_common()
Out[62]:
We can try to compute the square footage of each building in multiples of 100, but we'll see that...
In [63]:
data['Square Footage'][0] / 100
...it's currently all strings. What this means is that we need to convert it. We could do this by making another list, but we'll try instead using numpy arrays. Numpy arrays can be thought of as "lists" that aren't expandable, that contain objects that are all the same size (except for "object" arrays, which we won't cover) and that have a number of operations that take advantage of these assumptions.
First we import numpy as np
, as per convention.
In [64]:
import numpy as np
Now, we can convert our list to integers:
In [65]:
square_footage = np.array(data['Square Footage'], dtype='int')
In [66]:
square_footage
Out[66]:
There are a few operations we can call on numpy arrays, such as min/max, which we'll look at now:
In [67]:
square_footage.max()
Out[67]:
In [68]:
square_footage.min()
Out[68]:
Numpy arrays also allow for slicing. We'll look at every 10th value.
In [69]:
square_footage[::10]
Out[69]:
Now let's find out the 10 most common square footages.
In [70]:
Counter(data['Square Footage']).most_common(10)
Out[70]:
Huh, that's odd! There's a lot of buildings that are all about 12"x12". Let's see more about them, and find out which agencies they are in.
In [71]:
agencies = Counter()
for agency, sqfoot in zip(data['Agency Name'], data['Square Footage']):
if int(sqfoot) == 144:
agencies[agency] += 1
In [72]:
agencies
Out[72]:
Interesting. Lots of Department of Natural Resources. I bet these are picnic bench shelters!
Now let's get the year acquired in ints.
In [73]:
year_acquired = np.array(data['Year Acquired'], dtype='int')
We now need to set up our matplotlib plotting.
In [74]:
%matplotlib inline
In [75]:
import matplotlib.pyplot as plt
And, let's make our first plot! We will just pass the two arrays in, and every thing that could possibly go wrong will!
In [76]:
plt.plot(year_acquired, square_footage)
Out[76]:
Alright, let's try that again. This time, let's make it ever-so-slightly better. We'll use circle markers and we'll not connect the lines.
One item to note here is that these circle markers are set in plot coordinates, not data coordinates. This means that the relative size, overlap, etc, will all be related to the plot characteristics. This is not ideal.
In [77]:
plt.plot(year_acquired, square_footage, 'og')
Out[77]:
We see some obvious outliers in this plot. Let's take a look and see if we can clean up the data a bit. We will use indexing by boolean arrays here.
In [78]:
square_footage == 144
Out[78]:
In [79]:
year_acquired == 0
Out[79]:
In [80]:
good = (year_acquired > 0)
In [81]:
year_acquired[good].min()
Out[81]:
In [82]:
np.where(year_acquired == 1753)
Out[82]:
In [83]:
for h in header:
print("{}: {}".format(h, data[h][2799]))
Alright, we've done a bit of cleaning, seen that the state owns a building from 1753, and we can make some more plots. We'll also do some scale modification here.
In [84]:
plt.plot(year_acquired[good], square_footage[good], '.g')
plt.title("State Buildings")
plt.xlabel("Year Acquired")
plt.ylabel("Square Footage")
plt.yscale("log")
Let's pull out all the "zero square footage" bits, as well. We'll use a logical operation to do this.
In [85]:
good_sqf = square_footage > 0.0
gpos = good & good_sqf
The hexbin plot is our next step -- this shows the density of plots at any given point. We will make it relatively coarse, with 32 bins on each axis.
In [86]:
plt.clf()
plt.hexbin(year_acquired[gpos], square_footage[gpos], yscale='log', bins='log', gridsize=32, cmap='viridis')
plt.title("State Buildings")
plt.xlabel("Year Acquired")
plt.ylabel("Square Footage")
plt.colorbar()
plt.yscale("log")
fig = plt.gcf()
We can now make modifications to the plots to change different aspects. Let's get our first figure.
In [87]:
fig.axes
Out[87]:
In [88]:
ax = fig.axes[0]
fig
Out[88]:
We will work more with ticks later, but for now, we'll just experiment a little bit with them and how they can be modified. Let's first see the current locations.
In [89]:
for xtick in ax.xaxis.majorTicks:
print(xtick.get_loc())
In [90]:
ax.xaxis.set_visible(False)
ax.yaxis.set_visible(False)
fig
Out[90]:
In [91]:
ax.xaxis.set_visible(True)
ax.yaxis.set_visible(True)
fig
Out[91]:
As an example, we can also turn off our major ticks. We'll pick up from this to see more plot modifications next week! Note here that we're also setting them en masse rather than modifying in-place.
In [92]:
new_ticks = []
for tick in ax.yaxis.majorTicks:
tick.set_visible(False)
new_ticks.append(tick)
ax.yaxis.majorTicks = new_ticks
fig
Out[92]:
In [ ]: