Lecture 5: Advanced Data Structures

CSCI 1360E: Foundations for Informatics and Analytics

Overview and Objectives

We've covered list, tuples, sets, and dictionaries. These are the foundational data structures in Python. In this lecture, we'll go over some more advanced topics that are related to these datasets. By the end of this lecture, you should be able to

  • Compare and contrast generators and comprehensions, and how to construct them
  • Explain the benefits of generators, especially in the case of huge datasets
  • Loop over multiple lists simultaneously with zip() and index them with enumerate()

Part 1: List Comprehensions

Here's some good news: if we get right down to it, having done loops and lists already, there's nothing new here.

Here's the bad news: it's a different, and possibly less-easy-to-understand, but much more concise way of creating lists. We'll go over it bit by bit.

Let's look at an example from a previous lecture: creating a list of squares.


In [1]:
squares = []
for element in range(10):
    squares.append(element ** 2)
print(squares)


[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

I know this is repetitive, but let's break down what we have.

for element in range(10):

  • It's a standard "for" loop.
  • The thing we're iterating over is at the end: range(10), or a list[-like thing] of numbers [0, 10) by 1s.
  • In each loop, the current element from range(10) is stored in element.

squares.append(element ** 2)

  • Inside the loop, we append a new item to our list squares
  • The item is computed by taking the current its, element, and computing its square

We'll see these same pieces show up again, just in a slightly different order.


In [2]:
squares = [element ** 2 for element in range(10)]
print(squares)


[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

There it is: a list comprehension. Let's break it down.

  • Notice, first, that the entire expression is surrounded by the square brackets [ ] of a list. This is for the exact reason you'd think: we're building a list!
  • The "for" loop is completely intact, too; the entire header appears just as before.
  • The biggest wrinkle is the loop body. It appears right after the opening bracket, before the loop header. The rationale for this is that it's easy to see from the start of the line that
    1. We're building a list (revealed by the opening square bracket), and
    2. The list is built by successfully squaring a variable element

Let's say we have some dictionary of word counts:


In [3]:
word_counts = {
    'the': 10,
    'race': 2,
    'is': 3,
    'on': 5
}

and we want to generate a list of sentences:


In [4]:
sentences = ['"{}" appears {} times.'.format(word, count) for word, count in word_counts.items()]
print(sentences)


['"on" appears 5 times.', '"race" appears 2 times.', '"the" appears 10 times.', '"is" appears 3 times.']
  • Start with the loop header--you see it on the far right: for word, count in word_counts.items()
  • Then look at the loop body: '"{}" appears {} times.'.format(word, count)
  • All wrapped in square brackets [ ]
  • Assigned to the variable sentences

Part 2: Generators

Generators are cool twists on lists (see what I did there). They've been around since Python 2 but took on a whole new life in Python 3.

That said, if you ever get confused about generators, just think of them as lists. This can potentially get you in trouble with weird errors, but 90% of the time it'll work every time.

Let's start with an example you're probably already quite familiar with: range()


In [5]:
x = range(10)

As we know, this will create a list[-like thing] with the numbers 0 through 9, inclusive, and assign it to the variable x.

Now you'll see why I've been using the "list[-like thing]" notation: it's not really a list!


In [6]:
print(x)
print(type(x))


range(0, 10)
<class 'range'>

To get a list, we've been casting the generator to a list:


In [7]:
list(x)


Out[7]:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

and we get a vanilla Python list.

So range() gives us a generator! Great! ...what does that mean, exactly?

For most practical purposes, generators and lists are indistinguishable. However, there are some key differences to be aware of:

  • Generators are "lazy". This means when you call range(10), not all 10 numbers are immediately computed; in fact, none of them are. They're computed on-the-fly in the loop itself! This really comes in handy if, say, you wanted to loop through 1 trillion numbers, or call range(1000000000000). With vanilla lists, this would immediately create 1 trillion numbers in memory and store them, taking up a whole lot of space. With generators, only 1 number is ever computed at a given loop iteration. Huge memory savings!
  • Generators only work once. This is where you can get into trouble. Let's say you're trying to identify the two largest numbers in a generator of numbers. You'd loop through once and identify the largest number, then use that as a point of comparison to loop through again to find the second-largest number (you could do it with just one loop, but for the sake of discussion let's assume you did it this way). With a list, this would work just fine. Not with a generator, though. You'd need to explicitly recreate the generator.

How do we build generators? Aside from range(), that is.

Remember list comprehensions? Just replace the brackets of a list comprehension [ ] with parentheses ( ).


In [8]:
x = [i for i in range(10)]  # Brackets -> list
print(x)


[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [9]:
x = (i for i in range(10))  # Parentheses -> generator
print(x)


<generator object <genexpr> at 0x10415a410>

Also--where have we seen parentheses before? TUPLES! You can think of a generator as a sort of tuple. After all, like a tuple, a generator is immutable (cannot be changed once created). Be careful with this, though: all generators are very like tuple, but not all tuples are like generators.

In sum, use lists if:

  • you're working with a relatively small amount of elements
  • you want to add to / edit / remove from the elements
  • you need direct access to arbitrary elements, e.g. some_list[431]

On the other hand, use generators if:

  • you're working with a giant collection of elements
  • you'll only loop through the elements once or twice
  • when looping through elements, you're fine going in sequential order

Part 3: Other looping mechanisms

There are a few other advanced looping mechanisms in Python that are a little complex, but can make your life a lot easier when used correctly (especially if you're a convert from something like C++ or Java).

zip()

zip() is a small method that packs a big punch. It "zips" multiple lists together into something of one big mega-list for the sole purpose of being able to iterate through them all simultaneously.

We've already seen something like this before: the items() method in dictionaries. Dictionaries are more or less two lists stacked right up against each other: one list holds the keys, and the corresponding elements of the other list holds the values for each key. items() lets us loop through both simultaneously, giving us the corresponding elements from each list, one at a time:


In [10]:
d = {
    'uga': 'University of Georgia',
    'gt': 'Georgia Tech',
    'upitt': 'University of Pittsburgh',
    'cmu': 'Carnegie Mellon University'
}
for key, value in d.items():
    print("'{}' stands for '{}'.".format(key, value))


'gt' stands for 'Georgia Tech'.
'cmu' stands for 'Carnegie Mellon University'.
'uga' stands for 'University of Georgia'.
'upitt' stands for 'University of Pittsburgh'.

zip() does pretty much the same thing, but on steroids: rather than just "zipping" together two lists, it can zip together as many as you want.

Here's an example: first names, last names, and favorite programming languages.


In [11]:
first_names = ['Shannon', 'Jen', 'Natasha', 'Benjamin']
last_names = ['Quinn', 'Benoit', 'Romanov', 'Button']
fave_langs = ['Python', 'Java', 'Assembly', 'Go']

I want to loop through these three lists simultaneously, so I can print out the person's first name, last name, and their favorite language on the same line. Since I know they're the same length, I could just do a range(len(fname)), but this is arguably more elegant:


In [12]:
for fname, lname, lang in zip(first_names, last_names, fave_langs):
    print("{} {}'s favorite language is {}.".format(fname, lname, lang))


Shannon Quinn's favorite language is Python.
Jen Benoit's favorite language is Java.
Natasha Romanov's favorite language is Assembly.
Benjamin Button's favorite language is Go.

enumerate()

Of course, there are always those situations where it's really, really nice to have an index variable in the loop. Let's take a look at that previous example:


In [13]:
for fname, lname, lang in zip(first_names, last_names, fave_langs):
    print("{} {}'s favorite language is {}.".format(fname, lname, lang))


Shannon Quinn's favorite language is Python.
Jen Benoit's favorite language is Java.
Natasha Romanov's favorite language is Assembly.
Benjamin Button's favorite language is Go.

This is great if all I want to do is loop through the lists simultaneously. But what if the ordering of the elements matters? For example, I want to prefix each sentence with the line number. How can I track what index I'm on in a loop if I don't use range()?

enumerate() handles this. By wrapping the object we loop over inside enumerate(), on each loop iteration we not only get the next object of interest, but also the index of that object. To wit:


In [14]:
x = ['a', 'list', 'of', 'strings']
for index, element in enumerate(x):
    print("Found '{}' at index {}.".format(element, index))


Found 'a' at index 0.
Found 'list' at index 1.
Found 'of' at index 2.
Found 'strings' at index 3.

This comes in handy anytime you need to loop through a list or generator, but also need to know what index you're on.

Review Questions

Some questions to discuss and consider:

1: I want a list of all possible combinations of (x, y) values for the range [0, 9]. Show how this can be done with a list comprehension using two for-loops.

2: Without consulting Google University, consider how generators might work under the hood. How do you think they're implemented?

3: Go back to the example with three lists (first names, last names, and programming languages). Show how you could use enumerate to prepend a line number (the current index of the lists) to the sentence printed for each person, e.g.: "17: Joe Schmo's favorite language is C++."


Course Administrivia

"Cell was changed and shouldn't have" errors on your assignments. If you're getting these errors, it's because you put your code in the wrong cell. Make sure you edit only the cells that say # YOUR CODE HERE or YOUR ANSWER HERE. Also, be sure to delete or comment out the line that says raise NotImplementedError().

If you need to re-fetch an assignment, you have to delete the entire directory of the old version. For example, in the case where errors are found in the assignment and a new version needs to be pushed, you'll have to delete your current version as well as the folder it's in--so, everything--in order to re-fetch a new version.

Feedback is welcome! This is a new course being taught a new way using nascent technology. Let me know if you're running into problems or something should be improved!

Additional Resources

  1. Matthes, Eric. Python Crash Course. 2016. ISBN-13: 978-1593276034
  2. Grus, Joel. Data Science from Scratch. 2015. ISBN-13: 978-1491901427