Lecture 6: Advanced Data Structures

CSCI 1360: Foundations for Informatics and Analytics

Overview and Objectives

We've covered list, tuples, sets, and dictionaries. These are the foundational data structures in Python. In this lecture, we'll go over some more advanced topics that are related to these datasets. By the end of this lecture, you should be able to

  • Compare and contrast generators and comprehensions, and how to construct them
  • Explain the benefits of generators, especially in the case of huge datasets
  • Loop over multiple lists simultaneously with zip() and index them with enumerate()
  • Dive into advanced iterations using the itertools module

Part 1: List Comprehensions

Here's some good news: if we get right down to it, having done loops and lists already, there's nothing new here.

Here's the bad news: it's a different, and possibly less-easy-to-understand, but much more concise way of creating lists. We'll go over it bit by bit.

Let's look at an example from a previous lecture: creating a list of squares.


In [1]:
squares = []
for element in range(10):
    squares.append(element ** 2)
print(squares)


[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

Let's break it down.

for element in range(10):

  • It's a standard "for" loop header.
  • The thing we're iterating over is at the end: range(10), or a list[-like thing] of numbers [0, 10) by 1s.
  • In each loop, the current element from range(10) is stored in element.

squares.append(element ** 2)

  • Inside the loop, we append a new item to our list squares
  • The item is computed by taking the current its, element, and computing its square

We'll see these same pieces show up again, just in a slightly different order.


In [2]:
squares = [element ** 2 for element in range(10)]
print(squares)


[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

There it is: a list comprehension. Let's break it down.

  • Notice, first, that the entire expression is surrounded by the square brackets [ ] of a list. This is for the exact reason you'd think: we're building a list!
  • The "for" loop is completely intact, too; the entire header appears just as before.
  • The biggest wrinkle is the loop body. It appears right after the opening bracket, before the loop header. The rationale for this is that it's easy to see from the start of the line that
    1. We're building a list (revealed by the opening square bracket), and
    2. The list is built by successfully squaring a variable element

Let's say we have some dictionary of word counts:


In [1]:
word_counts = {
    'the': 10,
    'race': 2,
    'is': 3,
    'on': 5
}

and we want to generate a list of sentences:


In [4]:
sentences = ['"{}" appears {} times.'.format(word, count) for word, count in word_counts.items()]
print(sentences)


['"on" appears 5 times.', '"race" appears 2 times.', '"the" appears 10 times.', '"is" appears 3 times.']
  • Start with the loop header--you see it on the far right: for word, count in word_counts.items()
  • Then look at the loop body: '"{}" appears {} times.'.format(word, count)
  • All wrapped in square brackets [ ]
  • Assigned to the variable sentences

Here's another example: going from a dictionary of word counts to a list of squared counts.


In [2]:
squared_counts = [value ** 2 for key, value in word_counts.items()]
print(squared_counts)


[25, 100, 9, 4]
  • We used the items() method on the dictionary again, which gives us a list of tuples
  • Since we know the items in the list are tuples of two elements, we use unpacking
  • We provide our element-by-element construction of the list with our statement value ** 2, squaring the value

Part 2: Generators

Generators are cool twists on lists (see what I did there). They've been around since Python 2 but took on a whole new life in Python 3.

That said, if you ever get confused about generators, just think of them as lists. This can potentially get you in trouble with weird errors, but 90% of the time it'll work every time.

Let's start with an example you're probably already quite familiar with: range()


In [5]:
x = range(10)

As we know, this will create a list[-like thing] with the numbers 0 through 9, inclusive, and assign it to the variable x.

Now you'll see why I've been using the "list[-like thing]" notation: it's not really a list!


In [6]:
print(x)
print(type(x))


range(0, 10)
<class 'range'>

To get a list, we've been casting the generator to a list:


In [7]:
list(x)


Out[7]:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

and we get a vanilla Python list.

So range() gives us a generator! Great! ...what does that mean, exactly?

For most practical purposes, generators and lists are indistinguishable. However, there are some key differences to be aware of:

  • Generators are "lazy". This means when you call range(10), not all 10 numbers are immediately computed; in fact, none of them are. They're computed on-the-fly in the loop itself! This really comes in handy if, say, you wanted to loop through 1 trillion numbers, or call range(1000000000000). With vanilla lists, this would immediately create 1 trillion numbers in memory and store them, taking up a whole lot of space. With generators, only 1 number is ever computed at a given loop iteration. Huge memory savings!
  • Generators only work once. This is where you can get into trouble. Let's say you're trying to identify the two largest numbers in a generator of numbers. You'd loop through once and identify the largest number, then use that as a point of comparison to loop through again to find the second-largest number (you could do it with just one loop, but for the sake of discussion let's assume you did it this way). With a list, this would work just fine. Not with a generator, though. You'd need to explicitly recreate the generator.

How do we build generators? Aside from range(), that is.

Remember list comprehensions? Just replace the brackets of a list comprehension [ ] with parentheses ( ).


In [8]:
x = [i for i in range(10)]  # Brackets -> list
print(x)


[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [9]:
x = (i for i in range(10))  # Parentheses -> generator
print(x)


<generator object <genexpr> at 0x10415a410>

Also--where have we seen parentheses before? TUPLES! You can think of a generator as a sort of tuple. After all, like a tuple, a generator is immutable (cannot be changed once created). Be careful with this, though: all generators are very like tuples, but not all tuples are like generators.

In sum, use lists if:

  • you're working with a relatively small amount of elements
  • you want to add to / edit / remove from the elements
  • you need direct access to arbitrary elements, e.g. some_list[431]

On the other hand, use generators if:

  • you're working with a giant collection of elements
  • you'll only loop through the elements once or twice
  • when looping through elements, you're fine going in sequential order

Part 3: Other looping mechanisms

There are a few other advanced looping mechanisms in Python that are a little complex, but can make your life a lot easier when used correctly (especially if you're a convert from something like C++ or Java).

zip()

zip() is a small method that packs a big punch. It "zips" multiple lists together into something of one big mega-list for the sole purpose of being able to iterate through them all simultaneously.

We've already seen something like this before: the items() method in dictionaries. Dictionaries are more or less two lists stacked right up against each other: one list holds the keys, and the corresponding elements of the other list holds the values for each key. items() lets us loop through both simultaneously, giving us the corresponding elements from each list, one at a time:


In [10]:
d = {
    'uga': 'University of Georgia',
    'gt': 'Georgia Tech',
    'upitt': 'University of Pittsburgh',
    'cmu': 'Carnegie Mellon University'
}
for key, value in d.items():
    print("'{}' stands for '{}'.".format(key, value))


'gt' stands for 'Georgia Tech'.
'cmu' stands for 'Carnegie Mellon University'.
'uga' stands for 'University of Georgia'.
'upitt' stands for 'University of Pittsburgh'.

zip() does pretty much the same thing, but on steroids: rather than just "zipping" together two lists, it can zip together as many as you want.

Here's an example: first names, last names, and favorite programming languages.


In [11]:
first_names = ['Shannon', 'Jen', 'Natasha', 'Benjamin']
last_names = ['Quinn', 'Benoit', 'Romanov', 'Button']
fave_langs = ['Python', 'Java', 'Assembly', 'Go']

I want to loop through these three lists simultaneously, so I can print out the person's first name, last name, and their favorite language on the same line. Since I know they're the same length, I could just do a range(len(fname)), but this is arguably more elegant:


In [12]:
for fname, lname, lang in zip(first_names, last_names, fave_langs):
    print("{} {}'s favorite language is {}.".format(fname, lname, lang))


Shannon Quinn's favorite language is Python.
Jen Benoit's favorite language is Java.
Natasha Romanov's favorite language is Assembly.
Benjamin Button's favorite language is Go.

enumerate()

Of course, there are always those situations where it's really, really nice to have an index variable in the loop. Let's take a look at that previous example:


In [13]:
for fname, lname, lang in zip(first_names, last_names, fave_langs):
    print("{} {}'s favorite language is {}.".format(fname, lname, lang))


Shannon Quinn's favorite language is Python.
Jen Benoit's favorite language is Java.
Natasha Romanov's favorite language is Assembly.
Benjamin Button's favorite language is Go.

This is great if all I want to do is loop through the lists simultaneously. But what if the ordering of the elements matters? For example, I want to prefix each sentence with the line number. How can I track what index I'm on in a loop if I don't use range()?

enumerate() handles this. By wrapping the object we loop over inside enumerate(), on each loop iteration we not only get the next object of interest, but also the index of that object. To wit:


In [14]:
x = ['a', 'list', 'of', 'strings']
for index, element in enumerate(x):
    print("Found '{}' at index {}.".format(element, index))


Found 'a' at index 0.
Found 'list' at index 1.
Found 'of' at index 2.
Found 'strings' at index 3.

This comes in handy anytime you need to loop through a list or generator, but also need to know what index you're on.

break and continue

  • With for loops, you specify how many times to run the loop.
  • With while loops, you iterate until some condition is met.

For the vast majority of cases, this works well. But sometimes you need just a little more control for extenuating circumstances.

Take the example of a web server:

  • Listens for incoming requests
  • Serves those requests (e.g. returns web pages)
  • Goes back to listening for more requests

This is essentially implemented using a purposefully-infinite loop:


In [ ]:
while True:
    # Listen for incoming requests
    
    # Handle the request

How do you get out of this infinite loop? With a break statement.


In [2]:
import numpy as np

def handle_request():
    return np.random.randint(100)

loops = 0
while True:
    loops += 1
    req = handle_request()
    break
        
print("Exiting program after {} loop.".format(loops))


Exiting program after 1 loops.

Just break. That will snap whatever loop you're currently in and immediately dump you out just after it.

Same thing with for loops:


In [3]:
for i in range(100000):  # Loop 100,000 times!
    break
print(i)


0

Similar to break is continue, though you use this when you essentially want to "skip" certain iterations.

continue will also halt the current iteration, but instead of ending the loop entirely, it basically skips you on to the next iteration of the loop without executing any code that may be below it.


In [4]:
for i in range(100):
    continue
    print("This will never be printed.")
print(i)


99

Notice how the print statement inside the loop is never executed, but our loop counter i is still incremented through the very end.

Part 4: itertools

Aside: Welcome to the big wide world of Beyond Core Python.

Technically we're still in base Python, but in order to use more advanced iterating tools, we have to actually pull in an external package--itertools.

Justin Duke has an excellent web tutorial, which I'll reproduce in part here. Let's say we have a couple of lists we want to operate on:


In [3]:
letters = ['a', 'b', 'c', 'd', 'e', 'f']
booleans = [1, 0, 1, 0, 0, 1]
numbers = [23, 20, 44, 32, 7, 12]
decimals = [0.1, 0.7, 0.4, 0.4, 0.5]

Now: I want you to string all four of these lists together, end-to-end, into one long list. How would you do it?

Here's a way simpler way, though it requires pulling in an external package. You can do this with the keyword import:


In [4]:
import itertools

Now, we have access to all the functions available in the itertools package--to use them, just type the package name, a dot ".", and the function you want to call.

In this example, we want to use the itertools.chain() function:


In [6]:
monster = itertools.chain(letters, booleans, numbers, decimals)
print(monster)


<itertools.chain object at 0x106c7bda0>

Err, what's an itertools.chain object?

Don't panic--any thoughts as to what kind of object this might be?

It's an iterable, and we know how to handle those!


In [7]:
for item in monster:
    print(item, end = " ")


a b c d e f 1 0 1 0 0 1 23 20 44 32 7 12 0.1 0.7 0.4 0.4 0.5 

And there they are--all four lists, joined at the hip.

Another phenomenal function is combinations.

If you've ever taken a combinatorics class, or are at all interested in the idea of finding all the possible combinations of a certain collection of things, this is your function.

A common task in data science is finding combinations of configuration values that work well together--e.g., plotting your data in two dimensions. Which two dimensions will give the nicest plot?

Here's a list of numbers. How many possible pairings are there?


In [8]:
items = ['one', 'two', 'three', 'four', 'five', 'six']

In [9]:
combos = itertools.combinations(items, 2)  # The "2" means pairs.
for combo in combos:
    print(combo)


('one', 'two')
('one', 'three')
('one', 'four')
('one', 'five')
('one', 'six')
('two', 'three')
('two', 'four')
('two', 'five')
('two', 'six')
('three', 'four')
('three', 'five')
('three', 'six')
('four', 'five')
('four', 'six')
('five', 'six')

It doesn't have to be pairs; we can also try to find all the triplets of items.


In [10]:
combos = itertools.combinations(items, 3)  # Now it's a 3.
for combo in combos:
    print(combo)


('one', 'two', 'three')
('one', 'two', 'four')
('one', 'two', 'five')
('one', 'two', 'six')
('one', 'three', 'four')
('one', 'three', 'five')
('one', 'three', 'six')
('one', 'four', 'five')
('one', 'four', 'six')
('one', 'five', 'six')
('two', 'three', 'four')
('two', 'three', 'five')
('two', 'three', 'six')
('two', 'four', 'five')
('two', 'four', 'six')
('two', 'five', 'six')
('three', 'four', 'five')
('three', 'four', 'six')
('three', 'five', 'six')
('four', 'five', 'six')

Review Questions

Some questions to discuss and consider:

1: I want a list of all possible combinations of (x, y) values for the range [0, 9]. Show how this can be done with a list comprehension using two for-loops.

2: Without consulting Google University, consider how generators might work under the hood. How do you think they're implemented?

3: Go back to the example with three lists (first names, last names, and programming languages). Show how you could use enumerate to prepend a line number (the current index of the lists) to the sentence printed for each person, e.g.: "17: Joe Schmo's favorite language is C++."

4: Consider that you have a list of lists--representing a matrix--and you want to convert each "row" of the matrix to a tuple, where the first element is an integer row index, and the second element is the row itself (the original list). Assuming the list-of-lists matrix already exists, how could you add the list of indices to it?

5: How would you implement itertools.combinations on your own, just using loops?

Course Administrivia

"Cell was changed and shouldn't have" errors on your assignments. If you're getting these errors, it's because you put your code in the wrong cell. Make sure you edit only the cells that say # YOUR CODE HERE or YOUR ANSWER HERE. Also, be sure to delete or comment out the line that says raise NotImplementedError().

If you need to re-fetch an assignment, you have to delete the entire directory of the old version. For example, in the case where errors are found in the assignment and a new version needs to be pushed, you'll have to delete your current version as well as the folder it's in--so, everything--in order to re-fetch a new version.

First review session on Thursday! Come with questions. I'll have some exercises for everyone to work on as well--exercises that could very likely end up on the midterm or final in some form.

How is A1 going? A2 will be out on Thursday as well.

Additional Resources

  1. Matthes, Eric. Python Crash Course. 2016. ISBN-13: 978-1593276034
  2. Grus, Joel. Data Science from Scratch. 2015. ISBN-13: 978-1491901427
  3. Duke, Justin. *A Gentle Introduction to itertools. http://jmduke.com/posts/a-gentle-introduction-to-itertools/