Lesson 4: Data Structures and File Parsing


Table of Contents

  1. Data structures I: Lists
  2. Data structures II: Dictionaries
  3. String parsing with .split()
  4. Test your understanding: practice set 4

1. Data Structures I: Lists


What is a data structure?

A data structure is basically a way of storing large amounts of data (numbers, strings, etc) in an organized manner, making storage and retrieval easier. There are several different data structures available in Python, but we'll just go over the two most common ones: lists and dictionaries.

What is a list?

A list is one type of built-in data structure in Python that is specialized for storing data in a ordered, sequential manner. We've already seen an example of lists when we used the range() function:


In [0]:
print range(5)

Now we'll go over more formally what a list is and how it can be used.

Side note for people who have used other programming languages: Lists are similar to what other programming languages call arrays. There are actually some subtle (but important) differences between lists and arrays (Python lists are closer to what people usually call a "linked list"), but for most purposes they perform the same role. The most obvious difference you might notice is that you don't need to specify ahead of time how large your list will be. This is because the size of the list grows dynamically as you add things to it (it also shrinks automatically as you take things out).


How lists work

A Python list looks something like this when we print it out:

[3, "cat", 56.9, 4, 10, True]

However, it may be more helpful to think of a list as looking something like this:

Here, each thing in the list (element) is stored in its own cell, and each cell is given a sequential integer index, starting at 0.

We use only one variable name to refer to the whole list. To access a specific element in the list, we use the index of the element with following syntax: listName[index]. For example:


In [0]:
myList = [3, "cat", 56.9, 4, 10, True]  # recreating the list above
print myList[0]
print myList[1]

This ability to access a potentially huge amount of data using just one variable name is part of what makes data structures so useful. Imagine if you had 20,000 gene IDs you wanted to use in your code -- it wouldn't be feasible to create a separate variable name for each one. Instead, you can just dump all the gene IDs into a single list, and access them by index. We'll see how to actually do this sort of thing later in this lesson!


Important side note: "in-place" functions

Before we talk more about lists, we need to briefly introduce the idea of "in-place" functions.

The functions we've seen so far do not modify variables directly -- they simply "return" a value. For example, line.rstrip('\n') does nothing to the original string line, it just returns a modified version. To actually change line, you need to say line = line.rstrip('\n'), which overwrites line with the new value. Here's a similar example in code:


In [0]:
line = "ATGCGTA***********"

line.rstrip("*")
print line

line = line.rstrip("*")
print line

Below, we're going to see a few examples of functions that do directly modify the variable that they act on. These functions are called "in-place" functions, and I'll make a note of it wherever we encounter them.


Using lists

Creating a new list

A list can expand or shrink as we add or remove things from it. Before we can do this, we need to create the list itself. Most often we'll just start of with an empty list, but sometimes it can be useful to pre-fill the list with certain values. Here are the three main ways to create a list:

myList = []                          # create a new empty list
myList = [element1, element2, etc]   # create a new list with some things already in it
myList = range(num)                  # create a new list automatically filled with a range of numbers

Example:


In [0]:
shoppingList = ["pizza", "ice cream", "cat food"]
print shoppingList


Accessing elements in a list

As we saw above, an element of a list can be accessed using its index. Do not to try to access an index that's not yet in the list -- this will give an error.

someData = myList[index]

You can also index backwards using negative indices:

lastElement = myList[-1]

Examples:


In [0]:
contestRanking = ["Sally", "Billy", "Tommy", "Wilfred"]

print "First place goes to", contestRanking[0], "!"
print "Congrats also to our second and third place winners,", contestRanking[1], "and", contestRanking[2]
print "And in last place...", contestRanking[-1]

In [0]:
contestRanking = ["Sally", "Billy", "Tommy", "Wilfred"]

print "6th place goes to", contestRanking[5]


Adding to a list

After creating a list, you can add additional elements to the end using .append(). This is an in-place function, meaning that it directly modifies the list.

myList.append(element)

Insert an element at the specified index. Elements that come after that index will shift up one index. (in-place function)

myList.insert(index, value)

Example:


In [0]:
shoppingList = ["pizza", "ice cream", "cat food"]
shoppingList.append("english muffins")

print shoppingList

shoppingList.insert(2, "lembas")
print shoppingList


Removing from a list

After creating a list, you can remove elements from it using .pop(). Elements that come after the removed element will be moved up one index so that there are no empty spaces in the list. .pop() also returns the element that was "popped". (in-place function)

myList.pop(index)  # removes (and returns) the element at the specified index
myList.pop()       # removes (and returns) the last element

You can also remove the first occurrence of a specified element using .remove(). Elements that come after will shift down one index. (in-place function)

myList.remove(element) 

Examples:


In [0]:
toDoList = ["Water plants", "feed cat", "do dishes", "make python lesson"]
doNext = toDoList.pop()

print doNext
print toDoList

In [0]:
pizzaToppings = ["peppers", "sausage", "bananas", "pepperoni"]
pizzaToppings.remove("bananas")

print pizzaToppings


Checking if something is in the list

To check if a particular element is in a list, you can just use a special logical operator "in" (note that this is used differently in a logical statement as compared to a for loop):

if element in myList:
    ... do something

Example:


In [0]:
shoppingList = ["pizza", "ice cream", "cat food"]
item = "20lb bag of reese's"

if item in shoppingList:
    print "I'LL TAKE IT"
else:
    print "Not today..."


Iterating through a list

This should be pretty familiar by now. Since a list is an iterable, we can loop through it using a for loop:

for element in myList:
    ... do something

Example:


In [0]:
currentCats = ["Mittens", "Tatertot", "Meatball", "Star Destroyer"]

for cat in currentCats:
    print cat


Other list operations

listLen = len(myList)   # Get the length of a list

myList.sort()           # Sort (in-place function)

myList.reverse()        # Reverse (in-place function)

Related data structure: Tuples

There is another data structure in Python that is similar to lists, but not exactly the same, called tuples. We won't focus too much on these for this class, but essentially a tuple is, like a list, a sequence of items, but unlike a list, these items are immutable, meaning that once you have defined your tuple, you cannot change the items in it. These structures are useful in cases when you know exactly how many items you want to use, and if you aren't going to be changing these items.

Tuples are defined similarly to lists, except that they use rounded parentheses instead of square brackets, and can also be defined without any parentheses. For example:

myTuple = ()                   # create an empty tuple
myTuple = "a", "b", "c", "d"   # create a tuple of strings
myTuple = ("a", "b", "c", "d") # this makes the same exact tuple
myTuple = ("a", "b", 3, 4)     # tuples can also include different data types
myTuple = ("a")                # to make a tuple with a single value, you need a comma

For the purposes of this class, we won't go through all the different methods for tuples, except that they can be accessed similarly to lists, i.e.

myTuple[index]

For more info on tuples, see http://www.tutorialspoint.com/python/python_tuples.htm. Just note that you cannot change individual elements in a tuple!

2. Data structures II: Dictionaries


The next data structure we'll talk about is the dictionary. There are two key differences between dictionaries and lists:

  1. In a dictionary, you retrieve elements using a key rather than an index
  2. Dictionaries are unordered

Difference 1: Dictionaries are indexed by keys

With a dictionary (sometimes called a hash table in other programming languages), you access elements by a name ("key") that you pick:

Keys can be strings or numbers. The only restriction is that each key must be unique.

To retrieve a value from the hash, we use the following notation: dictName[key]. For example:


In [0]:
myDict = {'age':3, 'animal':'cat', 'num':56.9, 203:4, 'count':10, 'flag':True}   # creates the dictionary shown above

print myDict['animal']
print myDict[203]

Difference 2: Dictionaries are unordered

Lists are all about keeping elements in some order. Though you may change the ordering from time to time, it's still in some predictable order at all times.

You should think of dictionaries more like magic grab bags. You mark each piece of data with a key, then throw it in the bag. When you want that data back, you just tell the bag the key and it spits out the data assigned to that key. There's no intrinsic order to the things in the bag, they're all just kind of jumbled around in there. Because of this, dictionaries aren't great for situations where we need to keep data in a specific order -- but they're very convenient in other situations, as we'll see in a minute.

Technical side note: Ok, so in reality, there is an order to your dictionary. But it is an order that Python picks that obeys complex rules and is essentially unpredictable by us. So as far as we're concerned, it may as well be unordered.

For example, here's what happens when we print a dictionary -- you can see that the elements are not maintained in the same order they were added:


In [0]:
myDict = {'age':3, 'animal':'cat', 'num':56.9, 203:4, 'count':10, 'flag':True}
print myDict

Using dictionaries


Creating a dictionary

Create a new empty dictionary:

myDict = {}

Create a new dictionary with some elements:

myDict = {key1: value1, key2: value2}

Example:


In [0]:
myDict = {"Joe": 25, "Sally": 35}
print myDict


Adding to a dictionary

Add a new key-value pair to an existing dictionary:

myDict[newKey] = newVal

Note that if the specified key is already in the dictionary, the associated value will be overwritten by the new value we assign here!


In [0]:
myDict = {"Joe": 25, "Sally": 35}
myDict["Bobby"] = 65
myDict["Joe"] = 104
print myDict


Removing from a dictionary

Delete a key-value pair from an existing dictionary:

del hash[existingKey]

Note that trying to delete a key that is not in the dictionary will give an error.


Check if something is already in the dictionary

This works the same way it did with lists -- just use the in operator:

if someKey in myDict:
    ... do something


Iterating through a dictionary

A dictionary is an iterable, and the iterable unit is the key. So every time we loop, a new key from the dictionary is assigned to the for loop variable.

for key in myDict:
    ... do something

In [0]:
myDict = {"Joe": 25, "Sally": 35}

for person in myDict:
    print "Name:", person, "- Age:", myDict[person]


Special dictionary functions

Often, we'll want to quickly get a list of all the keys or values. Python provides the following functions to do this.

Get a list of the keys only:

keyList = myDict.keys()

Get a list of the values only:

valueList = myDict.values()

When are dictionaries useful?

One of the most natural applications of the dictionary is to create a "lookup table". There are many examples of lookup tables in our everyday lives -- things like a phonebook, the index at the back of a textbook, or (surprise surprise) a regular old dictionary. What these examples have in common is that they allow you to take one piece of information that you already know (a friend's name, a topic, a word) and use it to quickly look up some information that you don't know, but need (a phone number, a page number, a definition).

Here's a simple toy example of creating a phonebook using a dictionary:


In [0]:
phonebook = {}
phonebook["Joe Shmo"] = "958-273-7324"
phonebook["Sally Shmo"] = "958-273-9594"
phonebook["George Smith"] = "253-586-9933"

name = raw_input("Lookup number for: ")
print phonebook[name]

(Notice that we can store the name of a key in a variable, and then use that variable to access the desired element. In this case, the variable "name" holds the name that we input in the terminal, e.g. Sally Shmo.)

The real power of the dictionary comes when we start generating our lookup tables automatically from data files (instead of creating it manually like we did here). This allows us to very easily cross-reference data across multiple files. We'll look at a full example of this using real data at the end of this lesson!

3. String parsing with .split()


Before we can really dig in to analyzing some data files, there's one more tool we need: .split(). This is a simple and useful function that allows you to split any string into separate parts based on some delimiter. For example:


In [0]:
sentence = "Hello, how are you today?"

print sentence.split()
print sentence.split(",")
print sentence.split("o")

[ Definition ] .split()

Purpose: Splits a string into parts based on a specified delimiter. If no delimiter is given, splits on whitespace (spaces, tabs, and newlines). Returns a list.

Syntax:

result = string.split()
result = string.split(delimiter)

Notes:

  • The delimiter itself is not included in the output

Using .split() to parse text files

Most data files come in a tabular format, where the data is arranged in rows and columns in some consistent way. For example, you might have a file where each row is a gene, and each column is some type of information about the gene. If you open up this file in a plain text editor, you'll see something like this:

ucscID  geneName    numReads    proteinProduct
uc007afd.1  Mrpl15  368 internal-out-of-frame
uc007afh.1  Lypla1  783 n-term-trunc
uc007afi.1  Tcea1   3852    canonical
uc007afn.1  Atp6v1h 1407    n-term-trunc
uc007agb.1  Pcmtd1  65  uorf

This might look a bit messy to read by eye, but in fact this is a perfect format for reading into our code. The important point is that on each line, the data belonging to each "column" is separated by a consistent delimiter. In this case, the delimiter is a single tab (other common delimiters are commas and spaces). Using the .split() function, we can split up each line from this file into its separate "column" components so that each piece of information can be used separately. This is often what we mean when we say we're "parsing" a data file -- we're breaking it up into meaningful parts.

Here's an example of splitting up on of the lines above:


In [0]:
line = "uc007afd.1    Mrpl15    368    internal-out-of-frame"
print line.split()

Example: parsing a file with multiple columns

In the same folder as this notebook, you should have a file called init_sites.txt. This file contains some real data on translation initiation sites from a ribosome profiling study in mouse (Ingolia et al., Cell 2011). Here's what the first few lines look like:

knownGene   GeneName    InitCodon[nt]   DisttoCDS[codons]   FramevsCDS  InitContext[-3to+4] CDSLength[codons]   HarrPeakStart   HarrPeakWidth   #HarrReads  PeakScore   Codon   Product
uc007zzs.1  Cbr3    36  -23 -1  GCCACGG 22  35  3   379 4.75    nearcog uorf
uc009akk.1  Rac1    196 0   0   CAGATGC 192 195 3   3371    4.70    aug canonical
uc009eyb.1  Saps1   204 -91 1   GCCACGG 23  203 3   560 4.68    nearcog uorf
uc008wzq.1  Ppp1cb  96  0   0   AAGATGG 327 94  4   3218    4.56    aug canonical
uc007hnl.1  Pa2g4   38  -23 0   AGCCTGT 14  37  4   6236    4.54    nearcog uorf
uc007hnl.1  Pa2g4   40  -22 -1  CCTGTGG 17  37  4   6236    4.54    nearcog uorf
...
...

(Note: the header looks like it's on two lines here due to text wrapping, but it's actually just one line!)

Let's say the only info I'm want from this file is the initiation context of each translation initiation site. This is contained in column 6 of each row (under the header label "InitContext[-3to+4]"). How can I extract this information?

Think for a moment how you might do this, then take a look at the solution:


In [0]:
input = open("init_sites.txt", 'r')
input.readline()  #skip header line

for line in input:
    line = line.rstrip('\r\n')
    data = line.split()  #splits line on whitespace (includes tabs), returns a list
    print data[5]  #remember, list indexing starts at 0, so the 6th column = index 5 in the list!

input.close()

If instead we were interested in some other column of the file, we just need to switch data[5] to whichever index holds our information of interest, e.g. data[0] to get the "knownGene" column or data[2] to get the "InitCodon[nt]". Remember, though, that the lines of a file are always read in as strings, so you will need to convert numbers using int() or float() as appropriate. So for the InitCodon[nt], we will probably want to say int(data[2]) before doing any computations on those numbers.

4. Test your understanding: practice set 4


For the following blocks of code, first try to guess what the output will be, and then run the code yourself. These examples may introduce some ideas and common pitfalls that were not explicitly covered in the text above, so be sure to complete this section.


In [0]:
# RUN THIS BLOCK FIRST TO SET UP VARIABLES! (and re-run it if the lists/dictionary are changed in subsequent code blocks)
fruits = {"apple":"red", "banana":"yellow", "grape":"purple"}
names = ["Wilfred", "Manfred", "Wadsworth", "Jeeves"]
ages = [65, 34, 96, 47]
str1 = "Good morning, Mr. Mitsworth."

In [0]:
print len(ages)

In [0]:
print len(ages) == len(names)

In [0]:
print names[-1]

In [0]:
for age in ages:
    print age

In [0]:
for i in range(len(names)):
    print names[i],"is",ages[i]

In [0]:
if "Willard" not in names:
    names.append("Willard")
print names

In [0]:
ages.sort()
print ages

In [0]:
ages = ages.sort()
print ages

In [0]:
parts = str1.split()
print parts
print str1

In [0]:
parts = str1.split(",")
print parts

In [0]:
oldList = [2, 2, 6, 1, 2, 6]
newList = []

for item in oldList:
    if item not in newList:
        newList.append(item)
print newList

In [0]:
print fruits["banana"]

In [0]:
query = "apple"
print fruits[query]

In [0]:
print fruits[0]

In [0]:
print fruits.keys()

In [0]:
print fruits.values()

In [0]:
for key in fruits:
    print fruits[key]

In [0]:
del fruits["banana"]
print fruits

In [0]:
print fruits["pear"]

In [0]:
fruits["apple"] = fruits["apple"] + " or green"
print fruits["apple"]

In [0]:
fruits["pear"] = "green"
print fruits["pear"]