What is a data structure?
A data structure is basically a way of storing large amounts of data (numbers, strings, etc) in an organized manner, making storage and retrieval easier. There are several different data structures available in Python, but we'll just go over the two most common ones: lists and dictionaries.
What is a list?
A list is one type of built-in data structure in Python that is specialized for storing data in a ordered, sequential manner. We've already seen an example of lists when we used the range() function:
In [0]:
print range(5)
Now we'll go over more formally what a list is and how it can be used.
Side note for people who have used other programming languages: Lists are similar to what other programming languages call arrays. There are actually some subtle (but important) differences between lists and arrays (Python lists are closer to what people usually call a "linked list"), but for most purposes they perform the same role. The most obvious difference you might notice is that you don't need to specify ahead of time how large your list will be. This is because the size of the list grows dynamically as you add things to it (it also shrinks automatically as you take things out).
Here, each thing in the list (element) is stored in its own cell, and each cell is given a sequential integer index, starting at 0.
We use only one variable name to refer to the whole list. To access a specific element in the list, we use the index of the element with following syntax: listName[index]
. For example:
In [0]:
myList = [3, "cat", 56.9, 4, 10, True] # recreating the list above
print myList[0]
print myList[1]
This ability to access a potentially huge amount of data using just one variable name is part of what makes data structures so useful. Imagine if you had 20,000 gene IDs you wanted to use in your code -- it wouldn't be feasible to create a separate variable name for each one. Instead, you can just dump all the gene IDs into a single list, and access them by index. We'll see how to actually do this sort of thing later in this lesson!
Before we talk more about lists, we need to briefly introduce the idea of "in-place" functions.
The functions we've seen so far do not modify variables directly -- they simply "return" a value. For example, line.rstrip('\n')
does nothing to the original string line
, it just returns a modified version. To actually change line
, you need to say line = line.rstrip('\n')
, which overwrites line
with the new value. Here's a similar example in code:
In [0]:
line = "ATGCGTA***********"
line.rstrip("*")
print line
line = line.rstrip("*")
print line
Below, we're going to see a few examples of functions that do directly modify the variable that they act on. These functions are called "in-place" functions, and I'll make a note of it wherever we encounter them.
Creating a new list
A list can expand or shrink as we add or remove things from it. Before we can do this, we need to create the list itself. Most often we'll just start of with an empty list, but sometimes it can be useful to pre-fill the list with certain values. Here are the three main ways to create a list:
myList = [] # create a new empty list
myList = [element1, element2, etc] # create a new list with some things already in it
myList = range(num) # create a new list automatically filled with a range of numbers
Example:
In [0]:
shoppingList = ["pizza", "ice cream", "cat food"]
print shoppingList
Accessing elements in a list
As we saw above, an element of a list can be accessed using its index. Do not to try to access an index that's not yet in the list -- this will give an error.
someData = myList[index]
You can also index backwards using negative indices:
lastElement = myList[-1]
Examples:
In [0]:
contestRanking = ["Sally", "Billy", "Tommy", "Wilfred"]
print "First place goes to", contestRanking[0], "!"
print "Congrats also to our second and third place winners,", contestRanking[1], "and", contestRanking[2]
print "And in last place...", contestRanking[-1]
In [0]:
contestRanking = ["Sally", "Billy", "Tommy", "Wilfred"]
print "6th place goes to", contestRanking[5]
Adding to a list
After creating a list, you can add additional elements to the end using .append()
. This is an in-place function, meaning that it directly modifies the list.
myList.append(element)
Insert an element at the specified index. Elements that come after that index will shift up one index. (in-place function)
myList.insert(index, value)
Example:
In [0]:
shoppingList = ["pizza", "ice cream", "cat food"]
shoppingList.append("english muffins")
print shoppingList
shoppingList.insert(2, "lembas")
print shoppingList
Removing from a list
After creating a list, you can remove elements from it using .pop()
. Elements that come after the removed element will be moved up one index so that there are no empty spaces in the list. .pop()
also returns the element that was "popped". (in-place function)
myList.pop(index) # removes (and returns) the element at the specified index
myList.pop() # removes (and returns) the last element
You can also remove the first occurrence of a specified element using .remove()
. Elements that come after will shift down one index. (in-place function)
myList.remove(element)
Examples:
In [0]:
toDoList = ["Water plants", "feed cat", "do dishes", "make python lesson"]
doNext = toDoList.pop()
print doNext
print toDoList
In [0]:
pizzaToppings = ["peppers", "sausage", "bananas", "pepperoni"]
pizzaToppings.remove("bananas")
print pizzaToppings
Checking if something is in the list
To check if a particular element is in a list, you can just use a special logical operator "in
" (note that this is used differently in a logical statement as compared to a for loop):
if element in myList:
... do something
Example:
In [0]:
shoppingList = ["pizza", "ice cream", "cat food"]
item = "20lb bag of reese's"
if item in shoppingList:
print "I'LL TAKE IT"
else:
print "Not today..."
Iterating through a list
This should be pretty familiar by now. Since a list is an iterable, we can loop through it using a for
loop:
for element in myList:
... do something
Example:
In [0]:
currentCats = ["Mittens", "Tatertot", "Meatball", "Star Destroyer"]
for cat in currentCats:
print cat
Other list operations
listLen = len(myList) # Get the length of a list
myList.sort() # Sort (in-place function)
myList.reverse() # Reverse (in-place function)
Related data structure: Tuples
There is another data structure in Python that is similar to lists, but not exactly the same, called tuples. We won't focus too much on these for this class, but essentially a tuple is, like a list, a sequence of items, but unlike a list, these items are immutable, meaning that once you have defined your tuple, you cannot change the items in it. These structures are useful in cases when you know exactly how many items you want to use, and if you aren't going to be changing these items.
Tuples are defined similarly to lists, except that they use rounded parentheses instead of square brackets, and can also be defined without any parentheses. For example:
myTuple = () # create an empty tuple
myTuple = "a", "b", "c", "d" # create a tuple of strings
myTuple = ("a", "b", "c", "d") # this makes the same exact tuple
myTuple = ("a", "b", 3, 4) # tuples can also include different data types
myTuple = ("a") # to make a tuple with a single value, you need a comma
For the purposes of this class, we won't go through all the different methods for tuples, except that they can be accessed similarly to lists, i.e.
myTuple[index]
For more info on tuples, see http://www.tutorialspoint.com/python/python_tuples.htm. Just note that you cannot change individual elements in a tuple!
The next data structure we'll talk about is the dictionary. There are two key differences between dictionaries and lists:
Difference 1: Dictionaries are indexed by keys
With a dictionary (sometimes called a hash table in other programming languages), you access elements by a name ("key") that you pick:
Keys can be strings or numbers. The only restriction is that each key must be unique.
To retrieve a value from the hash, we use the following notation: dictName[key]
. For example:
In [0]:
myDict = {'age':3, 'animal':'cat', 'num':56.9, 203:4, 'count':10, 'flag':True} # creates the dictionary shown above
print myDict['animal']
print myDict[203]
Difference 2: Dictionaries are unordered
Lists are all about keeping elements in some order. Though you may change the ordering from time to time, it's still in some predictable order at all times.
You should think of dictionaries more like magic grab bags. You mark each piece of data with a key, then throw it in the bag. When you want that data back, you just tell the bag the key and it spits out the data assigned to that key. There's no intrinsic order to the things in the bag, they're all just kind of jumbled around in there. Because of this, dictionaries aren't great for situations where we need to keep data in a specific order -- but they're very convenient in other situations, as we'll see in a minute.
Technical side note: Ok, so in reality, there is an order to your dictionary. But it is an order that Python picks that obeys complex rules and is essentially unpredictable by us. So as far as we're concerned, it may as well be unordered.
For example, here's what happens when we print a dictionary -- you can see that the elements are not maintained in the same order they were added:
In [0]:
myDict = {'age':3, 'animal':'cat', 'num':56.9, 203:4, 'count':10, 'flag':True}
print myDict
In [0]:
myDict = {"Joe": 25, "Sally": 35}
print myDict
Adding to a dictionary
Add a new key-value pair to an existing dictionary:
myDict[newKey] = newVal
Note that if the specified key is already in the dictionary, the associated value will be overwritten by the new value we assign here!
In [0]:
myDict = {"Joe": 25, "Sally": 35}
myDict["Bobby"] = 65
myDict["Joe"] = 104
print myDict
Removing from a dictionary
Delete a key-value pair from an existing dictionary:
del hash[existingKey]
Note that trying to delete a key that is not in the dictionary will give an error.
Check if something is already in the dictionary
This works the same way it did with lists -- just use the in
operator:
if someKey in myDict:
... do something
Iterating through a dictionary
A dictionary is an iterable, and the iterable unit is the key. So every time we loop, a new key from the dictionary is assigned to the for
loop variable.
for key in myDict:
... do something
In [0]:
myDict = {"Joe": 25, "Sally": 35}
for person in myDict:
print "Name:", person, "- Age:", myDict[person]
Special dictionary functions
Often, we'll want to quickly get a list of all the keys or values. Python provides the following functions to do this.
Get a list of the keys only:
keyList = myDict.keys()
Get a list of the values only:
valueList = myDict.values()
One of the most natural applications of the dictionary is to create a "lookup table". There are many examples of lookup tables in our everyday lives -- things like a phonebook, the index at the back of a textbook, or (surprise surprise) a regular old dictionary. What these examples have in common is that they allow you to take one piece of information that you already know (a friend's name, a topic, a word) and use it to quickly look up some information that you don't know, but need (a phone number, a page number, a definition).
Here's a simple toy example of creating a phonebook using a dictionary:
In [0]:
phonebook = {}
phonebook["Joe Shmo"] = "958-273-7324"
phonebook["Sally Shmo"] = "958-273-9594"
phonebook["George Smith"] = "253-586-9933"
name = raw_input("Lookup number for: ")
print phonebook[name]
(Notice that we can store the name of a key in a variable, and then use that variable to access the desired element. In this case, the variable "name
" holds the name that we input in the terminal, e.g. Sally Shmo.)
The real power of the dictionary comes when we start generating our lookup tables automatically from data files (instead of creating it manually like we did here). This allows us to very easily cross-reference data across multiple files. We'll look at a full example of this using real data at the end of this lesson!
In [0]:
sentence = "Hello, how are you today?"
print sentence.split()
print sentence.split(",")
print sentence.split("o")
.split()
Purpose: Splits a string into parts based on a specified delimiter. If no delimiter is given, splits on whitespace (spaces, tabs, and newlines). Returns a list.
Syntax:
result = string.split()
result = string.split(delimiter)
Notes:
.split()
to parse text filesMost data files come in a tabular format, where the data is arranged in rows and columns in some consistent way. For example, you might have a file where each row is a gene, and each column is some type of information about the gene. If you open up this file in a plain text editor, you'll see something like this:
ucscID geneName numReads proteinProduct
uc007afd.1 Mrpl15 368 internal-out-of-frame
uc007afh.1 Lypla1 783 n-term-trunc
uc007afi.1 Tcea1 3852 canonical
uc007afn.1 Atp6v1h 1407 n-term-trunc
uc007agb.1 Pcmtd1 65 uorf
This might look a bit messy to read by eye, but in fact this is a perfect format for reading into our code. The important point is that on each line, the data belonging to each "column" is separated by a consistent delimiter. In this case, the delimiter is a single tab (other common delimiters are commas and spaces). Using the .split()
function, we can split up each line from this file into its separate "column" components so that each piece of information can be used separately. This is often what we mean when we say we're "parsing" a data file -- we're breaking it up into meaningful parts.
Here's an example of splitting up on of the lines above:
In [0]:
line = "uc007afd.1 Mrpl15 368 internal-out-of-frame"
print line.split()
In the same folder as this notebook, you should have a file called init_sites.txt
. This file contains some real data on translation initiation sites from a ribosome profiling study in mouse (Ingolia et al., Cell 2011). Here's what the first few lines look like:
knownGene GeneName InitCodon[nt] DisttoCDS[codons] FramevsCDS InitContext[-3to+4] CDSLength[codons] HarrPeakStart HarrPeakWidth #HarrReads PeakScore Codon Product
uc007zzs.1 Cbr3 36 -23 -1 GCCACGG 22 35 3 379 4.75 nearcog uorf
uc009akk.1 Rac1 196 0 0 CAGATGC 192 195 3 3371 4.70 aug canonical
uc009eyb.1 Saps1 204 -91 1 GCCACGG 23 203 3 560 4.68 nearcog uorf
uc008wzq.1 Ppp1cb 96 0 0 AAGATGG 327 94 4 3218 4.56 aug canonical
uc007hnl.1 Pa2g4 38 -23 0 AGCCTGT 14 37 4 6236 4.54 nearcog uorf
uc007hnl.1 Pa2g4 40 -22 -1 CCTGTGG 17 37 4 6236 4.54 nearcog uorf
...
...
(Note: the header looks like it's on two lines here due to text wrapping, but it's actually just one line!)
Let's say the only info I'm want from this file is the initiation context of each translation initiation site. This is contained in column 6 of each row (under the header label "InitContext[-3to+4]"). How can I extract this information?
Think for a moment how you might do this, then take a look at the solution:
In [0]:
input = open("init_sites.txt", 'r')
input.readline() #skip header line
for line in input:
line = line.rstrip('\r\n')
data = line.split() #splits line on whitespace (includes tabs), returns a list
print data[5] #remember, list indexing starts at 0, so the 6th column = index 5 in the list!
input.close()
If instead we were interested in some other column of the file, we just need to switch data[5]
to whichever index holds our information of interest, e.g. data[0]
to get the "knownGene" column or data[2]
to get the "InitCodon[nt]". Remember, though, that the lines of a file are always read in as strings, so you will need to convert numbers using int()
or float()
as appropriate. So for the InitCodon[nt], we will probably want to say int(data[2])
before doing any computations on those numbers.
For the following blocks of code, first try to guess what the output will be, and then run the code yourself. These examples may introduce some ideas and common pitfalls that were not explicitly covered in the text above, so be sure to complete this section.
In [0]:
# RUN THIS BLOCK FIRST TO SET UP VARIABLES! (and re-run it if the lists/dictionary are changed in subsequent code blocks)
fruits = {"apple":"red", "banana":"yellow", "grape":"purple"}
names = ["Wilfred", "Manfred", "Wadsworth", "Jeeves"]
ages = [65, 34, 96, 47]
str1 = "Good morning, Mr. Mitsworth."
In [0]:
print len(ages)
In [0]:
print len(ages) == len(names)
In [0]:
print names[-1]
In [0]:
for age in ages:
print age
In [0]:
for i in range(len(names)):
print names[i],"is",ages[i]
In [0]:
if "Willard" not in names:
names.append("Willard")
print names
In [0]:
ages.sort()
print ages
In [0]:
ages = ages.sort()
print ages
In [0]:
parts = str1.split()
print parts
print str1
In [0]:
parts = str1.split(",")
print parts
In [0]:
oldList = [2, 2, 6, 1, 2, 6]
newList = []
for item in oldList:
if item not in newList:
newList.append(item)
print newList
In [0]:
print fruits["banana"]
In [0]:
query = "apple"
print fruits[query]
In [0]:
print fruits[0]
In [0]:
print fruits.keys()
In [0]:
print fruits.values()
In [0]:
for key in fruits:
print fruits[key]
In [0]:
del fruits["banana"]
print fruits
In [0]:
print fruits["pear"]
In [0]:
fruits["apple"] = fruits["apple"] + " or green"
print fruits["apple"]
In [0]:
fruits["pear"] = "green"
print fruits["pear"]