Python Quickstart

Workshop on Web Scraping and Text Processing with Python

by Radhika Saksena, Princeton University, saksena@princeton.edu, radhika.saksena@gmail.com

Disclaimer: The code examples presented in this workshop are for educational purposes only. Please seek advice from a legal expert about the legal implications of using this code for web scraping.

1. First things first

This notebook describes some Python basics which we will be using throughout the workshop. Please go through this material and try out the code examples using IPython Notebook which comes with Anaconda (https://store.continuum.io/cshop/anaconda/).

1.1 Executing code in IPython Notebook

  • Click within an existing "Code" Cell or write new code in a "Code" Cell.
  • Type shift-Enter to execute the Python code contained in the Cell.

1.2 Python Indentation

  • Indentation is significant in Python. Instead of curly braces to demarcate a code block (as in C++, Java, R, etc.), consecutive statements with the same level of indentation are identified as being in the same block.

  • Any number of spaces is valid indentation. Four spaces for each level of indentation is conventional among programmers.

  • In IPython Notebook, simply use the [tab] key for each new level of indentation. This gets converted to four spaces automatically.

1.3 Comments in Python

  • Single-line comments start with a # symbol and end with the end of the line
  • Comments can be placed on a line by themselves

In [22]:
# Assign value 1 to variable x
x = 1
  • Comments can also be placed on the same line as the code as shown here.

In [ ]:
x = 1 # Assign value 1 to variable x
  • For multi-line comments, use triple-quoted strings.

In [ ]:
"""This is a multi-line comment.
Assign value 1 to variable x."""
x = 1

1.4 Python's print() function

The print function is used to print variables and expressions to screen. print() offers a lot of functionality which we'll encounter during the workshop. For now, note that:

  • You can pass anything to the print() function and it will attempt to print its arguments.

In [24]:
print(1) # Print a constant


1

In [25]:
x = 2014
print(x) # Print an integer variable


2014

In [26]:
xstr = "Hello World." # Print a string
print(xstr)


Hello World.

In [27]:
print(x,xstr) # Print multiple objects


(2014, 'Hello World.')

In [28]:
print("String 1" + " " + "String2") # Concatenate multiple strings and print them


String 1 String2
  • For web-scraping and text-processing type tasks, we'd like better control over how things get printed out, such as the number of decimal places when printing out floating point numbers. Use the format() method on the string to be printed out to control the output format.

In [29]:
x = 1
print("Formatted integer is {0:06d}".format(x)) # Note the format specification, 06d, for the integer.


Formatted integer is 000001

In [30]:
y = 12.66666666667
print("Formatted floating point number is {0:2.3f}".format(y)) # Note the format specification, 2.3f, for the floating point number.


Formatted floating point number is 12.667

In [31]:
iStr = "Hello World"
fStr = "Goodbye World"
print("Initial string: {0:s} . Final string: {1:s}.".format(iStr,fStr)) # Note the format specification, s, for the string.


Initial string: Hello World . Final string: Goodbye World.

In [32]:
print("Initial string: {0} . Final string: {1}.".format(iStr,fStr)) # In this case, omitting the s format specified works too.


Initial string: Hello World . Final string: Goodbye World.
  • The parentheses in the print() function are optional in Python 2. So, the following syntax for print is valid:

In [34]:
x = 1
print "Formatted integer is {0:06d}".format(x)


Formatted integer is 000001

In [33]:
y = 12.66666666667
print "Formatted floating point number is {0:2.3f}".format(y)


Formatted floating point number is 12.667

2. Numeric Variable Types

2.1 Integers


In [35]:
year = 2014
print(year)


2014

In [36]:
print("The year is %d." % year)


The year is 2014.

In [37]:
print(type(year))


<type 'int'>

In [38]:
help(year)


Help on int object:

class int(object)
 |  int(x=0) -> int or long
 |  int(x, base=10) -> int or long
 |  
 |  Convert a number or string to an integer, or return 0 if no arguments
 |  are given.  If x is floating point, the conversion truncates towards zero.
 |  If x is outside the integer range, the function returns a long instead.
 |  
 |  If x is not a number or if base is given, then x must be a string or
 |  Unicode object representing an integer literal in the given base.  The
 |  literal can be preceded by '+' or '-' and be surrounded by whitespace.
 |  The base defaults to 10.  Valid bases are 0 and 2-36.  Base 0 means to
 |  interpret the base from the string as an integer literal.
 |  >>> int('0b100', base=0)
 |  4
 |  
 |  Methods defined here:
 |  
 |  __abs__(...)
 |      x.__abs__() <==> abs(x)
 |  
 |  __add__(...)
 |      x.__add__(y) <==> x+y
 |  
 |  __and__(...)
 |      x.__and__(y) <==> x&y
 |  
 |  __cmp__(...)
 |      x.__cmp__(y) <==> cmp(x,y)
 |  
 |  __coerce__(...)
 |      x.__coerce__(y) <==> coerce(x, y)
 |  
 |  __div__(...)
 |      x.__div__(y) <==> x/y
 |  
 |  __divmod__(...)
 |      x.__divmod__(y) <==> divmod(x, y)
 |  
 |  __float__(...)
 |      x.__float__() <==> float(x)
 |  
 |  __floordiv__(...)
 |      x.__floordiv__(y) <==> x//y
 |  
 |  __format__(...)
 |  
 |  __getattribute__(...)
 |      x.__getattribute__('name') <==> x.name
 |  
 |  __getnewargs__(...)
 |  
 |  __hash__(...)
 |      x.__hash__() <==> hash(x)
 |  
 |  __hex__(...)
 |      x.__hex__() <==> hex(x)
 |  
 |  __index__(...)
 |      x[y:z] <==> x[y.__index__():z.__index__()]
 |  
 |  __int__(...)
 |      x.__int__() <==> int(x)
 |  
 |  __invert__(...)
 |      x.__invert__() <==> ~x
 |  
 |  __long__(...)
 |      x.__long__() <==> long(x)
 |  
 |  __lshift__(...)
 |      x.__lshift__(y) <==> x<<y
 |  
 |  __mod__(...)
 |      x.__mod__(y) <==> x%y
 |  
 |  __mul__(...)
 |      x.__mul__(y) <==> x*y
 |  
 |  __neg__(...)
 |      x.__neg__() <==> -x
 |  
 |  __nonzero__(...)
 |      x.__nonzero__() <==> x != 0
 |  
 |  __oct__(...)
 |      x.__oct__() <==> oct(x)
 |  
 |  __or__(...)
 |      x.__or__(y) <==> x|y
 |  
 |  __pos__(...)
 |      x.__pos__() <==> +x
 |  
 |  __pow__(...)
 |      x.__pow__(y[, z]) <==> pow(x, y[, z])
 |  
 |  __radd__(...)
 |      x.__radd__(y) <==> y+x
 |  
 |  __rand__(...)
 |      x.__rand__(y) <==> y&x
 |  
 |  __rdiv__(...)
 |      x.__rdiv__(y) <==> y/x
 |  
 |  __rdivmod__(...)
 |      x.__rdivmod__(y) <==> divmod(y, x)
 |  
 |  __repr__(...)
 |      x.__repr__() <==> repr(x)
 |  
 |  __rfloordiv__(...)
 |      x.__rfloordiv__(y) <==> y//x
 |  
 |  __rlshift__(...)
 |      x.__rlshift__(y) <==> y<<x
 |  
 |  __rmod__(...)
 |      x.__rmod__(y) <==> y%x
 |  
 |  __rmul__(...)
 |      x.__rmul__(y) <==> y*x
 |  
 |  __ror__(...)
 |      x.__ror__(y) <==> y|x
 |  
 |  __rpow__(...)
 |      y.__rpow__(x[, z]) <==> pow(x, y[, z])
 |  
 |  __rrshift__(...)
 |      x.__rrshift__(y) <==> y>>x
 |  
 |  __rshift__(...)
 |      x.__rshift__(y) <==> x>>y
 |  
 |  __rsub__(...)
 |      x.__rsub__(y) <==> y-x
 |  
 |  __rtruediv__(...)
 |      x.__rtruediv__(y) <==> y/x
 |  
 |  __rxor__(...)
 |      x.__rxor__(y) <==> y^x
 |  
 |  __str__(...)
 |      x.__str__() <==> str(x)
 |  
 |  __sub__(...)
 |      x.__sub__(y) <==> x-y
 |  
 |  __truediv__(...)
 |      x.__truediv__(y) <==> x/y
 |  
 |  __trunc__(...)
 |      Truncating an Integral returns itself.
 |  
 |  __xor__(...)
 |      x.__xor__(y) <==> x^y
 |  
 |  bit_length(...)
 |      int.bit_length() -> int
 |      
 |      Number of bits necessary to represent self in binary.
 |      >>> bin(37)
 |      '0b100101'
 |      >>> (37).bit_length()
 |      6
 |  
 |  conjugate(...)
 |      Returns self, the complex conjugate of any int.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  denominator
 |      the denominator of a rational number in lowest terms
 |  
 |  imag
 |      the imaginary part of a complex number
 |  
 |  numerator
 |      the numerator of a rational number in lowest terms
 |  
 |  real
 |      the real part of a complex number
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  __new__ = <built-in method __new__ of type object>
 |      T.__new__(S, ...) -> a new object with type S, a subtype of T


In [ ]:
help(int)

2.2. Floating Point Numbers


In [39]:
mean = (1.0 + 0.7 + 2.1)/3.0
print(mean)


1.26666666667

In [40]:
print("The mean is %6.2f." % mean)


The mean is   1.27.

In [41]:
print(type(mean))


<type 'float'>

In [ ]:
help(mean)

In [ ]:
help(float)

Beware of integer divisions in Python 2.


In [ ]:
mean = (1 + 2)/2
print(mean)

To fix this, either explicitly use floating point variables:


In [ ]:
mean = (1.0 + 2.0)/2
print(mean)

Or change the behaviour of the division operator:


In [ ]:
from __future__ import division
mean = (1 + 2)/2
print(mean)

3. Basic Operators

3.1 Arithmetic Operators

Standard arithmetic operators for addition (+), subtraction (-), multiplication (*) and division (/) are supported in Python. We have already seen use of the addition (+) and division (/) operators. Some more operators that are commonly encountered are demonstrated below.


In [43]:
x = 2**3  # ** is the exponentiation operator
print(x)


8

In [44]:
x = 9 % 4 # % is the modulus operator
print(x)


1

In [45]:
x = 9 // 4 # // is the operator for floor division
print(x)


2

3.2. Assignment Operators

In addition to using the = (simple assignment operator) for assigning values to variables, one can use a composite assignment operator(+=, -=, etc.) that combines the simple assignment operator with all of these arithmetic expressions. For example:


In [46]:
x = 2.0
y = 5.0
y += x # y = y + x
print(y)

y %= x # y = y%x
print(y)


7.0
1.0

3.3. Comparison Operators


In [47]:
x = 1
y = 1
x == y # Check for equality


Out[47]:
True

In [48]:
x = 1
y = 1
x != 1 # Check for inequality


Out[48]:
False

In [50]:
x = 0.5
y = 1.0
x > y # Check if x greater than y


Out[50]:
False

In [51]:
x < y # Check if x less than y


Out[51]:
True

In [52]:
x >= y # Check if x greater than equal to y


Out[52]:
False

In [53]:
x <= y # Check if x less than equal to y


Out[53]:
True

3.4. Logical Operators

Logical operators such as and, or, not allow specification of composite conditions, for example in if statements as we will see shortly.


In [54]:
a = 99
b = 99
(a == b) and (a <= 100) # use the and operator to check if both the operands are true


Out[54]:
True

In [55]:
a = True
b = False
a and b


Out[55]:
False

In [56]:
a = True
b = False
a or b # use the or operator to check if at least one of the two operands is true


Out[56]:
True

In [57]:
a = 100
b = 100
a == b
not(a == b) # use the not operator to reverse a logical statement


Out[57]:
False

4. Strings

  • A string is a sequence of characters.
  • Strings are specified by using single quotes (' ') or double quotes (" "). Multi-line strings can be specified with triple quotes.

In [58]:
pythonStr = 'A first Python string.' # String specified with single quotes.
print(type(pythonStr))
print(pythonStr)


<type 'str'>
A first Python string.

In [59]:
pythonStr = "A first Python string" # String specified with double quotes.
print(type(pythonStr))
print(pythonStr)


<type 'str'>
A first Python string

In [61]:
pythonStr = """A multi-line string.
A first Python string."""               # Multi-line string specified with triple quotes.
print(type(pythonStr))
print(pythonStr)


<type 'str'>
A multi-line string.
A first Python string.
  • Strings can be concatenated using the addition(+) operator.

In [62]:
str1 = " Rock "
str2 = " Paper "
str3 = " Scissors "
longStr = str1 + str2 + str3
print(longStr)


 Rock  Paper  Scissors 
  • Strings can also be repeated with the multiplication (*) operator.

In [63]:
str1 = "Rock,Paper,Scissors\n"
repeatStr = str1*5
print(repeatStr)


Rock,Paper,Scissors
Rock,Paper,Scissors
Rock,Paper,Scissors
Rock,Paper,Scissors
Rock,Paper,Scissors

  • The len() function returns the length of a string.

In [65]:
str1 = "Python"
lenStr1 = len(str1)
print("The length of str: is " + str(lenStr1) + ".")


The length of str: is 6.
  • Since, the Python string is a sequence of characters, individual characters in the string can be indexed. Note that, unlike R, in Python sequences indexing starts at 0 and goes up to one less than the length of the sequence.

In [68]:
str1 = "Python"
print(str1[0]) # Print the first character element of the string.


P

In [69]:
print(str1[len(str1)-1]) # Print the last character element of the string.


n

In [70]:
print(str1[2:4]) # Print a 2-element slice of the string, starting from the 2-nd element up to but not including the 4-th element.


th
  • Strings are immutable. That is, an existing instance of a string cannot be modified. Instead, a new string that contains the modification should be created.

In [67]:
str1 = "Python"
str1[1] = "3" # Error, strings can't be modified.


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-67-eb0ede1872df> in <module>()
      1 str1 = "Python"
----> 2 str1[1] = "3" # Error, strings can't be modified.

TypeError: 'str' object does not support item assignment

In [71]:
str1 = "Python"
print(str1.upper()) # Convert str1 to all uppercase.


PYTHON

In [72]:
str2 = "PYTHON"
print(str1.lower()) # Convert str2 to all lowercase.


python

In [73]:
str3 = "Rock,Paper,Scissors,Lizard,Spock"
print(str3.split(",")) # Split str3 using "," as the separator. A list of string elements is returned.


['Rock', 'Paper', 'Scissors', 'Lizard', 'Spock']

In [74]:
str4 = "The original string has trailing spaces.\t\n"
print("***"+str4.strip()+"***") # Print stripped string with trailing space characters removed.


***The original string has trailing spaces.***

5. Python Data Structures

5.1 Lists

  • List is an indexed collection of items. Each of the list items can be of arbitrary type. Note the square brackets in the pyList list declaration below. The len() function returns the length of the list.

In [75]:
# pyList contains an integer, string and floating point number
pyList = [2014,"02 June", 74.5]

# Print all the elements of pyList
print(pyList)
print("\n")

# Print the length of pyList obtained using the len() function
print("Length of pyList is: {0}.\n".format(len(pyList)))


[2014, '02 June', 74.5]


Length of pyList is: 3.

  • List elements can be individually referenced using their index in the list. Python indexing starts with 0 and runs up to the length of the sequence - 1. The square bracket is used to specify the index in to the list. This notation can also be used to assign values to the elements of the list. In contrast to strings, lists are mutable.

In [85]:
print(pyList)
print("\n")

# Print the first element of pyList. Remember, indexing starts with 0.
print("First element of pyList: {0}.\n".format(pyList[0]))

# Print the last element of pyList. Last element can be conveniently indexed using -1.
print("Last element of pyList: {0}.\n".format(pyList[-1]))

# Also the last element has index = (length of list - 1)
check = (pyList[2] == pyList[-1])
print("Is pyList[2] equal to pyList[-1]?\n{0}.\n".format(check))

# Assign a new value to the third element of the list
pyList[2] = -99.0
print("Modified element of pyList[2]: {0}.\n".format(pyList[2]))


[2014, '02 June', 100.0]


First element of pyList: 2014.

Last element of pyList: 100.0.

Is pyList[2] equal to pyList[-1]?
True.

Modified element of pyList[2]: -99.0.

  • Python lists can be sliced using the slice notation of two indices separated by a colon. An omitted first index indicates 0 and an omitted second index indicates the length of the list/sequence.

In [86]:
pyList = ["rock","paper","scissors","lizard","Spock"]

print(pyList[2:4]) # Print elements of a starting from the second, up to but not including the fourth.


['scissors', 'lizard']

In [87]:
print(pyList[:2])  # Print the first two elements of pyList.


['rock', 'paper']

In [88]:
print(pyList[2:]) # Print all the elements of pyList starting from the second.


['scissors', 'lizard', 'Spock']

In [89]:
print(pyList[:])  # Print all the elements of pyList


['rock', 'paper', 'scissors', 'lizard', 'Spock']
  • Python slice notation can also be used to assign into lists.

In [8]:
pyList = ["rock","paper","scissors","lizard","Spock"]

pyList[2:4] = ["gu","pa"] # Replace the second and third elements of pyList

print("Original contents of pyList:")
print(pyList)
print("\n")

pyList[:] = [] # Clear pyList, replace all items with an empty list

print("Modified contents of pyList:")
print(pyList)


Original contents of pyList:
['rock', 'paper', 'gu', 'pa', 'Spock']


Modified contents of pyList:
[]
  • Python lists come with useful methods to add elements - append() and extend()

In [9]:
pyList = ["rock","paper"]
print("Printing Python list pyList:")
print(pyList)
print("\n")

pyList.append("scissors")
print("Appended the string 'scissors' to pyList:")
print(pyList)
print("\n")

anotherList = ["lizard","Spock"]
pyList.extend(anotherList)
print("Extended pyList:")
print(pyList)
print("\n")


Printing Python list pyList:
['rock', 'paper']


Appended the string 'scissors' to pyList:
['rock', 'paper', 'scissors']


Extended pyList:
['rock', 'paper', 'scissors', 'lizard', 'Spock']


  • Python lists can be concatenated using the "+" operator (similar to strings).

In [10]:
pyList1 = ["rock","paper","scissors"]
pyList2 = ["lizard","Spock"]
newList = pyList1 + pyList2
print("New list:")
print(newList)


New list:
['rock', 'paper', 'scissors', 'lizard', 'Spock']
  • Python lists can be nested - list within a list within a list and so on. An index needs to be specified for each level of nesting.

In [19]:
pyLists = [["rock","paper","scissors"], ["ji","gu","pa"]]

# Print the first element (0-th index) of pyLists which is itself a list
print("pyLists[0] = ")
print(pyLists[0])
print("\n")

# Print the 0-th index element of the first list element in pyLists
print("pylists[0][0] = " + pyLists[0][0] + ".")
print("\n")

# Print the second element of pyLists which is itself a list
print("pyLists[1] = ")
print(pyLists[1])
print("\n")

# Print the 0-th index element of the second list element in pyLists
print("pyLists[1][0] = " + pyLists[1][0] + ".")
print("\n")


pyLists[0] = 
['rock', 'paper', 'scissors']


pylists[0][0] = rock.


pyLists[1] = 
['ji', 'gu', 'pa']


pyLists[1][0] = ji.



In [21]:
pyList = [1,3,4,2]
pyList.sort(reverse=True)
sum(pyList)
2*(pyList)
#2**(pyList)


Out[21]:
[4, 3, 2, 1, 4, 3, 2, 1]

5.2. Tuples

  • Tuples are another sequence data type consisting of arbitrary items separated by commas. In contrast to lists, tuples are immutable, i.e., they cannot be modified. See below for a declaration of a tuple. Note the parentheses in the declaration.

In [91]:
# pyTuple contains an integer, string and floating point number
pyTuple = (2014,"02 June", 74.5)

# Print all the elements of pyTuple
print("pyTuple is: ")
print(pyTuple)
print("\n")

# Print the length of pyTuple obtained using the len() function
print("Length of pyTuple is: {0}.\n".format(len(pyTuple)))


pyTuple is: 
(2014, '02 June', 74.5)


Length of pyTuple is: 3.

  • Tuples are immutable. Attempting to change elements of a tuple will result in errors.

In [92]:
pyTuple[1] = "31 December" # Error as pyTuple is a tuple and hence, immutable


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-92-5df9ae13c0e7> in <module>()
----> 1 pyTuple[1] = "31 December" # Error as pyTuple is a tuple and hence, immutable

TypeError: 'tuple' object does not support item assignment
  • Tuples can be packed from and unpacked into individual elements.

In [93]:
pyTuple = "rock", "paper", "scissors" # pack the strings into a tuple named pyTuple
print(pyTuple)


('rock', 'paper', 'scissors')

In [95]:
str0,str1,str2 = pyTuple # unpack the tuple into strings named str0, str1, str2
print "str0 = " + str0 + "."
print "str1 = " + str1 + "."
print "str2 = " + str2 + "."


str0 = rock.
str1 = paper.
str2 = scissors.
  • One can declare tuples of tuples.

In [96]:
pyTuples = (("rock","paper","scissors"),("ji","gu","pa"))
print "pyTuples[0] = {0}.".format(pyTuples[0]) # Print the first sub-tuple in pyTuples.
print "pyTuples[1] = {0}.".format(pyTuples[1]) # Print the second sub-tuple in pyTuples.


pyTuples[0] = ('rock', 'paper', 'scissors').
pyTuples[1] = ('ji', 'gu', 'pa').
  • One can declare a tuple of lists.

In [97]:
pyNested = (["rock","paper","scissors"],["ji","gu","pa"])
pyNested[0][2] = "lizard" # OK, list within the tuple is mutable
print(pyNested[0]) # Print first list element of the tuple


['rock', 'paper', 'lizard']
  • One can also declare a list of tuples.

In [ ]:
pyNested = [("rock","paper","scissors"),("ji","gu","pa")]
pyNested[0][2] = "lizard" # Error, tuples is immutable*

5.3. Dictionaries

  • A Python dictionary is an unordered set of key:value pairs that acts as an associate arrays. The keys are immutable and unique within one dictionary. In contrast to lists and tuples, dictionaries are indexed by keys. Note the use of curly braces in the declaration of the dictionary below.

In [98]:
pyDict = {"Canada":"CAN","Argentina":"ARG","Austria":"AUT"}
print("pyDict: {0}.".format(pyDict))


pyDict: {'Canada': 'CAN', 'Argentina': 'ARG', 'Austria': 'AUT'}.

In [99]:
print("pyDict['Argentina']: " + pyDict['Argentina'] + ".") # Print the value corresponding to key 'afghanistan'


pyDict['Argentina']: ARG.

In [100]:
print(pyDict.keys())


['Canada', 'Argentina', 'Austria']

In [101]:
print(pyDict.values()) # Return all the values in the dictionary as a list.


['CAN', 'ARG', 'AUT']

In [102]:
print(pyDict.items()) # Return key, value pairs from the dictionary as a list of tuples.


[('Canada', 'CAN'), ('Argentina', 'ARG'), ('Austria', 'AUT')]

Parsing hierarchical data structures involving Python dictionaries will be very useful when working with the JSON data format and APIs such as the Twitter API.

  • Values in a dictionary can be any object including other dictionaries.

In [103]:
pyDicts = {"Canada":{"Alpha-2":"CA","Alpha-3":"CAN","Numeric":"124"},
               "Argentina":{"Alpha-2":"AR","Alpha-3":"ARG","Numeric":"032"},
                                         "Austria":{"Alpha-2":"AT","Alpha-3":"AUT","Numeric":"040"}}

print("pyDicts['Canada'] = {0}.".format(pyDicts['Canada']))


pyDicts['Canada'] = {'Numeric': '124', 'Alpha-2': 'CA', 'Alpha-3': 'CAN'}.

In [104]:
print("pyDicts['Canada']['Alpha-2'] = {0}.".format(pyDicts['Canada']['Alpha-2']))


pyDicts['Canada']['Alpha-2'] = CA.
  • Values in a dictionary can also be lists.

In [105]:
pyNested = {"Canada":[2011,2008,2006,2004,2000 ],"Argentina":[2013,2011,2009,2007,2005],"Austria":[2013,2008,2006,2002,1999]}
print("pyNested['Canada'] = {0}".format(pyNested['Canada']))


pyNested['Canada'] = [2011, 2008, 2006, 2004, 2000]

In [106]:
print("pyNested['Austria'][4] = {0}.".format(pyNested['Austria'][4]))


pyNested['Austria'][4] = 1999.
  • Lastly, we can have lists of dictionaries

In [107]:
pyNested = [{"year":2011,"countries":["Canada","Argentina"]},
            {"year":2008,"countries":["Canada","Austria"]},
            {"year":2006,"countries":["Canada","Austria"]},
            {"year":2013,"countries":["Argentina","Austria"]}]
print("pyNested[0] = {0}".format(pyNested[0]))


pyNested[0] = {'countries': ['Canada', 'Argentina'], 'year': 2011}

In [108]:
print("pyNested[0]['year'] = {0}, pyNested[0]['countries'] = {1}.".format(pyNested[0]['year'],pyNested[0]['countries']))


pyNested[0]['year'] = 2011, pyNested[0]['countries'] = ['Canada', 'Argentina'].

6. Control Flow

6.1 if Statements

  • An if statement, coupled with zero or more elif statements can allow the execution of the script to be altered based on some condition. Here is an example.

In [109]:
pyNested = [{"year":2011,"countries":["Canada","Argentina"]},
            {"year":2008,"countries":["Canada","Austria"]},
            {"year":2006,"countries":["Canada","Austria"]},
            {"year":2013,"countries":["Argentina","Austria"]}]

# Check if first dictionary element of pyNested corresponds to years 2006 or 2008
if(pyNested[0]["year"] == 2008):
    print("Countries corresponding to year 2008 are: {0}.".format(pyNested[0]["countries"]))
elif(pyNested[0]["year"] == 2011):
    print("Countries corresponding to year 2011 are: {0}.".format(pyNested[0]["countries"]))
else:
    print("The first element does not correspond to either 2008 or 2011.")


Countries corresponding to year 2011 are: ['Canada', 'Argentina'].
  • Scripting languages, such as Python, make it easy to automate repetitive tasks. In this workshop, we'll use two of Python's syntactic constructs for iteration - the for loop and the while loop.

6.2 for Statements

  • Given an iterable, such as a list, the for loop construct can iterate over each of its values as shown below.

In [110]:
countryList = ["Canada", "United States of America", "Mexico"]
for country in countryList: # Loop over countryList, set country to next element in list.
    print(country)


Canada
United States of America
Mexico

In [111]:
countryDict = {"Canada":"124","United States":"840","Mexico":"484"}
print("Country\t\tISO 3166-1 Numeric Code")
for country,code in countryDict.items(): # Loop over all the key and value pairs in the dictionary
    print("{0:12s}\t\t{1:12s}".format(country,code))


Country		ISO 3166-1 Numeric Code
Canada      		124         
United States		840         
Mexico      		484         

6.3 range() Function

  • Another common use of the for loop is to iterate over an index which takes specific values. The range() function generates integers within the range specified by its arguments.

In [112]:
countryList = ["Canada", "United States of America", "Mexico"]
for i in range(0,3): # Loop over values of i in the range 0 up to, but not including, 3
    print countryList[i]


Canada
United States of America
Mexico

6.4 while() Statement

  • Another syntactic construct used for iteration is the while loop. This is generally used in conjunction with the conditional and logical operators which we saw earlier.

In [115]:
countryList = ["Canada", "United States of America", "Mexico"]
# iterate over countryList backwards, starting from the last element
while(countryList):
    print(countryList[-1])
    countryList.pop()


Mexico
United States of America
Canada

In [114]:
i = 0
countryList = ["Canada", "United States of America", "Mexico"]
while(i < len(countryList)):
    print("Iteration variable i = {0}, Country = {1}.".format(i,countryList[i]))
    i += 1


Iteration variable i = 0, Country = Canada.
Iteration variable i = 1, Country = United States of America.
Iteration variable i = 2, Country = Mexico.

6.5 break and continue Statements

  • Now, if some condition is evaluated within the for/while loop and based on that, we wish to exit the loop, we can use the break statement. Note that the break statement exits the innermost loop which contains it.

In [116]:
countryList = ["Canada", "United States of America", "Mexico"]
for country in countryList:
    if(country == "United States of America"):
        # if the country name matches, then break out of the for loop
        break
    else:
        # do some processing
        print(country)


Canada
  • If, instead of exiting the loop, one merely wishes to skip that iteration, then use the continue statement as shown here.

In [117]:
countryList = ["Canada", "United States of America", "Mexico"]
for country in countryList:
    if(country == "United States of America"):
        # if the country name matches, then break out of the for loop
        continue
    else:
        # do some processing
        print(country)


Canada
Mexico

7. Python File I/O

  • This is a quick intro to reading and writing plain text files in Python. As we proceed through the workshop, we'll look at more sophisticated ways of reading/writing files, in non-English languages and using specialized Python modules to handle files in formats such as CSV, JSON.

7.1 Writing to a File

  • In order to write to a file, the syntax is very similar. Open the file using the "w" mode instead of the "r" mode. Use the write() method of the file object as shown below. The syntax for the write() method is very similar to print(). Although, it does not automatically insert a newline at the end of the statement as does print().

In [118]:
filename = "tmp.txt"
fout = open(filename,"w") # The 'r' option indicates that the file is being opened to be read

for i in range(0,5): # Read in each line from the file
    # Do some processing
    fout.write("i = {0}.\n".format(i))
    
fout.close() # Once the file has been read, close the file
  • Alternative syntax for writing to file using 'with open' is shown below.

In [119]:
filename = "tmp.txt"
with open(filename,"w") as fout:
    for i in range(0,5):
        fout.write("i = {0}.\n".format(i))

fout.close()

7.2 Reading from a file

  • To open a file for reading each of its line use the open() function. Make sure that such a file does exist. Once the file has been read, close it using the close() method of the file object - this will free up system resources being used up by the open file.

In [121]:
filename = "tmp.txt"
fin = open(filename,"r") # The 'r' option indicates that the file is being opened to be read

for line in fin: # Read in each line from the file
    # Do some processing
    print(line)

fin.close() # Once the file has been read, close the file


i = 0.

i = 1.

i = 2.

i = 3.

i = 4.

  • The code below demonstrates another way to open a file and read each line. With this syntax, the file is automatically closed after the with block.

In [122]:
filename = "tmp.txt"
with open(filename,"r") as fin:
    for line in fin:
        # Do some processing
        print(line)


i = 0.

i = 1.

i = 2.

i = 3.

i = 4.

  • An input file can also be read in as one string by using the read() method.

7.3. The csv module

  • Python's csv module provides convenient functionality for reading and writing csv files similar to that available in R. The csv files can then be imported in other statistical packages such as R and Excel.
  • Here is a short example of using the csv module to write consecutive rows in to a comma-separated file. The delimiter can be chosen to be an arbitrary string.

In [ ]:
import csv

with open("game.csv","wb") as csvfile:
    csvwriter = csv.writer(csvfile,delimiter=',')
    csvwriter.writerow(["rock","paper","scissor"])
    csvwriter.writerow(["ji","gu","pa"])
    csvwriter.writerow(["rock","paper","scissor","lizard","Spock"])

In [ ]:
cat game.csv
  • And this is an example of reading the games.csv file. Each row of the csv file is read in as a list.

In [ ]:
import csv

with open("game.csv","r") as csvfile:
    csvreader = csv.reader(csvfile,delimiter=",")
    for row in csvreader:
        print(row)