Workshop on Web Scraping and Text Processing with Python
by Radhika Saksena, Princeton University, saksena@princeton.edu, radhika.saksena@gmail.com
Disclaimer: The code examples presented in this workshop are for educational purposes only. Please seek advice from a legal expert about the legal implications of using this code for web scraping.
This notebook describes some Python basics which we will be using throughout the workshop. Please go through this material and try out the code examples using IPython Notebook which comes with Anaconda (https://store.continuum.io/cshop/anaconda/).
"Code"
Cell or write new code in a "Code"
Cell.Indentation is significant in Python. Instead of curly braces to demarcate a code block (as in C++, Java, R, etc.), consecutive statements with the same level of indentation are identified as being in the same block.
Any number of spaces is valid indentation. Four spaces for each level of indentation is conventional among programmers.
In [22]:
# Assign value 1 to variable x
x = 1
In [ ]:
x = 1 # Assign value 1 to variable x
In [ ]:
"""This is a multi-line comment.
Assign value 1 to variable x."""
x = 1
The print function is used to print variables and expressions to screen. print() offers a lot of functionality which we'll encounter during the workshop. For now, note that:
In [24]:
print(1) # Print a constant
In [25]:
x = 2014
print(x) # Print an integer variable
In [26]:
xstr = "Hello World." # Print a string
print(xstr)
In [27]:
print(x,xstr) # Print multiple objects
In [28]:
print("String 1" + " " + "String2") # Concatenate multiple strings and print them
In [29]:
x = 1
print("Formatted integer is {0:06d}".format(x)) # Note the format specification, 06d, for the integer.
In [30]:
y = 12.66666666667
print("Formatted floating point number is {0:2.3f}".format(y)) # Note the format specification, 2.3f, for the floating point number.
In [31]:
iStr = "Hello World"
fStr = "Goodbye World"
print("Initial string: {0:s} . Final string: {1:s}.".format(iStr,fStr)) # Note the format specification, s, for the string.
In [32]:
print("Initial string: {0} . Final string: {1}.".format(iStr,fStr)) # In this case, omitting the s format specified works too.
In [34]:
x = 1
print "Formatted integer is {0:06d}".format(x)
In [33]:
y = 12.66666666667
print "Formatted floating point number is {0:2.3f}".format(y)
In [35]:
year = 2014
print(year)
In [36]:
print("The year is %d." % year)
In [37]:
print(type(year))
In [38]:
help(year)
In [ ]:
help(int)
In [39]:
mean = (1.0 + 0.7 + 2.1)/3.0
print(mean)
In [40]:
print("The mean is %6.2f." % mean)
In [41]:
print(type(mean))
In [ ]:
help(mean)
In [ ]:
help(float)
Beware of integer divisions in Python 2.
In [ ]:
mean = (1 + 2)/2
print(mean)
To fix this, either explicitly use floating point variables:
In [ ]:
mean = (1.0 + 2.0)/2
print(mean)
Or change the behaviour of the division operator:
In [ ]:
from __future__ import division
mean = (1 + 2)/2
print(mean)
Standard arithmetic operators for addition (+), subtraction (-), multiplication (*) and division (/) are supported in Python. We have already seen use of the addition (+) and division (/) operators. Some more operators that are commonly encountered are demonstrated below.
In [43]:
x = 2**3 # ** is the exponentiation operator
print(x)
In [44]:
x = 9 % 4 # % is the modulus operator
print(x)
In [45]:
x = 9 // 4 # // is the operator for floor division
print(x)
In addition to using the = (simple assignment operator) for assigning values to variables, one can use a composite assignment operator(+=, -=, etc.) that combines the simple assignment operator with all of these arithmetic expressions. For example:
In [46]:
x = 2.0
y = 5.0
y += x # y = y + x
print(y)
y %= x # y = y%x
print(y)
In [47]:
x = 1
y = 1
x == y # Check for equality
Out[47]:
In [48]:
x = 1
y = 1
x != 1 # Check for inequality
Out[48]:
In [50]:
x = 0.5
y = 1.0
x > y # Check if x greater than y
Out[50]:
In [51]:
x < y # Check if x less than y
Out[51]:
In [52]:
x >= y # Check if x greater than equal to y
Out[52]:
In [53]:
x <= y # Check if x less than equal to y
Out[53]:
Logical operators such as and, or, not allow specification of composite conditions, for example in if statements as we will see shortly.
In [54]:
a = 99
b = 99
(a == b) and (a <= 100) # use the and operator to check if both the operands are true
Out[54]:
In [55]:
a = True
b = False
a and b
Out[55]:
In [56]:
a = True
b = False
a or b # use the or operator to check if at least one of the two operands is true
Out[56]:
In [57]:
a = 100
b = 100
a == b
not(a == b) # use the not operator to reverse a logical statement
Out[57]:
In [58]:
pythonStr = 'A first Python string.' # String specified with single quotes.
print(type(pythonStr))
print(pythonStr)
In [59]:
pythonStr = "A first Python string" # String specified with double quotes.
print(type(pythonStr))
print(pythonStr)
In [61]:
pythonStr = """A multi-line string.
A first Python string.""" # Multi-line string specified with triple quotes.
print(type(pythonStr))
print(pythonStr)
In [62]:
str1 = " Rock "
str2 = " Paper "
str3 = " Scissors "
longStr = str1 + str2 + str3
print(longStr)
In [63]:
str1 = "Rock,Paper,Scissors\n"
repeatStr = str1*5
print(repeatStr)
In [65]:
str1 = "Python"
lenStr1 = len(str1)
print("The length of str: is " + str(lenStr1) + ".")
In [68]:
str1 = "Python"
print(str1[0]) # Print the first character element of the string.
In [69]:
print(str1[len(str1)-1]) # Print the last character element of the string.
In [70]:
print(str1[2:4]) # Print a 2-element slice of the string, starting from the 2-nd element up to but not including the 4-th element.
In [67]:
str1 = "Python"
str1[1] = "3" # Error, strings can't be modified.
In [71]:
str1 = "Python"
print(str1.upper()) # Convert str1 to all uppercase.
In [72]:
str2 = "PYTHON"
print(str1.lower()) # Convert str2 to all lowercase.
In [73]:
str3 = "Rock,Paper,Scissors,Lizard,Spock"
print(str3.split(",")) # Split str3 using "," as the separator. A list of string elements is returned.
In [74]:
str4 = "The original string has trailing spaces.\t\n"
print("***"+str4.strip()+"***") # Print stripped string with trailing space characters removed.
In [75]:
# pyList contains an integer, string and floating point number
pyList = [2014,"02 June", 74.5]
# Print all the elements of pyList
print(pyList)
print("\n")
# Print the length of pyList obtained using the len() function
print("Length of pyList is: {0}.\n".format(len(pyList)))
In [85]:
print(pyList)
print("\n")
# Print the first element of pyList. Remember, indexing starts with 0.
print("First element of pyList: {0}.\n".format(pyList[0]))
# Print the last element of pyList. Last element can be conveniently indexed using -1.
print("Last element of pyList: {0}.\n".format(pyList[-1]))
# Also the last element has index = (length of list - 1)
check = (pyList[2] == pyList[-1])
print("Is pyList[2] equal to pyList[-1]?\n{0}.\n".format(check))
# Assign a new value to the third element of the list
pyList[2] = -99.0
print("Modified element of pyList[2]: {0}.\n".format(pyList[2]))
In [86]:
pyList = ["rock","paper","scissors","lizard","Spock"]
print(pyList[2:4]) # Print elements of a starting from the second, up to but not including the fourth.
In [87]:
print(pyList[:2]) # Print the first two elements of pyList.
In [88]:
print(pyList[2:]) # Print all the elements of pyList starting from the second.
In [89]:
print(pyList[:]) # Print all the elements of pyList
In [8]:
pyList = ["rock","paper","scissors","lizard","Spock"]
pyList[2:4] = ["gu","pa"] # Replace the second and third elements of pyList
print("Original contents of pyList:")
print(pyList)
print("\n")
pyList[:] = [] # Clear pyList, replace all items with an empty list
print("Modified contents of pyList:")
print(pyList)
In [9]:
pyList = ["rock","paper"]
print("Printing Python list pyList:")
print(pyList)
print("\n")
pyList.append("scissors")
print("Appended the string 'scissors' to pyList:")
print(pyList)
print("\n")
anotherList = ["lizard","Spock"]
pyList.extend(anotherList)
print("Extended pyList:")
print(pyList)
print("\n")
In [10]:
pyList1 = ["rock","paper","scissors"]
pyList2 = ["lizard","Spock"]
newList = pyList1 + pyList2
print("New list:")
print(newList)
In [19]:
pyLists = [["rock","paper","scissors"], ["ji","gu","pa"]]
# Print the first element (0-th index) of pyLists which is itself a list
print("pyLists[0] = ")
print(pyLists[0])
print("\n")
# Print the 0-th index element of the first list element in pyLists
print("pylists[0][0] = " + pyLists[0][0] + ".")
print("\n")
# Print the second element of pyLists which is itself a list
print("pyLists[1] = ")
print(pyLists[1])
print("\n")
# Print the 0-th index element of the second list element in pyLists
print("pyLists[1][0] = " + pyLists[1][0] + ".")
print("\n")
In [21]:
pyList = [1,3,4,2]
pyList.sort(reverse=True)
sum(pyList)
2*(pyList)
#2**(pyList)
Out[21]:
In [91]:
# pyTuple contains an integer, string and floating point number
pyTuple = (2014,"02 June", 74.5)
# Print all the elements of pyTuple
print("pyTuple is: ")
print(pyTuple)
print("\n")
# Print the length of pyTuple obtained using the len() function
print("Length of pyTuple is: {0}.\n".format(len(pyTuple)))
In [92]:
pyTuple[1] = "31 December" # Error as pyTuple is a tuple and hence, immutable
In [93]:
pyTuple = "rock", "paper", "scissors" # pack the strings into a tuple named pyTuple
print(pyTuple)
In [95]:
str0,str1,str2 = pyTuple # unpack the tuple into strings named str0, str1, str2
print "str0 = " + str0 + "."
print "str1 = " + str1 + "."
print "str2 = " + str2 + "."
In [96]:
pyTuples = (("rock","paper","scissors"),("ji","gu","pa"))
print "pyTuples[0] = {0}.".format(pyTuples[0]) # Print the first sub-tuple in pyTuples.
print "pyTuples[1] = {0}.".format(pyTuples[1]) # Print the second sub-tuple in pyTuples.
In [97]:
pyNested = (["rock","paper","scissors"],["ji","gu","pa"])
pyNested[0][2] = "lizard" # OK, list within the tuple is mutable
print(pyNested[0]) # Print first list element of the tuple
In [ ]:
pyNested = [("rock","paper","scissors"),("ji","gu","pa")]
pyNested[0][2] = "lizard" # Error, tuples is immutable*
In [98]:
pyDict = {"Canada":"CAN","Argentina":"ARG","Austria":"AUT"}
print("pyDict: {0}.".format(pyDict))
In [99]:
print("pyDict['Argentina']: " + pyDict['Argentina'] + ".") # Print the value corresponding to key 'afghanistan'
In [100]:
print(pyDict.keys())
In [101]:
print(pyDict.values()) # Return all the values in the dictionary as a list.
In [102]:
print(pyDict.items()) # Return key, value pairs from the dictionary as a list of tuples.
Parsing hierarchical data structures involving Python dictionaries will be very useful when working with the JSON data format and APIs such as the Twitter API.
In [103]:
pyDicts = {"Canada":{"Alpha-2":"CA","Alpha-3":"CAN","Numeric":"124"},
"Argentina":{"Alpha-2":"AR","Alpha-3":"ARG","Numeric":"032"},
"Austria":{"Alpha-2":"AT","Alpha-3":"AUT","Numeric":"040"}}
print("pyDicts['Canada'] = {0}.".format(pyDicts['Canada']))
In [104]:
print("pyDicts['Canada']['Alpha-2'] = {0}.".format(pyDicts['Canada']['Alpha-2']))
In [105]:
pyNested = {"Canada":[2011,2008,2006,2004,2000 ],"Argentina":[2013,2011,2009,2007,2005],"Austria":[2013,2008,2006,2002,1999]}
print("pyNested['Canada'] = {0}".format(pyNested['Canada']))
In [106]:
print("pyNested['Austria'][4] = {0}.".format(pyNested['Austria'][4]))
In [107]:
pyNested = [{"year":2011,"countries":["Canada","Argentina"]},
{"year":2008,"countries":["Canada","Austria"]},
{"year":2006,"countries":["Canada","Austria"]},
{"year":2013,"countries":["Argentina","Austria"]}]
print("pyNested[0] = {0}".format(pyNested[0]))
In [108]:
print("pyNested[0]['year'] = {0}, pyNested[0]['countries'] = {1}.".format(pyNested[0]['year'],pyNested[0]['countries']))
In [109]:
pyNested = [{"year":2011,"countries":["Canada","Argentina"]},
{"year":2008,"countries":["Canada","Austria"]},
{"year":2006,"countries":["Canada","Austria"]},
{"year":2013,"countries":["Argentina","Austria"]}]
# Check if first dictionary element of pyNested corresponds to years 2006 or 2008
if(pyNested[0]["year"] == 2008):
print("Countries corresponding to year 2008 are: {0}.".format(pyNested[0]["countries"]))
elif(pyNested[0]["year"] == 2011):
print("Countries corresponding to year 2011 are: {0}.".format(pyNested[0]["countries"]))
else:
print("The first element does not correspond to either 2008 or 2011.")
In [110]:
countryList = ["Canada", "United States of America", "Mexico"]
for country in countryList: # Loop over countryList, set country to next element in list.
print(country)
In [111]:
countryDict = {"Canada":"124","United States":"840","Mexico":"484"}
print("Country\t\tISO 3166-1 Numeric Code")
for country,code in countryDict.items(): # Loop over all the key and value pairs in the dictionary
print("{0:12s}\t\t{1:12s}".format(country,code))
In [112]:
countryList = ["Canada", "United States of America", "Mexico"]
for i in range(0,3): # Loop over values of i in the range 0 up to, but not including, 3
print countryList[i]
In [115]:
countryList = ["Canada", "United States of America", "Mexico"]
# iterate over countryList backwards, starting from the last element
while(countryList):
print(countryList[-1])
countryList.pop()
In [114]:
i = 0
countryList = ["Canada", "United States of America", "Mexico"]
while(i < len(countryList)):
print("Iteration variable i = {0}, Country = {1}.".format(i,countryList[i]))
i += 1
In [116]:
countryList = ["Canada", "United States of America", "Mexico"]
for country in countryList:
if(country == "United States of America"):
# if the country name matches, then break out of the for loop
break
else:
# do some processing
print(country)
In [117]:
countryList = ["Canada", "United States of America", "Mexico"]
for country in countryList:
if(country == "United States of America"):
# if the country name matches, then break out of the for loop
continue
else:
# do some processing
print(country)
In [118]:
filename = "tmp.txt"
fout = open(filename,"w") # The 'r' option indicates that the file is being opened to be read
for i in range(0,5): # Read in each line from the file
# Do some processing
fout.write("i = {0}.\n".format(i))
fout.close() # Once the file has been read, close the file
In [119]:
filename = "tmp.txt"
with open(filename,"w") as fout:
for i in range(0,5):
fout.write("i = {0}.\n".format(i))
fout.close()
In [121]:
filename = "tmp.txt"
fin = open(filename,"r") # The 'r' option indicates that the file is being opened to be read
for line in fin: # Read in each line from the file
# Do some processing
print(line)
fin.close() # Once the file has been read, close the file
In [122]:
filename = "tmp.txt"
with open(filename,"r") as fin:
for line in fin:
# Do some processing
print(line)
In [ ]:
import csv
with open("game.csv","wb") as csvfile:
csvwriter = csv.writer(csvfile,delimiter=',')
csvwriter.writerow(["rock","paper","scissor"])
csvwriter.writerow(["ji","gu","pa"])
csvwriter.writerow(["rock","paper","scissor","lizard","Spock"])
In [ ]:
cat game.csv
In [ ]:
import csv
with open("game.csv","r") as csvfile:
csvreader = csv.reader(csvfile,delimiter=",")
for row in csvreader:
print(row)