Lecture notes from week 6

Programming for the Behavioral Sciences

Text and string processing. The goal of this lecture is count the words in parts on Hamlet. What is the most frequently used work in this book, for instance?

More information about basic string and text operations can be found here:

http://www.pythonforbeginners.com/basics/string-manipulation-in-python

https://www.tutorialspoint.com/python/python_strings.htm

Basic string operations


In [4]:
# Create a string
my_str = 'This is my string'

# Split it...
my_str_split = my_str.split()
print(my_str_split)

# ... and restore (join) it again
my_str_joined = ' '.join(my_str_split) # ' ' means join with space
print(my_str_joined)

# Find first occurence of word containing 'str'
print(my_str.find('str')) 

# Replace all occurences of 'str' with 'th'
print(my_str.replace('str','th'))

# Print the length of the string. %.2f means that the values should have two decimals (a float).
print(my_str + ' contains %.2f characters' % len(my_str))


['This', 'is', 'my', 'string']
This is my string
11
This is my thing
This is my string contains 17.00 characters

Count words in Hamlet


In [12]:
filename = 'img\hamlet.txt'

# Open file for reading
f = open(filename,'r')

# Create an empty dictionary
worddict = {}

# Loop over each line in file
for line in f:
    # Loop over each word in line
    words = line.split()
    for i, word in enumerate(words):   
        # Make all words lowercase (so that The and the are counted as the same, for instance)
        w = word.lower()
        if w in worddict:
            worddict[w] +=1
        else:
            worddict[w] = 1

# Close the file for reading and writing
f.close()  

# Create a list from the dictionary
wordlist = []
for key, value in worddict.iteritems():
    wordlist.append([value, key])
    
# Sort worddict such that the most common words are on the top   
wordlist.sort(reverse=True)  

# Print the 10 most common words
print(wordlist[:10])

# Write output to file       
filename_out = 'my_hamlet_frequencies.txt'        
f_out = open(filename_out,'w')
for w in wordlist:
    f_out.write('\t'.join((w[1],str(w[0]),'\n')))
f_out.close()


[[1083, 'the'], [939, 'and'], [727, 'to'], [670, 'of'], [540, 'a'], [523, 'i'], [519, 'my'], [433, 'you'], [420, 'in'], [358, 'ham.']]

Use collections to do the same thing


In [10]:
import re
from collections import Counter

# Read all the words into a list
words = re.findall(r'\w+', open('img/hamlet.txt').read().lower())

# Present the 10 
Counter(words).most_common(10)


Out[10]:
[('the', 1091),
 ('and', 969),
 ('to', 767),
 ('of', 675),
 ('i', 633),
 ('a', 571),
 ('you', 558),
 ('my', 520),
 ('in', 451),
 ('it', 421)]

Compare the results using our own implementation and Counter. Why do they give slightly different results?