Programming for the Behavioral Sciences
Text and string processing. The goal of this lecture is count the words in parts on Hamlet. What is the most frequently used work in this book, for instance?
More information about basic string and text operations can be found here:
http://www.pythonforbeginners.com/basics/string-manipulation-in-python
In [4]:
# Create a string
my_str = 'This is my string'
# Split it...
my_str_split = my_str.split()
print(my_str_split)
# ... and restore (join) it again
my_str_joined = ' '.join(my_str_split) # ' ' means join with space
print(my_str_joined)
# Find first occurence of word containing 'str'
print(my_str.find('str'))
# Replace all occurences of 'str' with 'th'
print(my_str.replace('str','th'))
# Print the length of the string. %.2f means that the values should have two decimals (a float).
print(my_str + ' contains %.2f characters' % len(my_str))
In [12]:
filename = 'img\hamlet.txt'
# Open file for reading
f = open(filename,'r')
# Create an empty dictionary
worddict = {}
# Loop over each line in file
for line in f:
# Loop over each word in line
words = line.split()
for i, word in enumerate(words):
# Make all words lowercase (so that The and the are counted as the same, for instance)
w = word.lower()
if w in worddict:
worddict[w] +=1
else:
worddict[w] = 1
# Close the file for reading and writing
f.close()
# Create a list from the dictionary
wordlist = []
for key, value in worddict.iteritems():
wordlist.append([value, key])
# Sort worddict such that the most common words are on the top
wordlist.sort(reverse=True)
# Print the 10 most common words
print(wordlist[:10])
# Write output to file
filename_out = 'my_hamlet_frequencies.txt'
f_out = open(filename_out,'w')
for w in wordlist:
f_out.write('\t'.join((w[1],str(w[0]),'\n')))
f_out.close()
In [10]:
import re
from collections import Counter
# Read all the words into a list
words = re.findall(r'\w+', open('img/hamlet.txt').read().lower())
# Present the 10
Counter(words).most_common(10)
Out[10]:
Compare the results using our own implementation and Counter. Why do they give slightly different results?