In the code below, we use os.walk which returns a tuple that contains the location and all its contents (both directories and files). The tuple in this case looks like this:
('./texts', [], ['anc-088.txt', 'anc-089.txt', 'anc-090.txt', 'anc-091.txt', 'bro-001.txt', 'bro-002.txt', 'bro-003.txt', 'bro-004.txt', 'lau-013.txt', 'lau-014.txt', 'loh-157.txt', 'loh-158.txt', 'loh-159.txt', 'loh-160.txt', 'loh-161.txt', 'loh-162.txt', 'loh-162b.txt', 'loh-163.txt', 'loh-164.txt', 'loh-165.txt', 'uls-001.txt', 'uls-002.txt', 'uls-003.txt'])
Here, the path is the one we provided, ./texts and it contains no directories, only these files. The list of files is in the third position of the tuple -- the number 2 when counting from zero -- and we get to it with .next(). Our focus is only on the length of that list: 23.
In [8]:
import os
fileCount = len(os.walk('./texts').next()[2])
print(fileCount)
For the record, here are our texts:
In [7]:
print(os.walk('./texts').next()[2])
Let's get some basic information about these texts: how long they are and how many unique words occur in each one:
In [26]:
import glob
import re
files = {}
for fpath in glob.glob("./texts/*.txt"):
with open(fpath) as f:
fixed_text = re.sub("[^a-zA-Z'-]"," ",f.read())
files[fpath] = (len(fixed_text.split()),len(set(fixed_text.split())))
for fname in sorted(files):
print fname + '\t' + str(files[fname][0]) + '\t' + str(files[fname][1])
| TEXT | Length | Unique |
|---|---|---|
| anc-088 | 333 | 148 |
| anc-089 | 153 | 86 |
| anc-090 | 138 | 75 |
| anc-091 | 174 | 111 |
| bro-001 | 117 | 74 |
| bro-002 | 122 | 79 |
| bro-003 | 136 | 90 |
| bro-004 | 67 | 50 |
| lau-013 | 384 | 185 |
| lau-014 | 676 | 218 |
| loh-157 | 367 | 180 |
| loh-158 | 193 | 111 |
| loh-159 | 279 | 149 |
| loh-160 | 760 | 306 |
| loh-161 | 295 | 141 |
| loh-162 | 331 | 154 |
| loh-162b | 194 | 114 |
| loh-163 | 209 | 120 |
| loh-164 | 1024 | 339 |
| loh-165 | 904 | 308 |
| uls-001 | 112 | 68 |
| uls-002 | 188 | 102 |
| uls-003 | 234 | 121 |
Let's take a look at the first text on that list, Ancelet 89, which is 153 words long with 86 unique words:
I went to meet an old man in Marrero, and he told me a story. He went to look for a treasure with some other men. And there was a controller who had brought a Bible to control the spirits. And when they arrived at the site, they saw a big horse coming through the woods with a man riding it, and when he dismounted, it was no longer a man on the horse. It was a dog. And he said the dog came and rubbed itself against his legs. He said it was growling. He said he knew the dog was touching him, but he didn't feel anything. It was like there was just a wind. And he said they all took off running. He lost his hat and his glasses and he tore all his clothes. And even the controller ran off and he never saw his Bible again after that.
If take a look at all the words that occur more than once in the text, we arrive at the following enumeration:
With perhaps the exception of he/his as indicative of gender roles and said as a possible polysyndetic chaining syntax and/or traditionalizing/authorizing claim, many of the most frequently-used words do not contribute to the meaning of the text. Many of these words are considered functional words, in contrast with lexical words, and are often dropped from algorithmic analyses that focus on content matters.
If we focus on nouns and verbs we can reduce the text to the following:
man (3), dog (3), went (2), saw (2), horse (2), controller (2), bible (2)
Another way to represent this text is as a row with the text's name at the beginning, and then each item that follows represents one of the words (in the order above) and how many times it occurs:
ANC-089, 3, 3, 2, 2, 2, 2, 2
What happens when we compare several texts along these seven dimensions: man, dog, went, saw, horse, controller, bible? Here's our current text again:
ANC-089, 3, 3, 2, 2, 2, 2, 2
And here are some others from the collection:
ANC-088, 3, 0, 2, 0, 0, 3, 0 #Words in common: man, went, controller
LAU-013, 3, 0, 3, 0, 0, 0, 0 #Words in common: man, went
If we add two dimensions, bull and shovel, both of which appear in the two additional texts, we see the following:
ANC-089, 3, 3, 2, 2, 2, 2, 2, 0, 0
ANC-088, 3, 0, 2, 0, 0, 3, 0, 3, 1
LAU-013, 3, 0, 3, 0, 0, 0, 0, 3, 2
The latter two texts would appear to have a great deal in common. For the record, both of these texts treat the matter of getting treasure out of the ground with a religious figure. However, Laudun 13 has a relative to the narrator preaching, not an external figure, as in the two Ancelet texts, called a "spirit controller." (We will have to deal with the problem of synonyms later.)
Fernando Pérez, Brian E. Granger, IPython: A System for Interactive Scientific Computing, Computing in Science and Engineering, vol. 9, no. 3, pp. 21-29, May/June 2007, doi:10.1109/MCSE.2007.53. URL: http://ipython.org