Counting Tales

Towards an Algorithmic Analysis of Folk Narratives

Counting Texts

In the code below, we use os.walk which returns a tuple that contains the location and all its contents (both directories and files). The tuple in this case looks like this:

('./texts', [], ['anc-088.txt', 'anc-089.txt', 'anc-090.txt', 'anc-091.txt', 'bro-001.txt', 'bro-002.txt', 'bro-003.txt', 'bro-004.txt', 'lau-013.txt', 'lau-014.txt', 'loh-157.txt', 'loh-158.txt', 'loh-159.txt', 'loh-160.txt', 'loh-161.txt', 'loh-162.txt', 'loh-162b.txt', 'loh-163.txt', 'loh-164.txt', 'loh-165.txt', 'uls-001.txt', 'uls-002.txt', 'uls-003.txt'])

Here, the path is the one we provided, ./texts and it contains no directories, only these files. The list of files is in the third position of the tuple -- the number 2 when counting from zero -- and we get to it with .next(). Our focus is only on the length of that list: 23.


In [8]:
import os

fileCount = len(os.walk('./texts').next()[2])

print(fileCount)


23

For the record, here are our texts:


In [7]:
print(os.walk('./texts').next()[2])


['anc-088.txt', 'anc-089.txt', 'anc-090.txt', 'anc-091.txt', 'bro-001.txt', 'bro-002.txt', 'bro-003.txt', 'bro-004.txt', 'lau-013.txt', 'lau-014.txt', 'loh-157.txt', 'loh-158.txt', 'loh-159.txt', 'loh-160.txt', 'loh-161.txt', 'loh-162.txt', 'loh-162b.txt', 'loh-163.txt', 'loh-164.txt', 'loh-165.txt', 'uls-001.txt', 'uls-002.txt', 'uls-003.txt']

Let's get some basic information about these texts: how long they are and how many unique words occur in each one:


In [26]:
import glob
import re

files = {}
for fpath in glob.glob("./texts/*.txt"):
    with open(fpath) as f:
         fixed_text = re.sub("[^a-zA-Z'-]"," ",f.read())
    files[fpath] = (len(fixed_text.split()),len(set(fixed_text.split())))

for fname in sorted(files):
    print fname + '\t' + str(files[fname][0]) + '\t' + str(files[fname][1])


./texts/anc-088.txt	333	148
./texts/anc-089.txt	153	86
./texts/anc-090.txt	138	75
./texts/anc-091.txt	174	111
./texts/bro-001.txt	117	74
./texts/bro-002.txt	122	79
./texts/bro-003.txt	136	90
./texts/bro-004.txt	67	50
./texts/lau-013.txt	384	185
./texts/lau-014.txt	676	218
./texts/loh-157.txt	367	180
./texts/loh-158.txt	193	111
./texts/loh-159.txt	279	149
./texts/loh-160.txt	760	306
./texts/loh-161.txt	295	141
./texts/loh-162.txt	331	154
./texts/loh-162b.txt	194	114
./texts/loh-163.txt	209	120
./texts/loh-164.txt	1024	339
./texts/loh-165.txt	904	308
./texts/uls-001.txt	112	68
./texts/uls-002.txt	188	102
./texts/uls-003.txt	234	121
TEXT Length Unique
anc-088 333 148
anc-089 153 86
anc-090 138 75
anc-091 174 111
bro-001 117 74
bro-002 122 79
bro-003 136 90
bro-004 67 50
lau-013 384 185
lau-014 676 218
loh-157 367 180
loh-158 193 111
loh-159 279 149
loh-160 760 306
loh-161 295 141
loh-162 331 154
loh-162b 194 114
loh-163 209 120
loh-164 1024 339
loh-165 904 308
uls-001 112 68
uls-002 188 102
uls-003 234 121

A Sample Text

Let's take a look at the first text on that list, Ancelet 89, which is 153 words long with 86 unique words:

I went to meet an old man in Marrero, and he told me a story. He went to look for a treasure with some other men. And there was a controller who had brought a Bible to control the spirits. And when they arrived at the site, they saw a big horse coming through the woods with a man riding it, and when he dismounted, it was no longer a man on the horse. It was a dog. And he said the dog came and rubbed itself against his legs. He said it was growling. He said he knew the dog was touching him, but he didn't feel anything. It was like there was just a wind. And he said they all took off running. He lost his hat and his glasses and he tore all his clothes. And even the controller ran off and he never saw his Bible again after that.

Words, Words, Words

If take a look at all the words that occur more than once in the text, we arrive at the following enumeration:

  • double digits: he (12), and (11)
  • a lot: a (9), was (7), the (7), it (5), his (5), said (4)
  • three times: to (3), they (3), man (3), dog (3)
  • twice: with, when, went, there, saw, off, horse, controller, bible, all

Words that Count

With perhaps the exception of he/his as indicative of gender roles and said as a possible polysyndetic chaining syntax and/or traditionalizing/authorizing claim, many of the most frequently-used words do not contribute to the meaning of the text. Many of these words are considered functional words, in contrast with lexical words, and are often dropped from algorithmic analyses that focus on content matters.

If we focus on nouns and verbs we can reduce the text to the following:

man (3), dog (3), went (2), saw (2), horse (2), controller (2), bible (2)

Another way to represent this text is as a row with the text's name at the beginning, and then each item that follows represents one of the words (in the order above) and how many times it occurs:

ANC-089, 3, 3, 2, 2, 2, 2, 2

Comparing Word Counts

What happens when we compare several texts along these seven dimensions: man, dog, went, saw, horse, controller, bible? Here's our current text again:

ANC-089, 3, 3, 2, 2, 2, 2, 2

And here are some others from the collection:

ANC-088, 3, 0, 2, 0, 0, 3, 0 #Words in common: man, went, controller
LAU-013, 3, 0, 3, 0, 0, 0, 0 #Words in common: man, went

Adding Words/Dimensions

If we add two dimensions, bull and shovel, both of which appear in the two additional texts, we see the following:

ANC-089, 3, 3, 2, 2, 2, 2, 2, 0, 0
ANC-088, 3, 0, 2, 0, 0, 3, 0, 3, 1
LAU-013, 3, 0, 3, 0, 0, 0, 0, 3, 2

The latter two texts would appear to have a great deal in common. For the record, both of these texts treat the matter of getting treasure out of the ground with a religious figure. However, Laudun 13 has a relative to the narrator preaching, not an external figure, as in the two Ancelet texts, called a "spirit controller." (We will have to deal with the problem of synonyms later.)

Resources

Fernando Pérez, Brian E. Granger, IPython: A System for Interactive Scientific Computing, Computing in Science and Engineering, vol. 9, no. 3, pp. 21-29, May/June 2007, doi:10.1109/MCSE.2007.53. URL: http://ipython.org