Counting Tales

Towards an Algorithmic Analysis of Folk Narratives

Counting Texts

In the code below, we use os.walk which returns a tuple that contains the location and all its contents (both directories and files). The tuple in this case looks like this:

('./texts', [], ['anc-088.txt', 'anc-089.txt', 'anc-090.txt', 'anc-091.txt', 'bro-001.txt', 'bro-002.txt', 'bro-003.txt', 'bro-004.txt', 'lau-013.txt', 'lau-014.txt', 'loh-157.txt', 'loh-158.txt', 'loh-159.txt', 'loh-160.txt', 'loh-161.txt', 'loh-162.txt', 'loh-162b.txt', 'loh-163.txt', 'loh-164.txt', 'loh-165.txt', 'uls-001.txt', 'uls-002.txt', 'uls-003.txt'])

Here, the path is the one we provided, ./texts and it contains no directories, only these files. The list of files is in the third position of the tuple -- the number 2 when counting from zero -- and we get to it with .next(). Our focus is only on the length of that list: 23.



In [8]:

    
import os

fileCount = len(os.walk('./texts').next()[2])

print(fileCount)

For the record, here are our texts:



In [7]:

    
print(os.walk('./texts').next()[2])









    



['anc-088.txt', 'anc-089.txt', 'anc-090.txt', 'anc-091.txt', 'bro-001.txt', 'bro-002.txt', 'bro-003.txt', 'bro-004.txt', 'lau-013.txt', 'lau-014.txt', 'loh-157.txt', 'loh-158.txt', 'loh-159.txt', 'loh-160.txt', 'loh-161.txt', 'loh-162.txt', 'loh-162b.txt', 'loh-163.txt', 'loh-164.txt', 'loh-165.txt', 'uls-001.txt', 'uls-002.txt', 'uls-003.txt']

Let's get some basic information about these texts: how long they are and how many unique words occur in each one:



In [26]:

    
import glob
import re

files = {}
for fpath in glob.glob("./texts/*.txt"):
    with open(fpath) as f:
         fixed_text = re.sub("[^a-zA-Z'-]"," ",f.read())
    files[fpath] = (len(fixed_text.split()),len(set(fixed_text.split())))

for fname in sorted(files):
    print fname + '\t' + str(files[fname][0]) + '\t' + str(files[fname][1])









    



./texts/anc-088.txt	333	148
./texts/anc-089.txt	153	86
./texts/anc-090.txt	138	75
./texts/anc-091.txt	174	111
./texts/bro-001.txt	117	74
./texts/bro-002.txt	122	79
./texts/bro-003.txt	136	90
./texts/bro-004.txt	67	50
./texts/lau-013.txt	384	185
./texts/lau-014.txt	676	218
./texts/loh-157.txt	367	180
./texts/loh-158.txt	193	111
./texts/loh-159.txt	279	149
./texts/loh-160.txt	760	306
./texts/loh-161.txt	295	141
./texts/loh-162.txt	331	154
./texts/loh-162b.txt	194	114
./texts/loh-163.txt	209	120
./texts/loh-164.txt	1024	339
./texts/loh-165.txt	904	308
./texts/uls-001.txt	112	68
./texts/uls-002.txt	188	102
./texts/uls-003.txt	234	121

TEXT	Length	Unique
anc-088	333	148
anc-089	153	86
anc-090	138	75
anc-091	174	111
bro-001	117	74
bro-002	122	79
bro-003	136	90
bro-004	67	50
lau-013	384	185
lau-014	676	218
loh-157	367	180
loh-158	193	111
loh-159	279	149
loh-160	760	306
loh-161	295	141
loh-162	331	154
loh-162b	194	114
loh-163	209	120
loh-164	1024	339
loh-165	904	308
uls-001	112	68
uls-002	188	102
uls-003	234	121

A Sample Text

Let's take a look at the first text on that list, Ancelet 89, which is 153 words long with 86 unique words:

I went to meet an old man in Marrero, and he told me a story. He went to look for a treasure with some other men. And there was a controller who had brought a Bible to control the spirits. And when they arrived at the site, they saw a big horse coming through the woods with a man riding it, and when he dismounted, it was no longer a man on the horse. It was a dog. And he said the dog came and rubbed itself against his legs. He said it was growling. He said he knew the dog was touching him, but he didn't feel anything. It was like there was just a wind. And he said they all took off running. He lost his hat and his glasses and he tore all his clothes. And even the controller ran off and he never saw his Bible again after that.

Words, Words, Words

If take a look at all the words that occur more than once in the text, we arrive at the following enumeration:

double digits: he (12), and (11)
a lot: a (9), was (7), the (7), it (5), his (5), said (4)
three times: to (3), they (3), man (3), dog (3)
twice: with, when, went, there, saw, off, horse, controller, bible, all

Words that Count

With perhaps the exception of he/his as indicative of gender roles and said as a possible polysyndetic chaining syntax and/or traditionalizing/authorizing claim, many of the most frequently-used words do not contribute to the meaning of the text. Many of these words are considered functional words, in contrast with lexical words, and are often dropped from algorithmic analyses that focus on content matters.

If we focus on nouns and verbs we can reduce the text to the following:

man (3), dog (3), went (2), saw (2), horse (2), controller (2), bible (2)

Another way to represent this text is as a row with the text's name at the beginning, and then each item that follows represents one of the words (in the order above) and how many times it occurs:

ANC-089, 3, 3, 2, 2, 2, 2, 2

Comparing Word Counts

What happens when we compare several texts along these seven dimensions: man, dog, went, saw, horse, controller, bible? Here's our current text again:

ANC-089, 3, 3, 2, 2, 2, 2, 2

And here are some others from the collection:

ANC-088, 3, 0, 2, 0, 0, 3, 0 #Words in common: man, went, controller
LAU-013, 3, 0, 3, 0, 0, 0, 0 #Words in common: man, went

Adding Words/Dimensions

If we add two dimensions, bull and shovel, both of which appear in the two additional texts, we see the following:

ANC-089, 3, 3, 2, 2, 2, 2, 2, 0, 0
ANC-088, 3, 0, 2, 0, 0, 3, 0, 3, 1
LAU-013, 3, 0, 3, 0, 0, 0, 0, 3, 2

The latter two texts would appear to have a great deal in common. For the record, both of these texts treat the matter of getting treasure out of the ground with a religious figure. However, Laudun 13 has a relative to the narrator preaching, not an external figure, as in the two Ancelet texts, called a "spirit controller." (We will have to deal with the problem of synonyms later.)

Resources

Fernando Pérez, Brian E. Granger, IPython: A System for Interactive Scientific Computing, Computing in Science and Engineering, vol. 9, no. 3, pp. 21-29, May/June 2007, doi:10.1109/MCSE.2007.53. URL: http://ipython.org