Problem of distribution of epithet docs

Because most epithets do not have many representative documents, I will create another feature table, this time with most of the docs cut out.

Looking at the following, there is a long tail epithets with few surviving representatives.


In [1]:
from cltk.corpus.greek.tlg.parse_tlg_indices import get_epithet_index
import pandas

epithet_frequencies = []
for epithet, _ids in get_epithet_index().items():
    epithet_frequencies.append((epithet, len(_ids)))
df = pandas.DataFrame(epithet_frequencies)
df.sort_values(1, ascending=False)


Out[1]:
0 1
9 Historici/-ae 340
43 Philosophici/-ae 245
39 Comici 150
16 Tragici 85
32 Grammatici 74
29 Rhetorici 67
52 Epici/-ae 67
1 Scriptores Ecclesiastici 63
5 Lyrici/-ae 57
46 Medici 46
47 Sophistae 43
28 Poetae 40
50 Theologici 36
17 Elegiaci 33
25 Alchemistae 26
37 Epigrammatici/-ae 23
19 Mathematici 15
23 Astronomici 14
3 Astrologici 14
30 Iambici 14
8 Geographi 13
26 Oratores 12
48 Epistolographi 10
45 Biographi 10
20 Apologetici 9
42 Paradoxographi 9
36 Periegetae 9
4 Philologi 9
14 Scriptores Erotici 8
12 Mechanici 8
7 Poetae Philosophi 8
24 Mythographi 7
38 Tactici 6
6 Chronographi 6
53 Parodii 5
35 Scriptores Rerum Naturalium 5
2 Paroemiographi 5
27 Musici 5
51 Gnomici 4
0 Poetae Medici 4
31 Atticistae 4
15 Geometri 4
11 Lexicographi 4
44 Polyhistorici 3
54 Bucolici 3
40 Scriptores Fabularum 2
33 Gnostici 2
22 Mimographi 2
21 Onirocritici 2
13 Hagiographi 2
10 Doxographi 2
41 Hymnographi 1
34 Nomographi 1
18 Poetae Didactici 1
49 Choliambographi 1

Wikipedia on the long tail:

The specific cutoff of what part of a distribution is the "long tail" is often arbitrary, but in some cases may be specified objectively; see segmentation of rank-size distributions.

So I'll do this semi-objectively. I'm going to cut out any documents with a negative standard score (that is, below the mean). Thus, epithets with fewer than 26 (-0.064414235569960288) representative documents I will drop.

See following printout for z-score distribution


In [2]:
from scipy import stats

distribution = sorted(list(df[1]), reverse=True)
zscores = stats.zscore(distribution)
list(zip(distribution, zscores))


Out[2]:
[(340, 5.2838254196858783),
 (245, 3.6657274348154809),
 (150, 2.047629449945084),
 (85, 0.94050977608639141),
 (74, 0.75315106204876658),
 (67, 0.63392278947936886),
 (67, 0.63392278947936886),
 (63, 0.56579234801114164),
 (57, 0.46359668580880081),
 (46, 0.27623797177117587),
 (43, 0.22514014067000546),
 (40, 0.17404230956883504),
 (36, 0.10591186810060781),
 (33, 0.054814036999437377),
 (26, -0.064414235569960288),
 (23, -0.11551206667113072),
 (15, -0.25177294960758517),
 (14, -0.26880555997464201),
 (14, -0.26880555997464201),
 (14, -0.26880555997464201),
 (13, -0.28583817034169878),
 (12, -0.30287078070875562),
 (10, -0.33693600144286923),
 (10, -0.33693600144286923),
 (9, -0.35396861180992606),
 (9, -0.35396861180992606),
 (9, -0.35396861180992606),
 (9, -0.35396861180992606),
 (8, -0.37100122217698284),
 (8, -0.37100122217698284),
 (8, -0.37100122217698284),
 (7, -0.38803383254403967),
 (6, -0.40506644291109645),
 (6, -0.40506644291109645),
 (5, -0.42209905327815328),
 (5, -0.42209905327815328),
 (5, -0.42209905327815328),
 (5, -0.42209905327815328),
 (4, -0.43913166364521011),
 (4, -0.43913166364521011),
 (4, -0.43913166364521011),
 (4, -0.43913166364521011),
 (4, -0.43913166364521011),
 (3, -0.45616427401226689),
 (3, -0.45616427401226689),
 (2, -0.47319688437932372),
 (2, -0.47319688437932372),
 (2, -0.47319688437932372),
 (2, -0.47319688437932372),
 (2, -0.47319688437932372),
 (2, -0.47319688437932372),
 (1, -0.4902294947463805),
 (1, -0.4902294947463805),
 (1, -0.4902294947463805),
 (1, -0.4902294947463805)]

In [22]:
# Make list of epithets to drop
to_drop = df[0].where(df[1] < 26)
to_drop = [epi for epi in to_drop if not type(epi) is float]
to_drop = set(to_drop)
to_drop


Out[22]:
{'Apologetici',
 'Astrologici',
 'Astronomici',
 'Atticistae',
 'Biographi',
 'Bucolici',
 'Choliambographi',
 'Chronographi',
 'Doxographi',
 'Epigrammatici/-ae',
 'Epistolographi',
 'Geographi',
 'Geometri',
 'Gnomici',
 'Gnostici',
 'Hagiographi',
 'Hymnographi',
 'Iambici',
 'Lexicographi',
 'Mathematici',
 'Mechanici',
 'Mimographi',
 'Musici',
 'Mythographi',
 'Nomographi',
 'Onirocritici',
 'Oratores',
 'Paradoxographi',
 'Parodii',
 'Paroemiographi',
 'Periegetae',
 'Philologi',
 'Poetae Didactici',
 'Poetae Medici',
 'Poetae Philosophi',
 'Polyhistorici',
 'Scriptores Erotici',
 'Scriptores Fabularum',
 'Scriptores Rerum Naturalium',
 'Tactici'}

Make vectorizer

Now when loading documents, drop those belonging to an epithet in the to_drop list


In [23]:
import datetime as dt
import os
import time

from cltk.corpus.greek.tlg.parse_tlg_indices import get_epithet_of_author
from cltk.corpus.greek.tlg.parse_tlg_indices import get_id_author
import pandas
from sklearn.externals import joblib
from sklearn.feature_extraction.text import CountVectorizer

In [24]:
def stream_lemmatized_files(corpus_dir):
    # return all docs in a dir
    user_dir = os.path.expanduser('~/cltk_data/user_data/' + corpus_dir)
    files = os.listdir(user_dir)

    for file in files:
        filepath = os.path.join(user_dir, file)
        with open(filepath) as fo:
            #TODO rm words less the 3 chars long
            yield file[3:-4], fo.read()

In [25]:
t0 = dt.datetime.utcnow()

map_id_author = get_id_author()

df = pandas.DataFrame(columns=['id', 'author' 'text', 'epithet'])

for _id, text in stream_lemmatized_files('tlg_lemmatized_no_accents_no_stops'):
    author = map_id_author[_id]
    epithet = get_epithet_of_author(_id)
    if epithet in to_drop:
        continue
    df = df.append({'id': _id, 'author': author, 'text': text, 'epithet': epithet}, ignore_index=True)

print(df.shape)
print('... finished in {}'.format(dt.datetime.utcnow() - t0))
print('Number of texts:', len(df))


(1587, 5)
... finished in 0:00:09.806514
Number of texts: 1587

In [26]:
text_list = df['text'].tolist()

# make a list of short texts to drop
# For pres, get distributions of words per doc
short_text_drop_index = [index if len(text) > 500 else None for index, text in enumerate(text_list) ]  # ~100 words

In [27]:
t0 = dt.datetime.utcnow()

# TODO: Consider using generator to CV http://stackoverflow.com/a/21600406

# time & size counts, w/ 50 texts:
# 0:01:15 & 202M @ ngram_range=(1, 3), min_df=2, max_features=500
# 0:00:26 & 80M @ ngram_range=(1, 2), analyzer='word', min_df=2, max_features=5000
# 0:00:24 & 81M @ ngram_range=(1, 2), analyzer='word', min_df=2, max_features=50000

# time & size counts, w/ 1823 texts:
# 0:02:18 & 46MB @ ngram_range=(1, 1), analyzer='word', min_df=2, max_features=500000
# 0:2:01 & 47 @ ngram_range=(1, 1), analyzer='word', min_df=2, max_features=1000000

# max features in the lemmatized data set: 551428
max_features = 100000
ngrams = 1
vectorizer = CountVectorizer(ngram_range=(1, ngrams), analyzer='word', 
                             min_df=2, max_features=max_features)
term_document_matrix = vectorizer.fit_transform(text_list)  # input is a list of strings, 1 per document

# save matrix
vector_fp = os.path.expanduser('~/cltk_data/user_data/vectorizer_test_features{0}_ngrams{1}.pickle'.format(max_features, ngrams))
joblib.dump(term_document_matrix, vector_fp)

print('... finished in {}'.format(dt.datetime.utcnow() - t0))


... finished in 0:00:51.457374

Transform term matrix into feature table


In [28]:
# Put BoW vectors into a new df
term_document_matrix = joblib.load(vector_fp)  # scipy.sparse.csr.csr_matrix

In [29]:
term_document_matrix.shape


Out[29]:
(1587, 100000)

In [30]:
term_document_matrix_array = term_document_matrix.toarray()

In [31]:
dataframe_bow = pandas.DataFrame(term_document_matrix_array, columns=vectorizer.get_feature_names())

In [32]:
ids_list = df['id'].tolist()

In [33]:
len(ids_list)


Out[33]:
1587

In [34]:
dataframe_bow.shape


Out[34]:
(1587, 100000)

In [35]:
dataframe_bow['id'] = ids_list

In [36]:
authors_list = df['author'].tolist()
dataframe_bow['author'] = authors_list

In [37]:
epithets_list = df['epithet'].tolist()
dataframe_bow['epithet'] = epithets_list

In [38]:
# For pres, give distribution of epithets, including None
dataframe_bow['epithet']


Out[38]:
0                  Historici/-ae
1                        Tragici
2                        Tragici
3                         Comici
4                           None
5                           None
6                  Historici/-ae
7               Philosophici/-ae
8                      Sophistae
9                     Theologici
10                 Historici/-ae
11      Scriptores Ecclesiastici
12                          None
13                    Lyrici/-ae
14              Philosophici/-ae
15                       Tragici
16                          None
17                          None
18                        Medici
19                 Historici/-ae
20                 Historici/-ae
21                        Medici
22                    Lyrici/-ae
23      Scriptores Ecclesiastici
24                       Tragici
25                          None
26                    Grammatici
27                 Historici/-ae
28                        Comici
29                        Comici
                  ...           
1557               Historici/-ae
1558                  Grammatici
1559                    Elegiaci
1560               Historici/-ae
1561               Historici/-ae
1562               Historici/-ae
1563                        None
1564               Historici/-ae
1565                        None
1566            Philosophici/-ae
1567            Philosophici/-ae
1568                    Elegiaci
1569                  Lyrici/-ae
1570                 Alchemistae
1571            Philosophici/-ae
1572            Philosophici/-ae
1573                      Comici
1574                      Comici
1575            Philosophici/-ae
1576                  Lyrici/-ae
1577                   Sophistae
1578                   Epici/-ae
1579            Philosophici/-ae
1580            Philosophici/-ae
1581               Historici/-ae
1582            Philosophici/-ae
1583                  Lyrici/-ae
1584               Historici/-ae
1585                        None
1586                      Comici
Name: epithet, dtype: object

In [39]:
t0 = dt.datetime.utcnow()

# removes 334
#! remove rows whose epithet = None
# note on selecting none in pandas: http://stackoverflow.com/a/24489602
dataframe_bow = dataframe_bow[dataframe_bow.epithet.notnull()]
dataframe_bow.shape

print('... finished in {}'.format(dt.datetime.utcnow() - t0))


... finished in 0:00:00.489680

In [40]:
t0 = dt.datetime.utcnow()

dataframe_bow.to_csv(os.path.expanduser('~/cltk_data/user_data/tlg_bow.csv'))

print('... finished in {}'.format(dt.datetime.utcnow() - t0))


... finished in 0:03:55.564356

In [41]:
dataframe_bow.shape


Out[41]:
(1253, 100003)

In [42]:
dataframe_bow.head(10)


Out[42]:
αʹ ααα ααπτος ααπτους ααρων αασαμην αατος ααω αβαθης αβακιον ... ϲωμα ϲωματα ϲωματι ϲωματοϲ ϲωματων ϲωμαϲι ϲωμαϲιν id author epithet
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1459 Lepidus Hist. Historici/-ae
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0825 Melito Trag. Tragici
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0331 [Polyidus] Trag. Tragici
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0417 Archippus Comic. Comici
6 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 2475 Menecrates Hist. Historici/-ae
7 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 4075 Marinus Phil. Philosophici/-ae
8 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 2127 Troilus Soph. Sophistae
9 0 0 0 0 4 0 1 0 0 0 ... 0 0 0 0 0 0 0 2074 Apollinaris Theol. Theologici
10 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 2173 Antileon Hist. Historici/-ae
11 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1419 Hermas Scr. Eccl., Pastor Hermae Scriptores Ecclesiastici

10 rows × 100003 columns


In [43]:
# write dataframe_bow to disk, for fast reuse while classifying
# 2.3G
fp_df = os.path.expanduser('~/cltk_data/user_data/tlg_bow_df.pickle')
joblib.dump(dataframe_bow, fp_df)


Out[43]:
['/root/cltk_data/user_data/tlg_bow_df.pickle']

In [ ]: