Learning Syllabification

Speakers have relatively clear intuitions about how words in their native language should be divided up into syllables. These intuitions are quite systematic across speakers of the same language, and can be applied to new words that a speaker has never heard before (e.g. English speakers have the intuition that the fake word 'haldapet' should be syllabified as 'hal.da.pet', and not some other way). This tells us that speakers have a systematic way of grouping sounds into syllables, and that this system can be applied to new data. A lot of research has been done on syllabification in English, and there are well-developed analyses of the rule system which underlies syllabification in English, but syllable structure has recieved much less attention in less well studied languages. The aim of this lab is to discover the rules governing syllable structure in Indonesian.

Downloading and Cleaning the Data


In [1]:
import pandas as pd

In [2]:
# I downloaded an Indonesian dictionary in SQL db format from the the Indonesian Ministry for Edcuation and Culture

import sqlite3
conn = sqlite3.connect('KBBI.db')
c = conn.cursor()

In [3]:
c.execute("SELECT name FROM sqlite_master WHERE type='table';")
print(c.fetchall())


[(u'datakata',), (u'android_metadata',), (u'bookmark',)]

In [4]:
# Taking a look at the data, we see that there are three columns: an index '_id', keyword 'katakunci' and definition
# 'artikata'
kbbi = pd.pandas.read_sql('SELECT * FROM datakata',conn, index_col='_id')
print kbbi.shape
kbbi.head(2)


(35969, 2)
Out[4]:
katakunci artikata
_id
1 a <b><sup>1</sup>A, a</b&gt...
2 a <b><sup>3</sup>a-</b> ...

In [5]:
#  I'll reset the column names to English for clarity
kbbi.columns = ['keyword','definition']
kbbi.head(2)


Out[5]:
keyword definition
_id
1 a <b><sup>1</sup>A, a</b&gt...
2 a <b><sup>3</sup>a-</b> ...

In [6]:
# The 'definition' column is filled contains unusual coding.  We are only interested in words where there are syllable
# boundaries.  Let's find some words that have multiple syllables and determine a strategy for parsing out the 
# syllabified string. The best way to do this is to search for keywords that contain multiple vowels. Below is
# a function which will detect multiple syllables.

def vowel_count(keyword):
    vowels = ['a','e','i','o','u']
    count = 0
    for letter in keyword:
        if letter in vowels:
            count += 1
    return count

In [7]:
# Now let's create a column listing a count of the vowels in each keyword.
kbbi['vowel_count'] = kbbi['keyword'].apply(lambda x: vowel_count(x))
kbbi.head(10)


Out[7]:
keyword definition vowel_count
_id
1 a <b><sup>1</sup>A, a</b&gt... 1
2 a <b><sup>3</sup>a-</b> ... 1
3 ab <b><sup>1</sup>ab</b> ... 1
4 ab <b><sup>2</sup>ab</b> ... 1
5 ab <b><sup>3</sup>ab-</b>... 1
6 aba <b>aba</b> <i>n</i> ay... 2
7 aba-aba <b>aba-aba</b> <i>n</i&gt... 4
8 abad <b>abad</b> <i>n</i> &... 2
9 abadi <b>aba·di</b> <i>a</i>... 3
10 abadiah <b>aba·di·ah</b> <i>Ar n<... 4

In [8]:
# Just looking at the first few examples we can see that the data is quite inconsistent in terms of what
# information is included. We can see that for some keywords, e.g. 'abadiah', information about syllabification
# is provided, since we see the string 'aba·di·ah' buried within the definition.  Monosyllabic forms like 'ab' lack
# an explicit syllabification.  Moreover, some keywords, like the form 'aba-aba' (a reduplicated form) are clearly
# polysyllabic, but nevertheless lack explicit syllabification. It would appear from these limited examples that
# polysyllabic for which syllabification is marked contain the symbol '·'.  If this symbol is indeed unique to
# entries with syllabification, it should be the case that a search for entries containing '·' will only 
# give us forms for which there is more than one vowel.  Let's test this hypothesis.

def syl_dot(string):
    if '·' in string:
        return True
    else:
        return False

In [9]:
# I had to encode the string in 'definition' as 'utf-8' since, in 'ascii', I could not search for the symbol '·' 
kbbi['definition'] = kbbi['definition'].str.encode(encoding='utf-8')
kbbi['syllable_divider'] = kbbi['definition'].apply(lambda x: syl_dot(x))

In [10]:
# now let's check to see that all rows with '·' contain more than one vowel
divider_words = kbbi[kbbi['syllable_divider'] == True]
divider_words['vowel_count'].value_counts()


Out[10]:
2    16760
3     9729
4     4900
5     1862
6      596
7      145
1       97
8       37
9        5
Name: vowel_count, dtype: int64

In [11]:
# I expected that the words containing only a single vowel (words that should be monosyllabic) would never contain
# the syllable divider symbol '·', but this was not the case. I'm going to take a closer look at these forms to see
# why they contain this symbol.

divided_one_V = divider_words[divider_words['vowel_count'] == 1]
divided_one_V


Out[11]:
keyword definition vowel_count syllable_divider
_id
930 am <b>am</b> <i>a</i> &lt... 1 True
2818 bank <b>bank</b> <i>n</i> b... 1 True
3225 bel <b><sup>1</sup>bel</b>... 1 True
4157 bis <b><sup>1</sup>bis</b>... 1 True
4301 bom <b><sup>1</sup>bom</b>... 1 True
4313 bon <b><sup>1</sup>bon</b>... 1 True
4375 bor <b><sup>1</sup>bor</b>... 1 True
4592 buk <b><sup>2</sup>buk, me·nge·b... 1 True
4952 cam <b><sup>1</sup>cam</b>... 1 True
5083 cap <b><sup>1</sup>cap</b>... 1 True
5139 cas <b><sup>1</sup>cas</b>... 1 True
5140 cas <b><sup>2</sup>cas</b>... 1 True
5143 cat <b>cat</b> <i>n</i> &l... 1 True
5227 cek <b><sup>3</sup>cek</b>... 1 True
5890 cor <b>cor</b> <i>v,</i> &... 1 True
6055 dab <b>dab, me·nge·dab</b> <i>v ... 1 True
6463 deg <b>deg, deg-deg·an</b> <i>v ... 1 True
6731 dep <b>dep</b> /dép/ <i>v,</i... 1 True
7398 dor <b>dor</b> <i>n</i> ti... 1 True
7419 dot <b>dot</b> <i>n</i> al... 1 True
7445 drel <b>drel</b> /drél/ <i>n</... 1 True
7453 dril <b><sup>2</sup>dril</b&gt... 1 True
7456 drop <b>drop</b> <i>v cak</i&g... 1 True
7471 dub <b>dub</b> <i>v,</i> &... 1 True
7530 dum <b>dum</b> <i>n cak</i&gt... 1 True
7552 dup <b>dup, me·nge·dup</b> <i>v ... 1 True
8783 film <b>film</b> <i>n</i> &... 1 True
10469 gol <b>gol</b> <b>1</b> &l... 1 True
10477 golf <b>golf</b> <i>n</i> c... 1 True
10741 gung <b>gung</b> <i>n</i> &... 1 True
... ... ... ... ...
28153 saf <b>saf</b> <i>n</i> de... 1 True
28183 sah <b>sah</b> <b>1</b> &l... 1 True
29119 sel <b><sup>1</sup>sel</b>... 1 True
29120 sel <b><sup>2</sup>sel</b>... 1 True
29578 sen <b><sup>1</sup>sen</b>... 1 True
29832 sep <b>sep</b> /sép/ <i>ark n&lt... 1 True
30236 set <b><sup>2</sup>set</b>... 1 True
30809 sir <b><sup>2</sup>sir</b>... 1 True
30965 skor <b>skor</b> <i>n</i> &... 1 True
30968 skors <b>skors</b> <i>v,</i>... 1 True
31045 sol <b><sup>1</sup>sol</b>... 1 True
31080 som <b>som</b> <i>n cak</i&gt... 1 True
31590 suh <b>suh</b> <i>n</i> pa... 1 True
31727 sun <b><sup>2</sup>sun</b>... 1 True
32020 syak <b>syak</b> <b>1</b> &... 1 True
32080 syur <b><sup>2</sup>syur</b&gt... 1 True
32570 tap <b><sup>1</sup>tap</b>... 1 True
32885 teh <b>teh</b> /téh/ <i>n</i&... 1 True
33094 tem <b>tem</b> /tém/ <i>n cak&lt... 1 True
33614 tes <b>tes</b> /tés/ <i>n</i&... 1 True
33699 tik <b><sup>2</sup>tik</b>... 1 True
33729 tim <b><sup>2</sup>tim</b>... 1 True
33822 tip <b><sup>2</sup>tip</b>... 1 True
34028 top <b><sup>3</sup>top</b>... 1 True
34069 tos <b>tos</b> <i>n cak</i&gt... 1 True
34170 trek <b>trek</b> /trék/ <i>n</... 1 True
34178 tren <b>tren</b> /trén/ <i>n</... 1 True
34276 truk <b>truk</b> <i>n</i> m... 1 True
34397 tum <b><sup>2</sup>tum</b>... 1 True
35783 yang <b><sup>2</sup>yang</b&gt... 1 True

97 rows × 4 columns


In [12]:
# Inspection on the string reveals that these definitions contain morphologically complex forms containing the 
# keyword as a base or root (i.e. forms which contain suffixes, prefixes, etc.).  These complex forms are 
# polysyllabic and contain explicit syllabification.  For example, the entry for the keyword 'am' (below) contains
# information about the derived form meng-am-kan, which in turn contains the the syllablic marker '·' 

print divided_one_V['definition'][930]


<b>am</b> <i>a</i> <b>1</b> tidak terbatas pd orang atau golongan tertentu; umum; awam: <i>orang --;</i> <b>2</b> tidak terbatas pd bidang tertentu: <i>pengetahuan --;<br></i><b>meng·am·kan</b> <i>v</i> <b>1</b> menyediakan untuk orang; <b>2</b> mengumumkan; menyampaikan (menyiarkan, memberitahukan) kpd khalayak

In [13]:
# In forms where a syllabification for the keyword itself is provided, it appears that the syllabified form
# is provided in the first 'chunk' of the text string. With this in mind, let's parse the text before the first ' '
# in 'definition'.  I suspect that this chunk of text will only contain '·' in cases where a syllabification of the
# keyword itself is being provided (rather than some other form buried deeper in the definition text).

kbbi['string_1'] = kbbi['definition'].apply(lambda x: x.split(' ')[0])

In [14]:
# Now we can reset the column syllable divider based on whether the divider symbol occurs in the 'string_1' column

kbbi['syllable_divider'] = kbbi['string_1'].apply(lambda x: syl_dot(x))
kbbi.head(10)


Out[14]:
keyword definition vowel_count syllable_divider string_1
_id
1 a <b><sup>1</sup>A, a</b&gt... 1 False <b><sup>1</sup>A,
2 a <b><sup>3</sup>a-</b> ... 1 False <b><sup>3</sup>a-</b>
3 ab <b><sup>1</sup>ab</b> ... 1 False <b><sup>1</sup>ab</b>
4 ab <b><sup>2</sup>ab</b> ... 1 False <b><sup>2</sup>ab</b>
5 ab <b><sup>3</sup>ab-</b>... 1 False <b><sup>3</sup>ab-</b>
6 aba <b>aba</b> <i>n</i> ay... 2 False <b>aba</b>
7 aba-aba <b>aba-aba</b> <i>n</i&gt... 4 False <b>aba-aba</b>
8 abad <b>abad</b> <i>n</i> &... 2 False <b>abad</b>
9 abadi <b>aba·di</b> <i>a</i>... 3 True <b>aba·di</b>
10 abadiah <b>aba·di·ah</b> <i>Ar n<... 4 True <b>aba·di·ah</b>

In [15]:
# Again, we expect that at syllable dividers will only occur in words with more than 2 vowels

divider_words = kbbi[kbbi['syllable_divider'] == True]
divider_words['vowel_count'].value_counts()


Out[15]:
2    16244
3     9706
4     4854
5     1855
6      596
7      145
8       37
9        5
1        1
Name: vowel_count, dtype: int64

In [16]:
# Almost perfectly as we expected: Only one word is left containing a divider and only one vowel.  
# Let's give it a look to see what's going on.

divided_one_V = divider_words[divider_words['vowel_count'] == 1]
divided_one_V


Out[16]:
keyword definition vowel_count syllable_divider string_1
_id
11174 henry <b>hen·ry</b> <i>n El</i&... 1 True <b>hen·ry</b>

In [17]:
# The exception that proves the rule! This is a borrowing where 'y' (a vowel not used in Indo. orthography) is
# used. Let's delete this row.

kbbi.drop(11174,axis=0,inplace=True)
kbbi.reset_index(drop=True, inplace=True)

In [18]:
# before looking more closely at how words are syllabified, let's try to clean up some of the messy encoding in  
# in string_1.  First, let's convert the syllable boundary into a symbol which does not cause us problems as we
# convert from one encoding to another (the current symbol gets rendered as '\xc2\xb7').  Let's use the symbol $ for
# syllable boundaries

kbbi['string_1'] = kbbi['string_1'].apply(lambda x: x.replace('\xc2\xb7','!'))

In [19]:
# we can also chop up the encoding to figure out what is a word and what is not
kbbi['chunks'] = kbbi['string_1'].apply(lambda x: x.split(';'))
chunks = []
for line in kbbi['chunks']:
    chunks = chunks + line
chunks = pd.Series(chunks)
chunks = chunks.value_counts().head(30)

In [20]:
print chunks.index
del kbbi['chunks']


Index([u'&lt', u'b&gt', u'/b&gt', u'', u'/sup&gt', u'sup&gt', u'1&lt', u'2&lt',
       u'3&lt', u',', u'4&lt', u'i&gt', u'5&lt', u'6&lt', u',&lt', u'ta!ta',
       u'su!suk&lt', u'ta!pak', u'bu!ras&lt', u'je!nang&lt', u'tun!tung&lt',
       u'te!la&lt', u'ka!la&lt', u'su!ri&lt', u'ke!tam&lt', u'can!da&lt',
       u'la!wang&lt', u'ulas&lt', u'ba!dar&lt', u'se!pah&lt'],
      dtype='object')

In [21]:
# now that we have a better sense of what tags there are, we can create a function to delete them:
def tag_delete(string):
    tags = ['&lt','b&gt','/','b&gt', '/sup&gt', 'sup','&gt', '1','&lt', '2','3',',', '4','&lt', 'i&gt', '5', '6', ',&lt',';', 'A', '7','8','9']
    for tag in tags:
        string = string.replace(tag,'')
    return string

kbbi['string_1'] = kbbi['string_1'].apply(lambda x: tag_delete(x))

In [22]:
kbbi.head(10)


Out[22]:
keyword definition vowel_count syllable_divider string_1
0 a <b><sup>1</sup>A, a</b&gt... 1 False
1 a <b><sup>3</sup>a-</b> ... 1 False a-
2 ab <b><sup>1</sup>ab</b> ... 1 False ab
3 ab <b><sup>2</sup>ab</b> ... 1 False ab
4 ab <b><sup>3</sup>ab-</b>... 1 False ab-
5 aba <b>aba</b> <i>n</i> ay... 2 False aba
6 aba-aba <b>aba-aba</b> <i>n</i&gt... 4 False aba-aba
7 abad <b>abad</b> <i>n</i> &... 2 False abad
8 abadi <b>aba·di</b> <i>a</i>... 3 True aba!di
9 abadiah <b>aba·di·ah</b> <i>Ar n<... 4 True aba!di!ah

In [23]:
# it looks like that cleaned things up pretty well, but let's jus make sure that we don't still have unusual symbols
# somewhere in the string:

lists = kbbi['string_1'].tolist()
characters = set()
for lst in lists:
    for ch in lst:
        characters.add(ch)
print characters


set(['\xc3', '\xa9', '!', ')', '(', '-', ':', 'C', 'B', 'E', 'D', 'G', 'F', 'I', 'H', 'K', 'J', 'M', 'L', 'O', 'N', 'Q', 'P', 'S', 'R', 'U', 'T', 'W', 'V', 'Y', 'X', 'Z', 'a', 'c', 'b', 'e', 'd', 'g', 'f', 'i', 'h', 'k', 'j', 'm', 'l', 'o', 'n', 'q', 'p', 's', 'r', 'u', 't', 'w', 'v', 'y', 'x', 'z'])

In [24]:
# many of these characters should not be part of a phonetic transcription
bad_characters = ['\xc3', '\xa9', ')', '(', ':', 'x','q']

# there are also caps which may or may not be parts of actual words, we will return to these shortly
caps = ['E', 'I', 'M', 'L', 'T','B', 'C', 'E', 'D', 'G', 'F', 'I', 'H', 'K', 'J', 'M', 'L', 'O', 'N', 'Q', 'P', 'S', 'R', 'U', 'T', 'W', 'V', 'Y', 'X', 'Z']

# first, however, let's delete the characters which are definitely not part of a syllabified transcription
def bad_ch_delete(string):
    bad_characters = ['\xc3', '\xa9', ')', '(', '\xb7', ':', '\xc2', 'x']
    for ch in bad_characters:
        string = string.replace(ch,'')
    return string

kbbi['string_1'] = kbbi['string_1'].apply(lambda x: bad_ch_delete(x))

In [25]:
# now let's look at words with caps
lists = kbbi['string_1'].tolist()
caps_list = []
for i in lists:
    for cap in caps:
        if cap in i:
            caps_list.append(i)
print caps_list


['B', 'Ba!al', 'Ba!bet', 'Ba!dui', 'Bai!tul!ha!ram', 'Bai!tul!lah', 'Bai!tul!mak!dis', 'Bai!tul!mak!mur', 'Bai!tul!mu!ka!das', 'Bai!tul!mu!ka!dis', 'Ba!ru!na', 'Ba!tak', 'Be', 'Be!lan!da', 'Bi!bel', 'Brah!ma', 'Bud!dha', 'Bud!dhis', 'Bud!dhis!me', 'Bur!ju!sum!bu!lat', 'C', 'Cak!ra!bi!ra!wa', 'Cel!si!us', 'Ci!na', 'Co!kor!da', 'D', 'D', 'Dal', 'Da!nuh', 'De!lu', 'De!sem!ber', 'Deu!te!ro!ka!no!ni!ka', 'Di!gul', 'di!nul-Is!lam', 'di!nul-Is!lam', 'E', 'E', 'Ehe', 'Ehe', 'Eu!ra!sia', 'Eu!ra!sia', 'F', 'Fah!ren!heit', 'Feb!ru!a!ri', 'G', 'Ga!lu!ngan', 'Ge!mi!ni', 'H', 'Hab!syi', 'Ha!mal', 'Ham!ba!li', 'Ha!na!fi', 'Hi!na!ya!na', 'Hin!di', 'Hin!du', 'Hut', 'I', 'I', 'Ida', 'Ida', 'Idul!ad!ha', 'Idul!ad!ha', 'Idul!fit!ri', 'Idul!fit!ri', 'Idul!kur!ban', 'Idul!kur!ban', 'Ila!hi', 'Ila!hi', 'In!do!ne!sia', 'In!do!ne!sia', 'Ing!gris', 'Ing!gris', 'In!jil', 'In!jil', 'In!su!lin!de', 'In!su!lin!de', 'Isa', 'Isa', 'Is!lam', 'Is!lam', 'J', 'Ja!bar', 'Ja!ba!ri!ah', 'Jai!nis!me', 'Ja!nu!a!ri', 'Jau!za', 'Je!pun', 'Jib!ra!il', 'Jib!ril', 'Ji!ma!kir', 'Ji!ma!wal', 'Jo!gi', 'Jo!har', 'Ju!ja', 'Ju!li', 'Ju!ma!dil!a!khir', 'Ju!ma!dil!a!wal', 'Jum!at', 'Ju!ni', 'K', 'Ka!a!bah', 'Ka!bah', 'Ka!bil', 'Ka!da!ri!ah', 'Ka!di!ri!ah', 'Kak!bah', 'Ka!ma!ja!ya', 'Ka!mis', 'Kan!ser', 'Ka!pe!la', 'Kap!ri!kor!nus', 'Kar!ka!ta', 'Ka!to!lik', 'Ke!jo!ra', 'Ke!ling', 'Kli!won', 'Kris!ten', 'Kris!tus', 'Kub!ti', 'Kum!ba', 'Ku!ni!ngan', 'Kur!an', 'L', 'L', 'Lak!smi', 'Lak!smi', 'Las!pa!ra!gi!na!se', 'Las!pa!ra!gi!na!se', 'La!wa!la!ta', 'La!wa!la!ta', 'Le!ba!ran', 'Le!ba!ran', 'Le!gi', 'Le!gi', 'Leo', 'Leo', 'Lib!ra', 'Lib!ra', 'M', 'M', 'M', 'M', 'Ma!ha', 'Ma!ha', 'Ma!ha!ya!na', 'Ma!ha!ya!na', 'Ma!ka!ra', 'Ma!ka!ra', 'Ma!li!ki', 'Ma!li!ki', 'Ma!li!kul!ja!bar', 'Ma!li!kul!ja!bar', 'Ma!li!kul!mu!luk', 'Ma!li!kul!mu!luk', 'Man!dar', 'Man!dar', 'Ma!ret', 'Ma!ret', 'Ma!rikh', 'Ma!rikh', 'Mars', 'Mars', 'Ma!se!hi', 'Ma!se!hi', 'Mas!ji!dil!ak!sa', 'Mas!ji!dil!ak!sa', 'Mas!ji!dil!ha!ram', 'Mas!ji!dil!ha!ram', 'Ma!ya', 'Ma!ya', 'Ma!yang', 'Ma!yang', 'Me!di!te!ra!nia', 'Me!di!te!ra!nia', 'Me!du!sa', 'Me!du!sa', 'Me!ga!li!ti!kum', 'Me!ga!li!ti!kum', 'Mei', 'Mei', 'Me!la!ne!sia', 'Me!la!ne!sia', 'Me!la!yu', 'Me!la!yu', 'Meng!ka!ra', 'Meng!ka!ra', 'Men!se!ren!da!hi', 'Men!se!ren!da!hi', 'Mer!ku!ri!us', 'Mer!ku!ri!us', 'Me!sa', 'Me!sa', 'Mi!na', 'Mi!na', 'Mi!na', 'Mi!na', 'Ming!gu', 'Ming!gu', 'Muhammad', 'Muhammad', 'Mu!ha!ram', 'Mu!ha!ram', 'Mur!ba', 'Mur!ba', 'N', 'N', 'Nas!ra!ni', 'Na!zi', 'Ne!ger', 'Neg!ro', 'Ne!gus', 'Nep!tu!nus', 'No!vem!ber', 'Nu!zu!lul', 'Nye!pi', 'O', 'O', 'O', 'Oe!di!pus-kom!pleks', 'Ok!to!ber', 'Olan!da', 'Olim!pi!a!de', 'Ori!on', 'P', 'Pa!ing', 'Pan!ca!si!la', 'Pan!ca!si!la!is', 'Pan!te!kos!ta', 'Pas!kah', 'Pi!ses', 'Plu!to', 'Pon', 'Pro!tes!tan', 'Pro!tes!tan!tis!me', 'Q', 'Qur!an', 'R', 'Ra!bi!ul!a!khir', 'Ra!bi!ul!a!wal', 'Ra!bu', 'Ra!bul!i!zat', 'Ra!jab', 'Ra!ma!dan', 'Ra!su!lul!lah', 'Re!au!mur', 'Re!bo', 'Re!jab', 'Ro!hul!ku!dus', 'Ro!ma!wi', 'Ru!ah', 'Ru!ma!wi', 'Ru!mi', 'Ru!wah', 'S', 'Sa!ban', 'Sa!bi', 'Sab!tu', 'Sa!far', 'Sa!gi!ta!ri!us', 'Sai!lan', 'Sa!ka', 'Sa!kai', 'Sang!se!ker!ta', 'San!sker!ta', 'San!skrit', 'Sa!par', 'Sar!tan', 'Sa!tur!nus', 'Sa!ur', 'Se!lan', 'Se!la!sa', 'Se!la!tan', 'Se!lon', 'Se!long', 'Se!mang', 'Se!nen', 'Se!nin', 'Sep!tem!ber', 'Se!ran!dib', 'Se!ra!ni', 'Se!ri!kan!di', 'Si!nan!sa!ri', 'Sin!ter!klas', 'Si!nyo!ko!las', 'Si!wa!rat!ri', 'Skor!pio', 'Sri!kan!di', 'Stam!bul', 'Sud!ra', 'Sun!bu!lat', 'Su!ra', 'Su!war!na!dwi!pa', 'Sya!ban', 'Sya!fii', 'Sya!ka', 'Syak!ban', 'Syam', 'Sya!wal', 'Syi!wa', 'Syi!wa!rat!ri', 'T', 'T', 'Ta!ra!wih', 'Ta!ra!wih', 'Tau!rat', 'Tau!rat', 'Tau!ret', 'Tau!ret', 'Tau!rit', 'Tau!rit', 'Tau!rus', 'Tau!rus', 'Ti!ja!ni!ah', 'Ti!ja!ni!ah', 'Tu!ba!gus', 'Tu!ba!gus', 'Tuhan', 'Tuhan', 'Tu!la', 'Tu!la', 'U', 'U', 'Ura!nus', 'Ur!du', 'Uta!rid', 'V', 'Va!len!tine', 'Va!ti!kan', 'Ve!nus', 'Vir!go', 'Vis!nu', 'W', 'Wa!ge', 'Wai!saki', 'Wa!si', 'Wa!wu', 'We!da', 'Wi!di!wa!sa', 'Wis!nu', 'Wri!sa!ba', 'X', 'X', 'Y', 'Ya!hu!di', 'Ya!hu!di!ah', 'Yah!we', 'Yak!juj', 'Ye!ho!va', 'Yu!na!ni', 'Yu!pi!ter', 'Z', 'Za!ba!ni!ah', 'Za!bur', 'Zang!gi', 'Zen', 'Zend-ves!ta', 'Zi!on', 'Zi!o!nis', 'Zo!hal', 'Zoh!rah', 'Zoh!rat', 'Zu!ha!ra', 'Zul!hi!jah', 'Zul!ka!i!dah', 'Zu!lu']

In [26]:
# Lot's of these are proper nouns.  A few of these do not contain information about syllabification.  Let's delete 
# those ones, and convert the others to lowercase.
caps_list = [x for x in caps_list if '!' not in x]

def caps_finder(string):
    for word in caps_list:
        if word in string:
            return True
        else:
            return False
        
kbbi['caps'] = kbbi['string_1'].apply(lambda x: caps_finder(x))
kbbi = kbbi[kbbi['caps'] == False]
del kbbi['caps']

# now we can convert the remaining symbols to lowercase
kbbi['string_1'] = kbbi['string_1'].apply(lambda x: x.lower())

In [27]:
kbbi.head()


Out[27]:
keyword definition vowel_count syllable_divider string_1
0 a <b><sup>1</sup>A, a</b&gt... 1 False
1 a <b><sup>3</sup>a-</b> ... 1 False a-
2 ab <b><sup>1</sup>ab</b> ... 1 False ab
3 ab <b><sup>2</sup>ab</b> ... 1 False ab
4 ab <b><sup>3</sup>ab-</b>... 1 False ab-

In [28]:
# Back to syllables: We expect that in most cases, the number of syllable dividers in string_1 will be equal to the
# number of vowels-1, since e.g. a 2 syllable word like 'batu' would have a single syllable division.
# This won't always be the case, since Indonesian allows the diphthongs 'ai' and 'au' in final syllables
# (and rarely elsewhere).  In these limited cases the number of syllable boundaries should be equal to
# the number of vowels.  Let's test this out by creating a new column counting boundaries.  We can compare the
# value in this column to value in vowel_count.

kbbi['divider_count'] = kbbi['string_1'].apply(lambda x: x.count('!'))

In [29]:
kbbi.head(10)


Out[29]:
keyword definition vowel_count syllable_divider string_1 divider_count
0 a <b><sup>1</sup>A, a</b&gt... 1 False 0
1 a <b><sup>3</sup>a-</b> ... 1 False a- 0
2 ab <b><sup>1</sup>ab</b> ... 1 False ab 0
3 ab <b><sup>2</sup>ab</b> ... 1 False ab 0
4 ab <b><sup>3</sup>ab-</b>... 1 False ab- 0
5 aba <b>aba</b> <i>n</i> ay... 2 False aba 0
6 aba-aba <b>aba-aba</b> <i>n</i&gt... 4 False aba-aba 0
7 abad <b>abad</b> <i>n</i> &... 2 False abad 0
8 abadi <b>aba·di</b> <i>a</i>... 3 True aba!di 1
9 abadiah <b>aba·di·ah</b> <i>Ar n<... 4 True aba!di!ah 2

In [30]:
# let's make a separate column counting the difference between the vowel count and syllable boundary count.
# We expect this difference to be equal to 1 in the vast majority of cases, and equal to 2 in a relatively small number
# of cases where a diphthong is present.

kbbi['diff_numV_numB'] = kbbi['vowel_count'] -  kbbi['divider_count']

# Let's look at the new column values for words with explicit syllable boundaries marked
kbbi['diff_numV_numB'][kbbi['syllable_divider'] == True].value_counts()


Out[30]:
1    29806
2     3281
3      278
4       52
5        5
6        1
0        1
Name: diff_numV_numB, dtype: int64

In [31]:
# The count roughly matches expectations; however, there are several entries where the number of vowels in the 
# keyword far exceeds the number of syllables.  I suspect these are cases where the authors of the dictionary
# failed to transcribe a syllable boundary or two.  Let's take a look, starting with the entries in which the
# number of vowels in the keyword exceeds the number of syllable boundaries by 6.

syl_words = kbbi[kbbi['syllable_divider']==True]
check_words = syl_words[syl_words['diff_numV_numB'] >= 3]
check_words


Out[31]:
keyword definition vowel_count syllable_divider string_1 divider_count diff_numV_numB
15 abaimana <b>abai·ma·na</b> <i>ark n&l... 5 True abai!ma!na 2 3
125 abulia <b>abu·lia</b> <i>n Dok</... 4 True abu!lia 1 3
171 adagio <b>ada·gio</b> <i>n Mus</... 4 True ada!gio 1 3
205 adempauze <b>adem·pau·ze</b> <i>Bld n&... 5 True adem!pau!ze 2 3
240 adinamia <b>adi·na·mia</b> <i>a Dok&l... 5 True adi!na!mia 2 3
258 adiwidia <b>adi·wi·dia</b> <i>n</i... 5 True adi!wi!dia 2 3
288 aduhai <b>adu·hai 1</b> <i>p</i&... 4 True adu!hai 1 3
322 aeronautika <b>ae·ro·nau·ti·ka</b> /aéronautik... 7 True ae!ro!nau!ti!ka 4 3
328 aeroterapia <b>ae·ro·te·ra·pia</b> /aérotérapi... 7 True ae!ro!te!ra!pia 4 3
333 afasia <b>afa·sia</b> <i>n Dok</... 4 True afa!sia 1 3
350 afonia <b>afo·nia</b> <i>n Dok</... 4 True afo!nia 1 3
368 agalaksia <b>aga·lak·sia</b> <i>a Dok&... 5 True aga!lak!sia 2 3
394 agiria <b>agi·ria</b> <i>n Dok</... 4 True agi!ria 1 3
412 agonia <b>ago·nia</b> <i>Dok</i&... 4 True ago!nia 1 3
415 agorafobia <b>ago·ra·fo·bia</b> <i>n Ps... 6 True ago!ra!fo!bia 3 3
462 ahli negara <b>ah·li ne·ga·ra</b> <i>n&l... 5 True ah!li 1 4
541 akasia <b>aka·sia</b> <i>n</i&gt... 4 True aka!sia 1 3
542 akatalepsia <b>aka·ta·lep·sia</b> /akatalépsia... 6 True aka!ta!lep!sia 3 3
682 alabio <b>ala·bio</b> <i>n Tern<... 4 True ala!bio 1 3
688 alai-belai <b>alai-be·lai</b> <i>ark n&... 6 True alai-be!lai 1 5
689 alaihi salam <b>alai·hi sa·lam</b> <i>Ar ... 6 True alai!hi 1 5
690 alaika salam <b>alai·ka sa·lam</b> <i>Ar ... 6 True alai!ka 1 5
691 alaikum salam <b>a·lai·kum sa·lam</b> <i>A... 6 True a!lai!kum 2 4
694 alalia <b>ala·lia</b> <i>n Ling<... 4 True ala!lia 1 3
750 aleksia <b>alek·sia</b> /aléksia/ <i&gt... 4 True alek!sia 1 3
803 alinea <b>ali·nea</b> /alinéa/ <i>n... 4 True ali!nea 1 3
879 alopesia <b>alo·pe·sia</b> /alopésia/ <i... 5 True alo!pe!sia 2 3
892 alter ego <b>al·ter ego</b> /alter égo/ <... 4 True al!ter 1 3
907 alu-aluan <b>alu-a·lu·an</b> <i>n</... 5 True alu-a!lu!an 2 3
966 ambai-ambai <b><sup>1</sup>am·bai-am·bai... 6 True am!bai-am!bai 2 4
... ... ... ... ... ... ... ...
33608 terus terang <b>te·rus te·rang</b> <i>v&l... 4 True te!rus 1 3
33674 tiang pancang <b>ti·ang pan·cang</b> <i>n&... 4 True ti!ang 1 3
33771 tindak lanjut <b>tin·dak lan·jut</b> <i>v&... 4 True tin!dak 1 3
33869 titik berat <b>ti·tik be·rat</b> <i>n&lt... 4 True ti!tik 1 3
34068 tosan aji <b>to·san a·ji</b> <i>n</... 4 True to!san 1 3
34141 transmigrasi lokal <b>trans·mig·ra·si lo·kal</b> <... 6 True trans!mig!ra!si 3 3
34184 trias politika <b>tri·as po·li·ti·ka</b> <i&gt... 6 True tri!as 1 5
34290 tuan rumah <b>tu·an ru·mah</b> <i>n<... 4 True tu!an 1 3
34327 tugas karya <b>tu·gas kar·ya, me·nu·gas·kar·ya·kan&l... 4 True tu!gas 1 3
34346 tujuh bulan <b>tu·juh bu·lan</b>, <b>me·... 4 True tu!juh 1 3
34347 tujuh hari <b>tu·juh ha·ri</b> <i>n<... 4 True tu!juh 1 3
34424 tumpang sari <b>tum·pang sari</b> <i>v&lt... 4 True tum!pang 1 3
34488 tunggang langgang <b>tung·gang lang·gang</b> <i&g... 4 True tung!gang 1 3
34537 tupai-tupai <b>tu·pai-tu·pai</b> <i>n&lt... 6 True tu!pai-tu!pai 2 4
34680 ugal-ugalan <b>ugal-ugal·an</b> <i>a<... 5 True ugal-ugal!an 1 4
34736 ular-ularan <b>ular-ular·an</b> <i>n<... 5 True ular-ular!an 1 4
34925 uniseluler <b>uni·se·luler</b> /unisélulér/ &... 5 True uni!se!luler 2 3
34931 universalia <b>uni·ver·sa·lia</b> <i>n&l... 6 True uni!ver!sa!lia 3 3
34942 unjuk rasa <b>un·juk ra·sa</b> <i>n<... 4 True un!juk 1 3
34996 uraemia <b>ura·e·mia</b> /uraémia/ <i&g... 5 True ura!e!mia 2 3
35020 uremia <b>ure·mia</b> /urémia/ <i>n... 4 True ure!mia 1 3
35108 utopia <b>uto·pia</b> <i>n</i&gt... 4 True uto!pia 1 3
35495 warga negara <b>war·ga ne·ga·ra</b> <i>n&... 5 True war!ga 1 4
35564 wawas diri <b>wa·was di·ri</b> <i>v<... 4 True wa!was 1 3
35612 wesi aji <b>we·si a·ji ? besi aji 4 True we!si 1 3
35720 wulu cumbu <b>wu·lu cum·bu</b> <i>Jw n&... 4 True wu!lu 1 3
35774 yakjuj wa makjuj <b>Yak·juj wa Mak·juj</b> <i&gt... 5 True yak!juj 1 4
35832 yupa prasasti <b>yu·pa pra·sas·ti</b> <i>n... 5 True yu!pa 1 4
35893 zend-avesta <b>Zend-Aves·ta</b> /zéndavésta/ &... 4 True zend-ves!ta 1 3
35932 zirkonium oksida <b>zir·ko·ni·um ok·si·da</b> <i... 7 True zir!ko!ni!um 3 4

336 rows × 7 columns


In [32]:
# it looks like several of these entries are just poorly marked.  For example, the word 'abaimana' is syllabified
# as abai.ma.na, whereas it should be syllabified as a.bai.ma.na.  While there are a few examples where the authors
# of the dictionary neglected to mark a syllable boundary, there are far more examples where the mismatch between
# vowels and syllable boundaries is due to something that can be readily fixed using a string search.  Some examples
# are described below.

In [33]:
# For many of the keywords in the dataset, no syllabification is provided.  We want to remove any polysyllabic words
# for which information about syllabification is absent; however, there also exist many monosyllabic words which
# lack a syllable boundary because they contain a single boundary. The vowel counts we did above will help us 
# distinguish between monosyllabic words and words which are polysyllabic and  just lack information about 
# syllabification.  With this in mind, we might delete all keywords containing more than one vowel and also
# lack a syllable boundary. There is a potential problem with removing all such words: Indonesian contains 
# numerous words ending with the vowel sequences 'ai' and 'au'.  These sequences are tricky to deal with because
# when the only vowels in a word are 'ai' or 'au' speakers may disagree about whether they constitute 1 or 2 syllables,
# and, in fact, this may depend on syntactic factors.  The words 'kau' ('you') and 'mau' ('want') are both forms
# for which this is the case (i.e. mau ~ ma.u; kau ~ ka.u).  To deal with this, let's first get 
# rid of keywords which are without question not monosyllabic: words containing 3 or more vowels which
# are not explicity syllabified.
kbbi = kbbi[(kbbi.syllable_divider==True) | ((kbbi.vowel_count<=2) & (kbbi.syllable_divider==False))]

In [34]:
kbbi[(kbbi.vowel_count==2) & (kbbi.syllable_divider==False)]


Out[34]:
keyword definition vowel_count syllable_divider string_1 divider_count diff_numV_numB
5 aba &lt;b&gt;aba&lt;/b&gt; &lt;i&gt;n&lt;/i&gt; ay... 2 False aba 0 2
7 abad &lt;b&gt;abad&lt;/b&gt; &lt;i&gt;n&lt;/i&gt; &... 2 False abad 0 2
11 abah &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;abah&lt;/b&gt... 2 False abah 0 2
12 abah &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;abah&lt;/b&gt... 2 False abah 0 2
20 aban &lt;b&gt;aban&lt;/b&gt; &lt;i&gt;n Antr&lt;/i&... 2 False aban 0 2
21 abang &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;abang&lt;/b&g... 2 False abang 0 2
22 abang &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;abang&lt;/b&g... 2 False abang 0 2
26 abar &lt;b&gt;abar&lt;/b&gt; &lt;i&gt;n&lt;/i&gt; a... 2 False abar 0 2
39 aben &lt;b&gt;aben&lt;/b&gt; /abén/, &lt;b&gt;meng·... 2 False aben 0 2
41 abet &lt;b&gt;abet&lt;/b&gt; &lt;i&gt;Jk n&lt;/i&gt... 2 False abet 0 2
43 abid &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;abid&lt;/b&gt... 2 False abid 0 2
44 abid &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;abid&lt;/b&gt... 2 False abid 0 2
48 abing &lt;b&gt;abing&lt;/b&gt; &lt;i&gt;n&lt;/i&gt; ... 2 False abing 0 2
52 abis &lt;b&gt;abis&lt;/b&gt; &lt;i&gt;n Geo&lt;/i&g... 2 False abis 0 2
67 abon &lt;b&gt;abon&lt;/b&gt; &lt;i&gt;n&lt;/i&gt; m... 2 False abon 0 2
111 abu &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;abu&lt;/b&gt;... 2 False abu 0 2
112 abu &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;abu&lt;/b&gt;... 2 False abu 0 2
113 abu &lt;b&gt;&lt;sup&gt;3&lt;/sup&gt;abu&lt;/b&gt;... 2 False abu 0 2
117 abuh &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;abuh&lt;/b&gt... 2 False abuh 0 2
118 abuh &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;abuh&lt;/b&gt... 2 False abuh 0 2
119 abuk &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;abuk&lt;/b&gt... 2 False abuk 0 2
120 abuk &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;abuk&lt;/b&gt... 2 False abuk 0 2
121 abuk &lt;b&gt;&lt;sup&gt;3&lt;/sup&gt;abuk&lt;/b&gt... 2 False abuk 0 2
123 abul &lt;b&gt;abul&lt;/b&gt; &lt;i&gt;v &lt;/i&gt;m... 2 False abul 0 2
127 abur &lt;b&gt;abur&lt;/b&gt; &lt;i&gt;a&lt;/i&gt; b... 2 False abur 0 2
128 abus &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;abus&lt;/b&gt... 2 False abus 0 2
129 abus &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;abus&lt;/b&gt... 2 False abus 0 2
130 acah &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;acah&lt;/b&gt... 2 False acah 0 2
131 acah &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;acah&lt;/b&gt... 2 False acah 0 2
132 acah &lt;b&gt;&lt;sup&gt;3&lt;/sup&gt;acah, peng·a·... 2 False acah 0 2
... ... ... ... ... ... ... ...
35087 usur &lt;b&gt;usur&lt;/b&gt; &lt;i&gt;Ar num&lt;/i&... 2 False usur 0 2
35088 usus &lt;b&gt;usus&lt;/b&gt;&lt;i&gt; n&lt;/i&gt; a... 2 False ususi 0 2
35089 usut &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;usut&lt;/b&gt... 2 False usut 0 2
35090 usut &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;usut&lt;/b&gt... 2 False usut 0 2
35094 utan &lt;b&gt;utan ? hutan 2 False utan 0 2
35095 utang &lt;b&gt;utang&lt;/b&gt; &lt;i&gt;n&lt;/i&gt; ... 2 False utang 0 2
35100 utas &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;utas&lt;/b&gt... 2 False utas 0 2
35101 utas &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;utas&lt;/b&gt... 2 False utas 0 2
35104 utih &lt;b&gt;utih&lt;/b&gt; &lt;i&gt;ark&lt;/i&gt;... 2 False utih 0 2
35105 utik &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;utik&lt;/b&gt... 2 False utik 0 2
35106 utik &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;utik&lt;/b&gt... 2 False utik 0 2
35113 utuh &lt;b&gt;utuh&lt;/b&gt; &lt;i&gt;a&lt;/i&gt; (... 2 False utuh 0 2
35114 utus &lt;b&gt;utus&lt;/b&gt; &lt;i&gt;v,&lt;/i&gt; ... 2 False utus 0 2
35117 uwak &lt;b&gt;uwak&lt;/b&gt; ? &lt;b&gt;1uak 2 False uwak 0 2
35119 uwur &lt;b&gt;uwur&lt;/b&gt; &lt;i&gt;Jw n&lt;/i&gt... 2 False uwur 0 2
35121 uyung &lt;b&gt;uyung ? huyung 2 False uyung 0 2
35123 uzur &lt;b&gt;uzur&lt;/b&gt; &lt;b&gt;1&lt;/b&gt; &... 2 False uzur 0 2
35177 veem &lt;b&gt;veem&lt;/b&gt; &lt;i&gt;Bld n&lt;/i&g... 2 False veem 0 2
35242 via &lt;b&gt;via&lt;/b&gt; &lt;i&gt;p&lt;/i&gt; le... 2 False via 0 2
35398 wahib &lt;b&gt;wahib&lt;/b&gt; &lt;i&gt;Ar n&lt;/i&g... 2 False wahib 0 2
35402 wai &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;wai&lt;/b&gt;... 2 False wai 0 2
35403 wai &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;wai&lt;/b&gt;... 2 False wai 0 2
35552 wati &lt;b&gt;-wati&lt;/b&gt; lihat &lt;b&gt;-wan 2 False -wati 0 2
35555 wau &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;wau&lt;/b&gt;... 2 False wau 0 2
35556 wau &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;wau&lt;/b&gt;... 2 False wau 0 2
35557 wau &lt;b&gt;&lt;sup&gt;3&lt;/sup&gt;wau&lt;/b&gt;... 2 False wau 0 2
35621 wiah &lt;b&gt;-wiah&lt;/b&gt; lihat &lt;b&gt;1-i 2 False -wiah 0 2
35782 yang-yang &lt;b&gt;yang-yang&lt;/b&gt; lihat &lt;b&gt;akar 2 False yang-yang 0 2
35859 zai &lt;b&gt;zai&lt;/b&gt; &lt;i&gt;n&lt;/i&gt; na... 2 False zai 0 2
35911 zig-zag &lt;b&gt;zig-zag&lt;/b&gt; &lt;i&gt;a&lt;/i&gt... 2 False zig-zag 0 2

1158 rows × 7 columns


In [35]:
# keywords containing more than one word: There are several keywords which actually contain two words separated by 
# a space.  In many of these cases, syllabification is only provided for one of the words.  It is almost always the 
# case that the words in such keywords independently exist as keywords elsewhere in the dictionary, by deleting them 
# from the database, we will not loose much information.
# Likewise, keywords containing a hyphen contain words that, in the vast majority of cases, have their own independent
# entries, and therefore we won't be losing important datapoints by deleting them.

def space_hyphen_finder(string):
    if ' ' in string:
        return True
    elif '-' in string:
        return True
    elif len(string) == 0:
        return True
    else:            
        return False
        
kbbi['spaces'] = kbbi['string_1'].apply(lambda x: space_hyphen_finder(x))
kbbi = kbbi[kbbi['spaces'] == False]
del kbbi['spaces']

In [36]:
# now, to help us with the next couple of steps, let's add '#' to mark word boundaries.  Later on, we will also need
# these boundaries to train our model.

kbbi['string_1'] = kbbi['string_1'].apply(lambda x: '#' + x + '#')

In [37]:
kbbi.head(30)


Out[37]:
keyword definition vowel_count syllable_divider string_1 divider_count diff_numV_numB
2 ab &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;ab&lt;/b&gt; ... 1 False #ab# 0 1
3 ab &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;ab&lt;/b&gt; ... 1 False #ab# 0 1
5 aba &lt;b&gt;aba&lt;/b&gt; &lt;i&gt;n&lt;/i&gt; ay... 2 False #aba# 0 2
7 abad &lt;b&gt;abad&lt;/b&gt; &lt;i&gt;n&lt;/i&gt; &... 2 False #abad# 0 2
8 abadi &lt;b&gt;aba·di&lt;/b&gt; &lt;i&gt;a&lt;/i&gt;... 3 True #aba!di# 1 2
9 abadiah &lt;b&gt;aba·di·ah&lt;/b&gt; &lt;i&gt;Ar n&lt;... 4 True #aba!di!ah# 2 2
10 abadiat &lt;b&gt;aba·di·at ? abadiah 4 True #aba!di!at# 2 2
11 abah &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;abah&lt;/b&gt... 2 False #abah# 0 2
12 abah &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;abah&lt;/b&gt... 2 False #abah# 0 2
15 abaimana &lt;b&gt;abai·ma·na&lt;/b&gt; &lt;i&gt;ark n&l... 5 True #abai!ma!na# 2 3
16 abaka &lt;b&gt;aba·ka&lt;/b&gt; &lt;i&gt;n&lt;/i&gt;... 3 True #aba!ka# 1 2
17 abaktinal &lt;b&gt;abak·ti·nal&lt;/b&gt; &lt;i&gt;a Bio&... 4 True #abak!ti!nal# 2 2
18 abakus &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;aba·kus&lt;/b... 3 True #aba!kus# 1 2
19 abakus &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;aba·kus&lt;/b... 3 True #aba!kus# 1 2
20 aban &lt;b&gt;aban&lt;/b&gt; &lt;i&gt;n Antr&lt;/i&... 2 False #aban# 0 2
21 abang &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;abang&lt;/b&g... 2 False #abang# 0 2
22 abang &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;abang&lt;/b&g... 2 False #abang# 0 2
23 abangan &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;abang·an&lt;/... 3 True #abang!an# 1 2
24 abangan &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;abang·an&lt;/... 3 True #abang!an# 1 2
25 abangga &lt;b&gt;abang·ga&lt;/b&gt; &lt;i&gt;n Ark &lt... 3 True #abang!ga# 1 2
26 abar &lt;b&gt;abar&lt;/b&gt; &lt;i&gt;n&lt;/i&gt; a... 2 False #abar# 0 2
27 abatoar &lt;b&gt;aba·to·ar&lt;/b&gt; &lt;i&gt;n&lt;/i&... 4 True #aba!to!ar# 2 2
29 abdas &lt;b&gt;ab·das&lt;/b&gt;,&lt;b&gt; ber·ab·das... 2 True #ab!das# 1 1
30 abdi &lt;b&gt;ab·di&lt;/b&gt; &lt;i&gt;n&lt;/i&gt; ... 2 True #ab!di# 1 1
31 abdikasi &lt;b&gt;ab·di·ka·si&lt;/b&gt; &lt;i&gt;n&lt;/... 4 True #ab!di!ka!si# 3 1
32 abdomen &lt;b&gt;ab·do·men&lt;/b&gt; &lt;i&gt;n Bio&lt... 3 True #ab!do!men# 2 1
33 abdominal &lt;b&gt;ab·do·mi·nal&lt;/b&gt; &lt;i&gt;a&lt;... 4 True #ab!do!mi!nal# 3 1
34 abdu &lt;b&gt;ab·du&lt;/b&gt; &lt;i&gt;kl n&lt;/i&g... 2 True #ab!du# 1 1
35 abduksi &lt;b&gt;ab·duk·si&lt;/b&gt; &lt;i&gt;n&lt;/i&... 3 True #ab!duk!si# 2 1
36 abduktor &lt;b&gt;ab·duk·tor&lt;/b&gt; &lt;i&gt;n Dok&l... 3 True #ab!duk!tor# 2 1

In [38]:
# There are plenty of words on this list which contain the sequence vowel-consonant-vowel. Without exception, 
# such sequences are always syllabified as V.CV, so we can write a function to insert a syllable boundary.

words = kbbi['string_1'][(kbbi.vowel_count==2) & (kbbi.syllable_divider==False)].tolist()

characters = set()
for word in words:
    for ch in word:
        characters.add(ch)
print characters


set(['#', 'a', 'c', 'b', 'e', 'd', 'g', 'f', 'i', 'h', 'k', 'j', 'm', 'l', 'o', 'n', 'q', 'p', 's', 'r', 'u', 't', 'w', 'v', 'y', 'z'])

In [39]:
# let's generate all possible VCV sequences, then we can create a function to replace them with the right
# syllabification
vowels = ['a','e','i','o','u']
consonants = ['c','b','d','g','f','h', 'k', 'j', 'm', 'l', 'n', 'q', 'p', 's', 'r', 't', 'w', 'v', 'y', 'z']
VCVs = []
for v in vowels:
    for c in consonants:
        for v1 in vowels:
            sequence = ('#' + v + c + v1,'#' + v + '!' + c + v1)
            VCVs.append(sequence)

def syllabify_VCV(string):
    for VCV in VCVs:
        string = string.replace(VCV[0],VCV[1])
    return string

kbbi['string_1'] = kbbi['string_1'].apply(lambda x: syllabify_VCV(x))

In [40]:
kbbi.head()


Out[40]:
keyword definition vowel_count syllable_divider string_1 divider_count diff_numV_numB
2 ab &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;ab&lt;/b&gt; ... 1 False #ab# 0 1
3 ab &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;ab&lt;/b&gt; ... 1 False #ab# 0 1
5 aba &lt;b&gt;aba&lt;/b&gt; &lt;i&gt;n&lt;/i&gt; ay... 2 False #a!ba# 0 2
7 abad &lt;b&gt;abad&lt;/b&gt; &lt;i&gt;n&lt;/i&gt; &... 2 False #a!bad# 0 2
8 abadi &lt;b&gt;aba·di&lt;/b&gt; &lt;i&gt;a&lt;/i&gt;... 3 True #a!ba!di# 1 2

In [41]:
# words with ..ia.. (e.g. 'widia') this sequence, which occurs in many borrowings, is syllabified as i.a or i.ya for 
# many speakers.  I will treat all instances of this vowel sequence as belonging to separate syllables.

#def insert_dot_ia(string):
#    string = string.replace('ia','i!a')
#    return string
#kbbi['string_1'][] = kbbi['string_1'].apply(lambda x: insert_dot_ia(x))

In [42]:
# another orthographic peculiarity of Indonesian is that, in limited cases, the glides 'w' and 'y' occur as the second
# segment of a sequence following a consonant (e.g. Widya).  In some cases, the pronunication of the glide is actually
# [i.y] not [y].  I'm going to try to correct or remove any such examples using the help of a native speaker.

# first I'll generate a list of possible C+glide sequences
CGl = []
for c in consonants:
    cluster_y = c + 'y'
    cluster_w = c + 'w'
    CGl.append(cluster_y)
    CGl.append(cluster_w)
    
# now, I'll check to see which of these is attested in out wordlist
gl_clusters = set()
for word in kbbi['string_1']:
    for cg in CGl:
        if cg in word:
            gl_clusters.add(cg)
gl_clusters


Out[42]:
{'by', 'dw', 'dy', 'fy', 'hw', 'hy', 'kw', 'ny', 'py', 'sw', 'sy', 'wy'}

In [43]:
# some sequences represent individual sounds that are represented with two letters ('digraph'). These include [sy] 
# -- which is typically pronounced as [s], but may also be pronounced as a palatal, and [sw].  
#I will convert this sequence to a single symbol later on.  Likewise, [ny] is actually a single sound--the palatal 
# nasal stop--and I will also convert it to a single symbol below.  Let's take a closer look at the remaining sounds:

C_glide = ['by', 'dw', 'dy', 'fy', 'hw', 'hy', 'kw', 'py', 'wy']
for word in kbbi['string_1']:
    for Cgl in C_glide:
        if Cgl in word:
            print word


#am!byar#
#byar!pet#
#dang!hyang#
#dwi!ar!ti#
#dwi!ba!ha!sa#
#dwi!ba!ha!sa!wan#
#dwi!dar!ma#
#dwi!da!sa!war!sa#
#dwi!fung!si#
#dwi!gan!da#
#dwi!gu!na#
#dwi!ling!ga#
#dwi!mat!ra#
#dwi!ming!gu#
#dwi!mu!ka#
#dwi!pe!ran#
#dwi!pur!wa#
#dwi!se!gi#
#dwi!ta!rung#
#dwi!tung!gal#
#dwi!war!na#
#fyord#
#gam!byong#
#gom!byok#
#gram!byang#
#kem!pyang#
#kwar!tet#
#kwar!tir#
#kwa!si!or!kor#
#kwe!ni#
#kwe!ti!au#
#kwiz#
#kwo!si!en#
#nahwu#
#om!byok#
#pra!sa!wya#
#rom!pyok#
#su!war!na!dwi!pa#
#wa!no!dya#

In [44]:
# there are several word-final sequences of consonants that are written in Indonesian, but not pronounced. For
# example a consonant is deleted or a vowel is inserted to break up the cluster in actual speech. 
# I want to double check that the authors of the kbbi got the transcription of these correct.

final_CC = []
for C1 in consonants:
    for C2 in consonants:
        CC = C1 + C2 + '#'
        final_CC.append(CC)

CC_clusters = []
for word in kbbi['string_1']:
    for cc in final_CC:
        if cc in word and cc != 'ng#': # 'ng' is a single sound written as a digraph
            CC_clusters.append(cc)
            print word


#ab!sorb#
#ab!surd#
#a!daks#
#ad!mi!tans#
#a!do!le!sens#
#a!fiks#
#a!gens#
#akh#
#a!larm#
#a!lo!leks#
#a!lo!morf#
#am!bu!lans#
#a!morf#
#an!te!fiks#
#an!te!liks#
#an!ti!kli!maks#
#an!ti!tank#
#an!traks#
#a!pen!diks#
#a!po!ka!lips#
#a!rasy#
#a!va!lans#
#bakh#
#ba!lans#
#bank#
#bar!zakh#
#bi!hausy#
#bi!kon!veks#
#bi!o!film#
#bi!ro!faks#
#bi!seps#
#boks#
#bom!seks#
#bo!raks#
#de!fe!rens#
#de!si!nens#
#di!a!morf#
#dif!lu!ens#
#dis!kli!maks#
#dis!kor!dans#
#dra!ma!turg#
#du!pleks#
#ek!lips#
#eks#
#eks#
#eks#
#eks!tern#
#ek!to!derm#
#ek!to!term#
#e!ku!i!noks#
#e!lips#
#e!mi!tans#
#en!do!derm#
#en!do!term#
#en!si!form#
#en!to!derm#
#erg#
#e!sens#
#faks#
#far!sakh#
#fa!sakh#
#film#
#fi!o!laks#
#firn#
#fluks#
#flu!o!re!sens#
#fos!fo!re!sens#
#front#
#fyord#
#gi!ga!hertz#
#gips#
#glans#
#golf#
#hart#
#helm#
#hertz#
#he!te!ro!doks#
#ho!mo!seks#
#ho!mo!term#
#im!pe!dans#
#im!puls#
#in!deks#
#in!duk!tans#
#in!fiks#
#in!flo!re!sens#
#in!ten!dans#
#in!tens#
#in!ter!fe!rens#
#in!tern#
#i!so!hips#
#i!so!leks#
#i!so!morf#
#i!so!term#
#kalk#
#kamp#
#kans#
#ka!pa!si!tans#
#ka!ra!paks#
#karst#
#ka!ta!falk#
#ke!lo!e!lek!tro!volt#
#kelp#
#ke!mi!lu!mi!ne!sens#
#kers#
#kiln#
#ki!lo!hertz#
#ki!lo!volt#
#ki!lo!watt#
#kits#
#kli!maks#
#klo!ro!form#
#ko!balt#
#ko!deks#
#ko!laps#
#kolt#
#kom!pleks#
#kom!pleks#
#kon!duk!tans#
#kon!fiks#
#kon!form#
#kong!kurs#
#kon!kurs#
#kon!si!de!rans#
#kon!teks#
#kon!veks#
#korps#
#kor!teks#
#ko!teks#
#krans#
#ku!ad!ru!pleks#
#ku!ark#
#ku!art#
#ku!in!te!sens#
#kult#
#kurs#
#ku!teks#
#lam!bert#
#lan!dors#
#lar!naks#
#lars#
#la!teks#
#leg!horn#
#lens#
#lift#
#luks#
#malt#
#man!sukh#
#ma!rikh#
#mark#
#mars#
#mars#
#mat!riks#
#me!ga!ohm#
#me!ga!watt#
#mens#
#me!ri!karp#
#me!so!derm#
#me!so!morf#
#me!so!to!raks#
#me!ta!morf#
#mik!ro!film#
#mik!rohm#
#mik!ro!watt#
#mi!li!volt#
#mo!dern#
#morf#
#mu!a!rikh#
#mul!ti!kom!pleks#
#mul!ti!pleks#
#mu!wa!rikh#
#na!palm#
#na!sakh#
#non!sens#
#o!be!lisk#
#ohm#
#o!niks#
#or!do!nans#
#or!to!doks#
#palm#
#pam!pi!ni!form#
#pa!ra!doks#
#pa!ra!laks#
#pas!ca!mo!dern#
#pat!ri!ark#
#pels#
#peny#
#pers#
#pet!ro!maks#
#pi!ri!form#
#plat!form#
#plu!ri!form#
#poi!ki!lo!term#
#pons#
#pra!mo!dern#
#pre!fiks#
#pro!to!raks#
#psalm#
#pseu!do!morf#
#pu!be!sens#
#pulp#
#punk#
#ra!di!ans#
#ra!diks#
#re!ak!tans#
#re!doks#
#re!fleks#
#re!kurs#
#re!laks#
#rem!burs#
#re!nai!sans#
#re!sis!tans#
#res!pons#
#ret!ro!fleks#
#re!vans#
#ri!leks#
#sa!ins#
#se!fa!lo!to!raks#
#sekh#
#se!kors#
#seks#
#sfinks#
#silt#
#sim!pleks#
#si!mul!fiks#
#si!ne!ma!pleks#
#si!ne!pleks#
#sir!kum!fiks#
#sir!kum!fleks#
#skors#
#spons#
#sport#
#sprint#
#start#
#su!fiks#
#!ra!fiks#
#syaikh#
#syekh#
#talk#
#tank#
#ta!rikh#
#tar!kasy#
#ta!wa!rikh#
#teks#
#te!leks#
#te!le!ost#
#term#
#ter!mo!fos!fo!re!sens#
#ter!mo!lu!mi!ne!sens#
#to!raks#
#trans#
#trans!lu!sens#
#tri!pleks#
#trips#
#tuts#
#ul!tra!mi!kro!sko!piks#
#ul!tra!mo!dern#
#u!ni!form#
#u!ni!seks#
#vi!de!o!teks#
#volt#
#vor!teks#
#wals#
#wa!ter!mark#
#watt#
#werst#
#wi!ra!bank#
#yard#
#yog!hurt#
#yolk#
#zi!go!morf#
#zink#
#zir!nikh#

In [45]:
# many of these clusters are only orthographic, but are reduced to a single consonant in pronunciation.  For the time
# being I am going to delete all rows with these CCS, except some of the most common of these, for which I have a good. 
# sense of the pronounciation.  At some future point, I plan to edit these based on actual pronunciation.
CC_clusters = pd.Series(CC_clusters)
CC_clusters.value_counts()


Out[45]:
ks#    78
ns#    46
rm#    19
kh#    16
rs#    13
lt#     9
ps#     9
rf#     9
rn#     8
lm#     7
nk#     6
rt#     6
rk#     4
lk#     4
tt#     4
st#     3
ls#     3
hm#     3
tz#     3
sy#     3
rd#     3
nt#     2
rg#     2
ts#     2
lp#     2
mp#     1
rp#     1
rb#     1
sk#     1
ny#     1
lf#     1
ln#     1
ft#     1
dtype: int64

In [46]:
# deleting final clusters (these can be tweaked later)
final_CCs_to_omit = [u'ks#', u'ns#', u'rm#', u'kh#', u'rs#', u'lt#', u'ps#', u'rf#', u'rn#',
       u'lm#', u'nk#', u'rt#', u'rk#', u'lk#', u'tt#', u'st#', u'ls#', u'hm#',
       u'tz#', u'sy#', u'rd#', u'nt#', u'rg#', u'ts#', u'lp#', u'mp#', u'rp#',
       u'rb#', u'sk#', u'ny#', u'lf#', u'ln#', u'ft#']

def delete_final_CC(string):
    for cc in final_CCs_to_omit:
        if cc in word:
            return True
        else:
            return False

kbbi['final_CC'] = kbbi['string_1'].apply(lambda x: delete_final_CC(x))
kbbi = kbbi[kbbi['final_CC'] == False]

In [47]:
# set of words I would like to take a closer look at are non-native diphthongs like 'eu'.  Let's start by doing a 
# quick search to see what vowel-vowel sequences occur in the data.  We'll check for any unusual clusters

VVs = []
for V1 in vowels:
    for V2 in vowels:
        non_hiatus_VV = V1 + V2
        VVs.append(non_hiatus_VV)

VVs_attested = set()
VV_words = []
for word in kbbi['string_1']:
    for vv in VVs:
        if vv in word and vv != 'ai' and vv != 'au': # 'ai' and 'au' is a common syllable rime, so we don't suspect
                                                     # that they have been mistranscribed
            VVs_attested.add(vv)
            VV_words.append(word)

In [48]:
## attested vvs
print VVs_attested


set(['oo', 'eo', 'ei', 'oi', 'ee', 'iu', 'oe', 'ea', 'oa', 'ii', 'eu', 'ui', 'ao', 'io', 'ia', 'ae', 'ie', 'ue', 'ua', 'ou'])

In [49]:
VV_words[:10]


Out[49]:
['#ab!lep!sia#',
 '#a!bu!lia#',
 '#a!da!gio#',
 '#a!di!na!mia#',
 '#a!di!wi!dia#',
 '#ad!ven!ti!sia#',
 '#ad!ver!bia#',
 '#ae!o!lus#',
 '#ae!ra!si#',
 '#ae!ra!tor#']

In [50]:
### all of the words with these clusters are rarely used terms.  For the time being, I am going to delete these terms.
def delete_borrowing(string):
    for word in VV_words:
        if string == word:
            return True
        else:
            return False

kbbi['strange_VV'] = kbbi['string_1'].apply(lambda x: delete_borrowing(x))
kbbi = kbbi[kbbi['strange_VV']== False]
del kbbi['strange_VV']
del kbbi['final_CC']

In [51]:
kbbi.head(50)


Out[51]:
keyword definition vowel_count syllable_divider string_1 divider_count diff_numV_numB
2 ab &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;ab&lt;/b&gt; ... 1 False #ab# 0 1
3 ab &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;ab&lt;/b&gt; ... 1 False #ab# 0 1
5 aba &lt;b&gt;aba&lt;/b&gt; &lt;i&gt;n&lt;/i&gt; ay... 2 False #a!ba# 0 2
7 abad &lt;b&gt;abad&lt;/b&gt; &lt;i&gt;n&lt;/i&gt; &... 2 False #a!bad# 0 2
8 abadi &lt;b&gt;aba·di&lt;/b&gt; &lt;i&gt;a&lt;/i&gt;... 3 True #a!ba!di# 1 2
9 abadiah &lt;b&gt;aba·di·ah&lt;/b&gt; &lt;i&gt;Ar n&lt;... 4 True #a!ba!di!ah# 2 2
10 abadiat &lt;b&gt;aba·di·at ? abadiah 4 True #a!ba!di!at# 2 2
11 abah &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;abah&lt;/b&gt... 2 False #a!bah# 0 2
12 abah &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;abah&lt;/b&gt... 2 False #a!bah# 0 2
15 abaimana &lt;b&gt;abai·ma·na&lt;/b&gt; &lt;i&gt;ark n&l... 5 True #a!bai!ma!na# 2 3
16 abaka &lt;b&gt;aba·ka&lt;/b&gt; &lt;i&gt;n&lt;/i&gt;... 3 True #a!ba!ka# 1 2
17 abaktinal &lt;b&gt;abak·ti·nal&lt;/b&gt; &lt;i&gt;a Bio&... 4 True #a!bak!ti!nal# 2 2
18 abakus &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;aba·kus&lt;/b... 3 True #a!ba!kus# 1 2
19 abakus &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;aba·kus&lt;/b... 3 True #a!ba!kus# 1 2
20 aban &lt;b&gt;aban&lt;/b&gt; &lt;i&gt;n Antr&lt;/i&... 2 False #a!ban# 0 2
21 abang &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;abang&lt;/b&g... 2 False #a!bang# 0 2
22 abang &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;abang&lt;/b&g... 2 False #a!bang# 0 2
23 abangan &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;abang·an&lt;/... 3 True #a!bang!an# 1 2
24 abangan &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;abang·an&lt;/... 3 True #a!bang!an# 1 2
25 abangga &lt;b&gt;abang·ga&lt;/b&gt; &lt;i&gt;n Ark &lt... 3 True #a!bang!ga# 1 2
26 abar &lt;b&gt;abar&lt;/b&gt; &lt;i&gt;n&lt;/i&gt; a... 2 False #a!bar# 0 2
27 abatoar &lt;b&gt;aba·to·ar&lt;/b&gt; &lt;i&gt;n&lt;/i&... 4 True #a!ba!to!ar# 2 2
29 abdas &lt;b&gt;ab·das&lt;/b&gt;,&lt;b&gt; ber·ab·das... 2 True #ab!das# 1 1
30 abdi &lt;b&gt;ab·di&lt;/b&gt; &lt;i&gt;n&lt;/i&gt; ... 2 True #ab!di# 1 1
31 abdikasi &lt;b&gt;ab·di·ka·si&lt;/b&gt; &lt;i&gt;n&lt;/... 4 True #ab!di!ka!si# 3 1
32 abdomen &lt;b&gt;ab·do·men&lt;/b&gt; &lt;i&gt;n Bio&lt... 3 True #ab!do!men# 2 1
33 abdominal &lt;b&gt;ab·do·mi·nal&lt;/b&gt; &lt;i&gt;a&lt;... 4 True #ab!do!mi!nal# 3 1
34 abdu &lt;b&gt;ab·du&lt;/b&gt; &lt;i&gt;kl n&lt;/i&g... 2 True #ab!du# 1 1
35 abduksi &lt;b&gt;ab·duk·si&lt;/b&gt; &lt;i&gt;n&lt;/i&... 3 True #ab!duk!si# 2 1
36 abduktor &lt;b&gt;ab·duk·tor&lt;/b&gt; &lt;i&gt;n Dok&l... 3 True #ab!duk!tor# 2 1
37 abdul &lt;b&gt;ab·dul ? abdu 2 True #ab!dul# 1 1
38 abece &lt;b&gt;abe·ce&lt;/b&gt; /abécé/ &lt;i&gt;n&l... 3 True #a!be!ce# 1 2
39 aben &lt;b&gt;aben&lt;/b&gt; /abén/, &lt;b&gt;meng·... 2 False #a!ben# 0 2
40 aberasi &lt;b&gt;abe·ra·si&lt;/b&gt; &lt;i&gt;n&lt;/i&... 4 True #a!be!ra!si# 2 2
41 abet &lt;b&gt;abet&lt;/b&gt; &lt;i&gt;Jk n&lt;/i&gt... 2 False #a!bet# 0 2
42 abian &lt;b&gt;abi·an&lt;/b&gt; &lt;i&gt;n&lt;/i&gt;... 3 True #a!bi!an# 1 2
43 abid &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;abid&lt;/b&gt... 2 False #a!bid# 0 2
44 abid &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;abid&lt;/b&gt... 2 False #a!bid# 0 2
45 abidin &lt;b&gt;abi·din&lt;/b&gt; &lt;i&gt;Ar n&lt;/i... 3 True #a!bi!din# 1 2
46 abilah &lt;b&gt;abi·lah&lt;/b&gt; &lt;i&gt;ark n&lt;/... 3 True #a!bi!lah# 1 2
47 abimana &lt;b&gt;abi·ma·na ? abaimana 4 True #a!bi!ma!na# 2 2
48 abing &lt;b&gt;abing&lt;/b&gt; &lt;i&gt;n&lt;/i&gt; ... 2 False #a!bing# 0 2
49 abiogenesis &lt;b&gt;abi·o·ge·ne·sis&lt;/b&gt; /abiogénési... 6 True #a!bi!o!ge!ne!sis# 4 2
50 abiosfer &lt;b&gt;abi·o·sfer&lt;/b&gt; /abiosfér/ &lt;i... 4 True #a!bi!o!sfer# 2 2
51 abiotik &lt;b&gt;abi·o·tik&lt;/b&gt; &lt;b&gt;1&lt;/b&... 4 True #a!bi!o!tik# 2 2
52 abis &lt;b&gt;abis&lt;/b&gt; &lt;i&gt;n Geo&lt;/i&g... 2 False #a!bis# 0 2
53 abisal &lt;b&gt;abi·sal&lt;/b&gt; &lt;i&gt;n&lt;/i&gt... 3 True #a!bi!sal# 1 2
54 abiseka &lt;b&gt;abi·se·ka&lt;/b&gt; /abiséka/ &lt;i&g... 4 True #a!bi!se!ka# 2 2
55 abiturien &lt;b&gt;abi·tu·ri·en &lt;/b&gt;/abiturién/ &l... 5 True #a!bi!tu!ri!en# 3 2
56 abjad &lt;b&gt;ab·jad&lt;/b&gt; &lt;i&gt;n&lt;/i&gt;... 2 True #ab!jad# 1 1

Generating a dataset


In [52]:
# the data seems pretty clean now, so let's try to extract features.  We want to set up a classification model
# which tells us for any given position in a word, whether there is a syllable boundary in that position.  The 
# way that I am going to do this is looking at adjacent sounds within a certain window size.  Here are two functions
# to accomplish this:

# first we need a function to change digraphs (like the palatal nasal 'ny') to a single character
def prepare_string(word):
    word = word.lower()
    word = word.replace('ng','N')
    word = word.replace('ny','Y')
    word = word.replace('sy','s')
    word = word.replace('kh','k')
    word = word.replace('tj','c')
    word = word.replace('dj','j')
    return(word)

# then we need a function to create windows of a designated size, with information about whether the window contains
# a syllable boundary in its center position (e.g. between the second and third segment if the left margin = 2
# and the right margin = 3
def window_gen(strings,left_margin,right_margin):
    windows = []
    for string in strings:
        string = prepare_string(string)
        char_ind = []
        syl_ind = []
        for indx,ch in enumerate(string):
            if ch == '!':
                syl_ind.append(indx)
            else:
                char_ind.append(indx)
        for i,j in enumerate(char_ind):
            left = range((i-left_margin),i)
            right = range(i,(right_margin+i))
            window_range = left + right
            try:
                window_index = [char_ind[z] for z in window_range]
                window = ''.join([str(string[p]) for p in window_index])
                is_boundary = bool([n for n in syl_ind if n > char_ind[(i-1)] and n < char_ind[(i)]])
                windows.append((window,is_boundary)) 
            except:
                pass
    return windows

In [53]:
# using window gen, we can generate a number of new dataframes with various window sizes
strings = kbbi['string_1'].tolist()
win_1_0 = window_gen(strings,1,0)
win_0_1 = window_gen(strings,0,1)
win_1_1 = window_gen(strings,1,1)
win_1_2 = window_gen(strings,1,2)
win_2_1 = window_gen(strings,2,1)
win_2_2 = window_gen(strings,2,2)
win_2_3 = window_gen(strings,2,3)
win_3_2 = window_gen(strings,3,2)
win_3_3 = window_gen(strings,3,3)

In [54]:
# let's create a data dictionary
data = {}
win10_windows = []
win10_boundaries = []
for i in win_1_0:
    win10_windows.append(i[0])
    win10_boundaries.append(i[1])
data['win10_windows'] = win10_windows
data['win10_boundaries'] = win10_boundaries

win01_windows = []
win01_boundaries = []
for i in win_0_1:
    win01_windows.append(i[0])
    win01_boundaries.append(i[1])
data['win01_windows'] = win01_windows
data['win01_boundaries'] = win01_boundaries

win11_windows = []
win11_boundaries = []
for i in win_1_1:
    win11_windows.append(i[0])
    win11_boundaries.append(i[1])
data['win11_windows'] = win11_windows
data['win11_boundaries'] = win11_boundaries

win12_windows = []
win12_boundaries = []
for i in win_1_2:
    win12_windows.append(i[0])
    win12_boundaries.append(i[1])
data['win12_windows'] = win12_windows
data['win12_boundaries'] = win12_boundaries

win21_windows = []
win21_boundaries = []
for i in win_2_1:
    win21_windows.append(i[0])
    win21_boundaries.append(i[1])
data['win21_windows'] = win21_windows
data['win21_boundaries'] = win21_boundaries

win22_windows = []
win22_boundaries = []
for i in win_2_2:
    win22_windows.append(i[0])
    win22_boundaries.append(i[1])
data['win22_windows'] = win22_windows
data['win22_boundaries'] = win22_boundaries

win32_windows = []
win32_boundaries = []
for i in win_3_2:
    win32_windows.append(i[0])
    win32_boundaries.append(i[1])
data['win32_windows'] = win32_windows
data['win32_boundaries'] = win32_boundaries

win23_windows = []
win23_boundaries = []
for i in win_2_3:
    win23_windows.append(i[0])
    win23_boundaries.append(i[1])
data['win23_windows'] = win23_windows
data['win23_boundaries'] = win23_boundaries

win33_windows = []
win33_boundaries = []
for i in win_3_3:
    win33_windows.append(i[0])
    win33_boundaries.append(i[1])
data['win33_windows'] = win33_windows
data['win33_boundaries'] = win33_boundaries

In [55]:
# these individual functions decompose sounds into phonological features, we will include these within the function
# below:
def sonorant(character):
    if character in ['a','e','i','o','u','y','w','m','N','Y','l','r','h','q']:
        return 1
    else:
        return 0
    
def continuant(character):
    if character in ['s','z','f']:
         return 1
    else:
         return 0
    
def consonant(character):
    if character in ['p','t','k','q','h','c','b','d','g','j','s','z','f','m','n','Y','N','l','r']:
        return 1
    else:
        return 0
    
def strident(character):
    if character in ['s','j','c','z']:
        return 1
    else:
        return 0
    
###populate place: labial, coronal, palatal, velar, glottal 
def labial(character):
    if character in ['p','m','f','b','w','u','o']:
        return 1
    else:
        return 0   

def coronal(character):
    if character in ['t','d','n','s','j','c','Y','i','e','r','l','z']:
        return 1
    else:
        return 0 

def palatal(character):
    if character in ['s','j','c','i','Y','e']:
        return 1
    else:
        return 0

def velar(character):
    if character in ['u','k','g','N','o']:
        return 1
    else:
        return 0
    
def glottal(character):
    if character in ['h','q']:
        return 1
    else:
        return 0

###nasality
def nasal(character):
    if character in ['m','n','Y','N']:
        return 1
    else:
        return 0
            
###populate obstruent voicing 
###i assume that [voice] is only phonologically active in obstruents
def obs_voice(character):
    if character in ['b','d','g','j','z']:
        return 1
    else:
        return 0          
            
### populate lateral/rhotic
def lateral(character):
    if character == 'l':
        return 1
    else:
        return 0
def rhotic(character):
    if character == 'r':
        return 1
    else:
        return 0

###populate vowel height
###I assume at mid is not an active feature
def high(character):
    if character in ['i','u']:
        return 1
    else:
        return 0
def low(character):
    if character == 'a':
        return 1
    else:
        return 0
def word_boundary(character):
    if character == '#':
         return 1
    else:
        return 0

In [550]:
# now we need to a function to decompose sounds into phonological features
def matrix_creater(data, left_window, right_window):
    name_w = 'win'+str(left_window)+str(right_window)+'_windows'
    name_b = 'win'+str(left_window)+str(right_window)+'_boundaries'
    string_list = data[name_w]
    prediction_list = data[name_b]
    
    ch_count = left_window + right_window
    df = pd.DataFrame(string_list,columns=['string'])
    for num in range(0,ch_count):
        df['sonorant'+str(num)] = df['string'].apply(lambda x: sonorant(x[num]))
        df['continuant'+str(num)] = df['string'].apply(lambda x: continuant(x[num]))
        df['consonant'+str(num)] = df['string'].apply(lambda x: consonant(x[num]))
        df['strident'+str(num)] = df['string'].apply(lambda x: strident(x[num]))
        df['labial'+str(num)] = df['string'].apply(lambda x: labial(x[num]))
        df['coronal'+str(num)] = df['string'].apply(lambda x: coronal(x[num]))
        df['palatal'+str(num)] = df['string'].apply(lambda x: palatal(x[num]))
        df['velar'+str(num)] = df['string'].apply(lambda x: velar(x[num]))
        df['glottal'+str(num)] = df['string'].apply(lambda x: glottal(x[num]))
        df['nasal'+str(num)] = df['string'].apply(lambda x: nasal(x[num]))
        df['obs_voice'+str(num)] = df['string'].apply(lambda x: obs_voice(x[num]))
        df['lateral'+str(num)] = df['string'].apply(lambda x: lateral(x[num]))
        df['rhotic'+str(num)] = df['string'].apply(lambda x: rhotic(x[num]))
        df['high'+str(num)] = df['string'].apply(lambda x: high(x[num]))
        df['low'+str(num)] = df['string'].apply(lambda x: low(x[num]))
        df['word_boundary'+str(num)] = df['string'].apply(lambda x: word_boundary(x[num]))
    df1 = pd.DataFrame(data=prediction_list, columns=['pred'])
    df = pd.concat([df,df1],axis=1)
    df.drop_duplicates(subset='string', inplace=True)
    return df

In [604]:
#df01 = matrix_creater(data,0,1)
#df10 = matrix_creater(data,1,0)
df11 = matrix_creater(data,1,1)
df12 = matrix_creater(data,1,2)
#df21 = matrix_creater(data,2,1)
df22 = matrix_creater(data,2,2)
#df23 = matrix_creater(data,2,3)
#df32 = matrix_creater(data,3,2)
df33 = matrix_creater(data,3,3)

Running the model and looking at results


In [560]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.cross_validation import cross_val_score
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# cross validation
from sklearn.cross_validation import train_test_split, KFold

# models
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier

# evaluation
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix, classification_report

import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.metrics import classification_report

In [607]:
y = df12['pred']
X = df12.drop('pred',axis=1)

In [574]:
words = X['string']
del X['string']

In [575]:
# let's see what our baseline is
float(y.value_counts()[0])/(float(y.value_counts()[0]) + float(y.value_counts()[1]))


Out[575]:
0.6505772360072022

In [564]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=30)

In [536]:
def evaluate_model(model):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    cnf_mtx = confusion_matrix(y_test, y_pred)
    acc_scr = accuracy_score(y_test, y_pred)
    cls_rep = classification_report(y_test, y_pred)
    print cnf_mtx
    print cls_rep
    print "Accuracy Score: ", acc_scr
    print "*****************************"
    return acc_scr

In [537]:
# global dictionary of models
models = {}

In [ ]:
# I run a number of models below.  I have not added many comments.  In general, none of the models significantly
# outperform the Decision Tree Classifier.  This is illustrated by the examples below.
# Given that the goal of this project is to discover the simplest 
# model that account for syllabification, I will use go ahead and use the decision tree classifier.

In [538]:
max_depths = [n for n in range(1,30)]
criteria = ['gini', 'entropy']
for max_depth in max_depths:
    for criterion in criteria:
        print "Max Depth: ", max_depth
        print "Criterion: ", criterion
        evaluate_model(DecisionTreeClassifier(criterion=criterion, max_depth=max_depth))


Max Depth:  1
Criterion:  gini
[[3802 1639]
 [ 529 2470]]
             precision    recall  f1-score   support

      False       0.88      0.70      0.78      5441
       True       0.60      0.82      0.69      2999

avg / total       0.78      0.74      0.75      8440

Accuracy Score:  0.743127962085
*****************************
Max Depth:  1
Criterion:  entropy
[[3228 2213]
 [ 287 2712]]
             precision    recall  f1-score   support

      False       0.92      0.59      0.72      5441
       True       0.55      0.90      0.68      2999

avg / total       0.79      0.70      0.71      8440

Accuracy Score:  0.703791469194
*****************************
Max Depth:  2
Criterion:  gini
[[4589  852]
 [ 536 2463]]
             precision    recall  f1-score   support

      False       0.90      0.84      0.87      5441
       True       0.74      0.82      0.78      2999

avg / total       0.84      0.84      0.84      8440

Accuracy Score:  0.835545023697
*****************************
Max Depth:  2
Criterion:  entropy
[[4891  550]
 [ 622 2377]]
             precision    recall  f1-score   support

      False       0.89      0.90      0.89      5441
       True       0.81      0.79      0.80      2999

avg / total       0.86      0.86      0.86      8440

Accuracy Score:  0.861137440758
*****************************
Max Depth:  3
Criterion:  gini
[[5111  330]
 [ 676 2323]]
             precision    recall  f1-score   support

      False       0.88      0.94      0.91      5441
       True       0.88      0.77      0.82      2999

avg / total       0.88      0.88      0.88      8440

Accuracy Score:  0.880805687204
*****************************
Max Depth:  3
Criterion:  entropy
[[4905  536]
 [ 446 2553]]
             precision    recall  f1-score   support

      False       0.92      0.90      0.91      5441
       True       0.83      0.85      0.84      2999

avg / total       0.88      0.88      0.88      8440

Accuracy Score:  0.8836492891
*****************************
Max Depth:  4
Criterion:  gini
[[5028  413]
 [ 345 2654]]
             precision    recall  f1-score   support

      False       0.94      0.92      0.93      5441
       True       0.87      0.88      0.88      2999

avg / total       0.91      0.91      0.91      8440

Accuracy Score:  0.91018957346
*****************************
Max Depth:  4
Criterion:  entropy
[[4809  632]
 [ 153 2846]]
             precision    recall  f1-score   support

      False       0.97      0.88      0.92      5441
       True       0.82      0.95      0.88      2999

avg / total       0.92      0.91      0.91      8440

Accuracy Score:  0.906990521327
*****************************
Max Depth:  5
Criterion:  gini
[[5196  245]
 [ 348 2651]]
             precision    recall  f1-score   support

      False       0.94      0.95      0.95      5441
       True       0.92      0.88      0.90      2999

avg / total       0.93      0.93      0.93      8440

Accuracy Score:  0.929739336493
*****************************
Max Depth:  5
Criterion:  entropy
[[5032  409]
 [ 211 2788]]
             precision    recall  f1-score   support

      False       0.96      0.92      0.94      5441
       True       0.87      0.93      0.90      2999

avg / total       0.93      0.93      0.93      8440

Accuracy Score:  0.92654028436
*****************************
Max Depth:  6
Criterion:  gini
[[5150  291]
 [ 172 2827]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.91      0.94      0.92      2999

avg / total       0.95      0.95      0.95      8440

Accuracy Score:  0.945142180095
*****************************
Max Depth:  6
Criterion:  entropy
[[5163  278]
 [ 199 2800]]
             precision    recall  f1-score   support

      False       0.96      0.95      0.96      5441
       True       0.91      0.93      0.92      2999

avg / total       0.94      0.94      0.94      8440

Accuracy Score:  0.943483412322
*****************************
Max Depth:  7
Criterion:  gini
[[5149  292]
 [ 168 2831]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.91      0.94      0.92      2999

avg / total       0.95      0.95      0.95      8440

Accuracy Score:  0.945497630332
*****************************
Max Depth:  7
Criterion:  entropy
[[5165  276]
 [ 198 2801]]
             precision    recall  f1-score   support

      False       0.96      0.95      0.96      5441
       True       0.91      0.93      0.92      2999

avg / total       0.94      0.94      0.94      8440

Accuracy Score:  0.943838862559
*****************************
Max Depth:  8
Criterion:  gini
[[5175  266]
 [ 179 2820]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.91      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Accuracy Score:  0.947274881517
*****************************
Max Depth:  8
Criterion:  entropy
[[5197  244]
 [ 209 2790]]
             precision    recall  f1-score   support

      False       0.96      0.96      0.96      5441
       True       0.92      0.93      0.92      2999

avg / total       0.95      0.95      0.95      8440

Accuracy Score:  0.946327014218
*****************************
Max Depth:  9
Criterion:  gini
[[5203  238]
 [ 182 2817]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Accuracy Score:  0.950236966825
*****************************
Max Depth:  9
Criterion:  entropy
[[5188  253]
 [ 179 2820]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Accuracy Score:  0.948815165877
*****************************
Max Depth:  10
Criterion:  gini
[[5204  237]
 [ 188 2811]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Accuracy Score:  0.949644549763
*****************************
Max Depth:  10
Criterion:  entropy
[[5200  241]
 [ 140 2859]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.96      5441
       True       0.92      0.95      0.94      2999

avg / total       0.96      0.95      0.96      8440

Accuracy Score:  0.954857819905
*****************************
Max Depth:  11
Criterion:  gini
[[5195  246]
 [ 183 2816]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Accuracy Score:  0.949170616114
*****************************
Max Depth:  11
Criterion:  entropy
[[5172  269]
 [ 121 2878]]
             precision    recall  f1-score   support

      False       0.98      0.95      0.96      5441
       True       0.91      0.96      0.94      2999

avg / total       0.95      0.95      0.95      8440

Accuracy Score:  0.953791469194
*****************************
Max Depth:  12
Criterion:  gini
[[5190  251]
 [ 197 2802]]
             precision    recall  f1-score   support

      False       0.96      0.95      0.96      5441
       True       0.92      0.93      0.93      2999

avg / total       0.95      0.95      0.95      8440

Accuracy Score:  0.94691943128
*****************************
Max Depth:  12
Criterion:  entropy
[[5186  255]
 [ 159 2840]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.92      0.95      0.93      2999

avg / total       0.95      0.95      0.95      8440

Accuracy Score:  0.950947867299
*****************************
Max Depth:  13
Criterion:  gini
[[5173  268]
 [ 214 2785]]
             precision    recall  f1-score   support

      False       0.96      0.95      0.96      5441
       True       0.91      0.93      0.92      2999

avg / total       0.94      0.94      0.94      8440

Accuracy Score:  0.942890995261
*****************************
Max Depth:  13
Criterion:  entropy
[[5180  261]
 [ 183 2816]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Accuracy Score:  0.947393364929
*****************************
Max Depth:  14
Criterion:  gini
[[5183  258]
 [ 247 2752]]
             precision    recall  f1-score   support

      False       0.95      0.95      0.95      5441
       True       0.91      0.92      0.92      2999

avg / total       0.94      0.94      0.94      8440

Accuracy Score:  0.940165876777
*****************************
Max Depth:  14
Criterion:  entropy
[[5188  253]
 [ 225 2774]]
             precision    recall  f1-score   support

      False       0.96      0.95      0.96      5441
       True       0.92      0.92      0.92      2999

avg / total       0.94      0.94      0.94      8440

Accuracy Score:  0.94336492891
*****************************
Max Depth:  15
Criterion:  gini
[[5167  274]
 [ 271 2728]]
             precision    recall  f1-score   support

      False       0.95      0.95      0.95      5441
       True       0.91      0.91      0.91      2999

avg / total       0.94      0.94      0.94      8440

Accuracy Score:  0.935426540284
*****************************
Max Depth:  15
Criterion:  entropy
[[5197  244]
 [ 262 2737]]
             precision    recall  f1-score   support

      False       0.95      0.96      0.95      5441
       True       0.92      0.91      0.92      2999

avg / total       0.94      0.94      0.94      8440

Accuracy Score:  0.940047393365
*****************************
Max Depth:  16
Criterion:  gini
[[5169  272]
 [ 322 2677]]
             precision    recall  f1-score   support

      False       0.94      0.95      0.95      5441
       True       0.91      0.89      0.90      2999

avg / total       0.93      0.93      0.93      8440

Accuracy Score:  0.929620853081
*****************************
Max Depth:  16
Criterion:  entropy
[[5187  254]
 [ 285 2714]]
             precision    recall  f1-score   support

      False       0.95      0.95      0.95      5441
       True       0.91      0.90      0.91      2999

avg / total       0.94      0.94      0.94      8440

Accuracy Score:  0.936137440758
*****************************
Max Depth:  17
Criterion:  gini
[[5174  267]
 [ 357 2642]]
             precision    recall  f1-score   support

      False       0.94      0.95      0.94      5441
       True       0.91      0.88      0.89      2999

avg / total       0.93      0.93      0.93      8440

Accuracy Score:  0.926066350711
*****************************
Max Depth:  17
Criterion:  entropy
[[5176  265]
 [ 325 2674]]
             precision    recall  f1-score   support

      False       0.94      0.95      0.95      5441
       True       0.91      0.89      0.90      2999

avg / total       0.93      0.93      0.93      8440

Accuracy Score:  0.93009478673
*****************************
Max Depth:  18
Criterion:  gini
[[5178  263]
 [ 378 2621]]
             precision    recall  f1-score   support

      False       0.93      0.95      0.94      5441
       True       0.91      0.87      0.89      2999

avg / total       0.92      0.92      0.92      8440

Accuracy Score:  0.924052132701
*****************************
Max Depth:  18
Criterion:  entropy
[[5173  268]
 [ 346 2653]]
             precision    recall  f1-score   support

      False       0.94      0.95      0.94      5441
       True       0.91      0.88      0.90      2999

avg / total       0.93      0.93      0.93      8440

Accuracy Score:  0.927251184834
*****************************
Max Depth:  19
Criterion:  gini
[[5168  273]
 [ 379 2620]]
             precision    recall  f1-score   support

      False       0.93      0.95      0.94      5441
       True       0.91      0.87      0.89      2999

avg / total       0.92      0.92      0.92      8440

Accuracy Score:  0.922748815166
*****************************
Max Depth:  19
Criterion:  entropy
[[5161  280]
 [ 366 2633]]
             precision    recall  f1-score   support

      False       0.93      0.95      0.94      5441
       True       0.90      0.88      0.89      2999

avg / total       0.92      0.92      0.92      8440

Accuracy Score:  0.92345971564
*****************************
Max Depth:  20
Criterion:  gini
[[5166  275]
 [ 403 2596]]
             precision    recall  f1-score   support

      False       0.93      0.95      0.94      5441
       True       0.90      0.87      0.88      2999

avg / total       0.92      0.92      0.92      8440

Accuracy Score:  0.919668246445
*****************************
Max Depth:  20
Criterion:  entropy
[[5160  281]
 [ 380 2619]]
             precision    recall  f1-score   support

      False       0.93      0.95      0.94      5441
       True       0.90      0.87      0.89      2999

avg / total       0.92      0.92      0.92      8440

Accuracy Score:  0.921682464455
*****************************
Max Depth:  21
Criterion:  gini
[[5167  274]
 [ 420 2579]]
             precision    recall  f1-score   support

      False       0.92      0.95      0.94      5441
       True       0.90      0.86      0.88      2999

avg / total       0.92      0.92      0.92      8440

Accuracy Score:  0.917772511848
*****************************
Max Depth:  21
Criterion:  entropy
[[5153  288]
 [ 397 2602]]
             precision    recall  f1-score   support

      False       0.93      0.95      0.94      5441
       True       0.90      0.87      0.88      2999

avg / total       0.92      0.92      0.92      8440

Accuracy Score:  0.918838862559
*****************************
Max Depth:  22
Criterion:  gini
[[5163  278]
 [ 427 2572]]
             precision    recall  f1-score   support

      False       0.92      0.95      0.94      5441
       True       0.90      0.86      0.88      2999

avg / total       0.92      0.92      0.92      8440

Accuracy Score:  0.916469194313
*****************************
Max Depth:  22
Criterion:  entropy
[[5150  291]
 [ 406 2593]]
             precision    recall  f1-score   support

      False       0.93      0.95      0.94      5441
       True       0.90      0.86      0.88      2999

avg / total       0.92      0.92      0.92      8440

Accuracy Score:  0.917417061611
*****************************
Max Depth:  23
Criterion:  gini
[[5161  280]
 [ 431 2568]]
             precision    recall  f1-score   support

      False       0.92      0.95      0.94      5441
       True       0.90      0.86      0.88      2999

avg / total       0.92      0.92      0.92      8440

Accuracy Score:  0.915758293839
*****************************
Max Depth:  23
Criterion:  entropy
[[5160  281]
 [ 403 2596]]
             precision    recall  f1-score   support

      False       0.93      0.95      0.94      5441
       True       0.90      0.87      0.88      2999

avg / total       0.92      0.92      0.92      8440

Accuracy Score:  0.918957345972
*****************************
Max Depth:  24
Criterion:  gini
[[5170  271]
 [ 436 2563]]
             precision    recall  f1-score   support

      False       0.92      0.95      0.94      5441
       True       0.90      0.85      0.88      2999

avg / total       0.92      0.92      0.92      8440

Accuracy Score:  0.916232227488
*****************************
Max Depth:  24
Criterion:  entropy
[[5164  277]
 [ 415 2584]]
             precision    recall  f1-score   support

      False       0.93      0.95      0.94      5441
       True       0.90      0.86      0.88      2999

avg / total       0.92      0.92      0.92      8440

Accuracy Score:  0.918009478673
*****************************
Max Depth:  25
Criterion:  gini
[[5167  274]
 [ 439 2560]]
             precision    recall  f1-score   support

      False       0.92      0.95      0.94      5441
       True       0.90      0.85      0.88      2999

avg / total       0.92      0.92      0.91      8440

Accuracy Score:  0.915521327014
*****************************
Max Depth:  25
Criterion:  entropy
[[5155  286]
 [ 426 2573]]
             precision    recall  f1-score   support

      False       0.92      0.95      0.94      5441
       True       0.90      0.86      0.88      2999

avg / total       0.92      0.92      0.92      8440

Accuracy Score:  0.915639810427
*****************************
Max Depth:  26
Criterion:  gini
[[5167  274]
 [ 437 2562]]
             precision    recall  f1-score   support

      False       0.92      0.95      0.94      5441
       True       0.90      0.85      0.88      2999

avg / total       0.92      0.92      0.92      8440

Accuracy Score:  0.915758293839
*****************************
Max Depth:  26
Criterion:  entropy
[[5144  297]
 [ 404 2595]]
             precision    recall  f1-score   support

      False       0.93      0.95      0.94      5441
       True       0.90      0.87      0.88      2999

avg / total       0.92      0.92      0.92      8440

Accuracy Score:  0.916943127962
*****************************
Max Depth:  27
Criterion:  gini
[[5170  271]
 [ 445 2554]]
             precision    recall  f1-score   support

      False       0.92      0.95      0.94      5441
       True       0.90      0.85      0.88      2999

avg / total       0.91      0.92      0.91      8440

Accuracy Score:  0.915165876777
*****************************
Max Depth:  27
Criterion:  entropy
[[5153  288]
 [ 427 2572]]
             precision    recall  f1-score   support

      False       0.92      0.95      0.94      5441
       True       0.90      0.86      0.88      2999

avg / total       0.91      0.92      0.91      8440

Accuracy Score:  0.91528436019
*****************************
Max Depth:  28
Criterion:  gini
[[5166  275]
 [ 452 2547]]
             precision    recall  f1-score   support

      False       0.92      0.95      0.93      5441
       True       0.90      0.85      0.88      2999

avg / total       0.91      0.91      0.91      8440

Accuracy Score:  0.913862559242
*****************************
Max Depth:  28
Criterion:  entropy
[[5151  290]
 [ 425 2574]]
             precision    recall  f1-score   support

      False       0.92      0.95      0.94      5441
       True       0.90      0.86      0.88      2999

avg / total       0.91      0.92      0.91      8440

Accuracy Score:  0.91528436019
*****************************
Max Depth:  29
Criterion:  gini
[[5172  269]
 [ 432 2567]]
             precision    recall  f1-score   support

      False       0.92      0.95      0.94      5441
       True       0.91      0.86      0.88      2999

avg / total       0.92      0.92      0.92      8440

Accuracy Score:  0.916943127962
*****************************
Max Depth:  29
Criterion:  entropy
[[5155  286]
 [ 429 2570]]
             precision    recall  f1-score   support

      False       0.92      0.95      0.94      5441
       True       0.90      0.86      0.88      2999

avg / total       0.91      0.92      0.91      8440

Accuracy Score:  0.91528436019
*****************************

Logistic Regression


In [539]:
C = [.01,.1,1,10,100]
for c in C:
    print "C: ", c
    evaluate_model(LogisticRegression(C=c))


C:  0.01
[[5115  326]
 [ 624 2375]]
             precision    recall  f1-score   support

      False       0.89      0.94      0.92      5441
       True       0.88      0.79      0.83      2999

avg / total       0.89      0.89      0.89      8440

Accuracy Score:  0.887440758294
*****************************
C:  0.1
[[5074  367]
 [ 581 2418]]
             precision    recall  f1-score   support

      False       0.90      0.93      0.91      5441
       True       0.87      0.81      0.84      2999

avg / total       0.89      0.89      0.89      8440

Accuracy Score:  0.887677725118
*****************************
C:  1
[[5045  396]
 [ 552 2447]]
             precision    recall  f1-score   support

      False       0.90      0.93      0.91      5441
       True       0.86      0.82      0.84      2999

avg / total       0.89      0.89      0.89      8440

Accuracy Score:  0.887677725118
*****************************
C:  10
[[5031  410]
 [ 530 2469]]
             precision    recall  f1-score   support

      False       0.90      0.92      0.91      5441
       True       0.86      0.82      0.84      2999

avg / total       0.89      0.89      0.89      8440

Accuracy Score:  0.888625592417
*****************************
C:  100
[[5031  410]
 [ 527 2472]]
             precision    recall  f1-score   support

      False       0.91      0.92      0.91      5441
       True       0.86      0.82      0.84      2999

avg / total       0.89      0.89      0.89      8440

Accuracy Score:  0.888981042654
*****************************

KNN model


In [506]:
n_values = [n for n in range(1,20,2)]
for n_value in n_values:
    print "Number of Neighbors: ", n_value
    evaluate_model(KNeighborsClassifier(n_neighbors=n_value))


Number of Neighbors:  1
[[5054  387]
 [ 408 2591]]
             precision    recall  f1-score   support

      False       0.93      0.93      0.93      5441
       True       0.87      0.86      0.87      2999

avg / total       0.91      0.91      0.91      8440

Number of Neighbors:  3
[[5153  288]
 [ 226 2773]]
             precision    recall  f1-score   support

      False       0.96      0.95      0.95      5441
       True       0.91      0.92      0.92      2999

avg / total       0.94      0.94      0.94      8440

Number of Neighbors:  5
[[5172  269]
 [ 208 2791]]
             precision    recall  f1-score   support

      False       0.96      0.95      0.96      5441
       True       0.91      0.93      0.92      2999

avg / total       0.94      0.94      0.94      8440

Number of Neighbors:  7
[[5164  277]
 [ 194 2805]]
             precision    recall  f1-score   support

      False       0.96      0.95      0.96      5441
       True       0.91      0.94      0.92      2999

avg / total       0.94      0.94      0.94      8440

Number of Neighbors:  9
[[5182  259]
 [ 190 2809]]
             precision    recall  f1-score   support

      False       0.96      0.95      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Number of Neighbors:  11
[[5174  267]
 [ 186 2813]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.91      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Number of Neighbors:  13
[[5174  267]
 [ 168 2831]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.91      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Number of Neighbors:  15
[[5173  268]
 [ 170 2829]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.91      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Number of Neighbors:  17
[[5174  267]
 [ 175 2824]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.91      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Number of Neighbors:  19
[[5175  266]
 [ 177 2822]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.91      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Random Forest Classifier


In [516]:
max_depth = [n for n in range(1,20,2)]
n_estimators = [n for n in range(10,300,25)]
for max_n in max_depth:
    for n_estimator in n_estimators:
        print "Max Tree Depth: ", max_n
        print "Number of Trees: ", n_estimator
        evaluate_model(RandomForestClassifier(max_depth=max_n,n_estimators=n_estimator))


Max Tree Depth:  1
Number of Trees:  10
[[5393   48]
 [2193  806]]
             precision    recall  f1-score   support

      False       0.71      0.99      0.83      5441
       True       0.94      0.27      0.42      2999

avg / total       0.79      0.73      0.68      8440

Max Tree Depth:  1
Number of Trees:  35
[[5386   55]
 [2121  878]]
             precision    recall  f1-score   support

      False       0.72      0.99      0.83      5441
       True       0.94      0.29      0.45      2999

avg / total       0.80      0.74      0.70      8440

Max Tree Depth:  1
Number of Trees:  60
[[5438    3]
 [2853  146]]
             precision    recall  f1-score   support

      False       0.66      1.00      0.79      5441
       True       0.98      0.05      0.09      2999

avg / total       0.77      0.66      0.54      8440

Max Tree Depth:  1
Number of Trees:  85
[[5417   24]
 [2577  422]]
             precision    recall  f1-score   support

      False       0.68      1.00      0.81      5441
       True       0.95      0.14      0.24      2999

avg / total       0.77      0.69      0.61      8440

Max Tree Depth:  1
Number of Trees:  110
[[5423   18]
 [2611  388]]
             precision    recall  f1-score   support

      False       0.68      1.00      0.80      5441
       True       0.96      0.13      0.23      2999

avg / total       0.77      0.69      0.60      8440

Max Tree Depth:  1
Number of Trees:  135
[[5433    8]
 [2800  199]]
             precision    recall  f1-score   support

      False       0.66      1.00      0.79      5441
       True       0.96      0.07      0.12      2999

avg / total       0.77      0.67      0.56      8440

Max Tree Depth:  1
Number of Trees:  160
[[5430   11]
 [2597  402]]
             precision    recall  f1-score   support

      False       0.68      1.00      0.81      5441
       True       0.97      0.13      0.24      2999

avg / total       0.78      0.69      0.60      8440

Max Tree Depth:  1
Number of Trees:  185
[[5419   22]
 [2506  493]]
             precision    recall  f1-score   support

      False       0.68      1.00      0.81      5441
       True       0.96      0.16      0.28      2999

avg / total       0.78      0.70      0.62      8440

Max Tree Depth:  1
Number of Trees:  210
[[5419   22]
 [2567  432]]
             precision    recall  f1-score   support

      False       0.68      1.00      0.81      5441
       True       0.95      0.14      0.25      2999

avg / total       0.78      0.69      0.61      8440

Max Tree Depth:  1
Number of Trees:  235
[[5420   21]
 [2490  509]]
             precision    recall  f1-score   support

      False       0.69      1.00      0.81      5441
       True       0.96      0.17      0.29      2999

avg / total       0.78      0.70      0.63      8440

Max Tree Depth:  1
Number of Trees:  260
[[5428   13]
 [2544  455]]
             precision    recall  f1-score   support

      False       0.68      1.00      0.81      5441
       True       0.97      0.15      0.26      2999

avg / total       0.78      0.70      0.62      8440

Max Tree Depth:  1
Number of Trees:  285
[[5425   16]
 [2543  456]]
             precision    recall  f1-score   support

      False       0.68      1.00      0.81      5441
       True       0.97      0.15      0.26      2999

avg / total       0.78      0.70      0.62      8440

Max Tree Depth:  3
Number of Trees:  10
[[5278  163]
 [ 875 2124]]
             precision    recall  f1-score   support

      False       0.86      0.97      0.91      5441
       True       0.93      0.71      0.80      2999

avg / total       0.88      0.88      0.87      8440

Max Tree Depth:  3
Number of Trees:  35
[[5239  202]
 [ 697 2302]]
             precision    recall  f1-score   support

      False       0.88      0.96      0.92      5441
       True       0.92      0.77      0.84      2999

avg / total       0.90      0.89      0.89      8440

Max Tree Depth:  3
Number of Trees:  60
[[5159  282]
 [ 647 2352]]
             precision    recall  f1-score   support

      False       0.89      0.95      0.92      5441
       True       0.89      0.78      0.84      2999

avg / total       0.89      0.89      0.89      8440

Max Tree Depth:  3
Number of Trees:  85
[[5243  198]
 [ 771 2228]]
             precision    recall  f1-score   support

      False       0.87      0.96      0.92      5441
       True       0.92      0.74      0.82      2999

avg / total       0.89      0.89      0.88      8440

Max Tree Depth:  3
Number of Trees:  110
[[5240  201]
 [ 762 2237]]
             precision    recall  f1-score   support

      False       0.87      0.96      0.92      5441
       True       0.92      0.75      0.82      2999

avg / total       0.89      0.89      0.88      8440

Max Tree Depth:  3
Number of Trees:  135
[[5248  193]
 [ 720 2279]]
             precision    recall  f1-score   support

      False       0.88      0.96      0.92      5441
       True       0.92      0.76      0.83      2999

avg / total       0.89      0.89      0.89      8440

Max Tree Depth:  3
Number of Trees:  160
[[5223  218]
 [ 708 2291]]
             precision    recall  f1-score   support

      False       0.88      0.96      0.92      5441
       True       0.91      0.76      0.83      2999

avg / total       0.89      0.89      0.89      8440

Max Tree Depth:  3
Number of Trees:  185
[[5243  198]
 [ 730 2269]]
             precision    recall  f1-score   support

      False       0.88      0.96      0.92      5441
       True       0.92      0.76      0.83      2999

avg / total       0.89      0.89      0.89      8440

Max Tree Depth:  3
Number of Trees:  210
[[5229  212]
 [ 734 2265]]
             precision    recall  f1-score   support

      False       0.88      0.96      0.92      5441
       True       0.91      0.76      0.83      2999

avg / total       0.89      0.89      0.89      8440

Max Tree Depth:  3
Number of Trees:  235
[[5235  206]
 [ 731 2268]]
             precision    recall  f1-score   support

      False       0.88      0.96      0.92      5441
       True       0.92      0.76      0.83      2999

avg / total       0.89      0.89      0.89      8440

Max Tree Depth:  3
Number of Trees:  260
[[5218  223]
 [ 755 2244]]
             precision    recall  f1-score   support

      False       0.87      0.96      0.91      5441
       True       0.91      0.75      0.82      2999

avg / total       0.89      0.88      0.88      8440

Max Tree Depth:  3
Number of Trees:  285
[[5248  193]
 [ 752 2247]]
             precision    recall  f1-score   support

      False       0.87      0.96      0.92      5441
       True       0.92      0.75      0.83      2999

avg / total       0.89      0.89      0.89      8440

Max Tree Depth:  5
Number of Trees:  10
[[5196  245]
 [ 602 2397]]
             precision    recall  f1-score   support

      False       0.90      0.95      0.92      5441
       True       0.91      0.80      0.85      2999

avg / total       0.90      0.90      0.90      8440

Max Tree Depth:  5
Number of Trees:  35
[[5262  179]
 [ 543 2456]]
             precision    recall  f1-score   support

      False       0.91      0.97      0.94      5441
       True       0.93      0.82      0.87      2999

avg / total       0.92      0.91      0.91      8440

Max Tree Depth:  5
Number of Trees:  60
[[5193  248]
 [ 539 2460]]
             precision    recall  f1-score   support

      False       0.91      0.95      0.93      5441
       True       0.91      0.82      0.86      2999

avg / total       0.91      0.91      0.91      8440

Max Tree Depth:  5
Number of Trees:  85
[[5192  249]
 [ 544 2455]]
             precision    recall  f1-score   support

      False       0.91      0.95      0.93      5441
       True       0.91      0.82      0.86      2999

avg / total       0.91      0.91      0.90      8440

Max Tree Depth:  5
Number of Trees:  110
[[5203  238]
 [ 521 2478]]
             precision    recall  f1-score   support

      False       0.91      0.96      0.93      5441
       True       0.91      0.83      0.87      2999

avg / total       0.91      0.91      0.91      8440

Max Tree Depth:  5
Number of Trees:  135
[[5161  280]
 [ 548 2451]]
             precision    recall  f1-score   support

      False       0.90      0.95      0.93      5441
       True       0.90      0.82      0.86      2999

avg / total       0.90      0.90      0.90      8440

Max Tree Depth:  5
Number of Trees:  160
[[5223  218]
 [ 510 2489]]
             precision    recall  f1-score   support

      False       0.91      0.96      0.93      5441
       True       0.92      0.83      0.87      2999

avg / total       0.91      0.91      0.91      8440

Max Tree Depth:  5
Number of Trees:  185
[[5216  225]
 [ 550 2449]]
             precision    recall  f1-score   support

      False       0.90      0.96      0.93      5441
       True       0.92      0.82      0.86      2999

avg / total       0.91      0.91      0.91      8440

Max Tree Depth:  5
Number of Trees:  210
[[5207  234]
 [ 513 2486]]
             precision    recall  f1-score   support

      False       0.91      0.96      0.93      5441
       True       0.91      0.83      0.87      2999

avg / total       0.91      0.91      0.91      8440

Max Tree Depth:  5
Number of Trees:  235
[[5194  247]
 [ 516 2483]]
             precision    recall  f1-score   support

      False       0.91      0.95      0.93      5441
       True       0.91      0.83      0.87      2999

avg / total       0.91      0.91      0.91      8440

Max Tree Depth:  5
Number of Trees:  260
[[5227  214]
 [ 514 2485]]
             precision    recall  f1-score   support

      False       0.91      0.96      0.93      5441
       True       0.92      0.83      0.87      2999

avg / total       0.91      0.91      0.91      8440

Max Tree Depth:  5
Number of Trees:  285
[[5227  214]
 [ 510 2489]]
             precision    recall  f1-score   support

      False       0.91      0.96      0.94      5441
       True       0.92      0.83      0.87      2999

avg / total       0.91      0.91      0.91      8440

Max Tree Depth:  7
Number of Trees:  10
[[5219  222]
 [ 331 2668]]
             precision    recall  f1-score   support

      False       0.94      0.96      0.95      5441
       True       0.92      0.89      0.91      2999

avg / total       0.93      0.93      0.93      8440

Max Tree Depth:  7
Number of Trees:  35
[[5234  207]
 [ 378 2621]]
             precision    recall  f1-score   support

      False       0.93      0.96      0.95      5441
       True       0.93      0.87      0.90      2999

avg / total       0.93      0.93      0.93      8440

Max Tree Depth:  7
Number of Trees:  60
[[5230  211]
 [ 347 2652]]
             precision    recall  f1-score   support

      False       0.94      0.96      0.95      5441
       True       0.93      0.88      0.90      2999

avg / total       0.93      0.93      0.93      8440

Max Tree Depth:  7
Number of Trees:  85
[[5229  212]
 [ 367 2632]]
             precision    recall  f1-score   support

      False       0.93      0.96      0.95      5441
       True       0.93      0.88      0.90      2999

avg / total       0.93      0.93      0.93      8440

Max Tree Depth:  7
Number of Trees:  110
[[5225  216]
 [ 347 2652]]
             precision    recall  f1-score   support

      False       0.94      0.96      0.95      5441
       True       0.92      0.88      0.90      2999

avg / total       0.93      0.93      0.93      8440

Max Tree Depth:  7
Number of Trees:  135
[[5214  227]
 [ 348 2651]]
             precision    recall  f1-score   support

      False       0.94      0.96      0.95      5441
       True       0.92      0.88      0.90      2999

avg / total       0.93      0.93      0.93      8440

Max Tree Depth:  7
Number of Trees:  160
[[5223  218]
 [ 353 2646]]
             precision    recall  f1-score   support

      False       0.94      0.96      0.95      5441
       True       0.92      0.88      0.90      2999

avg / total       0.93      0.93      0.93      8440

Max Tree Depth:  7
Number of Trees:  185
[[5229  212]
 [ 351 2648]]
             precision    recall  f1-score   support

      False       0.94      0.96      0.95      5441
       True       0.93      0.88      0.90      2999

avg / total       0.93      0.93      0.93      8440

Max Tree Depth:  7
Number of Trees:  210
[[5223  218]
 [ 346 2653]]
             precision    recall  f1-score   support

      False       0.94      0.96      0.95      5441
       True       0.92      0.88      0.90      2999

avg / total       0.93      0.93      0.93      8440

Max Tree Depth:  7
Number of Trees:  235
[[5225  216]
 [ 346 2653]]
             precision    recall  f1-score   support

      False       0.94      0.96      0.95      5441
       True       0.92      0.88      0.90      2999

avg / total       0.93      0.93      0.93      8440

Max Tree Depth:  7
Number of Trees:  260
[[5231  210]
 [ 343 2656]]
             precision    recall  f1-score   support

      False       0.94      0.96      0.95      5441
       True       0.93      0.89      0.91      2999

avg / total       0.93      0.93      0.93      8440

Max Tree Depth:  7
Number of Trees:  285
[[5226  215]
 [ 346 2653]]
             precision    recall  f1-score   support

      False       0.94      0.96      0.95      5441
       True       0.93      0.88      0.90      2999

avg / total       0.93      0.93      0.93      8440

Max Tree Depth:  9
Number of Trees:  10
[[5176  265]
 [ 261 2738]]
             precision    recall  f1-score   support

      False       0.95      0.95      0.95      5441
       True       0.91      0.91      0.91      2999

avg / total       0.94      0.94      0.94      8440

Max Tree Depth:  9
Number of Trees:  35
[[5219  222]
 [ 290 2709]]
             precision    recall  f1-score   support

      False       0.95      0.96      0.95      5441
       True       0.92      0.90      0.91      2999

avg / total       0.94      0.94      0.94      8440

Max Tree Depth:  9
Number of Trees:  60
[[5208  233]
 [ 254 2745]]
             precision    recall  f1-score   support

      False       0.95      0.96      0.96      5441
       True       0.92      0.92      0.92      2999

avg / total       0.94      0.94      0.94      8440

Max Tree Depth:  9
Number of Trees:  85
[[5232  209]
 [ 306 2693]]
             precision    recall  f1-score   support

      False       0.94      0.96      0.95      5441
       True       0.93      0.90      0.91      2999

avg / total       0.94      0.94      0.94      8440

Max Tree Depth:  9
Number of Trees:  110
[[5213  228]
 [ 254 2745]]
             precision    recall  f1-score   support

      False       0.95      0.96      0.96      5441
       True       0.92      0.92      0.92      2999

avg / total       0.94      0.94      0.94      8440

Max Tree Depth:  9
Number of Trees:  135
[[5227  214]
 [ 277 2722]]
             precision    recall  f1-score   support

      False       0.95      0.96      0.96      5441
       True       0.93      0.91      0.92      2999

avg / total       0.94      0.94      0.94      8440

Max Tree Depth:  9
Number of Trees:  160
[[5218  223]
 [ 271 2728]]
             precision    recall  f1-score   support

      False       0.95      0.96      0.95      5441
       True       0.92      0.91      0.92      2999

avg / total       0.94      0.94      0.94      8440

Max Tree Depth:  9
Number of Trees:  185
[[5224  217]
 [ 268 2731]]
             precision    recall  f1-score   support

      False       0.95      0.96      0.96      5441
       True       0.93      0.91      0.92      2999

avg / total       0.94      0.94      0.94      8440

Max Tree Depth:  9
Number of Trees:  210
[[5222  219]
 [ 272 2727]]
             precision    recall  f1-score   support

      False       0.95      0.96      0.96      5441
       True       0.93      0.91      0.92      2999

avg / total       0.94      0.94      0.94      8440

Max Tree Depth:  9
Number of Trees:  235
[[5229  212]
 [ 281 2718]]
             precision    recall  f1-score   support

      False       0.95      0.96      0.95      5441
       True       0.93      0.91      0.92      2999

avg / total       0.94      0.94      0.94      8440

Max Tree Depth:  9
Number of Trees:  260
[[5230  211]
 [ 280 2719]]
             precision    recall  f1-score   support

      False       0.95      0.96      0.96      5441
       True       0.93      0.91      0.92      2999

avg / total       0.94      0.94      0.94      8440

Max Tree Depth:  9
Number of Trees:  285
[[5221  220]
 [ 279 2720]]
             precision    recall  f1-score   support

      False       0.95      0.96      0.95      5441
       True       0.93      0.91      0.92      2999

avg / total       0.94      0.94      0.94      8440

Max Tree Depth:  11
Number of Trees:  10
[[5187  254]
 [ 202 2797]]
             precision    recall  f1-score   support

      False       0.96      0.95      0.96      5441
       True       0.92      0.93      0.92      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  11
Number of Trees:  35
[[5207  234]
 [ 188 2811]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  11
Number of Trees:  60
[[5197  244]
 [ 190 2809]]
             precision    recall  f1-score   support

      False       0.96      0.96      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  11
Number of Trees:  85
[[5195  246]
 [ 186 2813]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  11
Number of Trees:  110
[[5192  249]
 [ 178 2821]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  11
Number of Trees:  135
[[5204  237]
 [ 184 2815]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  11
Number of Trees:  160
[[5194  247]
 [ 174 2825]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  11
Number of Trees:  185
[[5194  247]
 [ 178 2821]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  11
Number of Trees:  210
[[5196  245]
 [ 182 2817]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  11
Number of Trees:  235
[[5196  245]
 [ 183 2816]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  11
Number of Trees:  260
[[5196  245]
 [ 176 2823]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  11
Number of Trees:  285
[[5198  243]
 [ 176 2823]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  13
Number of Trees:  10
[[5200  241]
 [ 175 2824]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  13
Number of Trees:  35
[[5195  246]
 [ 168 2831]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  13
Number of Trees:  60
[[5203  238]
 [ 163 2836]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.96      5441
       True       0.92      0.95      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  13
Number of Trees:  85
[[5202  239]
 [ 163 2836]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.96      5441
       True       0.92      0.95      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  13
Number of Trees:  110
[[5195  246]
 [ 165 2834]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  13
Number of Trees:  135
[[5202  239]
 [ 163 2836]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.96      5441
       True       0.92      0.95      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  13
Number of Trees:  160
[[5197  244]
 [ 164 2835]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.96      5441
       True       0.92      0.95      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  13
Number of Trees:  185
[[5198  243]
 [ 157 2842]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.96      5441
       True       0.92      0.95      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  13
Number of Trees:  210
[[5204  237]
 [ 162 2837]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.96      5441
       True       0.92      0.95      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  13
Number of Trees:  235
[[5194  247]
 [ 160 2839]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.92      0.95      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  13
Number of Trees:  260
[[5201  240]
 [ 156 2843]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.96      5441
       True       0.92      0.95      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  13
Number of Trees:  285
[[5204  237]
 [ 159 2840]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.96      5441
       True       0.92      0.95      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  15
Number of Trees:  10
[[5195  246]
 [ 173 2826]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  15
Number of Trees:  35
[[5195  246]
 [ 161 2838]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.92      0.95      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  15
Number of Trees:  60
[[5201  240]
 [ 154 2845]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.96      5441
       True       0.92      0.95      0.94      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  15
Number of Trees:  85
[[5194  247]
 [ 151 2848]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.92      0.95      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  15
Number of Trees:  110
[[5201  240]
 [ 155 2844]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.96      5441
       True       0.92      0.95      0.94      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  15
Number of Trees:  135
[[5207  234]
 [ 161 2838]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.96      5441
       True       0.92      0.95      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  15
Number of Trees:  160
[[5202  239]
 [ 163 2836]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.96      5441
       True       0.92      0.95      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  15
Number of Trees:  185
[[5208  233]
 [ 160 2839]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.96      5441
       True       0.92      0.95      0.94      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  15
Number of Trees:  210
[[5204  237]
 [ 158 2841]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.96      5441
       True       0.92      0.95      0.94      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  15
Number of Trees:  235
[[5205  236]
 [ 165 2834]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  15
Number of Trees:  260
[[5201  240]
 [ 163 2836]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.96      5441
       True       0.92      0.95      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  15
Number of Trees:  285
[[5200  241]
 [ 162 2837]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.96      5441
       True       0.92      0.95      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  17
Number of Trees:  10
[[5183  258]
 [ 175 2824]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  17
Number of Trees:  35
[[5187  254]
 [ 174 2825]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  17
Number of Trees:  60
[[5193  248]
 [ 179 2820]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  17
Number of Trees:  85
[[5190  251]
 [ 180 2819]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  17
Number of Trees:  110
[[5181  260]
 [ 161 2838]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.92      0.95      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  17
Number of Trees:  135
[[5187  254]
 [ 168 2831]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  17
Number of Trees:  160
[[5190  251]
 [ 175 2824]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  17
Number of Trees:  185
[[5185  256]
 [ 173 2826]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  17
Number of Trees:  210
[[5190  251]
 [ 177 2822]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  17
Number of Trees:  235
[[5189  252]
 [ 170 2829]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  17
Number of Trees:  260
[[5189  252]
 [ 172 2827]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  17
Number of Trees:  285
[[5188  253]
 [ 171 2828]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  19
Number of Trees:  10
[[5182  259]
 [ 229 2770]]
             precision    recall  f1-score   support

      False       0.96      0.95      0.96      5441
       True       0.91      0.92      0.92      2999

avg / total       0.94      0.94      0.94      8440

Max Tree Depth:  19
Number of Trees:  35
[[5184  257]
 [ 205 2794]]
             precision    recall  f1-score   support

      False       0.96      0.95      0.96      5441
       True       0.92      0.93      0.92      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  19
Number of Trees:  60
[[5176  265]
 [ 203 2796]]
             precision    recall  f1-score   support

      False       0.96      0.95      0.96      5441
       True       0.91      0.93      0.92      2999

avg / total       0.94      0.94      0.94      8440

Max Tree Depth:  19
Number of Trees:  85
[[5177  264]
 [ 202 2797]]
             precision    recall  f1-score   support

      False       0.96      0.95      0.96      5441
       True       0.91      0.93      0.92      2999

avg / total       0.95      0.94      0.94      8440

Max Tree Depth:  19
Number of Trees:  110
[[5176  265]
 [ 200 2799]]
             precision    recall  f1-score   support

      False       0.96      0.95      0.96      5441
       True       0.91      0.93      0.92      2999

avg / total       0.95      0.94      0.95      8440

Max Tree Depth:  19
Number of Trees:  135
[[5182  259]
 [ 200 2799]]
             precision    recall  f1-score   support

      False       0.96      0.95      0.96      5441
       True       0.92      0.93      0.92      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  19
Number of Trees:  160
[[5179  262]
 [ 204 2795]]
             precision    recall  f1-score   support

      False       0.96      0.95      0.96      5441
       True       0.91      0.93      0.92      2999

avg / total       0.95      0.94      0.94      8440

Max Tree Depth:  19
Number of Trees:  185
[[5178  263]
 [ 196 2803]]
             precision    recall  f1-score   support

      False       0.96      0.95      0.96      5441
       True       0.91      0.93      0.92      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  19
Number of Trees:  210
[[5180  261]
 [ 206 2793]]
             precision    recall  f1-score   support

      False       0.96      0.95      0.96      5441
       True       0.91      0.93      0.92      2999

avg / total       0.94      0.94      0.94      8440

Max Tree Depth:  19
Number of Trees:  235
[[5180  261]
 [ 197 2802]]
             precision    recall  f1-score   support

      False       0.96      0.95      0.96      5441
       True       0.91      0.93      0.92      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  19
Number of Trees:  260
[[5179  262]
 [ 201 2798]]
             precision    recall  f1-score   support

      False       0.96      0.95      0.96      5441
       True       0.91      0.93      0.92      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  19
Number of Trees:  285
[[5180  261]
 [ 206 2793]]
             precision    recall  f1-score   support

      False       0.96      0.95      0.96      5441
       True       0.91      0.93      0.92      2999

avg / total       0.94      0.94      0.94      8440

AdaBoost


In [517]:
from sklearn.ensemble import AdaBoostClassifier

In [522]:
max_depth = [n for n in range(2,16,2)]
for depth in max_depth:
    print "Depth: ", depth
    AdaBoostClassifier(RandomForestClassifier(max_depth = 13), n_estimators=185)
    evaluate_model(AdaBoostClassifier(RandomForestClassifier(max_depth = depth), n_estimators=185))


Depth:  2
[[5214  227]
 [ 183 2816]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.96      5441
       True       0.93      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Depth:  4
[[5200  241]
 [ 162 2837]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.96      5441
       True       0.92      0.95      0.93      2999

avg / total       0.95      0.95      0.95      8440

Depth:  6
[[5178  263]
 [ 271 2728]]
             precision    recall  f1-score   support

      False       0.95      0.95      0.95      5441
       True       0.91      0.91      0.91      2999

avg / total       0.94      0.94      0.94      8440

Depth:  8
[[5166  275]
 [ 269 2730]]
             precision    recall  f1-score   support

      False       0.95      0.95      0.95      5441
       True       0.91      0.91      0.91      2999

avg / total       0.94      0.94      0.94      8440

Depth:  10
[[5167  274]
 [ 268 2731]]
             precision    recall  f1-score   support

      False       0.95      0.95      0.95      5441
       True       0.91      0.91      0.91      2999

avg / total       0.94      0.94      0.94      8440

Depth:  12
[[5165  276]
 [ 267 2732]]
             precision    recall  f1-score   support

      False       0.95      0.95      0.95      5441
       True       0.91      0.91      0.91      2999

avg / total       0.94      0.94      0.94      8440

Depth:  14
[[5167  274]
 [ 269 2730]]
             precision    recall  f1-score   support

      False       0.95      0.95      0.95      5441
       True       0.91      0.91      0.91      2999

avg / total       0.94      0.94      0.94      8440


In [525]:
max_depth = [n for n in range(2,16,2)]
for depth in max_depth:
    print "Depth: ", depth
    ExtraTreesClassifier(max_depth = 13, n_estimators=185)
    evaluate_model(AdaBoostClassifier(RandomForestClassifier(max_depth = depth), n_estimators=185))


Depth:  2
[[5210  231]
 [ 175 2824]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Depth:  4
[[5203  238]
 [ 170 2829]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Depth:  6
[[5173  268]
 [ 254 2745]]
             precision    recall  f1-score   support

      False       0.95      0.95      0.95      5441
       True       0.91      0.92      0.91      2999

avg / total       0.94      0.94      0.94      8440

Depth:  8
[[5170  271]
 [ 272 2727]]
             precision    recall  f1-score   support

      False       0.95      0.95      0.95      5441
       True       0.91      0.91      0.91      2999

avg / total       0.94      0.94      0.94      8440

Depth:  10
[[5168  273]
 [ 269 2730]]
             precision    recall  f1-score   support

      False       0.95      0.95      0.95      5441
       True       0.91      0.91      0.91      2999

avg / total       0.94      0.94      0.94      8440

Depth:  12
[[5166  275]
 [ 268 2731]]
             precision    recall  f1-score   support

      False       0.95      0.95      0.95      5441
       True       0.91      0.91      0.91      2999

avg / total       0.94      0.94      0.94      8440

Depth:  14
[[5164  277]
 [ 272 2727]]
             precision    recall  f1-score   support

      False       0.95      0.95      0.95      5441
       True       0.91      0.91      0.91      2999

avg / total       0.93      0.93      0.93      8440

Support Vector Classifier


In [529]:
kernels = ['linear', 'poly', 'rbf', 'sigmoid']
C = [.01,.1,1,10,100]
for kernel in kernels:
    for c in C:
        print "C penalty: ", c
        print "Kernel: ", kernel
        evaluate_model(SVC(C=c,kernel=kernel))


C penalty:  0.01
Kernel:  linear
[[5150  291]
 [ 591 2408]]
             precision    recall  f1-score   support

      False       0.90      0.95      0.92      5441
       True       0.89      0.80      0.85      2999

avg / total       0.90      0.90      0.89      8440

C penalty:  0.1
Kernel:  linear
[[5083  358]
 [ 554 2445]]
             precision    recall  f1-score   support

      False       0.90      0.93      0.92      5441
       True       0.87      0.82      0.84      2999

avg / total       0.89      0.89      0.89      8440

C penalty:  1
Kernel:  linear
[[5048  393]
 [ 554 2445]]
             precision    recall  f1-score   support

      False       0.90      0.93      0.91      5441
       True       0.86      0.82      0.84      2999

avg / total       0.89      0.89      0.89      8440

C penalty:  10
Kernel:  linear
[[5049  392]
 [ 554 2445]]
             precision    recall  f1-score   support

      False       0.90      0.93      0.91      5441
       True       0.86      0.82      0.84      2999

avg / total       0.89      0.89      0.89      8440

C penalty:  100
Kernel:  linear
[[5049  392]
 [ 553 2446]]
             precision    recall  f1-score   support

      False       0.90      0.93      0.91      5441
       True       0.86      0.82      0.84      2999

avg / total       0.89      0.89      0.89      8440

C penalty:  0.01
Kernel:  poly
//anaconda/lib/python2.7/site-packages/sklearn/metrics/classification.py:1074: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
[[5441    0]
 [2999    0]]
             precision    recall  f1-score   support

      False       0.64      1.00      0.78      5441
       True       0.00      0.00      0.00      2999

avg / total       0.42      0.64      0.51      8440

C penalty:  0.1
Kernel:  poly
[[5441    0]
 [2999    0]]
             precision    recall  f1-score   support

      False       0.64      1.00      0.78      5441
       True       0.00      0.00      0.00      2999

avg / total       0.42      0.64      0.51      8440

C penalty:  1
Kernel:  poly
[[5275  166]
 [ 788 2211]]
             precision    recall  f1-score   support

      False       0.87      0.97      0.92      5441
       True       0.93      0.74      0.82      2999

avg / total       0.89      0.89      0.88      8440

C penalty:  10
Kernel:  poly
[[5222  219]
 [ 246 2753]]
             precision    recall  f1-score   support

      False       0.96      0.96      0.96      5441
       True       0.93      0.92      0.92      2999

avg / total       0.94      0.94      0.94      8440

C penalty:  100
Kernel:  poly
[[5212  229]
 [ 143 2856]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.97      5441
       True       0.93      0.95      0.94      2999

avg / total       0.96      0.96      0.96      8440

C penalty:  0.01
Kernel:  rbf
[[5151  290]
 [ 649 2350]]
             precision    recall  f1-score   support

      False       0.89      0.95      0.92      5441
       True       0.89      0.78      0.83      2999

avg / total       0.89      0.89      0.89      8440

C penalty:  0.1
Kernel:  rbf
[[5179  262]
 [ 561 2438]]
             precision    recall  f1-score   support

      False       0.90      0.95      0.93      5441
       True       0.90      0.81      0.86      2999

avg / total       0.90      0.90      0.90      8440

C penalty:  1
Kernel:  rbf
[[5232  209]
 [ 268 2731]]
             precision    recall  f1-score   support

      False       0.95      0.96      0.96      5441
       True       0.93      0.91      0.92      2999

avg / total       0.94      0.94      0.94      8440

C penalty:  10
Kernel:  rbf
[[5220  221]
 [ 153 2846]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.97      5441
       True       0.93      0.95      0.94      2999

avg / total       0.96      0.96      0.96      8440

C penalty:  100
Kernel:  rbf
[[5215  226]
 [ 117 2882]]
             precision    recall  f1-score   support

      False       0.98      0.96      0.97      5441
       True       0.93      0.96      0.94      2999

avg / total       0.96      0.96      0.96      8440

C penalty:  0.01
Kernel:  sigmoid
[[5441    0]
 [2999    0]]
             precision    recall  f1-score   support

      False       0.64      1.00      0.78      5441
       True       0.00      0.00      0.00      2999

avg / total       0.42      0.64      0.51      8440

C penalty:  0.1
Kernel:  sigmoid
[[5441    0]
 [2999    0]]
             precision    recall  f1-score   support

      False       0.64      1.00      0.78      5441
       True       0.00      0.00      0.00      2999

avg / total       0.42      0.64      0.51      8440

C penalty:  1
Kernel:  sigmoid
[[5441    0]
 [2999    0]]
             precision    recall  f1-score   support

      False       0.64      1.00      0.78      5441
       True       0.00      0.00      0.00      2999

avg / total       0.42      0.64      0.51      8440

C penalty:  10
Kernel:  sigmoid
[[5441    0]
 [2999    0]]
             precision    recall  f1-score   support

      False       0.64      1.00      0.78      5441
       True       0.00      0.00      0.00      2999

avg / total       0.42      0.64      0.51      8440

C penalty:  100
Kernel:  sigmoid
[[5441    0]
 [2999    0]]
             precision    recall  f1-score   support

      False       0.64      1.00      0.78      5441
       True       0.00      0.00      0.00      2999

avg / total       0.42      0.64      0.51      8440

Decision tree depth : fit on bad predictions from previous test set

Examining the exceptions


In [540]:
DTC = DecisionTreeClassifier(max_depth=6)
model = DTC.fit(X_train,y_train)
y_pred = model.predict(X_test)

In [541]:
y_pred = pd.DataFrame(y_pred3)
y_test = pd.DataFrame(y_test)
X_df = pd.concat([words,y_test,y_pred],axis=1)

In [ ]:


In [542]:
X_df.columns = ['string','actual','predicted']

In [543]:
X_df['Correct'] = X_df['actual'] + X_df['predicted']

In [544]:
X_df['Correct'] = X_df['Correct'].map({2:1,1:0,0:1})

In [545]:
X_df


Out[545]:
string actual predicted Correct
0 b##a NaN False NaN
1 ##ab NaN True NaN
2 #ab# False False 1.0
3 NaN NaN False NaN
4 NaN NaN True NaN
5 NaN NaN False NaN
6 a##a NaN False NaN
7 NaN NaN False NaN
8 #aba NaN True NaN
9 aba# False True 0.0
10 d##a NaN True NaN
11 NaN NaN False NaN
12 NaN NaN False NaN
13 abad NaN False NaN
14 bad# False True 0.0
15 i##a False False 1.0
16 NaN NaN False NaN
17 NaN NaN False NaN
18 NaN NaN False NaN
19 badi NaN False NaN
20 adi# False False 1.0
21 h##a False True 0.0
22 NaN NaN False NaN
23 NaN NaN True NaN
24 NaN NaN False NaN
25 NaN NaN False NaN
26 adia False True 0.0
27 diah True False 0.0
28 iah# False True 0.0
29 t##a NaN False NaN
... ... ... ... ...
261076 zoof NaN NaN NaN
261077 oofi NaN NaN NaN
261084 oofo True NaN NaN
261092 zoog NaN NaN NaN
261093 ooga True NaN NaN
261112 zool NaN NaN NaN
261113 oolo NaN NaN NaN
261120 zoon NaN NaN NaN
261121 oono True NaN NaN
261129 zoos True NaN NaN
261130 oose NaN NaN NaN
261140 ##zu False NaN NaN
261141 #zua NaN NaN NaN
261142 zuad NaN NaN NaN
261149 zuam NaN NaN NaN
261154 #zuh NaN NaN NaN
261155 zuha NaN NaN NaN
261162 zuhu True NaN NaN
261163 uhud False NaN NaN
261173 #zul False NaN NaN
261174 zulf NaN NaN NaN
261183 zulh False NaN NaN
261185 lhij False NaN NaN
261192 zulk NaN NaN NaN
261194 lkai NaN NaN NaN
261202 zulm NaN NaN NaN
261204 lmat NaN NaN NaN
261218 #zur False NaN NaN
261240 #zus NaN NaN NaN
261241 zus# NaN NaN NaN

31152 rows × 4 columns


In [413]:
bad_predictions = X_df[X_df['Correct']==0]

In [414]:
bad_predictions.index


Out[414]:
Int64Index([   9,   14,   21,   26,   27,   28,   41,   76,   77,   86,
            ...
            8067, 8084, 8127, 8153, 8167, 8286, 8295, 8303, 8338, 8415],
           dtype='int64', length=444)

In [415]:
# Let's create a dataset consisting of only incorrectly predicted data
y = y_test[y_test.index.isin(bad_predictions.index)]
X = X_test[X_test.index.isin(bad_predictions.index)]

In [416]:
#What's out baseline?
float(y[0].value_counts()[0])/(float(y[0].value_counts()[0]) + float(y[0].value_counts()[1]))


Out[416]:
0.536036036036036

In [417]:
X_train, X_test, y_train, y_test = train_test_split(X, y[0], test_size=0.33, random_state=30)

In [ ]:

Decision tree with bad prediction dataset


In [602]:
# now let's run the data through a decision tree classifier model
clf = DecisionTreeClassifier(random_state=30, max_depth=3)
cross_val = cross_val_score(clf, X_train, y_train, cv=2)

In [603]:
evaluate_model(clf)


[[59 20]
 [19 85]]
             precision    recall  f1-score   support

      False       0.76      0.75      0.75        79
       True       0.81      0.82      0.81       104

avg / total       0.79      0.79      0.79       183

Accuracy Score:  0.786885245902
*****************************
Out[603]:
0.78688524590163933

In [420]:
print classification_report(y_test,y_pred)


             precision    recall  f1-score   support

      False       0.88      0.94      0.91        77
       True       0.92      0.86      0.89        70

avg / total       0.90      0.90      0.90       147


In [421]:
from sklearn.tree import export_graphviz
with open('tree.dot', 'w') as dotfile:
    # Creating a dot file that we can write to for our output.
    export_graphviz(decision_tree = model, out_file = dotfile, feature_names=X.columns)
    # Writing to our dot file we just created.
import graphviz
with open("tree.dot") as f:
    # Opening our file where the decision trees decision information is store.
    dot_graph = f.read()
    # setting a variable equal to the contents of the read dot file.
graphviz.Source(dot_graph) 
# Equavalent of plt.show() for graphviz.


Out[421]:
Tree 0 sonorant3 <= 0.5 gini = 0.4965 samples = 297 value = [161, 136] 1 high1 <= 0.5 gini = 0.1813 samples = 119 value = [107, 12] 0->1 True 4 consonant2 <= 0.5 gini = 0.4227 samples = 178 value = [54, 124] 0->4 False 2 gini = 0.0775 samples = 99 value = [95, 4] 1->2 3 gini = 0.48 samples = 20 value = [12, 8] 1->3 5 gini = 0.2999 samples = 49 value = [40, 9] 4->5 6 gini = 0.1935 samples = 129 value = [14, 115] 4->6

In [ ]:

Examining the exceptions


In [371]:
y_pred = pd.DataFrame(y_pred)
y_test = pd.DataFrame(y_test)
X_df = pd.concat([words,y_test,y_pred],axis=1)

In [372]:
X_df.columns = ['string','actual','predicted']

In [373]:
X_df['Correct'] = X_df['actual'] + X_df['predicted']

In [374]:
X_df['Correct'] = X_df['Correct'].map({2:1,1:0,0:1})

In [375]:
bad_predictions = X_df[X_df['Correct']==0]

In [376]:
bad_predictions


Out[376]:
string actual predicted Correct
28 iah# False True 0.0

In [ ]:


In [ ]:


In [ ]: