Learning Syllabification

Speakers have relatively clear intuitions about how words in their native language should be divided up into syllables. These intuitions are quite systematic across speakers of the same language, and can be applied to new words that a speaker has never heard before (e.g. English speakers have the intuition that the fake word 'haldapet' should be syllabified as 'hal.da.pet', and not some other way). This tells us that speakers have a systematic way of grouping sounds into syllables, and that this system can be applied to new data. A lot of research has been done on syllabification in English, and there are well-developed analyses of the rule system which underlies syllabification in English, but syllable structure has recieved much less attention in less well studied languages. The aim of this lab is to discover the rules governing syllable structure in Indonesian.

Downloading and Cleaning the Data



In [1]:

    
import pandas as pd



In [2]:

    
# I downloaded an Indonesian dictionary in SQL db format from the the Indonesian Ministry for Edcuation and Culture

import sqlite3
conn = sqlite3.connect('KBBI.db')
c = conn.cursor()



In [3]:

    
c.execute("SELECT name FROM sqlite_master WHERE type='table';")
print(c.fetchall())









    



[(u'datakata',), (u'android_metadata',), (u'bookmark',)]



In [4]:

    
# Taking a look at the data, we see that there are three columns: an index '_id', keyword 'katakunci' and definition
# 'artikata'
kbbi = pd.pandas.read_sql('SELECT * FROM datakata',conn, index_col='_id')
print kbbi.shape
kbbi.head(2)









    



(35969, 2)






    Out[4]:






  
    
      
      katakunci
      artikata
    
    
      _id
      
      
    
  
  
    
      1
      a
      &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;A, a&lt;/b&gt...
    
    
      2
      a
      &lt;b&gt;&lt;sup&gt;3&lt;/sup&gt;a-&lt;/b&gt; ...



In [5]:

    
#  I'll reset the column names to English for clarity
kbbi.columns = ['keyword','definition']
kbbi.head(2)









    Out[5]:






  
    
      
      keyword
      definition
    
    
      _id
      
      
    
  
  
    
      1
      a
      &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;A, a&lt;/b&gt...
    
    
      2
      a
      &lt;b&gt;&lt;sup&gt;3&lt;/sup&gt;a-&lt;/b&gt; ...



In [6]:

    
# The 'definition' column is filled contains unusual coding.  We are only interested in words where there are syllable
# boundaries.  Let's find some words that have multiple syllables and determine a strategy for parsing out the 
# syllabified string. The best way to do this is to search for keywords that contain multiple vowels. Below is
# a function which will detect multiple syllables.

def vowel_count(keyword):
    vowels = ['a','e','i','o','u']
    count = 0
    for letter in keyword:
        if letter in vowels:
            count += 1
    return count



In [7]:

    
# Now let's create a column listing a count of the vowels in each keyword.
kbbi['vowel_count'] = kbbi['keyword'].apply(lambda x: vowel_count(x))
kbbi.head(10)









    Out[7]:






  
    
      
      keyword
      definition
      vowel_count
    
    
      _id
      
      
      
    
  
  
    
      1
      a
      &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;A, a&lt;/b&gt...
      1
    
    
      2
      a
      &lt;b&gt;&lt;sup&gt;3&lt;/sup&gt;a-&lt;/b&gt; ...
      1
    
    
      3
      ab
      &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;ab&lt;/b&gt; ...
      1
    
    
      4
      ab
      &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;ab&lt;/b&gt; ...
      1
    
    
      5
      ab
      &lt;b&gt;&lt;sup&gt;3&lt;/sup&gt;ab-&lt;/b&gt;...
      1
    
    
      6
      aba
      &lt;b&gt;aba&lt;/b&gt; &lt;i&gt;n&lt;/i&gt; ay...
      2
    
    
      7
      aba-aba
      &lt;b&gt;aba-aba&lt;/b&gt; &lt;i&gt;n&lt;/i&gt...
      4
    
    
      8
      abad
      &lt;b&gt;abad&lt;/b&gt; &lt;i&gt;n&lt;/i&gt; &...
      2
    
    
      9
      abadi
      &lt;b&gt;aba·di&lt;/b&gt; &lt;i&gt;a&lt;/i&gt;...
      3
    
    
      10
      abadiah
      &lt;b&gt;aba·di·ah&lt;/b&gt; &lt;i&gt;Ar n&lt;...
      4



In [8]:

    
# Just looking at the first few examples we can see that the data is quite inconsistent in terms of what
# information is included. We can see that for some keywords, e.g. 'abadiah', information about syllabification
# is provided, since we see the string 'aba·di·ah' buried within the definition.  Monosyllabic forms like 'ab' lack
# an explicit syllabification.  Moreover, some keywords, like the form 'aba-aba' (a reduplicated form) are clearly
# polysyllabic, but nevertheless lack explicit syllabification. It would appear from these limited examples that
# polysyllabic for which syllabification is marked contain the symbol '·'.  If this symbol is indeed unique to
# entries with syllabification, it should be the case that a search for entries containing '·' will only 
# give us forms for which there is more than one vowel.  Let's test this hypothesis.

def syl_dot(string):
    if '·' in string:
        return True
    else:
        return False



In [9]:

    
# I had to encode the string in 'definition' as 'utf-8' since, in 'ascii', I could not search for the symbol '·' 
kbbi['definition'] = kbbi['definition'].str.encode(encoding='utf-8')
kbbi['syllable_divider'] = kbbi['definition'].apply(lambda x: syl_dot(x))



In [10]:

    
# now let's check to see that all rows with '·' contain more than one vowel
divider_words = kbbi[kbbi['syllable_divider'] == True]
divider_words['vowel_count'].value_counts()









    Out[10]:





2    16760
3     9729
4     4900
5     1862
6      596
7      145
1       97
8       37
9        5
Name: vowel_count, dtype: int64



In [11]:

    
# I expected that the words containing only a single vowel (words that should be monosyllabic) would never contain
# the syllable divider symbol '·', but this was not the case. I'm going to take a closer look at these forms to see
# why they contain this symbol.

divided_one_V = divider_words[divider_words['vowel_count'] == 1]
divided_one_V









    Out[11]:






  
    
      
      keyword
      definition
      vowel_count
      syllable_divider
    
    
      _id
      
      
      
      
    
  
  
    
      930
      am
      &lt;b&gt;am&lt;/b&gt; &lt;i&gt;a&lt;/i&gt; &lt...
      1
      True
    
    
      2818
      bank
      &lt;b&gt;bank&lt;/b&gt; &lt;i&gt;n&lt;/i&gt; b...
      1
      True
    
    
      3225
      bel
      &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;bel&lt;/b&gt;...
      1
      True
    
    
      4157
      bis
      &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;bis&lt;/b&gt;...
      1
      True
    
    
      4301
      bom
      &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;bom&lt;/b&gt;...
      1
      True
    
    
      4313
      bon
      &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;bon&lt;/b&gt;...
      1
      True
    
    
      4375
      bor
      &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;bor&lt;/b&gt;...
      1
      True
    
    
      4592
      buk
      &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;buk, me·nge·b...
      1
      True
    
    
      4952
      cam
      &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;cam&lt;/b&gt;...
      1
      True
    
    
      5083
      cap
      &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;cap&lt;/b&gt;...
      1
      True
    
    
      5139
      cas
      &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;cas&lt;/b&gt;...
      1
      True
    
    
      5140
      cas
      &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;cas&lt;/b&gt;...
      1
      True
    
    
      5143
      cat
      &lt;b&gt;cat&lt;/b&gt; &lt;i&gt;n&lt;/i&gt; &l...
      1
      True
    
    
      5227
      cek
      &lt;b&gt;&lt;sup&gt;3&lt;/sup&gt;cek&lt;/b&gt;...
      1
      True
    
    
      5890
      cor
      &lt;b&gt;cor&lt;/b&gt; &lt;i&gt;v,&lt;/i&gt; &...
      1
      True
    
    
      6055
      dab
      &lt;b&gt;dab, me·nge·dab&lt;/b&gt; &lt;i&gt;v ...
      1
      True
    
    
      6463
      deg
      &lt;b&gt;deg, deg-deg·an&lt;/b&gt; &lt;i&gt;v ...
      1
      True
    
    
      6731
      dep
      &lt;b&gt;dep&lt;/b&gt; /dép/ &lt;i&gt;v,&lt;/i...
      1
      True
    
    
      7398
      dor
      &lt;b&gt;dor&lt;/b&gt; &lt;i&gt;n&lt;/i&gt; ti...
      1
      True
    
    
      7419
      dot
      &lt;b&gt;dot&lt;/b&gt; &lt;i&gt;n&lt;/i&gt; al...
      1
      True
    
    
      7445
      drel
      &lt;b&gt;drel&lt;/b&gt; /drél/ &lt;i&gt;n&lt;/...
      1
      True
    
    
      7453
      dril
      &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;dril&lt;/b&gt...
      1
      True
    
    
      7456
      drop
      &lt;b&gt;drop&lt;/b&gt; &lt;i&gt;v cak&lt;/i&g...
      1
      True
    
    
      7471
      dub
      &lt;b&gt;dub&lt;/b&gt; &lt;i&gt;v,&lt;/i&gt; &...
      1
      True
    
    
      7530
      dum
      &lt;b&gt;dum&lt;/b&gt; &lt;i&gt;n cak&lt;/i&gt...
      1
      True
    
    
      7552
      dup
      &lt;b&gt;dup, me·nge·dup&lt;/b&gt; &lt;i&gt;v ...
      1
      True
    
    
      8783
      film
      &lt;b&gt;film&lt;/b&gt; &lt;i&gt;n&lt;/i&gt; &...
      1
      True
    
    
      10469
      gol
      &lt;b&gt;gol&lt;/b&gt; &lt;b&gt;1&lt;/b&gt; &l...
      1
      True
    
    
      10477
      golf
      &lt;b&gt;golf&lt;/b&gt; &lt;i&gt;n&lt;/i&gt; c...
      1
      True
    
    
      10741
      gung
      &lt;b&gt;gung&lt;/b&gt; &lt;i&gt;n&lt;/i&gt; &...
      1
      True
    
    
      ...
      ...
      ...
      ...
      ...
    
    
      28153
      saf
      &lt;b&gt;saf&lt;/b&gt; &lt;i&gt;n&lt;/i&gt; de...
      1
      True
    
    
      28183
      sah
      &lt;b&gt;sah&lt;/b&gt; &lt;b&gt;1&lt;/b&gt; &l...
      1
      True
    
    
      29119
      sel
      &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;sel&lt;/b&gt;...
      1
      True
    
    
      29120
      sel
      &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;sel&lt;/b&gt;...
      1
      True
    
    
      29578
      sen
      &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;sen&lt;/b&gt;...
      1
      True
    
    
      29832
      sep
      &lt;b&gt;sep&lt;/b&gt; /sép/ &lt;i&gt;ark n&lt...
      1
      True
    
    
      30236
      set
      &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;set&lt;/b&gt;...
      1
      True
    
    
      30809
      sir
      &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;sir&lt;/b&gt;...
      1
      True
    
    
      30965
      skor
      &lt;b&gt;skor&lt;/b&gt; &lt;i&gt;n&lt;/i&gt; &...
      1
      True
    
    
      30968
      skors
      &lt;b&gt;skors&lt;/b&gt; &lt;i&gt;v,&lt;/i&gt;...
      1
      True
    
    
      31045
      sol
      &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;sol&lt;/b&gt;...
      1
      True
    
    
      31080
      som
      &lt;b&gt;som&lt;/b&gt; &lt;i&gt;n cak&lt;/i&gt...
      1
      True
    
    
      31590
      suh
      &lt;b&gt;suh&lt;/b&gt; &lt;i&gt;n&lt;/i&gt; pa...
      1
      True
    
    
      31727
      sun
      &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;sun&lt;/b&gt;...
      1
      True
    
    
      32020
      syak
      &lt;b&gt;syak&lt;/b&gt; &lt;b&gt;1&lt;/b&gt; &...
      1
      True
    
    
      32080
      syur
      &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;syur&lt;/b&gt...
      1
      True
    
    
      32570
      tap
      &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;tap&lt;/b&gt;...
      1
      True
    
    
      32885
      teh
      &lt;b&gt;teh&lt;/b&gt; /téh/ &lt;i&gt;n&lt;/i&...
      1
      True
    
    
      33094
      tem
      &lt;b&gt;tem&lt;/b&gt; /tém/ &lt;i&gt;n cak&lt...
      1
      True
    
    
      33614
      tes
      &lt;b&gt;tes&lt;/b&gt; /tés/ &lt;i&gt;n&lt;/i&...
      1
      True
    
    
      33699
      tik
      &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;tik&lt;/b&gt;...
      1
      True
    
    
      33729
      tim
      &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;tim&lt;/b&gt;...
      1
      True
    
    
      33822
      tip
      &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;tip&lt;/b&gt;...
      1
      True
    
    
      34028
      top
      &lt;b&gt;&lt;sup&gt;3&lt;/sup&gt;top&lt;/b&gt;...
      1
      True
    
    
      34069
      tos
      &lt;b&gt;tos&lt;/b&gt; &lt;i&gt;n cak&lt;/i&gt...
      1
      True
    
    
      34170
      trek
      &lt;b&gt;trek&lt;/b&gt; /trék/ &lt;i&gt;n&lt;/...
      1
      True
    
    
      34178
      tren
      &lt;b&gt;tren&lt;/b&gt; /trén/ &lt;i&gt;n&lt;/...
      1
      True
    
    
      34276
      truk
      &lt;b&gt;truk&lt;/b&gt; &lt;i&gt;n&lt;/i&gt; m...
      1
      True
    
    
      34397
      tum
      &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;tum&lt;/b&gt;...
      1
      True
    
    
      35783
      yang
      &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;yang&lt;/b&gt...
      1
      True
    
  

97 rows × 4 columns



In [12]:

    
# Inspection on the string reveals that these definitions contain morphologically complex forms containing the 
# keyword as a base or root (i.e. forms which contain suffixes, prefixes, etc.).  These complex forms are 
# polysyllabic and contain explicit syllabification.  For example, the entry for the keyword 'am' (below) contains
# information about the derived form meng-am-kan, which in turn contains the the syllablic marker '·' 

print divided_one_V['definition'][930]









    



&lt;b&gt;am&lt;/b&gt; &lt;i&gt;a&lt;/i&gt; &lt;b&gt;1&lt;/b&gt; tidak terbatas pd orang atau golongan tertentu; umum; awam: &lt;i&gt;orang --;&lt;/i&gt; &lt;b&gt;2&lt;/b&gt; tidak terbatas pd bidang tertentu: &lt;i&gt;pengetahuan --;&lt;br&gt;&lt;/i&gt;&lt;b&gt;meng·am·kan&lt;/b&gt; &lt;i&gt;v&lt;/i&gt; &lt;b&gt;1&lt;/b&gt; menyediakan untuk orang; &lt;b&gt;2&lt;/b&gt; mengumumkan; menyampaikan (menyiarkan, memberitahukan) kpd khalayak



In [13]:

    
# In forms where a syllabification for the keyword itself is provided, it appears that the syllabified form
# is provided in the first 'chunk' of the text string. With this in mind, let's parse the text before the first ' '
# in 'definition'.  I suspect that this chunk of text will only contain '·' in cases where a syllabification of the
# keyword itself is being provided (rather than some other form buried deeper in the definition text).

kbbi['string_1'] = kbbi['definition'].apply(lambda x: x.split(' ')[0])



In [14]:

    
# Now we can reset the column syllable divider based on whether the divider symbol occurs in the 'string_1' column

kbbi['syllable_divider'] = kbbi['string_1'].apply(lambda x: syl_dot(x))
kbbi.head(10)









    Out[14]:






  
    
      
      keyword
      definition
      vowel_count
      syllable_divider
      string_1
    
    
      _id
      
      
      
      
      
    
  
  
    
      1
      a
      &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;A, a&lt;/b&gt...
      1
      False
      &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;A,
    
    
      2
      a
      &lt;b&gt;&lt;sup&gt;3&lt;/sup&gt;a-&lt;/b&gt; ...
      1
      False
      &lt;b&gt;&lt;sup&gt;3&lt;/sup&gt;a-&lt;/b&gt;
    
    
      3
      ab
      &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;ab&lt;/b&gt; ...
      1
      False
      &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;ab&lt;/b&gt;
    
    
      4
      ab
      &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;ab&lt;/b&gt; ...
      1
      False
      &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;ab&lt;/b&gt;
    
    
      5
      ab
      &lt;b&gt;&lt;sup&gt;3&lt;/sup&gt;ab-&lt;/b&gt;...
      1
      False
      &lt;b&gt;&lt;sup&gt;3&lt;/sup&gt;ab-&lt;/b&gt;
    
    
      6
      aba
      &lt;b&gt;aba&lt;/b&gt; &lt;i&gt;n&lt;/i&gt; ay...
      2
      False
      &lt;b&gt;aba&lt;/b&gt;
    
    
      7
      aba-aba
      &lt;b&gt;aba-aba&lt;/b&gt; &lt;i&gt;n&lt;/i&gt...
      4
      False
      &lt;b&gt;aba-aba&lt;/b&gt;
    
    
      8
      abad
      &lt;b&gt;abad&lt;/b&gt; &lt;i&gt;n&lt;/i&gt; &...
      2
      False
      &lt;b&gt;abad&lt;/b&gt;
    
    
      9
      abadi
      &lt;b&gt;aba·di&lt;/b&gt; &lt;i&gt;a&lt;/i&gt;...
      3
      True
      &lt;b&gt;aba·di&lt;/b&gt;
    
    
      10
      abadiah
      &lt;b&gt;aba·di·ah&lt;/b&gt; &lt;i&gt;Ar n&lt;...
      4
      True
      &lt;b&gt;aba·di·ah&lt;/b&gt;



In [15]:

    
# Again, we expect that at syllable dividers will only occur in words with more than 2 vowels

divider_words = kbbi[kbbi['syllable_divider'] == True]
divider_words['vowel_count'].value_counts()









    Out[15]:





2    16244
3     9706
4     4854
5     1855
6      596
7      145
8       37
9        5
1        1
Name: vowel_count, dtype: int64



In [16]:

    
# Almost perfectly as we expected: Only one word is left containing a divider and only one vowel.  
# Let's give it a look to see what's going on.

divided_one_V = divider_words[divider_words['vowel_count'] == 1]
divided_one_V









    Out[16]:






  
    
      
      keyword
      definition
      vowel_count
      syllable_divider
      string_1
    
    
      _id
      
      
      
      
      
    
  
  
    
      11174
      henry
      &lt;b&gt;hen·ry&lt;/b&gt; &lt;i&gt;n El&lt;/i&...
      1
      True
      &lt;b&gt;hen·ry&lt;/b&gt;



In [17]:

    
# The exception that proves the rule! This is a borrowing where 'y' (a vowel not used in Indo. orthography) is
# used. Let's delete this row.

kbbi.drop(11174,axis=0,inplace=True)
kbbi.reset_index(drop=True, inplace=True)



In [18]:

    
# before looking more closely at how words are syllabified, let's try to clean up some of the messy encoding in  
# in string_1.  First, let's convert the syllable boundary into a symbol which does not cause us problems as we
# convert from one encoding to another (the current symbol gets rendered as '\xc2\xb7').  Let's use the symbol $ for
# syllable boundaries

kbbi['string_1'] = kbbi['string_1'].apply(lambda x: x.replace('\xc2\xb7','!'))



In [19]:

    
# we can also chop up the encoding to figure out what is a word and what is not
kbbi['chunks'] = kbbi['string_1'].apply(lambda x: x.split(';'))
chunks = []
for line in kbbi['chunks']:
    chunks = chunks + line
chunks = pd.Series(chunks)
chunks = chunks.value_counts().head(30)



In [20]:

    
print chunks.index
del kbbi['chunks']









    



Index([u'&lt', u'b&gt', u'/b&gt', u'', u'/sup&gt', u'sup&gt', u'1&lt', u'2&lt',
       u'3&lt', u',', u'4&lt', u'i&gt', u'5&lt', u'6&lt', u',&lt', u'ta!ta',
       u'su!suk&lt', u'ta!pak', u'bu!ras&lt', u'je!nang&lt', u'tun!tung&lt',
       u'te!la&lt', u'ka!la&lt', u'su!ri&lt', u'ke!tam&lt', u'can!da&lt',
       u'la!wang&lt', u'ulas&lt', u'ba!dar&lt', u'se!pah&lt'],
      dtype='object')



In [21]:

    
# now that we have a better sense of what tags there are, we can create a function to delete them:
def tag_delete(string):
    tags = ['&lt','b&gt','/','b&gt', '/sup&gt', 'sup','&gt', '1','&lt', '2','3',',', '4','&lt', 'i&gt', '5', '6', ',&lt',';', 'A', '7','8','9']
    for tag in tags:
        string = string.replace(tag,'')
    return string

kbbi['string_1'] = kbbi['string_1'].apply(lambda x: tag_delete(x))



In [22]:

    
kbbi.head(10)









    Out[22]:






  
    
      
      keyword
      definition
      vowel_count
      syllable_divider
      string_1
    
  
  
    
      0
      a
      &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;A, a&lt;/b&gt...
      1
      False
      
    
    
      1
      a
      &lt;b&gt;&lt;sup&gt;3&lt;/sup&gt;a-&lt;/b&gt; ...
      1
      False
      a-
    
    
      2
      ab
      &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;ab&lt;/b&gt; ...
      1
      False
      ab
    
    
      3
      ab
      &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;ab&lt;/b&gt; ...
      1
      False
      ab
    
    
      4
      ab
      &lt;b&gt;&lt;sup&gt;3&lt;/sup&gt;ab-&lt;/b&gt;...
      1
      False
      ab-
    
    
      5
      aba
      &lt;b&gt;aba&lt;/b&gt; &lt;i&gt;n&lt;/i&gt; ay...
      2
      False
      aba
    
    
      6
      aba-aba
      &lt;b&gt;aba-aba&lt;/b&gt; &lt;i&gt;n&lt;/i&gt...
      4
      False
      aba-aba
    
    
      7
      abad
      &lt;b&gt;abad&lt;/b&gt; &lt;i&gt;n&lt;/i&gt; &...
      2
      False
      abad
    
    
      8
      abadi
      &lt;b&gt;aba·di&lt;/b&gt; &lt;i&gt;a&lt;/i&gt;...
      3
      True
      aba!di
    
    
      9
      abadiah
      &lt;b&gt;aba·di·ah&lt;/b&gt; &lt;i&gt;Ar n&lt;...
      4
      True
      aba!di!ah



In [23]:

    
# it looks like that cleaned things up pretty well, but let's jus make sure that we don't still have unusual symbols
# somewhere in the string:

lists = kbbi['string_1'].tolist()
characters = set()
for lst in lists:
    for ch in lst:
        characters.add(ch)
print characters









    



set(['\xc3', '\xa9', '!', ')', '(', '-', ':', 'C', 'B', 'E', 'D', 'G', 'F', 'I', 'H', 'K', 'J', 'M', 'L', 'O', 'N', 'Q', 'P', 'S', 'R', 'U', 'T', 'W', 'V', 'Y', 'X', 'Z', 'a', 'c', 'b', 'e', 'd', 'g', 'f', 'i', 'h', 'k', 'j', 'm', 'l', 'o', 'n', 'q', 'p', 's', 'r', 'u', 't', 'w', 'v', 'y', 'x', 'z'])



In [24]:

    
# many of these characters should not be part of a phonetic transcription
bad_characters = ['\xc3', '\xa9', ')', '(', ':', 'x','q']

# there are also caps which may or may not be parts of actual words, we will return to these shortly
caps = ['E', 'I', 'M', 'L', 'T','B', 'C', 'E', 'D', 'G', 'F', 'I', 'H', 'K', 'J', 'M', 'L', 'O', 'N', 'Q', 'P', 'S', 'R', 'U', 'T', 'W', 'V', 'Y', 'X', 'Z']

# first, however, let's delete the characters which are definitely not part of a syllabified transcription
def bad_ch_delete(string):
    bad_characters = ['\xc3', '\xa9', ')', '(', '\xb7', ':', '\xc2', 'x']
    for ch in bad_characters:
        string = string.replace(ch,'')
    return string

kbbi['string_1'] = kbbi['string_1'].apply(lambda x: bad_ch_delete(x))



In [25]:

    
# now let's look at words with caps
lists = kbbi['string_1'].tolist()
caps_list = []
for i in lists:
    for cap in caps:
        if cap in i:
            caps_list.append(i)
print caps_list









    



['B', 'Ba!al', 'Ba!bet', 'Ba!dui', 'Bai!tul!ha!ram', 'Bai!tul!lah', 'Bai!tul!mak!dis', 'Bai!tul!mak!mur', 'Bai!tul!mu!ka!das', 'Bai!tul!mu!ka!dis', 'Ba!ru!na', 'Ba!tak', 'Be', 'Be!lan!da', 'Bi!bel', 'Brah!ma', 'Bud!dha', 'Bud!dhis', 'Bud!dhis!me', 'Bur!ju!sum!bu!lat', 'C', 'Cak!ra!bi!ra!wa', 'Cel!si!us', 'Ci!na', 'Co!kor!da', 'D', 'D', 'Dal', 'Da!nuh', 'De!lu', 'De!sem!ber', 'Deu!te!ro!ka!no!ni!ka', 'Di!gul', 'di!nul-Is!lam', 'di!nul-Is!lam', 'E', 'E', 'Ehe', 'Ehe', 'Eu!ra!sia', 'Eu!ra!sia', 'F', 'Fah!ren!heit', 'Feb!ru!a!ri', 'G', 'Ga!lu!ngan', 'Ge!mi!ni', 'H', 'Hab!syi', 'Ha!mal', 'Ham!ba!li', 'Ha!na!fi', 'Hi!na!ya!na', 'Hin!di', 'Hin!du', 'Hut', 'I', 'I', 'Ida', 'Ida', 'Idul!ad!ha', 'Idul!ad!ha', 'Idul!fit!ri', 'Idul!fit!ri', 'Idul!kur!ban', 'Idul!kur!ban', 'Ila!hi', 'Ila!hi', 'In!do!ne!sia', 'In!do!ne!sia', 'Ing!gris', 'Ing!gris', 'In!jil', 'In!jil', 'In!su!lin!de', 'In!su!lin!de', 'Isa', 'Isa', 'Is!lam', 'Is!lam', 'J', 'Ja!bar', 'Ja!ba!ri!ah', 'Jai!nis!me', 'Ja!nu!a!ri', 'Jau!za', 'Je!pun', 'Jib!ra!il', 'Jib!ril', 'Ji!ma!kir', 'Ji!ma!wal', 'Jo!gi', 'Jo!har', 'Ju!ja', 'Ju!li', 'Ju!ma!dil!a!khir', 'Ju!ma!dil!a!wal', 'Jum!at', 'Ju!ni', 'K', 'Ka!a!bah', 'Ka!bah', 'Ka!bil', 'Ka!da!ri!ah', 'Ka!di!ri!ah', 'Kak!bah', 'Ka!ma!ja!ya', 'Ka!mis', 'Kan!ser', 'Ka!pe!la', 'Kap!ri!kor!nus', 'Kar!ka!ta', 'Ka!to!lik', 'Ke!jo!ra', 'Ke!ling', 'Kli!won', 'Kris!ten', 'Kris!tus', 'Kub!ti', 'Kum!ba', 'Ku!ni!ngan', 'Kur!an', 'L', 'L', 'Lak!smi', 'Lak!smi', 'Las!pa!ra!gi!na!se', 'Las!pa!ra!gi!na!se', 'La!wa!la!ta', 'La!wa!la!ta', 'Le!ba!ran', 'Le!ba!ran', 'Le!gi', 'Le!gi', 'Leo', 'Leo', 'Lib!ra', 'Lib!ra', 'M', 'M', 'M', 'M', 'Ma!ha', 'Ma!ha', 'Ma!ha!ya!na', 'Ma!ha!ya!na', 'Ma!ka!ra', 'Ma!ka!ra', 'Ma!li!ki', 'Ma!li!ki', 'Ma!li!kul!ja!bar', 'Ma!li!kul!ja!bar', 'Ma!li!kul!mu!luk', 'Ma!li!kul!mu!luk', 'Man!dar', 'Man!dar', 'Ma!ret', 'Ma!ret', 'Ma!rikh', 'Ma!rikh', 'Mars', 'Mars', 'Ma!se!hi', 'Ma!se!hi', 'Mas!ji!dil!ak!sa', 'Mas!ji!dil!ak!sa', 'Mas!ji!dil!ha!ram', 'Mas!ji!dil!ha!ram', 'Ma!ya', 'Ma!ya', 'Ma!yang', 'Ma!yang', 'Me!di!te!ra!nia', 'Me!di!te!ra!nia', 'Me!du!sa', 'Me!du!sa', 'Me!ga!li!ti!kum', 'Me!ga!li!ti!kum', 'Mei', 'Mei', 'Me!la!ne!sia', 'Me!la!ne!sia', 'Me!la!yu', 'Me!la!yu', 'Meng!ka!ra', 'Meng!ka!ra', 'Men!se!ren!da!hi', 'Men!se!ren!da!hi', 'Mer!ku!ri!us', 'Mer!ku!ri!us', 'Me!sa', 'Me!sa', 'Mi!na', 'Mi!na', 'Mi!na', 'Mi!na', 'Ming!gu', 'Ming!gu', 'Muhammad', 'Muhammad', 'Mu!ha!ram', 'Mu!ha!ram', 'Mur!ba', 'Mur!ba', 'N', 'N', 'Nas!ra!ni', 'Na!zi', 'Ne!ger', 'Neg!ro', 'Ne!gus', 'Nep!tu!nus', 'No!vem!ber', 'Nu!zu!lul', 'Nye!pi', 'O', 'O', 'O', 'Oe!di!pus-kom!pleks', 'Ok!to!ber', 'Olan!da', 'Olim!pi!a!de', 'Ori!on', 'P', 'Pa!ing', 'Pan!ca!si!la', 'Pan!ca!si!la!is', 'Pan!te!kos!ta', 'Pas!kah', 'Pi!ses', 'Plu!to', 'Pon', 'Pro!tes!tan', 'Pro!tes!tan!tis!me', 'Q', 'Qur!an', 'R', 'Ra!bi!ul!a!khir', 'Ra!bi!ul!a!wal', 'Ra!bu', 'Ra!bul!i!zat', 'Ra!jab', 'Ra!ma!dan', 'Ra!su!lul!lah', 'Re!au!mur', 'Re!bo', 'Re!jab', 'Ro!hul!ku!dus', 'Ro!ma!wi', 'Ru!ah', 'Ru!ma!wi', 'Ru!mi', 'Ru!wah', 'S', 'Sa!ban', 'Sa!bi', 'Sab!tu', 'Sa!far', 'Sa!gi!ta!ri!us', 'Sai!lan', 'Sa!ka', 'Sa!kai', 'Sang!se!ker!ta', 'San!sker!ta', 'San!skrit', 'Sa!par', 'Sar!tan', 'Sa!tur!nus', 'Sa!ur', 'Se!lan', 'Se!la!sa', 'Se!la!tan', 'Se!lon', 'Se!long', 'Se!mang', 'Se!nen', 'Se!nin', 'Sep!tem!ber', 'Se!ran!dib', 'Se!ra!ni', 'Se!ri!kan!di', 'Si!nan!sa!ri', 'Sin!ter!klas', 'Si!nyo!ko!las', 'Si!wa!rat!ri', 'Skor!pio', 'Sri!kan!di', 'Stam!bul', 'Sud!ra', 'Sun!bu!lat', 'Su!ra', 'Su!war!na!dwi!pa', 'Sya!ban', 'Sya!fii', 'Sya!ka', 'Syak!ban', 'Syam', 'Sya!wal', 'Syi!wa', 'Syi!wa!rat!ri', 'T', 'T', 'Ta!ra!wih', 'Ta!ra!wih', 'Tau!rat', 'Tau!rat', 'Tau!ret', 'Tau!ret', 'Tau!rit', 'Tau!rit', 'Tau!rus', 'Tau!rus', 'Ti!ja!ni!ah', 'Ti!ja!ni!ah', 'Tu!ba!gus', 'Tu!ba!gus', 'Tuhan', 'Tuhan', 'Tu!la', 'Tu!la', 'U', 'U', 'Ura!nus', 'Ur!du', 'Uta!rid', 'V', 'Va!len!tine', 'Va!ti!kan', 'Ve!nus', 'Vir!go', 'Vis!nu', 'W', 'Wa!ge', 'Wai!saki', 'Wa!si', 'Wa!wu', 'We!da', 'Wi!di!wa!sa', 'Wis!nu', 'Wri!sa!ba', 'X', 'X', 'Y', 'Ya!hu!di', 'Ya!hu!di!ah', 'Yah!we', 'Yak!juj', 'Ye!ho!va', 'Yu!na!ni', 'Yu!pi!ter', 'Z', 'Za!ba!ni!ah', 'Za!bur', 'Zang!gi', 'Zen', 'Zend-ves!ta', 'Zi!on', 'Zi!o!nis', 'Zo!hal', 'Zoh!rah', 'Zoh!rat', 'Zu!ha!ra', 'Zul!hi!jah', 'Zul!ka!i!dah', 'Zu!lu']



In [26]:

    
# Lot's of these are proper nouns.  A few of these do not contain information about syllabification.  Let's delete 
# those ones, and convert the others to lowercase.
caps_list = [x for x in caps_list if '!' not in x]

def caps_finder(string):
    for word in caps_list:
        if word in string:
            return True
        else:
            return False
        
kbbi['caps'] = kbbi['string_1'].apply(lambda x: caps_finder(x))
kbbi = kbbi[kbbi['caps'] == False]
del kbbi['caps']

# now we can convert the remaining symbols to lowercase
kbbi['string_1'] = kbbi['string_1'].apply(lambda x: x.lower())



In [27]:

    
kbbi.head()









    Out[27]:






  
    
      
      keyword
      definition
      vowel_count
      syllable_divider
      string_1
    
  
  
    
      0
      a
      &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;A, a&lt;/b&gt...
      1
      False
      
    
    
      1
      a
      &lt;b&gt;&lt;sup&gt;3&lt;/sup&gt;a-&lt;/b&gt; ...
      1
      False
      a-
    
    
      2
      ab
      &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;ab&lt;/b&gt; ...
      1
      False
      ab
    
    
      3
      ab
      &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;ab&lt;/b&gt; ...
      1
      False
      ab
    
    
      4
      ab
      &lt;b&gt;&lt;sup&gt;3&lt;/sup&gt;ab-&lt;/b&gt;...
      1
      False
      ab-



In [28]:

    
# Back to syllables: We expect that in most cases, the number of syllable dividers in string_1 will be equal to the
# number of vowels-1, since e.g. a 2 syllable word like 'batu' would have a single syllable division.
# This won't always be the case, since Indonesian allows the diphthongs 'ai' and 'au' in final syllables
# (and rarely elsewhere).  In these limited cases the number of syllable boundaries should be equal to
# the number of vowels.  Let's test this out by creating a new column counting boundaries.  We can compare the
# value in this column to value in vowel_count.

kbbi['divider_count'] = kbbi['string_1'].apply(lambda x: x.count('!'))



In [29]:

    
kbbi.head(10)









    Out[29]:






  
    
      
      keyword
      definition
      vowel_count
      syllable_divider
      string_1
      divider_count
    
  
  
    
      0
      a
      &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;A, a&lt;/b&gt...
      1
      False
      
      0
    
    
      1
      a
      &lt;b&gt;&lt;sup&gt;3&lt;/sup&gt;a-&lt;/b&gt; ...
      1
      False
      a-
      0
    
    
      2
      ab
      &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;ab&lt;/b&gt; ...
      1
      False
      ab
      0
    
    
      3
      ab
      &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;ab&lt;/b&gt; ...
      1
      False
      ab
      0
    
    
      4
      ab
      &lt;b&gt;&lt;sup&gt;3&lt;/sup&gt;ab-&lt;/b&gt;...
      1
      False
      ab-
      0
    
    
      5
      aba
      &lt;b&gt;aba&lt;/b&gt; &lt;i&gt;n&lt;/i&gt; ay...
      2
      False
      aba
      0
    
    
      6
      aba-aba
      &lt;b&gt;aba-aba&lt;/b&gt; &lt;i&gt;n&lt;/i&gt...
      4
      False
      aba-aba
      0
    
    
      7
      abad
      &lt;b&gt;abad&lt;/b&gt; &lt;i&gt;n&lt;/i&gt; &...
      2
      False
      abad
      0
    
    
      8
      abadi
      &lt;b&gt;aba·di&lt;/b&gt; &lt;i&gt;a&lt;/i&gt;...
      3
      True
      aba!di
      1
    
    
      9
      abadiah
      &lt;b&gt;aba·di·ah&lt;/b&gt; &lt;i&gt;Ar n&lt;...
      4
      True
      aba!di!ah
      2



In [30]:

    
# let's make a separate column counting the difference between the vowel count and syllable boundary count.
# We expect this difference to be equal to 1 in the vast majority of cases, and equal to 2 in a relatively small number
# of cases where a diphthong is present.

kbbi['diff_numV_numB'] = kbbi['vowel_count'] -  kbbi['divider_count']

# Let's look at the new column values for words with explicit syllable boundaries marked
kbbi['diff_numV_numB'][kbbi['syllable_divider'] == True].value_counts()









    Out[30]:





1    29806
2     3281
3      278
4       52
5        5
6        1
0        1
Name: diff_numV_numB, dtype: int64



In [31]:

    
# The count roughly matches expectations; however, there are several entries where the number of vowels in the 
# keyword far exceeds the number of syllables.  I suspect these are cases where the authors of the dictionary
# failed to transcribe a syllable boundary or two.  Let's take a look, starting with the entries in which the
# number of vowels in the keyword exceeds the number of syllable boundaries by 6.

syl_words = kbbi[kbbi['syllable_divider']==True]
check_words = syl_words[syl_words['diff_numV_numB'] >= 3]
check_words









    Out[31]:






  
    
      
      keyword
      definition
      vowel_count
      syllable_divider
      string_1
      divider_count
      diff_numV_numB
    
  
  
    
      15
      abaimana
      &lt;b&gt;abai·ma·na&lt;/b&gt; &lt;i&gt;ark n&l...
      5
      True
      abai!ma!na
      2
      3
    
    
      125
      abulia
      &lt;b&gt;abu·lia&lt;/b&gt; &lt;i&gt;n Dok&lt;/...
      4
      True
      abu!lia
      1
      3
    
    
      171
      adagio
      &lt;b&gt;ada·gio&lt;/b&gt; &lt;i&gt;n Mus&lt;/...
      4
      True
      ada!gio
      1
      3
    
    
      205
      adempauze
      &lt;b&gt;adem·pau·ze&lt;/b&gt; &lt;i&gt;Bld n&...
      5
      True
      adem!pau!ze
      2
      3
    
    
      240
      adinamia
      &lt;b&gt;adi·na·mia&lt;/b&gt; &lt;i&gt;a Dok&l...
      5
      True
      adi!na!mia
      2
      3
    
    
      258
      adiwidia
      &lt;b&gt;adi·wi·dia&lt;/b&gt; &lt;i&gt;n&lt;/i...
      5
      True
      adi!wi!dia
      2
      3
    
    
      288
      aduhai
      &lt;b&gt;adu·hai 1&lt;/b&gt; &lt;i&gt;p&lt;/i&...
      4
      True
      adu!hai
      1
      3
    
    
      322
      aeronautika
      &lt;b&gt;ae·ro·nau·ti·ka&lt;/b&gt; /aéronautik...
      7
      True
      ae!ro!nau!ti!ka
      4
      3
    
    
      328
      aeroterapia
      &lt;b&gt;ae·ro·te·ra·pia&lt;/b&gt; /aérotérapi...
      7
      True
      ae!ro!te!ra!pia
      4
      3
    
    
      333
      afasia
      &lt;b&gt;afa·sia&lt;/b&gt; &lt;i&gt;n Dok&lt;/...
      4
      True
      afa!sia
      1
      3
    
    
      350
      afonia
      &lt;b&gt;afo·nia&lt;/b&gt; &lt;i&gt;n Dok&lt;/...
      4
      True
      afo!nia
      1
      3
    
    
      368
      agalaksia
      &lt;b&gt;aga·lak·sia&lt;/b&gt; &lt;i&gt;a Dok&...
      5
      True
      aga!lak!sia
      2
      3
    
    
      394
      agiria
      &lt;b&gt;agi·ria&lt;/b&gt; &lt;i&gt;n Dok&lt;/...
      4
      True
      agi!ria
      1
      3
    
    
      412
      agonia
      &lt;b&gt;ago·nia&lt;/b&gt; &lt;i&gt;Dok&lt;/i&...
      4
      True
      ago!nia
      1
      3
    
    
      415
      agorafobia
      &lt;b&gt;ago·ra·fo·bia&lt;/b&gt; &lt;i&gt;n Ps...
      6
      True
      ago!ra!fo!bia
      3
      3
    
    
      462
      ahli negara
      &lt;b&gt;ah·li ne·ga·ra&lt;/b&gt; &lt;i&gt;n&l...
      5
      True
      ah!li
      1
      4
    
    
      541
      akasia
      &lt;b&gt;aka·sia&lt;/b&gt; &lt;i&gt;n&lt;/i&gt...
      4
      True
      aka!sia
      1
      3
    
    
      542
      akatalepsia
      &lt;b&gt;aka·ta·lep·sia&lt;/b&gt; /akatalépsia...
      6
      True
      aka!ta!lep!sia
      3
      3
    
    
      682
      alabio
      &lt;b&gt;ala·bio&lt;/b&gt; &lt;i&gt;n Tern&lt;...
      4
      True
      ala!bio
      1
      3
    
    
      688
      alai-belai
      &lt;b&gt;alai-be·lai&lt;/b&gt; &lt;i&gt;ark n&...
      6
      True
      alai-be!lai
      1
      5
    
    
      689
      alaihi salam
      &lt;b&gt;alai·hi sa·lam&lt;/b&gt; &lt;i&gt;Ar ...
      6
      True
      alai!hi
      1
      5
    
    
      690
      alaika salam
      &lt;b&gt;alai·ka sa·lam&lt;/b&gt; &lt;i&gt;Ar ...
      6
      True
      alai!ka
      1
      5
    
    
      691
      alaikum salam
      &lt;b&gt;a·lai·kum sa·lam&lt;/b&gt; &lt;i&gt;A...
      6
      True
      a!lai!kum
      2
      4
    
    
      694
      alalia
      &lt;b&gt;ala·lia&lt;/b&gt; &lt;i&gt;n Ling&lt;...
      4
      True
      ala!lia
      1
      3
    
    
      750
      aleksia
      &lt;b&gt;alek·sia&lt;/b&gt; /aléksia/ &lt;i&gt...
      4
      True
      alek!sia
      1
      3
    
    
      803
      alinea
      &lt;b&gt;ali·nea&lt;/b&gt; /alinéa/ &lt;i&gt;n...
      4
      True
      ali!nea
      1
      3
    
    
      879
      alopesia
      &lt;b&gt;alo·pe·sia&lt;/b&gt; /alopésia/ &lt;i...
      5
      True
      alo!pe!sia
      2
      3
    
    
      892
      alter ego
      &lt;b&gt;al·ter ego&lt;/b&gt; /alter égo/ &lt;...
      4
      True
      al!ter
      1
      3
    
    
      907
      alu-aluan
      &lt;b&gt;alu-a·lu·an&lt;/b&gt; &lt;i&gt;n&lt;/...
      5
      True
      alu-a!lu!an
      2
      3
    
    
      966
      ambai-ambai
      &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;am·bai-am·bai...
      6
      True
      am!bai-am!bai
      2
      4
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      33608
      terus terang
      &lt;b&gt;te·rus te·rang&lt;/b&gt; &lt;i&gt;v&l...
      4
      True
      te!rus
      1
      3
    
    
      33674
      tiang pancang
      &lt;b&gt;ti·ang pan·cang&lt;/b&gt; &lt;i&gt;n&...
      4
      True
      ti!ang
      1
      3
    
    
      33771
      tindak lanjut
      &lt;b&gt;tin·dak lan·jut&lt;/b&gt; &lt;i&gt;v&...
      4
      True
      tin!dak
      1
      3
    
    
      33869
      titik berat
      &lt;b&gt;ti·tik be·rat&lt;/b&gt; &lt;i&gt;n&lt...
      4
      True
      ti!tik
      1
      3
    
    
      34068
      tosan aji
      &lt;b&gt;to·san a·ji&lt;/b&gt; &lt;i&gt;n&lt;/...
      4
      True
      to!san
      1
      3
    
    
      34141
      transmigrasi lokal
      &lt;b&gt;trans·mig·ra·si lo·kal&lt;/b&gt; &lt;...
      6
      True
      trans!mig!ra!si
      3
      3
    
    
      34184
      trias politika
      &lt;b&gt;tri·as po·li·ti·ka&lt;/b&gt; &lt;i&gt...
      6
      True
      tri!as
      1
      5
    
    
      34290
      tuan rumah
      &lt;b&gt;tu·an ru·mah&lt;/b&gt; &lt;i&gt;n&lt;...
      4
      True
      tu!an
      1
      3
    
    
      34327
      tugas karya
      &lt;b&gt;tu·gas kar·ya, me·nu·gas·kar·ya·kan&l...
      4
      True
      tu!gas
      1
      3
    
    
      34346
      tujuh bulan
      &lt;b&gt;tu·juh bu·lan&lt;/b&gt;, &lt;b&gt;me·...
      4
      True
      tu!juh
      1
      3
    
    
      34347
      tujuh hari
      &lt;b&gt;tu·juh ha·ri&lt;/b&gt; &lt;i&gt;n&lt;...
      4
      True
      tu!juh
      1
      3
    
    
      34424
      tumpang sari
      &lt;b&gt;tum·pang sari&lt;/b&gt; &lt;i&gt;v&lt...
      4
      True
      tum!pang
      1
      3
    
    
      34488
      tunggang langgang
      &lt;b&gt;tung·gang lang·gang&lt;/b&gt; &lt;i&g...
      4
      True
      tung!gang
      1
      3
    
    
      34537
      tupai-tupai
      &lt;b&gt;tu·pai-tu·pai&lt;/b&gt; &lt;i&gt;n&lt...
      6
      True
      tu!pai-tu!pai
      2
      4
    
    
      34680
      ugal-ugalan
      &lt;b&gt;ugal-ugal·an&lt;/b&gt; &lt;i&gt;a&lt;...
      5
      True
      ugal-ugal!an
      1
      4
    
    
      34736
      ular-ularan
      &lt;b&gt;ular-ular·an&lt;/b&gt; &lt;i&gt;n&lt;...
      5
      True
      ular-ular!an
      1
      4
    
    
      34925
      uniseluler
      &lt;b&gt;uni·se·luler&lt;/b&gt; /unisélulér/ &...
      5
      True
      uni!se!luler
      2
      3
    
    
      34931
      universalia
      &lt;b&gt;uni·ver·sa·lia&lt;/b&gt; &lt;i&gt;n&l...
      6
      True
      uni!ver!sa!lia
      3
      3
    
    
      34942
      unjuk rasa
      &lt;b&gt;un·juk ra·sa&lt;/b&gt; &lt;i&gt;n&lt;...
      4
      True
      un!juk
      1
      3
    
    
      34996
      uraemia
      &lt;b&gt;ura·e·mia&lt;/b&gt; /uraémia/ &lt;i&g...
      5
      True
      ura!e!mia
      2
      3
    
    
      35020
      uremia
      &lt;b&gt;ure·mia&lt;/b&gt; /urémia/ &lt;i&gt;n...
      4
      True
      ure!mia
      1
      3
    
    
      35108
      utopia
      &lt;b&gt;uto·pia&lt;/b&gt; &lt;i&gt;n&lt;/i&gt...
      4
      True
      uto!pia
      1
      3
    
    
      35495
      warga negara
      &lt;b&gt;war·ga ne·ga·ra&lt;/b&gt; &lt;i&gt;n&...
      5
      True
      war!ga
      1
      4
    
    
      35564
      wawas diri
      &lt;b&gt;wa·was di·ri&lt;/b&gt; &lt;i&gt;v&lt;...
      4
      True
      wa!was
      1
      3
    
    
      35612
      wesi aji
      &lt;b&gt;we·si a·ji ? besi aji
      4
      True
      we!si
      1
      3
    
    
      35720
      wulu cumbu
      &lt;b&gt;wu·lu cum·bu&lt;/b&gt; &lt;i&gt;Jw n&...
      4
      True
      wu!lu
      1
      3
    
    
      35774
      yakjuj wa makjuj
      &lt;b&gt;Yak·juj wa Mak·juj&lt;/b&gt; &lt;i&gt...
      5
      True
      yak!juj
      1
      4
    
    
      35832
      yupa prasasti
      &lt;b&gt;yu·pa pra·sas·ti&lt;/b&gt; &lt;i&gt;n...
      5
      True
      yu!pa
      1
      4
    
    
      35893
      zend-avesta
      &lt;b&gt;Zend-Aves·ta&lt;/b&gt; /zéndavésta/ &...
      4
      True
      zend-ves!ta
      1
      3
    
    
      35932
      zirkonium oksida
      &lt;b&gt;zir·ko·ni·um ok·si·da&lt;/b&gt; &lt;i...
      7
      True
      zir!ko!ni!um
      3
      4
    
  

336 rows × 7 columns



In [32]:

    
# it looks like several of these entries are just poorly marked.  For example, the word 'abaimana' is syllabified
# as abai.ma.na, whereas it should be syllabified as a.bai.ma.na.  While there are a few examples where the authors
# of the dictionary neglected to mark a syllable boundary, there are far more examples where the mismatch between
# vowels and syllable boundaries is due to something that can be readily fixed using a string search.  Some examples
# are described below.



In [33]:

    
# For many of the keywords in the dataset, no syllabification is provided.  We want to remove any polysyllabic words
# for which information about syllabification is absent; however, there also exist many monosyllabic words which
# lack a syllable boundary because they contain a single boundary. The vowel counts we did above will help us 
# distinguish between monosyllabic words and words which are polysyllabic and  just lack information about 
# syllabification.  With this in mind, we might delete all keywords containing more than one vowel and also
# lack a syllable boundary. There is a potential problem with removing all such words: Indonesian contains 
# numerous words ending with the vowel sequences 'ai' and 'au'.  These sequences are tricky to deal with because
# when the only vowels in a word are 'ai' or 'au' speakers may disagree about whether they constitute 1 or 2 syllables,
# and, in fact, this may depend on syntactic factors.  The words 'kau' ('you') and 'mau' ('want') are both forms
# for which this is the case (i.e. mau ~ ma.u; kau ~ ka.u).  To deal with this, let's first get 
# rid of keywords which are without question not monosyllabic: words containing 3 or more vowels which
# are not explicity syllabified.
kbbi = kbbi[(kbbi.syllable_divider==True) | ((kbbi.vowel_count<=2) & (kbbi.syllable_divider==False))]



In [34]:

    
kbbi[(kbbi.vowel_count==2) & (kbbi.syllable_divider==False)]









    Out[34]:






  
    
      
      keyword
      definition
      vowel_count
      syllable_divider
      string_1
      divider_count
      diff_numV_numB
    
  
  
    
      5
      aba
      &lt;b&gt;aba&lt;/b&gt; &lt;i&gt;n&lt;/i&gt; ay...
      2
      False
      aba
      0
      2
    
    
      7
      abad
      &lt;b&gt;abad&lt;/b&gt; &lt;i&gt;n&lt;/i&gt; &...
      2
      False
      abad
      0
      2
    
    
      11
      abah
      &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;abah&lt;/b&gt...
      2
      False
      abah
      0
      2
    
    
      12
      abah
      &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;abah&lt;/b&gt...
      2
      False
      abah
      0
      2
    
    
      20
      aban
      &lt;b&gt;aban&lt;/b&gt; &lt;i&gt;n Antr&lt;/i&...
      2
      False
      aban
      0
      2
    
    
      21
      abang
      &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;abang&lt;/b&g...
      2
      False
      abang
      0
      2
    
    
      22
      abang
      &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;abang&lt;/b&g...
      2
      False
      abang
      0
      2
    
    
      26
      abar
      &lt;b&gt;abar&lt;/b&gt; &lt;i&gt;n&lt;/i&gt; a...
      2
      False
      abar
      0
      2
    
    
      39
      aben
      &lt;b&gt;aben&lt;/b&gt; /abén/, &lt;b&gt;meng·...
      2
      False
      aben
      0
      2
    
    
      41
      abet
      &lt;b&gt;abet&lt;/b&gt; &lt;i&gt;Jk n&lt;/i&gt...
      2
      False
      abet
      0
      2
    
    
      43
      abid
      &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;abid&lt;/b&gt...
      2
      False
      abid
      0
      2
    
    
      44
      abid
      &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;abid&lt;/b&gt...
      2
      False
      abid
      0
      2
    
    
      48
      abing
      &lt;b&gt;abing&lt;/b&gt; &lt;i&gt;n&lt;/i&gt; ...
      2
      False
      abing
      0
      2
    
    
      52
      abis
      &lt;b&gt;abis&lt;/b&gt; &lt;i&gt;n Geo&lt;/i&g...
      2
      False
      abis
      0
      2
    
    
      67
      abon
      &lt;b&gt;abon&lt;/b&gt; &lt;i&gt;n&lt;/i&gt; m...
      2
      False
      abon
      0
      2
    
    
      111
      abu
      &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;abu&lt;/b&gt;...
      2
      False
      abu
      0
      2
    
    
      112
      abu
      &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;abu&lt;/b&gt;...
      2
      False
      abu
      0
      2
    
    
      113
      abu
      &lt;b&gt;&lt;sup&gt;3&lt;/sup&gt;abu&lt;/b&gt;...
      2
      False
      abu
      0
      2
    
    
      117
      abuh
      &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;abuh&lt;/b&gt...
      2
      False
      abuh
      0
      2
    
    
      118
      abuh
      &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;abuh&lt;/b&gt...
      2
      False
      abuh
      0
      2
    
    
      119
      abuk
      &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;abuk&lt;/b&gt...
      2
      False
      abuk
      0
      2
    
    
      120
      abuk
      &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;abuk&lt;/b&gt...
      2
      False
      abuk
      0
      2
    
    
      121
      abuk
      &lt;b&gt;&lt;sup&gt;3&lt;/sup&gt;abuk&lt;/b&gt...
      2
      False
      abuk
      0
      2
    
    
      123
      abul
      &lt;b&gt;abul&lt;/b&gt; &lt;i&gt;v &lt;/i&gt;m...
      2
      False
      abul
      0
      2
    
    
      127
      abur
      &lt;b&gt;abur&lt;/b&gt; &lt;i&gt;a&lt;/i&gt; b...
      2
      False
      abur
      0
      2
    
    
      128
      abus
      &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;abus&lt;/b&gt...
      2
      False
      abus
      0
      2
    
    
      129
      abus
      &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;abus&lt;/b&gt...
      2
      False
      abus
      0
      2
    
    
      130
      acah
      &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;acah&lt;/b&gt...
      2
      False
      acah
      0
      2
    
    
      131
      acah
      &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;acah&lt;/b&gt...
      2
      False
      acah
      0
      2
    
    
      132
      acah
      &lt;b&gt;&lt;sup&gt;3&lt;/sup&gt;acah, peng·a·...
      2
      False
      acah
      0
      2
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      35087
      usur
      &lt;b&gt;usur&lt;/b&gt; &lt;i&gt;Ar num&lt;/i&...
      2
      False
      usur
      0
      2
    
    
      35088
      usus
      &lt;b&gt;usus&lt;/b&gt;&lt;i&gt; n&lt;/i&gt; a...
      2
      False
      ususi
      0
      2
    
    
      35089
      usut
      &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;usut&lt;/b&gt...
      2
      False
      usut
      0
      2
    
    
      35090
      usut
      &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;usut&lt;/b&gt...
      2
      False
      usut
      0
      2
    
    
      35094
      utan
      &lt;b&gt;utan ? hutan
      2
      False
      utan
      0
      2
    
    
      35095
      utang
      &lt;b&gt;utang&lt;/b&gt; &lt;i&gt;n&lt;/i&gt; ...
      2
      False
      utang
      0
      2
    
    
      35100
      utas
      &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;utas&lt;/b&gt...
      2
      False
      utas
      0
      2
    
    
      35101
      utas
      &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;utas&lt;/b&gt...
      2
      False
      utas
      0
      2
    
    
      35104
      utih
      &lt;b&gt;utih&lt;/b&gt; &lt;i&gt;ark&lt;/i&gt;...
      2
      False
      utih
      0
      2
    
    
      35105
      utik
      &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;utik&lt;/b&gt...
      2
      False
      utik
      0
      2
    
    
      35106
      utik
      &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;utik&lt;/b&gt...
      2
      False
      utik
      0
      2
    
    
      35113
      utuh
      &lt;b&gt;utuh&lt;/b&gt; &lt;i&gt;a&lt;/i&gt; (...
      2
      False
      utuh
      0
      2
    
    
      35114
      utus
      &lt;b&gt;utus&lt;/b&gt; &lt;i&gt;v,&lt;/i&gt; ...
      2
      False
      utus
      0
      2
    
    
      35117
      uwak
      &lt;b&gt;uwak&lt;/b&gt; ? &lt;b&gt;1uak
      2
      False
      uwak
      0
      2
    
    
      35119
      uwur
      &lt;b&gt;uwur&lt;/b&gt; &lt;i&gt;Jw n&lt;/i&gt...
      2
      False
      uwur
      0
      2
    
    
      35121
      uyung
      &lt;b&gt;uyung ? huyung
      2
      False
      uyung
      0
      2
    
    
      35123
      uzur
      &lt;b&gt;uzur&lt;/b&gt; &lt;b&gt;1&lt;/b&gt; &...
      2
      False
      uzur
      0
      2
    
    
      35177
      veem
      &lt;b&gt;veem&lt;/b&gt; &lt;i&gt;Bld n&lt;/i&g...
      2
      False
      veem
      0
      2
    
    
      35242
      via
      &lt;b&gt;via&lt;/b&gt; &lt;i&gt;p&lt;/i&gt; le...
      2
      False
      via
      0
      2
    
    
      35398
      wahib
      &lt;b&gt;wahib&lt;/b&gt; &lt;i&gt;Ar n&lt;/i&g...
      2
      False
      wahib
      0
      2
    
    
      35402
      wai
      &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;wai&lt;/b&gt;...
      2
      False
      wai
      0
      2
    
    
      35403
      wai
      &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;wai&lt;/b&gt;...
      2
      False
      wai
      0
      2
    
    
      35552
      wati
      &lt;b&gt;-wati&lt;/b&gt; lihat &lt;b&gt;-wan
      2
      False
      -wati
      0
      2
    
    
      35555
      wau
      &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;wau&lt;/b&gt;...
      2
      False
      wau
      0
      2
    
    
      35556
      wau
      &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;wau&lt;/b&gt;...
      2
      False
      wau
      0
      2
    
    
      35557
      wau
      &lt;b&gt;&lt;sup&gt;3&lt;/sup&gt;wau&lt;/b&gt;...
      2
      False
      wau
      0
      2
    
    
      35621
      wiah
      &lt;b&gt;-wiah&lt;/b&gt; lihat &lt;b&gt;1-i
      2
      False
      -wiah
      0
      2
    
    
      35782
      yang-yang
      &lt;b&gt;yang-yang&lt;/b&gt; lihat &lt;b&gt;akar
      2
      False
      yang-yang
      0
      2
    
    
      35859
      zai
      &lt;b&gt;zai&lt;/b&gt; &lt;i&gt;n&lt;/i&gt; na...
      2
      False
      zai
      0
      2
    
    
      35911
      zig-zag
      &lt;b&gt;zig-zag&lt;/b&gt; &lt;i&gt;a&lt;/i&gt...
      2
      False
      zig-zag
      0
      2
    
  

1158 rows × 7 columns



In [35]:

    
# keywords containing more than one word: There are several keywords which actually contain two words separated by 
# a space.  In many of these cases, syllabification is only provided for one of the words.  It is almost always the 
# case that the words in such keywords independently exist as keywords elsewhere in the dictionary, by deleting them 
# from the database, we will not loose much information.
# Likewise, keywords containing a hyphen contain words that, in the vast majority of cases, have their own independent
# entries, and therefore we won't be losing important datapoints by deleting them.

def space_hyphen_finder(string):
    if ' ' in string:
        return True
    elif '-' in string:
        return True
    elif len(string) == 0:
        return True
    else:            
        return False
        
kbbi['spaces'] = kbbi['string_1'].apply(lambda x: space_hyphen_finder(x))
kbbi = kbbi[kbbi['spaces'] == False]
del kbbi['spaces']



In [36]:

    
# now, to help us with the next couple of steps, let's add '#' to mark word boundaries.  Later on, we will also need
# these boundaries to train our model.

kbbi['string_1'] = kbbi['string_1'].apply(lambda x: '#' + x + '#')



In [37]:

    
kbbi.head(30)









    Out[37]:






  
    
      
      keyword
      definition
      vowel_count
      syllable_divider
      string_1
      divider_count
      diff_numV_numB
    
  
  
    
      2
      ab
      &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;ab&lt;/b&gt; ...
      1
      False
      #ab#
      0
      1
    
    
      3
      ab
      &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;ab&lt;/b&gt; ...
      1
      False
      #ab#
      0
      1
    
    
      5
      aba
      &lt;b&gt;aba&lt;/b&gt; &lt;i&gt;n&lt;/i&gt; ay...
      2
      False
      #aba#
      0
      2
    
    
      7
      abad
      &lt;b&gt;abad&lt;/b&gt; &lt;i&gt;n&lt;/i&gt; &...
      2
      False
      #abad#
      0
      2
    
    
      8
      abadi
      &lt;b&gt;aba·di&lt;/b&gt; &lt;i&gt;a&lt;/i&gt;...
      3
      True
      #aba!di#
      1
      2
    
    
      9
      abadiah
      &lt;b&gt;aba·di·ah&lt;/b&gt; &lt;i&gt;Ar n&lt;...
      4
      True
      #aba!di!ah#
      2
      2
    
    
      10
      abadiat
      &lt;b&gt;aba·di·at ? abadiah
      4
      True
      #aba!di!at#
      2
      2
    
    
      11
      abah
      &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;abah&lt;/b&gt...
      2
      False
      #abah#
      0
      2
    
    
      12
      abah
      &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;abah&lt;/b&gt...
      2
      False
      #abah#
      0
      2
    
    
      15
      abaimana
      &lt;b&gt;abai·ma·na&lt;/b&gt; &lt;i&gt;ark n&l...
      5
      True
      #abai!ma!na#
      2
      3
    
    
      16
      abaka
      &lt;b&gt;aba·ka&lt;/b&gt; &lt;i&gt;n&lt;/i&gt;...
      3
      True
      #aba!ka#
      1
      2
    
    
      17
      abaktinal
      &lt;b&gt;abak·ti·nal&lt;/b&gt; &lt;i&gt;a Bio&...
      4
      True
      #abak!ti!nal#
      2
      2
    
    
      18
      abakus
      &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;aba·kus&lt;/b...
      3
      True
      #aba!kus#
      1
      2
    
    
      19
      abakus
      &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;aba·kus&lt;/b...
      3
      True
      #aba!kus#
      1
      2
    
    
      20
      aban
      &lt;b&gt;aban&lt;/b&gt; &lt;i&gt;n Antr&lt;/i&...
      2
      False
      #aban#
      0
      2
    
    
      21
      abang
      &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;abang&lt;/b&g...
      2
      False
      #abang#
      0
      2
    
    
      22
      abang
      &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;abang&lt;/b&g...
      2
      False
      #abang#
      0
      2
    
    
      23
      abangan
      &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;abang·an&lt;/...
      3
      True
      #abang!an#
      1
      2
    
    
      24
      abangan
      &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;abang·an&lt;/...
      3
      True
      #abang!an#
      1
      2
    
    
      25
      abangga
      &lt;b&gt;abang·ga&lt;/b&gt; &lt;i&gt;n Ark &lt...
      3
      True
      #abang!ga#
      1
      2
    
    
      26
      abar
      &lt;b&gt;abar&lt;/b&gt; &lt;i&gt;n&lt;/i&gt; a...
      2
      False
      #abar#
      0
      2
    
    
      27
      abatoar
      &lt;b&gt;aba·to·ar&lt;/b&gt; &lt;i&gt;n&lt;/i&...
      4
      True
      #aba!to!ar#
      2
      2
    
    
      29
      abdas
      &lt;b&gt;ab·das&lt;/b&gt;,&lt;b&gt; ber·ab·das...
      2
      True
      #ab!das#
      1
      1
    
    
      30
      abdi
      &lt;b&gt;ab·di&lt;/b&gt; &lt;i&gt;n&lt;/i&gt; ...
      2
      True
      #ab!di#
      1
      1
    
    
      31
      abdikasi
      &lt;b&gt;ab·di·ka·si&lt;/b&gt; &lt;i&gt;n&lt;/...
      4
      True
      #ab!di!ka!si#
      3
      1
    
    
      32
      abdomen
      &lt;b&gt;ab·do·men&lt;/b&gt; &lt;i&gt;n Bio&lt...
      3
      True
      #ab!do!men#
      2
      1
    
    
      33
      abdominal
      &lt;b&gt;ab·do·mi·nal&lt;/b&gt; &lt;i&gt;a&lt;...
      4
      True
      #ab!do!mi!nal#
      3
      1
    
    
      34
      abdu
      &lt;b&gt;ab·du&lt;/b&gt; &lt;i&gt;kl n&lt;/i&g...
      2
      True
      #ab!du#
      1
      1
    
    
      35
      abduksi
      &lt;b&gt;ab·duk·si&lt;/b&gt; &lt;i&gt;n&lt;/i&...
      3
      True
      #ab!duk!si#
      2
      1
    
    
      36
      abduktor
      &lt;b&gt;ab·duk·tor&lt;/b&gt; &lt;i&gt;n Dok&l...
      3
      True
      #ab!duk!tor#
      2
      1



In [38]:

    
# There are plenty of words on this list which contain the sequence vowel-consonant-vowel. Without exception, 
# such sequences are always syllabified as V.CV, so we can write a function to insert a syllable boundary.

words = kbbi['string_1'][(kbbi.vowel_count==2) & (kbbi.syllable_divider==False)].tolist()

characters = set()
for word in words:
    for ch in word:
        characters.add(ch)
print characters









    



set(['#', 'a', 'c', 'b', 'e', 'd', 'g', 'f', 'i', 'h', 'k', 'j', 'm', 'l', 'o', 'n', 'q', 'p', 's', 'r', 'u', 't', 'w', 'v', 'y', 'z'])



In [39]:

    
# let's generate all possible VCV sequences, then we can create a function to replace them with the right
# syllabification
vowels = ['a','e','i','o','u']
consonants = ['c','b','d','g','f','h', 'k', 'j', 'm', 'l', 'n', 'q', 'p', 's', 'r', 't', 'w', 'v', 'y', 'z']
VCVs = []
for v in vowels:
    for c in consonants:
        for v1 in vowels:
            sequence = ('#' + v + c + v1,'#' + v + '!' + c + v1)
            VCVs.append(sequence)

def syllabify_VCV(string):
    for VCV in VCVs:
        string = string.replace(VCV[0],VCV[1])
    return string

kbbi['string_1'] = kbbi['string_1'].apply(lambda x: syllabify_VCV(x))



In [40]:

    
kbbi.head()









    Out[40]:






  
    
      
      keyword
      definition
      vowel_count
      syllable_divider
      string_1
      divider_count
      diff_numV_numB
    
  
  
    
      2
      ab
      &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;ab&lt;/b&gt; ...
      1
      False
      #ab#
      0
      1
    
    
      3
      ab
      &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;ab&lt;/b&gt; ...
      1
      False
      #ab#
      0
      1
    
    
      5
      aba
      &lt;b&gt;aba&lt;/b&gt; &lt;i&gt;n&lt;/i&gt; ay...
      2
      False
      #a!ba#
      0
      2
    
    
      7
      abad
      &lt;b&gt;abad&lt;/b&gt; &lt;i&gt;n&lt;/i&gt; &...
      2
      False
      #a!bad#
      0
      2
    
    
      8
      abadi
      &lt;b&gt;aba·di&lt;/b&gt; &lt;i&gt;a&lt;/i&gt;...
      3
      True
      #a!ba!di#
      1
      2



In [41]:

    
# words with ..ia.. (e.g. 'widia') this sequence, which occurs in many borrowings, is syllabified as i.a or i.ya for 
# many speakers.  I will treat all instances of this vowel sequence as belonging to separate syllables.

#def insert_dot_ia(string):
#    string = string.replace('ia','i!a')
#    return string
#kbbi['string_1'][] = kbbi['string_1'].apply(lambda x: insert_dot_ia(x))



In [42]:

    
# another orthographic peculiarity of Indonesian is that, in limited cases, the glides 'w' and 'y' occur as the second
# segment of a sequence following a consonant (e.g. Widya).  In some cases, the pronunication of the glide is actually
# [i.y] not [y].  I'm going to try to correct or remove any such examples using the help of a native speaker.

# first I'll generate a list of possible C+glide sequences
CGl = []
for c in consonants:
    cluster_y = c + 'y'
    cluster_w = c + 'w'
    CGl.append(cluster_y)
    CGl.append(cluster_w)
    
# now, I'll check to see which of these is attested in out wordlist
gl_clusters = set()
for word in kbbi['string_1']:
    for cg in CGl:
        if cg in word:
            gl_clusters.add(cg)
gl_clusters









    Out[42]:





{'by', 'dw', 'dy', 'fy', 'hw', 'hy', 'kw', 'ny', 'py', 'sw', 'sy', 'wy'}



In [43]:

    
# some sequences represent individual sounds that are represented with two letters ('digraph'). These include [sy] 
# -- which is typically pronounced as [s], but may also be pronounced as a palatal, and [sw].  
#I will convert this sequence to a single symbol later on.  Likewise, [ny] is actually a single sound--the palatal 
# nasal stop--and I will also convert it to a single symbol below.  Let's take a closer look at the remaining sounds:

C_glide = ['by', 'dw', 'dy', 'fy', 'hw', 'hy', 'kw', 'py', 'wy']
for word in kbbi['string_1']:
    for Cgl in C_glide:
        if Cgl in word:
            print word









    



#am!byar#
#byar!pet#
#dang!hyang#
#dwi!ar!ti#
#dwi!ba!ha!sa#
#dwi!ba!ha!sa!wan#
#dwi!dar!ma#
#dwi!da!sa!war!sa#
#dwi!fung!si#
#dwi!gan!da#
#dwi!gu!na#
#dwi!ling!ga#
#dwi!mat!ra#
#dwi!ming!gu#
#dwi!mu!ka#
#dwi!pe!ran#
#dwi!pur!wa#
#dwi!se!gi#
#dwi!ta!rung#
#dwi!tung!gal#
#dwi!war!na#
#fyord#
#gam!byong#
#gom!byok#
#gram!byang#
#kem!pyang#
#kwar!tet#
#kwar!tir#
#kwa!si!or!kor#
#kwe!ni#
#kwe!ti!au#
#kwiz#
#kwo!si!en#
#nahwu#
#om!byok#
#pra!sa!wya#
#rom!pyok#
#su!war!na!dwi!pa#
#wa!no!dya#



In [44]:

    
# there are several word-final sequences of consonants that are written in Indonesian, but not pronounced. For
# example a consonant is deleted or a vowel is inserted to break up the cluster in actual speech. 
# I want to double check that the authors of the kbbi got the transcription of these correct.

final_CC = []
for C1 in consonants:
    for C2 in consonants:
        CC = C1 + C2 + '#'
        final_CC.append(CC)

CC_clusters = []
for word in kbbi['string_1']:
    for cc in final_CC:
        if cc in word and cc != 'ng#': # 'ng' is a single sound written as a digraph
            CC_clusters.append(cc)
            print word









    



#ab!sorb#
#ab!surd#
#a!daks#
#ad!mi!tans#
#a!do!le!sens#
#a!fiks#
#a!gens#
#akh#
#a!larm#
#a!lo!leks#
#a!lo!morf#
#am!bu!lans#
#a!morf#
#an!te!fiks#
#an!te!liks#
#an!ti!kli!maks#
#an!ti!tank#
#an!traks#
#a!pen!diks#
#a!po!ka!lips#
#a!rasy#
#a!va!lans#
#bakh#
#ba!lans#
#bank#
#bar!zakh#
#bi!hausy#
#bi!kon!veks#
#bi!o!film#
#bi!ro!faks#
#bi!seps#
#boks#
#bom!seks#
#bo!raks#
#de!fe!rens#
#de!si!nens#
#di!a!morf#
#dif!lu!ens#
#dis!kli!maks#
#dis!kor!dans#
#dra!ma!turg#
#du!pleks#
#ek!lips#
#eks#
#eks#
#eks#
#eks!tern#
#ek!to!derm#
#ek!to!term#
#e!ku!i!noks#
#e!lips#
#e!mi!tans#
#en!do!derm#
#en!do!term#
#en!si!form#
#en!to!derm#
#erg#
#e!sens#
#faks#
#far!sakh#
#fa!sakh#
#film#
#fi!o!laks#
#firn#
#fluks#
#flu!o!re!sens#
#fos!fo!re!sens#
#front#
#fyord#
#gi!ga!hertz#
#gips#
#glans#
#golf#
#hart#
#helm#
#hertz#
#he!te!ro!doks#
#ho!mo!seks#
#ho!mo!term#
#im!pe!dans#
#im!puls#
#in!deks#
#in!duk!tans#
#in!fiks#
#in!flo!re!sens#
#in!ten!dans#
#in!tens#
#in!ter!fe!rens#
#in!tern#
#i!so!hips#
#i!so!leks#
#i!so!morf#
#i!so!term#
#kalk#
#kamp#
#kans#
#ka!pa!si!tans#
#ka!ra!paks#
#karst#
#ka!ta!falk#
#ke!lo!e!lek!tro!volt#
#kelp#
#ke!mi!lu!mi!ne!sens#
#kers#
#kiln#
#ki!lo!hertz#
#ki!lo!volt#
#ki!lo!watt#
#kits#
#kli!maks#
#klo!ro!form#
#ko!balt#
#ko!deks#
#ko!laps#
#kolt#
#kom!pleks#
#kom!pleks#
#kon!duk!tans#
#kon!fiks#
#kon!form#
#kong!kurs#
#kon!kurs#
#kon!si!de!rans#
#kon!teks#
#kon!veks#
#korps#
#kor!teks#
#ko!teks#
#krans#
#ku!ad!ru!pleks#
#ku!ark#
#ku!art#
#ku!in!te!sens#
#kult#
#kurs#
#ku!teks#
#lam!bert#
#lan!dors#
#lar!naks#
#lars#
#la!teks#
#leg!horn#
#lens#
#lift#
#luks#
#malt#
#man!sukh#
#ma!rikh#
#mark#
#mars#
#mars#
#mat!riks#
#me!ga!ohm#
#me!ga!watt#
#mens#
#me!ri!karp#
#me!so!derm#
#me!so!morf#
#me!so!to!raks#
#me!ta!morf#
#mik!ro!film#
#mik!rohm#
#mik!ro!watt#
#mi!li!volt#
#mo!dern#
#morf#
#mu!a!rikh#
#mul!ti!kom!pleks#
#mul!ti!pleks#
#mu!wa!rikh#
#na!palm#
#na!sakh#
#non!sens#
#o!be!lisk#
#ohm#
#o!niks#
#or!do!nans#
#or!to!doks#
#palm#
#pam!pi!ni!form#
#pa!ra!doks#
#pa!ra!laks#
#pas!ca!mo!dern#
#pat!ri!ark#
#pels#
#peny#
#pers#
#pet!ro!maks#
#pi!ri!form#
#plat!form#
#plu!ri!form#
#poi!ki!lo!term#
#pons#
#pra!mo!dern#
#pre!fiks#
#pro!to!raks#
#psalm#
#pseu!do!morf#
#pu!be!sens#
#pulp#
#punk#
#ra!di!ans#
#ra!diks#
#re!ak!tans#
#re!doks#
#re!fleks#
#re!kurs#
#re!laks#
#rem!burs#
#re!nai!sans#
#re!sis!tans#
#res!pons#
#ret!ro!fleks#
#re!vans#
#ri!leks#
#sa!ins#
#se!fa!lo!to!raks#
#sekh#
#se!kors#
#seks#
#sfinks#
#silt#
#sim!pleks#
#si!mul!fiks#
#si!ne!ma!pleks#
#si!ne!pleks#
#sir!kum!fiks#
#sir!kum!fleks#
#skors#
#spons#
#sport#
#sprint#
#start#
#su!fiks#
#!ra!fiks#
#syaikh#
#syekh#
#talk#
#tank#
#ta!rikh#
#tar!kasy#
#ta!wa!rikh#
#teks#
#te!leks#
#te!le!ost#
#term#
#ter!mo!fos!fo!re!sens#
#ter!mo!lu!mi!ne!sens#
#to!raks#
#trans#
#trans!lu!sens#
#tri!pleks#
#trips#
#tuts#
#ul!tra!mi!kro!sko!piks#
#ul!tra!mo!dern#
#u!ni!form#
#u!ni!seks#
#vi!de!o!teks#
#volt#
#vor!teks#
#wals#
#wa!ter!mark#
#watt#
#werst#
#wi!ra!bank#
#yard#
#yog!hurt#
#yolk#
#zi!go!morf#
#zink#
#zir!nikh#



In [45]:

    
# many of these clusters are only orthographic, but are reduced to a single consonant in pronunciation.  For the time
# being I am going to delete all rows with these CCS, except some of the most common of these, for which I have a good. 
# sense of the pronounciation.  At some future point, I plan to edit these based on actual pronunciation.
CC_clusters = pd.Series(CC_clusters)
CC_clusters.value_counts()









    Out[45]:





ks#    78
ns#    46
rm#    19
kh#    16
rs#    13
lt#     9
ps#     9
rf#     9
rn#     8
lm#     7
nk#     6
rt#     6
rk#     4
lk#     4
tt#     4
st#     3
ls#     3
hm#     3
tz#     3
sy#     3
rd#     3
nt#     2
rg#     2
ts#     2
lp#     2
mp#     1
rp#     1
rb#     1
sk#     1
ny#     1
lf#     1
ln#     1
ft#     1
dtype: int64



In [46]:

    
# deleting final clusters (these can be tweaked later)
final_CCs_to_omit = [u'ks#', u'ns#', u'rm#', u'kh#', u'rs#', u'lt#', u'ps#', u'rf#', u'rn#',
       u'lm#', u'nk#', u'rt#', u'rk#', u'lk#', u'tt#', u'st#', u'ls#', u'hm#',
       u'tz#', u'sy#', u'rd#', u'nt#', u'rg#', u'ts#', u'lp#', u'mp#', u'rp#',
       u'rb#', u'sk#', u'ny#', u'lf#', u'ln#', u'ft#']

def delete_final_CC(string):
    for cc in final_CCs_to_omit:
        if cc in word:
            return True
        else:
            return False

kbbi['final_CC'] = kbbi['string_1'].apply(lambda x: delete_final_CC(x))
kbbi = kbbi[kbbi['final_CC'] == False]



In [47]:

    
# set of words I would like to take a closer look at are non-native diphthongs like 'eu'.  Let's start by doing a 
# quick search to see what vowel-vowel sequences occur in the data.  We'll check for any unusual clusters

VVs = []
for V1 in vowels:
    for V2 in vowels:
        non_hiatus_VV = V1 + V2
        VVs.append(non_hiatus_VV)

VVs_attested = set()
VV_words = []
for word in kbbi['string_1']:
    for vv in VVs:
        if vv in word and vv != 'ai' and vv != 'au': # 'ai' and 'au' is a common syllable rime, so we don't suspect
                                                     # that they have been mistranscribed
            VVs_attested.add(vv)
            VV_words.append(word)



In [48]:

    
## attested vvs
print VVs_attested









    



set(['oo', 'eo', 'ei', 'oi', 'ee', 'iu', 'oe', 'ea', 'oa', 'ii', 'eu', 'ui', 'ao', 'io', 'ia', 'ae', 'ie', 'ue', 'ua', 'ou'])



In [49]:

    
VV_words[:10]









    Out[49]:





['#ab!lep!sia#',
 '#a!bu!lia#',
 '#a!da!gio#',
 '#a!di!na!mia#',
 '#a!di!wi!dia#',
 '#ad!ven!ti!sia#',
 '#ad!ver!bia#',
 '#ae!o!lus#',
 '#ae!ra!si#',
 '#ae!ra!tor#']



In [50]:

    
### all of the words with these clusters are rarely used terms.  For the time being, I am going to delete these terms.
def delete_borrowing(string):
    for word in VV_words:
        if string == word:
            return True
        else:
            return False

kbbi['strange_VV'] = kbbi['string_1'].apply(lambda x: delete_borrowing(x))
kbbi = kbbi[kbbi['strange_VV']== False]
del kbbi['strange_VV']
del kbbi['final_CC']



In [51]:

    
kbbi.head(50)









    Out[51]:






  
    
      
      keyword
      definition
      vowel_count
      syllable_divider
      string_1
      divider_count
      diff_numV_numB
    
  
  
    
      2
      ab
      &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;ab&lt;/b&gt; ...
      1
      False
      #ab#
      0
      1
    
    
      3
      ab
      &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;ab&lt;/b&gt; ...
      1
      False
      #ab#
      0
      1
    
    
      5
      aba
      &lt;b&gt;aba&lt;/b&gt; &lt;i&gt;n&lt;/i&gt; ay...
      2
      False
      #a!ba#
      0
      2
    
    
      7
      abad
      &lt;b&gt;abad&lt;/b&gt; &lt;i&gt;n&lt;/i&gt; &...
      2
      False
      #a!bad#
      0
      2
    
    
      8
      abadi
      &lt;b&gt;aba·di&lt;/b&gt; &lt;i&gt;a&lt;/i&gt;...
      3
      True
      #a!ba!di#
      1
      2
    
    
      9
      abadiah
      &lt;b&gt;aba·di·ah&lt;/b&gt; &lt;i&gt;Ar n&lt;...
      4
      True
      #a!ba!di!ah#
      2
      2
    
    
      10
      abadiat
      &lt;b&gt;aba·di·at ? abadiah
      4
      True
      #a!ba!di!at#
      2
      2
    
    
      11
      abah
      &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;abah&lt;/b&gt...
      2
      False
      #a!bah#
      0
      2
    
    
      12
      abah
      &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;abah&lt;/b&gt...
      2
      False
      #a!bah#
      0
      2
    
    
      15
      abaimana
      &lt;b&gt;abai·ma·na&lt;/b&gt; &lt;i&gt;ark n&l...
      5
      True
      #a!bai!ma!na#
      2
      3
    
    
      16
      abaka
      &lt;b&gt;aba·ka&lt;/b&gt; &lt;i&gt;n&lt;/i&gt;...
      3
      True
      #a!ba!ka#
      1
      2
    
    
      17
      abaktinal
      &lt;b&gt;abak·ti·nal&lt;/b&gt; &lt;i&gt;a Bio&...
      4
      True
      #a!bak!ti!nal#
      2
      2
    
    
      18
      abakus
      &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;aba·kus&lt;/b...
      3
      True
      #a!ba!kus#
      1
      2
    
    
      19
      abakus
      &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;aba·kus&lt;/b...
      3
      True
      #a!ba!kus#
      1
      2
    
    
      20
      aban
      &lt;b&gt;aban&lt;/b&gt; &lt;i&gt;n Antr&lt;/i&...
      2
      False
      #a!ban#
      0
      2
    
    
      21
      abang
      &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;abang&lt;/b&g...
      2
      False
      #a!bang#
      0
      2
    
    
      22
      abang
      &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;abang&lt;/b&g...
      2
      False
      #a!bang#
      0
      2
    
    
      23
      abangan
      &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;abang·an&lt;/...
      3
      True
      #a!bang!an#
      1
      2
    
    
      24
      abangan
      &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;abang·an&lt;/...
      3
      True
      #a!bang!an#
      1
      2
    
    
      25
      abangga
      &lt;b&gt;abang·ga&lt;/b&gt; &lt;i&gt;n Ark &lt...
      3
      True
      #a!bang!ga#
      1
      2
    
    
      26
      abar
      &lt;b&gt;abar&lt;/b&gt; &lt;i&gt;n&lt;/i&gt; a...
      2
      False
      #a!bar#
      0
      2
    
    
      27
      abatoar
      &lt;b&gt;aba·to·ar&lt;/b&gt; &lt;i&gt;n&lt;/i&...
      4
      True
      #a!ba!to!ar#
      2
      2
    
    
      29
      abdas
      &lt;b&gt;ab·das&lt;/b&gt;,&lt;b&gt; ber·ab·das...
      2
      True
      #ab!das#
      1
      1
    
    
      30
      abdi
      &lt;b&gt;ab·di&lt;/b&gt; &lt;i&gt;n&lt;/i&gt; ...
      2
      True
      #ab!di#
      1
      1
    
    
      31
      abdikasi
      &lt;b&gt;ab·di·ka·si&lt;/b&gt; &lt;i&gt;n&lt;/...
      4
      True
      #ab!di!ka!si#
      3
      1
    
    
      32
      abdomen
      &lt;b&gt;ab·do·men&lt;/b&gt; &lt;i&gt;n Bio&lt...
      3
      True
      #ab!do!men#
      2
      1
    
    
      33
      abdominal
      &lt;b&gt;ab·do·mi·nal&lt;/b&gt; &lt;i&gt;a&lt;...
      4
      True
      #ab!do!mi!nal#
      3
      1
    
    
      34
      abdu
      &lt;b&gt;ab·du&lt;/b&gt; &lt;i&gt;kl n&lt;/i&g...
      2
      True
      #ab!du#
      1
      1
    
    
      35
      abduksi
      &lt;b&gt;ab·duk·si&lt;/b&gt; &lt;i&gt;n&lt;/i&...
      3
      True
      #ab!duk!si#
      2
      1
    
    
      36
      abduktor
      &lt;b&gt;ab·duk·tor&lt;/b&gt; &lt;i&gt;n Dok&l...
      3
      True
      #ab!duk!tor#
      2
      1
    
    
      37
      abdul
      &lt;b&gt;ab·dul ? abdu
      2
      True
      #ab!dul#
      1
      1
    
    
      38
      abece
      &lt;b&gt;abe·ce&lt;/b&gt; /abécé/ &lt;i&gt;n&l...
      3
      True
      #a!be!ce#
      1
      2
    
    
      39
      aben
      &lt;b&gt;aben&lt;/b&gt; /abén/, &lt;b&gt;meng·...
      2
      False
      #a!ben#
      0
      2
    
    
      40
      aberasi
      &lt;b&gt;abe·ra·si&lt;/b&gt; &lt;i&gt;n&lt;/i&...
      4
      True
      #a!be!ra!si#
      2
      2
    
    
      41
      abet
      &lt;b&gt;abet&lt;/b&gt; &lt;i&gt;Jk n&lt;/i&gt...
      2
      False
      #a!bet#
      0
      2
    
    
      42
      abian
      &lt;b&gt;abi·an&lt;/b&gt; &lt;i&gt;n&lt;/i&gt;...
      3
      True
      #a!bi!an#
      1
      2
    
    
      43
      abid
      &lt;b&gt;&lt;sup&gt;1&lt;/sup&gt;abid&lt;/b&gt...
      2
      False
      #a!bid#
      0
      2
    
    
      44
      abid
      &lt;b&gt;&lt;sup&gt;2&lt;/sup&gt;abid&lt;/b&gt...
      2
      False
      #a!bid#
      0
      2
    
    
      45
      abidin
      &lt;b&gt;abi·din&lt;/b&gt; &lt;i&gt;Ar n&lt;/i...
      3
      True
      #a!bi!din#
      1
      2
    
    
      46
      abilah
      &lt;b&gt;abi·lah&lt;/b&gt; &lt;i&gt;ark n&lt;/...
      3
      True
      #a!bi!lah#
      1
      2
    
    
      47
      abimana
      &lt;b&gt;abi·ma·na ? abaimana
      4
      True
      #a!bi!ma!na#
      2
      2
    
    
      48
      abing
      &lt;b&gt;abing&lt;/b&gt; &lt;i&gt;n&lt;/i&gt; ...
      2
      False
      #a!bing#
      0
      2
    
    
      49
      abiogenesis
      &lt;b&gt;abi·o·ge·ne·sis&lt;/b&gt; /abiogénési...
      6
      True
      #a!bi!o!ge!ne!sis#
      4
      2
    
    
      50
      abiosfer
      &lt;b&gt;abi·o·sfer&lt;/b&gt; /abiosfér/ &lt;i...
      4
      True
      #a!bi!o!sfer#
      2
      2
    
    
      51
      abiotik
      &lt;b&gt;abi·o·tik&lt;/b&gt; &lt;b&gt;1&lt;/b&...
      4
      True
      #a!bi!o!tik#
      2
      2
    
    
      52
      abis
      &lt;b&gt;abis&lt;/b&gt; &lt;i&gt;n Geo&lt;/i&g...
      2
      False
      #a!bis#
      0
      2
    
    
      53
      abisal
      &lt;b&gt;abi·sal&lt;/b&gt; &lt;i&gt;n&lt;/i&gt...
      3
      True
      #a!bi!sal#
      1
      2
    
    
      54
      abiseka
      &lt;b&gt;abi·se·ka&lt;/b&gt; /abiséka/ &lt;i&g...
      4
      True
      #a!bi!se!ka#
      2
      2
    
    
      55
      abiturien
      &lt;b&gt;abi·tu·ri·en &lt;/b&gt;/abiturién/ &l...
      5
      True
      #a!bi!tu!ri!en#
      3
      2
    
    
      56
      abjad
      &lt;b&gt;ab·jad&lt;/b&gt; &lt;i&gt;n&lt;/i&gt;...
      2
      True
      #ab!jad#
      1
      1

Generating a dataset



In [52]:

    
# the data seems pretty clean now, so let's try to extract features.  We want to set up a classification model
# which tells us for any given position in a word, whether there is a syllable boundary in that position.  The 
# way that I am going to do this is looking at adjacent sounds within a certain window size.  Here are two functions
# to accomplish this:

# first we need a function to change digraphs (like the palatal nasal 'ny') to a single character
def prepare_string(word):
    word = word.lower()
    word = word.replace('ng','N')
    word = word.replace('ny','Y')
    word = word.replace('sy','s')
    word = word.replace('kh','k')
    word = word.replace('tj','c')
    word = word.replace('dj','j')
    return(word)

# then we need a function to create windows of a designated size, with information about whether the window contains
# a syllable boundary in its center position (e.g. between the second and third segment if the left margin = 2
# and the right margin = 3
def window_gen(strings,left_margin,right_margin):
    windows = []
    for string in strings:
        string = prepare_string(string)
        char_ind = []
        syl_ind = []
        for indx,ch in enumerate(string):
            if ch == '!':
                syl_ind.append(indx)
            else:
                char_ind.append(indx)
        for i,j in enumerate(char_ind):
            left = range((i-left_margin),i)
            right = range(i,(right_margin+i))
            window_range = left + right
            try:
                window_index = [char_ind[z] for z in window_range]
                window = ''.join([str(string[p]) for p in window_index])
                is_boundary = bool([n for n in syl_ind if n > char_ind[(i-1)] and n < char_ind[(i)]])
                windows.append((window,is_boundary)) 
            except:
                pass
    return windows



In [53]:

    
# using window gen, we can generate a number of new dataframes with various window sizes
strings = kbbi['string_1'].tolist()
win_1_0 = window_gen(strings,1,0)
win_0_1 = window_gen(strings,0,1)
win_1_1 = window_gen(strings,1,1)
win_1_2 = window_gen(strings,1,2)
win_2_1 = window_gen(strings,2,1)
win_2_2 = window_gen(strings,2,2)
win_2_3 = window_gen(strings,2,3)
win_3_2 = window_gen(strings,3,2)
win_3_3 = window_gen(strings,3,3)



In [54]:

    
# let's create a data dictionary
data = {}
win10_windows = []
win10_boundaries = []
for i in win_1_0:
    win10_windows.append(i[0])
    win10_boundaries.append(i[1])
data['win10_windows'] = win10_windows
data['win10_boundaries'] = win10_boundaries

win01_windows = []
win01_boundaries = []
for i in win_0_1:
    win01_windows.append(i[0])
    win01_boundaries.append(i[1])
data['win01_windows'] = win01_windows
data['win01_boundaries'] = win01_boundaries

win11_windows = []
win11_boundaries = []
for i in win_1_1:
    win11_windows.append(i[0])
    win11_boundaries.append(i[1])
data['win11_windows'] = win11_windows
data['win11_boundaries'] = win11_boundaries

win12_windows = []
win12_boundaries = []
for i in win_1_2:
    win12_windows.append(i[0])
    win12_boundaries.append(i[1])
data['win12_windows'] = win12_windows
data['win12_boundaries'] = win12_boundaries

win21_windows = []
win21_boundaries = []
for i in win_2_1:
    win21_windows.append(i[0])
    win21_boundaries.append(i[1])
data['win21_windows'] = win21_windows
data['win21_boundaries'] = win21_boundaries

win22_windows = []
win22_boundaries = []
for i in win_2_2:
    win22_windows.append(i[0])
    win22_boundaries.append(i[1])
data['win22_windows'] = win22_windows
data['win22_boundaries'] = win22_boundaries

win32_windows = []
win32_boundaries = []
for i in win_3_2:
    win32_windows.append(i[0])
    win32_boundaries.append(i[1])
data['win32_windows'] = win32_windows
data['win32_boundaries'] = win32_boundaries

win23_windows = []
win23_boundaries = []
for i in win_2_3:
    win23_windows.append(i[0])
    win23_boundaries.append(i[1])
data['win23_windows'] = win23_windows
data['win23_boundaries'] = win23_boundaries

win33_windows = []
win33_boundaries = []
for i in win_3_3:
    win33_windows.append(i[0])
    win33_boundaries.append(i[1])
data['win33_windows'] = win33_windows
data['win33_boundaries'] = win33_boundaries



In [55]:

    
# these individual functions decompose sounds into phonological features, we will include these within the function
# below:
def sonorant(character):
    if character in ['a','e','i','o','u','y','w','m','N','Y','l','r','h','q']:
        return 1
    else:
        return 0
    
def continuant(character):
    if character in ['s','z','f']:
         return 1
    else:
         return 0
    
def consonant(character):
    if character in ['p','t','k','q','h','c','b','d','g','j','s','z','f','m','n','Y','N','l','r']:
        return 1
    else:
        return 0
    
def strident(character):
    if character in ['s','j','c','z']:
        return 1
    else:
        return 0
    
###populate place: labial, coronal, palatal, velar, glottal 
def labial(character):
    if character in ['p','m','f','b','w','u','o']:
        return 1
    else:
        return 0   

def coronal(character):
    if character in ['t','d','n','s','j','c','Y','i','e','r','l','z']:
        return 1
    else:
        return 0 

def palatal(character):
    if character in ['s','j','c','i','Y','e']:
        return 1
    else:
        return 0

def velar(character):
    if character in ['u','k','g','N','o']:
        return 1
    else:
        return 0
    
def glottal(character):
    if character in ['h','q']:
        return 1
    else:
        return 0

###nasality
def nasal(character):
    if character in ['m','n','Y','N']:
        return 1
    else:
        return 0
            
###populate obstruent voicing 
###i assume that [voice] is only phonologically active in obstruents
def obs_voice(character):
    if character in ['b','d','g','j','z']:
        return 1
    else:
        return 0          
            
### populate lateral/rhotic
def lateral(character):
    if character == 'l':
        return 1
    else:
        return 0
def rhotic(character):
    if character == 'r':
        return 1
    else:
        return 0

###populate vowel height
###I assume at mid is not an active feature
def high(character):
    if character in ['i','u']:
        return 1
    else:
        return 0
def low(character):
    if character == 'a':
        return 1
    else:
        return 0
def word_boundary(character):
    if character == '#':
         return 1
    else:
        return 0



In [550]:

    
# now we need to a function to decompose sounds into phonological features
def matrix_creater(data, left_window, right_window):
    name_w = 'win'+str(left_window)+str(right_window)+'_windows'
    name_b = 'win'+str(left_window)+str(right_window)+'_boundaries'
    string_list = data[name_w]
    prediction_list = data[name_b]
    
    ch_count = left_window + right_window
    df = pd.DataFrame(string_list,columns=['string'])
    for num in range(0,ch_count):
        df['sonorant'+str(num)] = df['string'].apply(lambda x: sonorant(x[num]))
        df['continuant'+str(num)] = df['string'].apply(lambda x: continuant(x[num]))
        df['consonant'+str(num)] = df['string'].apply(lambda x: consonant(x[num]))
        df['strident'+str(num)] = df['string'].apply(lambda x: strident(x[num]))
        df['labial'+str(num)] = df['string'].apply(lambda x: labial(x[num]))
        df['coronal'+str(num)] = df['string'].apply(lambda x: coronal(x[num]))
        df['palatal'+str(num)] = df['string'].apply(lambda x: palatal(x[num]))
        df['velar'+str(num)] = df['string'].apply(lambda x: velar(x[num]))
        df['glottal'+str(num)] = df['string'].apply(lambda x: glottal(x[num]))
        df['nasal'+str(num)] = df['string'].apply(lambda x: nasal(x[num]))
        df['obs_voice'+str(num)] = df['string'].apply(lambda x: obs_voice(x[num]))
        df['lateral'+str(num)] = df['string'].apply(lambda x: lateral(x[num]))
        df['rhotic'+str(num)] = df['string'].apply(lambda x: rhotic(x[num]))
        df['high'+str(num)] = df['string'].apply(lambda x: high(x[num]))
        df['low'+str(num)] = df['string'].apply(lambda x: low(x[num]))
        df['word_boundary'+str(num)] = df['string'].apply(lambda x: word_boundary(x[num]))
    df1 = pd.DataFrame(data=prediction_list, columns=['pred'])
    df = pd.concat([df,df1],axis=1)
    df.drop_duplicates(subset='string', inplace=True)
    return df



In [604]:

    
#df01 = matrix_creater(data,0,1)
#df10 = matrix_creater(data,1,0)
df11 = matrix_creater(data,1,1)
df12 = matrix_creater(data,1,2)
#df21 = matrix_creater(data,2,1)
df22 = matrix_creater(data,2,2)
#df23 = matrix_creater(data,2,3)
#df32 = matrix_creater(data,3,2)
df33 = matrix_creater(data,3,3)

Running the model and looking at results



In [560]:

    
from sklearn.tree import DecisionTreeClassifier
from sklearn.cross_validation import cross_val_score
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# cross validation
from sklearn.cross_validation import train_test_split, KFold

# models
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier

# evaluation
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix, classification_report

import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.metrics import classification_report



In [607]:

    
y = df12['pred']
X = df12.drop('pred',axis=1)



In [574]:

    
words = X['string']
del X['string']



In [575]:

    
# let's see what our baseline is
float(y.value_counts()[0])/(float(y.value_counts()[0]) + float(y.value_counts()[1]))









    Out[575]:





0.6505772360072022



In [564]:

    
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=30)



In [536]:

    
def evaluate_model(model):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    cnf_mtx = confusion_matrix(y_test, y_pred)
    acc_scr = accuracy_score(y_test, y_pred)
    cls_rep = classification_report(y_test, y_pred)
    print cnf_mtx
    print cls_rep
    print "Accuracy Score: ", acc_scr
    print "*****************************"
    return acc_scr



In [537]:

    
# global dictionary of models
models = {}



In [ ]:

    
# I run a number of models below.  I have not added many comments.  In general, none of the models significantly
# outperform the Decision Tree Classifier.  This is illustrated by the examples below.
# Given that the goal of this project is to discover the simplest 
# model that account for syllabification, I will use go ahead and use the decision tree classifier.



In [538]:

    
max_depths = [n for n in range(1,30)]
criteria = ['gini', 'entropy']
for max_depth in max_depths:
    for criterion in criteria:
        print "Max Depth: ", max_depth
        print "Criterion: ", criterion
        evaluate_model(DecisionTreeClassifier(criterion=criterion, max_depth=max_depth))









    



Max Depth:  1
Criterion:  gini
[[3802 1639]
 [ 529 2470]]
             precision    recall  f1-score   support

      False       0.88      0.70      0.78      5441
       True       0.60      0.82      0.69      2999

avg / total       0.78      0.74      0.75      8440

Accuracy Score:  0.743127962085
*****************************
Max Depth:  1
Criterion:  entropy
[[3228 2213]
 [ 287 2712]]
             precision    recall  f1-score   support

      False       0.92      0.59      0.72      5441
       True       0.55      0.90      0.68      2999

avg / total       0.79      0.70      0.71      8440

Accuracy Score:  0.703791469194
*****************************
Max Depth:  2
Criterion:  gini
[[4589  852]
 [ 536 2463]]
             precision    recall  f1-score   support

      False       0.90      0.84      0.87      5441
       True       0.74      0.82      0.78      2999

avg / total       0.84      0.84      0.84      8440

Accuracy Score:  0.835545023697
*****************************
Max Depth:  2
Criterion:  entropy
[[4891  550]
 [ 622 2377]]
             precision    recall  f1-score   support

      False       0.89      0.90      0.89      5441
       True       0.81      0.79      0.80      2999

avg / total       0.86      0.86      0.86      8440

Accuracy Score:  0.861137440758
*****************************
Max Depth:  3
Criterion:  gini
[[5111  330]
 [ 676 2323]]
             precision    recall  f1-score   support

      False       0.88      0.94      0.91      5441
       True       0.88      0.77      0.82      2999

avg / total       0.88      0.88      0.88      8440

Accuracy Score:  0.880805687204
*****************************
Max Depth:  3
Criterion:  entropy
[[4905  536]
 [ 446 2553]]
             precision    recall  f1-score   support

      False       0.92      0.90      0.91      5441
       True       0.83      0.85      0.84      2999

avg / total       0.88      0.88      0.88      8440

Accuracy Score:  0.8836492891
*****************************
Max Depth:  4
Criterion:  gini
[[5028  413]
 [ 345 2654]]
             precision    recall  f1-score   support

      False       0.94      0.92      0.93      5441
       True       0.87      0.88      0.88      2999

avg / total       0.91      0.91      0.91      8440

Accuracy Score:  0.91018957346
*****************************
Max Depth:  4
Criterion:  entropy
[[4809  632]
 [ 153 2846]]
             precision    recall  f1-score   support

      False       0.97      0.88      0.92      5441
       True       0.82      0.95      0.88      2999

avg / total       0.92      0.91      0.91      8440

Accuracy Score:  0.906990521327
*****************************
Max Depth:  5
Criterion:  gini
[[5196  245]
 [ 348 2651]]
             precision    recall  f1-score   support

      False       0.94      0.95      0.95      5441
       True       0.92      0.88      0.90      2999

avg / total       0.93      0.93      0.93      8440

Accuracy Score:  0.929739336493
*****************************
Max Depth:  5
Criterion:  entropy
[[5032  409]
 [ 211 2788]]
             precision    recall  f1-score   support

      False       0.96      0.92      0.94      5441
       True       0.87      0.93      0.90      2999

avg / total       0.93      0.93      0.93      8440

Accuracy Score:  0.92654028436
*****************************
Max Depth:  6
Criterion:  gini
[[5150  291]
 [ 172 2827]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.91      0.94      0.92      2999

avg / total       0.95      0.95      0.95      8440

Accuracy Score:  0.945142180095
*****************************
Max Depth:  6
Criterion:  entropy
[[5163  278]
 [ 199 2800]]
             precision    recall  f1-score   support

      False       0.96      0.95      0.96      5441
       True       0.91      0.93      0.92      2999

avg / total       0.94      0.94      0.94      8440

Accuracy Score:  0.943483412322
*****************************
Max Depth:  7
Criterion:  gini
[[5149  292]
 [ 168 2831]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.91      0.94      0.92      2999

avg / total       0.95      0.95      0.95      8440

Accuracy Score:  0.945497630332
*****************************
Max Depth:  7
Criterion:  entropy
[[5165  276]
 [ 198 2801]]
             precision    recall  f1-score   support

      False       0.96      0.95      0.96      5441
       True       0.91      0.93      0.92      2999

avg / total       0.94      0.94      0.94      8440

Accuracy Score:  0.943838862559
*****************************
Max Depth:  8
Criterion:  gini
[[5175  266]
 [ 179 2820]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.91      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Accuracy Score:  0.947274881517
*****************************
Max Depth:  8
Criterion:  entropy
[[5197  244]
 [ 209 2790]]
             precision    recall  f1-score   support

      False       0.96      0.96      0.96      5441
       True       0.92      0.93      0.92      2999

avg / total       0.95      0.95      0.95      8440

Accuracy Score:  0.946327014218
*****************************
Max Depth:  9
Criterion:  gini
[[5203  238]
 [ 182 2817]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Accuracy Score:  0.950236966825
*****************************
Max Depth:  9
Criterion:  entropy
[[5188  253]
 [ 179 2820]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Accuracy Score:  0.948815165877
*****************************
Max Depth:  10
Criterion:  gini
[[5204  237]
 [ 188 2811]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Accuracy Score:  0.949644549763
*****************************
Max Depth:  10
Criterion:  entropy
[[5200  241]
 [ 140 2859]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.96      5441
       True       0.92      0.95      0.94      2999

avg / total       0.96      0.95      0.96      8440

Accuracy Score:  0.954857819905
*****************************
Max Depth:  11
Criterion:  gini
[[5195  246]
 [ 183 2816]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Accuracy Score:  0.949170616114
*****************************
Max Depth:  11
Criterion:  entropy
[[5172  269]
 [ 121 2878]]
             precision    recall  f1-score   support

      False       0.98      0.95      0.96      5441
       True       0.91      0.96      0.94      2999

avg / total       0.95      0.95      0.95      8440

Accuracy Score:  0.953791469194
*****************************
Max Depth:  12
Criterion:  gini
[[5190  251]
 [ 197 2802]]
             precision    recall  f1-score   support

      False       0.96      0.95      0.96      5441
       True       0.92      0.93      0.93      2999

avg / total       0.95      0.95      0.95      8440

Accuracy Score:  0.94691943128
*****************************
Max Depth:  12
Criterion:  entropy
[[5186  255]
 [ 159 2840]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.92      0.95      0.93      2999

avg / total       0.95      0.95      0.95      8440

Accuracy Score:  0.950947867299
*****************************
Max Depth:  13
Criterion:  gini
[[5173  268]
 [ 214 2785]]
             precision    recall  f1-score   support

      False       0.96      0.95      0.96      5441
       True       0.91      0.93      0.92      2999

avg / total       0.94      0.94      0.94      8440

Accuracy Score:  0.942890995261
*****************************
Max Depth:  13
Criterion:  entropy
[[5180  261]
 [ 183 2816]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Accuracy Score:  0.947393364929
*****************************
Max Depth:  14
Criterion:  gini
[[5183  258]
 [ 247 2752]]
             precision    recall  f1-score   support

      False       0.95      0.95      0.95      5441
       True       0.91      0.92      0.92      2999

avg / total       0.94      0.94      0.94      8440

Accuracy Score:  0.940165876777
*****************************
Max Depth:  14
Criterion:  entropy
[[5188  253]
 [ 225 2774]]
             precision    recall  f1-score   support

      False       0.96      0.95      0.96      5441
       True       0.92      0.92      0.92      2999

avg / total       0.94      0.94      0.94      8440

Accuracy Score:  0.94336492891
*****************************
Max Depth:  15
Criterion:  gini
[[5167  274]
 [ 271 2728]]
             precision    recall  f1-score   support

      False       0.95      0.95      0.95      5441
       True       0.91      0.91      0.91      2999

avg / total       0.94      0.94      0.94      8440

Accuracy Score:  0.935426540284
*****************************
Max Depth:  15
Criterion:  entropy
[[5197  244]
 [ 262 2737]]
             precision    recall  f1-score   support

      False       0.95      0.96      0.95      5441
       True       0.92      0.91      0.92      2999

avg / total       0.94      0.94      0.94      8440

Accuracy Score:  0.940047393365
*****************************
Max Depth:  16
Criterion:  gini
[[5169  272]
 [ 322 2677]]
             precision    recall  f1-score   support

      False       0.94      0.95      0.95      5441
       True       0.91      0.89      0.90      2999

avg / total       0.93      0.93      0.93      8440

Accuracy Score:  0.929620853081
*****************************
Max Depth:  16
Criterion:  entropy
[[5187  254]
 [ 285 2714]]
             precision    recall  f1-score   support

      False       0.95      0.95      0.95      5441
       True       0.91      0.90      0.91      2999

avg / total       0.94      0.94      0.94      8440

Accuracy Score:  0.936137440758
*****************************
Max Depth:  17
Criterion:  gini
[[5174  267]
 [ 357 2642]]
             precision    recall  f1-score   support

      False       0.94      0.95      0.94      5441
       True       0.91      0.88      0.89      2999

avg / total       0.93      0.93      0.93      8440

Accuracy Score:  0.926066350711
*****************************
Max Depth:  17
Criterion:  entropy
[[5176  265]
 [ 325 2674]]
             precision    recall  f1-score   support

      False       0.94      0.95      0.95      5441
       True       0.91      0.89      0.90      2999

avg / total       0.93      0.93      0.93      8440

Accuracy Score:  0.93009478673
*****************************
Max Depth:  18
Criterion:  gini
[[5178  263]
 [ 378 2621]]
             precision    recall  f1-score   support

      False       0.93      0.95      0.94      5441
       True       0.91      0.87      0.89      2999

avg / total       0.92      0.92      0.92      8440

Accuracy Score:  0.924052132701
*****************************
Max Depth:  18
Criterion:  entropy
[[5173  268]
 [ 346 2653]]
             precision    recall  f1-score   support

      False       0.94      0.95      0.94      5441
       True       0.91      0.88      0.90      2999

avg / total       0.93      0.93      0.93      8440

Accuracy Score:  0.927251184834
*****************************
Max Depth:  19
Criterion:  gini
[[5168  273]
 [ 379 2620]]
             precision    recall  f1-score   support

      False       0.93      0.95      0.94      5441
       True       0.91      0.87      0.89      2999

avg / total       0.92      0.92      0.92      8440

Accuracy Score:  0.922748815166
*****************************
Max Depth:  19
Criterion:  entropy
[[5161  280]
 [ 366 2633]]
             precision    recall  f1-score   support

      False       0.93      0.95      0.94      5441
       True       0.90      0.88      0.89      2999

avg / total       0.92      0.92      0.92      8440

Accuracy Score:  0.92345971564
*****************************
Max Depth:  20
Criterion:  gini
[[5166  275]
 [ 403 2596]]
             precision    recall  f1-score   support

      False       0.93      0.95      0.94      5441
       True       0.90      0.87      0.88      2999

avg / total       0.92      0.92      0.92      8440

Accuracy Score:  0.919668246445
*****************************
Max Depth:  20
Criterion:  entropy
[[5160  281]
 [ 380 2619]]
             precision    recall  f1-score   support

      False       0.93      0.95      0.94      5441
       True       0.90      0.87      0.89      2999

avg / total       0.92      0.92      0.92      8440

Accuracy Score:  0.921682464455
*****************************
Max Depth:  21
Criterion:  gini
[[5167  274]
 [ 420 2579]]
             precision    recall  f1-score   support

      False       0.92      0.95      0.94      5441
       True       0.90      0.86      0.88      2999

avg / total       0.92      0.92      0.92      8440

Accuracy Score:  0.917772511848
*****************************
Max Depth:  21
Criterion:  entropy
[[5153  288]
 [ 397 2602]]
             precision    recall  f1-score   support

      False       0.93      0.95      0.94      5441
       True       0.90      0.87      0.88      2999

avg / total       0.92      0.92      0.92      8440

Accuracy Score:  0.918838862559
*****************************
Max Depth:  22
Criterion:  gini
[[5163  278]
 [ 427 2572]]
             precision    recall  f1-score   support

      False       0.92      0.95      0.94      5441
       True       0.90      0.86      0.88      2999

avg / total       0.92      0.92      0.92      8440

Accuracy Score:  0.916469194313
*****************************
Max Depth:  22
Criterion:  entropy
[[5150  291]
 [ 406 2593]]
             precision    recall  f1-score   support

      False       0.93      0.95      0.94      5441
       True       0.90      0.86      0.88      2999

avg / total       0.92      0.92      0.92      8440

Accuracy Score:  0.917417061611
*****************************
Max Depth:  23
Criterion:  gini
[[5161  280]
 [ 431 2568]]
             precision    recall  f1-score   support

      False       0.92      0.95      0.94      5441
       True       0.90      0.86      0.88      2999

avg / total       0.92      0.92      0.92      8440

Accuracy Score:  0.915758293839
*****************************
Max Depth:  23
Criterion:  entropy
[[5160  281]
 [ 403 2596]]
             precision    recall  f1-score   support

      False       0.93      0.95      0.94      5441
       True       0.90      0.87      0.88      2999

avg / total       0.92      0.92      0.92      8440

Accuracy Score:  0.918957345972
*****************************
Max Depth:  24
Criterion:  gini
[[5170  271]
 [ 436 2563]]
             precision    recall  f1-score   support

      False       0.92      0.95      0.94      5441
       True       0.90      0.85      0.88      2999

avg / total       0.92      0.92      0.92      8440

Accuracy Score:  0.916232227488
*****************************
Max Depth:  24
Criterion:  entropy
[[5164  277]
 [ 415 2584]]
             precision    recall  f1-score   support

      False       0.93      0.95      0.94      5441
       True       0.90      0.86      0.88      2999

avg / total       0.92      0.92      0.92      8440

Accuracy Score:  0.918009478673
*****************************
Max Depth:  25
Criterion:  gini
[[5167  274]
 [ 439 2560]]
             precision    recall  f1-score   support

      False       0.92      0.95      0.94      5441
       True       0.90      0.85      0.88      2999

avg / total       0.92      0.92      0.91      8440

Accuracy Score:  0.915521327014
*****************************
Max Depth:  25
Criterion:  entropy
[[5155  286]
 [ 426 2573]]
             precision    recall  f1-score   support

      False       0.92      0.95      0.94      5441
       True       0.90      0.86      0.88      2999

avg / total       0.92      0.92      0.92      8440

Accuracy Score:  0.915639810427
*****************************
Max Depth:  26
Criterion:  gini
[[5167  274]
 [ 437 2562]]
             precision    recall  f1-score   support

      False       0.92      0.95      0.94      5441
       True       0.90      0.85      0.88      2999

avg / total       0.92      0.92      0.92      8440

Accuracy Score:  0.915758293839
*****************************
Max Depth:  26
Criterion:  entropy
[[5144  297]
 [ 404 2595]]
             precision    recall  f1-score   support

      False       0.93      0.95      0.94      5441
       True       0.90      0.87      0.88      2999

avg / total       0.92      0.92      0.92      8440

Accuracy Score:  0.916943127962
*****************************
Max Depth:  27
Criterion:  gini
[[5170  271]
 [ 445 2554]]
             precision    recall  f1-score   support

      False       0.92      0.95      0.94      5441
       True       0.90      0.85      0.88      2999

avg / total       0.91      0.92      0.91      8440

Accuracy Score:  0.915165876777
*****************************
Max Depth:  27
Criterion:  entropy
[[5153  288]
 [ 427 2572]]
             precision    recall  f1-score   support

      False       0.92      0.95      0.94      5441
       True       0.90      0.86      0.88      2999

avg / total       0.91      0.92      0.91      8440

Accuracy Score:  0.91528436019
*****************************
Max Depth:  28
Criterion:  gini
[[5166  275]
 [ 452 2547]]
             precision    recall  f1-score   support

      False       0.92      0.95      0.93      5441
       True       0.90      0.85      0.88      2999

avg / total       0.91      0.91      0.91      8440

Accuracy Score:  0.913862559242
*****************************
Max Depth:  28
Criterion:  entropy
[[5151  290]
 [ 425 2574]]
             precision    recall  f1-score   support

      False       0.92      0.95      0.94      5441
       True       0.90      0.86      0.88      2999

avg / total       0.91      0.92      0.91      8440

Accuracy Score:  0.91528436019
*****************************
Max Depth:  29
Criterion:  gini
[[5172  269]
 [ 432 2567]]
             precision    recall  f1-score   support

      False       0.92      0.95      0.94      5441
       True       0.91      0.86      0.88      2999

avg / total       0.92      0.92      0.92      8440

Accuracy Score:  0.916943127962
*****************************
Max Depth:  29
Criterion:  entropy
[[5155  286]
 [ 429 2570]]
             precision    recall  f1-score   support

      False       0.92      0.95      0.94      5441
       True       0.90      0.86      0.88      2999

avg / total       0.91      0.92      0.91      8440

Accuracy Score:  0.91528436019
*****************************

Logistic Regression



In [539]:

    
C = [.01,.1,1,10,100]
for c in C:
    print "C: ", c
    evaluate_model(LogisticRegression(C=c))









    



C:  0.01
[[5115  326]
 [ 624 2375]]
             precision    recall  f1-score   support

      False       0.89      0.94      0.92      5441
       True       0.88      0.79      0.83      2999

avg / total       0.89      0.89      0.89      8440

Accuracy Score:  0.887440758294
*****************************
C:  0.1
[[5074  367]
 [ 581 2418]]
             precision    recall  f1-score   support

      False       0.90      0.93      0.91      5441
       True       0.87      0.81      0.84      2999

avg / total       0.89      0.89      0.89      8440

Accuracy Score:  0.887677725118
*****************************
C:  1
[[5045  396]
 [ 552 2447]]
             precision    recall  f1-score   support

      False       0.90      0.93      0.91      5441
       True       0.86      0.82      0.84      2999

avg / total       0.89      0.89      0.89      8440

Accuracy Score:  0.887677725118
*****************************
C:  10
[[5031  410]
 [ 530 2469]]
             precision    recall  f1-score   support

      False       0.90      0.92      0.91      5441
       True       0.86      0.82      0.84      2999

avg / total       0.89      0.89      0.89      8440

Accuracy Score:  0.888625592417
*****************************
C:  100
[[5031  410]
 [ 527 2472]]
             precision    recall  f1-score   support

      False       0.91      0.92      0.91      5441
       True       0.86      0.82      0.84      2999

avg / total       0.89      0.89      0.89      8440

Accuracy Score:  0.888981042654
*****************************

KNN model



In [506]:

    
n_values = [n for n in range(1,20,2)]
for n_value in n_values:
    print "Number of Neighbors: ", n_value
    evaluate_model(KNeighborsClassifier(n_neighbors=n_value))









    



Number of Neighbors:  1
[[5054  387]
 [ 408 2591]]
             precision    recall  f1-score   support

      False       0.93      0.93      0.93      5441
       True       0.87      0.86      0.87      2999

avg / total       0.91      0.91      0.91      8440

Number of Neighbors:  3
[[5153  288]
 [ 226 2773]]
             precision    recall  f1-score   support

      False       0.96      0.95      0.95      5441
       True       0.91      0.92      0.92      2999

avg / total       0.94      0.94      0.94      8440

Number of Neighbors:  5
[[5172  269]
 [ 208 2791]]
             precision    recall  f1-score   support

      False       0.96      0.95      0.96      5441
       True       0.91      0.93      0.92      2999

avg / total       0.94      0.94      0.94      8440

Number of Neighbors:  7
[[5164  277]
 [ 194 2805]]
             precision    recall  f1-score   support

      False       0.96      0.95      0.96      5441
       True       0.91      0.94      0.92      2999

avg / total       0.94      0.94      0.94      8440

Number of Neighbors:  9
[[5182  259]
 [ 190 2809]]
             precision    recall  f1-score   support

      False       0.96      0.95      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Number of Neighbors:  11
[[5174  267]
 [ 186 2813]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.91      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Number of Neighbors:  13
[[5174  267]
 [ 168 2831]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.91      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Number of Neighbors:  15
[[5173  268]
 [ 170 2829]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.91      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Number of Neighbors:  17
[[5174  267]
 [ 175 2824]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.91      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Number of Neighbors:  19
[[5175  266]
 [ 177 2822]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.91      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Random Forest Classifier



In [516]:

    
max_depth = [n for n in range(1,20,2)]
n_estimators = [n for n in range(10,300,25)]
for max_n in max_depth:
    for n_estimator in n_estimators:
        print "Max Tree Depth: ", max_n
        print "Number of Trees: ", n_estimator
        evaluate_model(RandomForestClassifier(max_depth=max_n,n_estimators=n_estimator))









    



Max Tree Depth:  1
Number of Trees:  10
[[5393   48]
 [2193  806]]
             precision    recall  f1-score   support

      False       0.71      0.99      0.83      5441
       True       0.94      0.27      0.42      2999

avg / total       0.79      0.73      0.68      8440

Max Tree Depth:  1
Number of Trees:  35
[[5386   55]
 [2121  878]]
             precision    recall  f1-score   support

      False       0.72      0.99      0.83      5441
       True       0.94      0.29      0.45      2999

avg / total       0.80      0.74      0.70      8440

Max Tree Depth:  1
Number of Trees:  60
[[5438    3]
 [2853  146]]
             precision    recall  f1-score   support

      False       0.66      1.00      0.79      5441
       True       0.98      0.05      0.09      2999

avg / total       0.77      0.66      0.54      8440

Max Tree Depth:  1
Number of Trees:  85
[[5417   24]
 [2577  422]]
             precision    recall  f1-score   support

      False       0.68      1.00      0.81      5441
       True       0.95      0.14      0.24      2999

avg / total       0.77      0.69      0.61      8440

Max Tree Depth:  1
Number of Trees:  110
[[5423   18]
 [2611  388]]
             precision    recall  f1-score   support

      False       0.68      1.00      0.80      5441
       True       0.96      0.13      0.23      2999

avg / total       0.77      0.69      0.60      8440

Max Tree Depth:  1
Number of Trees:  135
[[5433    8]
 [2800  199]]
             precision    recall  f1-score   support

      False       0.66      1.00      0.79      5441
       True       0.96      0.07      0.12      2999

avg / total       0.77      0.67      0.56      8440

Max Tree Depth:  1
Number of Trees:  160
[[5430   11]
 [2597  402]]
             precision    recall  f1-score   support

      False       0.68      1.00      0.81      5441
       True       0.97      0.13      0.24      2999

avg / total       0.78      0.69      0.60      8440

Max Tree Depth:  1
Number of Trees:  185
[[5419   22]
 [2506  493]]
             precision    recall  f1-score   support

      False       0.68      1.00      0.81      5441
       True       0.96      0.16      0.28      2999

avg / total       0.78      0.70      0.62      8440

Max Tree Depth:  1
Number of Trees:  210
[[5419   22]
 [2567  432]]
             precision    recall  f1-score   support

      False       0.68      1.00      0.81      5441
       True       0.95      0.14      0.25      2999

avg / total       0.78      0.69      0.61      8440

Max Tree Depth:  1
Number of Trees:  235
[[5420   21]
 [2490  509]]
             precision    recall  f1-score   support

      False       0.69      1.00      0.81      5441
       True       0.96      0.17      0.29      2999

avg / total       0.78      0.70      0.63      8440

Max Tree Depth:  1
Number of Trees:  260
[[5428   13]
 [2544  455]]
             precision    recall  f1-score   support

      False       0.68      1.00      0.81      5441
       True       0.97      0.15      0.26      2999

avg / total       0.78      0.70      0.62      8440

Max Tree Depth:  1
Number of Trees:  285
[[5425   16]
 [2543  456]]
             precision    recall  f1-score   support

      False       0.68      1.00      0.81      5441
       True       0.97      0.15      0.26      2999

avg / total       0.78      0.70      0.62      8440

Max Tree Depth:  3
Number of Trees:  10
[[5278  163]
 [ 875 2124]]
             precision    recall  f1-score   support

      False       0.86      0.97      0.91      5441
       True       0.93      0.71      0.80      2999

avg / total       0.88      0.88      0.87      8440

Max Tree Depth:  3
Number of Trees:  35
[[5239  202]
 [ 697 2302]]
             precision    recall  f1-score   support

      False       0.88      0.96      0.92      5441
       True       0.92      0.77      0.84      2999

avg / total       0.90      0.89      0.89      8440

Max Tree Depth:  3
Number of Trees:  60
[[5159  282]
 [ 647 2352]]
             precision    recall  f1-score   support

      False       0.89      0.95      0.92      5441
       True       0.89      0.78      0.84      2999

avg / total       0.89      0.89      0.89      8440

Max Tree Depth:  3
Number of Trees:  85
[[5243  198]
 [ 771 2228]]
             precision    recall  f1-score   support

      False       0.87      0.96      0.92      5441
       True       0.92      0.74      0.82      2999

avg / total       0.89      0.89      0.88      8440

Max Tree Depth:  3
Number of Trees:  110
[[5240  201]
 [ 762 2237]]
             precision    recall  f1-score   support

      False       0.87      0.96      0.92      5441
       True       0.92      0.75      0.82      2999

avg / total       0.89      0.89      0.88      8440

Max Tree Depth:  3
Number of Trees:  135
[[5248  193]
 [ 720 2279]]
             precision    recall  f1-score   support

      False       0.88      0.96      0.92      5441
       True       0.92      0.76      0.83      2999

avg / total       0.89      0.89      0.89      8440

Max Tree Depth:  3
Number of Trees:  160
[[5223  218]
 [ 708 2291]]
             precision    recall  f1-score   support

      False       0.88      0.96      0.92      5441
       True       0.91      0.76      0.83      2999

avg / total       0.89      0.89      0.89      8440

Max Tree Depth:  3
Number of Trees:  185
[[5243  198]
 [ 730 2269]]
             precision    recall  f1-score   support

      False       0.88      0.96      0.92      5441
       True       0.92      0.76      0.83      2999

avg / total       0.89      0.89      0.89      8440

Max Tree Depth:  3
Number of Trees:  210
[[5229  212]
 [ 734 2265]]
             precision    recall  f1-score   support

      False       0.88      0.96      0.92      5441
       True       0.91      0.76      0.83      2999

avg / total       0.89      0.89      0.89      8440

Max Tree Depth:  3
Number of Trees:  235
[[5235  206]
 [ 731 2268]]
             precision    recall  f1-score   support

      False       0.88      0.96      0.92      5441
       True       0.92      0.76      0.83      2999

avg / total       0.89      0.89      0.89      8440

Max Tree Depth:  3
Number of Trees:  260
[[5218  223]
 [ 755 2244]]
             precision    recall  f1-score   support

      False       0.87      0.96      0.91      5441
       True       0.91      0.75      0.82      2999

avg / total       0.89      0.88      0.88      8440

Max Tree Depth:  3
Number of Trees:  285
[[5248  193]
 [ 752 2247]]
             precision    recall  f1-score   support

      False       0.87      0.96      0.92      5441
       True       0.92      0.75      0.83      2999

avg / total       0.89      0.89      0.89      8440

Max Tree Depth:  5
Number of Trees:  10
[[5196  245]
 [ 602 2397]]
             precision    recall  f1-score   support

      False       0.90      0.95      0.92      5441
       True       0.91      0.80      0.85      2999

avg / total       0.90      0.90      0.90      8440

Max Tree Depth:  5
Number of Trees:  35
[[5262  179]
 [ 543 2456]]
             precision    recall  f1-score   support

      False       0.91      0.97      0.94      5441
       True       0.93      0.82      0.87      2999

avg / total       0.92      0.91      0.91      8440

Max Tree Depth:  5
Number of Trees:  60
[[5193  248]
 [ 539 2460]]
             precision    recall  f1-score   support

      False       0.91      0.95      0.93      5441
       True       0.91      0.82      0.86      2999

avg / total       0.91      0.91      0.91      8440

Max Tree Depth:  5
Number of Trees:  85
[[5192  249]
 [ 544 2455]]
             precision    recall  f1-score   support

      False       0.91      0.95      0.93      5441
       True       0.91      0.82      0.86      2999

avg / total       0.91      0.91      0.90      8440

Max Tree Depth:  5
Number of Trees:  110
[[5203  238]
 [ 521 2478]]
             precision    recall  f1-score   support

      False       0.91      0.96      0.93      5441
       True       0.91      0.83      0.87      2999

avg / total       0.91      0.91      0.91      8440

Max Tree Depth:  5
Number of Trees:  135
[[5161  280]
 [ 548 2451]]
             precision    recall  f1-score   support

      False       0.90      0.95      0.93      5441
       True       0.90      0.82      0.86      2999

avg / total       0.90      0.90      0.90      8440

Max Tree Depth:  5
Number of Trees:  160
[[5223  218]
 [ 510 2489]]
             precision    recall  f1-score   support

      False       0.91      0.96      0.93      5441
       True       0.92      0.83      0.87      2999

avg / total       0.91      0.91      0.91      8440

Max Tree Depth:  5
Number of Trees:  185
[[5216  225]
 [ 550 2449]]
             precision    recall  f1-score   support

      False       0.90      0.96      0.93      5441
       True       0.92      0.82      0.86      2999

avg / total       0.91      0.91      0.91      8440

Max Tree Depth:  5
Number of Trees:  210
[[5207  234]
 [ 513 2486]]
             precision    recall  f1-score   support

      False       0.91      0.96      0.93      5441
       True       0.91      0.83      0.87      2999

avg / total       0.91      0.91      0.91      8440

Max Tree Depth:  5
Number of Trees:  235
[[5194  247]
 [ 516 2483]]
             precision    recall  f1-score   support

      False       0.91      0.95      0.93      5441
       True       0.91      0.83      0.87      2999

avg / total       0.91      0.91      0.91      8440

Max Tree Depth:  5
Number of Trees:  260
[[5227  214]
 [ 514 2485]]
             precision    recall  f1-score   support

      False       0.91      0.96      0.93      5441
       True       0.92      0.83      0.87      2999

avg / total       0.91      0.91      0.91      8440

Max Tree Depth:  5
Number of Trees:  285
[[5227  214]
 [ 510 2489]]
             precision    recall  f1-score   support

      False       0.91      0.96      0.94      5441
       True       0.92      0.83      0.87      2999

avg / total       0.91      0.91      0.91      8440

Max Tree Depth:  7
Number of Trees:  10
[[5219  222]
 [ 331 2668]]
             precision    recall  f1-score   support

      False       0.94      0.96      0.95      5441
       True       0.92      0.89      0.91      2999

avg / total       0.93      0.93      0.93      8440

Max Tree Depth:  7
Number of Trees:  35
[[5234  207]
 [ 378 2621]]
             precision    recall  f1-score   support

      False       0.93      0.96      0.95      5441
       True       0.93      0.87      0.90      2999

avg / total       0.93      0.93      0.93      8440

Max Tree Depth:  7
Number of Trees:  60
[[5230  211]
 [ 347 2652]]
             precision    recall  f1-score   support

      False       0.94      0.96      0.95      5441
       True       0.93      0.88      0.90      2999

avg / total       0.93      0.93      0.93      8440

Max Tree Depth:  7
Number of Trees:  85
[[5229  212]
 [ 367 2632]]
             precision    recall  f1-score   support

      False       0.93      0.96      0.95      5441
       True       0.93      0.88      0.90      2999

avg / total       0.93      0.93      0.93      8440

Max Tree Depth:  7
Number of Trees:  110
[[5225  216]
 [ 347 2652]]
             precision    recall  f1-score   support

      False       0.94      0.96      0.95      5441
       True       0.92      0.88      0.90      2999

avg / total       0.93      0.93      0.93      8440

Max Tree Depth:  7
Number of Trees:  135
[[5214  227]
 [ 348 2651]]
             precision    recall  f1-score   support

      False       0.94      0.96      0.95      5441
       True       0.92      0.88      0.90      2999

avg / total       0.93      0.93      0.93      8440

Max Tree Depth:  7
Number of Trees:  160
[[5223  218]
 [ 353 2646]]
             precision    recall  f1-score   support

      False       0.94      0.96      0.95      5441
       True       0.92      0.88      0.90      2999

avg / total       0.93      0.93      0.93      8440

Max Tree Depth:  7
Number of Trees:  185
[[5229  212]
 [ 351 2648]]
             precision    recall  f1-score   support

      False       0.94      0.96      0.95      5441
       True       0.93      0.88      0.90      2999

avg / total       0.93      0.93      0.93      8440

Max Tree Depth:  7
Number of Trees:  210
[[5223  218]
 [ 346 2653]]
             precision    recall  f1-score   support

      False       0.94      0.96      0.95      5441
       True       0.92      0.88      0.90      2999

avg / total       0.93      0.93      0.93      8440

Max Tree Depth:  7
Number of Trees:  235
[[5225  216]
 [ 346 2653]]
             precision    recall  f1-score   support

      False       0.94      0.96      0.95      5441
       True       0.92      0.88      0.90      2999

avg / total       0.93      0.93      0.93      8440

Max Tree Depth:  7
Number of Trees:  260
[[5231  210]
 [ 343 2656]]
             precision    recall  f1-score   support

      False       0.94      0.96      0.95      5441
       True       0.93      0.89      0.91      2999

avg / total       0.93      0.93      0.93      8440

Max Tree Depth:  7
Number of Trees:  285
[[5226  215]
 [ 346 2653]]
             precision    recall  f1-score   support

      False       0.94      0.96      0.95      5441
       True       0.93      0.88      0.90      2999

avg / total       0.93      0.93      0.93      8440

Max Tree Depth:  9
Number of Trees:  10
[[5176  265]
 [ 261 2738]]
             precision    recall  f1-score   support

      False       0.95      0.95      0.95      5441
       True       0.91      0.91      0.91      2999

avg / total       0.94      0.94      0.94      8440

Max Tree Depth:  9
Number of Trees:  35
[[5219  222]
 [ 290 2709]]
             precision    recall  f1-score   support

      False       0.95      0.96      0.95      5441
       True       0.92      0.90      0.91      2999

avg / total       0.94      0.94      0.94      8440

Max Tree Depth:  9
Number of Trees:  60
[[5208  233]
 [ 254 2745]]
             precision    recall  f1-score   support

      False       0.95      0.96      0.96      5441
       True       0.92      0.92      0.92      2999

avg / total       0.94      0.94      0.94      8440

Max Tree Depth:  9
Number of Trees:  85
[[5232  209]
 [ 306 2693]]
             precision    recall  f1-score   support

      False       0.94      0.96      0.95      5441
       True       0.93      0.90      0.91      2999

avg / total       0.94      0.94      0.94      8440

Max Tree Depth:  9
Number of Trees:  110
[[5213  228]
 [ 254 2745]]
             precision    recall  f1-score   support

      False       0.95      0.96      0.96      5441
       True       0.92      0.92      0.92      2999

avg / total       0.94      0.94      0.94      8440

Max Tree Depth:  9
Number of Trees:  135
[[5227  214]
 [ 277 2722]]
             precision    recall  f1-score   support

      False       0.95      0.96      0.96      5441
       True       0.93      0.91      0.92      2999

avg / total       0.94      0.94      0.94      8440

Max Tree Depth:  9
Number of Trees:  160
[[5218  223]
 [ 271 2728]]
             precision    recall  f1-score   support

      False       0.95      0.96      0.95      5441
       True       0.92      0.91      0.92      2999

avg / total       0.94      0.94      0.94      8440

Max Tree Depth:  9
Number of Trees:  185
[[5224  217]
 [ 268 2731]]
             precision    recall  f1-score   support

      False       0.95      0.96      0.96      5441
       True       0.93      0.91      0.92      2999

avg / total       0.94      0.94      0.94      8440

Max Tree Depth:  9
Number of Trees:  210
[[5222  219]
 [ 272 2727]]
             precision    recall  f1-score   support

      False       0.95      0.96      0.96      5441
       True       0.93      0.91      0.92      2999

avg / total       0.94      0.94      0.94      8440

Max Tree Depth:  9
Number of Trees:  235
[[5229  212]
 [ 281 2718]]
             precision    recall  f1-score   support

      False       0.95      0.96      0.95      5441
       True       0.93      0.91      0.92      2999

avg / total       0.94      0.94      0.94      8440

Max Tree Depth:  9
Number of Trees:  260
[[5230  211]
 [ 280 2719]]
             precision    recall  f1-score   support

      False       0.95      0.96      0.96      5441
       True       0.93      0.91      0.92      2999

avg / total       0.94      0.94      0.94      8440

Max Tree Depth:  9
Number of Trees:  285
[[5221  220]
 [ 279 2720]]
             precision    recall  f1-score   support

      False       0.95      0.96      0.95      5441
       True       0.93      0.91      0.92      2999

avg / total       0.94      0.94      0.94      8440

Max Tree Depth:  11
Number of Trees:  10
[[5187  254]
 [ 202 2797]]
             precision    recall  f1-score   support

      False       0.96      0.95      0.96      5441
       True       0.92      0.93      0.92      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  11
Number of Trees:  35
[[5207  234]
 [ 188 2811]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  11
Number of Trees:  60
[[5197  244]
 [ 190 2809]]
             precision    recall  f1-score   support

      False       0.96      0.96      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  11
Number of Trees:  85
[[5195  246]
 [ 186 2813]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  11
Number of Trees:  110
[[5192  249]
 [ 178 2821]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  11
Number of Trees:  135
[[5204  237]
 [ 184 2815]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  11
Number of Trees:  160
[[5194  247]
 [ 174 2825]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  11
Number of Trees:  185
[[5194  247]
 [ 178 2821]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  11
Number of Trees:  210
[[5196  245]
 [ 182 2817]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  11
Number of Trees:  235
[[5196  245]
 [ 183 2816]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  11
Number of Trees:  260
[[5196  245]
 [ 176 2823]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  11
Number of Trees:  285
[[5198  243]
 [ 176 2823]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  13
Number of Trees:  10
[[5200  241]
 [ 175 2824]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  13
Number of Trees:  35
[[5195  246]
 [ 168 2831]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  13
Number of Trees:  60
[[5203  238]
 [ 163 2836]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.96      5441
       True       0.92      0.95      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  13
Number of Trees:  85
[[5202  239]
 [ 163 2836]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.96      5441
       True       0.92      0.95      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  13
Number of Trees:  110
[[5195  246]
 [ 165 2834]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  13
Number of Trees:  135
[[5202  239]
 [ 163 2836]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.96      5441
       True       0.92      0.95      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  13
Number of Trees:  160
[[5197  244]
 [ 164 2835]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.96      5441
       True       0.92      0.95      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  13
Number of Trees:  185
[[5198  243]
 [ 157 2842]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.96      5441
       True       0.92      0.95      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  13
Number of Trees:  210
[[5204  237]
 [ 162 2837]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.96      5441
       True       0.92      0.95      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  13
Number of Trees:  235
[[5194  247]
 [ 160 2839]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.92      0.95      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  13
Number of Trees:  260
[[5201  240]
 [ 156 2843]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.96      5441
       True       0.92      0.95      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  13
Number of Trees:  285
[[5204  237]
 [ 159 2840]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.96      5441
       True       0.92      0.95      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  15
Number of Trees:  10
[[5195  246]
 [ 173 2826]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  15
Number of Trees:  35
[[5195  246]
 [ 161 2838]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.92      0.95      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  15
Number of Trees:  60
[[5201  240]
 [ 154 2845]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.96      5441
       True       0.92      0.95      0.94      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  15
Number of Trees:  85
[[5194  247]
 [ 151 2848]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.92      0.95      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  15
Number of Trees:  110
[[5201  240]
 [ 155 2844]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.96      5441
       True       0.92      0.95      0.94      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  15
Number of Trees:  135
[[5207  234]
 [ 161 2838]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.96      5441
       True       0.92      0.95      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  15
Number of Trees:  160
[[5202  239]
 [ 163 2836]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.96      5441
       True       0.92      0.95      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  15
Number of Trees:  185
[[5208  233]
 [ 160 2839]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.96      5441
       True       0.92      0.95      0.94      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  15
Number of Trees:  210
[[5204  237]
 [ 158 2841]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.96      5441
       True       0.92      0.95      0.94      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  15
Number of Trees:  235
[[5205  236]
 [ 165 2834]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  15
Number of Trees:  260
[[5201  240]
 [ 163 2836]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.96      5441
       True       0.92      0.95      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  15
Number of Trees:  285
[[5200  241]
 [ 162 2837]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.96      5441
       True       0.92      0.95      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  17
Number of Trees:  10
[[5183  258]
 [ 175 2824]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  17
Number of Trees:  35
[[5187  254]
 [ 174 2825]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  17
Number of Trees:  60
[[5193  248]
 [ 179 2820]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  17
Number of Trees:  85
[[5190  251]
 [ 180 2819]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  17
Number of Trees:  110
[[5181  260]
 [ 161 2838]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.92      0.95      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  17
Number of Trees:  135
[[5187  254]
 [ 168 2831]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  17
Number of Trees:  160
[[5190  251]
 [ 175 2824]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  17
Number of Trees:  185
[[5185  256]
 [ 173 2826]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  17
Number of Trees:  210
[[5190  251]
 [ 177 2822]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  17
Number of Trees:  235
[[5189  252]
 [ 170 2829]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  17
Number of Trees:  260
[[5189  252]
 [ 172 2827]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  17
Number of Trees:  285
[[5188  253]
 [ 171 2828]]
             precision    recall  f1-score   support

      False       0.97      0.95      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  19
Number of Trees:  10
[[5182  259]
 [ 229 2770]]
             precision    recall  f1-score   support

      False       0.96      0.95      0.96      5441
       True       0.91      0.92      0.92      2999

avg / total       0.94      0.94      0.94      8440

Max Tree Depth:  19
Number of Trees:  35
[[5184  257]
 [ 205 2794]]
             precision    recall  f1-score   support

      False       0.96      0.95      0.96      5441
       True       0.92      0.93      0.92      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  19
Number of Trees:  60
[[5176  265]
 [ 203 2796]]
             precision    recall  f1-score   support

      False       0.96      0.95      0.96      5441
       True       0.91      0.93      0.92      2999

avg / total       0.94      0.94      0.94      8440

Max Tree Depth:  19
Number of Trees:  85
[[5177  264]
 [ 202 2797]]
             precision    recall  f1-score   support

      False       0.96      0.95      0.96      5441
       True       0.91      0.93      0.92      2999

avg / total       0.95      0.94      0.94      8440

Max Tree Depth:  19
Number of Trees:  110
[[5176  265]
 [ 200 2799]]
             precision    recall  f1-score   support

      False       0.96      0.95      0.96      5441
       True       0.91      0.93      0.92      2999

avg / total       0.95      0.94      0.95      8440

Max Tree Depth:  19
Number of Trees:  135
[[5182  259]
 [ 200 2799]]
             precision    recall  f1-score   support

      False       0.96      0.95      0.96      5441
       True       0.92      0.93      0.92      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  19
Number of Trees:  160
[[5179  262]
 [ 204 2795]]
             precision    recall  f1-score   support

      False       0.96      0.95      0.96      5441
       True       0.91      0.93      0.92      2999

avg / total       0.95      0.94      0.94      8440

Max Tree Depth:  19
Number of Trees:  185
[[5178  263]
 [ 196 2803]]
             precision    recall  f1-score   support

      False       0.96      0.95      0.96      5441
       True       0.91      0.93      0.92      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  19
Number of Trees:  210
[[5180  261]
 [ 206 2793]]
             precision    recall  f1-score   support

      False       0.96      0.95      0.96      5441
       True       0.91      0.93      0.92      2999

avg / total       0.94      0.94      0.94      8440

Max Tree Depth:  19
Number of Trees:  235
[[5180  261]
 [ 197 2802]]
             precision    recall  f1-score   support

      False       0.96      0.95      0.96      5441
       True       0.91      0.93      0.92      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  19
Number of Trees:  260
[[5179  262]
 [ 201 2798]]
             precision    recall  f1-score   support

      False       0.96      0.95      0.96      5441
       True       0.91      0.93      0.92      2999

avg / total       0.95      0.95      0.95      8440

Max Tree Depth:  19
Number of Trees:  285
[[5180  261]
 [ 206 2793]]
             precision    recall  f1-score   support

      False       0.96      0.95      0.96      5441
       True       0.91      0.93      0.92      2999

avg / total       0.94      0.94      0.94      8440

AdaBoost



In [517]:

    
from sklearn.ensemble import AdaBoostClassifier



In [522]:

    
max_depth = [n for n in range(2,16,2)]
for depth in max_depth:
    print "Depth: ", depth
    AdaBoostClassifier(RandomForestClassifier(max_depth = 13), n_estimators=185)
    evaluate_model(AdaBoostClassifier(RandomForestClassifier(max_depth = depth), n_estimators=185))









    



Depth:  2
[[5214  227]
 [ 183 2816]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.96      5441
       True       0.93      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Depth:  4
[[5200  241]
 [ 162 2837]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.96      5441
       True       0.92      0.95      0.93      2999

avg / total       0.95      0.95      0.95      8440

Depth:  6
[[5178  263]
 [ 271 2728]]
             precision    recall  f1-score   support

      False       0.95      0.95      0.95      5441
       True       0.91      0.91      0.91      2999

avg / total       0.94      0.94      0.94      8440

Depth:  8
[[5166  275]
 [ 269 2730]]
             precision    recall  f1-score   support

      False       0.95      0.95      0.95      5441
       True       0.91      0.91      0.91      2999

avg / total       0.94      0.94      0.94      8440

Depth:  10
[[5167  274]
 [ 268 2731]]
             precision    recall  f1-score   support

      False       0.95      0.95      0.95      5441
       True       0.91      0.91      0.91      2999

avg / total       0.94      0.94      0.94      8440

Depth:  12
[[5165  276]
 [ 267 2732]]
             precision    recall  f1-score   support

      False       0.95      0.95      0.95      5441
       True       0.91      0.91      0.91      2999

avg / total       0.94      0.94      0.94      8440

Depth:  14
[[5167  274]
 [ 269 2730]]
             precision    recall  f1-score   support

      False       0.95      0.95      0.95      5441
       True       0.91      0.91      0.91      2999

avg / total       0.94      0.94      0.94      8440



In [525]:

    
max_depth = [n for n in range(2,16,2)]
for depth in max_depth:
    print "Depth: ", depth
    ExtraTreesClassifier(max_depth = 13, n_estimators=185)
    evaluate_model(AdaBoostClassifier(RandomForestClassifier(max_depth = depth), n_estimators=185))









    



Depth:  2
[[5210  231]
 [ 175 2824]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Depth:  4
[[5203  238]
 [ 170 2829]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.96      5441
       True       0.92      0.94      0.93      2999

avg / total       0.95      0.95      0.95      8440

Depth:  6
[[5173  268]
 [ 254 2745]]
             precision    recall  f1-score   support

      False       0.95      0.95      0.95      5441
       True       0.91      0.92      0.91      2999

avg / total       0.94      0.94      0.94      8440

Depth:  8
[[5170  271]
 [ 272 2727]]
             precision    recall  f1-score   support

      False       0.95      0.95      0.95      5441
       True       0.91      0.91      0.91      2999

avg / total       0.94      0.94      0.94      8440

Depth:  10
[[5168  273]
 [ 269 2730]]
             precision    recall  f1-score   support

      False       0.95      0.95      0.95      5441
       True       0.91      0.91      0.91      2999

avg / total       0.94      0.94      0.94      8440

Depth:  12
[[5166  275]
 [ 268 2731]]
             precision    recall  f1-score   support

      False       0.95      0.95      0.95      5441
       True       0.91      0.91      0.91      2999

avg / total       0.94      0.94      0.94      8440

Depth:  14
[[5164  277]
 [ 272 2727]]
             precision    recall  f1-score   support

      False       0.95      0.95      0.95      5441
       True       0.91      0.91      0.91      2999

avg / total       0.93      0.93      0.93      8440

Support Vector Classifier



In [529]:

    
kernels = ['linear', 'poly', 'rbf', 'sigmoid']
C = [.01,.1,1,10,100]
for kernel in kernels:
    for c in C:
        print "C penalty: ", c
        print "Kernel: ", kernel
        evaluate_model(SVC(C=c,kernel=kernel))









    



C penalty:  0.01
Kernel:  linear
[[5150  291]
 [ 591 2408]]
             precision    recall  f1-score   support

      False       0.90      0.95      0.92      5441
       True       0.89      0.80      0.85      2999

avg / total       0.90      0.90      0.89      8440

C penalty:  0.1
Kernel:  linear
[[5083  358]
 [ 554 2445]]
             precision    recall  f1-score   support

      False       0.90      0.93      0.92      5441
       True       0.87      0.82      0.84      2999

avg / total       0.89      0.89      0.89      8440

C penalty:  1
Kernel:  linear
[[5048  393]
 [ 554 2445]]
             precision    recall  f1-score   support

      False       0.90      0.93      0.91      5441
       True       0.86      0.82      0.84      2999

avg / total       0.89      0.89      0.89      8440

C penalty:  10
Kernel:  linear
[[5049  392]
 [ 554 2445]]
             precision    recall  f1-score   support

      False       0.90      0.93      0.91      5441
       True       0.86      0.82      0.84      2999

avg / total       0.89      0.89      0.89      8440

C penalty:  100
Kernel:  linear
[[5049  392]
 [ 553 2446]]
             precision    recall  f1-score   support

      False       0.90      0.93      0.91      5441
       True       0.86      0.82      0.84      2999

avg / total       0.89      0.89      0.89      8440

C penalty:  0.01
Kernel:  poly






    



//anaconda/lib/python2.7/site-packages/sklearn/metrics/classification.py:1074: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)






    



[[5441    0]
 [2999    0]]
             precision    recall  f1-score   support

      False       0.64      1.00      0.78      5441
       True       0.00      0.00      0.00      2999

avg / total       0.42      0.64      0.51      8440

C penalty:  0.1
Kernel:  poly
[[5441    0]
 [2999    0]]
             precision    recall  f1-score   support

      False       0.64      1.00      0.78      5441
       True       0.00      0.00      0.00      2999

avg / total       0.42      0.64      0.51      8440

C penalty:  1
Kernel:  poly
[[5275  166]
 [ 788 2211]]
             precision    recall  f1-score   support

      False       0.87      0.97      0.92      5441
       True       0.93      0.74      0.82      2999

avg / total       0.89      0.89      0.88      8440

C penalty:  10
Kernel:  poly
[[5222  219]
 [ 246 2753]]
             precision    recall  f1-score   support

      False       0.96      0.96      0.96      5441
       True       0.93      0.92      0.92      2999

avg / total       0.94      0.94      0.94      8440

C penalty:  100
Kernel:  poly
[[5212  229]
 [ 143 2856]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.97      5441
       True       0.93      0.95      0.94      2999

avg / total       0.96      0.96      0.96      8440

C penalty:  0.01
Kernel:  rbf
[[5151  290]
 [ 649 2350]]
             precision    recall  f1-score   support

      False       0.89      0.95      0.92      5441
       True       0.89      0.78      0.83      2999

avg / total       0.89      0.89      0.89      8440

C penalty:  0.1
Kernel:  rbf
[[5179  262]
 [ 561 2438]]
             precision    recall  f1-score   support

      False       0.90      0.95      0.93      5441
       True       0.90      0.81      0.86      2999

avg / total       0.90      0.90      0.90      8440

C penalty:  1
Kernel:  rbf
[[5232  209]
 [ 268 2731]]
             precision    recall  f1-score   support

      False       0.95      0.96      0.96      5441
       True       0.93      0.91      0.92      2999

avg / total       0.94      0.94      0.94      8440

C penalty:  10
Kernel:  rbf
[[5220  221]
 [ 153 2846]]
             precision    recall  f1-score   support

      False       0.97      0.96      0.97      5441
       True       0.93      0.95      0.94      2999

avg / total       0.96      0.96      0.96      8440

C penalty:  100
Kernel:  rbf
[[5215  226]
 [ 117 2882]]
             precision    recall  f1-score   support

      False       0.98      0.96      0.97      5441
       True       0.93      0.96      0.94      2999

avg / total       0.96      0.96      0.96      8440

C penalty:  0.01
Kernel:  sigmoid
[[5441    0]
 [2999    0]]
             precision    recall  f1-score   support

      False       0.64      1.00      0.78      5441
       True       0.00      0.00      0.00      2999

avg / total       0.42      0.64      0.51      8440

C penalty:  0.1
Kernel:  sigmoid
[[5441    0]
 [2999    0]]
             precision    recall  f1-score   support

      False       0.64      1.00      0.78      5441
       True       0.00      0.00      0.00      2999

avg / total       0.42      0.64      0.51      8440

C penalty:  1
Kernel:  sigmoid
[[5441    0]
 [2999    0]]
             precision    recall  f1-score   support

      False       0.64      1.00      0.78      5441
       True       0.00      0.00      0.00      2999

avg / total       0.42      0.64      0.51      8440

C penalty:  10
Kernel:  sigmoid
[[5441    0]
 [2999    0]]
             precision    recall  f1-score   support

      False       0.64      1.00      0.78      5441
       True       0.00      0.00      0.00      2999

avg / total       0.42      0.64      0.51      8440

C penalty:  100
Kernel:  sigmoid
[[5441    0]
 [2999    0]]
             precision    recall  f1-score   support

      False       0.64      1.00      0.78      5441
       True       0.00      0.00      0.00      2999

avg / total       0.42      0.64      0.51      8440

Decision tree depth : fit on bad predictions from previous test set

Examining the exceptions



In [540]:

    
DTC = DecisionTreeClassifier(max_depth=6)
model = DTC.fit(X_train,y_train)
y_pred = model.predict(X_test)



In [541]:

    
y_pred = pd.DataFrame(y_pred3)
y_test = pd.DataFrame(y_test)
X_df = pd.concat([words,y_test,y_pred],axis=1)



In [ ]:



In [542]:

    
X_df.columns = ['string','actual','predicted']



In [543]:

    
X_df['Correct'] = X_df['actual'] + X_df['predicted']



In [544]:

    
X_df['Correct'] = X_df['Correct'].map({2:1,1:0,0:1})



In [545]:

    
X_df









    Out[545]:






  
    
      
      string
      actual
      predicted
      Correct
    
  
  
    
      0
      b##a
      NaN
      False
      NaN
    
    
      1
      ##ab
      NaN
      True
      NaN
    
    
      2
      #ab#
      False
      False
      1.0
    
    
      3
      NaN
      NaN
      False
      NaN
    
    
      4
      NaN
      NaN
      True
      NaN
    
    
      5
      NaN
      NaN
      False
      NaN
    
    
      6
      a##a
      NaN
      False
      NaN
    
    
      7
      NaN
      NaN
      False
      NaN
    
    
      8
      #aba
      NaN
      True
      NaN
    
    
      9
      aba#
      False
      True
      0.0
    
    
      10
      d##a
      NaN
      True
      NaN
    
    
      11
      NaN
      NaN
      False
      NaN
    
    
      12
      NaN
      NaN
      False
      NaN
    
    
      13
      abad
      NaN
      False
      NaN
    
    
      14
      bad#
      False
      True
      0.0
    
    
      15
      i##a
      False
      False
      1.0
    
    
      16
      NaN
      NaN
      False
      NaN
    
    
      17
      NaN
      NaN
      False
      NaN
    
    
      18
      NaN
      NaN
      False
      NaN
    
    
      19
      badi
      NaN
      False
      NaN
    
    
      20
      adi#
      False
      False
      1.0
    
    
      21
      h##a
      False
      True
      0.0
    
    
      22
      NaN
      NaN
      False
      NaN
    
    
      23
      NaN
      NaN
      True
      NaN
    
    
      24
      NaN
      NaN
      False
      NaN
    
    
      25
      NaN
      NaN
      False
      NaN
    
    
      26
      adia
      False
      True
      0.0
    
    
      27
      diah
      True
      False
      0.0
    
    
      28
      iah#
      False
      True
      0.0
    
    
      29
      t##a
      NaN
      False
      NaN
    
    
      ...
      ...
      ...
      ...
      ...
    
    
      261076
      zoof
      NaN
      NaN
      NaN
    
    
      261077
      oofi
      NaN
      NaN
      NaN
    
    
      261084
      oofo
      True
      NaN
      NaN
    
    
      261092
      zoog
      NaN
      NaN
      NaN
    
    
      261093
      ooga
      True
      NaN
      NaN
    
    
      261112
      zool
      NaN
      NaN
      NaN
    
    
      261113
      oolo
      NaN
      NaN
      NaN
    
    
      261120
      zoon
      NaN
      NaN
      NaN
    
    
      261121
      oono
      True
      NaN
      NaN
    
    
      261129
      zoos
      True
      NaN
      NaN
    
    
      261130
      oose
      NaN
      NaN
      NaN
    
    
      261140
      ##zu
      False
      NaN
      NaN
    
    
      261141
      #zua
      NaN
      NaN
      NaN
    
    
      261142
      zuad
      NaN
      NaN
      NaN
    
    
      261149
      zuam
      NaN
      NaN
      NaN
    
    
      261154
      #zuh
      NaN
      NaN
      NaN
    
    
      261155
      zuha
      NaN
      NaN
      NaN
    
    
      261162
      zuhu
      True
      NaN
      NaN
    
    
      261163
      uhud
      False
      NaN
      NaN
    
    
      261173
      #zul
      False
      NaN
      NaN
    
    
      261174
      zulf
      NaN
      NaN
      NaN
    
    
      261183
      zulh
      False
      NaN
      NaN
    
    
      261185
      lhij
      False
      NaN
      NaN
    
    
      261192
      zulk
      NaN
      NaN
      NaN
    
    
      261194
      lkai
      NaN
      NaN
      NaN
    
    
      261202
      zulm
      NaN
      NaN
      NaN
    
    
      261204
      lmat
      NaN
      NaN
      NaN
    
    
      261218
      #zur
      False
      NaN
      NaN
    
    
      261240
      #zus
      NaN
      NaN
      NaN
    
    
      261241
      zus#
      NaN
      NaN
      NaN
    
  

31152 rows × 4 columns



In [413]:

    
bad_predictions = X_df[X_df['Correct']==0]



In [414]:

    
bad_predictions.index









    Out[414]:





Int64Index([   9,   14,   21,   26,   27,   28,   41,   76,   77,   86,
            ...
            8067, 8084, 8127, 8153, 8167, 8286, 8295, 8303, 8338, 8415],
           dtype='int64', length=444)



In [415]:

    
# Let's create a dataset consisting of only incorrectly predicted data
y = y_test[y_test.index.isin(bad_predictions.index)]
X = X_test[X_test.index.isin(bad_predictions.index)]



In [416]:

    
#What's out baseline?
float(y[0].value_counts()[0])/(float(y[0].value_counts()[0]) + float(y[0].value_counts()[1]))









    Out[416]:





0.536036036036036



In [417]:

    
X_train, X_test, y_train, y_test = train_test_split(X, y[0], test_size=0.33, random_state=30)



In [ ]:

Decision tree with bad prediction dataset



In [602]:

    
# now let's run the data through a decision tree classifier model
clf = DecisionTreeClassifier(random_state=30, max_depth=3)
cross_val = cross_val_score(clf, X_train, y_train, cv=2)



In [603]:

    
evaluate_model(clf)









    



[[59 20]
 [19 85]]
             precision    recall  f1-score   support

      False       0.76      0.75      0.75        79
       True       0.81      0.82      0.81       104

avg / total       0.79      0.79      0.79       183

Accuracy Score:  0.786885245902
*****************************






    Out[603]:





0.78688524590163933



In [420]:

    
print classification_report(y_test,y_pred)









    



             precision    recall  f1-score   support

      False       0.88      0.94      0.91        77
       True       0.92      0.86      0.89        70

avg / total       0.90      0.90      0.90       147



In [421]:

    
from sklearn.tree import export_graphviz
with open('tree.dot', 'w') as dotfile:
    # Creating a dot file that we can write to for our output.
    export_graphviz(decision_tree = model, out_file = dotfile, feature_names=X.columns)
    # Writing to our dot file we just created.
import graphviz
with open("tree.dot") as f:
    # Opening our file where the decision trees decision information is store.
    dot_graph = f.read()
    # setting a variable equal to the contents of the read dot file.
graphviz.Source(dot_graph) 
# Equavalent of plt.show() for graphviz.









    Out[421]:



In [ ]:

Examining the exceptions



In [371]:

    
y_pred = pd.DataFrame(y_pred)
y_test = pd.DataFrame(y_test)
X_df = pd.concat([words,y_test,y_pred],axis=1)



In [372]:

    
X_df.columns = ['string','actual','predicted']



In [373]:

    
X_df['Correct'] = X_df['actual'] + X_df['predicted']



In [374]:

    
X_df['Correct'] = X_df['Correct'].map({2:1,1:0,0:1})



In [375]:

    
bad_predictions = X_df[X_df['Correct']==0]



In [376]:

    
bad_predictions









    Out[376]:






  
    
      
      string
      actual
      predicted
      Correct
    
  
  
    
      28
      iah#
      False
      True
      0.0



In [ ]:



In [ ]:



In [ ]:

	katakunci	artikata
_id
1	a	<b><sup>1</sup>A, a</b&gt...
2	a	<b><sup>3</sup>a-</b> ...

	keyword	definition	vowel_count
_id
1	a	<b><sup>1</sup>A, a</b&gt...	1
2	a	<b><sup>3</sup>a-</b> ...	1
3	ab	<b><sup>1</sup>ab</b> ...	1
4	ab	<b><sup>2</sup>ab</b> ...	1
5	ab	<b><sup>3</sup>ab-</b>...	1
6	aba	<b>aba</b> <i>n</i> ay...	2
7	aba-aba	<b>aba-aba</b> <i>n</i&gt...	4
8	abad	<b>abad</b> <i>n</i> &...	2
9	abadi	<b>aba·di</b> <i>a</i>...	3
10	abadiah	<b>aba·di·ah</b> <i>Ar n<...	4

	keyword	definition	vowel_count	syllable_divider
_id
930	am	<b>am</b> <i>a</i> &lt...	1	True
2818	bank	<b>bank</b> <i>n</i> b...	1	True
3225	bel	<b><sup>1</sup>bel</b>...	1	True
4157	bis	<b><sup>1</sup>bis</b>...	1	True
4301	bom	<b><sup>1</sup>bom</b>...	1	True
4313	bon	<b><sup>1</sup>bon</b>...	1	True
4375	bor	<b><sup>1</sup>bor</b>...	1	True
4592	buk	<b><sup>2</sup>buk, me·nge·b...	1	True
4952	cam	<b><sup>1</sup>cam</b>...	1	True
5083	cap	<b><sup>1</sup>cap</b>...	1	True
5139	cas	<b><sup>1</sup>cas</b>...	1	True
5140	cas	<b><sup>2</sup>cas</b>...	1	True
5143	cat	<b>cat</b> <i>n</i> &l...	1	True
5227	cek	<b><sup>3</sup>cek</b>...	1	True
5890	cor	<b>cor</b> <i>v,</i> &...	1	True
6055	dab	<b>dab, me·nge·dab</b> <i>v ...	1	True
6463	deg	<b>deg, deg-deg·an</b> <i>v ...	1	True
6731	dep	<b>dep</b> /dép/ <i>v,</i...	1	True
7398	dor	<b>dor</b> <i>n</i> ti...	1	True
7419	dot	<b>dot</b> <i>n</i> al...	1	True
7445	drel	<b>drel</b> /drél/ <i>n</...	1	True
7453	dril	<b><sup>2</sup>dril</b&gt...	1	True
7456	drop	<b>drop</b> <i>v cak</i&g...	1	True
7471	dub	<b>dub</b> <i>v,</i> &...	1	True
7530	dum	<b>dum</b> <i>n cak</i&gt...	1	True
7552	dup	<b>dup, me·nge·dup</b> <i>v ...	1	True
8783	film	<b>film</b> <i>n</i> &...	1	True
10469	gol	<b>gol</b> <b>1</b> &l...	1	True
10477	golf	<b>golf</b> <i>n</i> c...	1	True
10741	gung	<b>gung</b> <i>n</i> &...	1	True
...	...	...	...	...
28153	saf	<b>saf</b> <i>n</i> de...	1	True
28183	sah	<b>sah</b> <b>1</b> &l...	1	True
29119	sel	<b><sup>1</sup>sel</b>...	1	True
29120	sel	<b><sup>2</sup>sel</b>...	1	True
29578	sen	<b><sup>1</sup>sen</b>...	1	True
29832	sep	<b>sep</b> /sép/ <i>ark n&lt...	1	True
30236	set	<b><sup>2</sup>set</b>...	1	True
30809	sir	<b><sup>2</sup>sir</b>...	1	True
30965	skor	<b>skor</b> <i>n</i> &...	1	True
30968	skors	<b>skors</b> <i>v,</i>...	1	True
31045	sol	<b><sup>1</sup>sol</b>...	1	True
31080	som	<b>som</b> <i>n cak</i&gt...	1	True
31590	suh	<b>suh</b> <i>n</i> pa...	1	True
31727	sun	<b><sup>2</sup>sun</b>...	1	True
32020	syak	<b>syak</b> <b>1</b> &...	1	True
32080	syur	<b><sup>2</sup>syur</b&gt...	1	True
32570	tap	<b><sup>1</sup>tap</b>...	1	True
32885	teh	<b>teh</b> /téh/ <i>n</i&...	1	True
33094	tem	<b>tem</b> /tém/ <i>n cak&lt...	1	True
33614	tes	<b>tes</b> /tés/ <i>n</i&...	1	True
33699	tik	<b><sup>2</sup>tik</b>...	1	True
33729	tim	<b><sup>2</sup>tim</b>...	1	True
33822	tip	<b><sup>2</sup>tip</b>...	1	True
34028	top	<b><sup>3</sup>top</b>...	1	True
34069	tos	<b>tos</b> <i>n cak</i&gt...	1	True
34170	trek	<b>trek</b> /trék/ <i>n</...	1	True
34178	tren	<b>tren</b> /trén/ <i>n</...	1	True
34276	truk	<b>truk</b> <i>n</i> m...	1	True
34397	tum	<b><sup>2</sup>tum</b>...	1	True
35783	yang	<b><sup>2</sup>yang</b&gt...	1	True

	keyword	definition	vowel_count	syllable_divider	string_1
_id
1	a	<b><sup>1</sup>A, a</b&gt...	1	False	<b><sup>1</sup>A,
2	a	<b><sup>3</sup>a-</b> ...	1	False	<b><sup>3</sup>a-</b>
3	ab	<b><sup>1</sup>ab</b> ...	1	False	<b><sup>1</sup>ab</b>
4	ab	<b><sup>2</sup>ab</b> ...	1	False	<b><sup>2</sup>ab</b>
5	ab	<b><sup>3</sup>ab-</b>...	1	False	<b><sup>3</sup>ab-</b>
6	aba	<b>aba</b> <i>n</i> ay...	2	False	<b>aba</b>
7	aba-aba	<b>aba-aba</b> <i>n</i&gt...	4	False	<b>aba-aba</b>
8	abad	<b>abad</b> <i>n</i> &...	2	False	<b>abad</b>
9	abadi	<b>aba·di</b> <i>a</i>...	3	True	<b>aba·di</b>
10	abadiah	<b>aba·di·ah</b> <i>Ar n<...	4	True	<b>aba·di·ah</b>

	keyword	definition	vowel_count	syllable_divider	string_1
_id
11174	henry	<b>hen·ry</b> <i>n El</i&...	1	True	<b>hen·ry</b>

	keyword	definition	vowel_count	syllable_divider	string_1	divider_count	diff_numV_numB
15	abaimana	<b>abai·ma·na</b> <i>ark n&l...	5	True	abai!ma!na	2	3
125	abulia	<b>abu·lia</b> <i>n Dok</...	4	True	abu!lia	1	3
171	adagio	<b>ada·gio</b> <i>n Mus</...	4	True	ada!gio	1	3
205	adempauze	<b>adem·pau·ze</b> <i>Bld n&...	5	True	adem!pau!ze	2	3
240	adinamia	<b>adi·na·mia</b> <i>a Dok&l...	5	True	adi!na!mia	2	3
258	adiwidia	<b>adi·wi·dia</b> <i>n</i...	5	True	adi!wi!dia	2	3
288	aduhai	<b>adu·hai 1</b> <i>p</i&...	4	True	adu!hai	1	3
322	aeronautika	<b>ae·ro·nau·ti·ka</b> /aéronautik...	7	True	ae!ro!nau!ti!ka	4	3
328	aeroterapia	<b>ae·ro·te·ra·pia</b> /aérotérapi...	7	True	ae!ro!te!ra!pia	4	3
333	afasia	<b>afa·sia</b> <i>n Dok</...	4	True	afa!sia	1	3
350	afonia	<b>afo·nia</b> <i>n Dok</...	4	True	afo!nia	1	3
368	agalaksia	<b>aga·lak·sia</b> <i>a Dok&...	5	True	aga!lak!sia	2	3
394	agiria	<b>agi·ria</b> <i>n Dok</...	4	True	agi!ria	1	3
412	agonia	<b>ago·nia</b> <i>Dok</i&...	4	True	ago!nia	1	3
415	agorafobia	<b>ago·ra·fo·bia</b> <i>n Ps...	6	True	ago!ra!fo!bia	3	3
462	ahli negara	<b>ah·li ne·ga·ra</b> <i>n&l...	5	True	ah!li	1	4
541	akasia	<b>aka·sia</b> <i>n</i&gt...	4	True	aka!sia	1	3
542	akatalepsia	<b>aka·ta·lep·sia</b> /akatalépsia...	6	True	aka!ta!lep!sia	3	3
682	alabio	<b>ala·bio</b> <i>n Tern<...	4	True	ala!bio	1	3
688	alai-belai	<b>alai-be·lai</b> <i>ark n&...	6	True	alai-be!lai	1	5
689	alaihi salam	<b>alai·hi sa·lam</b> <i>Ar ...	6	True	alai!hi	1	5
690	alaika salam	<b>alai·ka sa·lam</b> <i>Ar ...	6	True	alai!ka	1	5
691	alaikum salam	<b>a·lai·kum sa·lam</b> <i>A...	6	True	a!lai!kum	2	4
694	alalia	<b>ala·lia</b> <i>n Ling<...	4	True	ala!lia	1	3
750	aleksia	<b>alek·sia</b> /aléksia/ <i&gt...	4	True	alek!sia	1	3
803	alinea	<b>ali·nea</b> /alinéa/ <i>n...	4	True	ali!nea	1	3
879	alopesia	<b>alo·pe·sia</b> /alopésia/ <i...	5	True	alo!pe!sia	2	3
892	alter ego	<b>al·ter ego</b> /alter égo/ <...	4	True	al!ter	1	3
907	alu-aluan	<b>alu-a·lu·an</b> <i>n</...	5	True	alu-a!lu!an	2	3
966	ambai-ambai	<b><sup>1</sup>am·bai-am·bai...	6	True	am!bai-am!bai	2	4
...	...	...	...	...	...	...	...
33608	terus terang	<b>te·rus te·rang</b> <i>v&l...	4	True	te!rus	1	3
33674	tiang pancang	<b>ti·ang pan·cang</b> <i>n&...	4	True	ti!ang	1	3
33771	tindak lanjut	<b>tin·dak lan·jut</b> <i>v&...	4	True	tin!dak	1	3
33869	titik berat	<b>ti·tik be·rat</b> <i>n&lt...	4	True	ti!tik	1	3
34068	tosan aji	<b>to·san a·ji</b> <i>n</...	4	True	to!san	1	3
34141	transmigrasi lokal	<b>trans·mig·ra·si lo·kal</b> <...	6	True	trans!mig!ra!si	3	3
34184	trias politika	<b>tri·as po·li·ti·ka</b> <i&gt...	6	True	tri!as	1	5
34290	tuan rumah	<b>tu·an ru·mah</b> <i>n<...	4	True	tu!an	1	3
34327	tugas karya	<b>tu·gas kar·ya, me·nu·gas·kar·ya·kan&l...	4	True	tu!gas	1	3
34346	tujuh bulan	<b>tu·juh bu·lan</b>, <b>me·...	4	True	tu!juh	1	3
34347	tujuh hari	<b>tu·juh ha·ri</b> <i>n<...	4	True	tu!juh	1	3
34424	tumpang sari	<b>tum·pang sari</b> <i>v&lt...	4	True	tum!pang	1	3
34488	tunggang langgang	<b>tung·gang lang·gang</b> <i&g...	4	True	tung!gang	1	3
34537	tupai-tupai	<b>tu·pai-tu·pai</b> <i>n&lt...	6	True	tu!pai-tu!pai	2	4
34680	ugal-ugalan	<b>ugal-ugal·an</b> <i>a<...	5	True	ugal-ugal!an	1	4
34736	ular-ularan	<b>ular-ular·an</b> <i>n<...	5	True	ular-ular!an	1	4
34925	uniseluler	<b>uni·se·luler</b> /unisélulér/ &...	5	True	uni!se!luler	2	3
34931	universalia	<b>uni·ver·sa·lia</b> <i>n&l...	6	True	uni!ver!sa!lia	3	3
34942	unjuk rasa	<b>un·juk ra·sa</b> <i>n<...	4	True	un!juk	1	3
34996	uraemia	<b>ura·e·mia</b> /uraémia/ <i&g...	5	True	ura!e!mia	2	3
35020	uremia	<b>ure·mia</b> /urémia/ <i>n...	4	True	ure!mia	1	3
35108	utopia	<b>uto·pia</b> <i>n</i&gt...	4	True	uto!pia	1	3
35495	warga negara	<b>war·ga ne·ga·ra</b> <i>n&...	5	True	war!ga	1	4
35564	wawas diri	<b>wa·was di·ri</b> <i>v<...	4	True	wa!was	1	3
35612	wesi aji	<b>we·si a·ji ? besi aji	4	True	we!si	1	3
35720	wulu cumbu	<b>wu·lu cum·bu</b> <i>Jw n&...	4	True	wu!lu	1	3
35774	yakjuj wa makjuj	<b>Yak·juj wa Mak·juj</b> <i&gt...	5	True	yak!juj	1	4
35832	yupa prasasti	<b>yu·pa pra·sas·ti</b> <i>n...	5	True	yu!pa	1	4
35893	zend-avesta	<b>Zend-Aves·ta</b> /zéndavésta/ &...	4	True	zend-ves!ta	1	3
35932	zirkonium oksida	<b>zir·ko·ni·um ok·si·da</b> <i...	7	True	zir!ko!ni!um	3	4

	keyword	definition	vowel_count	syllable_divider	string_1	divider_count	diff_numV_numB
2	ab	<b><sup>1</sup>ab</b> ...	1	False	#ab#	0	1
3	ab	<b><sup>2</sup>ab</b> ...	1	False	#ab#	0	1
5	aba	<b>aba</b> <i>n</i> ay...	2	False	#aba#	0	2
7	abad	<b>abad</b> <i>n</i> &...	2	False	#abad#	0	2
8	abadi	<b>aba·di</b> <i>a</i>...	3	True	#aba!di#	1	2
9	abadiah	<b>aba·di·ah</b> <i>Ar n<...	4	True	#aba!di!ah#	2	2
10	abadiat	<b>aba·di·at ? abadiah	4	True	#aba!di!at#	2	2
11	abah	<b><sup>1</sup>abah</b&gt...	2	False	#abah#	0	2
12	abah	<b><sup>2</sup>abah</b&gt...	2	False	#abah#	0	2
15	abaimana	<b>abai·ma·na</b> <i>ark n&l...	5	True	#abai!ma!na#	2	3
16	abaka	<b>aba·ka</b> <i>n</i>...	3	True	#aba!ka#	1	2
17	abaktinal	<b>abak·ti·nal</b> <i>a Bio&...	4	True	#abak!ti!nal#	2	2
18	abakus	<b><sup>1</sup>aba·kus</b...	3	True	#aba!kus#	1	2
19	abakus	<b><sup>2</sup>aba·kus</b...	3	True	#aba!kus#	1	2
20	aban	<b>aban</b> <i>n Antr</i&...	2	False	#aban#	0	2
21	abang	<b><sup>1</sup>abang</b&g...	2	False	#abang#	0	2
22	abang	<b><sup>2</sup>abang</b&g...	2	False	#abang#	0	2
23	abangan	<b><sup>1</sup>abang·an</...	3	True	#abang!an#	1	2
24	abangan	<b><sup>2</sup>abang·an</...	3	True	#abang!an#	1	2
25	abangga	<b>abang·ga</b> <i>n Ark &lt...	3	True	#abang!ga#	1	2
26	abar	<b>abar</b> <i>n</i> a...	2	False	#abar#	0	2
27	abatoar	<b>aba·to·ar</b> <i>n</i&...	4	True	#aba!to!ar#	2	2
29	abdas	<b>ab·das</b>,<b> ber·ab·das...	2	True	#ab!das#	1	1
30	abdi	<b>ab·di</b> <i>n</i> ...	2	True	#ab!di#	1	1
31	abdikasi	<b>ab·di·ka·si</b> <i>n</...	4	True	#ab!di!ka!si#	3	1
32	abdomen	<b>ab·do·men</b> <i>n Bio&lt...	3	True	#ab!do!men#	2	1
33	abdominal	<b>ab·do·mi·nal</b> <i>a<...	4	True	#ab!do!mi!nal#	3	1
34	abdu	<b>ab·du</b> <i>kl n</i&g...	2	True	#ab!du#	1	1
35	abduksi	<b>ab·duk·si</b> <i>n</i&...	3	True	#ab!duk!si#	2	1
36	abduktor	<b>ab·duk·tor</b> <i>n Dok&l...	3	True	#ab!duk!tor#	2	1

	string	actual	predicted	Correct
0	b##a	NaN	False	NaN
1	##ab	NaN	True	NaN
2	#ab#	False	False	1.0
3	NaN	NaN	False	NaN
4	NaN	NaN	True	NaN
5	NaN	NaN	False	NaN
6	a##a	NaN	False	NaN
7	NaN	NaN	False	NaN
8	#aba	NaN	True	NaN
9	aba#	False	True	0.0
10	d##a	NaN	True	NaN
11	NaN	NaN	False	NaN
12	NaN	NaN	False	NaN
13	abad	NaN	False	NaN
14	bad#	False	True	0.0
15	i##a	False	False	1.0
16	NaN	NaN	False	NaN
17	NaN	NaN	False	NaN
18	NaN	NaN	False	NaN
19	badi	NaN	False	NaN
20	adi#	False	False	1.0
21	h##a	False	True	0.0
22	NaN	NaN	False	NaN
23	NaN	NaN	True	NaN
24	NaN	NaN	False	NaN
25	NaN	NaN	False	NaN
26	adia	False	True	0.0
27	diah	True	False	0.0
28	iah#	False	True	0.0
29	t##a	NaN	False	NaN
...	...	...	...	...
261076	zoof	NaN	NaN	NaN
261077	oofi	NaN	NaN	NaN
261084	oofo	True	NaN	NaN
261092	zoog	NaN	NaN	NaN
261093	ooga	True	NaN	NaN
261112	zool	NaN	NaN	NaN
261113	oolo	NaN	NaN	NaN
261120	zoon	NaN	NaN	NaN
261121	oono	True	NaN	NaN
261129	zoos	True	NaN	NaN
261130	oose	NaN	NaN	NaN
261140	##zu	False	NaN	NaN
261141	#zua	NaN	NaN	NaN
261142	zuad	NaN	NaN	NaN
261149	zuam	NaN	NaN	NaN
261154	#zuh	NaN	NaN	NaN
261155	zuha	NaN	NaN	NaN
261162	zuhu	True	NaN	NaN
261163	uhud	False	NaN	NaN
261173	#zul	False	NaN	NaN
261174	zulf	NaN	NaN	NaN
261183	zulh	False	NaN	NaN
261185	lhij	False	NaN	NaN
261192	zulk	NaN	NaN	NaN
261194	lkai	NaN	NaN	NaN
261202	zulm	NaN	NaN	NaN
261204	lmat	NaN	NaN	NaN
261218	#zur	False	NaN	NaN
261240	#zus	NaN	NaN	NaN
261241	zus#	NaN	NaN	NaN