Speakers have relatively clear intuitions about how words in their native language should be divided up into syllables. These intuitions are quite systematic across speakers of the same language, and can be applied to new words that a speaker has never heard before (e.g. English speakers have the intuition that the fake word 'haldapet' should be syllabified as 'hal.da.pet', and not some other way). This tells us that speakers have a systematic way of grouping sounds into syllables, and that this system can be applied to new data. A lot of research has been done on syllabification in English, and there are well-developed analyses of the rule system which underlies syllabification in English, but syllable structure has recieved much less attention in less well studied languages. The aim of this lab is to discover the rules governing syllable structure in Indonesian.
In [1]:
import pandas as pd
In [2]:
# I downloaded an Indonesian dictionary in SQL db format from the the Indonesian Ministry for Edcuation and Culture
import sqlite3
conn = sqlite3.connect('KBBI.db')
c = conn.cursor()
In [3]:
c.execute("SELECT name FROM sqlite_master WHERE type='table';")
print(c.fetchall())
In [4]:
# Taking a look at the data, we see that there are three columns: an index '_id', keyword 'katakunci' and definition
# 'artikata'
kbbi = pd.pandas.read_sql('SELECT * FROM datakata',conn, index_col='_id')
print kbbi.shape
kbbi.head(2)
Out[4]:
In [5]:
# I'll reset the column names to English for clarity
kbbi.columns = ['keyword','definition']
kbbi.head(2)
Out[5]:
In [6]:
# The 'definition' column is filled contains unusual coding. We are only interested in words where there are syllable
# boundaries. Let's find some words that have multiple syllables and determine a strategy for parsing out the
# syllabified string. The best way to do this is to search for keywords that contain multiple vowels. Below is
# a function which will detect multiple syllables.
def vowel_count(keyword):
vowels = ['a','e','i','o','u']
count = 0
for letter in keyword:
if letter in vowels:
count += 1
return count
In [7]:
# Now let's create a column listing a count of the vowels in each keyword.
kbbi['vowel_count'] = kbbi['keyword'].apply(lambda x: vowel_count(x))
kbbi.head(10)
Out[7]:
In [8]:
# Just looking at the first few examples we can see that the data is quite inconsistent in terms of what
# information is included. We can see that for some keywords, e.g. 'abadiah', information about syllabification
# is provided, since we see the string 'aba·di·ah' buried within the definition. Monosyllabic forms like 'ab' lack
# an explicit syllabification. Moreover, some keywords, like the form 'aba-aba' (a reduplicated form) are clearly
# polysyllabic, but nevertheless lack explicit syllabification. It would appear from these limited examples that
# polysyllabic for which syllabification is marked contain the symbol '·'. If this symbol is indeed unique to
# entries with syllabification, it should be the case that a search for entries containing '·' will only
# give us forms for which there is more than one vowel. Let's test this hypothesis.
def syl_dot(string):
if '·' in string:
return True
else:
return False
In [9]:
# I had to encode the string in 'definition' as 'utf-8' since, in 'ascii', I could not search for the symbol '·'
kbbi['definition'] = kbbi['definition'].str.encode(encoding='utf-8')
kbbi['syllable_divider'] = kbbi['definition'].apply(lambda x: syl_dot(x))
In [10]:
# now let's check to see that all rows with '·' contain more than one vowel
divider_words = kbbi[kbbi['syllable_divider'] == True]
divider_words['vowel_count'].value_counts()
Out[10]:
In [11]:
# I expected that the words containing only a single vowel (words that should be monosyllabic) would never contain
# the syllable divider symbol '·', but this was not the case. I'm going to take a closer look at these forms to see
# why they contain this symbol.
divided_one_V = divider_words[divider_words['vowel_count'] == 1]
divided_one_V
Out[11]:
In [12]:
# Inspection on the string reveals that these definitions contain morphologically complex forms containing the
# keyword as a base or root (i.e. forms which contain suffixes, prefixes, etc.). These complex forms are
# polysyllabic and contain explicit syllabification. For example, the entry for the keyword 'am' (below) contains
# information about the derived form meng-am-kan, which in turn contains the the syllablic marker '·'
print divided_one_V['definition'][930]
In [13]:
# In forms where a syllabification for the keyword itself is provided, it appears that the syllabified form
# is provided in the first 'chunk' of the text string. With this in mind, let's parse the text before the first ' '
# in 'definition'. I suspect that this chunk of text will only contain '·' in cases where a syllabification of the
# keyword itself is being provided (rather than some other form buried deeper in the definition text).
kbbi['string_1'] = kbbi['definition'].apply(lambda x: x.split(' ')[0])
In [14]:
# Now we can reset the column syllable divider based on whether the divider symbol occurs in the 'string_1' column
kbbi['syllable_divider'] = kbbi['string_1'].apply(lambda x: syl_dot(x))
kbbi.head(10)
Out[14]:
In [15]:
# Again, we expect that at syllable dividers will only occur in words with more than 2 vowels
divider_words = kbbi[kbbi['syllable_divider'] == True]
divider_words['vowel_count'].value_counts()
Out[15]:
In [16]:
# Almost perfectly as we expected: Only one word is left containing a divider and only one vowel.
# Let's give it a look to see what's going on.
divided_one_V = divider_words[divider_words['vowel_count'] == 1]
divided_one_V
Out[16]:
In [17]:
# The exception that proves the rule! This is a borrowing where 'y' (a vowel not used in Indo. orthography) is
# used. Let's delete this row.
kbbi.drop(11174,axis=0,inplace=True)
kbbi.reset_index(drop=True, inplace=True)
In [18]:
# before looking more closely at how words are syllabified, let's try to clean up some of the messy encoding in
# in string_1. First, let's convert the syllable boundary into a symbol which does not cause us problems as we
# convert from one encoding to another (the current symbol gets rendered as '\xc2\xb7'). Let's use the symbol $ for
# syllable boundaries
kbbi['string_1'] = kbbi['string_1'].apply(lambda x: x.replace('\xc2\xb7','!'))
In [19]:
# we can also chop up the encoding to figure out what is a word and what is not
kbbi['chunks'] = kbbi['string_1'].apply(lambda x: x.split(';'))
chunks = []
for line in kbbi['chunks']:
chunks = chunks + line
chunks = pd.Series(chunks)
chunks = chunks.value_counts().head(30)
In [20]:
print chunks.index
del kbbi['chunks']
In [21]:
# now that we have a better sense of what tags there are, we can create a function to delete them:
def tag_delete(string):
tags = ['<','b>','/','b>', '/sup>', 'sup','>', '1','<', '2','3',',', '4','<', 'i>', '5', '6', ',<',';', 'A', '7','8','9']
for tag in tags:
string = string.replace(tag,'')
return string
kbbi['string_1'] = kbbi['string_1'].apply(lambda x: tag_delete(x))
In [22]:
kbbi.head(10)
Out[22]:
In [23]:
# it looks like that cleaned things up pretty well, but let's jus make sure that we don't still have unusual symbols
# somewhere in the string:
lists = kbbi['string_1'].tolist()
characters = set()
for lst in lists:
for ch in lst:
characters.add(ch)
print characters
In [24]:
# many of these characters should not be part of a phonetic transcription
bad_characters = ['\xc3', '\xa9', ')', '(', ':', 'x','q']
# there are also caps which may or may not be parts of actual words, we will return to these shortly
caps = ['E', 'I', 'M', 'L', 'T','B', 'C', 'E', 'D', 'G', 'F', 'I', 'H', 'K', 'J', 'M', 'L', 'O', 'N', 'Q', 'P', 'S', 'R', 'U', 'T', 'W', 'V', 'Y', 'X', 'Z']
# first, however, let's delete the characters which are definitely not part of a syllabified transcription
def bad_ch_delete(string):
bad_characters = ['\xc3', '\xa9', ')', '(', '\xb7', ':', '\xc2', 'x']
for ch in bad_characters:
string = string.replace(ch,'')
return string
kbbi['string_1'] = kbbi['string_1'].apply(lambda x: bad_ch_delete(x))
In [25]:
# now let's look at words with caps
lists = kbbi['string_1'].tolist()
caps_list = []
for i in lists:
for cap in caps:
if cap in i:
caps_list.append(i)
print caps_list
In [26]:
# Lot's of these are proper nouns. A few of these do not contain information about syllabification. Let's delete
# those ones, and convert the others to lowercase.
caps_list = [x for x in caps_list if '!' not in x]
def caps_finder(string):
for word in caps_list:
if word in string:
return True
else:
return False
kbbi['caps'] = kbbi['string_1'].apply(lambda x: caps_finder(x))
kbbi = kbbi[kbbi['caps'] == False]
del kbbi['caps']
# now we can convert the remaining symbols to lowercase
kbbi['string_1'] = kbbi['string_1'].apply(lambda x: x.lower())
In [27]:
kbbi.head()
Out[27]:
In [28]:
# Back to syllables: We expect that in most cases, the number of syllable dividers in string_1 will be equal to the
# number of vowels-1, since e.g. a 2 syllable word like 'batu' would have a single syllable division.
# This won't always be the case, since Indonesian allows the diphthongs 'ai' and 'au' in final syllables
# (and rarely elsewhere). In these limited cases the number of syllable boundaries should be equal to
# the number of vowels. Let's test this out by creating a new column counting boundaries. We can compare the
# value in this column to value in vowel_count.
kbbi['divider_count'] = kbbi['string_1'].apply(lambda x: x.count('!'))
In [29]:
kbbi.head(10)
Out[29]:
In [30]:
# let's make a separate column counting the difference between the vowel count and syllable boundary count.
# We expect this difference to be equal to 1 in the vast majority of cases, and equal to 2 in a relatively small number
# of cases where a diphthong is present.
kbbi['diff_numV_numB'] = kbbi['vowel_count'] - kbbi['divider_count']
# Let's look at the new column values for words with explicit syllable boundaries marked
kbbi['diff_numV_numB'][kbbi['syllable_divider'] == True].value_counts()
Out[30]:
In [31]:
# The count roughly matches expectations; however, there are several entries where the number of vowels in the
# keyword far exceeds the number of syllables. I suspect these are cases where the authors of the dictionary
# failed to transcribe a syllable boundary or two. Let's take a look, starting with the entries in which the
# number of vowels in the keyword exceeds the number of syllable boundaries by 6.
syl_words = kbbi[kbbi['syllable_divider']==True]
check_words = syl_words[syl_words['diff_numV_numB'] >= 3]
check_words
Out[31]:
In [32]:
# it looks like several of these entries are just poorly marked. For example, the word 'abaimana' is syllabified
# as abai.ma.na, whereas it should be syllabified as a.bai.ma.na. While there are a few examples where the authors
# of the dictionary neglected to mark a syllable boundary, there are far more examples where the mismatch between
# vowels and syllable boundaries is due to something that can be readily fixed using a string search. Some examples
# are described below.
In [33]:
# For many of the keywords in the dataset, no syllabification is provided. We want to remove any polysyllabic words
# for which information about syllabification is absent; however, there also exist many monosyllabic words which
# lack a syllable boundary because they contain a single boundary. The vowel counts we did above will help us
# distinguish between monosyllabic words and words which are polysyllabic and just lack information about
# syllabification. With this in mind, we might delete all keywords containing more than one vowel and also
# lack a syllable boundary. There is a potential problem with removing all such words: Indonesian contains
# numerous words ending with the vowel sequences 'ai' and 'au'. These sequences are tricky to deal with because
# when the only vowels in a word are 'ai' or 'au' speakers may disagree about whether they constitute 1 or 2 syllables,
# and, in fact, this may depend on syntactic factors. The words 'kau' ('you') and 'mau' ('want') are both forms
# for which this is the case (i.e. mau ~ ma.u; kau ~ ka.u). To deal with this, let's first get
# rid of keywords which are without question not monosyllabic: words containing 3 or more vowels which
# are not explicity syllabified.
kbbi = kbbi[(kbbi.syllable_divider==True) | ((kbbi.vowel_count<=2) & (kbbi.syllable_divider==False))]
In [34]:
kbbi[(kbbi.vowel_count==2) & (kbbi.syllable_divider==False)]
Out[34]:
In [35]:
# keywords containing more than one word: There are several keywords which actually contain two words separated by
# a space. In many of these cases, syllabification is only provided for one of the words. It is almost always the
# case that the words in such keywords independently exist as keywords elsewhere in the dictionary, by deleting them
# from the database, we will not loose much information.
# Likewise, keywords containing a hyphen contain words that, in the vast majority of cases, have their own independent
# entries, and therefore we won't be losing important datapoints by deleting them.
def space_hyphen_finder(string):
if ' ' in string:
return True
elif '-' in string:
return True
elif len(string) == 0:
return True
else:
return False
kbbi['spaces'] = kbbi['string_1'].apply(lambda x: space_hyphen_finder(x))
kbbi = kbbi[kbbi['spaces'] == False]
del kbbi['spaces']
In [36]:
# now, to help us with the next couple of steps, let's add '#' to mark word boundaries. Later on, we will also need
# these boundaries to train our model.
kbbi['string_1'] = kbbi['string_1'].apply(lambda x: '#' + x + '#')
In [37]:
kbbi.head(30)
Out[37]:
In [38]:
# There are plenty of words on this list which contain the sequence vowel-consonant-vowel. Without exception,
# such sequences are always syllabified as V.CV, so we can write a function to insert a syllable boundary.
words = kbbi['string_1'][(kbbi.vowel_count==2) & (kbbi.syllable_divider==False)].tolist()
characters = set()
for word in words:
for ch in word:
characters.add(ch)
print characters
In [39]:
# let's generate all possible VCV sequences, then we can create a function to replace them with the right
# syllabification
vowels = ['a','e','i','o','u']
consonants = ['c','b','d','g','f','h', 'k', 'j', 'm', 'l', 'n', 'q', 'p', 's', 'r', 't', 'w', 'v', 'y', 'z']
VCVs = []
for v in vowels:
for c in consonants:
for v1 in vowels:
sequence = ('#' + v + c + v1,'#' + v + '!' + c + v1)
VCVs.append(sequence)
def syllabify_VCV(string):
for VCV in VCVs:
string = string.replace(VCV[0],VCV[1])
return string
kbbi['string_1'] = kbbi['string_1'].apply(lambda x: syllabify_VCV(x))
In [40]:
kbbi.head()
Out[40]:
In [41]:
# words with ..ia.. (e.g. 'widia') this sequence, which occurs in many borrowings, is syllabified as i.a or i.ya for
# many speakers. I will treat all instances of this vowel sequence as belonging to separate syllables.
#def insert_dot_ia(string):
# string = string.replace('ia','i!a')
# return string
#kbbi['string_1'][] = kbbi['string_1'].apply(lambda x: insert_dot_ia(x))
In [42]:
# another orthographic peculiarity of Indonesian is that, in limited cases, the glides 'w' and 'y' occur as the second
# segment of a sequence following a consonant (e.g. Widya). In some cases, the pronunication of the glide is actually
# [i.y] not [y]. I'm going to try to correct or remove any such examples using the help of a native speaker.
# first I'll generate a list of possible C+glide sequences
CGl = []
for c in consonants:
cluster_y = c + 'y'
cluster_w = c + 'w'
CGl.append(cluster_y)
CGl.append(cluster_w)
# now, I'll check to see which of these is attested in out wordlist
gl_clusters = set()
for word in kbbi['string_1']:
for cg in CGl:
if cg in word:
gl_clusters.add(cg)
gl_clusters
Out[42]:
In [43]:
# some sequences represent individual sounds that are represented with two letters ('digraph'). These include [sy]
# -- which is typically pronounced as [s], but may also be pronounced as a palatal, and [sw].
#I will convert this sequence to a single symbol later on. Likewise, [ny] is actually a single sound--the palatal
# nasal stop--and I will also convert it to a single symbol below. Let's take a closer look at the remaining sounds:
C_glide = ['by', 'dw', 'dy', 'fy', 'hw', 'hy', 'kw', 'py', 'wy']
for word in kbbi['string_1']:
for Cgl in C_glide:
if Cgl in word:
print word
In [44]:
# there are several word-final sequences of consonants that are written in Indonesian, but not pronounced. For
# example a consonant is deleted or a vowel is inserted to break up the cluster in actual speech.
# I want to double check that the authors of the kbbi got the transcription of these correct.
final_CC = []
for C1 in consonants:
for C2 in consonants:
CC = C1 + C2 + '#'
final_CC.append(CC)
CC_clusters = []
for word in kbbi['string_1']:
for cc in final_CC:
if cc in word and cc != 'ng#': # 'ng' is a single sound written as a digraph
CC_clusters.append(cc)
print word
In [45]:
# many of these clusters are only orthographic, but are reduced to a single consonant in pronunciation. For the time
# being I am going to delete all rows with these CCS, except some of the most common of these, for which I have a good.
# sense of the pronounciation. At some future point, I plan to edit these based on actual pronunciation.
CC_clusters = pd.Series(CC_clusters)
CC_clusters.value_counts()
Out[45]:
In [46]:
# deleting final clusters (these can be tweaked later)
final_CCs_to_omit = [u'ks#', u'ns#', u'rm#', u'kh#', u'rs#', u'lt#', u'ps#', u'rf#', u'rn#',
u'lm#', u'nk#', u'rt#', u'rk#', u'lk#', u'tt#', u'st#', u'ls#', u'hm#',
u'tz#', u'sy#', u'rd#', u'nt#', u'rg#', u'ts#', u'lp#', u'mp#', u'rp#',
u'rb#', u'sk#', u'ny#', u'lf#', u'ln#', u'ft#']
def delete_final_CC(string):
for cc in final_CCs_to_omit:
if cc in word:
return True
else:
return False
kbbi['final_CC'] = kbbi['string_1'].apply(lambda x: delete_final_CC(x))
kbbi = kbbi[kbbi['final_CC'] == False]
In [47]:
# set of words I would like to take a closer look at are non-native diphthongs like 'eu'. Let's start by doing a
# quick search to see what vowel-vowel sequences occur in the data. We'll check for any unusual clusters
VVs = []
for V1 in vowels:
for V2 in vowels:
non_hiatus_VV = V1 + V2
VVs.append(non_hiatus_VV)
VVs_attested = set()
VV_words = []
for word in kbbi['string_1']:
for vv in VVs:
if vv in word and vv != 'ai' and vv != 'au': # 'ai' and 'au' is a common syllable rime, so we don't suspect
# that they have been mistranscribed
VVs_attested.add(vv)
VV_words.append(word)
In [48]:
## attested vvs
print VVs_attested
In [49]:
VV_words[:10]
Out[49]:
In [50]:
### all of the words with these clusters are rarely used terms. For the time being, I am going to delete these terms.
def delete_borrowing(string):
for word in VV_words:
if string == word:
return True
else:
return False
kbbi['strange_VV'] = kbbi['string_1'].apply(lambda x: delete_borrowing(x))
kbbi = kbbi[kbbi['strange_VV']== False]
del kbbi['strange_VV']
del kbbi['final_CC']
In [51]:
kbbi.head(50)
Out[51]:
In [52]:
# the data seems pretty clean now, so let's try to extract features. We want to set up a classification model
# which tells us for any given position in a word, whether there is a syllable boundary in that position. The
# way that I am going to do this is looking at adjacent sounds within a certain window size. Here are two functions
# to accomplish this:
# first we need a function to change digraphs (like the palatal nasal 'ny') to a single character
def prepare_string(word):
word = word.lower()
word = word.replace('ng','N')
word = word.replace('ny','Y')
word = word.replace('sy','s')
word = word.replace('kh','k')
word = word.replace('tj','c')
word = word.replace('dj','j')
return(word)
# then we need a function to create windows of a designated size, with information about whether the window contains
# a syllable boundary in its center position (e.g. between the second and third segment if the left margin = 2
# and the right margin = 3
def window_gen(strings,left_margin,right_margin):
windows = []
for string in strings:
string = prepare_string(string)
char_ind = []
syl_ind = []
for indx,ch in enumerate(string):
if ch == '!':
syl_ind.append(indx)
else:
char_ind.append(indx)
for i,j in enumerate(char_ind):
left = range((i-left_margin),i)
right = range(i,(right_margin+i))
window_range = left + right
try:
window_index = [char_ind[z] for z in window_range]
window = ''.join([str(string[p]) for p in window_index])
is_boundary = bool([n for n in syl_ind if n > char_ind[(i-1)] and n < char_ind[(i)]])
windows.append((window,is_boundary))
except:
pass
return windows
In [53]:
# using window gen, we can generate a number of new dataframes with various window sizes
strings = kbbi['string_1'].tolist()
win_1_0 = window_gen(strings,1,0)
win_0_1 = window_gen(strings,0,1)
win_1_1 = window_gen(strings,1,1)
win_1_2 = window_gen(strings,1,2)
win_2_1 = window_gen(strings,2,1)
win_2_2 = window_gen(strings,2,2)
win_2_3 = window_gen(strings,2,3)
win_3_2 = window_gen(strings,3,2)
win_3_3 = window_gen(strings,3,3)
In [54]:
# let's create a data dictionary
data = {}
win10_windows = []
win10_boundaries = []
for i in win_1_0:
win10_windows.append(i[0])
win10_boundaries.append(i[1])
data['win10_windows'] = win10_windows
data['win10_boundaries'] = win10_boundaries
win01_windows = []
win01_boundaries = []
for i in win_0_1:
win01_windows.append(i[0])
win01_boundaries.append(i[1])
data['win01_windows'] = win01_windows
data['win01_boundaries'] = win01_boundaries
win11_windows = []
win11_boundaries = []
for i in win_1_1:
win11_windows.append(i[0])
win11_boundaries.append(i[1])
data['win11_windows'] = win11_windows
data['win11_boundaries'] = win11_boundaries
win12_windows = []
win12_boundaries = []
for i in win_1_2:
win12_windows.append(i[0])
win12_boundaries.append(i[1])
data['win12_windows'] = win12_windows
data['win12_boundaries'] = win12_boundaries
win21_windows = []
win21_boundaries = []
for i in win_2_1:
win21_windows.append(i[0])
win21_boundaries.append(i[1])
data['win21_windows'] = win21_windows
data['win21_boundaries'] = win21_boundaries
win22_windows = []
win22_boundaries = []
for i in win_2_2:
win22_windows.append(i[0])
win22_boundaries.append(i[1])
data['win22_windows'] = win22_windows
data['win22_boundaries'] = win22_boundaries
win32_windows = []
win32_boundaries = []
for i in win_3_2:
win32_windows.append(i[0])
win32_boundaries.append(i[1])
data['win32_windows'] = win32_windows
data['win32_boundaries'] = win32_boundaries
win23_windows = []
win23_boundaries = []
for i in win_2_3:
win23_windows.append(i[0])
win23_boundaries.append(i[1])
data['win23_windows'] = win23_windows
data['win23_boundaries'] = win23_boundaries
win33_windows = []
win33_boundaries = []
for i in win_3_3:
win33_windows.append(i[0])
win33_boundaries.append(i[1])
data['win33_windows'] = win33_windows
data['win33_boundaries'] = win33_boundaries
In [55]:
# these individual functions decompose sounds into phonological features, we will include these within the function
# below:
def sonorant(character):
if character in ['a','e','i','o','u','y','w','m','N','Y','l','r','h','q']:
return 1
else:
return 0
def continuant(character):
if character in ['s','z','f']:
return 1
else:
return 0
def consonant(character):
if character in ['p','t','k','q','h','c','b','d','g','j','s','z','f','m','n','Y','N','l','r']:
return 1
else:
return 0
def strident(character):
if character in ['s','j','c','z']:
return 1
else:
return 0
###populate place: labial, coronal, palatal, velar, glottal
def labial(character):
if character in ['p','m','f','b','w','u','o']:
return 1
else:
return 0
def coronal(character):
if character in ['t','d','n','s','j','c','Y','i','e','r','l','z']:
return 1
else:
return 0
def palatal(character):
if character in ['s','j','c','i','Y','e']:
return 1
else:
return 0
def velar(character):
if character in ['u','k','g','N','o']:
return 1
else:
return 0
def glottal(character):
if character in ['h','q']:
return 1
else:
return 0
###nasality
def nasal(character):
if character in ['m','n','Y','N']:
return 1
else:
return 0
###populate obstruent voicing
###i assume that [voice] is only phonologically active in obstruents
def obs_voice(character):
if character in ['b','d','g','j','z']:
return 1
else:
return 0
### populate lateral/rhotic
def lateral(character):
if character == 'l':
return 1
else:
return 0
def rhotic(character):
if character == 'r':
return 1
else:
return 0
###populate vowel height
###I assume at mid is not an active feature
def high(character):
if character in ['i','u']:
return 1
else:
return 0
def low(character):
if character == 'a':
return 1
else:
return 0
def word_boundary(character):
if character == '#':
return 1
else:
return 0
In [550]:
# now we need to a function to decompose sounds into phonological features
def matrix_creater(data, left_window, right_window):
name_w = 'win'+str(left_window)+str(right_window)+'_windows'
name_b = 'win'+str(left_window)+str(right_window)+'_boundaries'
string_list = data[name_w]
prediction_list = data[name_b]
ch_count = left_window + right_window
df = pd.DataFrame(string_list,columns=['string'])
for num in range(0,ch_count):
df['sonorant'+str(num)] = df['string'].apply(lambda x: sonorant(x[num]))
df['continuant'+str(num)] = df['string'].apply(lambda x: continuant(x[num]))
df['consonant'+str(num)] = df['string'].apply(lambda x: consonant(x[num]))
df['strident'+str(num)] = df['string'].apply(lambda x: strident(x[num]))
df['labial'+str(num)] = df['string'].apply(lambda x: labial(x[num]))
df['coronal'+str(num)] = df['string'].apply(lambda x: coronal(x[num]))
df['palatal'+str(num)] = df['string'].apply(lambda x: palatal(x[num]))
df['velar'+str(num)] = df['string'].apply(lambda x: velar(x[num]))
df['glottal'+str(num)] = df['string'].apply(lambda x: glottal(x[num]))
df['nasal'+str(num)] = df['string'].apply(lambda x: nasal(x[num]))
df['obs_voice'+str(num)] = df['string'].apply(lambda x: obs_voice(x[num]))
df['lateral'+str(num)] = df['string'].apply(lambda x: lateral(x[num]))
df['rhotic'+str(num)] = df['string'].apply(lambda x: rhotic(x[num]))
df['high'+str(num)] = df['string'].apply(lambda x: high(x[num]))
df['low'+str(num)] = df['string'].apply(lambda x: low(x[num]))
df['word_boundary'+str(num)] = df['string'].apply(lambda x: word_boundary(x[num]))
df1 = pd.DataFrame(data=prediction_list, columns=['pred'])
df = pd.concat([df,df1],axis=1)
df.drop_duplicates(subset='string', inplace=True)
return df
In [604]:
#df01 = matrix_creater(data,0,1)
#df10 = matrix_creater(data,1,0)
df11 = matrix_creater(data,1,1)
df12 = matrix_creater(data,1,2)
#df21 = matrix_creater(data,2,1)
df22 = matrix_creater(data,2,2)
#df23 = matrix_creater(data,2,3)
#df32 = matrix_creater(data,3,2)
df33 = matrix_creater(data,3,3)
In [560]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.cross_validation import cross_val_score
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# cross validation
from sklearn.cross_validation import train_test_split, KFold
# models
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
# evaluation
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.metrics import classification_report
In [607]:
y = df12['pred']
X = df12.drop('pred',axis=1)
In [574]:
words = X['string']
del X['string']
In [575]:
# let's see what our baseline is
float(y.value_counts()[0])/(float(y.value_counts()[0]) + float(y.value_counts()[1]))
Out[575]:
In [564]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=30)
In [536]:
def evaluate_model(model):
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
cnf_mtx = confusion_matrix(y_test, y_pred)
acc_scr = accuracy_score(y_test, y_pred)
cls_rep = classification_report(y_test, y_pred)
print cnf_mtx
print cls_rep
print "Accuracy Score: ", acc_scr
print "*****************************"
return acc_scr
In [537]:
# global dictionary of models
models = {}
In [ ]:
# I run a number of models below. I have not added many comments. In general, none of the models significantly
# outperform the Decision Tree Classifier. This is illustrated by the examples below.
# Given that the goal of this project is to discover the simplest
# model that account for syllabification, I will use go ahead and use the decision tree classifier.
In [538]:
max_depths = [n for n in range(1,30)]
criteria = ['gini', 'entropy']
for max_depth in max_depths:
for criterion in criteria:
print "Max Depth: ", max_depth
print "Criterion: ", criterion
evaluate_model(DecisionTreeClassifier(criterion=criterion, max_depth=max_depth))
In [539]:
C = [.01,.1,1,10,100]
for c in C:
print "C: ", c
evaluate_model(LogisticRegression(C=c))
In [506]:
n_values = [n for n in range(1,20,2)]
for n_value in n_values:
print "Number of Neighbors: ", n_value
evaluate_model(KNeighborsClassifier(n_neighbors=n_value))
In [516]:
max_depth = [n for n in range(1,20,2)]
n_estimators = [n for n in range(10,300,25)]
for max_n in max_depth:
for n_estimator in n_estimators:
print "Max Tree Depth: ", max_n
print "Number of Trees: ", n_estimator
evaluate_model(RandomForestClassifier(max_depth=max_n,n_estimators=n_estimator))
In [517]:
from sklearn.ensemble import AdaBoostClassifier
In [522]:
max_depth = [n for n in range(2,16,2)]
for depth in max_depth:
print "Depth: ", depth
AdaBoostClassifier(RandomForestClassifier(max_depth = 13), n_estimators=185)
evaluate_model(AdaBoostClassifier(RandomForestClassifier(max_depth = depth), n_estimators=185))
In [525]:
max_depth = [n for n in range(2,16,2)]
for depth in max_depth:
print "Depth: ", depth
ExtraTreesClassifier(max_depth = 13, n_estimators=185)
evaluate_model(AdaBoostClassifier(RandomForestClassifier(max_depth = depth), n_estimators=185))
In [529]:
kernels = ['linear', 'poly', 'rbf', 'sigmoid']
C = [.01,.1,1,10,100]
for kernel in kernels:
for c in C:
print "C penalty: ", c
print "Kernel: ", kernel
evaluate_model(SVC(C=c,kernel=kernel))
In [540]:
DTC = DecisionTreeClassifier(max_depth=6)
model = DTC.fit(X_train,y_train)
y_pred = model.predict(X_test)
In [541]:
y_pred = pd.DataFrame(y_pred3)
y_test = pd.DataFrame(y_test)
X_df = pd.concat([words,y_test,y_pred],axis=1)
In [ ]:
In [542]:
X_df.columns = ['string','actual','predicted']
In [543]:
X_df['Correct'] = X_df['actual'] + X_df['predicted']
In [544]:
X_df['Correct'] = X_df['Correct'].map({2:1,1:0,0:1})
In [545]:
X_df
Out[545]:
In [413]:
bad_predictions = X_df[X_df['Correct']==0]
In [414]:
bad_predictions.index
Out[414]:
In [415]:
# Let's create a dataset consisting of only incorrectly predicted data
y = y_test[y_test.index.isin(bad_predictions.index)]
X = X_test[X_test.index.isin(bad_predictions.index)]
In [416]:
#What's out baseline?
float(y[0].value_counts()[0])/(float(y[0].value_counts()[0]) + float(y[0].value_counts()[1]))
Out[416]:
In [417]:
X_train, X_test, y_train, y_test = train_test_split(X, y[0], test_size=0.33, random_state=30)
In [ ]:
In [602]:
# now let's run the data through a decision tree classifier model
clf = DecisionTreeClassifier(random_state=30, max_depth=3)
cross_val = cross_val_score(clf, X_train, y_train, cv=2)
In [603]:
evaluate_model(clf)
Out[603]:
In [420]:
print classification_report(y_test,y_pred)
In [421]:
from sklearn.tree import export_graphviz
with open('tree.dot', 'w') as dotfile:
# Creating a dot file that we can write to for our output.
export_graphviz(decision_tree = model, out_file = dotfile, feature_names=X.columns)
# Writing to our dot file we just created.
import graphviz
with open("tree.dot") as f:
# Opening our file where the decision trees decision information is store.
dot_graph = f.read()
# setting a variable equal to the contents of the read dot file.
graphviz.Source(dot_graph)
# Equavalent of plt.show() for graphviz.
Out[421]:
In [ ]:
In [371]:
y_pred = pd.DataFrame(y_pred)
y_test = pd.DataFrame(y_test)
X_df = pd.concat([words,y_test,y_pred],axis=1)
In [372]:
X_df.columns = ['string','actual','predicted']
In [373]:
X_df['Correct'] = X_df['actual'] + X_df['predicted']
In [374]:
X_df['Correct'] = X_df['Correct'].map({2:1,1:0,0:1})
In [375]:
bad_predictions = X_df[X_df['Correct']==0]
In [376]:
bad_predictions
Out[376]:
In [ ]:
In [ ]:
In [ ]: