Place names (or 'Toponyms') can reveal a great deal about the history of a geographic region. They stick around even as locals adopt new languages or populations are forced out by famine or conflict. Pick a region anywhere in the world at random and you'll likely find a spattering of place names originating from a dead or moribund languages. In the Americas this is likely to be a Na-Dené, Algic, or Uto-Aztecan language. Similarly, Western Europe is full of place names of pre-historic Celtic origin. In some cases, names are the only remaining evidence of a particular speech community having once existed in a certain place, and for this reason, place names can be very useful for understanding the history of regions where linguistic prehistory is yet unclear.
In this post, I'm going to use focus on a region of the world whose linguistic prehistory is remains murky. Sumatra a large island located in the far west of the Indonesian Archipelago. Local languages throughout much of the island belong to the Malayic branch of the Austronesian language family, a family of languages spoken across an enourmous region spanning from Madagascar off the east coast of Africa to Rapa Nui (Easter Island) off the coast of Chile. The Malayic languages are spoken by over 200 million speakers and include the national languages on Malaysia, Indonesia and Brunei. Sumatra (and more specifically, the Jambi/Palambang regions, which were likely the the center of Sriwijaya, a Buddhist Kingdom whose influence and power extended throughout South East Asia) is recognized by indonesianists as the Malay Homeland--the region where the Malay language first flourished and spread out to become an influential trade language.
Despite its prevalence in Sumatra over the past 1000+ years, the Malay language is a relative newcomer to the island. Scholars estimate that Malay speakers began to settle in Sumatra in relatively large numbers (very) roughly two millenia ago, and that these speakers were preceded by an earlier wave of languages also belonging to the Austronesian language family. However, even by generous estimates of how early the arrival of Austonesian languages to the island took place, it is clear from archeological evidence that human populations inhabited Sumatra for thousands of years before Austronesian languages departed their homeland of Formosa (Taiwan), let alone spread to South East Asia.
In light of this history, we might expect to find place names in Sumatra originating from pre-Malay (or even pre-Austonesian) times. To find out, I have put together an abridged database of Sumatran place names using data from the GEONet Names Server (GNS), a database sanctioned by the U.S. Board on Geographic Names. The technique I eploy to find relic place names involves filtering words of known origin from this database, then subsequently examining the place names to see if they provide any hints as to the linguistic identity of earlier populations. I filter out place names by of know origin through a matching algorithm using a wordlist I compiled based on the KBBI (Kamus Besar Bahasa Indonesia), an authoritative Indonesian Dictionary (Indonesian, itself a variety of Malay, will serve as a stand-in for Malay). Since there is considerable variation in the spelling and pronounciation of place names, I will use fuzzy matching algorithms to identify names of known origin. The three packages I will use are fuzzywuzzy, difflib and pandas.
In [2]:
from fuzzywuzzy import fuzz, StringMatcher
import difflib
import pandas as pd
In [3]:
### loading partially cleaned csvs containing place names and dictionary
kbbi = pd.read_csv("/Users/admin/Desktop/loanwords/clean.kbbi.csv")
places = pd.read_csv("/Users/admin/Desktop/loanwords/concat_places.csv")
kbbi.columns = ['old_index', 'words']
The place names dictionary contains full names of locations, a location index which can be used to locate each place on a map, and a column containing information regarding the type of geographic feature. Notice that many of the place names contain two words. I want check each of these words individually for matches in the dictionary, therefore I have split rows which contain place names containing more than one word. For example, The place 'Laut Seram' corresponds to two separate rows: one for 'Laut' and the other for 'Seram'. The database contains a total of 438,600 rows.
In [4]:
places.head()
Out[4]:
The dictionary database comprises a list of words taken for the KBBI along with original dictionary indexes. The database contains 50,000+ entries.
In [5]:
kbbi.head()
Out[5]:
Unlike English, the Indonesian spelling system is almost perfectly phonemic i.e. in most cases there is a one-to-one relationship between a letter and the sound it represents. This being said, there are a few irregularities in the orthography. To rid the data of as many of these irregularities as possible, I created the following function:
In [6]:
def prepare_string(word):
word = word.lower()
word = word.replace('ng','N')
word = word.replace('ny','Y')
word = word.replace('sy','s')
word = word.replace('kh','k')
word = word.replace('oe','u')
word = word.replace('tj','c')
word = word.replace('dj','j')
return(word)
In [7]:
kbbi['prepared_entries'] = kbbi.words.apply(lambda x: prepare_string(str(x)))
places['prepared_places'] = places.split_word.apply(lambda x: prepare_string(str(x)))
Moreover, I generated new columns in the dictionary and place names database for the cleaned strings.
In [8]:
kbbi.head()
Out[8]:
In [9]:
places.head()
Out[9]:
In [10]:
def find_matches(place, wordlist_series):
bool_series = wordlist_series.str.contains(place,case=True)
return(bool_series.value_counts(ascending=True).index[0])
I use the this function in conjunction with the .apply() method in pandas to return a series of booleans as a column in the place name database. The large size of the database means that running this function could take up to an hour. To demonstrate how the function works, I will apply the function to a subset of the data 'sample_df' containing 1000 rows.
In [14]:
sample_df = places.loc[0:1000,:]
sample_df['matches'] = sample_df.prepared_places.apply(lambda x: find_matches(x,kbbi.prepared_entries))
In [15]:
sample_df.head()
Out[15]:
Using the boolean values in the 'matches' column, I will filter out place names with an exact match in the dictionary.
In [16]:
sample_df = sample_df[sample_df.matches == False]
In [17]:
sample_df.head()
Out[17]:
Although we have gotten rid of all place names with exact matches in the dictionary, there are still plenty of place names which represent variants of items in the dictionary (e.g. words which are historically the same word, but are now spelled and/or pronounced differently). We want to get rid of these words. The method that I am going to use is called 'fuzzy matching'. I will borrow a function called 'fuzzy wuzzy' which calculates a percentage similarity score with 100 being a perfect match. This function is a variant of the Levinshtein distance function, a function which quantifies the distance between two words a and b based on number of deletions, substitutions or insertions it would take to transform a into b.
Before applying this function, however, there is an important problem with the function that needs to be addressed. Levenshtein distance does not take the phonological similarity between two sounds into account. To illustrate this, consider two pairs of words:
Pair 1: bat, pat Pair 2: bat, sat
In [19]:
fuzz.ratio('bat','pat')
Out[19]:
In [20]:
fuzz.ratio('bat','sat')
Out[20]:
In terms of Levenshtein distance, both pairs of words earn the same score. This is because, in both cases, this first item in the pair can be transformed into the second item by substituting the first letter(b for p in Pair 1 and b for s in Pair 2). In terms of actual phonological distance, the words in Pair 1 are much closer to one another than the words in Pair 2. This is because the sound [b] is nearly identical to [p] in terms of both the physical gestures involved in its pronunciation as well its accoustic properties. In both of these respects, [b] and [s] are quite different from one another.
Drawing this distinction might seem pedantic, but phonological distance is important within the context of discovering pairs of words which are variants of one another, or related historically. Changes in pronounciation that occur over time are gradual. For example, cases where [p] becomes [b] are extremely common across the world's languages, whereas [b] to [s] is an extremely rare if not unattested change. Since we are trying to establish whether place names which do not have perfect matches in the dictionary are nevertheless etymologically related to words in the dictionary, it is important that we quantify phonological distance. To accomplish this, I have created a function which converts each sound into a matrix dictionary of its component phonological features (i.e. where in the vocal tract it is pronounced, whether it is nasal etc.) (For more information about phonological features see Wikipedia's page: https://en.wikipedia.org/wiki/Distinctive_feature)
In [21]:
def phono_matrix(string):
string_of_matrixes = []
for character in string:
matrix = {}
###populate manner:
### sonorant
if character in ['a','e','i','o','u','y','w','m','N','Y','l','r','h','q']:
matrix['sonorant'] = 'Y'
else:
matrix['sonorant'] = 'N'
###continuant
if character in ['l','r','y','w','a','e','i','o','u','s','z','f']:
matrix['continuant'] = 'Y'
else:
matrix['continuant'] = 'N'
###consonant
if character in ['p','t','k','q','h','c','b','d','g','j','s','z','f','m','n','Y','N','l','r']:
matrix['consonant'] = 'Y'
else:
matrix['consonant'] = 'N'
###syllabic
if character in ['a','e','i','o','u']:
matrix['syllabic'] = 'Y'
else:
matrix['syllabic'] = 'N'
###strident
if character in ['s','j','c']:
matrix['strident'] = 'Y'
else:
matrix['strident'] = 'N'
###populate place: labial, coronal, palatal, velar, glottal
###labial
if character in ['p','m','f','b','w','u','o']:
matrix['labial'] = 'Y'
else:
matrix['labial'] = 'N'
###coronal
if character in ['t','d','n','s','j','c','Y','i','e','r','l']:
matrix['coronal'] = 'Y'
else:
matrix['coronal'] = 'N'
###palatal
if character in ['s','j','c','i','Y','e']:
matrix['palatal'] = 'Y'
else:
matrix['palatal'] = 'N'
###palatal
if character in ['u','k','g','N','o']:
matrix['velar'] = 'Y'
else:
matrix['velar'] = 'N'
###glottal
if character in ['h','q']:
matrix['glottal'] = 'Y'
else:
matrix['glottal'] = 'N'
###glottal
if character in ['h','q']:
matrix['glottal'] = 'Y'
else:
matrix['glottal'] = 'N'
###nasality
###nasal/oral
if character in ['m','n','Y','N']:
matrix['nasal'] = 'Y'
else:
matrix['nasal'] = 'N'
###populate obstruent voicing
###i assume that [voice] is only phonologically active in obstruents
###voiced/voiceless obstruent
if character in ['b','d','g','j']:
matrix['voice'] = 'Y'
else:
matrix['voice'] = 'N'
### populate lateral/rhotic
###lateral
if character == 'l':
matrix['lateral'] = 'Y'
else:
matrix['lateral'] = 'N'
###rhotic
if character == 'r':
matrix['rhotic'] = 'Y'
else:
matrix['rhotic'] = 'N'
string_of_matrixes.append(matrix)
###populate vowel height
###I assume at mid is not an active feature
### high
if character in ['i','u']:
matrix['high'] = 'Y'
else:
matrix['high'] = 'N'
### low
if character == 'a':
matrix['low'] = 'Y'
else:
matrix['low'] = 'N'
string_of_matrixes.append(matrix)
return(string_of_matrixes)
In [22]:
kbbi['matrixes'] = kbbi.prepared_entries.apply(lambda x: phono_matrix(x))
sample_df['matrixes'] = sample_df.prepared_places.apply(lambda x: phono_matrix(x))
In [23]:
kbbi.matrixes.head()
Out[23]:
In [24]:
sample_df.matrixes.head()
Out[24]:
Now that each sound is represented by its features, we're almost at the point where we can compare sounds on a feature by feature basis to get a much more accurate measure of their phonological distance. To allow this feature by feature comparison, I have created the following function, which takes a string of matrixes corresponding to individual sounds, and generates multiple tiers, each correponding to a single binary feature ('Y' for 'has feature' and 'N' for 'lacks feature'). Consider the tier for the feature [consonant] in the case of Pair 1 above: This tier will be identical for the words 'pat' and 'bat', and in both cases, it will take the form 'YNY' (i.e. b,s=consonant, a=vowel, t=consonant). The following function translates a string of feature matrixes into tiers, each corresponding to a single feature.
In [26]:
def tier_builder(string_of_matrixes):
features = ['sonorant','consonant','continuant','syllabic','strident','labial','coronal','palatal','velar', 'glottal','nasal','voice','lateral','rhotic']
tier_dictionary = {}
for feature in features:
tier_dictionary[feature] = str()
for matrix in string_of_matrixes:
if matrix[feature] is not None:
tier_dictionary[feature] = tier_dictionary[feature] + matrix[feature]
return(tier_dictionary)
In [27]:
kbbi['tiers'] = kbbi.matrixes.apply(lambda x: tier_builder(x))
sample_df['tiers'] = sample_df.matrixes.apply(lambda x: tier_builder(x))
Let's take a closer look at the first entry in the dictionary and how it is represented in terms of tiers of phonological features:
In [30]:
kbbi.prepared_entries[0]
Out[30]:
In [31]:
kbbi.tiers[0]
Out[31]:
Now that we have complete phonological representations for each place name and dictionary item, we can apply the following function to calculate the Levenshtein distance between each place name and each dictionary item. The function calculates phonological similarity on a feature-by-feature basis, by comparing the tiers we generated above using 'fuzzy matching'.
Similarity values are calculated for each feature tier, then the average of all tiers is returned, giving us a global measure of phonolgical similarity between two words.
In [32]:
def similarity(word1_tiers,word2_tiers):
features = ['sonorant','continuant','syllabic','strident','labial','coronal','palatal','velar', 'glottal','nasal','voice','lateral','rhotic']
tier_similarity = {}
for feature in features:
tier_similarity[feature] = fuzz.ratio(word1_tiers[feature],word2_tiers[feature])
tier_similarity = pd.Series(tier_similarity)
return(tier_similarity.mean())
Let's test this out with the word pairs 'bat'/'pat' and 'bat'/'sat'. Remamber that these pairs recieved the same score for Levenshtein distance before. If we convert them from strings into phonological matrixes using the functions presented above, we expect that 'pat'/'bat' will get a higher similarity score than 'bat'/'sat'.
In [33]:
### Building feature matrixes:
pat = phono_matrix('pat')
pat = tier_builder(pat)
bat = phono_matrix('bat')
bat = tier_builder(bat)
sat = phono_matrix('sat')
sat = tier_builder(sat)
In [35]:
similarity(pat,bat)
Out[35]:
In [36]:
similarity(bat,sat)
Out[36]:
Now that we know that our similarity function works, we can return to the task of identifying place names that have near matches in the dictionary, and are thus likely to be of known etymological origin. I have developed a function which scores the similarity of all words in the dictionary to a single place name, it then returns the dictionary word which is most phonologically similar along with the similarity score for the same dictionary entry. However, at this point the function takes a very long time to execute for the entire list of place names.
Once I have similarity values for the entire place name database, I will be able to find those words which are least likely to have originated from Malay/Indonesian. I will then investigate the origins of these names. For any given place name X from an unknown source, there are two basic hypotheses regarding its origins:
Hypothesis 1: X was coined and thus does not originate from another language Hypothesis 2: X originates from another language
These hypotheses predict different distributional properties: In the case of coinages, we predict that identical coinages will rarely occur independently. On the other hand, we expect that place names related to geographic terms (e.g. English 'river', 'mountain', 'hill', etc.) will occur frequently and in diverse locations.
Bringing it all together, we are looking for words which both 1. have the lowest scores in terms of their nearest match with a dictionary entry (this means that they are highly unlikely to have a Malay/Indonesian etymological source); and 2. Have a high similarity scores with other words of unknown origins (since this would indicate a distribution in line with Hypothesis 2 above. We can calculate a score for 2, using the following function. (n.b. I haven't tested this function yet.)
In [ ]:
def find_partners(wordlist):
ranking_dict = {}
for item in wordlist:
for item2 in wordlist:
ranking_dict[item,item2] = similarity(item,item2)
ranking_dict = pd.Series(ranking_dict)
return(ranking_dict.sort_values(ascending=False))
Once I have found words that exhibit this distribution, I can compare these words to other languages spoken in adjacent regions (using the techniques outlined above) to determine whether there are similarities that would suggest a non-Malay/Indonesian etymological source.
In [ ]: