A Big Data Search for Prehistoric Placenames

Place names (or 'Toponyms') can reveal a great deal about the history of a geographic region. They stick around even as locals adopt new languages or populations are forced out by famine or conflict. Pick a region anywhere in the world at random and you'll likely find a spattering of place names originating from a dead or moribund languages. In the Americas this is likely to be a Na-Dené, Algic, or Uto-Aztecan language. Similarly, Western Europe is full of place names of pre-historic Celtic origin. In some cases, names are the only remaining evidence of a particular speech community having once existed in a certain place, and for this reason, place names can be very useful for understanding the history of regions where linguistic prehistory is yet unclear.

In this post, I'm going to use focus on a region of the world whose linguistic prehistory is remains murky. Sumatra a large island located in the far west of the Indonesian Archipelago. Local languages throughout much of the island belong to the Malayic branch of the Austronesian language family, a family of languages spoken across an enourmous region spanning from Madagascar off the east coast of Africa to Rapa Nui (Easter Island) off the coast of Chile. The Malayic languages are spoken by over 200 million speakers and include the national languages on Malaysia, Indonesia and Brunei. Sumatra (and more specifically, the Jambi/Palambang regions, which were likely the the center of Sriwijaya, a Buddhist Kingdom whose influence and power extended throughout South East Asia) is recognized by indonesianists as the Malay Homeland--the region where the Malay language first flourished and spread out to become an influential trade language.

Despite its prevalence in Sumatra over the past 1000+ years, the Malay language is a relative newcomer to the island. Scholars estimate that Malay speakers began to settle in Sumatra in relatively large numbers (very) roughly two millenia ago, and that these speakers were preceded by an earlier wave of languages also belonging to the Austronesian language family. However, even by generous estimates of how early the arrival of Austonesian languages to the island took place, it is clear from archeological evidence that human populations inhabited Sumatra for thousands of years before Austronesian languages departed their homeland of Formosa (Taiwan), let alone spread to South East Asia.

In light of this history, we might expect to find place names in Sumatra originating from pre-Malay (or even pre-Austonesian) times. To find out, I have put together an abridged database of Sumatran place names using data from the GEONet Names Server (GNS), a database sanctioned by the U.S. Board on Geographic Names. The technique I eploy to find relic place names involves filtering words of known origin from this database, then subsequently examining the place names to see if they provide any hints as to the linguistic identity of earlier populations. I filter out place names by of know origin through a matching algorithm using a wordlist I compiled based on the KBBI (Kamus Besar Bahasa Indonesia), an authoritative Indonesian Dictionary (Indonesian, itself a variety of Malay, will serve as a stand-in for Malay). Since there is considerable variation in the spelling and pronounciation of place names, I will use fuzzy matching algorithms to identify names of known origin. The three packages I will use are fuzzywuzzy, difflib and pandas.


In [2]:
from fuzzywuzzy import fuzz, StringMatcher
import difflib
import pandas as pd

In [3]:
### loading partially cleaned csvs containing place names and dictionary

kbbi = pd.read_csv("/Users/admin/Desktop/loanwords/clean.kbbi.csv")
places = pd.read_csv("/Users/admin/Desktop/loanwords/concat_places.csv")
kbbi.columns = ['old_index', 'words']

Databases:

The place names dictionary contains full names of locations, a location index which can be used to locate each place on a map, and a column containing information regarding the type of geographic feature. Notice that many of the place names contain two words. I want check each of these words individually for matches in the dictionary, therefore I have split rows which contain place names containing more than one word. For example, The place 'Laut Seram' corresponds to two separate rows: one for 'Laut' and the other for 'Seram'. The database contains a total of 438,600 rows.


In [4]:
places.head()


Out[4]:
Unnamed: 0 Full_name Location_Index Type_of_Feature split_word
0 0 Laut Seram 52MCC8882523631 SEA Laut
1 1 Laut Halmahera 52MED0000089470 SEA Laut
2 2 Laut Banda 51MVS6942418232 SEA Laut
3 3 Ijapo Mountains 54MWB0185296040 MTS Ijapo
4 4 Sungai Holander 54MWA0369804447 STM Sungai

The dictionary database comprises a list of words taken for the KBBI along with original dictionary indexes. The database contains 50,000+ entries.


In [5]:
kbbi.head()


Out[5]:
old_index words
0 1 gabung
1 2 kuripan
2 3 mengaras
3 4 murung
4 5 terkencing

Step 1: Removing orthographic irregularities

Unlike English, the Indonesian spelling system is almost perfectly phonemic i.e. in most cases there is a one-to-one relationship between a letter and the sound it represents. This being said, there are a few irregularities in the orthography. To rid the data of as many of these irregularities as possible, I created the following function:


In [6]:
def prepare_string(word):
    word = word.lower()
    word = word.replace('ng','N')
    word = word.replace('ny','Y')
    word = word.replace('sy','s')
    word = word.replace('kh','k')
    word = word.replace('oe','u')
    word = word.replace('tj','c')
    word = word.replace('dj','j')
    return(word)

In [7]:
kbbi['prepared_entries'] = kbbi.words.apply(lambda x: prepare_string(str(x)))
places['prepared_places'] = places.split_word.apply(lambda x: prepare_string(str(x)))

Moreover, I generated new columns in the dictionary and place names database for the cleaned strings.


In [8]:
kbbi.head()


Out[8]:
old_index words prepared_entries
0 1 gabung gabuN
1 2 kuripan kuripan
2 3 mengaras meNaras
3 4 murung muruN
4 5 terkencing terkenciN

In [9]:
places.head()


Out[9]:
Unnamed: 0 Full_name Location_Index Type_of_Feature split_word prepared_places
0 0 Laut Seram 52MCC8882523631 SEA Laut laut
1 1 Laut Halmahera 52MED0000089470 SEA Laut laut
2 2 Laut Banda 51MVS6942418232 SEA Laut laut
3 3 Ijapo Mountains 54MWB0185296040 MTS Ijapo ijapo
4 4 Sungai Holander 54MWA0369804447 STM Sungai suNai

Identifying place names of known origin: perfect matches

The following function takes a place name as input, compares the name to each word in the dictionary wordlist. A boolean is returned indicating whether any matches were found.


In [10]:
def find_matches(place, wordlist_series):
    bool_series = wordlist_series.str.contains(place,case=True)
    return(bool_series.value_counts(ascending=True).index[0])

I use the this function in conjunction with the .apply() method in pandas to return a series of booleans as a column in the place name database. The large size of the database means that running this function could take up to an hour. To demonstrate how the function works, I will apply the function to a subset of the data 'sample_df' containing 1000 rows.


In [14]:
sample_df = places.loc[0:1000,:]
sample_df['matches'] = sample_df.prepared_places.apply(lambda x: find_matches(x,kbbi.prepared_entries))


/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app

In [15]:
sample_df.head()


Out[15]:
Unnamed: 0 Full_name Location_Index Type_of_Feature split_word prepared_places matches
0 0 Laut Seram 52MCC8882523631 SEA Laut laut True
1 1 Laut Halmahera 52MED0000089470 SEA Laut laut True
2 2 Laut Banda 51MVS6942418232 SEA Laut laut True
3 3 Ijapo Mountains 54MWB0185296040 MTS Ijapo ijapo False
4 4 Sungai Holander 54MWA0369804447 STM Sungai suNai True

Using the boolean values in the 'matches' column, I will filter out place names with an exact match in the dictionary.


In [16]:
sample_df = sample_df[sample_df.matches == False]

In [17]:
sample_df.head()


Out[17]:
Unnamed: 0 Full_name Location_Index Type_of_Feature split_word prepared_places matches
3 3 Ijapo Mountains 54MWB0185296040 MTS Ijapo ijapo False
15 15 Kaiserin Augusta 55MBR2643874132 STM Kaiserin kaiserin False
16 16 Hollander River 54MWA0369804447 STM Hollander hollander False
17 17 Kuase River 54MWA2363960865 STM Kuase kuase False
18 18 Hauser River 54MWA2363960865 STM Hauser hauser False

Identifying place names of known origin: near matches

Although we have gotten rid of all place names with exact matches in the dictionary, there are still plenty of place names which represent variants of items in the dictionary (e.g. words which are historically the same word, but are now spelled and/or pronounced differently). We want to get rid of these words. The method that I am going to use is called 'fuzzy matching'. I will borrow a function called 'fuzzy wuzzy' which calculates a percentage similarity score with 100 being a perfect match. This function is a variant of the Levinshtein distance function, a function which quantifies the distance between two words a and b based on number of deletions, substitutions or insertions it would take to transform a into b.

Before applying this function, however, there is an important problem with the function that needs to be addressed. Levenshtein distance does not take the phonological similarity between two sounds into account. To illustrate this, consider two pairs of words:

Pair 1: bat, pat Pair 2: bat, sat


In [19]:
fuzz.ratio('bat','pat')


Out[19]:
67

In [20]:
fuzz.ratio('bat','sat')


Out[20]:
67

In terms of Levenshtein distance, both pairs of words earn the same score. This is because, in both cases, this first item in the pair can be transformed into the second item by substituting the first letter(b for p in Pair 1 and b for s in Pair 2). In terms of actual phonological distance, the words in Pair 1 are much closer to one another than the words in Pair 2. This is because the sound [b] is nearly identical to [p] in terms of both the physical gestures involved in its pronunciation as well its accoustic properties. In both of these respects, [b] and [s] are quite different from one another.

Drawing this distinction might seem pedantic, but phonological distance is important within the context of discovering pairs of words which are variants of one another, or related historically. Changes in pronounciation that occur over time are gradual. For example, cases where [p] becomes [b] are extremely common across the world's languages, whereas [b] to [s] is an extremely rare if not unattested change. Since we are trying to establish whether place names which do not have perfect matches in the dictionary are nevertheless etymologically related to words in the dictionary, it is important that we quantify phonological distance. To accomplish this, I have created a function which converts each sound into a matrix dictionary of its component phonological features (i.e. where in the vocal tract it is pronounced, whether it is nasal etc.) (For more information about phonological features see Wikipedia's page: https://en.wikipedia.org/wiki/Distinctive_feature)


In [21]:
def phono_matrix(string):
    string_of_matrixes = []
    for character in string:
        matrix = {}
###populate manner: 
    ### sonorant
        if character in ['a','e','i','o','u','y','w','m','N','Y','l','r','h','q']:
            matrix['sonorant'] = 'Y'
        else:
            matrix['sonorant'] = 'N'
    
    ###continuant
        if character in ['l','r','y','w','a','e','i','o','u','s','z','f']:
            matrix['continuant'] = 'Y'
        else:
            matrix['continuant'] = 'N'
    
    ###consonant
        if character in ['p','t','k','q','h','c','b','d','g','j','s','z','f','m','n','Y','N','l','r']:
            matrix['consonant'] = 'Y'
        else:
            matrix['consonant'] = 'N'
    
    ###syllabic
        if character in ['a','e','i','o','u']:
            matrix['syllabic'] = 'Y'
        else:
            matrix['syllabic'] = 'N'

    ###strident
        if character in ['s','j','c']:
            matrix['strident'] = 'Y'
        else:
            matrix['strident'] = 'N'
    
###populate place: labial, coronal, palatal, velar, glottal 
    ###labial
        if character in ['p','m','f','b','w','u','o']:
            matrix['labial'] = 'Y'
        else:
            matrix['labial'] = 'N'   

    ###coronal
        if character in ['t','d','n','s','j','c','Y','i','e','r','l']:
            matrix['coronal'] = 'Y'
        else:
            matrix['coronal'] = 'N' 
 
    ###palatal
        if character in ['s','j','c','i','Y','e']:
            matrix['palatal'] = 'Y'
        else:
            matrix['palatal'] = 'N'
    ###palatal
        if character in ['u','k','g','N','o']:
            matrix['velar'] = 'Y'
        else:
            matrix['velar'] = 'N'
    ###glottal
        if character in ['h','q']:
            matrix['glottal'] = 'Y'
        else:
            matrix['glottal'] = 'N'

    ###glottal
        if character in ['h','q']:
            matrix['glottal'] = 'Y'
        else:
            matrix['glottal'] = 'N'
###nasality
    ###nasal/oral
        if character in ['m','n','Y','N']:
            matrix['nasal'] = 'Y'
        else:
            matrix['nasal'] = 'N'
            
###populate obstruent voicing 
###i assume that [voice] is only phonologically active in obstruents
    ###voiced/voiceless obstruent
        if character in ['b','d','g','j']:
            matrix['voice'] = 'Y'
        else:
            matrix['voice'] = 'N'          
            
### populate lateral/rhotic
    ###lateral
        if character == 'l':
            matrix['lateral'] = 'Y'
        else:
            matrix['lateral'] = 'N'
    ###rhotic
        if character == 'r':
            matrix['rhotic'] = 'Y'
        else:
            matrix['rhotic'] = 'N'
        
        string_of_matrixes.append(matrix)

###populate vowel height
###I assume at mid is not an active feature
    ### high
        if character in ['i','u']:
            matrix['high'] = 'Y'
        else:
            matrix['high'] = 'N'
    ### low
        if character == 'a':
            matrix['low'] = 'Y'
        else:
            matrix['low'] = 'N'
        
        string_of_matrixes.append(matrix)
    return(string_of_matrixes)

In [22]:
kbbi['matrixes'] = kbbi.prepared_entries.apply(lambda x: phono_matrix(x))
sample_df['matrixes'] = sample_df.prepared_places.apply(lambda x: phono_matrix(x))

In [23]:
kbbi.matrixes.head()


Out[23]:
0    [{u'continuant': u'N', u'palatal': u'N', u'vel...
1    [{u'continuant': u'N', u'palatal': u'N', u'vel...
2    [{u'continuant': u'N', u'palatal': u'N', u'vel...
3    [{u'continuant': u'N', u'palatal': u'N', u'vel...
4    [{u'continuant': u'N', u'palatal': u'N', u'vel...
Name: matrixes, dtype: object

In [24]:
sample_df.matrixes.head()


Out[24]:
3     [{u'continuant': u'Y', u'palatal': u'Y', u'vel...
15    [{u'continuant': u'N', u'palatal': u'N', u'vel...
16    [{u'continuant': u'N', u'palatal': u'N', u'vel...
17    [{u'continuant': u'N', u'palatal': u'N', u'vel...
18    [{u'continuant': u'N', u'palatal': u'N', u'vel...
Name: matrixes, dtype: object

Now that each sound is represented by its features, we're almost at the point where we can compare sounds on a feature by feature basis to get a much more accurate measure of their phonological distance. To allow this feature by feature comparison, I have created the following function, which takes a string of matrixes corresponding to individual sounds, and generates multiple tiers, each correponding to a single binary feature ('Y' for 'has feature' and 'N' for 'lacks feature'). Consider the tier for the feature [consonant] in the case of Pair 1 above: This tier will be identical for the words 'pat' and 'bat', and in both cases, it will take the form 'YNY' (i.e. b,s=consonant, a=vowel, t=consonant). The following function translates a string of feature matrixes into tiers, each corresponding to a single feature.


In [26]:
def tier_builder(string_of_matrixes):
    features = ['sonorant','consonant','continuant','syllabic','strident','labial','coronal','palatal','velar', 'glottal','nasal','voice','lateral','rhotic']
    tier_dictionary = {}
    for feature in features:
        tier_dictionary[feature] = str()
        for matrix in string_of_matrixes:
            if matrix[feature] is not None:
                tier_dictionary[feature] = tier_dictionary[feature] + matrix[feature]
    return(tier_dictionary)

In [27]:
kbbi['tiers'] = kbbi.matrixes.apply(lambda x: tier_builder(x))
sample_df['tiers'] = sample_df.matrixes.apply(lambda x: tier_builder(x))

Let's take a closer look at the first entry in the dictionary and how it is represented in terms of tiers of phonological features:


In [30]:
kbbi.prepared_entries[0]


Out[30]:
'gabuN'

In [31]:
kbbi.tiers[0]


Out[31]:
{'consonant': 'YYNNYYNNYY',
 'continuant': 'NNYYNNYYNN',
 'coronal': 'NNNNNNNNNN',
 'glottal': 'NNNNNNNNNN',
 'labial': 'NNNNYYYYNN',
 'lateral': 'NNNNNNNNNN',
 'nasal': 'NNNNNNNNYY',
 'palatal': 'NNNNNNNNNN',
 'rhotic': 'NNNNNNNNNN',
 'sonorant': 'NNYYNNYYYY',
 'strident': 'NNNNNNNNNN',
 'syllabic': 'NNYYNNYYNN',
 'velar': 'YYNNNNYYYY',
 'voice': 'YYNNYYNNNN'}

Now that we have complete phonological representations for each place name and dictionary item, we can apply the following function to calculate the Levenshtein distance between each place name and each dictionary item. The function calculates phonological similarity on a feature-by-feature basis, by comparing the tiers we generated above using 'fuzzy matching'.

Similarity values are calculated for each feature tier, then the average of all tiers is returned, giving us a global measure of phonolgical similarity between two words.


In [32]:
def similarity(word1_tiers,word2_tiers):
    features = ['sonorant','continuant','syllabic','strident','labial','coronal','palatal','velar', 'glottal','nasal','voice','lateral','rhotic']  
    tier_similarity = {}
    for feature in features:
        tier_similarity[feature] = fuzz.ratio(word1_tiers[feature],word2_tiers[feature])
    tier_similarity = pd.Series(tier_similarity)
    return(tier_similarity.mean())

Let's test this out with the word pairs 'bat'/'pat' and 'bat'/'sat'. Remamber that these pairs recieved the same score for Levenshtein distance before. If we convert them from strings into phonological matrixes using the functions presented above, we expect that 'pat'/'bat' will get a higher similarity score than 'bat'/'sat'.


In [33]:
### Building feature matrixes:
pat = phono_matrix('pat')
pat = tier_builder(pat)
bat = phono_matrix('bat')
bat = tier_builder(bat)
sat = phono_matrix('sat')
sat = tier_builder(sat)

In [35]:
similarity(pat,bat)


Out[35]:
97.46153846153847

In [36]:
similarity(bat,sat)


Out[36]:
84.76923076923077

Now that we know that our similarity function works, we can return to the task of identifying place names that have near matches in the dictionary, and are thus likely to be of known etymological origin. I have developed a function which scores the similarity of all words in the dictionary to a single place name, it then returns the dictionary word which is most phonologically similar along with the similarity score for the same dictionary entry. However, at this point the function takes a very long time to execute for the entire list of place names.

Next Steps:

Once I have similarity values for the entire place name database, I will be able to find those words which are least likely to have originated from Malay/Indonesian. I will then investigate the origins of these names. For any given place name X from an unknown source, there are two basic hypotheses regarding its origins:

Hypothesis 1: X was coined and thus does not originate from another language Hypothesis 2: X originates from another language

These hypotheses predict different distributional properties: In the case of coinages, we predict that identical coinages will rarely occur independently. On the other hand, we expect that place names related to geographic terms (e.g. English 'river', 'mountain', 'hill', etc.) will occur frequently and in diverse locations.

Bringing it all together, we are looking for words which both 1. have the lowest scores in terms of their nearest match with a dictionary entry (this means that they are highly unlikely to have a Malay/Indonesian etymological source); and 2. Have a high similarity scores with other words of unknown origins (since this would indicate a distribution in line with Hypothesis 2 above. We can calculate a score for 2, using the following function. (n.b. I haven't tested this function yet.)


In [ ]:
def find_partners(wordlist):
    ranking_dict = {}
    for item in wordlist:
        for item2 in wordlist:
            ranking_dict[item,item2] = similarity(item,item2)
    ranking_dict = pd.Series(ranking_dict)
    return(ranking_dict.sort_values(ascending=False))

Once I have found words that exhibit this distribution, I can compare these words to other languages spoken in adjacent regions (using the techniques outlined above) to determine whether there are similarities that would suggest a non-Malay/Indonesian etymological source.


In [ ]: