Preparing data from the Czech National Corpus for IPa analysis

This notebook takes the tagged list of Czech nouns from the syn2010 corpus of the Czech National Corpus (kindly supplied by Michal Kren) and tabulates paradigms, as a first step before carrying out analyses.

The input list is in the file "substantiva_syn2010".

The tagset is described here: http://korpus.cz/bonito/znacky.php


In [1]:
import pandas as pd
import numpy as np

We make a list of the tags that we will need for future reference:


In [2]:
the_tags = [x+str(y) for x in ['S','P'] for y in range(1,8)]

First we open the file:


In [3]:
data_file = open('./substantiva_syn2010')

Selecting appropriate data

First order of business is to

  1. Select only nouns by dropping anything with a capitalized lemma or a suspicious looking tag
  2. Lowercase all forms, simplify the tags to number+case, treat frequency as an integer.

We store the result in a list of lists called data


In [4]:
data =[]
for line in data_file:
    items=(line.rstrip()).split('\t')
    if items[2].islower() and items[3][3:5] in the_tags:
        data.append([int(items[0]),items[1].lower(),items[2],items[3][3:5]])

Turn this into a pandas DataFrame


In [5]:
data = pd.DataFrame(data,columns=['freq','form','lemma','tag'])

We now have multiple lines with the same form in cases where the initial data contained both capitalized and uncapitalized versions of a form. To deal with this:

  1. We first group the data by form, lemma and tag
  2. Then we aggregate duplicate lines, summing the frequencies

In [6]:
data = data.groupby(['form','lemma','tag'],as_index=False)
data = data.agg(np.sum)

Dealing with overabundance

Now we want to deal with overabundance. The format of a paradigm cell will be a list of pairs "form:freq" separated by semicolons.

  1. First we create a new column combining form and frequency
  2. Then we group rows by lemma and tag
  3. Finally we aggregate the form/frequency pairs

In [7]:
data['formfreq']= data['form']+':'+ data['freq'].apply(str)
del data['form']
del data['freq']
data = data.groupby(['lemma','tag'],as_index=False)
data = data.agg(lambda l:';'.join(l))

We need to deal with a small complication. The data contains not only overabundant cells, but also orthographic variants.

  • An example of an overabundant cell is the GEN.SG of analyzátor, which is listed as both analyzátora and analyzátoru. This is a morphologically significant fact: this lexeme hesitates between the two major strategies for forming the GEN.SG of hars masculine nouns.
  • An example of orthographic variants is the GEN.SG of aktualizace, which is listed as both aktualisace and aktualizace. Notice that the CNK lexicon lists these under a single citation form. This is not a morphologically significant fact: clearly ther is hesitation between two stems rather than two inflection strategies.

There is no fully satisfactory way of dealing with this situation without heavy manual editing. What we do here is rely on the fact that overabundance always targets the suffixal exponents, whereas orthographic variation targets stem internal material. Thus we normalize the data in the following way:

  • If a cell contains multiple forms with different final segments, we keep these as distinct.
  • If a cell contains multiple forms with the same final segment, we keep only the segment with the shortest Levensthein distance to the lemma.

This is doubly unsatisfactory. First, there may be situations where an overabundant cell actually relies on two exponents that share the same final segment. We just happen to suspect that this is unlikely in Czech nominal declension. Second, there may be cases where the lemma uses variant A of the stem, some cell $c$ uses both variant A and variant B, and some other cell $c'$ uses only variant B. The present strategy will not catch this and still list variant B for cell $c'$.

One can hope that these situations will be caught at the next step, when analyzing by hand the inflection classes generated from the paradigms.


In [8]:
def levenshtein(a,b):
    "Calculates the Levenshtein distance between a and b."
    n, m = len(a), len(b)
    if n > m:
        # Make sure n <= m, to use O(min(n,m)) space
        a,b = b,a
        n,m = m,n
    current = range(n+1)
    for i in range(1,m+1):
        previous, current = current, [i]+[0]*n
        for j in range(1,n+1):
            add, delete = previous[j]+1, current[j-1]+1
            change = previous[j-1]
            if a[j-1] != b[i-1]:
                change = change + 1
            current[j] = min(add, delete, change)
    return current[n]

def partition_by_ending(lst,length=1):
    """Partitions a list of form:frequency pairs on the basis of shared final 
        segments. Argument length gives the number of segments to consider."""
    res = {}
    freq = {}
    for item in lst:
        ending = item.split(':')[0][-length:]
        if ending in res:
            res[ending].append(item)
        else:
            res[ending]=[item]
    return list(res.values())

for row in data.index:
        lemma = data.loc[row,'lemma']
        items = data.loc[row,'formfreq'].split(';')
        if len(items)>1:
            newlist = []
            p = partition_by_ending(items,2)
            for cell in p:
                normalized_form  = cell[0].split(':')[0]
                cum_freq  = int(cell[0].split(':')[1])
                for pair in cell[1:]:
                    (this_form,this_freq) = pair.split(':')
                    cum_freq+=int(this_freq)
                    if levenshtein(normalized_form,lemma) > levenshtein(this_form,lemma):
                        normalized_form = this_form
                newlist.append(normalized_form+':'+str(cum_freq))
            data.loc[row,'formfreq'] = ';'.join(newlist)
#            if len(newlist)>1:
#                print(lemma,data.loc[row,'tag'],data.loc[row,'formfreq'],sep='\t')

From forms to paradigms

Now we want to go from a format with one row per (lemma,tag) pair to a format with one row per lemma. This is easily done with the pivot method.


In [9]:
data=data.pivot(index='lemma',columns='tag',values='formfreq')

Generating output

The final step is to write to file:

  1. We reindex the DataFrame so that columns appear in the expected order
  2. We write to a csv file using more intuitive column names

In [10]:
data = data.reindex(columns=the_tags)
def new_col(s):
    number = {'S':'SG','P':'PL'}
    case = {'1':'NOM','2':'GEN','3':'DAT','4':'ACC','5':'VOC','6':'LOC','7':'INS'}
    return number[s[0]]+'.'+case[s[1]]

new_col_dict = {}
for t in the_tags:
    new_col_dict[t]=new_col(t)
        
data=data.rename(columns=new_col_dict)
data.to_csv('./substantiva.pdgm.csv')