This notebook takes the tagged list of Czech nouns from the syn2010 corpus of the Czech National Corpus (kindly supplied by Michal Kren) and tabulates paradigms, as a first step before carrying out analyses.
The input list is in the file "substantiva_syn2010".
The tagset is described here: http://korpus.cz/bonito/znacky.php
In [1]:
import pandas as pd
import numpy as np
We make a list of the tags that we will need for future reference:
In [2]:
the_tags = [x+str(y) for x in ['S','P'] for y in range(1,8)]
First we open the file:
In [3]:
data_file = open('./substantiva_syn2010')
First order of business is to
We store the result in a list of lists called data
In [4]:
data =[]
for line in data_file:
items=(line.rstrip()).split('\t')
if items[2].islower() and items[3][3:5] in the_tags:
data.append([int(items[0]),items[1].lower(),items[2],items[3][3:5]])
Turn this into a pandas DataFrame
In [5]:
data = pd.DataFrame(data,columns=['freq','form','lemma','tag'])
We now have multiple lines with the same form in cases where the initial data contained both capitalized and uncapitalized versions of a form. To deal with this:
In [6]:
data = data.groupby(['form','lemma','tag'],as_index=False)
data = data.agg(np.sum)
Now we want to deal with overabundance. The format of a paradigm cell will be a list of pairs "form:freq" separated by semicolons.
In [7]:
data['formfreq']= data['form']+':'+ data['freq'].apply(str)
del data['form']
del data['freq']
data = data.groupby(['lemma','tag'],as_index=False)
data = data.agg(lambda l:';'.join(l))
We need to deal with a small complication. The data contains not only overabundant cells, but also orthographic variants.
There is no fully satisfactory way of dealing with this situation without heavy manual editing. What we do here is rely on the fact that overabundance always targets the suffixal exponents, whereas orthographic variation targets stem internal material. Thus we normalize the data in the following way:
This is doubly unsatisfactory. First, there may be situations where an overabundant cell actually relies on two exponents that share the same final segment. We just happen to suspect that this is unlikely in Czech nominal declension. Second, there may be cases where the lemma uses variant A of the stem, some cell $c$ uses both variant A and variant B, and some other cell $c'$ uses only variant B. The present strategy will not catch this and still list variant B for cell $c'$.
One can hope that these situations will be caught at the next step, when analyzing by hand the inflection classes generated from the paradigms.
In [8]:
def levenshtein(a,b):
"Calculates the Levenshtein distance between a and b."
n, m = len(a), len(b)
if n > m:
# Make sure n <= m, to use O(min(n,m)) space
a,b = b,a
n,m = m,n
current = range(n+1)
for i in range(1,m+1):
previous, current = current, [i]+[0]*n
for j in range(1,n+1):
add, delete = previous[j]+1, current[j-1]+1
change = previous[j-1]
if a[j-1] != b[i-1]:
change = change + 1
current[j] = min(add, delete, change)
return current[n]
def partition_by_ending(lst,length=1):
"""Partitions a list of form:frequency pairs on the basis of shared final
segments. Argument length gives the number of segments to consider."""
res = {}
freq = {}
for item in lst:
ending = item.split(':')[0][-length:]
if ending in res:
res[ending].append(item)
else:
res[ending]=[item]
return list(res.values())
for row in data.index:
lemma = data.loc[row,'lemma']
items = data.loc[row,'formfreq'].split(';')
if len(items)>1:
newlist = []
p = partition_by_ending(items,2)
for cell in p:
normalized_form = cell[0].split(':')[0]
cum_freq = int(cell[0].split(':')[1])
for pair in cell[1:]:
(this_form,this_freq) = pair.split(':')
cum_freq+=int(this_freq)
if levenshtein(normalized_form,lemma) > levenshtein(this_form,lemma):
normalized_form = this_form
newlist.append(normalized_form+':'+str(cum_freq))
data.loc[row,'formfreq'] = ';'.join(newlist)
# if len(newlist)>1:
# print(lemma,data.loc[row,'tag'],data.loc[row,'formfreq'],sep='\t')
Now we want to go from a format with one row per (lemma,tag) pair to a format with one row per lemma. This is easily done with the pivot method.
In [9]:
data=data.pivot(index='lemma',columns='tag',values='formfreq')
The final step is to write to file:
In [10]:
data = data.reindex(columns=the_tags)
def new_col(s):
number = {'S':'SG','P':'PL'}
case = {'1':'NOM','2':'GEN','3':'DAT','4':'ACC','5':'VOC','6':'LOC','7':'INS'}
return number[s[0]]+'.'+case[s[1]]
new_col_dict = {}
for t in the_tags:
new_col_dict[t]=new_col(t)
data=data.rename(columns=new_col_dict)
data.to_csv('./substantiva.pdgm.csv')