Romania Kendo Stats

25 years of Kendo History in Romania, visualized

Data cleaning workbook

Created by Dénes Csala | 2019 | MIT License

For any improvement suggestions and spotted processing mistakes drop me a message on Facebook.
If you would like to have your country/club data visualized in a similar manner, or any other data visualization and analytics consultancy inquiries contact me at mail@csaladen.es

This workbook guides you through the data cleaning stage for the Romania Kendo Stats visualization. This is a multi-stage process, you will need access to the raw data (liaise with Secretary or other member in charge of data the Romanian Kendo Association), Python and Excel installed. Any Python packages will also be installed on the way, but we recommend using the Anaconda distribution of Python 3. If you would like to edit the visualization part, then you will need PowerBI Desktop.

The general structure of the repository is the following:

/data
- /raw: this where you place the downloaded data from the official data source, sorted by years and competitions, only keep those that have relevant data for matches only
- /ocr: this is where the data gets saved after an OCR has been performed - this is necessary for some older files in image format
- /manual: this is where manually extracted matches from old image files get placed - they should follow the 2018 CN format, i.e. all matches in one sheet
- /export: this is where we save the dataformatted for loading into the viz
- /clean: this is where all the processed, cleaned data ends up - they should follow the 2018 CN format, i.e. all matches in one sheet
/scripts: this is the main code repository for all data processing scripts
/viz: this is where the visualization files get saved - they are created using PowerBI and load data from /data/clean

1. Load and clean members

This section reads and clean the RKA members list. Save as baseline.



In [76]:

    
import pandas as pd, numpy as np, json
import members_loader, matches_loader, clubs_loader, point_utils, save_utils

First, download members data (Evidenta membrilor.xlsx) from the official data source, and create a macro-enabled Excel file from the Google Sheet. Then write a simple macro to extract the cell comments from the Club column in order to get info about club Transfers. Follow the instructions here. Save the new file as Evidenta membrilor.xlsm in the /data/manual folder. Use the members_loader module to process this file.



In [77]:

    
members=members_loader.get_members('../data/manual/Evidenta membrilor.xlsm')

Members are loaded but a bit messy.



In [78]:

    
members.head(2)









    Out[78]:







  
    
      
      231
      Nr. EKF
      Club
      Unnamed: 3
      Numele
      Prenumele
      Gen
      Data naşterii
      1 kyu
      practică
      ...
      7 dan
      8 dan
      2011
      2012
      2013
      2014
      2015
      2016
      2017
      2018
    
  
  
    
      0
      Free
      RO.00076
      KYO
      NaN
      Abe (Carțiș)
      Emilia
      F
      1979-06-06
      2007-08-10 00:00:00
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      1
      Activ
      RO.00205
      TAI
      NaN
      Abrudan
      Dorin-Ștefan
      M
      1991-12-27
      2015-08-08 00:00:00
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      Da
      Da
      Da
      Da
    
  

2 rows × 26 columns



In [79]:

    
members_clean=members_loader.cleaner(members).reset_index(drop=False)



In [80]:

    
members_clean.to_csv('../data/clean/members.csv')

2. Load and clean matches

Matches are loaded from excel sheets in the /data folder, organized by year and competition. We are always looking for match list data,the cleaner the better, the more concentrated the better. While this is not possible all the time, we have several demo import routines. These are stored in the matches_loader.py function library. While not all matches have textual data available, these will need to be processed through OCR first. Raw excel data that can be processed right away can be found in the /data/raw folder, while the processed ones in /data/ocr. We use a separate workbook, ocr.ipynb to walk you through the OCR process.



In [81]:

    
matches={i:{} for i in range(1993,2019)}
competitions={
    2018:['CR','CN','SL'],
    2017:['CR','CN','SL'],
    2016:['CR','CN','SL'],
    2015:['CR','CN','SL'],
    2014:['CR','CN','SL'],
    2013:['CR','CN','SL'],
    2012:['CR','CN'],
    2011:['CR','CN'],
    2010:['CR','CN'],
    2009:['CR','CN'],
    1998:['CR'],
    1997:['CR'],
    1993:['CR']
}

2.1. Load matches



In [7]:

    
for year in competitions:
    for competition in competitions[year]:
        matches[year][competition]=matches_loader.get_matches(year,competition)

2.2. Standardize names

Names in name_exceptions get replaced with their right hand side values before processing.



In [82]:

    
name_exceptions={'Atanasovski':'Atanasovski A. (MAC)',
                 'Dobrovicescu (SON)':'Dobrovicescu T. (SON)',
                 'Ianăș':'Ianăș F.',
                 'Crăciun (Tamang) Sujata':'Crăciun S.',
                 'Abe (Carțiș) Emilia':'Abe E.',
                 'Dinu (Ioniță) Claudia-Andreea':'Dinu A.',
                 'Mureșan (Egri) Melinda':'Mureșan M.',
                 'Grădișteanu (Gușu) Rebeca':'Grădișteanu R.',
                 'Józsa (Gușu) Rodiana':'Józsa R.',
                 'Arabadjiyski': 'Arabadjiyski A.',
                 'Dudaș Francisc Andrei':'Dudaș F.', 
                 'Dudaș Francisc':'Dudaș F.',
                 'Mandia':'Mandia F.',
                 'Stanev':'Stanev A.',
                 'Mochalov':'Mochalov O.',
                 'Sozzi':'Sozzi A.',
                 'Crăciunel':'Crăciunel I.',
                 'Craciunel':'Crăciunel I.',
                 'Sagaev':'Sagaev L.',
                 'Buzás':'Búzás C.',
                 'Csala':'Csala T.',
                 'Dimitrov':'Dimitrov M.',
                 'Józsa':'Józsa L.',
                 'Creangă':'Creangă A.',
                 'Duțescu':'Duțescu M.',                 
                 'Furtună':'Furtună G.',
                 'Gârbea':'Gârbea I.',
                 'Stupu':'Stupu I.',
                 'Mahika-Voiconi':'Mahika-Voiconi S.',
                 'Mahika':'Mahika-Voiconi S.',
                 'Stanciu':'Stanciu F.',
                 'Vrânceanu':'Vrânceanu R.',
                 'Wolfs':'Wolfs J.',
                 'Ducarme':'Ducarme A.',
                 'Sbârcea':'Sbârcea B.',
                 'Mocian':'Mocian A.',
                 'Hatvani':'Hatvani L.',
                 'Dusan':'Dusan N.',
                 'Borota':'Borota V.',
                 'Tsushima':'Tsushima K.',
                 'Tráser':'Tráser T.',
                 'Colțea':'Colțea A.',
                 'Brîcov':'Brîcov A.',
                 'Yamamoto':'Yamamoto M.',
                 'Crăciun':'Crăciun D.'}

Names in name_equals get replaced with their right hand side values after processing.



In [83]:

    
name_equals={'Chirea M.':'Chirea A.',
            'Ghinet C.':'Ghineț C.',
            'Anghelescu A.':'Anghelescu M.',
            'Domnița M.':'Domniță M.',
            'Bejgu N.':'Beygu N.',
            'Canceu A.':'Canceu Ad.',
            'Dinu C.':'Dinu A.',
            'Grapa D.':'Grapă D.',
            'Cristea C.':'Cristea Că.',
            'Cismas O.':'Cismaș O.',
            'Garbea I.':'Gârbea I.',
            'Vitali O.':'Oncea V.',
            'Ah-hu W.':'Ah-hu S.',
            'Horvát M.':'Horváth M.',
            'Ionita A.':'Ioniță A.',
            'Medvedschi I.':'Medvețchi I.',
            'Mahika S.':'Mahika-Voiconi S.',
            'Mate L.':'Máté L.',
            'Hentea L.':'Hentea A.',
            'Stupu I.':'Stupu A.',
            'Ah-Hu S.':'Ah-hu S.',
            'Alexa I.':'Alexa A.',
            'Albert V.':'Albert J.',
            'Angelescu M.':'Angelescu M.',
            'Apostu D.':'Apostu T.',
            'Brâcov A.':'Brîcov A.',
            'Zaporojan R.':'Zaporojan O.',
            'Vasile C.':'Vasile I.',
            'Dițu I.':'Dițu A.',
            'Tudor-Duicu C.':'Tudor D.',
            'Sandu M.':'Sandu Mar.',
            'Radulescu A.':'Rădulescu An.',
            'Péter C.':'Péter Cso.',
            'Movatz E.':'Movatz V.',
            'Molinger B.':'Molinger P.',
            'Mitelea C.':'Mițelea C.',
            'Macavei I.':'Macaveiu A.',
            'Macavei A.' : 'Macaveiu A.',
            'Macaveiu I.' : 'Macaveiu A.',
            'Luca T.':'Luca Tr.',
            'Leca L.':'Leca F.',
            'Gutu E.':'Guțu E.',
            'Angelescu A.':'Angelescu M.',
            'Mehelean L.':'Mahalean L.',
            'Catoriu D.':'Cantoriu D.',
            'Călina A.':'Călina C.',
            'Ștefu I.' : 'Ștefu L.',
            'Țarălungă A.' : 'Țarălungă D.',
            'Buzás C.':'Búzás C.',
            'Korenshi E.':'Korenschi E.',
            'Pleșa R.':'Pleșea R.',
            'Galos A.':'Galoș A.',
            'Győrfi G.':'Györfi G.',
            'Győrfi S.':'Györfi S.',
            'Ghineț G.':'Ghineț C.',
            'Hostina E.':'Hoștină E.', 
            'Hostină E.':'Hoștină E.', 
            'Ianăs F.':'Ianăș F.',
            'Ianas F.':'Ianăș F.',
            'Tamang S.':'Crăciun S.',
            'Taralunga D.':'Țarălungă D.',
            'Lacatus M.':'Lăcătuș M.',
            'Máthé L.':'Máté L.',
            'Burinaru A.':'Burinaru Al.',
            'Nastase M.':'Năstase E.',
            'Oprisan A.':'Oprișan A.',
            'Pârlea A.':'Pîrlea A.',
            'Parlea A.':'Pîrlea A.',
            'Sabau D.':'Sabău D.',
            'Spriu C.':'Spiru C.',
            'Crețiu T.':'Crețiu-Codreanu T.',
            'Crețiu M.':'Crețiu-Codreanu M.',
            'Bíró S.':'Biró S.',
            'Oprișan B.':'Oprișan A.',
            'Székely J.':'Székely P.',
            'Bărbulescu M.' : 'Bărbulescu E.',
            'Bejenariu G.' : 'Bejenaru G.', 
            'Bojan V.' : 'Bojan Vl.',
            'Moise A.' : 'Moise Ad.',
            'Măgirdicean R.' : 'Magirdicean Ră.',
            'Pall D.':'Páll D.',
            'Stănculascu C.':'Stănculescu C.',
            'Vrânceanu M.': 'Vrânceanu L.',
            'Georgescu A.':'Georgescu An.',         
            'Wasicek V.':'Wasicheck W.',
            'Wasicsec W.':'Wasicheck W.',
            'Wasichek W.' : 'Wasicheck W.',
            'Wasicsek W.':'Wasicheck W.',
            'Zolfoghari A.':'Zolfaghari A.'}

Names in name_doubles handle situation where the default name abbreviation might lead to duplicates.



In [84]:

    
name_doubles={
        'Cristea Cristina':'Cristea Cr.', 
        'Cristea Călin-Ștefan':'Cristea Că.',
        'Sandu Marius-Cristian':'Sandu Mar.', 
        'Sandu Matei-Serban':'Sandu Mat.',
        'Sandu Matei':'Sandu Mat.',
        'Georgescu Andrei':'Georgescu An.', 
        'Georgescu Alexandra':'Georgescu Al.',
        'Péter Csongor':'Péter Cso.', 
        'Péter Csanád':'Péter Csa.',
        'Luca Mihnea':'Luca Mihn.', 
        'Luca Mihai-Cătălin':'Luca Miha.',
        'Luca':'Luca Miha.',
        'Luca M':'Luca Miha.',
        'Luca M.':'Luca Miha.',
        'Luca Mihai':'Luca Miha.',
        'Luca Traian-Dan':'Luca Tr.', 
        'Luca Tudor':'Luca Tu.',
        'Canceu Anamaria':'Canceu An.', 
        'Canceu Adriana-Maria':'Canceu Ad.',
        'Cioată Daniel-Mihai':'Cioată M.', 
        'Cioată Dragoș':'Cioată D.',
        'Burinaru Alexandra':'Burinaru Al.', 
        'Burinaru Andreea':'Burinaru An.',
        'Kovács Andrei':'Kovács An.',
        'Kovács Alexandru':'Kovács Al.',
        'Cristea Adrian':'Cristea Ad.',
        'Cristea Andrei':'Cristea An.',
        'Cristea A.':'Cristea An.',
        'Ungureanu Nicolae Marius':'Ungureanu M.',
        'Ungureanu Nicoleta':'Ungureanu N.',
        'Vincze Vlad':'Vincze Vl.',
        'Vincze Valentina':'Vincze Va.',
        'Bojan Vladimir':'Bojan Vl.',
        'Bojan Voicu':'Bojan Vo.',
        'Crețiu Codreanu Matei':'Crețiu-Codreanu M.',
        'Crețiu Codreanu Tudor':'Crețiu-Codreanu T.',
        'Pop Mugurel Voicu':'Pop-Mugurel V.',
        'Pop Mihai':'Pop M.',
        'Moise Alexandru':'Moise Al.',
        'Moise Adrian':'Moise Ad.',
        'Rădulescu Andrei-Savin':'Rădulescu An.',
        'Rădulescu Adrian':'Rădulescu Ad.',
        'Magirdicean Romeo':'Magirdicean Ro.',
        'Magirdicean Răzvan Ionuț':'Magirdicean Ră.'}

Normalize Romanian characters, define name cleaner function to get Name IDs. Name ID are unique competitor names in the form of: Surname, First letter of Name. If the First Letter of Name leads to a non-unique ID, the second letter is taken, and so forth, until a unique ID is found. It gets contructed as follows:

If name in doubles return the solution directly
Normalize characters
If name is in exceptions, clean
Replace any double spaces, then split at ( (to split away club, if embedded in the name)
Split into Surname and Name, store in rnames
Store Surname N. in sname
If sname is in equals, clean
Retrun sname



In [85]:

    
letter_norm={'ţ':'ț','ş':'ș','Ş':'Ș'}
def name_cleaner(name):
    name=str(name)
    if name in name_doubles:
        return name_doubles[name]
    else:
        for letter in letter_norm:
            name=name.replace(letter,letter_norm[letter])
        if name in name_exceptions:
            name=name_exceptions[name]
        nc=name.replace('  ',' ').split('(')    
            
        rname=nc[0].strip()
        rnames=rname.split(' ')
        sname=rnames[0]+' '+rnames[1][0]+'.'
        if sname in name_equals:
            sname=name_equals[sname]
        if sname in name_doubles:
            print(name,sname)
    return sname

Names equalling any string in redflags_names get thrown out of the final dataset.
Names containing any string in redflags_names2 get thrown out of the final dataset.



In [86]:

    
redflags_names=['-','—','—',np.nan,'. ()','— ','- -.','- -. (-)','A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','R','S',
                'Kashi','Sankon','București','Victorii:','Sakura','Taiken','Ikada','Sonkei','CRK','Museido',
                'Ichimon','Bushi Tokukai 1','Competitori – Shiai-sha','Echipa - roşu','Numele şi prenumele',
                'Victorii:','Victorii: 0','Victorii: 1','Victorii: 2','Victorii: 3','Victorii: 4',
                'Victorii: 5','?','Kyobukan','2/5','2/6','3/8','Finala','Kyobukan (0/0/0)','―',
                '(clasament final după meci de baraj)','CRK (Bucuresti)','Kaybukan','Isshin (Cluj)',
                'Ikada (Bucureşti)','Kyobukan (Braşov)','Puncte:','KASHI','Budoshin','Isshin',
                '— (—)','4. B.','4. Baraj: Stupu M - Hostina','4. Baraj: Moise KM - Korenschi M',
               'Bushi Tokukai (2/8/17)','CRK 2 (1/6/14)', 'CRK 2','CRK 1','Loc I.:','Loc',
               'Bushi Tokukai 2 (M Ciuc)','Echipa suport']
redflags_names2=['Bushi Tokukai','Eliminatoriu','finala','Finala','Fianala','Ikada','Ichimon','Pool',
                'Locul ','Lotul ','Loc ','Grupa ','Isshin','Meciul ','Victorii:','L1','1','2','3','4','5','6','7','8','9','0']

Check is name is not in redflags. Ignore these entries.



In [87]:

    
def name_ok(name):
    name=str(name)
    if name=='nan': return False
    if name not in redflags_names:
        if np.array([i not in name for i in redflags_names2]).all():
            return True
    return False

Process all names for standardization. Create 3 variables:

all_players: forward relationship: unclean name -> cleaned name
all_players_r: reverse relationship
all_players_unsorted: unique set of all names processed

Process both competitor and shinpan names.



In [88]:

    
all_players={}
all_players_r={}
all_players_unsorted=set()
for year in matches:
    for competition in matches[year]:
        for match in matches[year][competition]:
            for color in ['aka','shiro']:
                name=match[color]['name']
                all_players_unsorted.add(name)
                if name_ok(name):
                    name=name_cleaner(name)
                    rname=match[color]['name']
                    if rname not in all_players_r:all_players_r[rname]=name
                    if name not in all_players: all_players[name]={}
                    if year not in all_players[name]:all_players[name][year]={'names':set()}
                    all_players[name][year]['names'].add(rname)
            if 'shinpan' in match:
                for color in ['fukushin1','shushin','fukushin2']:
                    aka=match['aka']['name']
                    shiro=match['shiro']['name']
                    if (name_ok(aka)) and\
                       (name_ok(shiro)) and\
                       (name_cleaner(aka) in all_players) and\
                       (name_cleaner(shiro) in all_players):
                        rname=match['shinpan'][color]
                        all_players_unsorted.add(rname)
                        if name_ok(rname):
                            name=name_cleaner(rname)
                            if rname not in all_players_r:all_players_r[rname]=name
                            if name not in all_players: all_players[name]={}
                            if year not in all_players[name]:all_players[name][year]={'names':set()}
                            all_players[name][year]['names'].add(rname)

Link procesed to names in members. The name_linker dictionary contains Name IDs (short names) as keys and sets of long names as values. Ideally, this set should contain only one element, so that the mapping is unique.



In [89]:

    
name_linker={}
for i in members_clean.index:
    name=members_clean.loc[i]['name']
    try:
        cname=name_cleaner(name)
    except:
        print(name)
    if cname not in name_linker:name_linker[cname]=set()
    name_linker[cname].add(name)

Do the opposite mapping in names_abbr: long->short. Create exceptions for duplicate names.



In [90]:

    
names_abbr={}
for name in name_linker:
    if len(name_linker[name])>1:
        #only for dev to create exceptions for duplicate person names.
        print(name,name_linker[name])
    for i in name_linker[name]:
        names_abbr[i]=name

Save club mappings by short name, by year.



In [91]:

    
names_abbr_list=[]
name_abbr2long={}
name_abbr2club={}
for i in members_clean.index:
    name=members_clean.loc[i]['name']
    club=members_clean.loc[i]['club']
    year=members_clean.loc[i]['year']
    names_abbr_list.append(names_abbr[name])
    name_abbr2long[names_abbr[name]]=name
    if names_abbr[name] not in name_abbr2club:name_abbr2club[names_abbr[name]]={}
    if year not in name_abbr2club[names_abbr[name]]:
        name_abbr2club[names_abbr[name]][year]=club

Add short names to members_clean.



In [92]:

    
members_clean['name_abbr']=names_abbr_list

Some names appear in the short form, we need to add them manually to the long list. We parse through all forms in which the name appears, and choose the longest. We call this the inferred name.



In [93]:

    
for name in all_players:
    if name not in name_abbr2long:
        #infer using longest available name
        names={len(j):j  for i in all_players[name] for j in all_players[name][i]['names']}
        if len(names)>0:
            inferred_name=names[max(names.keys())]
            if '(' in inferred_name:
                inferred_name=inferred_name[:inferred_name.find('(')-1]
            name_abbr2long[name]=inferred_name

Infer duplicates



In [94]:

    
def levenshteinDistance(s1, s2):
    if len(s1) > len(s2):
        s1, s2 = s2, s1

    distances = range(len(s1) + 1)
    for i2, c2 in enumerate(s2):
        distances_ = [i2+1]
        for i1, c1 in enumerate(s1):
            if c1 == c2:
                distances_.append(distances[i1])
            else:
                distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1])))
        distances = distances_
    return distances[-1]



In [95]:

    
nkeys=np.sort(list(name_abbr2long.keys()))
for ii in range(len(name_abbr2long)):
    i=nkeys[ii]
    for jj in range(ii):
        j=nkeys[jj]
        if levenshteinDistance(name_abbr2long[i],name_abbr2long[j])<4:
            print(name_abbr2long[i],':',name_abbr2long[j],' - ',i,':',j)









    



Asama Yukito : Asama Kaito  -  Asama Y. : Asama K.
Jianu Alexandru : Diaconu Alexandru  -  Jianu A. : Diaconu A.
Luca Tudor : Duicu Tudor  -  Luca Tu. : Duicu T.
Mocanu Sorin : Mocanu Alin  -  Mocanu S. : Mocanu A.
Mărășescu Cristina : Marinescu Cristina  -  Mărășescu C. : Marinescu C.
Petre Liviu : Perte Liviu  -  Petre L. : Perte L.
Sin Alexandru : Jianu Alexandru  -  Sin A. : Jianu A.
Spînu Alexandru : Jianu Alexandru  -  Spînu A. : Jianu A.
Spînu Alexandru : Sin Alexandru  -  Spînu A. : Sin A.



In [96]:

    
nkeys=np.sort(list(name_abbr2long.keys()))
for ii in range(len(name_abbr2long)):
    i=nkeys[ii]
    for jj in range(ii):
        j=nkeys[jj]
        if levenshteinDistance(i,j)<3:
            print(i,':',j,' - ',name_abbr2long[i],':',name_abbr2long[j])









    



Albescu I. : Albescu A.  -  Albescu Ioan : Albescu AnaMaria
Asama Y. : Asama K.  -  Asama Yukito : Asama Kaito
Balan R. : Balac V.  -  Balan Rares-Mihai : Balac Vlad
Biró I. : Biró B.  -  Biró István : Biró Bence
Biró S. : Biró B.  -  Biró Sándor : Biró Bence
Biró S. : Biró I.  -  Biró Sándor : Biró István
Blaj V. : Balac V.  -  Blaj Valentin : Balac Vlad
Bojan Vl. : Bojan I.  -  Bojan Vladimir : Bojan Ioachim
Bojan Vo. : Bojan I.  -  Bojan Voicu : Bojan Ioachim
Bojan Vo. : Bojan Vl.  -  Bojan Voicu : Bojan Vladimir
Boruga S. : Boruga R.  -  Boruga Sofia : Boruga Radu
Botezatu C. : Botezatu A.  -  Botezatu Codrin-Cătălin : Botezatu Alexandru-Cosmin
Brat I. : Brad A.  -  Brat Ilina-Ruxandra : Brad Andrei Bogdan
Brățoiu I. : Brățoiu C.  -  Brățoiu Iona-Matei : Brățoiu Cristian
Brățoiu L. : Brățoiu C.  -  Brățoiu Laura : Brățoiu Cristian
Brățoiu L. : Brățoiu I.  -  Brățoiu Laura : Brățoiu Iona-Matei
Burinaru An. : Burinaru Al.  -  Burinaru Andreea : Burinaru Alexandra
Burinaru T. : Burinaru Al.  -  Burinaru Tiberiu-Alecu : Burinaru Alexandra
Burinaru T. : Burinaru An.  -  Burinaru Tiberiu-Alecu : Burinaru Andreea
Bíró B. : Biró B.  -  Bíró Barna : Biró Bence
Bíró B. : Biró I.  -  Bíró Barna : Biró István
Bíró B. : Biró S.  -  Bíró Barna : Biró Sándor
Canceu An. : Canceu Ad.  -  Canceu Anamaria : Canceu Adriana-Maria
Cercel A. : Bercea A.  -  Cercel Adrian-Dumitru : Bercea Aurel
Chirea V. : Chirea A.  -  Chirea Valentin-Gabriel : Chirea Marius-Alin
Cioată E. : Cioată D.  -  Cioată Elena : Cioată Dragoș
Cioată M. : Cioată D.  -  Cioată Daniel-Mihai : Cioată Dragoș
Cioată M. : Cioată E.  -  Cioată Daniel-Mihai : Cioată Elena
Cioată T. : Cioată D.  -  Cioată Tudor : Cioată Dragoș
Cioată T. : Cioată E.  -  Cioată Tudor : Cioată Elena
Cioată T. : Cioată M.  -  Cioată Tudor : Cioată Daniel-Mihai
Colțea A. : Coldea R.  -  Colţea Andrei Răzvan : Coldea Roxana
Cozan O. : Cazan G.  -  Cozan Ovidiu : Cazan George
Cozan O. : Coman A.  -  Cozan Ovidiu : Coman Alin Dan
Crețiu-Codreanu T. : Crețiu-Codreanu M.  -  Crețiu Codreanu Tudor : Crețiu Codreanu Matei
Cristea An. : Cristea Ad.  -  Cristea Andrei : Cristea Adrian
Cristea Cr. : Cristea Ad.  -  Cristea Cristina : Cristea Adrian
Cristea Cr. : Cristea An.  -  Cristea Cristina : Cristea Andrei
Cristea Că. : Cristea Ad.  -  Cristea Călin-Ștefan : Cristea Adrian
Cristea Că. : Cristea An.  -  Cristea Călin-Ștefan : Cristea Andrei
Cristea Că. : Cristea Cr.  -  Cristea Călin-Ștefan : Cristea Cristina
Cristea D. : Cristea Ad.  -  Cristea Daniel : Cristea Adrian
Cristea D. : Cristea An.  -  Cristea Daniel : Cristea Andrei
Cristea D. : Cristea Cr.  -  Cristea Daniel : Cristea Cristina
Cristea D. : Cristea Că.  -  Cristea Daniel : Cristea Călin-Ștefan
Crăciun S. : Crăciun D.  -  Crăciun (Tamang) Sujata : Crăciun Dragoş Ştefan
Crăciun T. : Crăciun D.  -  Crăciun Tiberius Octavian : Crăciun Dragoş Ştefan
Crăciun T. : Crăciun S.  -  Crăciun Tiberius Octavian : Crăciun (Tamang) Sujata
Crăciunel I. : Crăciunel C.  -  Crăciunel Ioan : Crăciunel Cătălin
Crăciunel V. : Crăciunel C.  -  Crăciunel Valentin Petre : Crăciunel Cătălin
Crăciunel V. : Crăciunel I.  -  Crăciunel Valentin Petre : Crăciunel Ioan
Csibi Z. : Csibi K.  -  Csibi Zsolt : Csibi Krisztina
Căruntu V. : Căruntu B.  -  Căruntu Vlad : Căruntu Bogdan
Danciu G. : Banciu S.  -  Danciu Gheorghe-Daniel : Banciu Simion-Sorin
Darkó B. : Dakó F.  -  Darkó Béla : Dakó Ferenc
Dițu A. : Dinu A.  -  Dițu Ioan-Adrian : Dinu (Ioniță) Claudia-Andreea
Drăgan D. : Drăgan B.  -  Drăgan Dragoş-Ştefan : Drăgan Bogdan
Dudaș F. : Dudaș B.  -  Dudaș Francisc : Dudaș Bâtu Sebastian
Duicu T. : Duicu C.  -  Duicu Tudor : Duicu Claudia Bernadete
Dumitrescu S. : Dumitrescu E.  -  Dumitrescu Sorin : Dumitrescu Eugen
Emanoil R. : Emanoil O.  -  Emanoil Robert : Emanoil Oriana
Fecioru M. : Fecioru I.  -  Fecioru Mica : Fecioru Ioana Miruna
Feczkó Z. : Feczkó I.  -  Feczkó Zoltán : Feczkó Iuliana
Ferenczi C. : Ferencz T.  -  Ferenczi Cristian : Ferencz Tünde
Furtună N. : Furtună G.  -  Furtună Nina : Furtună George
Fülöp P. : Fülöp H.  -  Fülöp Péter-Attila : Fülöp Hunor
Gabor M. : Gabor B.  -  Gabor Mihai : Gabor Bogdan
Gașpar T. : Gașpar D.  -  Gașpar Toma-Ciprian : Gașpar Dan
Georgescu An. : Georgescu Al.  -  Georgescu Andrei : Georgescu Alexandra
Georgescu B. : Georgescu Al.  -  Georgescu Bogdan : Georgescu Alexandra
Georgescu B. : Georgescu An.  -  Georgescu Bogdan : Georgescu Andrei
Georgescu O. : Georgescu Al.  -  Georgescu Octavian-Florian : Georgescu Alexandra
Georgescu O. : Georgescu An.  -  Georgescu Octavian-Florian : Georgescu Andrei
Georgescu O. : Georgescu B.  -  Georgescu Octavian-Florian : Georgescu Bogdan
Georgescu R. : Georgescu Al.  -  Georgescu Radu-Constantin : Georgescu Alexandra
Georgescu R. : Georgescu An.  -  Georgescu Radu-Constantin : Georgescu Andrei
Georgescu R. : Georgescu B.  -  Georgescu Radu-Constantin : Georgescu Bogdan
Georgescu R. : Georgescu O.  -  Georgescu Radu-Constantin : Georgescu Octavian-Florian
Georgescu S. : Georgescu Al.  -  Georgescu Silviu : Georgescu Alexandra
Georgescu S. : Georgescu An.  -  Georgescu Silviu : Georgescu Andrei
Georgescu S. : Georgescu B.  -  Georgescu Silviu : Georgescu Bogdan
Georgescu S. : Georgescu O.  -  Georgescu Silviu : Georgescu Octavian-Florian
Georgescu S. : Georgescu R.  -  Georgescu Silviu : Georgescu Radu-Constantin
Gheorghiu R. : Gheorghiu C.  -  Gheorghiu Radu Cristian : Gheorghiu Constantin-Petru
Gheorghiță C. : Gheorghiu C.  -  Gheorghiță Călin : Gheorghiu Constantin-Petru
Gheorghiță C. : Gheorghiță A.  -  Gheorghiță Călin : Gheorghiță Andrei
Ghițuleasa L. : Ghițuleasa A.  -  Ghițuleasa Liviu-Gabriel : Ghițuleasa Anca-Andreea
Gruia G. : Gruia C.  -  Gruia Gabriel : Gruia Costin
Grădișteanu R. : Grădișteanu A.  -  Grădișteanu (Gușu) Rebeca : Grădișteanu Alexandru
Györfi N. : Györfi G.  -  Györfi Noémi : Györfi György
Györfi S. : Györfi G.  -  Györfi Sándor : Györfi György
Györfi S. : Györfi N.  -  Györfi Sándor : Györfi Noémi
Györfi Á. : Györfi G.  -  Györfi Ákos : Györfi György
Györfi Á. : Györfi N.  -  Györfi Ákos : Györfi Noémi
Györfi Á. : Györfi S.  -  Györfi Ákos : Györfi Sándor
Gáll F. : Gál B.  -  Gáll Ferenc-Levente : Gál Balázs
Gârbea I. : Gârbea D.  -  Gârbea Ion : Gârbea Daniel-Cristian
Hentea D. : Hentea A.  -  Hentea Daniela : Hentea Adrian Lucian
Horváth M. : Horváth D.  -  Horváth Mátyás : Horváth Dávid-István
Hrișcu M. : Hrișcu C.  -  Hrișcu Monica : Hrișcu Cristian-Octavian
Hulea V. : Fulea R.  -  Hulea Violeta : Fulea Rado-Gabriel
Iliescu S. : Iliescu D.  -  Iliescu Silviu-Alexandru : Iliescu Dragoș
Ilieș L. : Ilie A.  -  Ilieș Lavinia Gabriela : Ilie Andra-Maria
Iordan R. : Gordan M.  -  Iordan Relu : Gordan Mark
Iordan R. : Ioan R.  -  Iordan Relu : Ioan Roxana
Ivan C. : Ioan R.  -  Ivan Cristina : Ioan Roxana
Ivan F. : Ioan R.  -  Ivan Florentina : Ioan Roxana
Ivan F. : Ivan C.  -  Ivan Florentina : Ivan Cristina
Jianu A. : Dinu A.  -  Jianu Alexandru : Dinu (Ioniță) Claudia-Andreea
Józsa R. : Józsa L.  -  Józsa (Gușu) Rodiana : Józsa Levente
Keresztes M. : Keresztes C.  -  Keresztes Mátyás : Keresztes Csaba
Korenschi E. : Korenschi A.  -  Korenschi Eduard : Korenschi Arkadi
Korenschi R. : Korenschi A.  -  Korenschi Raymond : Korenschi Arkadi
Korenschi R. : Korenschi E.  -  Korenschi Raymond : Korenschi Eduard
Kovács An. : Kovács Al.  -  Kovács Andrei : Kovács Alexandru
Lazăr-Prodan S. : Lazăr-Prodan L.  -  Lazăr-Prodan Sarah : Lazăr-Prodan Lucian-József
Luca A. : Leca F.  -  Luca Alexandru-Alin : Leca Florin
Luca Mihn. : Luca Miha.  -  Luca Mihnea : Luca Mihai-Cătălin
Luca Tr. : Luca A.  -  Luca Traian-Dan : Luca Alexandru-Alin
Luca Tu. : Luca A.  -  Luca Tudor : Luca Alexandru-Alin
Luca Tu. : Luca Tr.  -  Luca Tudor : Luca Traian-Dan
Magirdicean Ră. : Magirdicean Ro.  -  Magirdicean Răzvan Ionuț : Magirdicean Romeo
Mahalean R. : Mahalean L.  -  Mahalean Razvan : Mahalean Ligia
Mahika-Voiconi S. : Mahika-Voiconi A.  -  Mahika-Voiconi Sorin-Constantin : Mahika-Voiconi Ana
Mahika-Voiconi T. : Mahika-Voiconi A.  -  Mahika-Voiconi Tudor : Mahika-Voiconi Ana
Mahika-Voiconi T. : Mahika-Voiconi S.  -  Mahika-Voiconi Tudor : Mahika-Voiconi Sorin-Constantin
Marin C. : Maria L.  -  Marin Constantin-Doru : Maria Luca
Martin E. : Marin C.  -  Martin Eliya : Marin Constantin-Doru
Martin I. : Marin C.  -  Martin Ian : Marin Constantin-Doru
Martin I. : Martin E.  -  Martin Ian : Martin Eliya
Martin M. : Marin C.  -  Martin Mihai : Marin Constantin-Doru
Martin M. : Martin E.  -  Martin Mihai : Martin Eliya
Martin M. : Martin I.  -  Martin Mihai : Martin Ian
Martinescu D. : Marinescu C.  -  Martinescu Dan : Marinescu Cristina
Marton L. : Martin E.  -  Marton László : Martin Eliya
Marton L. : Martin I.  -  Marton László : Martin Ian
Marton L. : Martin M.  -  Marton László : Martin Mihai
Matei V. : Matea C.  -  Matei Vasilica-Loredana : Matea Coralia-Ioana
Mirea D. : Mircea Ș.  -  Mirea Daniela-Andrada : Mircea Ștefan
Mitrea B. : Mirea D.  -  Mitrea Bogdan Alexandru : Mirea Daniela-Andrada
Mocanu S. : Mocanu A.  -  Mocanu Sorin : Mocanu Alin
Mocian A. : Mocanu A.  -  Mocian Adrian-Ilie : Mocanu Alin
Moise Al. : Moise Ad.  -  Moise Alexandru : Moise Adrian
Moise M. : Moise Ad.  -  Moise Maria-Silvana : Moise Adrian
Moise M. : Moise Al.  -  Moise Maria-Silvana : Moise Alexandru
Moldoveanu M. : Moldoveanu A.  -  Moldoveanu Mihail : Moldoveanu Alexandru
Morar D. : Moga D.  -  Morar Dumitru : Moga Dorin
Morar D. : Morar B.  -  Morar Dumitru : Morar Bogdan Lucian
Morar T. : Morar B.  -  Morar Tea-Andra : Morar Bogdan Lucian
Morar T. : Morar D.  -  Morar Tea-Andra : Morar Dumitru
Moraru F. : Morar B.  -  Moraru Florentina : Morar Bogdan Lucian
Moraru F. : Morar D.  -  Moraru Florentina : Morar Dumitru
Moraru F. : Morar T.  -  Moraru Florentina : Morar Tea-Andra
Moraru G. : Morar B.  -  Moraru Gelu : Morar Bogdan Lucian
Moraru G. : Morar D.  -  Moraru Gelu : Morar Dumitru
Moraru G. : Morar T.  -  Moraru Gelu : Morar Tea-Andra
Moraru G. : Moraru F.  -  Moraru Gelu : Moraru Florentina
Moraru R. : Morar B.  -  Moraru Raul : Morar Bogdan Lucian
Moraru R. : Morar D.  -  Moraru Raul : Morar Dumitru
Moraru R. : Morar T.  -  Moraru Raul : Morar Tea-Andra
Moraru R. : Moraru F.  -  Moraru Raul : Moraru Florentina
Moraru R. : Moraru G.  -  Moraru Raul : Moraru Gelu
Munteanu R. : Muntean A.  -  Munteanu Răzvan-Luigi : Muntean Alina Florina
Mărășescu M. : Mărășescu C.  -  Mărășescu Marius : Mărășescu Cristina
Nemes M. : Nemes D.  -  Nemes Mihai : Nemes Dan
Niciev Ș. : Niciev C.  -  Niciev Ștefan : Niciev Cristian
Nicolin D. : Nicola D.  -  Nicolin Deian : Nicola Delia
Perte L. : Perdi L.  -  Perte Liviu : Perdi Levente
Peter L. : Perte L.  -  Peter Lorand-Ervin : Perte Liviu
Petre L. : Perte L.  -  Petre Liviu : Perte Liviu
Petre L. : Peter L.  -  Petre Liviu : Peter Lorand-Ervin
Pleșea C. : Pletea A.  -  Pleşea Clara-Alexia : Pletea Aurel
Pleșea R. : Pletea A.  -  Pleşea Radu-Lucian : Pletea Aurel
Pleșea R. : Pleșea C.  -  Pleşea Radu-Lucian : Pleşea Clara-Alexia
Pop M. : Pop D.  -  Pop Mihai : Pop David-Cristian
Popa C. : Pop D.  -  Popa Ciprian : Pop David-Cristian
Popa C. : Pop M.  -  Popa Ciprian : Pop Mihai
Popescu C. : Popescu A.  -  Popescu Corneliu-Petru : Popescu Ana-Catrinel
Popescu M. : Popescu A.  -  Popescu Marian : Popescu Ana-Catrinel
Popescu M. : Popescu C.  -  Popescu Marian : Popescu Corneliu-Petru
Popescu R. : Popescu A.  -  Popescu Roland : Popescu Ana-Catrinel
Popescu R. : Popescu C.  -  Popescu Roland : Popescu Corneliu-Petru
Popescu R. : Popescu M.  -  Popescu Roland : Popescu Marian
Popescu S. : Popescu A.  -  Popescu Silviu : Popescu Ana-Catrinel
Popescu S. : Popescu C.  -  Popescu Silviu : Popescu Corneliu-Petru
Popescu S. : Popescu M.  -  Popescu Silviu : Popescu Marian
Popescu S. : Popescu R.  -  Popescu Silviu : Popescu Roland
Popovici I. : Popovici D.  -  Popovici Iulia-Alina : Popovici Dan Mircea
Prioteasa S. : Preoteasa L.  -  Prioteasa Simona : Preoteasa Liviu-George
Páll D. : Gáll F.  -  Páll Dávid : Gáll Ferenc-Levente
Péter Cso. : Péter Csa.  -  Péter Csongor : Péter Csanád
Raicea V. : Raicea A.  -  Raicea Victor-Cornel : Raicea Andrei Dan
Recheșan F. : Recheșan C.  -  Recheșan Flaviu : Recheșan Codruța-Ioana
Recheșan R. : Recheșan C.  -  Recheșan Rareș : Recheșan Codruța-Ioana
Recheșan R. : Recheșan F.  -  Recheșan Rareș : Recheșan Flaviu
Rusu-Boca P. : Rusu-Boca D.  -  Rusu-Boca Paul-Cristian : Rusu-Boca David-Alexander
Rácz R. : Rácz E.  -  Rácz Richárd : Rácz Edvárd
Rădulescu An. : Rădulescu Ad.  -  Rădulescu Andrei-Savin : Rădulescu Adrian
Sandu D. : Sabău D.  -  Sandu David Miron : Sabău Dan-Marius
Sandu E. : Sandu D.  -  Sandu Emanuel : Sandu David Miron
Sandu G. : Sandu D.  -  Sandu George : Sandu David Miron
Sandu G. : Sandu E.  -  Sandu George : Sandu Emanuel
Sandu Mat. : Sandu Mar.  -  Sandu Matei-Serban : Sandu Marius-Cristian
Sin A. : Dinu A.  -  Sin Alexandru : Dinu (Ioniță) Claudia-Andreea
Szabó L. : Szabó C.  -  Szabó Luca : Szabó Csaba
Szabó M. : Szabó C.  -  Szabó Márk-Richard : Szabó Csaba
Szabó M. : Szabó L.  -  Szabó Márk-Richard : Szabó Luca
Szabó S. : Szabó C.  -  Szabó Simon Dániel : Szabó Csaba
Szabó S. : Szabó L.  -  Szabó Simon Dániel : Szabó Luca
Szabó S. : Szabó M.  -  Szabó Simon Dániel : Szabó Márk-Richard
Szakács S. : Szakács N.  -  Szakács Szabolcs : Szakács Norbert
Tiron L. : Miron E.  -  Tiron Laurențiu : Miron Emanoil-Cristinel
Tișu A. : Dinu A.  -  Tișu Andrei : Dinu (Ioniță) Claudia-Andreea
Tișu A. : Dițu A.  -  Tișu Andrei : Dițu Ioan-Adrian
Tudor C. : Humor C.  -  Tudor Cosmin : Humor Claudiu
Tudor C. : Tudor A.  -  Tudor Cosmin : Tudor Aurelia-Mihaela
Tudor D. : Tudor A.  -  Tudor Duicu Cătălin : Tudor Aurelia-Mihaela
Tudor D. : Tudor C.  -  Tudor Duicu Cătălin : Tudor Cosmin
Ungurean R. : Ungurean M.  -  Ungurean Rareș : Ungurean Mircea
Ungureanu A. : Ungurean M.  -  Ungureanu Alexandru Cătălin : Ungurean Mircea
Ungureanu A. : Ungurean R.  -  Ungureanu Alexandru Cătălin : Ungurean Rareș
Ungureanu D. : Ungurean M.  -  Ungureanu Daniela : Ungurean Mircea
Ungureanu D. : Ungurean R.  -  Ungureanu Daniela : Ungurean Rareș
Ungureanu D. : Ungureanu A.  -  Ungureanu Daniela : Ungureanu Alexandru Cătălin
Ungureanu M. : Ungurean M.  -  Ungureanu Nicolae Marius : Ungurean Mircea
Ungureanu M. : Ungurean R.  -  Ungureanu Nicolae Marius : Ungurean Rareș
Ungureanu M. : Ungureanu A.  -  Ungureanu Nicolae Marius : Ungureanu Alexandru Cătălin
Ungureanu M. : Ungureanu D.  -  Ungureanu Nicolae Marius : Ungureanu Daniela
Ungureanu N. : Ungurean M.  -  Ungureanu Nicoleta : Ungurean Mircea
Ungureanu N. : Ungurean R.  -  Ungureanu Nicoleta : Ungurean Rareș
Ungureanu N. : Ungureanu A.  -  Ungureanu Nicoleta : Ungureanu Alexandru Cătălin
Ungureanu N. : Ungureanu D.  -  Ungureanu Nicoleta : Ungureanu Daniela
Ungureanu N. : Ungureanu M.  -  Ungureanu Nicoleta : Ungureanu Nicolae Marius
Vaia E. : Daina E.  -  Vaia Elizabet : Daina Elena-Cristina
Vasile R. : Vasile I.  -  Vasile Radu : Vasile Ioan-Codrin
Veres C. : Seres L.  -  Veres Cioanta Iustin : Seres Lehel
Vicu A. : Dinu A.  -  Vicu Anca : Dinu (Ioniță) Claudia-Andreea
Vicu A. : Dițu A.  -  Vicu Anca : Dițu Ioan-Adrian
Vicu A. : Tișu A.  -  Vicu Anca : Tișu Andrei
Vincze Vl. : Vincze Va.  -  Vincze Vlad : Vincze Valentina
Voicu C. : Duicu C.  -  Voicu Constantin : Duicu Claudia Bernadete
Voicu C. : Vicu A.  -  Voicu Constantin : Vicu Anca
Vrânceanu R. : Vrânceanu L.  -  Vrânceanu Radu : Vrânceanu Laura-Maria
Vuia M. : Vaia E.  -  Vuia Magyar Maria : Vaia Elizabet
Ștefan E. : Ștefan C.  -  Ștefan Eduard : Ștefan Cristina
Ștefan S. : Ștefan C.  -  Ștefan Sebastian : Ștefan Cristina
Ștefan S. : Ștefan E.  -  Ștefan Sebastian : Ștefan Eduard

2.3. Infer clubs

Infer clubs from name if club is part of name in the competition. Club names in redflags_clubs get ignored. Clubs in club_equals get replaced after processing. The convention is to have 3 letter all-caps club names for Romanian clubs, 3 letter club names followed by a / and a two letter country code for foreign clubs.



In [97]:

    
redflags_clubs=['','N/A','RO1','RO2']
club_equals=clubs_loader.club_equals

Attach clubs to all_players who have it in their competition name data, but we don't already known from members.



In [98]:

    
for name in all_players:
    #if we dont already know the club for this year from the members register
    if name not in name_abbr2club:    
        for year in all_players[name]:
            for name_form in all_players[name][year]['names']:
                if '(' in name_form:
                    club=name_form.split('(')[1].strip()[:-1]
                    if club not in redflags_clubs:
                        if name not in name_abbr2club:name_abbr2club[name]={}
                        name_abbr2club[name][year]=club
    else:
        for year in all_players[name]:
            #else if no club info for particular year
            if year not in name_abbr2club[name]:
                for name_form in all_players[name][year]['names']:
                    if '(' in name_form:
                        club=name_form.split('(')[1].strip()[:-1]
                        if club not in redflags_clubs:
                            name_abbr2club[name][year]=club

Normalize club names and long names.



In [99]:

    
for name in name_abbr2club:
    for year in name_abbr2club[name]:
        if name_abbr2club[name][year] in club_equals: 
            name_abbr2club[name][year]=club_equals[name_abbr2club[name][year]]
for name in name_abbr2long:
    name_abbr2long[name]=name_abbr2long[name].replace('  ',' ').strip()

If club still not found, fill the gaps between years. Forward fill first, then backward fill, if necessary.



In [103]:

    
for name in all_players:
    if name in name_abbr2club:
        years=np.sort(list(all_players[name].keys()))
        minyear1=min(years)
        maxyear1=max(years)
        minyear2=min(name_abbr2club[name].keys())
        maxyear2=min(name_abbr2club[name].keys())
                    
        if len(years)>1:
            for year in range(min(minyear1,minyear2),max(maxyear1,maxyear2)+1):
                if year not in name_abbr2club[name]:
                    #get club from previous year
                    for y in range(years[0],year):
                        if y in name_abbr2club[name]:
                            name_abbr2club[name][year]=str(name_abbr2club[name][y])
                            break
                if year not in name_abbr2club[name]:
                    #if still not found, get club from next year
                    for y in np.arange(years[-1],year,-1):
                        if y in name_abbr2club[name]:
                            name_abbr2club[name][year]=str(name_abbr2club[name][y])
                            break
                if year not in name_abbr2club[name]:
                    #if still not found, get first known year
                    if year<minyear2:
                        name_abbr2club[name][year]=str(name_abbr2club[name][minyear2])
                    else:
                        name_abbr2club[name][year]=str(name_abbr2club[name][maxyear2])

We have extracted what was possible from the data. Now we do a save of short name to long name and club mappings (by year). We then edit this file manually, if necessary.

2.4. Manual club and long name overrides



In [30]:

    
manual_data_needed=[]
for i in manual_name_needed.union(manual_club_needed):
    if i not in list(manual_data_override.index):
        dummy={'name':i,'long_name':'','club':''}
        if i in name_abbr2club:
            dummy['club']=name_abbr2club[name][max(list(name_abbr2club[name].keys()))]
        if i in manual_club_needed:
            if i in name_abbr2long:
                dummy['long_name']=name_abbr2long[i]
        manual_data_needed.append(dummy)



In [31]:

    
df=pd.DataFrame(manual_data_needed).set_index('name')
df=pd.concat([manual_data_override,df]).drop_duplicates().sort_index()



In [32]:

    
df.to_excel('../data/manual/members_manual.xlsx')

Extend with manual data



In [33]:

    
for i in df['long_name'].replace('',np.nan).dropna().index:
    name_abbr2long[i]=df.loc[i]['long_name']
    all_players_r[name_abbr2long[i]]=i



In [34]:

    
manual_club_needed=set()
for name in all_players:
    years=np.sort(list(all_players[name].keys()))
    minyear=min(years)
    maxyear=max(years)
    for year in range(minyear,maxyear+1):
        if name not in name_abbr2club:name_abbr2club[name]={}
        if year not in name_abbr2club[name]:
            if name in df['club'].replace('',np.nan).dropna().index:
                name_abbr2club[name][year]=df.loc[name]['club']
            else:
                name_abbr2club[name][year]='XXX'

Update and overwrite with club existence data

3. Update members

Extend members data with data mined from matches

Extend members with unregistered members. Probably inactive now, or from abroad. Only that one year when he appared in competition. But we only register them as known to be active that year. This is in ontrast with the Inactive members from the registry, for whom we know when did they go inactive.



In [35]:

    
unregistered_members=[]
for name in all_players:
    if name not in set(members_clean['name_abbr'].values):
        years=np.sort(list(name_abbr2club[name].keys()))
        for year in range(min(years),max(years)+1):
            if year in all_players[name]:
                iyear=year
            else:
                iyear=max(years)
            club,country=clubs_loader.club_cleaner(name_abbr2club[name][year])
            if country=='RO':
                activ='Active'
                dan=''#dan=0
            else:
                activ='Abroad'
                dan=''
            unregistered_members.append({'name':name_abbr2long[name],'name_abbr':name,
                    'club':club,'active':activ,'year':year,'dan':dan,'country':country,'source':'matches'})



In [36]:

    
members_clean['country']='RO'
members_clean['source']='member list'



In [49]:

    
members_updated=pd.concat([members_clean,pd.DataFrame(unregistered_members)]).reset_index(drop=True)









    



C:\Users\csala\AppData\Local\Continuum\anaconda2\envs\python3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  """Entry point for launching an IPython kernel.

Extend 0 dan down to starting year.



In [50]:

    
members_mu_dan_extensions=[]
members_by_name=members_updated.set_index(['name_abbr'])
for year in matches:
    members_by_year=members_updated.set_index(['year']).loc[year]
    for competition in matches[year]:
        print(year,competition)
        for k in matches[year][competition]:
            aka=k['aka']['name']
            shiro=k['shiro']['name']
            if (name_ok(aka)) and\
               (name_ok(shiro)) and\
               (name_cleaner(aka) in all_players) and\
               (name_cleaner(shiro) in all_players):
                for a in ['aka','shiro']:
                    for h in k[a]:
                        if h=='name':
                            name=k[a][h]
                            rname=all_players_r[name]
                            if rname in list(members_by_name.index):
                                if rname not in members_by_year['name_abbr'].values:
                                    dummy=members_by_name.loc[[rname]]
                                    minyear=min(dummy['year'])
                                    maxyear=max(dummy['year'])
                                    if year>maxyear:
                                        dummy=dummy[dummy['year']==maxyear]
                                        yeardiff=min(dummy['year'])-year
                                    else:
                                        dummy=dummy[dummy['year']==minyear]
                                        yeardiff=year-max(dummy['year'])
                                    dummy=dummy.reset_index()
                                    dummy['year']=year
                                    dummy['dan']=0
                                    dummy['age']=dummy['age']+yeardiff
                                    dummy['source']='matches, mu dan'
                                    members_mu_dan_extensions.append(dummy)
                            #if only appears in competition in one year, then not in members table
                            else:
                                print(rname,year)
                                #fix in unregistered_members









    



1993 CR
1997 CR
1998 CR
2009 CR
2009 CN
2010 CR
2010 CN
2011 CR
2011 CN
2012 CR
2012 CN
2013 CR
2013 CN
2013 SL
2014 CR
2014 CN
2014 SL
2015 CR
2015 CN
2015 SL
2016 CR
2016 CN
2016 SL
2017 CR
2017 CN
2017 SL
2018 CR
2018 CN
2018 SL

Update members



In [51]:

    
members_mu_dan_extensions=pd.concat(members_mu_dan_extensions)
members_updated=pd.concat([members_updated,members_mu_dan_extensions]).reset_index(drop=True)









    



C:\Users\csala\AppData\Local\Continuum\anaconda2\envs\python3\lib\site-packages\ipykernel_launcher.py:2: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

Prettify club names, and IDs



In [52]:

    
clubs=[]
pclubs=[]
countries=[]
for i in members_updated.index:
    club=members_updated.loc[i]['club']
    country=members_updated.loc[i]['country']
    year=members_updated.loc[i]['year']
    club,country=clubs_loader.club_cleaner(club,country)
    club,pclub=clubs_loader.club_year(club,country,year)
    clubs.append(club)
    pclubs.append(pclub)
    countries.append(country)









    



C:\Users\csala\OneDrive\Github\try-something-new-everyday\blog\kendo romania\scripts\clubs_loader.py:86: RuntimeWarning: invalid value encountered in greater
  if year>df['end']:



In [53]:

    
members_updated['club']=clubs
members_updated['pretty_club']=pclubs
members_updated['country']=countries

Fix unknwown genders



In [54]:

    
manual_mf_data_override=pd.read_excel('../data/manual/members_mf_manual.xlsx')



In [55]:

    
manual_mf_data_needed=members_updated[(members_updated['gen']!='M')&(members_updated['gen']!='F')][['name_abbr','name']]\
        .drop_duplicates()



In [56]:

    
df=manual_mf_data_needed#.merge(manual_mf_data_override[['name_abbr','gen']],'outer').drop_duplicates()
df.to_excel('../data/manual/members_mf_manual.xlsx')

Update members with manual gender data.



In [57]:

    
members_updated=members_updated.reset_index(drop=True).drop_duplicates()



In [58]:

    
gens=[]
for i in members_updated.index:
    name=members_updated.loc[i]['name_abbr']
    if name in list(df.index):
        gens.append(df.loc[name])
    else:
        gens.append(members_updated.loc[i]['gen'])



In [59]:

    
members_updated['gen']=gens

Save to /data/export.



In [60]:

    
members_updated.to_csv('../data/export/members.csv')



In [61]:

    
clubs_updated=members_updated.groupby(['club','country','pretty_club','year'])[['name_abbr']].count()
clubs_updated=clubs_updated.reset_index().set_index('club').join(clubs_loader.club_year_df['Oraș'])
clubs_updated.to_csv('../data/export/clubs.csv')

4. Update matches

Update and save cleaned match data



In [62]:

    
master_matches=[]
for year in matches:
    members_by_year=members_updated.set_index(['year']).loc[year].drop_duplicates()
    for competition in matches[year]:
        print(year,competition)
        for k in matches[year][competition]:
            good=True
            match={'year':year,'competition':competition}
            match['match_category'],match['match_teams'],match['match_phase']=point_utils.match_cleaner(year,k['match_type'])
            if 'shinpan' in k:
                for color in ['fukushin1','shushin','fukushin2']:
                    if color in k['shinpan']:
                        if k['shinpan'][color] in all_players_r:
                            #normalize shinpan names
                            match[color]=name_abbr2long[all_players_r[k['shinpan'][color]]]
            aka=k['aka']['name']
            shiro=k['shiro']['name']
            if (name_ok(aka)) and\
               (name_ok(shiro)) and\
               (name_cleaner(aka) in all_players) and\
               (name_cleaner(shiro) in all_players):
                for a in ['aka','shiro']:
                    points=''
                    for h in k[a]:
                        if h=='name':
                            name=k[a][h]
                            #normalize competitor names
                            rname=all_players_r[name]
                            df=members_by_year[members_by_year['name_abbr']==rname]
                            match[a+' name']=name_abbr2long[rname]
                        else:
                            point=k[a][h]
                            if str(point)=='nan': point=''
                            points=points+point
                    good=point_utils.point_redflags(points)
                    if good:
                        match[a+' point1'],match[a+' point2'],match[a+' points'],\
                                match[a+' hansoku'],match['encho']=point_utils.points_cleaner(points)
            else:
                good=False                
            if good:
                if 'outcome' in k:
                    match['encho']=point_utils.outcome_cleaner(k['outcome'])
                else: 
                    match['encho']=False
                match['winner'],match['difference']=point_utils.outcome_from_points(match['aka points'],match['shiro points'])

                master_matches.append(match)









    



1993 CR
1997 CR
1998 CR
2009 CR
2009 CN
2010 CR
2010 CN
2011 CR
2011 CN
2012 CR
2012 CN
2013 CR
2013 CN
2013 SL
2014 CR
2014 CN
2014 SL
2015 CR
2015 CN
2015 SL
2016 CR
2016 CN
2016 SL
2017 CR
2017 CN
2017 SL
2018 CR
2018 CN
2018 SL

Clean up and save matches for display



In [63]:

    
data=pd.DataFrame(master_matches).reset_index(drop=True)



In [64]:

    
save_utils.save(data)









    



Saving matches...
Saving players...
Saving points...






    



C:\Users\csala\OneDrive\Github\try-something-new-everyday\blog\kendo romania\scripts\save_utils.py:10: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  aka['color']='aka'
C:\Users\csala\OneDrive\Github\try-something-new-everyday\blog\kendo romania\scripts\save_utils.py:11: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  aka['opponent']=data['shiro name']
C:\Users\csala\OneDrive\Github\try-something-new-everyday\blog\kendo romania\scripts\save_utils.py:15: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  shiro['color']='shiro'
C:\Users\csala\OneDrive\Github\try-something-new-everyday\blog\kendo romania\scripts\save_utils.py:16: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  shiro['opponent']=data['aka name']
C:\Users\csala\OneDrive\Github\try-something-new-everyday\blog\kendo romania\scripts\save_utils.py:18: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  extended_matches=pd.concat([aka,shiro],axis=0).reset_index(drop=True)
C:\Users\csala\AppData\Local\Continuum\anaconda2\envs\python3\lib\site-packages\pandas\core\frame.py:3781: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  return super(DataFrame, self).rename(**kwargs)






    



Saving shinpan...






    



C:\Users\csala\OneDrive\Github\try-something-new-everyday\blog\kendo romania\scripts\save_utils.py:34: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  extended_shinpan=pd.concat([shu,fk1,fk2],axis=0).reset_index(drop=True)



In [ ]:



In [ ]:



In [ ]:



In [ ]:

	231	Nr. EKF	Club	Unnamed: 3	Numele	Prenumele	Gen	Data naşterii	1 kyu	practică	...	7 dan	8 dan	2011	2012	2013	2014	2015	2016	2017	2018
0	Free	RO.00076	KYO	NaN	Abe (Carțiș)	Emilia	F	1979-06-06	2007-08-10 00:00:00	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	Activ	RO.00205	TAI	NaN	Abrudan	Dorin-Ștefan	M	1991-12-27	2015-08-08 00:00:00	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	Da	Da	Da	Da