Created by Dénes Csala | 2019 | MIT License
For any improvement suggestions and spotted processing mistakes drop me a message on Facebook.
If you would like to have your country/club data visualized in a similar manner, or any other data visualization and analytics consultancy inquiries contact me at mail@csaladen.es
This workbook guides you through the data cleaning stage for the Romania Kendo Stats visualization. This is a multi-stage process, you will need access to the raw data (liaise with Secretary or other member in charge of data the Romanian Kendo Association), Python and Excel installed. Any Python packages will also be installed on the way, but we recommend using the Anaconda distribution of Python 3. If you would like to edit the visualization part, then you will need PowerBI Desktop.
The general structure of the repository is the following:
/data/raw: this where you place the downloaded data from the official data source, sorted by years and competitions, only keep those that have relevant data for matches only/ocr: this is where the data gets saved after an OCR has been performed - this is necessary for some older files in image format /manual: this is where manually extracted matches from old image files get placed - they should follow the 2018 CN format, i.e. all matches in one sheet/export: this is where we save the dataformatted for loading into the viz/clean: this is where all the processed, cleaned data ends up - they should follow the 2018 CN format, i.e. all matches in one sheet/scripts: this is the main code repository for all data processing scripts/viz: this is where the visualization files get saved - they are created using PowerBI and load data from /data/cleanThis section reads and clean the RKA members list. Save as baseline.
In [76]:
import pandas as pd, numpy as np, json
import members_loader, matches_loader, clubs_loader, point_utils, save_utils
First, download members data (Evidenta membrilor.xlsx) from the official data source, and create a macro-enabled Excel file from the Google Sheet. Then write a simple macro to extract the cell comments from the Club column in order to get info about club Transfers. Follow the instructions here. Save the new file as Evidenta membrilor.xlsm in the /data/manual folder. Use the members_loader module to process this file.
In [77]:
members=members_loader.get_members('../data/manual/Evidenta membrilor.xlsm')
Members are loaded but a bit messy.
In [78]:
members.head(2)
Out[78]:
In [79]:
members_clean=members_loader.cleaner(members).reset_index(drop=False)
In [80]:
members_clean.to_csv('../data/clean/members.csv')
Matches are loaded from excel sheets in the /data folder, organized by year and competition. We are always looking for match list data,the cleaner the better, the more concentrated the better. While this is not possible all the time, we have several demo import routines. These are stored in the matches_loader.py function library. While not all matches have textual data available, these will need to be processed through OCR first. Raw excel data that can be processed right away can be found in the /data/raw folder, while the processed ones in /data/ocr. We use a separate workbook, ocr.ipynb to walk you through the OCR process.
In [81]:
matches={i:{} for i in range(1993,2019)}
competitions={
2018:['CR','CN','SL'],
2017:['CR','CN','SL'],
2016:['CR','CN','SL'],
2015:['CR','CN','SL'],
2014:['CR','CN','SL'],
2013:['CR','CN','SL'],
2012:['CR','CN'],
2011:['CR','CN'],
2010:['CR','CN'],
2009:['CR','CN'],
1998:['CR'],
1997:['CR'],
1993:['CR']
}
In [7]:
for year in competitions:
for competition in competitions[year]:
matches[year][competition]=matches_loader.get_matches(year,competition)
Names in name_exceptions get replaced with their right hand side values before processing.
In [82]:
name_exceptions={'Atanasovski':'Atanasovski A. (MAC)',
'Dobrovicescu (SON)':'Dobrovicescu T. (SON)',
'Ianăș':'Ianăș F.',
'Crăciun (Tamang) Sujata':'Crăciun S.',
'Abe (Carțiș) Emilia':'Abe E.',
'Dinu (Ioniță) Claudia-Andreea':'Dinu A.',
'Mureșan (Egri) Melinda':'Mureșan M.',
'Grădișteanu (Gușu) Rebeca':'Grădișteanu R.',
'Józsa (Gușu) Rodiana':'Józsa R.',
'Arabadjiyski': 'Arabadjiyski A.',
'Dudaș Francisc Andrei':'Dudaș F.',
'Dudaș Francisc':'Dudaș F.',
'Mandia':'Mandia F.',
'Stanev':'Stanev A.',
'Mochalov':'Mochalov O.',
'Sozzi':'Sozzi A.',
'Crăciunel':'Crăciunel I.',
'Craciunel':'Crăciunel I.',
'Sagaev':'Sagaev L.',
'Buzás':'Búzás C.',
'Csala':'Csala T.',
'Dimitrov':'Dimitrov M.',
'Józsa':'Józsa L.',
'Creangă':'Creangă A.',
'Duțescu':'Duțescu M.',
'Furtună':'Furtună G.',
'Gârbea':'Gârbea I.',
'Stupu':'Stupu I.',
'Mahika-Voiconi':'Mahika-Voiconi S.',
'Mahika':'Mahika-Voiconi S.',
'Stanciu':'Stanciu F.',
'Vrânceanu':'Vrânceanu R.',
'Wolfs':'Wolfs J.',
'Ducarme':'Ducarme A.',
'Sbârcea':'Sbârcea B.',
'Mocian':'Mocian A.',
'Hatvani':'Hatvani L.',
'Dusan':'Dusan N.',
'Borota':'Borota V.',
'Tsushima':'Tsushima K.',
'Tráser':'Tráser T.',
'Colțea':'Colțea A.',
'Brîcov':'Brîcov A.',
'Yamamoto':'Yamamoto M.',
'Crăciun':'Crăciun D.'}
Names in name_equals get replaced with their right hand side values after processing.
In [83]:
name_equals={'Chirea M.':'Chirea A.',
'Ghinet C.':'Ghineț C.',
'Anghelescu A.':'Anghelescu M.',
'Domnița M.':'Domniță M.',
'Bejgu N.':'Beygu N.',
'Canceu A.':'Canceu Ad.',
'Dinu C.':'Dinu A.',
'Grapa D.':'Grapă D.',
'Cristea C.':'Cristea Că.',
'Cismas O.':'Cismaș O.',
'Garbea I.':'Gârbea I.',
'Vitali O.':'Oncea V.',
'Ah-hu W.':'Ah-hu S.',
'Horvát M.':'Horváth M.',
'Ionita A.':'Ioniță A.',
'Medvedschi I.':'Medvețchi I.',
'Mahika S.':'Mahika-Voiconi S.',
'Mate L.':'Máté L.',
'Hentea L.':'Hentea A.',
'Stupu I.':'Stupu A.',
'Ah-Hu S.':'Ah-hu S.',
'Alexa I.':'Alexa A.',
'Albert V.':'Albert J.',
'Angelescu M.':'Angelescu M.',
'Apostu D.':'Apostu T.',
'Brâcov A.':'Brîcov A.',
'Zaporojan R.':'Zaporojan O.',
'Vasile C.':'Vasile I.',
'Dițu I.':'Dițu A.',
'Tudor-Duicu C.':'Tudor D.',
'Sandu M.':'Sandu Mar.',
'Radulescu A.':'Rădulescu An.',
'Péter C.':'Péter Cso.',
'Movatz E.':'Movatz V.',
'Molinger B.':'Molinger P.',
'Mitelea C.':'Mițelea C.',
'Macavei I.':'Macaveiu A.',
'Macavei A.' : 'Macaveiu A.',
'Macaveiu I.' : 'Macaveiu A.',
'Luca T.':'Luca Tr.',
'Leca L.':'Leca F.',
'Gutu E.':'Guțu E.',
'Angelescu A.':'Angelescu M.',
'Mehelean L.':'Mahalean L.',
'Catoriu D.':'Cantoriu D.',
'Călina A.':'Călina C.',
'Ștefu I.' : 'Ștefu L.',
'Țarălungă A.' : 'Țarălungă D.',
'Buzás C.':'Búzás C.',
'Korenshi E.':'Korenschi E.',
'Pleșa R.':'Pleșea R.',
'Galos A.':'Galoș A.',
'Győrfi G.':'Györfi G.',
'Győrfi S.':'Györfi S.',
'Ghineț G.':'Ghineț C.',
'Hostina E.':'Hoștină E.',
'Hostină E.':'Hoștină E.',
'Ianăs F.':'Ianăș F.',
'Ianas F.':'Ianăș F.',
'Tamang S.':'Crăciun S.',
'Taralunga D.':'Țarălungă D.',
'Lacatus M.':'Lăcătuș M.',
'Máthé L.':'Máté L.',
'Burinaru A.':'Burinaru Al.',
'Nastase M.':'Năstase E.',
'Oprisan A.':'Oprișan A.',
'Pârlea A.':'Pîrlea A.',
'Parlea A.':'Pîrlea A.',
'Sabau D.':'Sabău D.',
'Spriu C.':'Spiru C.',
'Crețiu T.':'Crețiu-Codreanu T.',
'Crețiu M.':'Crețiu-Codreanu M.',
'Bíró S.':'Biró S.',
'Oprișan B.':'Oprișan A.',
'Székely J.':'Székely P.',
'Bărbulescu M.' : 'Bărbulescu E.',
'Bejenariu G.' : 'Bejenaru G.',
'Bojan V.' : 'Bojan Vl.',
'Moise A.' : 'Moise Ad.',
'Măgirdicean R.' : 'Magirdicean Ră.',
'Pall D.':'Páll D.',
'Stănculascu C.':'Stănculescu C.',
'Vrânceanu M.': 'Vrânceanu L.',
'Georgescu A.':'Georgescu An.',
'Wasicek V.':'Wasicheck W.',
'Wasicsec W.':'Wasicheck W.',
'Wasichek W.' : 'Wasicheck W.',
'Wasicsek W.':'Wasicheck W.',
'Zolfoghari A.':'Zolfaghari A.'}
Names in name_doubles handle situation where the default name abbreviation might lead to duplicates.
In [84]:
name_doubles={
'Cristea Cristina':'Cristea Cr.',
'Cristea Călin-Ștefan':'Cristea Că.',
'Sandu Marius-Cristian':'Sandu Mar.',
'Sandu Matei-Serban':'Sandu Mat.',
'Sandu Matei':'Sandu Mat.',
'Georgescu Andrei':'Georgescu An.',
'Georgescu Alexandra':'Georgescu Al.',
'Péter Csongor':'Péter Cso.',
'Péter Csanád':'Péter Csa.',
'Luca Mihnea':'Luca Mihn.',
'Luca Mihai-Cătălin':'Luca Miha.',
'Luca':'Luca Miha.',
'Luca M':'Luca Miha.',
'Luca M.':'Luca Miha.',
'Luca Mihai':'Luca Miha.',
'Luca Traian-Dan':'Luca Tr.',
'Luca Tudor':'Luca Tu.',
'Canceu Anamaria':'Canceu An.',
'Canceu Adriana-Maria':'Canceu Ad.',
'Cioată Daniel-Mihai':'Cioată M.',
'Cioată Dragoș':'Cioată D.',
'Burinaru Alexandra':'Burinaru Al.',
'Burinaru Andreea':'Burinaru An.',
'Kovács Andrei':'Kovács An.',
'Kovács Alexandru':'Kovács Al.',
'Cristea Adrian':'Cristea Ad.',
'Cristea Andrei':'Cristea An.',
'Cristea A.':'Cristea An.',
'Ungureanu Nicolae Marius':'Ungureanu M.',
'Ungureanu Nicoleta':'Ungureanu N.',
'Vincze Vlad':'Vincze Vl.',
'Vincze Valentina':'Vincze Va.',
'Bojan Vladimir':'Bojan Vl.',
'Bojan Voicu':'Bojan Vo.',
'Crețiu Codreanu Matei':'Crețiu-Codreanu M.',
'Crețiu Codreanu Tudor':'Crețiu-Codreanu T.',
'Pop Mugurel Voicu':'Pop-Mugurel V.',
'Pop Mihai':'Pop M.',
'Moise Alexandru':'Moise Al.',
'Moise Adrian':'Moise Ad.',
'Rădulescu Andrei-Savin':'Rădulescu An.',
'Rădulescu Adrian':'Rădulescu Ad.',
'Magirdicean Romeo':'Magirdicean Ro.',
'Magirdicean Răzvan Ionuț':'Magirdicean Ră.'}
Normalize Romanian characters, define name cleaner function to get Name IDs. Name ID are unique competitor names in the form of: Surname, First letter of Name. If the First Letter of Name leads to a non-unique ID, the second letter is taken, and so forth, until a unique ID is found. It gets contructed as follows:
rnamessnamesname is in equals, cleansname
In [85]:
letter_norm={'ţ':'ț','ş':'ș','Ş':'Ș'}
def name_cleaner(name):
name=str(name)
if name in name_doubles:
return name_doubles[name]
else:
for letter in letter_norm:
name=name.replace(letter,letter_norm[letter])
if name in name_exceptions:
name=name_exceptions[name]
nc=name.replace(' ',' ').split('(')
rname=nc[0].strip()
rnames=rname.split(' ')
sname=rnames[0]+' '+rnames[1][0]+'.'
if sname in name_equals:
sname=name_equals[sname]
if sname in name_doubles:
print(name,sname)
return sname
Names equalling any string in redflags_names get thrown out of the final dataset.
Names containing any string in redflags_names2 get thrown out of the final dataset.
In [86]:
redflags_names=['-','—','—',np.nan,'. ()','— ','- -.','- -. (-)','A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','R','S',
'Kashi','Sankon','București','Victorii:','Sakura','Taiken','Ikada','Sonkei','CRK','Museido',
'Ichimon','Bushi Tokukai 1','Competitori – Shiai-sha','Echipa - roşu','Numele şi prenumele',
'Victorii:','Victorii: 0','Victorii: 1','Victorii: 2','Victorii: 3','Victorii: 4',
'Victorii: 5','?','Kyobukan','2/5','2/6','3/8','Finala','Kyobukan (0/0/0)','―',
'(clasament final după meci de baraj)','CRK (Bucuresti)','Kaybukan','Isshin (Cluj)',
'Ikada (Bucureşti)','Kyobukan (Braşov)','Puncte:','KASHI','Budoshin','Isshin',
'— (—)','4. B.','4. Baraj: Stupu M - Hostina','4. Baraj: Moise KM - Korenschi M',
'Bushi Tokukai (2/8/17)','CRK 2 (1/6/14)', 'CRK 2','CRK 1','Loc I.:','Loc',
'Bushi Tokukai 2 (M Ciuc)','Echipa suport']
redflags_names2=['Bushi Tokukai','Eliminatoriu','finala','Finala','Fianala','Ikada','Ichimon','Pool',
'Locul ','Lotul ','Loc ','Grupa ','Isshin','Meciul ','Victorii:','L1','1','2','3','4','5','6','7','8','9','0']
Check is name is not in redflags. Ignore these entries.
In [87]:
def name_ok(name):
name=str(name)
if name=='nan': return False
if name not in redflags_names:
if np.array([i not in name for i in redflags_names2]).all():
return True
return False
Process all names for standardization. Create 3 variables:
all_players: forward relationship: unclean name -> cleaned nameall_players_r: reverse relationshipall_players_unsorted: unique set of all names processed Process both competitor and shinpan names.
In [88]:
all_players={}
all_players_r={}
all_players_unsorted=set()
for year in matches:
for competition in matches[year]:
for match in matches[year][competition]:
for color in ['aka','shiro']:
name=match[color]['name']
all_players_unsorted.add(name)
if name_ok(name):
name=name_cleaner(name)
rname=match[color]['name']
if rname not in all_players_r:all_players_r[rname]=name
if name not in all_players: all_players[name]={}
if year not in all_players[name]:all_players[name][year]={'names':set()}
all_players[name][year]['names'].add(rname)
if 'shinpan' in match:
for color in ['fukushin1','shushin','fukushin2']:
aka=match['aka']['name']
shiro=match['shiro']['name']
if (name_ok(aka)) and\
(name_ok(shiro)) and\
(name_cleaner(aka) in all_players) and\
(name_cleaner(shiro) in all_players):
rname=match['shinpan'][color]
all_players_unsorted.add(rname)
if name_ok(rname):
name=name_cleaner(rname)
if rname not in all_players_r:all_players_r[rname]=name
if name not in all_players: all_players[name]={}
if year not in all_players[name]:all_players[name][year]={'names':set()}
all_players[name][year]['names'].add(rname)
Link procesed to names in members. The name_linker dictionary contains Name IDs (short names) as keys and sets of long names as values. Ideally, this set should contain only one element, so that the mapping is unique.
In [89]:
name_linker={}
for i in members_clean.index:
name=members_clean.loc[i]['name']
try:
cname=name_cleaner(name)
except:
print(name)
if cname not in name_linker:name_linker[cname]=set()
name_linker[cname].add(name)
Do the opposite mapping in names_abbr: long->short. Create exceptions for duplicate names.
In [90]:
names_abbr={}
for name in name_linker:
if len(name_linker[name])>1:
#only for dev to create exceptions for duplicate person names.
print(name,name_linker[name])
for i in name_linker[name]:
names_abbr[i]=name
Save club mappings by short name, by year.
In [91]:
names_abbr_list=[]
name_abbr2long={}
name_abbr2club={}
for i in members_clean.index:
name=members_clean.loc[i]['name']
club=members_clean.loc[i]['club']
year=members_clean.loc[i]['year']
names_abbr_list.append(names_abbr[name])
name_abbr2long[names_abbr[name]]=name
if names_abbr[name] not in name_abbr2club:name_abbr2club[names_abbr[name]]={}
if year not in name_abbr2club[names_abbr[name]]:
name_abbr2club[names_abbr[name]][year]=club
Add short names to members_clean.
In [92]:
members_clean['name_abbr']=names_abbr_list
Some names appear in the short form, we need to add them manually to the long list. We parse through all forms in which the name appears, and choose the longest. We call this the inferred name.
In [93]:
for name in all_players:
if name not in name_abbr2long:
#infer using longest available name
names={len(j):j for i in all_players[name] for j in all_players[name][i]['names']}
if len(names)>0:
inferred_name=names[max(names.keys())]
if '(' in inferred_name:
inferred_name=inferred_name[:inferred_name.find('(')-1]
name_abbr2long[name]=inferred_name
Infer duplicates
In [94]:
def levenshteinDistance(s1, s2):
if len(s1) > len(s2):
s1, s2 = s2, s1
distances = range(len(s1) + 1)
for i2, c2 in enumerate(s2):
distances_ = [i2+1]
for i1, c1 in enumerate(s1):
if c1 == c2:
distances_.append(distances[i1])
else:
distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1])))
distances = distances_
return distances[-1]
In [95]:
nkeys=np.sort(list(name_abbr2long.keys()))
for ii in range(len(name_abbr2long)):
i=nkeys[ii]
for jj in range(ii):
j=nkeys[jj]
if levenshteinDistance(name_abbr2long[i],name_abbr2long[j])<4:
print(name_abbr2long[i],':',name_abbr2long[j],' - ',i,':',j)
In [96]:
nkeys=np.sort(list(name_abbr2long.keys()))
for ii in range(len(name_abbr2long)):
i=nkeys[ii]
for jj in range(ii):
j=nkeys[jj]
if levenshteinDistance(i,j)<3:
print(i,':',j,' - ',name_abbr2long[i],':',name_abbr2long[j])
Infer clubs from name if club is part of name in the competition. Club names in redflags_clubs get ignored. Clubs in club_equals get replaced after processing. The convention is to have 3 letter all-caps club names for Romanian clubs, 3 letter club names followed by a / and a two letter country code for foreign clubs.
In [97]:
redflags_clubs=['','N/A','RO1','RO2']
club_equals=clubs_loader.club_equals
Attach clubs to all_players who have it in their competition name data, but we don't already known from members.
In [98]:
for name in all_players:
#if we dont already know the club for this year from the members register
if name not in name_abbr2club:
for year in all_players[name]:
for name_form in all_players[name][year]['names']:
if '(' in name_form:
club=name_form.split('(')[1].strip()[:-1]
if club not in redflags_clubs:
if name not in name_abbr2club:name_abbr2club[name]={}
name_abbr2club[name][year]=club
else:
for year in all_players[name]:
#else if no club info for particular year
if year not in name_abbr2club[name]:
for name_form in all_players[name][year]['names']:
if '(' in name_form:
club=name_form.split('(')[1].strip()[:-1]
if club not in redflags_clubs:
name_abbr2club[name][year]=club
Normalize club names and long names.
In [99]:
for name in name_abbr2club:
for year in name_abbr2club[name]:
if name_abbr2club[name][year] in club_equals:
name_abbr2club[name][year]=club_equals[name_abbr2club[name][year]]
for name in name_abbr2long:
name_abbr2long[name]=name_abbr2long[name].replace(' ',' ').strip()
If club still not found, fill the gaps between years. Forward fill first, then backward fill, if necessary.
In [103]:
for name in all_players:
if name in name_abbr2club:
years=np.sort(list(all_players[name].keys()))
minyear1=min(years)
maxyear1=max(years)
minyear2=min(name_abbr2club[name].keys())
maxyear2=min(name_abbr2club[name].keys())
if len(years)>1:
for year in range(min(minyear1,minyear2),max(maxyear1,maxyear2)+1):
if year not in name_abbr2club[name]:
#get club from previous year
for y in range(years[0],year):
if y in name_abbr2club[name]:
name_abbr2club[name][year]=str(name_abbr2club[name][y])
break
if year not in name_abbr2club[name]:
#if still not found, get club from next year
for y in np.arange(years[-1],year,-1):
if y in name_abbr2club[name]:
name_abbr2club[name][year]=str(name_abbr2club[name][y])
break
if year not in name_abbr2club[name]:
#if still not found, get first known year
if year<minyear2:
name_abbr2club[name][year]=str(name_abbr2club[name][minyear2])
else:
name_abbr2club[name][year]=str(name_abbr2club[name][maxyear2])
We have extracted what was possible from the data. Now we do a save of short name to long name and club mappings (by year). We then edit this file manually, if necessary.
In [30]:
manual_data_needed=[]
for i in manual_name_needed.union(manual_club_needed):
if i not in list(manual_data_override.index):
dummy={'name':i,'long_name':'','club':''}
if i in name_abbr2club:
dummy['club']=name_abbr2club[name][max(list(name_abbr2club[name].keys()))]
if i in manual_club_needed:
if i in name_abbr2long:
dummy['long_name']=name_abbr2long[i]
manual_data_needed.append(dummy)
In [31]:
df=pd.DataFrame(manual_data_needed).set_index('name')
df=pd.concat([manual_data_override,df]).drop_duplicates().sort_index()
In [32]:
df.to_excel('../data/manual/members_manual.xlsx')
Extend with manual data
In [33]:
for i in df['long_name'].replace('',np.nan).dropna().index:
name_abbr2long[i]=df.loc[i]['long_name']
all_players_r[name_abbr2long[i]]=i
In [34]:
manual_club_needed=set()
for name in all_players:
years=np.sort(list(all_players[name].keys()))
minyear=min(years)
maxyear=max(years)
for year in range(minyear,maxyear+1):
if name not in name_abbr2club:name_abbr2club[name]={}
if year not in name_abbr2club[name]:
if name in df['club'].replace('',np.nan).dropna().index:
name_abbr2club[name][year]=df.loc[name]['club']
else:
name_abbr2club[name][year]='XXX'
Update and overwrite with club existence data
Extend members with unregistered members. Probably inactive now, or from abroad. Only that one year when he appared in competition. But we only register them as known to be active that year. This is in ontrast with the Inactive members from the registry, for whom we know when did they go inactive.
In [35]:
unregistered_members=[]
for name in all_players:
if name not in set(members_clean['name_abbr'].values):
years=np.sort(list(name_abbr2club[name].keys()))
for year in range(min(years),max(years)+1):
if year in all_players[name]:
iyear=year
else:
iyear=max(years)
club,country=clubs_loader.club_cleaner(name_abbr2club[name][year])
if country=='RO':
activ='Active'
dan=''#dan=0
else:
activ='Abroad'
dan=''
unregistered_members.append({'name':name_abbr2long[name],'name_abbr':name,
'club':club,'active':activ,'year':year,'dan':dan,'country':country,'source':'matches'})
In [36]:
members_clean['country']='RO'
members_clean['source']='member list'
In [49]:
members_updated=pd.concat([members_clean,pd.DataFrame(unregistered_members)]).reset_index(drop=True)
Extend 0 dan down to starting year.
In [50]:
members_mu_dan_extensions=[]
members_by_name=members_updated.set_index(['name_abbr'])
for year in matches:
members_by_year=members_updated.set_index(['year']).loc[year]
for competition in matches[year]:
print(year,competition)
for k in matches[year][competition]:
aka=k['aka']['name']
shiro=k['shiro']['name']
if (name_ok(aka)) and\
(name_ok(shiro)) and\
(name_cleaner(aka) in all_players) and\
(name_cleaner(shiro) in all_players):
for a in ['aka','shiro']:
for h in k[a]:
if h=='name':
name=k[a][h]
rname=all_players_r[name]
if rname in list(members_by_name.index):
if rname not in members_by_year['name_abbr'].values:
dummy=members_by_name.loc[[rname]]
minyear=min(dummy['year'])
maxyear=max(dummy['year'])
if year>maxyear:
dummy=dummy[dummy['year']==maxyear]
yeardiff=min(dummy['year'])-year
else:
dummy=dummy[dummy['year']==minyear]
yeardiff=year-max(dummy['year'])
dummy=dummy.reset_index()
dummy['year']=year
dummy['dan']=0
dummy['age']=dummy['age']+yeardiff
dummy['source']='matches, mu dan'
members_mu_dan_extensions.append(dummy)
#if only appears in competition in one year, then not in members table
else:
print(rname,year)
#fix in unregistered_members
Update members
In [51]:
members_mu_dan_extensions=pd.concat(members_mu_dan_extensions)
members_updated=pd.concat([members_updated,members_mu_dan_extensions]).reset_index(drop=True)
Prettify club names, and IDs
In [52]:
clubs=[]
pclubs=[]
countries=[]
for i in members_updated.index:
club=members_updated.loc[i]['club']
country=members_updated.loc[i]['country']
year=members_updated.loc[i]['year']
club,country=clubs_loader.club_cleaner(club,country)
club,pclub=clubs_loader.club_year(club,country,year)
clubs.append(club)
pclubs.append(pclub)
countries.append(country)
In [53]:
members_updated['club']=clubs
members_updated['pretty_club']=pclubs
members_updated['country']=countries
Fix unknwown genders
In [54]:
manual_mf_data_override=pd.read_excel('../data/manual/members_mf_manual.xlsx')
In [55]:
manual_mf_data_needed=members_updated[(members_updated['gen']!='M')&(members_updated['gen']!='F')][['name_abbr','name']]\
.drop_duplicates()
In [56]:
df=manual_mf_data_needed#.merge(manual_mf_data_override[['name_abbr','gen']],'outer').drop_duplicates()
df.to_excel('../data/manual/members_mf_manual.xlsx')
Update members with manual gender data.
In [57]:
members_updated=members_updated.reset_index(drop=True).drop_duplicates()
In [58]:
gens=[]
for i in members_updated.index:
name=members_updated.loc[i]['name_abbr']
if name in list(df.index):
gens.append(df.loc[name])
else:
gens.append(members_updated.loc[i]['gen'])
In [59]:
members_updated['gen']=gens
Save to /data/export.
In [60]:
members_updated.to_csv('../data/export/members.csv')
In [61]:
clubs_updated=members_updated.groupby(['club','country','pretty_club','year'])[['name_abbr']].count()
clubs_updated=clubs_updated.reset_index().set_index('club').join(clubs_loader.club_year_df['Oraș'])
clubs_updated.to_csv('../data/export/clubs.csv')
In [62]:
master_matches=[]
for year in matches:
members_by_year=members_updated.set_index(['year']).loc[year].drop_duplicates()
for competition in matches[year]:
print(year,competition)
for k in matches[year][competition]:
good=True
match={'year':year,'competition':competition}
match['match_category'],match['match_teams'],match['match_phase']=point_utils.match_cleaner(year,k['match_type'])
if 'shinpan' in k:
for color in ['fukushin1','shushin','fukushin2']:
if color in k['shinpan']:
if k['shinpan'][color] in all_players_r:
#normalize shinpan names
match[color]=name_abbr2long[all_players_r[k['shinpan'][color]]]
aka=k['aka']['name']
shiro=k['shiro']['name']
if (name_ok(aka)) and\
(name_ok(shiro)) and\
(name_cleaner(aka) in all_players) and\
(name_cleaner(shiro) in all_players):
for a in ['aka','shiro']:
points=''
for h in k[a]:
if h=='name':
name=k[a][h]
#normalize competitor names
rname=all_players_r[name]
df=members_by_year[members_by_year['name_abbr']==rname]
match[a+' name']=name_abbr2long[rname]
else:
point=k[a][h]
if str(point)=='nan': point=''
points=points+point
good=point_utils.point_redflags(points)
if good:
match[a+' point1'],match[a+' point2'],match[a+' points'],\
match[a+' hansoku'],match['encho']=point_utils.points_cleaner(points)
else:
good=False
if good:
if 'outcome' in k:
match['encho']=point_utils.outcome_cleaner(k['outcome'])
else:
match['encho']=False
match['winner'],match['difference']=point_utils.outcome_from_points(match['aka points'],match['shiro points'])
master_matches.append(match)
Clean up and save matches for display
In [63]:
data=pd.DataFrame(master_matches).reset_index(drop=True)
In [64]:
save_utils.save(data)
In [ ]:
In [ ]:
In [ ]:
In [ ]: