Merging "odds" and "player" data

Author: Carl Toews
File: merge_datasets.ipynb

Description:

An obvious metric for assessing the quality of learning algorithms is to compare their profitability against on-line betting markets. The notebook setup_mysql.ipynb loads into a MySQL database files from two distinct sources, one containing characteristics of the players (good for learning), the other containing odds for each match (good for assessing the learning). Each record from the latter presumably corresponds to a unique record from the former.

Unfortunately, there are a couple of issues that make establishing this correspondence difficult, including each of the following:

  1. Incorrect match dates
  2. Incorrect spellings of player names
  3. Listing the winner as the loser, and vice versa.
  4. Inconsistent tournament names
  5. Inconsistent tournament ID numbers

This notebook records some of the methods I used to establish the correspondence. It ultimately produces two DataFrames, one called matches (with player characteristics) and one called odds (with betting characteristics) of the same size and a one-to-one numerical key uniquely identifying each match across both datasets.


Examples of bad data

Bad player names


In [1]:
import IPython as IP
IP.display.Image("example_of_name_matching_problems_mod.png",width=400,height=200,embed=True)


Out[1]:

Bad match dates


In [2]:
IP.display.Image("../aux/bad_csv_data_mod.png",width=500,height=500,embed=True)


Out[2]:

Setup MySQL connection

Login credentials for connecting to MySQL database.


In [1]:
# name of database
db_name = "tennis"
# name of db user
username = "testuser"
# db password for db user
password = "test623"
# location of atp data files
atpfile_directory = "../data/tennis_atp-master/"
# location of odds data files
oddsfiles_directory = "../data/odds_data/"
# we'll read and write pickle files here
pickle_dir = '../pickle_files/'

All import statements here.


In [2]:
import sqlalchemy # pandas-mysql interface library
import sqlalchemy.exc # exception handling
from   sqlalchemy import create_engine  # needed to define db interface
import sys # for defining behavior under errors
import numpy as np # numerical libraries
import scipy as sp
import pandas as pd # for data analysis
import pandas.io.sql as sql # for interfacing with MySQL database
import matplotlib as mpl # a big library with plotting functionality
import matplotlib.pyplot as plt # a subset of matplotlib with most of the useful tools
import IPython as IP
%matplotlib inline 
import pdb
#%qtconsole

Try to connect to the tennis database on the local mysql host. If successful, print out the MySQL version number, if unsuccessful, exit gracefully.


In [3]:
# create an engine for interacting with the MySQL database
try:
    eng_str = 'mysql+mysqldb://' + username + ':' + password + '@localhost/' + db_name
    engine = create_engine(eng_str)
    connection = engine.connect()
    version = connection.execute("SELECT VERSION()")
    print("Database version : ")
    print(version.fetchone())
    
# report what went wrong if this fails.    
except sqlalchemy.exc.DatabaseError as e:
    reason = e.message
    print("Error %s:" % (reason))
    sys.exit(1)

# close the connection
finally:            
    if connection:    
        connection.close()
    else:
        print("Failed to create connection.")


Database version : 
('5.7.16',)

Extract data from MySQL database

We'll focus on cleaning the seven years of data between 2010 and 2016.


In [4]:
# focus on most recent data; exclude Davis Cup stuff
startdate = '20100101'
enddate = '20161231'

with engine.begin() as connection:
    odds = pd.read_sql_query("""SELECT * FROM odds \
                                WHERE DATE >= '""" + startdate + """' \
                                AND DATE <= '""" + enddate + """';""", connection)
    
with engine.begin() as connection:
    matches = pd.read_sql_query("""SELECT * FROM matches \
                                WHERE tourney_date >= '""" + startdate + """' \
                                AND tourney_date <= '""" + enddate + """' \
                                AND tourney_name NOT LIKE 'Davis%%';""", connection)
    
# view results
IP.display.display(odds[['ATP','Location','Tournament','Date','Round',
                         'Winner','Loser']].sort_values('Date')[0:5])
IP.display.display(matches[['tourney_id','tourney_name','tourney_date','round',
                            'winner_name','loser_name']].sort_values('tourney_date')[0:5])


ATP Location Tournament Date Round Winner Loser
0 1 Brisbane Brisbane International 2010-01-04 1st Round Gasquet R. Nieminen J.
34 2 Chennai Chennai Open 2010-01-04 1st Round Tipsarevic J. Moya C.
33 2 Chennai Chennai Open 2010-01-04 1st Round Giraldo S. Phau B.
32 2 Chennai Chennai Open 2010-01-04 1st Round Hajek J. Greul S.
31 2 Chennai Chennai Open 2010-01-04 1st Round Cilic M. Kunitsyn I.
tourney_id tourney_name tourney_date round winner_name loser_name
1411 2010-339 Brisbane 2010-01-03 R32 Andy Roddick Peter Luczak
1440 2010-339 Brisbane 2010-01-03 SF Radek Stepanek Gael Monfils
1439 2010-339 Brisbane 2010-01-03 SF Andy Roddick Tomas Berdych
1438 2010-339 Brisbane 2010-01-03 QF Radek Stepanek Wayne Odesnik
1437 2010-339 Brisbane 2010-01-03 QF Gael Monfils James Blake

Note some issues already:

  1. Date in matches are often pegged to the start date of the tournament, while dates in odds are often tied to the match itself. Thus time-ordering leads to different results.
  2. The variable tourney_name in matches often corresponds to the variable Location in odds (but not always!)
  3. Rounds are denoted in different ways. "1st round" in odds can match to R32, R64, or R128 on matches.
  4. Name formats are different.
  5. While matches has a unique tournament ID for each tournament, odds recycles tournament ids each year.

Before doing any serious data processing, tidy up the strings (strip whitespace, convert to lowercase, replace dashes by space)


In [5]:
odds[['Location','Winner','Loser']] = \
    odds[['Location','Winner','Loser']].\
    apply(lambda x: x.str.strip().str.lower().str.replace('-',' '),axis=1)
matches[['tourney_name','winner_name','loser_name']] = \
    matches[['tourney_name','winner_name','loser_name']].\
    apply(lambda x: x.str.strip().str.lower().str.replace('-',' '),axis=1)

Step I: pair unique tournaments from each set

Each unique tournament has an identifying key in both odds and matches, but the keys are different, and not one to one. Our first task is to associate unique tournaments with one another. We'll start by grouping both odds and matches by unique tournament number.


In [6]:
# matches tournament identifiers are unique
g_matches = matches.groupby('tourney_id')
# odds tournament identifiers are recycled every year
g_odds= odds.groupby(['ATP','fname'])

For each tournament (group), extract features to assist in "tourney-matching", including the number of matches in the tournament, the maxium and minimum listed dates, the location (in odds) and the tournament name (in matches.)


In [7]:
def extract_odds_features(group):
    sizes = len(group)
    min_date = group['Date'].min()
    max_date = group['Date'].max()
    location = group['Location'].unique()[0]
    return pd.Series({'size': sizes,'min_date':min_date,\
                      'max_date':max_date,'location':location})

def extract_matches_features(group):
    sizes = len(group)
    min_date = group['tourney_date'].min()
    max_date = group['tourney_date'].max()
    tourney_name = group['tourney_name'].unique()[0]
    return pd.Series({'size': sizes,'min_date':min_date,\
                      'max_date':max_date,'tourney_name':tourney_name})
    
g_odds = g_odds.apply(extract_odds_features).reset_index()
g_matches = g_matches.apply(extract_matches_features).reset_index()

Define lookup-table connecting Location in odds to tourney_name in matches. Establishing this matching is not totally straightforward: in many cases, the two columns are identical, but in others they are not, with no obvious connection. (Eg. "Roland Garros" is the tourney_name listed for the French Open in matches, while it's Location is "Paris".) Of the 100 or so difference tennis tournaments, about 60 have a direct correspondence, and the others need some massaging. The derived lookup table is the product of considerable grubby data wrangling.


In [8]:
tourney_lookup = pd.read_pickle(pickle_dir + 'tourney_lookup.pkl')
print("Snapshot of lookup table:")
IP.display.display(tourney_lookup.sort_values('o_name')[15:25])


Snapshot of lookup table:
m_name o_name
17 casablanca casablanca
18 chengdu chengdu
19 chennai chennai
20 cincinnati masters cincinnati
21 costa do sauipe costa do sauipe
22 delray beach delray beach
23 doha doha
24 dubai dubai
25 dusseldorf dusseldorf
58 power horse cup dusseldorf

Define a function that will take as input a set of features from one tournament in odds, and try to match it to one tournament in matches. The algorithm make sure that dates are roughly the same, and that there is a correspondence between Location and tourney_name.


In [9]:
def get_tourney_ID(o_row):
    """
    function:  get_tourney_ID(o_row)
    Input:   row from dataframe g_odds
    Output:  a Series object with two elements: 1) a match ID,
                and 2), a flag of True if the sizes of the two tournmanets are identical
    """

    # calculate the diffence in start/stop dates between this tournament and those in `matches`.
    min_date_delta = np.abs(g_matches['min_date'] - o_row['min_date']).apply(lambda x: x.days)
    max_date_delta = np.abs(g_matches['max_date'] - o_row['max_date']).apply(lambda x: x.days)

    # find a list of candidate tournament names, based on lookup table 
    mtchs = (tourney_lookup['o_name']==o_row['location'])
    if sum(mtchs)>0:
        m_name = tourney_lookup.loc[mtchs,'m_name']
    else:
        print('no match found for record {}'.format(o_row['location']))
        return ['Nan','Nan']
    
    # the "right" tournament has the right name, and reasonable close start or stop dates
    idx = ((min_date_delta <=3) | (max_date_delta <=1)) & (g_matches['tourney_name'].isin(m_name))
    record = g_matches.loc[idx,'tourney_id']
    
    # if there are no matches, print some diagnostic information and don't assign a match
    if len(record)<1:
        print("Warning:  no match found for `odds` match {}, year {}".format(o_row.ATP, o_row.fname))
        print("min date delta: {}, max date delta: {}, g_matches: {}".format(np.min(min_date_delta), \
                                                                             np.min(max_date_delta), \
                                                                             g_matches.loc[g_matches['tourney_name'].isin(m_name),'tourney_name'])) 
        return pd.Series({'ID':'None','size':'NA'})
    
    # if there are too many matches, print a warning and don't assign a match.
    elif (len(record)>1):
        print("Warning:  multiple matches found for `odds` match {}".format(o_row.ATP))
        return pd.Series({'ID':'Multiple','size':'NA'})

    # otherwise, assign a match, and check if the sizes of the matches are consistent (a good double-check)
    else:
        size_flag = (g_matches.loc[idx,'size']==o_row['size'])
        return pd.Series({'ID':record.iloc[0],'size':size_flag.iloc[0]})

Perform the matches.


In [10]:
# add columns to g_odds to hold match ID and also info about size-correspondence
g_odds.insert(len(g_odds.columns),'ID','None')
g_odds.insert(len(g_odds.columns),'sizes_match','NA')

# perform the match
g_odds[['ID','sizes_match']] = g_odds.apply(get_tourney_ID,axis=1).values

Merge match numbers back into bigger odds table, checking that the sizes of each "matched" tournament are the same.


In [11]:
# add "size" columns to both dataframes
odds = pd.merge(g_odds[['ATP','fname','ID','size']],odds,how='inner',on=['ATP','fname'])
matches = pd.merge(g_matches[['tourney_id','size']],matches,how='inner',on=['tourney_id'])

# sum the sizes
if sum(g_odds['sizes_match']==True) != len(g_odds):
    print("Warning:  at least one tournament in `odds` is matched to a \
    tournament in `matches` of a different size.")
else:
    print("Sizes seem to match up.")


Sizes seem to match up.

Step II: Pair matches within each tournament.

To help in this process, we first insert an integer index key into both odds and matches. Also assign a match number to each match within a tournament in odds.


In [12]:
# for each tournament, label match numbers from 1 to n_tourneys
odds.insert(5,'match_num',0)
grouped = odds[['ID','match_num']].groupby('ID')
odds['match_num'] = grouped.transform(lambda x: 1+np.arange(len(x)))

# add keys to both odds and match data
odds.insert(len(odds.columns),'key',np.arange(len(odds)))
matches.insert(len(matches.columns),'key',np.arange(len(matches)))

We'll use round information to help with the task of pairing individual matches . To use round information effectively, we need to establish a correspondence between round signifiers in the two datasets. The exact correspondence depends on how many tennis matches are in each tournmanet.


In [13]:
# figure out how many discrete sizes there are
print("size in odds:  ", odds['size'].unique())
print("size in matches:  ", matches['size'].unique())
print("unique round designators in odds:  ", odds.Round.unique())
print("unique round designators in matches:  ", matches['round'].unique())


size in odds:   [ 31  27 127  95  55  47  15]
size in matches:   [ 55  27  31  47  95 127  15  23  26  64]
unique round designators in odds:   ['1st Round' '2nd Round' 'Quarterfinals' 'Semifinals' 'The Final'
 '3rd Round' '4th Round' 'Round Robin']
unique round designators in matches:   ['R64' 'R32' 'R16' 'QF' 'SF' 'F' 'R128' 'RR' 'BR']

In [14]:
# create a lookup table to be able to match on rounds
m_rounds = ['R128','R64','R32','R16','QF','SF','F','RR']
o_rounds = ['1st Round','2nd Round','3rd Round','4th Round', \
            'Quarterfinals','Semifinals','The Final','Round Robin']

round_lookup_small = pd.DataFrame({'m_rounds': m_rounds[2:-1],\
                                   'o_rounds':o_rounds[0:2]+o_rounds[4:-1]})

round_lookup_medium = pd.DataFrame({'m_rounds': m_rounds[1:-1],\
                                   'o_rounds':o_rounds[0:3]+o_rounds[4:-1]})

round_lookup_large = pd.DataFrame({'m_rounds': m_rounds[0:-1],\
                                   'o_rounds':o_rounds[0:-1]})

round_lookup_RR  = pd.DataFrame({'m_rounds':m_rounds[5:],\
                                 'o_rounds':o_rounds[5:]})

With the round lookup tables defind, define a function that takes a row of odds, and figures how to map its round information to the round information in matches.


In [15]:
def map_rounds(x):

    cur_name = x['Round']
    t_size = x['size']

    if t_size in [27,31]:
        new_name = round_lookup_small.loc[round_lookup_small.o_rounds==cur_name,'m_rounds']
    elif t_size in [47,55]:
        new_name = round_lookup_medium.loc[round_lookup_medium.o_rounds==cur_name,'m_rounds']
    elif t_size in [95, 127]:
        new_name = round_lookup_large.loc[round_lookup_large.o_rounds==cur_name,'m_rounds']
    else:
        new_name = round_lookup_RR.loc[round_lookup_RR.o_rounds==cur_name,'m_rounds'] 

    return new_name.iloc[0]

We'll apply that mapping to each row of the odds dataframe.


In [16]:
# translate round indentifier appropriately
odds.insert(4,'round','TBD')
odds['round'] = odds.apply(map_rounds,axis=1).values

IP.display.display(odds[0:4])
IP.display.display(matches[0:4])


ATP fname ID size round Location match_num Tournament Date Series ... LBL PSW PSL SJW SJL MaxW MaxL AvgW AvgL key
0 1 2010 2010-339 31 R32 brisbane 1 Brisbane International 2010-01-04 ATP250 ... 2.375 1.526 2.740 1.500 2.500 NaN NaN NaN NaN 0
1 1 2010 2010-339 31 R32 brisbane 2 Brisbane International 2010-01-04 ATP250 ... 1.571 2.140 1.813 2.250 1.571 NaN NaN NaN NaN 1
2 1 2010 2010-339 31 R32 brisbane 3 Brisbane International 2010-01-04 ATP250 ... 2.100 1.676 2.360 1.615 2.200 NaN NaN NaN NaN 2
3 1 2010 2010-339 31 R32 brisbane 4 Brisbane International 2010-01-04 ATP250 ... 1.500 2.580 1.581 2.500 1.500 NaN NaN NaN NaN 3

4 rows × 48 columns

tourney_id size tourney_name surface draw_size tourney_level tourney_date match_num winner_id winner_seed ... l_df l_svpt l_1stIn l_1stWon l_2ndWon l_SvGms l_bpSaved l_bpFaced match_type key
0 2010-1536 55 madrid masters Clay 56 M 2010-05-09 1 103794 NaN ... 2.0 46.0 25.0 11.0 8.0 7.0 4.0 9.0 atp_matches 0
1 2010-1536 55 madrid masters Clay 56 M 2010-05-09 2 104919 NaN ... 7.0 110.0 71.0 52.0 19.0 17.0 1.0 4.0 atp_matches 1
2 2010-1536 55 madrid masters Clay 56 M 2010-05-09 3 104527 15.0 ... 1.0 62.0 33.0 17.0 11.0 10.0 3.0 9.0 atp_matches 2
3 2010-1536 55 madrid masters Clay 56 M 2010-05-09 4 104022 10.0 ... 1.0 73.0 45.0 29.0 12.0 12.0 5.0 10.0 atp_matches 3

4 rows × 52 columns

Before going on, a quick sanity check: is the set of matches in matches that are in odds the same size as the set of matches in odds that are in matches? Does the size column have the correct data?


In [17]:
t1=odds.ID.drop_duplicates().sort_values()
t2=matches.tourney_id.drop_duplicates().sort_values()
m_sizes=matches.loc[matches.tourney_id.isin(t1),['tourney_id','size']].drop_duplicates()
o_sizes=odds.loc[odds.ID.isin(t2),['ID','size']].drop_duplicates()
#comp = pd.merge(o_sizes,m_sizes,how='outer',left_on='ID',right_on='tourney_id')
print('sum of sizes of tournaments in odds: ', np.sum(o_sizes['size']))
print('sum of sizes of tournaments in matches: ', np.sum(m_sizes['size']))


sum of sizes of tournaments in odds:  18448
sum of sizes of tournaments in matches:  18448

Looks good. Now pare down the matches dataframe to contain only records in odds.


In [18]:
matches = matches.loc[matches.tourney_id.isin(t1),:]
print("number of records in `odds`:  ", len(odds))
print("number of records in `matches`:  ", len(matches))


number of records in `odds`:   18448
number of records in `matches`:   18448

To do player name matching, we split each name, and attempt to match on substrings.


In [19]:
# extract dataframe with player names split into discrete 'words'
m_players = pd.merge(matches.winner_name.str.split(pat=' ',expand=True), \
                     matches.loser_name.str.split(pat=' ',expand=True), \
                    how='inner',left_index=True, right_index=True,suffixes=('_W','_L'))  

# add on tournament, round, and match identifiers
m_players = pd.merge(matches[['tourney_id','match_num', 'round','key']], m_players,\
                    how='inner',left_index=True, right_index=True).sort_values(['tourney_id','round','1_W','1_L'])

# extract dataframe with player names split into discrete 'words'
o_players = pd.merge(odds.Winner.str.split(pat=' ',expand=True), \
                     odds.Loser.str.split(pat=' ',expand=True), \
                    how='inner',left_index=True, right_index=True,suffixes=('_W','_L')) 

# add on tournament and round identifiers
o_players = pd.merge(odds[['ID','round','match_num','key']], o_players,\
                    how='inner',left_index=True, right_index=True).sort_values(['ID','round','0_W','0_L'])

print("m_players: ")
IP.display.display(m_players[0:5])
print("o_players")
IP.display.display(o_players[0:5])


m_players: 
tourney_id match_num round key 0_W 1_W 2_W 3_W 4_W 0_L 1_L 2_L 3_L 4_L
54 2010-1536 55 F 54 rafael nadal None None None roger federer None None None
50 2010-1536 51 QF 50 nicolas almagro None None None jurgen melzer None None None
48 2010-1536 49 QF 48 roger federer None None None ernests gulbis None None None
49 2010-1536 50 QF 49 david ferrer None None None andy murray None None None
51 2010-1536 52 QF 51 rafael nadal None None None gael monfils None None None
o_players
ID round match_num key 0_W 1_W 2_W 3_W 4_W 0_L 1_L 2_L 3_L 4_L
7722 2010-1536 F 55 7722 nadal r. None None None federer r. None None None
7716 2010-1536 QF 49 7716 almagro n. None None None melzer j. None None None
7718 2010-1536 QF 51 7718 federer r. None None None gulbis e. None None None
7719 2010-1536 QF 52 7719 ferrer d. None None None murray a. None None None
7717 2010-1536 QF 50 7717 nadal r. None None None monfils g. None None None

Match individual matches as follows: assuming there is a match on both tournament number and round, then

  1. First try to match both winner and loser names
  2. In the compliment of the result, try to match just winner name or just loser names.
  3. Merge 1. and 2., see what's left over.

In [20]:
# try for an exact match on last names of both winner and loser
A = pd.merge(m_players[['tourney_id','round','key','1_W','1_L']],\
             o_players[['ID','round','key','0_W','0_L']],how='inner',\
             left_on=['tourney_id','round','1_W','1_L'],\
             right_on=['ID','round','0_W','0_L'],suffixes=['_m','_o'])

m_extras = m_players.loc[~m_players.key.isin(A.key_m),:]
o_extras = o_players.loc[~o_players.key.isin(A.key_o),:]

print("A total of {} matches down.  {} remain.".format(len(A),len(m_extras)))


A total of 16821 matches down.  1627 remain.

In [21]:
A.key_o.unique().size


Out[21]:
16821

Take a quick peak at the remaining names to get a sense for what the issues are


In [22]:
IP.display.display(m_extras[0:10])
IP.display.display(o_extras[0:10])


tourney_id match_num round key 0_W 1_W 2_W 3_W 4_W 0_L 1_L 2_L 3_L 4_L
36 2010-1536 37 R32 36 guillermo garcia lopez None None jo wilfried tsonga None None
28 2010-1536 29 R32 28 andy murray None None None juan ignacio chela None None
6 2010-1536 7 R64 6 juan ignacio chela None None pablo cuevas None None None
11 2010-1536 12 R64 11 eduardo schwank None None None paul henri mathieu None None
57 2010-1720 3 R32 57 olivier rochus None None None juan martin del potro None
101 2010-2276 20 R16 101 illya marchenko None None None ivan dodig None None None
100 2010-2276 19 R16 100 jurgen melzer None None None olivier rochus None None None
82 2010-2276 1 R32 82 marin cilic None None None jan hajek None None None
85 2010-2276 4 R32 85 ivo karlovic None None None antonio veic None None None
127 2010-301 15 R16 127 michael lammer None None None juan carlos ferrero None None
ID round match_num key 0_W 1_W 2_W 3_W 4_W 0_L 1_L 2_L 3_L 4_L
7707 2010-1536 R32 40 7707 garcia lopez g. None None tsonga j.w. None None None
7705 2010-1536 R32 38 7705 murray a. None None None chela j.i. None None None
7680 2010-1536 R64 13 7680 chela j.i. None None None cuevas p. None None None
7679 2010-1536 R64 12 7679 schwank e. None None None mathieu p.h. None None None
15205 2010-1720 R32 11 15205 rochus o. None None None del potro j.m. None None
2293 2010-2276 R16 18 2293 cilic m. None None None hajek j. None None None
2294 2010-2276 R16 19 2294 karlovic i. None None None veic a. None None None
2289 2010-2276 R32 14 2289 marchenko i. None None None dodig i. None None None
2291 2010-2276 R32 16 2291 melzer j. None None None rochus o. None None None
627 2010-301 R16 13 627 lammer m. None None None ferrero j.c. None None None

Now match across all substrings in a name. To do so, we need a function that will return True if there is a match between one or more strings in two string lists.


In [23]:
def comp_str_lists(a,b):
    """checks to see if any of the strings in list a are also in list b"""
    for i in a:
        if i in b:
            return True        
    return False

We also need a function that will take each row of odds, and try to find a match for some appropriate subchunk of matches.


In [24]:
def comp_all_cols(o_row):
    """
    input:  row of o_players
    output: 
    """
    
    m_chunk = m_extras.loc[(m_extras.tourney_id==o_row['ID']) & (m_extras['round']==o_row['round'])]
    o_winner = list(o_row[['0_W','1_W','2_W','3_W','4_W']].dropna())
    o_loser = list(o_row[['0_L','1_L','2_L','3_L','4_L']].dropna())
    
    pairing = []
    if len(m_chunk)==0:
        print("warning:  no match/round pairing found for o_row key {}".format(o_row['key']))
        return 0
    
    for i, m_row in m_chunk.iterrows():
        
        m_winner = list(m_row[['0_W','1_W','2_W','3_W','4_W']].dropna())
        m_loser = list(m_row[['0_L','1_L','2_L','3_L','4_L']].dropna())
        
        pairing.append(comp_str_lists(o_winner,m_winner) & (comp_str_lists(o_loser,o_loser)))
    
    
    if sum(pairing) == 1:
        m_row = m_chunk.iloc[np.array(pairing),:]
        return pd.Series({'key_o':o_row['key'],'key_m':m_row['key'].iloc[0]})

    elif sum(pairing)<1:
        print("warning:  no name matches for o_row key {}".format(o_row['key']))
        return 0
    
    else:
        print("warning:  multiple name matches for o_row key {}".format(o_row['key']))
        return 0

In [25]:
new_matches = o_extras.apply(comp_all_cols,axis=1)


warning:  no name matches for o_row key 2293
warning:  no name matches for o_row key 2294
warning:  no name matches for o_row key 2289
warning:  no name matches for o_row key 2291
warning:  no name matches for o_row key 2694
warning:  no name matches for o_row key 2691
warning:  no name matches for o_row key 2896
warning:  no name matches for o_row key 2897
warning:  no name matches for o_row key 2894
warning:  no name matches for o_row key 2893
warning:  no name matches for o_row key 3943
warning:  no name matches for o_row key 8349
warning:  no name matches for o_row key 8407
warning:  no name matches for o_row key 1097
warning:  no name matches for o_row key 11325
warning:  no name matches for o_row key 11319
warning:  no name matches for o_row key 11314
warning:  multiple name matches for o_row key 18283
warning:  multiple name matches for o_row key 18288
warning:  no name matches for o_row key 15263
warning:  no name matches for o_row key 2157
warning:  no name matches for o_row key 2152
warning:  no name matches for o_row key 3158
warning:  no name matches for o_row key 1267
warning:  multiple name matches for o_row key 18304
warning:  multiple name matches for o_row key 18301

Update match list, and check progress.


In [26]:
new_matches = new_matches.loc[(new_matches.key_m!=0)&(new_matches.key_o!=0),:]
A = pd.concat([A[['key_m','key_o']],new_matches])

m_extras = m_players.loc[~m_players.key.isin(A.key_m),:]
o_extras = o_players.loc[~o_players.key.isin(A.key_o),:]

print("A total of {} matches down.  {} remain.".format(len(A),len(m_extras)))


A total of 18422 matches down.  26 remain.

Take a peak at the remaining names and see what the problems are.


In [27]:
IP.display.display(m_extras.sort_values('0_W')[0:10])
IP.display.display(o_extras.sort_values('1_L')[0:10])


tourney_id match_num round key 0_W 1_W 2_W 3_W 4_W 0_L 1_L 2_L 3_L 4_L
1257 2010-424 1 R32 1257 andy roddick None None None ryler deheart None None None
1585 2010-499 19 R16 1585 benjamin becker None None None wayne odesnik None None None
1861 2010-520 27 R128 1861 carsten ball None None None philipp petzschner None None None
1902 2010-520 68 R64 1902 fabio fognini None None None gael monfils None None None
7129 2012-506 15 R32 7129 federico delbonis None None None frederico gil None None None
7069 2012-505 9 R32 7069 federico delbonis None None None nicolas massu None None None
7078 2012-505 18 R16 7078 federico delbonis None None None thomaz bellucci None None None
3070 2011-321 4 R32 3070 federico delbonis None None None florian mayer None None None
3084 2011-321 18 R16 3084 federico delbonis None None None sergiy stakhovsky None None None
3091 2011-321 25 QF 3091 federico delbonis None None None pavol cervenak None None None
ID round match_num key 0_W 1_W 2_W 3_W 4_W 0_L 1_L 2_L 3_L 4_L
2294 2010-2276 R16 19 2294 karlovic i. None None None veic a. None None None
3943 2010-499 R16 20 3943 odesnik w. None None None becker b. None None None
8349 2010-520 R128 24 8349 petzschner p. None None None ball c. None None None
2893 2010-424 R32 15 2893 kohlschreiber p. None None None sela d. None None None
8407 2010-520 R64 82 8407 monfils g. None None None fognini f. None None None
1267 2012-580 R128 21 1267 darcis s. None None None serra f. None None None
3158 2012-506 R32 13 3158 del bonis f. None None gil f. None None None
11314 2011-321 R32 14 11314 del bonis f. None None mayer f. None None None
2289 2010-2276 R32 14 2289 marchenko i. None None None dodig i. None None None
2691 2010-407 R32 14 2691 soderling r. None None None sijsling i. None None None

Some rounds are wrong. Try re-matching both winner and loser last names, without insisting on round information.


In [28]:
B = pd.merge(m_extras[['tourney_id','key','1_W','1_L']],\
             o_extras[['ID','round','key','0_W','0_L']],how='inner',\
             left_on=['tourney_id','1_W','1_L'],\
             right_on=['ID','0_W','0_L'],suffixes=['_m','_o'])
A = pd.concat([A,B[['key_m','key_o']]])

m_extras = m_players.loc[~m_players.key.isin(A.key_m),:]
o_extras = o_players.loc[~o_players.key.isin(A.key_o),:]

print("A total of {} matches down.  {} remain.".format(len(A),len(m_extras)))

IP.display.display(m_extras[0:4])
IP.display.display(o_extras[0:4])


A total of 18431 matches down.  17 remain.
tourney_id match_num round key 0_W 1_W 2_W 3_W 4_W 0_L 1_L 2_L 3_L 4_L
1272 2010-424 16 R32 1272 fernando verdasco None None None yen hsun lu None None
1585 2010-499 19 R16 1585 benjamin becker None None None wayne odesnik None None None
1861 2010-520 27 R128 1861 carsten ball None None None philipp petzschner None None None
1902 2010-520 68 R64 1902 fabio fognini None None None gael monfils None None None
ID round match_num key 0_W 1_W 2_W 3_W 4_W 0_L 1_L 2_L 3_L 4_L
2897 2010-424 R16 19 2897 verdasco f. None None None lu y.h. None None None
3943 2010-499 R16 20 3943 odesnik w. None None None becker b. None None None
8349 2010-520 R128 24 8349 petzschner p. None None None ball c. None None None
8407 2010-520 R64 82 8407 monfils g. None None None fognini f. None None None

That solved some. Now try matching unusual names, ignorning rounds. This involves slightly modifying the comparison function.


In [29]:
def comp_all_cols_no_rounds(o_row):
    """
    input:  row of o_players
    output: 
    """
    
    m_chunk = m_extras.loc[(m_extras.tourney_id==o_row['ID'])]
    o_winner = list(o_row[['0_W','1_W','2_W','3_W','4_W']].dropna())
    o_loser = list(o_row[['0_L','1_L','2_L','3_L','4_L']].dropna())
    
    pairing = []
    if len(m_chunk)==0:
        print("warning:  no match/round pairing found for o_row key {}".format(o_row['key']))
        return 0
    
    for i, m_row in m_chunk.iterrows():
        
        m_winner = list(m_row[['0_W','1_W','2_W','3_W','4_W']].dropna())
        m_loser = list(m_row[['0_L','1_L','2_L','3_L','4_L']].dropna())
        
        pairing.append(comp_str_lists(o_winner,m_winner) & (comp_str_lists(o_loser,m_loser)))
    
    
    if sum(pairing) == 1:
        m_row = m_chunk.iloc[np.array(pairing),:]
        return pd.Series({'key_o':o_row['key'],'key_m':m_row['key'].iloc[0]})
    elif sum(pairing)<1:
        print("warning:  no name matches for o_row key {}".format(o_row['key']))
        return pd.Series({'key_o':0,'key_m':0})
    else:
        print("warning:  multiple name matches for o_row key {}".format(o_row['key']))
        print(m_chunk.iloc[np.array(pairing),:])
        return pd.Series({'key_o':0,'key_m':0})

In [30]:
new_matches = o_extras.apply(comp_all_cols_no_rounds,axis=1)
new_matches = new_matches.loc[(new_matches.key_m!=0)&(new_matches.key_o!=0),:]
A = pd.concat([A[['key_m','key_o']],new_matches])


warning:  no name matches for o_row key 3943
warning:  no name matches for o_row key 8349
warning:  no name matches for o_row key 8407
warning:  no name matches for o_row key 1097
warning:  no name matches for o_row key 11325
warning:  no name matches for o_row key 11319
warning:  no name matches for o_row key 11314
warning:  no name matches for o_row key 15263
warning:  no name matches for o_row key 2157
warning:  no name matches for o_row key 2152
warning:  no name matches for o_row key 3158
warning:  no name matches for o_row key 1267

Still some errors. Let's take a look.


In [31]:
m_extras = m_players.loc[~m_players.key.isin(A.key_m),:]
o_extras = o_players.loc[~o_players.key.isin(A.key_o),:]

print("A total of {} matches down.  {} remain.".format(len(A),len(m_extras)))

IP.display.display(m_extras)
IP.display.display(o_extras)


A total of 18436 matches down.  12 remain.
tourney_id match_num round key 0_W 1_W 2_W 3_W 4_W 0_L 1_L 2_L 3_L 4_L
1585 2010-499 19 R16 1585 benjamin becker None None None wayne odesnik None None None
1861 2010-520 27 R128 1861 carsten ball None None None philipp petzschner None None None
1902 2010-520 68 R64 1902 fabio fognini None None None gael monfils None None None
2410 2010-580 102 R32 2410 lukasz kubot None None None mikhail youzhny None None None
3091 2011-321 25 QF 3091 federico delbonis None None None pavol cervenak None None None
3084 2011-321 18 R16 3084 federico delbonis None None None sergiy stakhovsky None None None
3070 2011-321 4 R32 3070 federico delbonis None None None florian mayer None None None
5971 2012-341 15 R16 5971 ivo karlovic None None None marcel granollers None None None
7078 2012-505 18 R16 7078 federico delbonis None None None thomaz bellucci None None None
7069 2012-505 9 R32 7069 federico delbonis None None None nicolas massu None None None
7129 2012-506 15 R32 7129 federico delbonis None None None frederico gil None None None
7653 2012-580 38 R128 7653 florent serra None None None steve darcis None None None
ID round match_num key 0_W 1_W 2_W 3_W 4_W 0_L 1_L 2_L 3_L 4_L
3943 2010-499 R16 20 3943 odesnik w. None None None becker b. None None None
8349 2010-520 R128 24 8349 petzschner p. None None None ball c. None None None
8407 2010-520 R64 82 8407 monfils g. None None None fognini f. None None None
1097 2010-580 R32 105 1097 youzhny m. None None None kubot l. None None None
11325 2011-321 QF 25 11325 del bonis f. None None cervenak p. None None None
11319 2011-321 R16 19 11319 del bonis f. None None stakhovsky s. None None None
11314 2011-321 R32 14 11314 del bonis f. None None mayer f. None None None
15263 2012-341 R16 15 15263 granollers m. None None None karlovic i. None None None
2157 2012-505 R16 17 2157 del bonis f. None None bellucci t. None None None
2152 2012-505 R32 12 2152 del bonis f. None None massu n. None None None
3158 2012-506 R32 13 3158 del bonis f. None None gil f. None None None
1267 2012-580 R128 21 1267 darcis s. None None None serra f. None None None

Two big problems. One is 'delbonis' vs. 'del bonis'.


In [32]:
# solve the delbonis problem
o_extras.loc[o_extras['1_W']=='bonis',('0_W','1_W')]  = ['delbonis',None]

B = pd.merge(m_extras[['tourney_id','key','1_W','1_L']],\
             o_extras[['ID','round','key','0_W','0_L']],how='inner',\
             left_on=['tourney_id','1_W','1_L'],\
             right_on=['ID','0_W','0_L'],suffixes=['_m','_o'])
A = pd.concat([A,B[['key_m','key_o']]])

m_extras = m_players.loc[~m_players.key.isin(A.key_m),:]
o_extras = o_players.loc[~o_players.key.isin(A.key_o),:]

print("A total of {} matches down.  {} remain.".format(len(A),len(m_extras)))

IP.display.display(m_extras)
IP.display.display(o_extras)


A total of 18442 matches down.  6 remain.
/Users/ctoews/anaconda/envs/py3_6_2/lib/python3.6/site-packages/pandas/core/indexing.py:517: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
tourney_id match_num round key 0_W 1_W 2_W 3_W 4_W 0_L 1_L 2_L 3_L 4_L
1585 2010-499 19 R16 1585 benjamin becker None None None wayne odesnik None None None
1861 2010-520 27 R128 1861 carsten ball None None None philipp petzschner None None None
1902 2010-520 68 R64 1902 fabio fognini None None None gael monfils None None None
2410 2010-580 102 R32 2410 lukasz kubot None None None mikhail youzhny None None None
5971 2012-341 15 R16 5971 ivo karlovic None None None marcel granollers None None None
7653 2012-580 38 R128 7653 florent serra None None None steve darcis None None None
ID round match_num key 0_W 1_W 2_W 3_W 4_W 0_L 1_L 2_L 3_L 4_L
3943 2010-499 R16 20 3943 odesnik w. None None None becker b. None None None
8349 2010-520 R128 24 8349 petzschner p. None None None ball c. None None None
8407 2010-520 R64 82 8407 monfils g. None None None fognini f. None None None
1097 2010-580 R32 105 1097 youzhny m. None None None kubot l. None None None
15263 2012-341 R16 15 15263 granollers m. None None None karlovic i. None None None
1267 2012-580 R128 21 1267 darcis s. None None None serra f. None None None

What remains is exclusively a mismatch of winner and loser. First sort and match keys, then check who really won and correct the data.


In [33]:
m_extras = m_extras.sort_values(['tourney_id','1_W'])
o_extras = o_extras.sort_values(['ID','0_L'])

dregs=pd.DataFrame(list(zip(m_extras['key'].values, \
                            o_extras['key'].values)),\
                        columns=['key_m','key_o'])

A = pd.concat([A,dregs[['key_m','key_o']]])

To see who really won, calculate who played the most rounds.


In [34]:
def find_winner(o_row):
    ID = o_row['ID']
    # nominal winner in `odds`
    name1 = o_row['0_W']
    # nominal loser in `odds`
    name2 = o_row['0_L']
    # number of rounds played by nominal winner
    rnds1 = len(odds.loc[(odds.ID==ID) & \
                         (odds.Winner.str.contains(name1) | \
                          odds.Loser.str.contains(name1)),:])
    # number of rounds played by nominal loser
    rnds2 = len(odds.loc[(odds.ID==ID) & \
                         (odds.Winner.str.contains(name2) | \
                          odds.Loser.str.contains(name2)),:])
    # if nominal winner played more rounds, `odds` is right and `matches` is wrong
    if rnds1>rnds2:
        print('Winner:  ', name1)
        return 'm'
    # otherwise, `odds` is wrong and `matches` is right.
    elif rnds1<rnds2:
        print('Winner:  ', name2)
        return 'o'
    else:
        print("function find_winner:  ambigous outcome")
        return np.nan

In [35]:
mistake_idx = o_extras.apply(find_winner,axis=1)


Winner:   becker
Winner:   ball
Winner:   fognini
Winner:   kubot
Winner:   karlovic
Winner:   serra

Correct mistakes any mistakes in either table.


In [36]:
# fix messed up `odds` records
o_errs = o_extras.loc[mistake_idx.values=='o',:]
if len(o_errs)!=0:
    temp = odds.loc[odds.key.isin(o_errs['key']),'Winner']
    odds.loc[odds.key.isin(o_errs['key']),'Winner']=\
        odds.loc[odds.key.isin(o_errs['key']),'Loser'].values
    odds.loc[odds.key.isin(o_errs['key']),'Loser']=temp.values

# fix messed up `matches` records
m_errs = m_extras.loc[mistake_idx.values=='m',:]
if len(m_errs)!=0:
    temp = matches.loc[matches.key.isin(o_errs['key']),'winner_name']
    matches.loc[matches.key.isin(o_errs['key']),'winner_name']=\
        matches.loc[matches.key.isin(o_errs['key']),'loser_name'].values
    matches.loc[matches.key.isin(o_errs['key']),'loser_name']=temp.values

In [37]:
#sanity check
print("odds has {} records".format(len(odds)))
print("our lookup table is of size {}".format(len(A)))
print("the table has {} unique keys for `matches`".format(len(A.key_m.unique())))
print("the table has {} unique keys for `odds`".format(len(A.key_o.unique())))


odds has 18448 records
our lookup table is of size 18448
the table has 18448 unique keys for `matches`
the table has 18448 unique keys for `odds`

Now we can assign a single key for both odds and matches which corresponds on the match level. We can also standardize match numbers within tournaments.


In [38]:
# take the key originally assigned to `matches` to be the main key
A.rename(columns = {'key_m':'key'},inplace=True)
# change name of `odds` key to match that in `A`
odds.rename(columns = {'key':'key_o'},inplace=True)
# add `matches` key to `odds`, and get rid of `odds` key
odds = pd.merge(odds,A,how='inner',on='key_o')
del odds['key_o']
# use the `odds` match numbers on `matches`
matches = matches.rename(columns={'match_num':'match_num_old'})
matches = pd.merge(matches,odds[['match_num','key']],how='inner',on='key')

Finally, we save our cleansed and matched datasets.


In [39]:
matches.to_pickle(pickle_dir + 'matches.pkl')
odds.to_pickle(pickle_dir + 'odds.pkl')

In [ ]: