Merging "odds" and "player" data
Author: Carl Toews
File: merge_datasets.ipynb
An obvious metric for assessing the quality of learning algorithms is to compare their profitability against on-line betting markets. The notebook setup_mysql.ipynb loads into a MySQL database files from two distinct sources, one containing characteristics of the players (good for learning), the other containing odds for each match (good for assessing the learning). Each record from the latter presumably corresponds to a unique record from the former.
Unfortunately, there are a couple of issues that make establishing this correspondence difficult, including each of the following:
This notebook records some of the methods I used to establish the correspondence. It ultimately produces two DataFrames, one called matches
(with player characteristics) and one called odds
(with betting characteristics) of the same size and a one-to-one numerical key uniquely identifying each match across both datasets.
In [212]:
import IPython as IP
IP.display.Image("example_of_name_matching_problems_mod.png",width=400,height=200,embed=True)
Out[212]:
Bad match dates
In [213]:
IP.display.Image("../aux/bad_csv_data_mod.png",width=500,height=500,embed=True)
Out[213]:
Login credentials for connecting to MySQL database.
In [258]:
# name of database
db_name = "tennis"
# name of db user
username = "testuser"
# db password for db user
password = "test623"
# location of atp data files
atpfile_directory = "../data/tennis_atp-master/"
# location of odds data files
oddsfiles_directory = "../data/odds_data/"
# we'll read and write pickle files here
pickle_dir = '../pickle_files/'
All import statements here.
In [215]:
import sqlalchemy # pandas-mysql interface library
import sqlalchemy.exc # exception handling
from sqlalchemy import create_engine # needed to define db interface
import sys # for defining behavior under errors
import numpy as np # numerical libraries
import scipy as sp
import pandas as pd # for data analysis
import pandas.io.sql as sql # for interfacing with MySQL database
import matplotlib as mpl # a big library with plotting functionality
import matplotlib.pyplot as plt # a subset of matplotlib with most of the useful tools
import IPython as IP
%matplotlib inline
import pdb
#%qtconsole
Try to connect to the tennis
database on the local mysql host. If successful, print out the MySQL version number, if unsuccessful, exit gracefully.
In [216]:
# create an engine for interacting with the MySQL database
try:
eng_str = 'mysql+mysqldb://' + username + ':' + password + '@localhost/' + db_name
engine = create_engine(eng_str)
connection = engine.connect()
version = connection.execute("SELECT VERSION()")
print("Database version : ")
print(version.fetchone())
# report what went wrong if this fails.
except sqlalchemy.exc.DatabaseError as e:
reason = e.message
print("Error %s:" % (reason))
sys.exit(1)
# close the connection
finally:
if connection:
connection.close()
else:
print("Failed to create connection.")
In [217]:
# focus on most recent data; exclude Davis Cup stuff
startdate = '20100101'
enddate = '20161231'
with engine.begin() as connection:
odds = pd.read_sql_query("""SELECT * FROM odds \
WHERE DATE >= '""" + startdate + """' \
AND DATE <= '""" + enddate + """';""", connection)
with engine.begin() as connection:
matches = pd.read_sql_query("""SELECT * FROM matches \
WHERE tourney_date >= '""" + startdate + """' \
AND tourney_date <= '""" + enddate + """' \
AND tourney_name NOT LIKE 'Davis%%';""", connection)
# view results
IP.display.display(odds[['ATP','Location','Tournament','Date','Round',
'Winner','Loser']].sort_values('Date')[0:5])
IP.display.display(matches[['tourney_id','tourney_name','tourney_date','round',
'winner_name','loser_name']].sort_values('tourney_date')[0:5])
Note some issues already:
matches
are often pegged to the start date of the tournament, while dates in odds
are often tied to the match itself. Thus time-ordering leads to different results.tourney_name
in matches
often corresponds to the variable Location
in odds
(but not always!)odds
can match to R32, R64, or R128 on matches
.matches
has a unique tournament ID for each tournament, odds
recycles tournament ids each year.Before doing any serious data processing, tidy up the strings (strip whitespace, convert to lowercase, replace dashes by space)
In [218]:
odds[['Location','Winner','Loser']] = \
odds[['Location','Winner','Loser']].\
apply(lambda x: x.str.strip().str.lower().str.replace('-',' '),axis=1)
matches[['tourney_name','winner_name','loser_name']] = \
matches[['tourney_name','winner_name','loser_name']].\
apply(lambda x: x.str.strip().str.lower().str.replace('-',' '),axis=1)
Each unique tournament has an identifying key in both odds and matches, but the keys are different, and not one to one. Our first task is to associate unique tournaments with one another. We'll start by grouping both odds
and matches
by unique tournament number.
In [219]:
# matches tournament identifiers are unique
g_matches = matches.groupby('tourney_id')
# odds tournament identifiers are recycled every year
g_odds= odds.groupby(['ATP','fname'])
For each tournament (group), extract features to assist in "tourney-matching", including the number of matches in the tournament, the maxium and minimum listed dates, the location (in odds
) and the tournament name (in matches
.)
In [220]:
def extract_odds_features(group):
sizes = len(group)
min_date = group['Date'].min()
max_date = group['Date'].max()
location = group['Location'].unique()[0]
return pd.Series({'size': sizes,'min_date':min_date,\
'max_date':max_date,'location':location})
def extract_matches_features(group):
sizes = len(group)
min_date = group['tourney_date'].min()
max_date = group['tourney_date'].max()
tourney_name = group['tourney_name'].unique()[0]
return pd.Series({'size': sizes,'min_date':min_date,\
'max_date':max_date,'tourney_name':tourney_name})
g_odds = g_odds.apply(extract_odds_features).reset_index()
g_matches = g_matches.apply(extract_matches_features).reset_index()
Define lookup-table connecting Location
in odds
to tourney_name
in matches
. Establishing this matching is not totally straightforward: in many cases, the two columns are identical, but in others they are not, with no obvious connection. (Eg. "Roland Garros" is the tourney_name
listed for the French Open in matches
, while it's Location
is "Paris".) Of the 100 or so difference tennis tournaments, about 60 have a direct correspondence, and the others need some massaging. The derived lookup table is the product of considerable grubby data wrangling.
In [257]:
tourney_lookup = pd.read_pickle(pickle_dir + 'tourney_lookup.pkl')
print("Snapshot of lookup table:")
IP.display.display(tourney_lookup.sort_values('o_name')[15:25])
Define a function that will take as input a set of features from one tournament in odds
, and try to match it to one tournament in matches
. The algorithm make sure that dates are roughly the same, and that there is a correspondence between Location
and tourney_name
.
In [222]:
def get_tourney_ID(o_row):
"""
function: get_tourney_ID(o_row)
Input: row from dataframe g_odds
Output: a Series object with two elements: 1) a match ID,
and 2), a flag of True if the sizes of the two tournmanets are identical
"""
# calculate the diffence in start/stop dates between this tournament and those in `matches`.
min_date_delta = np.abs(g_matches['min_date'] - o_row['min_date']).apply(lambda x: x.days)
max_date_delta = np.abs(g_matches['max_date'] - o_row['max_date']).apply(lambda x: x.days)
# find a list of candidate tournament names, based on lookup table
mtchs = (tourney_lookup['o_name']==o_row['location'])
if sum(mtchs)>0:
m_name = tourney_lookup.loc[mtchs,'m_name']
else:
print('no match found for record {}'.format(o_row['location']))
return ['Nan','Nan']
# the "right" tournament has the right name, and reasonable close start or stop dates
idx = ((min_date_delta <=3) | (max_date_delta <=1)) & (g_matches['tourney_name'].isin(m_name))
record = g_matches.loc[idx,'tourney_id']
# if there are no matches, print some diagnostic information and don't assign a match
if len(record)<1:
print("Warning: no match found for `odds` match {}, year {}".format(o_row.ATP, o_row.fname))
print("min date delta: {}, max date delta: {}, g_matches: {}".format(np.min(min_date_delta), \
np.min(max_date_delta), \
g_matches.loc[g_matches['tourney_name'].isin(m_name),'tourney_name']))
return pd.Series({'ID':'None','size':'NA'})
# if there are too many matches, print a warning and don't assign a match.
elif (len(record)>1):
print("Warning: multiple matches found for `odds` match {}".format(o_row.ATP))
return pd.Series({'ID':'Multiple','size':'NA'})
# otherwise, assign a match, and check if the sizes of the matches are consistent (a good double-check)
else:
size_flag = (g_matches.loc[idx,'size']==o_row['size'])
return pd.Series({'ID':record.iloc[0],'size':size_flag.iloc[0]})
Perform the matches.
In [223]:
# add columns to g_odds to hold match ID and also info about size-correspondence
g_odds.insert(len(g_odds.columns),'ID','None')
g_odds.insert(len(g_odds.columns),'sizes_match','NA')
# perform the match
g_odds[['ID','sizes_match']] = g_odds.apply(get_tourney_ID,axis=1).values
Merge match numbers back into bigger odds
table, checking that the sizes of each "matched" tournament are the same.
In [224]:
# add "size" columns to both dataframes
odds = pd.merge(g_odds[['ATP','fname','ID','size']],odds,how='inner',on=['ATP','fname'])
matches = pd.merge(g_matches[['tourney_id','size']],matches,how='inner',on=['tourney_id'])
# sum the sizes
if sum(g_odds['sizes_match']==True) != len(g_odds):
print("Warning: at least one tournament in `odds` is matched to a \
tournament in `matches` of a different size.")
else:
print("Sizes seem to match up.")
In [225]:
# for each tournament, label match numbers from 1 to n_tourneys
odds.insert(5,'match_num',0)
grouped = odds[['ID','match_num']].groupby('ID')
odds['match_num'] = grouped.transform(lambda x: 1+np.arange(len(x)))
# add keys to both odds and match data
odds.insert(len(odds.columns),'key',np.arange(len(odds)))
matches.insert(len(matches.columns),'key',np.arange(len(matches)))
We'll use round
information to help with the task of pairing individual matches . To use round
information effectively, we need to establish a correspondence between round signifiers in the two datasets. The exact correspondence depends on how many tennis matches are in each tournmanet.
In [226]:
# figure out how many discrete sizes there are
print("size in odds: ", odds['size'].unique())
print("size in matches: ", matches['size'].unique())
print("unique round designators in odds: ", odds.Round.unique())
print("unique round designators in matches: ", matches['round'].unique())
In [227]:
# create a lookup table to be able to match on rounds
m_rounds = ['R128','R64','R32','R16','QF','SF','F','RR']
o_rounds = ['1st Round','2nd Round','3rd Round','4th Round', \
'Quarterfinals','Semifinals','The Final','Round Robin']
round_lookup_small = pd.DataFrame({'m_rounds': m_rounds[2:-1],\
'o_rounds':o_rounds[0:2]+o_rounds[4:-1]})
round_lookup_medium = pd.DataFrame({'m_rounds': m_rounds[1:-1],\
'o_rounds':o_rounds[0:3]+o_rounds[4:-1]})
round_lookup_large = pd.DataFrame({'m_rounds': m_rounds[0:-1],\
'o_rounds':o_rounds[0:-1]})
round_lookup_RR = pd.DataFrame({'m_rounds':m_rounds[5:],\
'o_rounds':o_rounds[5:]})
With the round lookup tables defind, define a function that takes a row of odds, and figures how to map its round information to the round information in matches.
In [228]:
def map_rounds(x):
cur_name = x['Round']
t_size = x['size']
if t_size in [27,31]:
new_name = round_lookup_small.loc[round_lookup_small.o_rounds==cur_name,'m_rounds']
elif t_size in [47,55]:
new_name = round_lookup_medium.loc[round_lookup_medium.o_rounds==cur_name,'m_rounds']
elif t_size in [95, 127]:
new_name = round_lookup_large.loc[round_lookup_large.o_rounds==cur_name,'m_rounds']
else:
new_name = round_lookup_RR.loc[round_lookup_RR.o_rounds==cur_name,'m_rounds']
return new_name.iloc[0]
We'll apply that mapping to each row of the odds dataframe.
In [229]:
# translate round indentifier appropriately
odds.insert(4,'round','TBD')
odds['round'] = odds.apply(map_rounds,axis=1).values
IP.display.display(odds[0:4])
IP.display.display(matches[0:4])
Before going on, a quick sanity check: is the set of matches in matches
that are in odds
the same size as the set of matches in odds
that are in matches
? Does the size
column have the correct data?
In [230]:
t1=odds.ID.drop_duplicates().sort_values()
t2=matches.tourney_id.drop_duplicates().sort_values()
m_sizes=matches.loc[matches.tourney_id.isin(t1),['tourney_id','size']].drop_duplicates()
o_sizes=odds.loc[odds.ID.isin(t2),['ID','size']].drop_duplicates()
#comp = pd.merge(o_sizes,m_sizes,how='outer',left_on='ID',right_on='tourney_id')
print('sum of sizes of tournaments in odds: ', np.sum(o_sizes['size']))
print('sum of sizes of tournaments in matches: ', np.sum(m_sizes['size']))
Looks good. Now pare down the matches dataframe to contain only records in odds.
In [231]:
matches = matches.loc[matches.tourney_id.isin(t1),:]
print("number of records in `odds`: ", len(odds))
print("number of records in `matches`: ", len(matches))
To do player name matching, we split each name, and attempt to match on substrings.
In [232]:
# extract dataframe with player names split into discrete 'words'
m_players = pd.merge(matches.winner_name.str.split(pat=' ',expand=True), \
matches.loser_name.str.split(pat=' ',expand=True), \
how='inner',left_index=True, right_index=True,suffixes=('_W','_L'))
# add on tournament, round, and match identifiers
m_players = pd.merge(matches[['tourney_id','match_num', 'round','key']], m_players,\
how='inner',left_index=True, right_index=True).sort_values(['tourney_id','round','1_W','1_L'])
# extract dataframe with player names split into discrete 'words'
o_players = pd.merge(odds.Winner.str.split(pat=' ',expand=True), \
odds.Loser.str.split(pat=' ',expand=True), \
how='inner',left_index=True, right_index=True,suffixes=('_W','_L'))
# add on tournament and round identifiers
o_players = pd.merge(odds[['ID','round','match_num','key']], o_players,\
how='inner',left_index=True, right_index=True).sort_values(['ID','round','0_W','0_L'])
print("m_players: ")
IP.display.display(m_players[0:5])
print("o_players")
IP.display.display(o_players[0:5])
Match individual matches as follows: assuming there is a match on both tournament number and round, then
In [233]:
# try for an exact match on last names of both winner and loser
A = pd.merge(m_players[['tourney_id','round','key','1_W','1_L']],\
o_players[['ID','round','key','0_W','0_L']],how='inner',\
left_on=['tourney_id','round','1_W','1_L'],\
right_on=['ID','round','0_W','0_L'],suffixes=['_m','_o'])
m_extras = m_players.loc[~m_players.key.isin(A.key_m),:]
o_extras = o_players.loc[~o_players.key.isin(A.key_o),:]
print("A total of {} matches down. {} remain.".format(len(A),len(m_extras)))
In [234]:
A.key_o.unique().size
Out[234]:
Take a quick peak at the remaining names to get a sense for what the issues are
In [235]:
IP.display.display(m_extras[0:10])
IP.display.display(o_extras[0:10])
Now match across all substrings in a name. To do so, we need a function that will return True if there is a match between one or more strings in two string lists.
In [236]:
def comp_str_lists(a,b):
"""checks to see if any of the strings in list a are also in list b"""
for i in a:
if i in b:
return True
return False
We also need a function that will take each row of odds, and try to find a match for some appropriate subchunk of matches.
In [237]:
def comp_all_cols(o_row):
"""
input: row of o_players
output:
"""
m_chunk = m_extras.loc[(m_extras.tourney_id==o_row['ID']) & (m_extras['round']==o_row['round'])]
o_winner = list(o_row[['0_W','1_W','2_W','3_W','4_W']].dropna())
o_loser = list(o_row[['0_L','1_L','2_L','3_L','4_L']].dropna())
pairing = []
if len(m_chunk)==0:
print("warning: no match/round pairing found for o_row key {}".format(o_row['key']))
return 0
for i, m_row in m_chunk.iterrows():
m_winner = list(m_row[['0_W','1_W','2_W','3_W','4_W']].dropna())
m_loser = list(m_row[['0_L','1_L','2_L','3_L','4_L']].dropna())
pairing.append(comp_str_lists(o_winner,m_winner) & (comp_str_lists(o_loser,o_loser)))
if sum(pairing) == 1:
m_row = m_chunk.iloc[np.array(pairing),:]
return pd.Series({'key_o':o_row['key'],'key_m':m_row['key'].iloc[0]})
elif sum(pairing)<1:
print("warning: no name matches for o_row key {}".format(o_row['key']))
return 0
else:
print("warning: multiple name matches for o_row key {}".format(o_row['key']))
return 0
In [238]:
new_matches = o_extras.apply(comp_all_cols,axis=1)
Update match list, and check progress.
In [239]:
new_matches = new_matches.loc[(new_matches.key_m!=0)&(new_matches.key_o!=0),:]
A = pd.concat([A[['key_m','key_o']],new_matches])
m_extras = m_players.loc[~m_players.key.isin(A.key_m),:]
o_extras = o_players.loc[~o_players.key.isin(A.key_o),:]
print("A total of {} matches down. {} remain.".format(len(A),len(m_extras)))
Take a peak at the remaining names and see what the problems are.
In [240]:
IP.display.display(m_extras.sort_values('0_W')[0:10])
IP.display.display(o_extras.sort_values('1_L')[0:10])
Some rounds are wrong. Try re-matching both winner and loser last names, without insisting on round information.
In [241]:
B = pd.merge(m_extras[['tourney_id','key','1_W','1_L']],\
o_extras[['ID','round','key','0_W','0_L']],how='inner',\
left_on=['tourney_id','1_W','1_L'],\
right_on=['ID','0_W','0_L'],suffixes=['_m','_o'])
A = pd.concat([A,B[['key_m','key_o']]])
m_extras = m_players.loc[~m_players.key.isin(A.key_m),:]
o_extras = o_players.loc[~o_players.key.isin(A.key_o),:]
print("A total of {} matches down. {} remain.".format(len(A),len(m_extras)))
IP.display.display(m_extras[0:4])
IP.display.display(o_extras[0:4])
That solved some. Now try matching unusual names, ignorning rounds. This involves slightly modifying the comparison function.
In [242]:
def comp_all_cols_no_rounds(o_row):
"""
input: row of o_players
output:
"""
m_chunk = m_extras.loc[(m_extras.tourney_id==o_row['ID'])]
o_winner = list(o_row[['0_W','1_W','2_W','3_W','4_W']].dropna())
o_loser = list(o_row[['0_L','1_L','2_L','3_L','4_L']].dropna())
pairing = []
if len(m_chunk)==0:
print("warning: no match/round pairing found for o_row key {}".format(o_row['key']))
return 0
for i, m_row in m_chunk.iterrows():
m_winner = list(m_row[['0_W','1_W','2_W','3_W','4_W']].dropna())
m_loser = list(m_row[['0_L','1_L','2_L','3_L','4_L']].dropna())
pairing.append(comp_str_lists(o_winner,m_winner) & (comp_str_lists(o_loser,m_loser)))
if sum(pairing) == 1:
m_row = m_chunk.iloc[np.array(pairing),:]
return pd.Series({'key_o':o_row['key'],'key_m':m_row['key'].iloc[0]})
elif sum(pairing)<1:
print("warning: no name matches for o_row key {}".format(o_row['key']))
return pd.Series({'key_o':0,'key_m':0})
else:
print("warning: multiple name matches for o_row key {}".format(o_row['key']))
print(m_chunk.iloc[np.array(pairing),:])
return pd.Series({'key_o':0,'key_m':0})
In [243]:
new_matches = o_extras.apply(comp_all_cols_no_rounds,axis=1)
new_matches = new_matches.loc[(new_matches.key_m!=0)&(new_matches.key_o!=0),:]
A = pd.concat([A[['key_m','key_o']],new_matches])
Still some errors. Let's take a look.
In [244]:
m_extras = m_players.loc[~m_players.key.isin(A.key_m),:]
o_extras = o_players.loc[~o_players.key.isin(A.key_o),:]
print("A total of {} matches down. {} remain.".format(len(A),len(m_extras)))
IP.display.display(m_extras)
IP.display.display(o_extras)
Two big problems. One is 'delbonis' vs. 'del bonis'.
In [252]:
# solve the delbonis problem
o_extras.loc[o_extras['1_W']=='bonis',('0_W','1_W')] = ['delbonis',None]
B = pd.merge(m_extras[['tourney_id','key','1_W','1_L']],\
o_extras[['ID','round','key','0_W','0_L']],how='inner',\
left_on=['tourney_id','1_W','1_L'],\
right_on=['ID','0_W','0_L'],suffixes=['_m','_o'])
A = pd.concat([A,B[['key_m','key_o']]])
m_extras = m_players.loc[~m_players.key.isin(A.key_m),:]
o_extras = o_players.loc[~o_players.key.isin(A.key_o),:]
print("A total of {} matches down. {} remain.".format(len(A),len(m_extras)))
IP.display.display(m_extras)
IP.display.display(o_extras)
What remains is exclusively a mismatch of winner and loser. First sort and match keys, then check who really won and correct the data.
In [246]:
m_extras = m_extras.sort_values(['tourney_id','1_W'])
o_extras = o_extras.sort_values(['ID','0_L'])
dregs=pd.DataFrame(list(zip(m_extras['key'].values, \
o_extras['key'].values)),\
columns=['key_m','key_o'])
A = pd.concat([A,dregs[['key_m','key_o']]])
To see who really won, calculate who played the most rounds.
In [247]:
def find_winner(o_row):
ID = o_row['ID']
name1 = o_row['0_W']
name2 = o_row['0_L']
rnds1 = len(odds.loc[(odds.ID==ID) & \
(odds.Winner.str.contains(name1) | \
odds.Loser.str.contains(name1)),:])
rnds2 = len(odds.loc[(odds.ID==ID) & \
(odds.Winner.str.contains(name2) | \
odds.Loser.str.contains(name2)),:])
if rnds1>rnds2:
print('Winner: ', name1)
return 1
elif rnds1<rnds2:
print('Winner: ', name2)
return -1
else:
print("function find_winner: ambigous outcome")
return 0
In [251]:
mistake_idx = o_extras.apply(find_winner,axis=1)
o_errs = o_extras.loc[mistake_idx.values==-1,:]
temp = odds.loc[odds.key.isin(o_errs['key']),'Winner']
#odds.loc[odds.key.isin(o_errs['key']),'Winner']=odds.loc[odds.key.isin(o_errs['key']),\
# 'Loser'].values
odds.loc[odds.key.isin(o_errs['key']),'Loser']=temp.values
Note that Odesnik is listed as the winner in the 2nd round, but he clearly lost, as becker went on to play in the QF. We can use this insight to figure out who actually won the match.
Of course we are reduced to six non-matches, so we can finish our correspondence table by hand easily:
In [141]:
In [ ]:
In [142]:
#sanity check
print("odds has {} records".format(len(odds)))
print("our lookup table is of size {}".format(len(A)))
print("the table has {} unique keys for `matches`".format(len(A.key_m.unique())))
print("the table has {} unique keys for `odds`".format(len(A.key_o.unique())))
Now we can assign a single key for both odds
and matches
which corresponds on the match level. We can also standardize match numbers within tournaments.
In [143]:
# merge the two dataframes on both keys
matches.rename(columns = {'key':'key_m'},inplace=True)
matches = pd.merge(matches,A,how='inner',on='key_m')
odds.rename(columns = {'key':'key_o'},inplace=True)
odds = pd.merge(odds,A,how='inner',on='key_o')
# use the `odds` match numbers on `matches`
matches.rename(columns = {'match_num':'match_num_old'},inplace=True)
matches = pd.merge(matches,odds[['match_num','key_o']],how='inner',on='key_o')
Finally, we save our cleansed and matched datasets.
In [144]:
#matches.to_pickle(pickle_dir + 'matches.pkl')
#odds.to_pickle(pickle_dir + 'odds.pkl')
In [ ]: