Setup MySQL Database for Tennis Predictions
Author: Carl Toews
Date: September 30, 2017
File: setup_mysql_database.ipynb
Description:
This notebook provides Python code to create a MySQL database from CSV files with data about professional tennis matches. The data comes from two sources: basic player and outcome information is from https://github.com/JeffSackmann/tennis_atp, while odds data are from http://www.tennis-data.co.uk/alldata. The code in this notebook should be run before running the learning algorithms in the notebook tennis_predictions, which assumes that the MySQL database has already been created.
Details:
Path and file names can be configured in the first cell below. As it stands, this notebook assumes that file names and directory structures are as reflected in this github site.
The CSV files within Sackmann dataset are as follows:
* "atp_matches_xxxx.csv" (xxxx represents a year between 1968 and 2017)
* "atp_matches_futures_xxxx.csv" (xxxx is a year between 1991 and 2017)
* "atp_matches_qual_chall_xxxx.csv" (xxxx is a year between 1991 and 2017)
There is also a single CSV file with player info:
* "atp_players.csv"
and week-by-week rankings by decade, stored in files with names as follows:
* "atp_rankings_xxs.csv" (xx represents a decade--70,80,90,00, or 10)
The rankings of the current decade are stored in
* "atp_rankings_current.csv"
The CSV files within the directory "odds_data" are of the form
* "xxxx.csv" (xxxx a year between 2001 and 2017)
The notebook also assumes a MySQL database named "tennis" exists, and there there is a user called "testuser" with password "test623" who can create tables in this database. Again, username and password can be configured below.
Pathnames, database names, usernames, and passwords configurable here:
In [108]:
# name of database
db_name = "tennis"
# name of db user
username = "testuser"
# db password for db user
password = "test623"
# location of atp data files
atpfile_directory = "./data/tennis_atp-master/"
# location of odds data files
oddsfiles_directory = "./data/odds_data/"
In [109]:
import sqlalchemy # pandas-mysql interface library
import sqlalchemy.exc # exception handling
from sqlalchemy import create_engine # needed to define db interface
import glob # for file manipulation
import sys # for defining behavior under errors
from IPython.core.debugger import Tracer
import pdb
In [110]:
# create an engine for interacting with the MySQL database
try:
eng_str = 'mysql+mysqldb://' + username + ':' + password + '@localhost/'
engine = create_engine(eng_str)
connection = engine.connect()
version = connection.execute("SELECT VERSION()")
print("Database version : ")
print(version.fetchone())
# report what went wrong if this fails.
except sqlalchemy.exc.DatabaseError as e:
reason = e.message
print("Error %s:" % (reason))
sys.exit(1)
# close the connection
finally:
if connection:
connection.close()
else:
print("Failed to create connection.")
In [4]:
# create an engine for interacting with the MySQL database
try:
engine.execute("CREATE DATABASE IF NOT EXISTS " + db_name) #create db
engine.execute("USE " + db_name) # select new db
# report what went wrong if this fails.
except sqlalchemy.exc.DatabaseError as e:
reason = e.message
print("Error %s:" % (reason))
sys.exit(1)
matches"This cell first checks to see if the table "matches" already exists. If it does, the code does nothing. If it does not, the code creates it, and then reads in the data from the relevant .csv files, the location of which is specified above. Note that there are three different match types: regular matches, qualifying matches, and futures matches. By default, all three get read into the table, though you can edit this behavior in the opening lines. by default, all three types of matches will be read into the table "matches"
In [101]:
table_name = "matches"
db_name = "tennis"
engine.execute("USE " + db_name + ";")
#all_match_types = ['atp_matches','atp_matches_qual_chall','atp_matches_futures']
all_match_types = ['atp_matches'] # focus for the moment on major tournaments
# start a transaction with the database 'tennis'
with engine.begin() as connection:
query=("""SELECT count(*) FROM information_schema.TABLES WHERE
(TABLE_SCHEMA = '""" + db_name + """') AND (TABLE_NAME = '""" +
table_name + """');""")
res = connection.execute(query)
if (res.fetchone()[0]!=0):
print("Table " + table_name + " already exists, skipping table creation.")
else:
connection.execute("CREATE TABLE matches (tourney_id VARCHAR(256));")
connection.execute("ALTER TABLE matches ADD COLUMN tourney_name VARCHAR(256);")
connection.execute("ALTER TABLE matches ADD COLUMN surface VARCHAR(256);")
connection.execute("ALTER TABLE matches ADD COLUMN draw_size TINYINT UNSIGNED;")
connection.execute("ALTER TABLE matches ADD COLUMN tourney_level CHAR(1);")
connection.execute("ALTER TABLE matches ADD COLUMN tourney_date DATE;")
connection.execute("ALTER TABLE matches ADD COLUMN match_num SMALLINT UNSIGNED;")
connection.execute("ALTER TABLE matches ADD COLUMN winner_id MEDIUMINT UNSIGNED;")
connection.execute("ALTER TABLE matches ADD COLUMN winner_seed TINYINT UNSIGNED;")
connection.execute("ALTER TABLE matches ADD COLUMN winner_entry VARCHAR(2);")
connection.execute("ALTER TABLE matches ADD COLUMN winner_name VARCHAR(256);")
connection.execute("ALTER TABLE matches ADD COLUMN winner_hand CHAR(1);")
connection.execute("ALTER TABLE matches ADD COLUMN winner_ht TINYINT UNSIGNED;")
connection.execute("ALTER TABLE matches ADD COLUMN winner_ioc CHAR(3);")
connection.execute("ALTER TABLE matches ADD COLUMN winner_age DECIMAL(12,10);")
connection.execute("ALTER TABLE matches ADD COLUMN winner_rank SMALLINT UNSIGNED;")
connection.execute("ALTER TABLE matches ADD COLUMN winner_rank_points SMALLINT UNSIGNED;")
connection.execute("ALTER TABLE matches ADD COLUMN loser_id MEDIUMINT UNSIGNED;")
connection.execute("ALTER TABLE matches ADD COLUMN loser_seed TINYINT UNSIGNED;")
connection.execute("ALTER TABLE matches ADD COLUMN loser_entry VARCHAR(2);")
connection.execute("ALTER TABLE matches ADD COLUMN loser_name VARCHAR(256);")
connection.execute("ALTER TABLE matches ADD COLUMN loser_hand CHAR(1);")
connection.execute("ALTER TABLE matches ADD COLUMN loser_ht TINYINT UNSIGNED;")
connection.execute("ALTER TABLE matches ADD COLUMN loser_ioc CHAR(3);")
connection.execute("ALTER TABLE matches ADD COLUMN loser_age DECIMAL(12,10);")
connection.execute("ALTER TABLE matches ADD COLUMN loser_rank SMALLINT UNSIGNED;")
connection.execute("ALTER TABLE matches ADD COLUMN loser_rank_points SMALLINT UNSIGNED;")
connection.execute("ALTER TABLE matches ADD COLUMN score VARCHAR(256);")
connection.execute("ALTER TABLE matches ADD COLUMN best_of TINYINT UNSIGNED;")
connection.execute("ALTER TABLE matches ADD COLUMN round VARCHAR(4);")
connection.execute("ALTER TABLE matches ADD COLUMN minutes SMALLINT UNSIGNED;")
connection.execute("ALTER TABLE matches ADD COLUMN w_ace TINYINT UNSIGNED;")
connection.execute("ALTER TABLE matches ADD COLUMN w_df TINYINT UNSIGNED;")
connection.execute("ALTER TABLE matches ADD COLUMN w_svpt SMALLINT UNSIGNED;")
connection.execute("ALTER TABLE matches ADD COLUMN w_1stIn SMALLINT UNSIGNED;")
connection.execute("ALTER TABLE matches ADD COLUMN w_1stWon SMALLINT UNSIGNED;")
connection.execute("ALTER TABLE matches ADD COLUMN w_2ndWon TINYINT UNSIGNED;")
connection.execute("ALTER TABLE matches ADD COLUMN w_SvGms TINYINT UNSIGNED;")
connection.execute("ALTER TABLE matches ADD COLUMN w_bpSaved TINYINT UNSIGNED;")
connection.execute("ALTER TABLE matches ADD COLUMN w_bpFaced TINYINT UNSIGNED;")
connection.execute("ALTER TABLE matches ADD COLUMN l_ace TINYINT UNSIGNED;")
connection.execute("ALTER TABLE matches ADD COLUMN l_df TINYINT UNSIGNED;")
connection.execute("ALTER TABLE matches ADD COLUMN l_svpt SMALLINT UNSIGNED;")
connection.execute("ALTER TABLE matches ADD COLUMN l_1stIn SMALLINT UNSIGNED;")
connection.execute("ALTER TABLE matches ADD COLUMN l_1stWon SMALLINT UNSIGNED;")
connection.execute("ALTER TABLE matches ADD COLUMN l_2ndWon TINYINT UNSIGNED;")
connection.execute("ALTER TABLE matches ADD COLUMN l_SvGms TINYINT UNSIGNED;")
connection.execute("ALTER TABLE matches ADD COLUMN l_bpSaved TINYINT UNSIGNED;")
connection.execute("ALTER TABLE matches ADD COLUMN l_bpFaced TINYINT UNSIGNED;")
connection.execute("ALTER TABLE matches ADD COLUMN match_type VARCHAR(256);")
# loop through all three match types, and store data for each
for match_type in all_match_types:
# store names of all .csv files for this match-type
all_files = glob.glob(atpfile_directory + match_type + "_[0-9]*.csv")
#all_files = ["./data/tennis_atp-master/atp_matches_2017.csv"]
# load each file, and make sure missing data is stored as NULL
for file in all_files:
print(file)
query = """LOAD DATA LOCAL INFILE '""" + file + \
"""' INTO TABLE matches \
COLUMNS TERMINATED BY ',' \
LINES TERMINATED BY '\n' \
IGNORE 1 LINES \
(@v0, @v1, @v2, @v3, @v4, @v5, @v6, @v7, @v8, @v9, @v10, @v11, @v12, @v13, @v14, @v15, @v16, @v17, @v18, @v19, @v20, @v21, @v22, @v23, @v24, @v25, @v26, @v27, @v28, @v29, @v30, @v31, @v32, @v33, @v34, @v35, @v36, @v37, @v38, @v39, @v40, @v41, @v42, @v43, @v44, @v45, @v46, @v47, @v48) \
SET \
tourney_id = nullif(@v0,''), \
tourney_name = nullif(@v1,''), \
surface = nullif(@v2,''), \
draw_size = nullif(@v3,''), \
tourney_level = nullif(@v4,''), \
tourney_date = nullif(@v5,''), \
match_num = nullif(@v6,''), \
winner_id = nullif(@v7,''), \
winner_seed = nullif(@v8,''), \
winner_entry = nullif(@v9,''), \
winner_name = nullif(@v10,''), \
winner_hand = nullif(@v11,''), \
winner_ht = nullif(@v12,''), \
winner_ioc = nullif(@v13,''), \
winner_age = nullif(@v14,''), \
winner_rank = nullif(@v15,''), \
winner_rank_points = nullif(@v16,''), \
loser_id = nullif(@v17,''), \
loser_seed = nullif(@v18,''), \
loser_entry = nullif(@v19,''), \
loser_name = nullif(@v20,''), \
loser_hand = nullif(@v21,''), \
loser_ht = nullif(@v22,''), \
loser_ioc = nullif(@v23,''), \
loser_age = nullif(@v24,''), \
loser_rank = nullif(@v25,''), \
loser_rank_points = nullif(@v26,''), \
score = nullif(@v27,''), \
best_of = nullif(@v28,''), \
round = nullif(@v29,''), \
minutes = nullif(@v30,''), \
w_ace = nullif(@v31,''), \
w_df = nullif(@v32,''), \
w_svpt = nullif(@v33,''), \
w_1stIn = nullif(@v34,''), \
w_1stWon = nullif(@v35,''), \
w_2ndWon = nullif(@v36,''), \
w_SvGms = nullif(@v37,''), \
w_bpSaved = nullif(@v38,''), \
w_bpFaced = nullif(@v39,''), \
l_ace = nullif(@v40,''), \
l_df = nullif(@v41,''), \
l_svpt = nullif(@v42,''), \
l_1stIn = nullif(@v43,''), \
l_1stWon = nullif(@v44,''), \
l_2ndWon = nullif(@v45,''), \
l_SvGms = nullif(@v46,''), \
l_bpSaved = nullif(@v47,''), \
l_bpFaced = nullif(@v48,''), \
match_type = '""" + match_type + """' \
;"""
connection.execute(query)
In [6]:
# start a transaction with the "tennis" database
with engine.begin() as connection:
# find out if the table "players" exists
query="""\
SELECT count(*) FROM information_schema.TABLES WHERE (TABLE_SCHEMA = '""" + db_name + """') AND (TABLE_NAME = 'players');\
"""
res = connection.execute(query)
# if "players does not exist, create it
if (res.fetchone()[0]!=0):
print("Table PLAYERS already exists, skipping table creation.")
else:
connection.execute("CREATE TABLE players \
(player_id MEDIUMINT UNSIGNED, \
first_name VARCHAR(256), \
last_name VARCHAR(256), \
hand CHAR(1), \
birth_date DATE, \
country_code CHAR(3)) \
;")
# read in the appropriate .csv files
file = atpfile_directory + "atp_players.csv"
query = """LOAD DATA LOCAL INFILE '""" + file + """' INTO TABLE players \
COLUMNS TERMINATED BY ',' \
LINES TERMINATED BY '\n' \
(@v0, @v1, @v2, @v3, @v4, @v5) \
SET \
player_id = nullif(@v0,''), \
first_name = nullif(@v1,''), \
last_name = nullif(@v2,''), \
hand = nullif(@v3,''), \
birth_date = nullif(@v4,''), \
country_code = nullif(@v5,'')
;"""
connection.execute(query)
In [7]:
with engine.begin() as connection:
query="""\
SELECT count(*) FROM information_schema.TABLES WHERE (TABLE_SCHEMA = '""" + db_name + """') AND (TABLE_NAME = 'rankings');\
"""
r1 = connection.execute(query)
if (r1.fetchone()[0]!=0):
print("Table RANKINGS already exists, skipping table creation.")
else:
connection.execute("CREATE TABLE rankings \
(ranking_date DATE, \
ranking SMALLINT UNSIGNED, \
player_id MEDIUMINT UNSIGNED, \
ranking_points SMALLINT UNSIGNED) \
;")
# store in a list the names of all the relevant .csv files
all_files = glob.glob("/Users/ctoews/Projects/Data_Science/Tennis/tennis_atp-master/atp_rankings_*.csv")
# load each file, and make sure missing data is stored as NULL
for file in all_files:
query = """LOAD DATA LOCAL INFILE '""" + file + """' INTO TABLE rankings \
COLUMNS TERMINATED BY ',' \
LINES TERMINATED BY '\n' \
(@v0, @v1, @v2, @v3, @v4) \
SET
ranking_date = nullif(@v0,''), \
ranking = nullif(@v1,''), \
player_id = nullif(@v2,''), \
ranking_points = nullif(@v3,'') \
;"""
connection.execute(query)
In [114]:
table_name = "podds"
db_name = "tennis"
engine.execute("USE " + db_name + ";")
with engine.begin() as connection:
query=("""SELECT count(*) FROM information_schema.TABLES WHERE
(TABLE_SCHEMA = '""" + db_name + """') AND (TABLE_NAME = '""" +
table_name + """');""")
res = connection.execute(query)
if (res.fetchone()[0]!=0):
print("Table " + table_name + " already exists, skipping table creation.")
else:
query=("""CREATE TABLE """ + table_name + """
(ATP TINYINT UNSIGNED,
Location VARCHAR(256),
Tournament VARCHAR(256),
Date VARCHAR(256),
Series VARCHAR(256),
Court VARCHAR(256),
Surface VARCHAR(256),
Round VARCHAR(256),
BestOf TINYINT UNSIGNED,
Winner VARCHAR(256),
Loser VARCHAR(256),
WRank SMALLINT UNSIGNED,
LRank SMALLINT UNSIGNED,
WPts SMALLINT UNSIGNED,
LPts SMALLINT UNSIGNED,
W1 TINYINT UNSIGNED,
L1 TINYINT UNSIGNED,
W2 TINYINT UNSIGNED,
L2 TINYINT UNSIGNED,
W3 TINYINT UNSIGNED,
L3 TINYINT UNSIGNED,
W4 TINYINT UNSIGNED,
L4 TINYINT UNSIGNED,
W5 TINYINT UNSIGNED,
L5 TINYINT UNSIGNED,
Wsets TINYINT UNSIGNED,
Lsets TINYINT UNSIGNED,
Comment VARCHAR(256),
B365W DECIMAL(6,3),
B365L DECIMAL(6,3),
EXW DECIMAL(6,3),
EXL DECIMAL(6,3),
LBW DECIMAL(6,3),
LBL DECIMAL(6,3),
PSW DECIMAL(6,3),
PSL DECIMAL(6,3),
SJW DECIMAL(6,3),
SJL DECIMAL(6,3),
MaxW DECIMAL(6,3),
MaxL DECIMAL(6,3),
AvgW DECIMAL(6,3),
AvgL DECIMAL(6,3),
fname CHAR(4))
;""")
connection.execute(query)
# store in a list the names of all the relevant .csv files
all_files = glob.glob(oddsfiles_directory + "201[0-7].csv")
# load each file, and make sure missing data is stored as NULL
for file in all_files:
fname = file.split('/')[-1].split('.')[0]
#pdb.set_trace()
print("using file ", file)
query = ("""LOAD DATA LOCAL INFILE '""" + file + """' INTO TABLE """ + table_name +
""" COLUMNS TERMINATED BY ','
LINES TERMINATED BY '\r\n'
IGNORE 1 LINES
(@v0, @v1, @v2, @v3, @v4, @v5, @v6, @v7, @v8, @v9, @v10, @v11, @v12, @v13, @v14, @v15, @v16, @v17, @v18, @v19, @v20, @v21, @v22, @v23, @v24, @v25, @v26, @v27, @v28, @v29, @v30, @v31, @v32, @v33, @v34, @v35, @v36, @v37, @v38, @v39, @v40, @v41)
SET
ATP = nullif(@v0,''),
Location = nullif(@v1,''),
Tournament = nullif(@v2,''),
Date = nullif(@v3,''),
Series = nullif(@v4,''),
Court = nullif(@v5,''),
Surface = nullif(@v6,''),
Round = nullif(@v7,''),
BestOf = nullif(@v8,''),
Winner = nullif(@v9,''),
Loser = nullif(@v10,''),
WRank = nullif(@v11,''),
LRank = nullif(@v12,''),
WPts = nullif(@v13,''),
LPts = nullif(@v14,''),
W1 = nullif(@v15,''),
L1 = nullif(@v16,''),
W2 = nullif(@v17,''),
L2 = nullif(@v18,''),
W3 = nullif(@v19,''),
L3 = nullif(@v20,''),
W4 = nullif(@v21,''),
L4 = nullif(@v22,''),
W5 = nullif(@v23,''),
L5 = nullif(@v24,''),
Wsets = nullif(@v25,''),
Lsets = nullif(@v26,''),
Comment = nullif(@v27,''),
B365W = nullif(@v28,''),
B365L = nullif(@v29,''),
EXW = nullif(@v30,''),
EXL = nullif(@v31,''),
LBW = nullif(@v32,''),
LBL = nullif(@v33,''),
PSW = nullif(@v34,''),
PSL = nullif(@v35,''),
SJW = nullif(@v36,''),
SJL = nullif(@v37,''),
MaxW = nullif(@v38,''),
MaxL = nullif(@v39,''),
AvgW = nullif(@v40,''),
AvgL = nullif(@v41,''),
fname = '""" + fname + """';""")
connection.execute(query)
# modify date strings to be mysql "date" types
query = """UPDATE """ + table_name + """ SET Date = str_to_date( Date, '%%m/%%d/%%Y');"""
connection.execute(query)
# change the type to "date", explicitly
query = """ALTER TABLE """ + table_name + """ MODIFY COLUMN Date DATE;"""
connection.execute(query)
# add a primary key
#query = "ALTER TABLE " + table_name + " ADD `id` INT NOT NULL AUTO_INCREMENT PRIMARY KEY;"
#connection.execute(query)
In [ ]: