The goal of this project will be to investigate on how nationality is related to different aspects of a tennis player's game and how certain match statistics affect the outcome of the match. For example, it is generally regarded to be true that because of the lack of clay courts available for practicing in the US, American players are not very good on this surface. However, the abundance of hard courts makes them very serious contenders at tournaments where the matches are played on surfaces that more closely resemble asphalt, such as sthe US Open or the Australian Open. On the flip side, Spanish players are known as masters of the clay court. This is due to a combination of the abundance of clay courts in spain and other cultural factors that make the general Spanish style of play very advantageous for success on clay. In fact, one Spaniard, Rafael Nadal, has had so much success that he universally has been given the moniker of “The King of Clay.”
This is one of the trends we would like to investigate. For the most part, these claims remain cultural hearsay within the tennis world. We're interested in analyzing match data and seeing if these general patterns actually have some quantitative basis. Using the data from thousands of matches, we would like to examine these assumptions, as well as look for new patterns that can be found. Do certain nationalities tend to spend more time on court? Are certain players more likely to hit aces than others? Who is more at risk for committing unforced errors? How important is saving break points on your serve to winning a match? Have these trends changed over time?
In [1]:
import sys # system module
import pandas as pd # data package
import matplotlib.pyplot as plt # graphics module
import datetime as dt # date and time module
import numpy as np # foundation for pandas
%matplotlib inline
# check versions
print('Python version: ', sys.version)
print('Pandas version: ', pd.__version__)
print('Today: ', dt.date.today())
To look for and analyze these trends, we will be using work published by Jeff Sackmann, an author and entrepreneur in the sports statistics field. Jeff maintains a page on GitHub known as the “Match Charting Project,” in which he has recorded the results of tens of thousands of matches spanning over thirty years. These match results include data points such as court surface, amount of time on court, winner, loser, number of unforced errors, and many more. For the purposes of our research, we will only be examining the results of matches betwen 2013 and 2015. In addition, we will only be analyzing the results of players from seven countries. These countries are USA, Japan, Spain, Argentina, Australia, Serbia, and France. These countries give us a nice cross section of players from every continent and are the major countries in the tennis world with the most prominent players.
In [2]:
url2013 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2013.csv' # 2013 match data
url2014 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2014.csv' # 2014 match data
url2015 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2015.csv' # 2015 match data
# convert each csv file into data frames
df2013 = pd.read_csv(url2013)
df2014 = pd.read_csv(url2014)
df2015 = pd.read_csv(url2015)
In [3]:
# a look at the plentiful amount of data provided by Jeff Sackmann (each year has the same data)
df2013.columns
Out[3]:
Seeing as we are not examining every variable, our final dataframe that we will be working with will not include every column. The following code keeps only the variables we are interested in looking at. In addition, each dataframe includes countries other than what we are interested in looking at. Therefore, we create a variable countries
which contains string of the country codes we are examining.
In [4]:
numlist=[1,2,5,10,11,12,13,14,15,20,21,22,23,24,25,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48]
## we created a list numbers corresponding to the variables we want to extract from the dataframe
atp2013 = df2013[numlist]
atp2014 = df2014[numlist]
atp2015 = df2015[numlist]
##extracts the metrics we're interested in and assigns it to a new dataframe corresponding to year
countries=['USA','FRA','ESP','SRB','JPN','AUS','ARG'] ## we'll use this later. It's a list of countries we're interested in analyzing
In [5]:
# create our final master dataframe which combines the dataframes for each year
atp = atp2013.append(atp2014).append(atp2015)
atp.head()
Out[5]:
Finally, we decided that in order to keep our research as controlled as possible, we removed Grand Slam matches, which are played best 3 out of 5 sets. Our final dataframe consists of the results of matches that were played best 2 out of 3.
In [6]:
atp= atp[atp['best_of'].isin([3])]
Now that we have our "Master DataFrame", we decided to look at nationality's effects on play based on a few different metrics:
1) Length of Time of a Match
2) Win Percentage per Surface
3) Unforced Errors Committed Over Time
4) Break Points Saved as a percentage of Break Points Faced
The challenge in this dataframe is that for each match, there is a winner and a loser. This means that a lot of the data manipulation involves tracking the players both in the wins and the losses column.
Because Europeans play on clay on courts for all of their junior careers, their games are more suited to this surface. Since clay is a slow surface, one would anticipate that a European player would play longer matches. The same goes for Americans. Hard courts, which are notoriously fast, are the predominant surface in the US. This means American players are used to short points, so one would expect the overall match would be shorter.
In [7]:
# get average length of time for matches that a country won
g = atp[['winner_ioc','minutes']].groupby('winner_ioc')
gm=g.mean()
minW=minW= gm.loc[countries] #pull average match length for countries we want to examine
# because there is a winner_ioc and a loser_ioc, we have to rename the index to countries in order combine the dataframes
minW.index.name='Country'
minW
Out[7]:
In [8]:
# now we do the same for the matches that the countries lost
l = atp[['loser_ioc','minutes']].groupby('loser_ioc')
gl=l.mean()
minL= gl.loc[countries]
minL.index.name='Country'
minL
Out[8]:
In [9]:
# Combine the average won and lost time to a total average
minL['minutesW']=minW['minutes']
minL['Average']= (minL['minutes']+minL['minutesW'])/2
minL
Out[9]:
We are only interested in the Average column, as that is the average time for all matches, so that is the one we used to create our graph.
In [10]:
minL['Average'].plot(kind='bar',
ylim= (80,105))
Out[10]:
This graph supports the popular consensus people have about players from certain countries. Spanish players are known as the clay court masters who play incredibly long points. As this graph shows, Spanish players play longer matches on average than every other country expect for Argentina, a country also known for its clay court play. The United States, as to be expected, plays among the shortest matches on average, only behind Japan, a country where hard courts are the predominant surface.
As certain players grow up on certain surfaces, it is to be expected that this would be reflected in their win percentages. Our next analysis examines just that. To manipulate the data so that we can graph the win percentages by surfaces for each country, we had to count how many times the player's country showed up in the wins column and loss column, and then compute their win percentage.
In [11]:
w = atp[['winner_ioc','surface']].groupby(['surface']) # group the results by surface
In [12]:
# count of how many times each country shows up in the wins column and turn into a dataframe
t= w.winner_ioc.value_counts()
tt=pd.DataFrame(t)
In [13]:
tt.head()
Out[13]:
In [14]:
# move surface out of the index to look at a country's wins per surface
tt=tt.unstack(level='surface')
tt.head()
Out[14]:
In [15]:
tt=tt.loc[countries] # pull out countries we are examining
tt.columns=['Clay Wins', 'Grass Wins', 'Hard Wins']
tt.index.rename('Country', inplace=True) # changing winner_ioc to country allows us to add losses to the dataframe in a coherent way
tt=tt.reset_index() # in order to merge the data on country, it must not be in the index
tt
Out[15]:
In [16]:
# compute losses by surface per country
l= atp[['loser_ioc','surface']].groupby(['surface'])
In [17]:
x= l.loser_ioc.value_counts()
xx=pd.DataFrame(x)
xx=xx.unstack(level='surface')
xx=xx.loc[countries]
xx.index.rename('Country', inplace=True)
xx.columns=['Clay Losses', 'Grass Losses', 'Hard Losses']
xx=xx.reset_index()
xx
Out[17]:
In [18]:
wl = pd.merge(tt,xx,
how= 'right',
on= 'Country') #merge the two dataframes
wl=wl.set_index('Country')
wl
Out[18]:
Now that we have the wins and losses by surface for each country, we can make a new dataframe that has the win percentage by surface by country. A for
loop makes this simple.
In [19]:
surfaces=['Clay','Grass','Hard']
for surface in surfaces:
wl[surface+' Win%']= wl[surface+' Wins']/(wl[surface+' Wins']+ wl[surface+' Losses'])
wl=wl[['Clay Win%','Grass Win%','Hard Win%']]
wl
Out[19]:
In [20]:
fig, ax= plt.subplots()
wl.plot(ax=ax,
kind='bar',
color=['red','green','blue'],
title= 'Win Percentage on Each Surface By Country')
ax.legend(loc=0)
ax.set_ylabel('Win Percentage')
ax.set_ylim(0.3,0.7)
ax.set_xticklabels( countries, rotation=45 )
Out[20]:
Again, this graph supports certain popular opinions. Americans have a higher win percentage on hard courts than grass and clay. Spanish players are incredibly successful on clay courts. Australians, also known for their hard court play, have an incredibly poor record on clay, but do well on grass, a fast surface. Serbians seem to have a very high win percentage overall. However, that number is most likely skewed, as Serbian Novak Djokovic, the number 1 player in the world, is known to dominate his competition, pushing SRB's win percentage higher than normal.
One could assume that a consequence of playing shorter points would be more unforced errors are made as time goes on. However, unforced errors are not one of the variables that Jeff Sackmann provides in his dataset. We decided to use double faults, an unforced error of the serve, as a proxy. The hypothesis for this metric is that the longer a hard court player (American, Australian, Japanese) goes into a match, the more errors he makes, and vice versa for clay court players. Therefore, if we were to plot the length of the match in the x axis and number of doubles faults on the y axis, we would expect that the trend line would be steeper for hard court players than clay court players.
In [20]:
df= atp[['winner_ioc','w_df','loser_ioc','l_df','minutes']] #dataframe of only the variables we are looking at
df.head()
Out[20]:
In [21]:
# For matches in which players lost, how many double faults were the for the length of the match
dfl = df[['loser_ioc','l_df','minutes']]
dfl=dfl.set_index('loser_ioc')
dfl=dfl.loc[countries]
dfl.dropna(inplace=True)
# rename columns to make it simpler to combine dataframes
dfl.index.names= ['ioc']
dfl=dfl.rename(columns={'l_df':'df'})
dfl.head()
Out[21]:
In [22]:
# the same for matches in which the player won
dfw= df[['winner_ioc','w_df','minutes']]
dfw=dfw.set_index('winner_ioc')
dfw=dfw.loc[countries]
dfw.dropna(inplace=True)
# rename columns and index to make it easier to combine data
dfw.index.names= ['ioc']
dfw=dfw.rename(columns={'w_df':'df'})
dfw.head()
Out[22]:
In [23]:
#combine dataframes
dfs= dfw.append(dfl)
dfs.head()
Out[23]:
In [24]:
# combine data and organize it by country by length of match
dfs=dfs.reset_index()
dfs=dfs.set_index(['ioc','minutes'])
dfs=dfs.sort_index()
dfs.head()
Out[24]:
In [25]:
fig, ax= plt.subplots(7, figsize=(10,20), sharex=True)
for country in countries:
dfs.xs(country).reset_index(level=1).plot.scatter('minutes','df', ax= ax[countries.index(country)],
title=country, )
Examining these scatter plots reveals that there is no significant correlation between minutes played and number of unforced errors. This is likely due to the fact that best of 3 set matches are relatively short, so players are not likely to get tired in a 2 out of 3 set match. It is also possible that number of double faults is not an accurate proxy for unforced errors.
Now we look our last metric: break points saved. A break point in tennis is when the player returning serve can win the game by winning the next point. In tennis, there are certain points throughout a match that matter much more than other points. It's not at all uncommon to see a relatively even score in terms of total points won, but one player wins the match pretty handily. Especially in the men's game, a couple of break points, and therefore your ability to save them, is usually the difference between winning and losing. What we see at the top of the tennis game, in players like Andy Murray, Novak Djokovic, Rafael Nadal to name a few, is that they face fewer break points than other players, and most importantly, of the break points they face on their service games they save most of them.
The intuition behind investigating the differences among nationalities is that certain countries have players where service games are much more important to their success. For example, US players tend to play with big serves, so holding serve and saving break points should be very important to their success. On the other hand, Spanish players tend to rely less on big serves and more on their baseline game and ability to break their opponents' serve. If the BP Ratio is more important to US players as opposed to Spanish players, then we would expect a larger difference in BP Ratio between winners and losers for the US versus for Spain. It is, however, entirely possible that this metric is so important to success that no matter the nationality, winners have a certain BP Ratio and losers have a certain BP Ratio. It's also possible that there are simply no differences among nationalities.
In this final portion of the project, we're interested in investigating how important this Break Point Ratio is (we'll call it BP Ratio and it's just equal to total break points saved/total break points faced) to match success and how it varies across both winners and losers and players from different countries. Let's take a look.
First, let's start off by extracting the columns/variables we're interested in from the dataframe ATP. Then, we will attempt to organize the data by country and compute some simple mean statistics for the resulting dataframe. Lastly, we'll add a new varaible for BP Ratio to the dataframe.
As always, the data from our source is organized in terms of winners and losers by match. We'll first extract and organize the data for winners and then do the same for losers. Once we have both, we can combine the dataframes into one so we have all the data together organized by country. And lastly, we can run some simple plot methods on the dataframe to support our interpretations of the data with some visual representations.
In [26]:
varlist = ['winner_ioc','w_bpFaced','w_bpSaved',] ## list of variables we want
In [27]:
countries=['USA','FRA','ESP','SRB','JPN','AUS','ARG']
W_BPdf = atp[varlist] ## this dataframe is break point statistics for winners
W_BPdf.columns = ('Country','w_bpFaced','w_bpSaved')
W_BPdf.head()
Out[27]:
In [28]:
comparison = W_BPdf['Country'].isin(countries)
selection = W_BPdf[comparison] ## now we select only the entries that correspond to countries we're interested in
selection.head()
Out[28]:
This data is now organized by country; however, it's still organized with respect to individual matches. What we have now is for each match what country the winner is from and how many break points the winning player faced and saved. What we want is a dataframe organized by how many break points, on average, winning players from each country both faced and saved. To accomplish this, we'll use the .groupby method.
In [29]:
df = selection.groupby('Country')
In [30]:
dm = df.mean() ## computes the mean of each varaible proportional to how many entries there were for each country
dm
Out[30]:
In [31]:
dm['w_bpratio'] = dm['w_bpSaved']/dm['w_bpFaced'] ## now we add our new variable, bp ratio, as a new column to dm
dm
Out[31]:
We now have a dataframe that is exactly what we want: the averages for break points faced, break saved, and break point ratios for each country. Let's do the exact same with the data on losing players:
In [32]:
var2list = ['loser_ioc','l_bpFaced','l_bpSaved']
L_BPdf = atp[var2list]
L_BPdf.columns = ('Country','l_bpFaced','l_bpSaved')
L_BPdf.head()
Out[32]:
In [33]:
comparison2 = L_BPdf['Country'].isin(countries)
selection2 = L_BPdf[comparison2]
selection2.head()
Out[33]:
In [34]:
df2 = selection2.groupby('Country')
In [35]:
dm2 = df2.mean()
dm2
Out[35]:
In [36]:
dm2['l_bpratio'] = dm2['l_bpSaved']/dm2['l_bpFaced']
dm2
Out[36]:
In [37]:
result = pd.concat([dm, dm2], axis=1)
## This is a slighlty different method than the merge method we learned in class, but essentially it combines two dataframes.
In [38]:
result
Out[38]:
We now have a dataframe that's a combination of the winner and loser data. To make things easier for plotting purposes, let's take only the bp ratio numbers and extract them to a new dataframe.
In [39]:
varlist3 = ['l_bpratio','w_bpratio']
result2 = result[varlist3]
result2
Out[39]:
In [40]:
fig, ax= plt.subplots()
result2.plot(ax=ax,
kind='bar',
color=['red','blue'],
title= 'Break Point Ratios by Country')
ax.legend(loc=2)
ax.set_ylabel('Percentage of Break Points Saved')
ax.set_xlabel('Country')
ax.set_xticklabels( countries, rotation=45 )
Out[40]:
Our results indicate that the BP Ratio does in fact matter quite a bit to a player's success. Across the board, there are fairly large differences between loser bp ratios and winner bp ratios: for most countries, the difference between winners and losers is a full 10-15 percentage points. While there does seem to be a small amount of variation for these differences among nationalities (the BP Ratio seems to matter the least to the success of Australian players for some reason), for the most part, the differences are pretty significant and consistent between winning and losing players among different nationalities.
From this, we can conclude that, at least for the year 2015 and the countries represented in this particular analysis, the "BP Ratio" as we have defined it here in this project, is a pretty signfiicant factor in determining a player's success in a match. We cannot conclude wtih any reasonable degree of certainty any patterns in the BP Ratio among different countries other than that for the most part, the ratios are relatively consistent.
The results of our analysis confirmed certain theories the tennis community feels has about players from certain countries. After analyzing the average length of a 2 out of 3 set match, we see that Spanish players, who grow up playing on clay courts, have longer matches on average. Conversely, countries like Japan, USA, and Australia, whose players grow up on hard court, play much shorter matches. Our analysis of success rate/win percentage similarly confirms certain aspects of this theory. Again, Spanish players have a higher win percentage on clay than on hard and grass. Americans have a higher win percentage on hard court than clay or grass. Australian players have very little success on clay courts. Our analysis of comparing double faults to length of match did not return any valuable information to discern a player's endurance based on where he grew up playing. Lastly, having a higher ratio of break points saved to break points faced is an important factor for winning matches, an outcome consistent with the data for winners and losers and having little variation among players from different countries.
In [ ]: