The Effect of Nationality on Tennis Play

Matt DeMichiel and Kevin Lee

The goal of this project will be to investigate on how nationality is related to different aspects of a tennis player's game and how certain match statistics affect the outcome of the match. For example, it is generally regarded to be true that because of the lack of clay courts available for practicing in the US, American players are not very good on this surface. However, the abundance of hard courts makes them very serious contenders at tournaments where the matches are played on surfaces that more closely resemble asphalt, such as sthe US Open or the Australian Open. On the flip side, Spanish players are known as masters of the clay court. This is due to a combination of the abundance of clay courts in spain and other cultural factors that make the general Spanish style of play very advantageous for success on clay. In fact, one Spaniard, Rafael Nadal, has had so much success that he universally has been given the moniker of “The King of Clay.”

This is one of the trends we would like to investigate. For the most part, these claims remain cultural hearsay within the tennis world. We're interested in analyzing match data and seeing if these general patterns actually have some quantitative basis. Using the data from thousands of matches, we would like to examine these assumptions, as well as look for new patterns that can be found. Do certain nationalities tend to spend more time on court? Are certain players more likely to hit aces than others? Who is more at risk for committing unforced errors? How important is saving break points on your serve to winning a match? Have these trends changed over time?

Preliminaries


In [1]:
import sys                             # system module 
import pandas as pd                    # data package
import matplotlib.pyplot as plt        # graphics module  
import datetime as dt                  # date and time module
import numpy as np                     # foundation for pandas 

%matplotlib inline                     

# check versions
print('Python version: ', sys.version)
print('Pandas version: ', pd.__version__)
print('Today: ', dt.date.today())


Python version:  3.5.2 |Anaconda 4.2.0 (x86_64)| (default, Jul  2 2016, 17:52:12) 
[GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)]
Pandas version:  0.19.0
Today:  2016-12-22

Collecting the Data

To look for and analyze these trends, we will be using work published by Jeff Sackmann, an author and entrepreneur in the sports statistics field. Jeff maintains a page on GitHub known as the “Match Charting Project,” in which he has recorded the results of tens of thousands of matches spanning over thirty years. These match results include data points such as court surface, amount of time on court, winner, loser, number of unforced errors, and many more. For the purposes of our research, we will only be examining the results of matches betwen 2013 and 2015. In addition, we will only be analyzing the results of players from seven countries. These countries are USA, Japan, Spain, Argentina, Australia, Serbia, and France. These countries give us a nice cross section of players from every continent and are the major countries in the tennis world with the most prominent players.


In [2]:
url2013 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2013.csv' # 2013 match data
url2014 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2014.csv' # 2014 match data
url2015 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2015.csv' # 2015 match data
# convert each csv file into data frames
df2013 = pd.read_csv(url2013)
df2014 = pd.read_csv(url2014)
df2015 = pd.read_csv(url2015)

In [3]:
# a look at the plentiful amount of data provided by Jeff Sackmann (each year has the same data)
df2013.columns


Out[3]:
Index(['tourney_id', 'tourney_name', 'surface', 'draw_size', 'tourney_level',
       'tourney_date', 'match_num', 'winner_id', 'winner_seed', 'winner_entry',
       'winner_name', 'winner_hand', 'winner_ht', 'winner_ioc', 'winner_age',
       'winner_rank', 'winner_rank_points', 'loser_id', 'loser_seed',
       'loser_entry', 'loser_name', 'loser_hand', 'loser_ht', 'loser_ioc',
       'loser_age', 'loser_rank', 'loser_rank_points', 'score', 'best_of',
       'round', 'minutes', 'w_ace', 'w_df', 'w_svpt', 'w_1stIn', 'w_1stWon',
       'w_2ndWon', 'w_SvGms', 'w_bpSaved', 'w_bpFaced', 'l_ace', 'l_df',
       'l_svpt', 'l_1stIn', 'l_1stWon', 'l_2ndWon', 'l_SvGms', 'l_bpSaved',
       'l_bpFaced'],
      dtype='object')

Seeing as we are not examining every variable, our final dataframe that we will be working with will not include every column. The following code keeps only the variables we are interested in looking at. In addition, each dataframe includes countries other than what we are interested in looking at. Therefore, we create a variable countries which contains string of the country codes we are examining.


In [4]:
numlist=[1,2,5,10,11,12,13,14,15,20,21,22,23,24,25,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48] 
## we created a list numbers corresponding to the variables we want to extract from the dataframe
atp2013 = df2013[numlist]
atp2014 = df2014[numlist]
atp2015 = df2015[numlist]
##extracts the metrics we're interested in and assigns it to a new dataframe corresponding to year 
countries=['USA','FRA','ESP','SRB','JPN','AUS','ARG'] ## we'll use this later. It's a list of countries we're interested in analyzing

In [5]:
# create our final master dataframe which combines the dataframes for each year
atp = atp2013.append(atp2014).append(atp2015)
atp.head()


Out[5]:
tourney_name surface tourney_date winner_name winner_hand winner_ht winner_ioc winner_age winner_rank loser_name ... w_bpFaced l_ace l_df l_svpt l_1stIn l_1stWon l_2ndWon l_SvGms l_bpSaved l_bpFaced
0 Kitzbuhel Clay 20130728 Leonardo Mayer R 188.0 ARG 26.206708 78.0 Kenny De Schepper ... 0.0 3.0 4.0 78.0 45.0 30.0 16.0 10.0 10.0 12.0
1 Kitzbuhel Clay 20130728 Horacio Zeballos L 188.0 ARG 28.254620 54.0 Martin Fischer ... 16.0 4.0 2.0 81.0 44.0 30.0 21.0 14.0 5.0 9.0
2 Kitzbuhel Clay 20130728 Marcel Granollers R 190.0 ESP 27.296372 53.0 Mate Pavic ... 6.0 2.0 3.0 73.0 39.0 26.0 19.0 10.0 6.0 8.0
3 Kitzbuhel Clay 20130728 Guillermo Garcia Lopez R 188.0 ESP 30.151951 72.0 Thomaz Bellucci ... 15.0 9.0 3.0 116.0 62.0 38.0 27.0 16.0 6.0 12.0
4 Kitzbuhel Clay 20130728 Robin Haase R 190.0 NED 26.313484 68.0 Guillaume Rufin ... 6.0 2.0 7.0 88.0 48.0 32.0 15.0 11.0 7.0 11.0

5 rows × 37 columns

Finally, we decided that in order to keep our research as controlled as possible, we removed Grand Slam matches, which are played best 3 out of 5 sets. Our final dataframe consists of the results of matches that were played best 2 out of 3.


In [6]:
atp= atp[atp['best_of'].isin([3])]

Now that we have our "Master DataFrame", we decided to look at nationality's effects on play based on a few different metrics:

1) Length of Time of a Match
2) Win Percentage per Surface
3) Unforced Errors Committed Over Time
4) Break Points Saved as a percentage of Break Points Faced

The challenge in this dataframe is that for each match, there is a winner and a loser. This means that a lot of the data manipulation involves tracking the players both in the wins and the losses column.

Length of Time

Because Europeans play on clay on courts for all of their junior careers, their games are more suited to this surface. Since clay is a slow surface, one would anticipate that a European player would play longer matches. The same goes for Americans. Hard courts, which are notoriously fast, are the predominant surface in the US. This means American players are used to short points, so one would expect the overall match would be shorter.


In [7]:
# get average length of time for matches that a country won
g = atp[['winner_ioc','minutes']].groupby('winner_ioc')
gm=g.mean()
minW=minW= gm.loc[countries] #pull average match length for countries we want to examine

# because there is a winner_ioc and a loser_ioc, we have to rename the index to countries in order combine the dataframes
minW.index.name='Country'
minW


Out[7]:
minutes
Country
USA 92.684848
FRA 91.388489
ESP 97.551601
SRB 94.032787
JPN 96.401869
AUS 93.511628
ARG 99.451505

In [8]:
# now we do the same for the matches that the countries lost
l = atp[['loser_ioc','minutes']].groupby('loser_ioc')
gl=l.mean()
minL= gl.loc[countries]
minL.index.name='Country'
minL


Out[8]:
minutes
Country
USA 91.939474
FRA 93.676208
ESP 99.770167
SRB 96.153153
JPN 85.294872
AUS 94.548837
ARG 100.454225

In [9]:
# Combine the average won and lost time to a total average
minL['minutesW']=minW['minutes']
minL['Average']= (minL['minutes']+minL['minutesW'])/2
minL


Out[9]:
minutes minutesW Average
Country
USA 91.939474 92.684848 92.312161
FRA 93.676208 91.388489 92.532348
ESP 99.770167 97.551601 98.660884
SRB 96.153153 94.032787 95.092970
JPN 85.294872 96.401869 90.848370
AUS 94.548837 93.511628 94.030233
ARG 100.454225 99.451505 99.952865

We are only interested in the Average column, as that is the average time for all matches, so that is the one we used to create our graph.


In [10]:
minL['Average'].plot(kind='bar',
                    ylim= (80,105))


Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x119bb84a8>

This graph supports the popular consensus people have about players from certain countries. Spanish players are known as the clay court masters who play incredibly long points. As this graph shows, Spanish players play longer matches on average than every other country expect for Argentina, a country also known for its clay court play. The United States, as to be expected, plays among the shortest matches on average, only behind Japan, a country where hard courts are the predominant surface.

Win Percentage per Surface

As certain players grow up on certain surfaces, it is to be expected that this would be reflected in their win percentages. Our next analysis examines just that. To manipulate the data so that we can graph the win percentages by surfaces for each country, we had to count how many times the player's country showed up in the wins column and loss column, and then compute their win percentage.


In [11]:
w = atp[['winner_ioc','surface']].groupby(['surface']) # group the results by surface

In [12]:
# count of how many times each country shows up in the wins column and turn into a dataframe
t= w.winner_ioc.value_counts()
tt=pd.DataFrame(t)

In [13]:
tt.head()


Out[13]:
winner_ioc
surface winner_ioc
Clay ESP 533
ARG 209
FRA 154
ITA 129
GER 121

In [14]:
# move surface out of the index to look at a country's wins per surface
tt=tt.unstack(level='surface')
tt.head()


Out[14]:
winner_ioc
surface Clay Grass Hard
winner_ioc
ALG 2.0 NaN 1.0
ARG 209.0 7.0 109.0
AUS 20.0 40.0 160.0
AUT 65.0 7.0 49.0
BAR 1.0 NaN 1.0

In [15]:
tt=tt.loc[countries] # pull out countries we are examining
tt.columns=['Clay Wins', 'Grass Wins', 'Hard Wins']
tt.index.rename('Country', inplace=True) # changing winner_ioc to country allows us to add losses to the dataframe in a coherent way
tt=tt.reset_index() # in order to merge the data on country, it must not be in the index
tt


Out[15]:
Country Clay Wins Grass Wins Hard Wins
0 USA 52.0 54.0 321.0
1 FRA 154.0 79.0 451.0
2 ESP 533.0 43.0 412.0
3 SRB 68.0 7.0 160.0
4 JPN 28.0 11.0 101.0
5 AUS 20.0 40.0 160.0
6 ARG 209.0 7.0 109.0

In [16]:
# compute losses by surface per country
l= atp[['loser_ioc','surface']].groupby(['surface'])

In [17]:
x= l.loser_ioc.value_counts()
xx=pd.DataFrame(x)
xx=xx.unstack(level='surface')
xx=xx.loc[countries]
xx.index.rename('Country', inplace=True)
xx.columns=['Clay Losses', 'Grass Losses', 'Hard Losses']
xx=xx.reset_index()
xx


Out[17]:
Country Clay Losses Grass Losses Hard Losses
0 USA 60.0 63.0 346.0
1 FRA 175.0 64.0 424.0
2 ESP 381.0 38.0 367.0
3 SRB 53.0 7.0 78.0
4 JPN 14.0 10.0 87.0
5 AUS 45.0 32.0 190.0
6 ARG 192.0 14.0 109.0

In [18]:
wl = pd.merge(tt,xx,
             how= 'right',
             on= 'Country') #merge the two dataframes
wl=wl.set_index('Country')
wl


Out[18]:
Clay Wins Grass Wins Hard Wins Clay Losses Grass Losses Hard Losses
Country
USA 52.0 54.0 321.0 60.0 63.0 346.0
FRA 154.0 79.0 451.0 175.0 64.0 424.0
ESP 533.0 43.0 412.0 381.0 38.0 367.0
SRB 68.0 7.0 160.0 53.0 7.0 78.0
JPN 28.0 11.0 101.0 14.0 10.0 87.0
AUS 20.0 40.0 160.0 45.0 32.0 190.0
ARG 209.0 7.0 109.0 192.0 14.0 109.0

Now that we have the wins and losses by surface for each country, we can make a new dataframe that has the win percentage by surface by country. A for loop makes this simple.


In [19]:
surfaces=['Clay','Grass','Hard']
for surface in surfaces:
    wl[surface+' Win%']= wl[surface+' Wins']/(wl[surface+' Wins']+ wl[surface+' Losses'])
wl=wl[['Clay Win%','Grass Win%','Hard Win%']]
wl


Out[19]:
Clay Win% Grass Win% Hard Win%
Country
USA 0.464286 0.461538 0.481259
FRA 0.468085 0.552448 0.515429
ESP 0.583151 0.530864 0.528883
SRB 0.561983 0.500000 0.672269
JPN 0.666667 0.523810 0.537234
AUS 0.307692 0.555556 0.457143
ARG 0.521197 0.333333 0.500000

In [20]:
fig, ax= plt.subplots()
wl.plot(ax=ax,
        kind='bar',
        color=['red','green','blue'],
        title= 'Win Percentage on Each Surface By Country')
ax.legend(loc=0)
ax.set_ylabel('Win Percentage')
ax.set_ylim(0.3,0.7)
ax.set_xticklabels( countries, rotation=45 )


Out[20]:
[<matplotlib.text.Text at 0xaed78b1908>,
 <matplotlib.text.Text at 0xaed789c7b8>,
 <matplotlib.text.Text at 0xaed7934400>,
 <matplotlib.text.Text at 0xaed78a6f98>,
 <matplotlib.text.Text at 0xaed79345c0>,
 <matplotlib.text.Text at 0xaed7934ef0>,
 <matplotlib.text.Text at 0xaed7936940>]

Again, this graph supports certain popular opinions. Americans have a higher win percentage on hard courts than grass and clay. Spanish players are incredibly successful on clay courts. Australians, also known for their hard court play, have an incredibly poor record on clay, but do well on grass, a fast surface. Serbians seem to have a very high win percentage overall. However, that number is most likely skewed, as Serbian Novak Djokovic, the number 1 player in the world, is known to dominate his competition, pushing SRB's win percentage higher than normal.

Unforced Errors Over Time

One could assume that a consequence of playing shorter points would be more unforced errors are made as time goes on. However, unforced errors are not one of the variables that Jeff Sackmann provides in his dataset. We decided to use double faults, an unforced error of the serve, as a proxy. The hypothesis for this metric is that the longer a hard court player (American, Australian, Japanese) goes into a match, the more errors he makes, and vice versa for clay court players. Therefore, if we were to plot the length of the match in the x axis and number of doubles faults on the y axis, we would expect that the trend line would be steeper for hard court players than clay court players.


In [20]:
df= atp[['winner_ioc','w_df','loser_ioc','l_df','minutes']] #dataframe of only the variables we are looking at
df.head()


Out[20]:
winner_ioc w_df loser_ioc l_df minutes
0 ARG 1.0 FRA 4.0 84.0
1 ARG 1.0 AUT 2.0 125.0
2 ESP 3.0 CRO 3.0 115.0
3 ESP 12.0 BRA 3.0 171.0
4 NED 0.0 FRA 7.0 110.0

In [21]:
# For matches in which players lost, how many double faults were the for the length of the match
dfl = df[['loser_ioc','l_df','minutes']]
dfl=dfl.set_index('loser_ioc')
dfl=dfl.loc[countries]
dfl.dropna(inplace=True)
# rename columns to make it simpler to combine dataframes
dfl.index.names= ['ioc']
dfl=dfl.rename(columns={'l_df':'df'})
dfl.head()


Out[21]:
df minutes
ioc
USA 1.0 92.0
USA 3.0 92.0
USA 0.0 65.0
USA 4.0 100.0
USA 2.0 114.0

In [22]:
# the same for matches in which the player won
dfw= df[['winner_ioc','w_df','minutes']]
dfw=dfw.set_index('winner_ioc')
dfw=dfw.loc[countries]
dfw.dropna(inplace=True)
# rename columns and index to make it easier to combine data
dfw.index.names= ['ioc']
dfw=dfw.rename(columns={'w_df':'df'})
dfw.head()


Out[22]:
df minutes
ioc
USA 3.0 114.0
USA 0.0 62.0
USA 3.0 74.0
USA 3.0 83.0
USA 2.0 80.0

In [23]:
#combine dataframes
dfs= dfw.append(dfl)
dfs.head()


Out[23]:
df minutes
ioc
USA 3.0 114.0
USA 0.0 62.0
USA 3.0 74.0
USA 3.0 83.0
USA 2.0 80.0

In [24]:
# combine data and organize it by country by length of match
dfs=dfs.reset_index()
dfs=dfs.set_index(['ioc','minutes'])
dfs=dfs.sort_index()
dfs.head()


Out[24]:
df
ioc minutes
ARG 10.0 0.0
11.0 0.0
11.0 0.0
19.0 2.0
22.0 0.0

In [25]:
fig, ax= plt.subplots(7, figsize=(10,20), sharex=True)
for country in countries:
    dfs.xs(country).reset_index(level=1).plot.scatter('minutes','df', ax= ax[countries.index(country)],
                                                     title=country, )


Examining these scatter plots reveals that there is no significant correlation between minutes played and number of unforced errors. This is likely due to the fact that best of 3 set matches are relatively short, so players are not likely to get tired in a 2 out of 3 set match. It is also possible that number of double faults is not an accurate proxy for unforced errors.

Break Points Saved

Now we look our last metric: break points saved. A break point in tennis is when the player returning serve can win the game by winning the next point. In tennis, there are certain points throughout a match that matter much more than other points. It's not at all uncommon to see a relatively even score in terms of total points won, but one player wins the match pretty handily. Especially in the men's game, a couple of break points, and therefore your ability to save them, is usually the difference between winning and losing. What we see at the top of the tennis game, in players like Andy Murray, Novak Djokovic, Rafael Nadal to name a few, is that they face fewer break points than other players, and most importantly, of the break points they face on their service games they save most of them.

The intuition behind investigating the differences among nationalities is that certain countries have players where service games are much more important to their success. For example, US players tend to play with big serves, so holding serve and saving break points should be very important to their success. On the other hand, Spanish players tend to rely less on big serves and more on their baseline game and ability to break their opponents' serve. If the BP Ratio is more important to US players as opposed to Spanish players, then we would expect a larger difference in BP Ratio between winners and losers for the US versus for Spain. It is, however, entirely possible that this metric is so important to success that no matter the nationality, winners have a certain BP Ratio and losers have a certain BP Ratio. It's also possible that there are simply no differences among nationalities.

In this final portion of the project, we're interested in investigating how important this Break Point Ratio is (we'll call it BP Ratio and it's just equal to total break points saved/total break points faced) to match success and how it varies across both winners and losers and players from different countries. Let's take a look.

First, let's start off by extracting the columns/variables we're interested in from the dataframe ATP. Then, we will attempt to organize the data by country and compute some simple mean statistics for the resulting dataframe. Lastly, we'll add a new varaible for BP Ratio to the dataframe.

As always, the data from our source is organized in terms of winners and losers by match. We'll first extract and organize the data for winners and then do the same for losers. Once we have both, we can combine the dataframes into one so we have all the data together organized by country. And lastly, we can run some simple plot methods on the dataframe to support our interpretations of the data with some visual representations.


In [26]:
varlist = ['winner_ioc','w_bpFaced','w_bpSaved',] ## list of variables we want

In [27]:
countries=['USA','FRA','ESP','SRB','JPN','AUS','ARG']

W_BPdf = atp[varlist] ## this dataframe is break point statistics for winners
W_BPdf.columns = ('Country','w_bpFaced','w_bpSaved')
W_BPdf.head()


Out[27]:
Country w_bpFaced w_bpSaved
0 ARG 0.0 0.0
1 ARG 16.0 12.0
2 ESP 6.0 6.0
3 ESP 15.0 10.0
4 NED 6.0 3.0

In [28]:
comparison = W_BPdf['Country'].isin(countries)
selection = W_BPdf[comparison] ## now we select only the entries that correspond to countries we're interested in 
selection.head()


Out[28]:
Country w_bpFaced w_bpSaved
0 ARG 0.0 0.0
1 ARG 16.0 12.0
2 ESP 6.0 6.0
3 ESP 15.0 10.0
6 ESP 2.0 2.0

This data is now organized by country; however, it's still organized with respect to individual matches. What we have now is for each match what country the winner is from and how many break points the winning player faced and saved. What we want is a dataframe organized by how many break points, on average, winning players from each country both faced and saved. To accomplish this, we'll use the .groupby method.


In [29]:
df = selection.groupby('Country')

In [30]:
dm = df.mean() ## computes the mean of each varaible proportional to how many entries there were for each country
dm


Out[30]:
w_bpFaced w_bpSaved
Country
ARG 4.851735 3.261830
AUS 4.615023 3.281690
ESP 4.709845 3.266321
FRA 4.356509 3.087278
JPN 4.702899 3.173913
SRB 3.722467 2.409692
USA 3.624703 2.567696

In [31]:
dm['w_bpratio'] = dm['w_bpSaved']/dm['w_bpFaced'] ## now we add our new variable, bp ratio, as a new column to dm 
dm


Out[31]:
w_bpFaced w_bpSaved w_bpratio
Country
ARG 4.851735 3.261830 0.672302
AUS 4.615023 3.281690 0.711089
ESP 4.709845 3.266321 0.693509
FRA 4.356509 3.087278 0.708659
JPN 4.702899 3.173913 0.674884
SRB 3.722467 2.409692 0.647337
USA 3.624703 2.567696 0.708388

We now have a dataframe that is exactly what we want: the averages for break points faced, break saved, and break point ratios for each country. Let's do the exact same with the data on losing players:


In [32]:
var2list = ['loser_ioc','l_bpFaced','l_bpSaved']
L_BPdf = atp[var2list]
L_BPdf.columns = ('Country','l_bpFaced','l_bpSaved')
L_BPdf.head()


Out[32]:
Country l_bpFaced l_bpSaved
0 FRA 12.0 10.0
1 AUT 9.0 5.0
2 CRO 8.0 6.0
3 BRA 12.0 6.0
4 FRA 11.0 7.0

In [33]:
comparison2 = L_BPdf['Country'].isin(countries)
selection2 = L_BPdf[comparison2]
selection2.head()


Out[33]:
Country l_bpFaced l_bpSaved
0 FRA 12.0 10.0
4 FRA 11.0 7.0
5 ARG 6.0 4.0
13 ARG 6.0 2.0
14 ESP 17.0 13.0

In [34]:
df2 = selection2.groupby('Country')

In [35]:
dm2 = df2.mean()
dm2


Out[35]:
l_bpFaced l_bpSaved
Country
ARG 8.032258 4.406452
AUS 7.298113 4.079245
ESP 7.893590 4.315385
FRA 7.618902 4.353659
JPN 7.185185 3.657407
SRB 8.110294 4.750000
USA 7.212446 4.122318

In [36]:
dm2['l_bpratio'] = dm2['l_bpSaved']/dm2['l_bpFaced']
dm2


Out[36]:
l_bpFaced l_bpSaved l_bpratio
Country
ARG 8.032258 4.406452 0.548594
AUS 7.298113 4.079245 0.558945
ESP 7.893590 4.315385 0.546695
FRA 7.618902 4.353659 0.571429
JPN 7.185185 3.657407 0.509021
SRB 8.110294 4.750000 0.585675
USA 7.212446 4.122318 0.571556

In [37]:
result = pd.concat([dm, dm2], axis=1) 
## This is a slighlty different method than the merge method we learned in class, but essentially it combines two dataframes.

In [38]:
result


Out[38]:
w_bpFaced w_bpSaved w_bpratio l_bpFaced l_bpSaved l_bpratio
Country
ARG 4.851735 3.261830 0.672302 8.032258 4.406452 0.548594
AUS 4.615023 3.281690 0.711089 7.298113 4.079245 0.558945
ESP 4.709845 3.266321 0.693509 7.893590 4.315385 0.546695
FRA 4.356509 3.087278 0.708659 7.618902 4.353659 0.571429
JPN 4.702899 3.173913 0.674884 7.185185 3.657407 0.509021
SRB 3.722467 2.409692 0.647337 8.110294 4.750000 0.585675
USA 3.624703 2.567696 0.708388 7.212446 4.122318 0.571556

We now have a dataframe that's a combination of the winner and loser data. To make things easier for plotting purposes, let's take only the bp ratio numbers and extract them to a new dataframe.


In [39]:
varlist3 = ['l_bpratio','w_bpratio'] 
result2 = result[varlist3]
result2


Out[39]:
l_bpratio w_bpratio
Country
ARG 0.548594 0.672302
AUS 0.558945 0.711089
ESP 0.546695 0.693509
FRA 0.571429 0.708659
JPN 0.509021 0.674884
SRB 0.585675 0.647337
USA 0.571556 0.708388

In [40]:
fig, ax= plt.subplots()
result2.plot(ax=ax,
        kind='bar',
        color=['red','blue'],
        title= 'Break Point Ratios by Country')
ax.legend(loc=2)
ax.set_ylabel('Percentage of Break Points Saved')
ax.set_xlabel('Country')
ax.set_xticklabels( countries, rotation=45 )


Out[40]:
[<matplotlib.text.Text at 0x11abfa978>,
 <matplotlib.text.Text at 0x11abfce80>,
 <matplotlib.text.Text at 0x118010b38>,
 <matplotlib.text.Text at 0x118014588>,
 <matplotlib.text.Text at 0x118014f98>,
 <matplotlib.text.Text at 0x1180189e8>,
 <matplotlib.text.Text at 0x11801c438>]

Our results indicate that the BP Ratio does in fact matter quite a bit to a player's success. Across the board, there are fairly large differences between loser bp ratios and winner bp ratios: for most countries, the difference between winners and losers is a full 10-15 percentage points. While there does seem to be a small amount of variation for these differences among nationalities (the BP Ratio seems to matter the least to the success of Australian players for some reason), for the most part, the differences are pretty significant and consistent between winning and losing players among different nationalities.

From this, we can conclude that, at least for the year 2015 and the countries represented in this particular analysis, the "BP Ratio" as we have defined it here in this project, is a pretty signfiicant factor in determining a player's success in a match. We cannot conclude wtih any reasonable degree of certainty any patterns in the BP Ratio among different countries other than that for the most part, the ratios are relatively consistent.

Conclusion

The results of our analysis confirmed certain theories the tennis community feels has about players from certain countries. After analyzing the average length of a 2 out of 3 set match, we see that Spanish players, who grow up playing on clay courts, have longer matches on average. Conversely, countries like Japan, USA, and Australia, whose players grow up on hard court, play much shorter matches. Our analysis of success rate/win percentage similarly confirms certain aspects of this theory. Again, Spanish players have a higher win percentage on clay than on hard and grass. Americans have a higher win percentage on hard court than clay or grass. Australian players have very little success on clay courts. Our analysis of comparing double faults to length of match did not return any valuable information to discern a player's endurance based on where he grew up playing. Lastly, having a higher ratio of break points saved to break points faced is an important factor for winning matches, an outcome consistent with the data for winners and losers and having little variation among players from different countries.


In [ ]: