May 2016
Written by John Ockay at NYU's Stern School of Business
Contact: jfo262@stern.nyu.edu
In 2015, Novak Djokovic earned over 16 million through tennis alone while Dudi Sela, the 100th best tennis player in the world, earned just over 300 thousand. In 2010, the U.S. Tennis Association reported that the annual average cost to be a “highly competitive” professional tennis player was 143 thousand. After accounting for such expenses, it is clear that if a player cannot stay within the top 100 on a consistent basis, it will be incredibly challenging to survive financially playing tennis alone.
My project utilizes historical tennis data to investigate key indicators of future success in men's professional tennis. Before continuing to invest hundreds of thousands of dollars into one's career, a young player can compare his statistics with the early careers of tennis stars (i.e., those who have achieved at least a top 20 ranking at some point during their careers). For the purposes of this project, I analyzed the early career statistics for every player that achieved a top 20 or top 10 ranking between 1995 and 2015. I looked at the following key indicators:
1) Professional seasons until a top 50 ranking: For those who became top 20 or top 10 players between 1995 and 2015, how many years did it take for them to achieve a top 50 ranking after they turned pro? Young tennis players can use this metric to compare their ranking progress with the early progress of successful tennis players.
2) Top 10 Victories: How many top 10 victories did tennis stars have in their first 3 and their first 5 professional seasons?
3) Top 50 Victories: How many top 50 victories did tennis stars have in their first 3 and their first 5 professional seasons?
4) Tournament Victories: How many tournament victories did tennis stars have in their first 3 and their first 5 professional seasons?
For each indicator, I also broke the data down based on when a top 20 or top 10 player between 1995 and 2015 turned pro. I compared the differences among the early careers of tennis stars that turned pro before 1995, between 1995 and 2000, between 2000 and 2005, and after 2005. As we will see in the graphics below, young players today may not be expected to progress as quickly or have as many top 10 or tournament victories as players that turned pro before 1995.
Ultimately, I provide the means to compare the stats of any tennis player on the ATP tour with the average, historical data of players who eventually became top 20 or top 10 players in the world.
I used the following Python packages to import both internet and internal excel data and to develop relevant graphics:
In [1]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import xlrd
To complete this project, I used data from two different sources. First, Jeff Sackmann is an author and entrepreneur who has worked in the fields of sports statistics and test preparation. In particular, he has an interest in the world of tennis and tennis statistics. He has created a website called TennisAbstract that contains historical match data and other tennis analytics. He has uploaded match data from 1968 to 2016 on GitHub in CSV format. I pulled the data from the Github site into Python in order to conduct the necessary analysis work for my project. I imported match data from 1984 to 2015, combined the data into one large spreadsheet, removed unnecessary columns, and adjusted the date column in order to better facilitate filtering. This process is outlined below:
In [2]:
url2015 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2015.csv'
tennis2015 = pd.read_csv(url2015)
url2014 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2014.csv'
tennis2014 = pd.read_csv(url2014)
url2013 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2013.csv'
tennis2013 = pd.read_csv(url2013)
url2012 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2012.csv'
tennis2012 = pd.read_csv(url2012)
url2011 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2011.csv'
tennis2011 = pd.read_csv(url2011)
url2010 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2010.csv'
tennis2010 = pd.read_csv(url2010)
url2009 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2009.csv'
tennis2009 = pd.read_csv(url2009)
url2008 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2008.csv'
tennis2008 = pd.read_csv(url2008)
url2007 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2007.csv'
tennis2007 = pd.read_csv(url2007)
url2006 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2006.csv'
tennis2006 = pd.read_csv(url2006)
url2005 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2005.csv'
tennis2005 = pd.read_csv(url2005)
url2004 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2004.csv'
tennis2004 = pd.read_csv(url2004)
url2003 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2003.csv'
tennis2003 = pd.read_csv(url2003)
url2002 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2002.csv'
tennis2002 = pd.read_csv(url2002)
url2001 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2001.csv'
tennis2001 = pd.read_csv(url2001)
url2000 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2000.csv'
tennis2000 = pd.read_csv(url2000)
url1999 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1999.csv'
tennis1999 = pd.read_csv(url1999)
url1998 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1998.csv'
tennis1998 = pd.read_csv(url1998)
url1997 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1997.csv'
tennis1997 = pd.read_csv(url1997)
url1996 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1996.csv'
tennis1996 = pd.read_csv(url1996)
url1995 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1995.csv'
tennis1995 = pd.read_csv(url1995)
url1994 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1994.csv'
tennis1994 = pd.read_csv(url1994)
url1993 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1993.csv'
tennis1993 = pd.read_csv(url1993)
url1992 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1992.csv'
tennis1992 = pd.read_csv(url1992)
url1991 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1991.csv'
tennis1991 = pd.read_csv(url1991)
url1990 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1990.csv'
tennis1990 = pd.read_csv(url1990)
url1989 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1989.csv'
tennis1989 = pd.read_csv(url1989)
url1988 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1988.csv'
tennis1988 = pd.read_csv(url1988)
url1987 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1987.csv'
tennis1987 = pd.read_csv(url1987)
url1986 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1986.csv'
tennis1986 = pd.read_csv(url1986)
url1985 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1985.csv'
tennis1985 = pd.read_csv(url1985)
url1984 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1984.csv'
tennis1984 = pd.read_csv(url1984)
In [3]:
tennisALL = (tennis1984.append(tennis1985).append(tennis1986).append(tennis1987).append(tennis1988).
append(tennis1989).append(tennis1990).append(tennis1991).append(tennis1992).
append(tennis1993).append(tennis1994).append(tennis1995).append(tennis1996).
append(tennis1997).append(tennis1998).append(tennis1999).append(tennis2000).
append(tennis2001).append(tennis2002).append(tennis2003).append(tennis2004).
append(tennis2005).append(tennis2006).append(tennis2007).append(tennis2008).
append(tennis2009).append(tennis2010).append(tennis2011).append(tennis2012).
append(tennis2013).append(tennis2014).append(tennis2015))
In [4]:
tennisALL.head(2)
Out[4]:
In [5]:
tennisALL = tennisALL.drop(["tourney_id","draw_size","tourney_level","match_num","winner_id",
"winner_seed","winner_entry","winner_hand","winner_ht","winner_ioc",
"winner_age","winner_rank_points","loser_id","loser_seed","loser_entry",
"loser_hand","loser_ht","loser_ioc","loser_age","loser_rank_points","score",
"best_of","minutes","w_ace","w_df","w_svpt","w_1stIn","w_1stWon","w_2ndWon",
"w_SvGms","w_bpSaved","w_bpFaced","l_ace","l_df","l_svpt","l_1stIn","l_1stWon",
"l_2ndWon","l_SvGms","l_bpSaved","l_bpFaced"],axis=1)
tennisALL["tourney_date"] = tennisALL["tourney_date"].astype(str)
tennisALL["tourney_date"] = tennisALL["tourney_date"].str[:4]
tennisALL["tourney_date"] = tennisALL["tourney_date"].astype(int)
tennisALL
Out[5]:
I utilized the above data to develop functions that allowed me to easily obtain and organize any tennis player's early career statistics, such as top 10 or top 50 victories. Those functions are listed below with Roger Federer used as an example:
CareerWins: total career victories for any given tennis player
In [6]:
def CareerWins(player):
return len(tennisALL[(tennisALL['winner_name']==player)])
CareerWins("Roger Federer")
Out[6]:
YearWins: victories within a specific season for any given tennis player
In [7]:
def YearWins(player,year):
return len(tennisALL[(tennisALL['tourney_date']==year) & (tennisALL['winner_name']==player)])
YearWins("Roger Federer",2014)
Out[7]:
CareerTopTen: total career top 10 victories for any given tennis player
In [8]:
def CareerTopTen(player):
return len(tennisALL[(tennisALL['loser_rank']<=10) & (tennisALL['winner_name']==player)])
CareerTopTen("Roger Federer")
Out[8]:
YearTopTen: top ten victories within a specific season for any given tennis player
In [9]:
def YearTopTen(player,year):
return len(tennisALL[(tennisALL['loser_rank']<=10)&(tennisALL['winner_name']==player)
&(tennisALL['tourney_date']==year)])
YearTopTen("Roger Federer",2014)
Out[9]:
CareerTopFifty: total career top 50 victories for any given tennis player
In [10]:
def CareerTopFifty(player):
return len(tennisALL[(tennisALL['loser_rank']<=50) & (tennisALL['winner_name']==player)])
CareerTopFifty("Roger Federer")
Out[10]:
YearTopFifty: top fifty victories within a specific season for any given tennis player
In [11]:
def YearTopFifty(player,year):
return len(tennisALL[(tennisALL['loser_rank']<=50)&(tennisALL['winner_name']==player)
&(tennisALL['tourney_date']==year)])
YearTopFifty("Roger Federer",2014)
Out[11]:
CareerTourneyWins: total career tournament victories for any given tennis player
In [12]:
def CareerTourneyWins(player):
return len(tennisALL[(tennisALL['round']=='F') & (tennisALL['winner_name']==player)])
CareerTourneyWins("Roger Federer")
Out[12]:
YearTourneyWins: tournament victories within a specific season for any given tennis player
In [13]:
def YearTourneyWins(player,year):
return len(tennisALL[(tennisALL['round']=='F')&(tennisALL['winner_name']==player)
&(tennisALL['tourney_date']==year)])
YearTourneyWins("Roger Federer",2014)
Out[13]:
While the data from Jeff Sackmann allowed me to easily collect data for individual tennis players, I needed to use an Excel document in order to better organize summary data for Top 20 and Top 10 players between the years of 1995 and 2015. In addition, I collected (from atpworld.com) and organized historical rankings data and the years in which players turned pro in order to investigate players' rankings progression. The Excel document contains the following worksheets:
In [14]:
path = ("C:/Users/John/Desktop/Data_Bootcamp/Project Data/ATP Tennis Data (1973-2016).xlsx")
TennisData = pd.ExcelFile(path)
In [15]:
Top20Summary = TennisData.parse("Top 20 Summary (1995 Onward)").set_index("Player")
Top20RankingProgression = TennisData.parse("Top 20 Ranking Progression").set_index("Player")
Top10RankingProgression = TennisData.parse("Top 10 Ranking Progression").set_index("Player")
Top20Top10Wins = TennisData.parse("Top 20 Top 10 Wins").set_index("Player")
Top10Top10Wins = TennisData.parse("Top 10 Top 10 Wins").set_index("Player")
Top20Top50Wins = TennisData.parse("Top 20 Top 50 Wins").set_index("Player")
Top10Top50Wins = TennisData.parse("Top 10 Top 50 Wins").set_index("Player")
Top20TournamentWins = TennisData.parse("Top 20 Tournament Wins").set_index("Player")
Top10TournamentWins = TennisData.parse("Top 10 Tournament Wins").set_index("Player")
AllPlayerSummary = TennisData.parse("All Player Summary").set_index("Player")
The following cells illustrate some of the data contained within the Excel document:
In [16]:
Top20Summary = Top20Summary.fillna("")
Top20Summary.head(5)
Out[16]:
In [17]:
Top20RankingProgression.head(5)
Out[17]:
In [18]:
Top10Top10Wins.head(5)
Out[18]:
In [19]:
AllPlayerSummary = AllPlayerSummary[["Years Until Top 50","Year Turned Pro"]]
AllPlayerSummary = AllPlayerSummary.fillna("Never")
AllPlayerSummary.head(5)
Out[19]:
In order to investigate a player's early career, it is essential to identify when that player turned professional. I developed the following function based on the data within the Excel document to facilitate that process. Again, Roger Federer is used as an example.
In [20]:
def YearTurnedPro(Player):
return AllPlayerSummary.ix[Player,"Year Turned Pro"]
YearTurnedPro("Roger Federer")
Out[20]:
In [21]:
def RankingProgression(Player):
print("Player Name:",Player)
print("Year Turned Pro:",YearTurnedPro(Player))
print("Rankings First Five Pro Years:",[Top20RankingProgression.ix[Player,"Year 1"],
Top20RankingProgression.ix[Player,"Year 2"],Top20RankingProgression.ix[Player,"Year 3"],
Top20RankingProgression.ix[Player,"Year 4"],Top20RankingProgression.ix[Player,"Year 5"]])
RankingProgression("Roger Federer")
As mentioned in the introduction, a primary goal for this project was to compare the stats of young tennis players to the early career stats of players who eventually achieved top 20 or top 10 status. Utilizing data from the Excel document, the following cells include my analysis of the early careers of players who achieved top 20 or top 10 status between the years of 1995 and 2015.
First, I compared how long, on average, it took players who eventually became top 20 or top 10 players to achieve top 50 status upon turning pro. For all analyses below, in addition to comparing the top 20 group to the top 10 group, I also made comparisons based on the year in which tennis players turned pro.
In [64]:
YearsUntilTop50Comparison = pd.DataFrame({'Year Turned Pro':['Before 1995','Between 1995 and 1999',
'Between 2000 and 2004','After 2004'],
'Avg Years to Reach Top 50 (For Eventual Top 20 Players)':
[Top20RankingProgression[(Top20RankingProgression['Year Turned Pro']<[1995])]
["Years Until Top 50"].mean(),
Top20RankingProgression[(Top20RankingProgression['Year Turned Pro']>=[1995])&
(Top20RankingProgression['Year Turned Pro']<[2000])]["Years Until Top 50"].mean(),
Top20RankingProgression[(Top20RankingProgression['Year Turned Pro']>=[2000])&
(Top20RankingProgression['Year Turned Pro']<[2005])]["Years Until Top 50"].mean(),
Top20RankingProgression[(Top20RankingProgression['Year Turned Pro']>=[2005])]
["Years Until Top 50"].mean()],
'Avg Years to Reach Top 50 (For Eventual Top 10 Players)':
[Top10RankingProgression[(Top10RankingProgression['Year Turned Pro']<[1995])]
["Years Until Top 50"].mean(),
Top10RankingProgression[(Top10RankingProgression['Year Turned Pro']>=[1995])&
(Top10RankingProgression['Year Turned Pro']<[2000])]["Years Until Top 50"].mean(),
Top10RankingProgression[(Top10RankingProgression['Year Turned Pro']>=[2000])&
(Top10RankingProgression['Year Turned Pro']<[2005])]["Years Until Top 50"].mean(),
Top10RankingProgression[(Top10RankingProgression['Year Turned Pro']>=[2005])]
["Years Until Top 50"].mean()]})
In [65]:
YearsUntilTop50Comparison = YearsUntilTop50Comparison.set_index("Year Turned Pro")
YearsUntilTop50ComparisonGraph = YearsUntilTop50Comparison.plot(kind="line",figsize=(10,5),fontsize=(14))
plt.ylabel("Years")
plt.style.use('fivethirtyeight')
plt.legend(loc='lower right',prop={'size':12}).get_frame().set_edgecolor('black')
plt.title("How Long Does it Take Tennis Stars to \nAchieve a Top 50 Ranking After Turning Pro?",fontsize=(16))
Out[65]:
As seen above, tennis stars who turned pro before 1995 achieved top 50 status more quickly than players who turned pro after 2004. As expected, players who eventually became top 10 achieved top 50 status more quickly than those players who only became top 20 in the world. The following two graphics outline similar information in table and histogram formats:
In [24]:
YearsUntilTop50Comparison.round(2)
Out[24]:
In [25]:
Top20RankingProgression.plot(kind="hist",bins=10,y="Years Until Top 50")
plt.legend(loc='upper right',prop={'size':12}).get_frame().set_edgecolor('black')
plt.title("Number of Years It Took Top 20 Players from 1995-2015 to Reach Top 50\n\n",fontsize=(12))
plt.xlabel("\nYears",fontsize=(12))
plt.ylabel("Frequency\n",fontsize=(12))
Out[25]:
Second, I compared how many top 10 wins, on average, tennis stars had within 3 years of turning pro.
In [26]:
Top10WinsComparison = pd.DataFrame({'Year Turned Pro':['Before 1995','Between 1995 and 1999',
'Between 2000 and 2004','After 2004'],
'Top 10 Wins in First 3 Pro Years (For Eventual Top 20 Players)':
[Top20Top10Wins[(Top20Top10Wins['Year Turned Pro']<[1995])]["Total First 3 Years"].mean(),
Top20Top10Wins[(Top20Top10Wins['Year Turned Pro']>=[1995])&
(Top20Top10Wins['Year Turned Pro']<[2000])]["Total First 3 Years"].mean(),
Top20Top10Wins[(Top20Top10Wins['Year Turned Pro']>=[2000])&
(Top20Top10Wins['Year Turned Pro']<[2005])]["Total First 3 Years"].mean(),
Top20Top10Wins[(Top20Top10Wins['Year Turned Pro']>=[2005])]["Total First 3 Years"].mean()],
'Top 10 Wins in First 3 Pro Years (For Eventual Top 10 Players)':
[Top10Top10Wins[(Top10Top10Wins['Year Turned Pro']<[1995])]["Total First 3 Years"].mean(),
Top10Top10Wins[(Top10Top10Wins['Year Turned Pro']>=[1995])&
(Top10Top10Wins['Year Turned Pro']<[2000])]["Total First 3 Years"].mean(),
Top10Top10Wins[(Top10Top10Wins['Year Turned Pro']>=[2000])&
(Top10Top10Wins['Year Turned Pro']<[2005])]["Total First 3 Years"].mean(),
Top10Top10Wins[(Top10Top10Wins['Year Turned Pro']>=[2005])]["Total First 3 Years"].mean()]})
In [27]:
Top10WinsComparison = Top10WinsComparison.set_index("Year Turned Pro")
Top10WinsComparisonGraph = Top10WinsComparison.plot(kind="line",figsize=(10,5),fontsize=(14))
plt.ylabel("Top 10 Victories")
plt.legend(loc='lower left',prop={'size':11}).get_frame().set_edgecolor('black')
plt.title("How Many Top 10 Victories Do Tennis Stars \nHave in First 3 Professional Seasons?",fontsize=(16))
Out[27]:
As seen above, players who turned pro prior to 2000 generally had more top 10 victories within their first 3 professional seasons compared to players turning pro after 2000. Due to players like Andy Murray, who had incredible success early in his career, the trend has started to change over the last decade. The following graphs highlight similar information in table and histogram format:
In [28]:
Top10WinsComparison.round(2)
Out[28]:
In [29]:
Top20Top10Wins.plot(kind="hist",bins=10,y="Total First 3 Years")
plt.legend(loc='upper right',prop={'size':12}).get_frame().set_edgecolor('black')
plt.title("Number of Top 10 Victories in First 3 Pro Seasons\n for Eventual Top 20 Players",fontsize=(12))
plt.xlabel("\nTop 10 Victories",fontsize=(12))
plt.ylabel("Frequency\n",fontsize=(12))
Out[29]:
Third, I compared how many top 10 wins, on average, tennis stars had within 5 years of turning pro.
In [30]:
Top10WinsComparison2 = pd.DataFrame({'Year Turned Pro':['Before 1995','Between 1995 and 1999',
'Between 2000 and 2004','After 2004'],
'Top 10 Wins in First 5 Pro Years (For Eventual Top 20 Players)':
[Top20Top10Wins[(Top20Top10Wins['Year Turned Pro']<[1995])]["Total First 5 Years"].mean(),
Top20Top10Wins[(Top20Top10Wins['Year Turned Pro']>=[1995])&
(Top20Top10Wins['Year Turned Pro']<[2000])]["Total First 5 Years"].mean(),
Top20Top10Wins[(Top20Top10Wins['Year Turned Pro']>=[2000])&
(Top20Top10Wins['Year Turned Pro']<[2005])]["Total First 5 Years"].mean(),
Top20Top10Wins[(Top20Top10Wins['Year Turned Pro']>=[2005])]["Total First 5 Years"].mean()],
'Top 10 Wins in First 5 Pro Years (For Eventual Top 10 Players)':
[Top10Top10Wins[(Top10Top10Wins['Year Turned Pro']<[1995])]["Total First 5 Years"].mean(),
Top10Top10Wins[(Top10Top10Wins['Year Turned Pro']>=[1995])&
(Top10Top10Wins['Year Turned Pro']<[2000])]["Total First 5 Years"].mean(),
Top10Top10Wins[(Top10Top10Wins['Year Turned Pro']>=[2000])&
(Top10Top10Wins['Year Turned Pro']<[2005])]["Total First 5 Years"].mean(),
Top10Top10Wins[(Top10Top10Wins['Year Turned Pro']>=[2005])]["Total First 5 Years"].mean()]})
In [31]:
Top10WinsComparison2 = Top10WinsComparison2.set_index("Year Turned Pro")
Top10WinsComparison2Graph = Top10WinsComparison2.plot(kind="line",figsize=(10,5),fontsize=(14))
plt.ylabel("Top 10 Victories")
plt.legend(loc='best',prop={'size':11}).get_frame().set_edgecolor('black')
plt.title("How Many Top 10 Victories Do Tennis Stars \nHave in First 5 Professional Seasons?",fontsize=(16))
Out[31]:
As seen above, players who turned pro prior to 2000 generally had more top 10 victories within their first 3 professional seasons compared to players turning pro after 2000. Due to players like Andy Murray, who had incredible success early in his career, the trend has started to change over the last decade. The following graphs highlight similar information in table and histogram format:
In [32]:
Top10WinsComparison2.round(2)
Out[32]:
In [33]:
Top20Top10Wins.plot(kind="hist",bins=10,y="Total First 5 Years")
plt.legend(loc='upper right',prop={'size':12}).get_frame().set_edgecolor('black')
plt.title("Number of Top 10 Victories in First 5 Pro Seasons\n for Eventual Top 20 Players",fontsize=(12))
plt.xlabel("\nTop 10 Victories",fontsize=(12))
plt.ylabel("Frequency\n",fontsize=(12))
Out[33]:
Fourth, I compared how many top 50 wins, on average, tennis stars had within 3 years of turning pro.
In [34]:
Top50WinsComparison = pd.DataFrame({'Year Turned Pro':['Before 1995','Between 1995 and 1999',
'Between 2000 and 2004','After 2004'],
'Top 50 Wins in First 3 Pro Years (For Eventual Top 20 Players)':
[Top20Top50Wins[(Top20Top50Wins['Year Turned Pro']<[1995])]["Total First 3 Years"].mean(),
Top20Top50Wins[(Top20Top50Wins['Year Turned Pro']>=[1995])&
(Top20Top50Wins['Year Turned Pro']<[2000])]["Total First 3 Years"].mean(),
Top20Top50Wins[(Top20Top50Wins['Year Turned Pro']>=[2000])&
(Top20Top50Wins['Year Turned Pro']<[2005])]["Total First 3 Years"].mean(),
Top20Top50Wins[(Top20Top50Wins['Year Turned Pro']>=[2005])]["Total First 3 Years"].mean()],
'Top 50 Wins in First 3 Pro Years (For Eventual Top 10 Players)':
[Top10Top50Wins[(Top10Top50Wins['Year Turned Pro']<[1995])]["Total First 3 Years"].mean(),
Top10Top50Wins[(Top10Top50Wins['Year Turned Pro']>=[1995])&
(Top10Top50Wins['Year Turned Pro']<[2000])]["Total First 3 Years"].mean(),
Top10Top50Wins[(Top10Top50Wins['Year Turned Pro']>=[2000])&
(Top10Top50Wins['Year Turned Pro']<[2005])]["Total First 3 Years"].mean(),
Top10Top50Wins[(Top10Top50Wins['Year Turned Pro']>=[2005])]["Total First 3 Years"].mean()]})
In [35]:
Top50WinsComparison = Top50WinsComparison.set_index("Year Turned Pro")
Top50WinsComparisonGraph = Top50WinsComparison.plot(kind="line",figsize=(10,5),fontsize=(14))
plt.ylabel("Top 50 Victories")
plt.legend(loc='best',prop={'size':11}).get_frame().set_edgecolor('black')
plt.title("How Many Top 50 Victories Do Tennis Stars \nHave in First 3 Professional Seasons?",fontsize=(16))
Out[35]:
As seen above, players who turned pro prior to 2000 generally had more top 50 victories within their first 3 professional seasons compared to players turning pro after 2000. Due to players like Andy Murray, who had incredible success early in his career, the trend has started to change over the last decade. The following graphs highlight similar information in table and histogram format:
In [36]:
Top50WinsComparison.round(2)
Out[36]:
In [37]:
Top20Top50Wins.plot(kind="hist",bins=10,y="Total First 3 Years")
plt.legend(loc='upper right',prop={'size':12}).get_frame().set_edgecolor('black')
plt.title("Number of Top 50 Victories in First 5 Pro Seasons\n for Eventual Top 20 Players",fontsize=(12))
plt.xlabel("\nTop 50 Victories",fontsize=(12))
plt.ylabel("Frequency\n",fontsize=(12))
Out[37]:
Fifth, I compared how many top 50 wins, on average, tennis stars had within 5 years of turning pro.
In [38]:
Top50WinsComparison2 = pd.DataFrame({'Year Turned Pro':['Before 1995','Between 1995 and 1999',
'Between 2000 and 2004','After 2004'],
'Top 50 Wins in First 5 Pro Years (For Eventual Top 20 Players)':
[Top20Top50Wins[(Top20Top50Wins['Year Turned Pro']<[1995])]["Total First 5 Years"].mean(),
Top20Top50Wins[(Top20Top50Wins['Year Turned Pro']>=[1995])&
(Top20Top50Wins['Year Turned Pro']<[2000])]["Total First 5 Years"].mean(),
Top20Top50Wins[(Top20Top50Wins['Year Turned Pro']>=[2000])&
(Top20Top50Wins['Year Turned Pro']<[2005])]["Total First 5 Years"].mean(),
Top20Top50Wins[(Top20Top50Wins['Year Turned Pro']>=[2005])]["Total First 5 Years"].mean()],
'Top 50 Wins in First 5 Pro Years (For Eventual Top 10 Players)':
[Top10Top50Wins[(Top10Top50Wins['Year Turned Pro']<[1995])]["Total First 5 Years"].mean(),
Top10Top50Wins[(Top10Top50Wins['Year Turned Pro']>=[1995])&
(Top10Top50Wins['Year Turned Pro']<[2000])]["Total First 5 Years"].mean(),
Top10Top50Wins[(Top10Top50Wins['Year Turned Pro']>=[2000])&
(Top10Top50Wins['Year Turned Pro']<[2005])]["Total First 5 Years"].mean(),
Top10Top50Wins[(Top10Top50Wins['Year Turned Pro']>=[2005])]["Total First 5 Years"].mean()]})
In [39]:
Top50WinsComparison2 = Top50WinsComparison2.set_index("Year Turned Pro")
Top50WinsComparison2Graph = Top50WinsComparison2.plot(kind="line",figsize=(10,5),fontsize=(14))
plt.ylabel("Top 50 Victories")
plt.legend(loc='best',prop={'size':11}).get_frame().set_edgecolor('black')
plt.title("How Many Top 50 Victories Do Tennis Stars \nHave in First 5 Professional Seasons?",fontsize=(16))
Out[39]:
As seen above, players who turned pro prior to 2000 generally had more top 50 victories within their first 5 professional seasons compared to players turning pro after 2000. Due to players like Andy Murray, who had incredible success early in his career, the trend has started to change over the last decade. The following graphs highlight similar information in table and histogram format:
In [40]:
Top50WinsComparison2.round(2)
Out[40]:
In [41]:
Top20Top50Wins.plot(kind="hist",bins=10,y="Total First 5 Years")
plt.legend(loc='upper right',prop={'size':12}).get_frame().set_edgecolor('black')
plt.title("Number of Top 50 Victories in First 5 Pro Seasons\n for Eventual Top 20 Players",fontsize=(12))
plt.xlabel("\nTop 50 Victories",fontsize=(12))
plt.ylabel("Frequency\n",fontsize=(12))
Out[41]:
Sixth, I compared how many tournament wins, on average, tennis stars had within 3 years of turning pro.
In [42]:
TournamentWinsComparison = pd.DataFrame({'Year Turned Pro':['Before 1995','Between 1995 and 1999',
'Between 2000 and 2004','After 2004'],
'Tournament Wins in First 3 Pro Years (For Eventual Top 20 Players)':
[Top20TournamentWins[(Top20TournamentWins['Year Turned Pro']<[1995])]["Total First 3 Years"].mean(),
Top20TournamentWins[(Top20TournamentWins['Year Turned Pro']>=[1995])&
(Top20TournamentWins['Year Turned Pro']<[2000])]["Total First 3 Years"].mean(),
Top20TournamentWins[(Top20TournamentWins['Year Turned Pro']>=[2000])&
(Top20TournamentWins['Year Turned Pro']<[2005])]["Total First 3 Years"].mean(),
Top20TournamentWins[(Top20TournamentWins['Year Turned Pro']>=[2005])]["Total First 3 Years"].mean()],
'Tournament Wins in First 3 Pro Years (For Eventual Top 10 Players)':
[Top10TournamentWins[(Top10TournamentWins['Year Turned Pro']<[1995])]["Total First 3 Years"].mean(),
Top10TournamentWins[(Top10TournamentWins['Year Turned Pro']>=[1995])&
(Top10TournamentWins['Year Turned Pro']<[2000])]["Total First 3 Years"].mean(),
Top10TournamentWins[(Top10TournamentWins['Year Turned Pro']>=[2000])&
(Top10TournamentWins['Year Turned Pro']<[2005])]["Total First 3 Years"].mean(),
Top10TournamentWins[(Top10TournamentWins['Year Turned Pro']>=[2005])]["Total First 3 Years"].mean()]},
index=[1, 2, 3, 4])
In [43]:
TournamentWinsComparison = TournamentWinsComparison.set_index("Year Turned Pro")
Top50WinsComparisonGraph = TournamentWinsComparison.plot(kind="line",figsize=(10,5),fontsize=(14))
plt.ylabel("Tournament Victories")
plt.legend(loc='best',prop={'size':11}).get_frame().set_edgecolor('black')
plt.title("How Many Tournament Victories Do Tennis Stars \nHave in First 3 Professional Seasons?",
fontsize=(16))
Out[43]:
As seen above, tennis stars that turned pro before 1995 had much more tournament success early in their careers compared with tennis stars that turned pro after 2000. The following graphs highlight similar information in table and histogram format:
In [44]:
TournamentWinsComparison.round(2)
Out[44]:
In [45]:
Top20TournamentWins.plot(kind="hist",bins=10,y="Total First 3 Years")
plt.legend(loc='upper right',prop={'size':12}).get_frame().set_edgecolor('black')
plt.title("Number of Tournament Victories in First 3 Pro Seasons\n for Eventual Top 20 Players",fontsize=(12))
plt.xlabel("\nTournament Victories",fontsize=(12))
plt.ylabel("Frequency\n",fontsize=(12))
Out[45]:
Lastly, I compared how many tournament wins, on average, tennis stars had within 5 years of turning pro.
In [46]:
TournamentWinsComparison2 = pd.DataFrame({'Year Turned Pro':['Before 1995','Between 1995 and 1999',
'Between 2000 and 2004','After 2004'],
'Tournament Wins in First 5 Pro Years (For Eventual Top 20 Players)':
[Top20TournamentWins[(Top20TournamentWins['Year Turned Pro']<[1995])]["Total First 5 Years"].mean(),
Top20TournamentWins[(Top20TournamentWins['Year Turned Pro']>=[1995])&
(Top20TournamentWins['Year Turned Pro']<[2000])]["Total First 5 Years"].mean(),
Top20TournamentWins[(Top20TournamentWins['Year Turned Pro']>=[2000])&
(Top20TournamentWins['Year Turned Pro']<[2005])]["Total First 5 Years"].mean(),
Top20TournamentWins[(Top20TournamentWins['Year Turned Pro']>=[2005])]["Total First 5 Years"].mean()],
'Tournament Wins in First 5 Pro Years (For Eventual Top 10 Players)':
[Top10TournamentWins[(Top10TournamentWins['Year Turned Pro']<[1995])]["Total First 5 Years"].mean(),
Top10TournamentWins[(Top10TournamentWins['Year Turned Pro']>=[1995])&
(Top10TournamentWins['Year Turned Pro']<[2000])]["Total First 5 Years"].mean(),
Top10TournamentWins[(Top10TournamentWins['Year Turned Pro']>=[2000])&
(Top10TournamentWins['Year Turned Pro']<[2005])]["Total First 5 Years"].mean(),
Top10TournamentWins[(Top10TournamentWins['Year Turned Pro']>=[2005])]["Total First 5 Years"].mean()]},
index=[1, 2, 3, 4])
In [47]:
TournamentWinsComparison2 = TournamentWinsComparison2.set_index("Year Turned Pro")
Top50WinsComparison2Graph = TournamentWinsComparison2.plot(kind="line",figsize=(10,5),fontsize=(14))
plt.ylabel("Tournament Victories")
plt.legend(loc='best',prop={'size':11}).get_frame().set_edgecolor('black')
plt.title("How Many Tournament Victories Do Tennis Stars \nHave in First 5 Professional Seasons?",
fontsize=(16))
Out[47]:
Once again, while there has been a resurgence more recently, tennis stars who turned pro before 1995 generally had more early success in tournaments compared with tennis stars who turned pro after 2000. The following graphs highlight similar information in table and histogram format:
In [48]:
TournamentWinsComparison2.round(2)
Out[48]:
In [49]:
Top20TournamentWins.plot(kind="hist",bins=10,y="Total First 5 Years")
plt.legend(loc='upper right',prop={'size':12}).get_frame().set_edgecolor('black')
plt.title("Number of Tournament Victories in First 5 Pro Seasons\n for Eventual Top 20 Players",fontsize=(12))
plt.xlabel("\nTournament Victories",fontsize=(12))
plt.ylabel("Frequency\n",fontsize=(12))
Out[49]:
The following table summarizes the data from the above graphics. These numbers serve as the foundation for the ultimate analysis of young players' early careers. As we saw above, with the exception of outliers like Andy Murray, tennis stars in recent years have taken a longer time to reach top 50 status and have had fewer top 10, 50, and tournament victories in the first few years of their careers. Since the early 2000's, tennis has largely been dominated by four players (Roger Federer, Rafael Nadal, Andy Murray, and Novak Djokovic), and this is likely the primary explanation for the increased difficulty within men's professional tennis. With such dominant players at the top, it has been challenging for younger players to break into the top 10 or to win tournaments at a young age.
In [50]:
TennisSummary = [YearsUntilTop50Comparison["Avg Years to Reach Top 50 (For Eventual Top 20 Players)"],
Top10WinsComparison["Top 10 Wins in First 3 Pro Years (For Eventual Top 20 Players)"],
Top10WinsComparison2["Top 10 Wins in First 5 Pro Years (For Eventual Top 20 Players)"],
Top50WinsComparison["Top 50 Wins in First 3 Pro Years (For Eventual Top 20 Players)"],
Top50WinsComparison2["Top 50 Wins in First 5 Pro Years (For Eventual Top 20 Players)"],
TournamentWinsComparison["Tournament Wins in First 3 Pro Years (For Eventual Top 20 Players)"],
TournamentWinsComparison2["Tournament Wins in First 5 Pro Years (For Eventual Top 20 Players)"],
YearsUntilTop50Comparison["Avg Years to Reach Top 50 (For Eventual Top 10 Players)"],
Top10WinsComparison["Top 10 Wins in First 3 Pro Years (For Eventual Top 10 Players)"],
Top10WinsComparison2["Top 10 Wins in First 5 Pro Years (For Eventual Top 10 Players)"],
Top50WinsComparison["Top 50 Wins in First 3 Pro Years (For Eventual Top 10 Players)"],
Top50WinsComparison2["Top 50 Wins in First 5 Pro Years (For Eventual Top 10 Players)"],
TournamentWinsComparison["Tournament Wins in First 3 Pro Years (For Eventual Top 10 Players)"],
TournamentWinsComparison2["Tournament Wins in First 5 Pro Years (For Eventual Top 10 Players)"]]
result = pd.concat(TennisSummary,axis=1)
columns = [('All Top 20 Players between 1995 to 2015','All Top 10 Players between 1995 to 2015'),
("Avg Years to Reach Top 50",
"Top 10 Wins First 3 Pro Years",
"Top 10 Wins First 5 Pro Years",
"Top 50 Wins First 3 Pro Years",
"Top 50 Wins First 5 Pro Years",
"Tournament Wins First 3 Pro Years",
"Tournament Wins First 5 Pro Years")]
result.columns=pd.MultiIndex.from_product(columns)
result.round(2)
Out[50]:
As the final piece of the analysis, I developed a function to easily compare the average statistics from the early careers of the top 20 and top 10 players between 1995 and 2015 with the stats of any selected tennis player. The following functions utilized terms developed earlier in this report and allowed me to easily summarize tennis stats for any player.
TourneyWinsFirst3ProYears: Outputs number of tournament victories in first 3 professional seasons for any selected tennis player
In [51]:
def TourneyWinsFirst3ProYears(player):
return (YearTourneyWins(player,YearTurnedPro(player)) + YearTourneyWins(player,YearTurnedPro(player)+1) +
YearTourneyWins(player,YearTurnedPro(player)+2))
TourneyWinsFirst3ProYears("Roger Federer")
Out[51]:
TourneyWinsFirst5ProYears: Outputs number of tournament victories in first 5 professional seasons for any selected tennis player
In [52]:
def TourneyWinsFirst5ProYears(player):
return (YearTourneyWins(player,YearTurnedPro(player)) + YearTourneyWins(player,YearTurnedPro(player)+1) +
YearTourneyWins(player,YearTurnedPro(player)+2) + YearTourneyWins(player,YearTurnedPro(player)+3) +
YearTourneyWins(player,YearTurnedPro(player)+4))
TourneyWinsFirst5ProYears("Roger Federer")
Out[52]:
Top10WinsFirst3ProYears: Outputs number of top 10 victories in first 3 professional seasons for any selected tennis player
In [53]:
def Top10WinsFirst3ProYears(player):
return (YearTopTen(player,YearTurnedPro(player)) + YearTopTen(player,YearTurnedPro(player)+1) +
YearTopTen(player,YearTurnedPro(player)+2))
Top10WinsFirst3ProYears("Roger Federer")
Out[53]:
Top10WinsFirst5ProYears: Outputs number of top ten victories in first 5 professional seasons for any selected tennis player
In [54]:
def Top10WinsFirst5ProYears(player):
return (YearTopTen(player,YearTurnedPro(player)) + YearTopTen(player,YearTurnedPro(player)+1) +
YearTopTen(player,YearTurnedPro(player)+2) + YearTopTen(player,YearTurnedPro(player)+3) +
YearTopTen(player,YearTurnedPro(player)+4))
Top10WinsFirst5ProYears("Roger Federer")
Out[54]:
Top50WinsFirst3ProYears: Outputs number of top 50 victories in first 3 professional seasons for any selected tennis player
In [55]:
def Top50WinsFirst3ProYears(player):
return (YearTopFifty(player,YearTurnedPro(player)) + YearTopFifty(player,YearTurnedPro(player)+1) +
YearTopFifty(player,YearTurnedPro(player)+2))
Top50WinsFirst3ProYears("Roger Federer")
Out[55]:
Top50WinsFirst5ProYears: Outputs number of top 50 victories in first 5 professional seasons for any selected tennis player
In [56]:
def Top50WinsFirst5ProYears(player):
return (YearTopFifty(player,YearTurnedPro(player)) + YearTopFifty(player,YearTurnedPro(player)+1) +
YearTopFifty(player,YearTurnedPro(player)+2) + YearTopFifty(player,YearTurnedPro(player)+3) +
YearTopFifty(player,YearTurnedPro(player)+4))
Top50WinsFirst5ProYears("Roger Federer")
Out[56]:
YearsUntilTop50: Identifies how long it took a player to achieve top 50 status
In [57]:
def YearsUntilTop50(Player):
if type(AllPlayerSummary.ix[Player,"Years Until Top 50"]) == float:
return int(AllPlayerSummary.ix[Player,"Years Until Top 50"])
else:
return AllPlayerSummary.ix[Player,"Years Until Top 50"]
YearsUntilTop50("Roger Federer")
Out[57]:
The final function summarizes all the data seen thus far in this report in a single table. We can now compare any tennis player to the average statistics of players who eventually became top 10 or top 20 players in the world. For young tennis players, this could be used to gauge success early in their careers.
In [58]:
def WillHeBeGreat(Player):
Will = pd.DataFrame({'Player':[Player,"Average for Top 20 Players from 1995 to 2015",
"Average for Top 10 Players from 1995 to 2015"],
'Pro Years Until Top 50 Ranking':[YearsUntilTop50(Player),
Top20RankingProgression["Years Until Top 50"].mean(),
Top10RankingProgression["Years Until Top 50"].mean()],
'Top 10 Wins First 3 Pro Years':[Top10WinsFirst3ProYears(Player),
Top20Top10Wins["Total First 3 Years"].mean(),
Top10Top10Wins["Total First 3 Years"].mean()],
'Top 10 Wins First 5 Pro Years':[Top10WinsFirst5ProYears(Player),
Top20Top10Wins["Total First 5 Years"].mean(),
Top10Top10Wins["Total First 5 Years"].mean()],
'Top 50 Wins First 3 Pro Years':[Top50WinsFirst3ProYears(Player),
Top20Top50Wins["Total First 3 Years"].mean(),
Top10Top50Wins["Total First 3 Years"].mean()],
'Top 50 Wins First 5 Pro Years':[Top50WinsFirst5ProYears(Player),
Top20Top50Wins["Total First 5 Years"].mean(),
Top10Top50Wins["Total First 5 Years"].mean()],
'Tournament Wins First 3 Pro Years':[TourneyWinsFirst3ProYears(Player),
Top20TournamentWins["Total First 3 Years"].mean(),
Top10TournamentWins["Total First 3 Years"].mean()],
'Tournament Wins First 5 Pro Years':[TourneyWinsFirst5ProYears(Player),
Top20TournamentWins["Total First 5 Years"].mean(),
Top10TournamentWins["Total First 5 Years"].mean()],
'Year Turned Pro':[YearTurnedPro(Player),"",""]})
ColumnOrder = ['Year Turned Pro','Pro Years Until Top 50 Ranking','Top 10 Wins First 3 Pro Years',
'Top 10 Wins First 5 Pro Years','Top 50 Wins First 3 Pro Years',
'Top 50 Wins First 5 Pro Years','Tournament Wins First 3 Pro Years',
'Tournament Wins First 5 Pro Years']
Will = Will.set_index("Player")
Will = Will[ColumnOrder]
Will = Will.round(2)
return Will
In [59]:
WillHeBeGreat("Andy Murray")
Out[59]:
Based on the information above, Andy Murray had an incredible start to his career. For example, it took him only 2 years to achieve a top 50 ranking, and he had 35 victories against top 10 opponents in his first 5 professional seasons (compared to an average of 10 for players that eventually became top 10 in the world). Looking at this table, it is not surprising that Andy Murray eventually became a top 5 player in the world with multiple grand slam championship titles. The following function allows us to see the same information in a more visual format:
In [60]:
def WillHeBeGreatGraph(Player):
WillHeBeGreat(Player).plot(y=["Pro Years Until Top 50 Ranking",
"Top 10 Wins First 3 Pro Years",
"Top 10 Wins First 5 Pro Years",
"Tournament Wins First 3 Pro Years",
"Tournament Wins First 5 Pro Years"],kind="barh",figsize=(10,5))
plt.style.use("ggplot")
plt.ylabel("")
plt.xlabel("Years or # of Victories")
plt.title("Early Career of Selected Player vs. \nEarly Careers of Eventual Top 20 or Top 10 Players")
WillHeBeGreatGraph("Andy Murray")
As another example, we can look at the career of Steve Johnson. Steve Johnson is a perennial top 100 player in the world, but has struggled to make a significant impact on the ATP tour. Looking at his early career, it is not surprising that he has had difficulty becoming a top 20 player.
In [61]:
WillHeBeGreatGraph("Steve Johnson")
Finally, we can look at the early statistics for Roger Federer, one of the greatest men's tennis players of all time. As the graph illustrates, his play in the first few years of his career, particularly with respect to victories against top 10 opponents, made it clear that he would become a phenomenal tennis player someday.
In [62]:
WillHeBeGreatGraph("Roger Federer")
Ultimately, my analysis demonstrates what it takes for a young tennis player to have legitimate expectations of future, long-term success on the men's tour. While recent dominance at the top of the rankings has given young players some leeway when analyzing their early stats, inevitable retirements from top players like Rafael Nadal and Roger Federer will greatly increase expectations for young tennis players. Tennis is an incredibly expensive sport to play, and players cannot afford to simply survive on tour as a sub-100 player. Analyzing the early careers of tennis stars is another way for a young player to gauge success and to develop reasonable goals and expectations. This report consolidates that analysis in one place. It is no guarantee that a player will or will not eventually become a dominant player in the sport, but understanding the careers of players who went on to have great success will only help a young player to better understand what he needs to accomplish in order to excel in the sport.