Indicators of Future Success in Men's Professional Tennis

May 2016
Written by John Ockay at NYU's Stern School of Business
Contact: jfo262@stern.nyu.edu

Project Description

In 2015, Novak Djokovic earned over 16 million through tennis alone while Dudi Sela, the 100th best tennis player in the world, earned just over 300 thousand. In 2010, the U.S. Tennis Association reported that the annual average cost to be a “highly competitive” professional tennis player was 143 thousand. After accounting for such expenses, it is clear that if a player cannot stay within the top 100 on a consistent basis, it will be incredibly challenging to survive financially playing tennis alone.

My project utilizes historical tennis data to investigate key indicators of future success in men's professional tennis. Before continuing to invest hundreds of thousands of dollars into one's career, a young player can compare his statistics with the early careers of tennis stars (i.e., those who have achieved at least a top 20 ranking at some point during their careers). For the purposes of this project, I analyzed the early career statistics for every player that achieved a top 20 or top 10 ranking between 1995 and 2015. I looked at the following key indicators:

1) Professional seasons until a top 50 ranking: For those who became top 20 or top 10 players between 1995 and 2015, how many years did it take for them to achieve a top 50 ranking after they turned pro? Young tennis players can use this metric to compare their ranking progress with the early progress of successful tennis players.

2) Top 10 Victories: How many top 10 victories did tennis stars have in their first 3 and their first 5 professional seasons?

3) Top 50 Victories: How many top 50 victories did tennis stars have in their first 3 and their first 5 professional seasons?

4) Tournament Victories: How many tournament victories did tennis stars have in their first 3 and their first 5 professional seasons?

For each indicator, I also broke the data down based on when a top 20 or top 10 player between 1995 and 2015 turned pro. I compared the differences among the early careers of tennis stars that turned pro before 1995, between 1995 and 2000, between 2000 and 2005, and after 2005. As we will see in the graphics below, young players today may not be expected to progress as quickly or have as many top 10 or tournament victories as players that turned pro before 1995.

Ultimately, I provide the means to compare the stats of any tennis player on the ATP tour with the average, historical data of players who eventually became top 20 or top 10 players in the world.

I used the following Python packages to import both internet and internal excel data and to develop relevant graphics:


In [1]:
%matplotlib inline                  
import pandas as pd                 
import matplotlib.pyplot as plt     
import xlrd

Data (Part I)

To complete this project, I used data from two different sources. First, Jeff Sackmann is an author and entrepreneur who has worked in the fields of sports statistics and test preparation. In particular, he has an interest in the world of tennis and tennis statistics. He has created a website called TennisAbstract that contains historical match data and other tennis analytics. He has uploaded match data from 1968 to 2016 on GitHub in CSV format. I pulled the data from the Github site into Python in order to conduct the necessary analysis work for my project. I imported match data from 1984 to 2015, combined the data into one large spreadsheet, removed unnecessary columns, and adjusted the date column in order to better facilitate filtering. This process is outlined below:


In [2]:
url2015 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2015.csv'
tennis2015 = pd.read_csv(url2015)
url2014 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2014.csv'
tennis2014 = pd.read_csv(url2014)
url2013 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2013.csv'
tennis2013 = pd.read_csv(url2013)
url2012 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2012.csv'
tennis2012 = pd.read_csv(url2012)
url2011 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2011.csv'
tennis2011 = pd.read_csv(url2011)
url2010 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2010.csv'
tennis2010 = pd.read_csv(url2010)
url2009 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2009.csv'
tennis2009 = pd.read_csv(url2009)
url2008 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2008.csv'
tennis2008 = pd.read_csv(url2008)
url2007 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2007.csv'
tennis2007 = pd.read_csv(url2007)
url2006 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2006.csv'
tennis2006 = pd.read_csv(url2006)
url2005 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2005.csv'
tennis2005 = pd.read_csv(url2005)
url2004 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2004.csv'
tennis2004 = pd.read_csv(url2004)
url2003 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2003.csv'
tennis2003 = pd.read_csv(url2003)
url2002 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2002.csv'
tennis2002 = pd.read_csv(url2002)
url2001 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2001.csv'
tennis2001 = pd.read_csv(url2001)
url2000 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2000.csv'
tennis2000 = pd.read_csv(url2000)
url1999 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1999.csv'
tennis1999 = pd.read_csv(url1999)
url1998 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1998.csv'
tennis1998 = pd.read_csv(url1998)
url1997 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1997.csv'
tennis1997 = pd.read_csv(url1997)
url1996 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1996.csv'
tennis1996 = pd.read_csv(url1996)
url1995 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1995.csv'
tennis1995 = pd.read_csv(url1995)
url1994 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1994.csv'
tennis1994 = pd.read_csv(url1994)
url1993 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1993.csv'
tennis1993 = pd.read_csv(url1993)
url1992 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1992.csv'
tennis1992 = pd.read_csv(url1992)
url1991 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1991.csv'
tennis1991 = pd.read_csv(url1991)
url1990 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1990.csv'
tennis1990 = pd.read_csv(url1990)
url1989 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1989.csv'
tennis1989 = pd.read_csv(url1989)
url1988 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1988.csv'
tennis1988 = pd.read_csv(url1988)
url1987 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1987.csv'
tennis1987 = pd.read_csv(url1987)
url1986 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1986.csv'
tennis1986 = pd.read_csv(url1986)
url1985 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1985.csv'
tennis1985 = pd.read_csv(url1985)
url1984 = 'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1984.csv'
tennis1984 = pd.read_csv(url1984)

In [3]:
tennisALL = (tennis1984.append(tennis1985).append(tennis1986).append(tennis1987).append(tennis1988).
             append(tennis1989).append(tennis1990).append(tennis1991).append(tennis1992).
             append(tennis1993).append(tennis1994).append(tennis1995).append(tennis1996).
             append(tennis1997).append(tennis1998).append(tennis1999).append(tennis2000).
             append(tennis2001).append(tennis2002).append(tennis2003).append(tennis2004).
             append(tennis2005).append(tennis2006).append(tennis2007).append(tennis2008).
             append(tennis2009).append(tennis2010).append(tennis2011).append(tennis2012).
             append(tennis2013).append(tennis2014).append(tennis2015))

In [4]:
tennisALL.head(2)


Out[4]:
tourney_id tourney_name surface draw_size tourney_level tourney_date match_num winner_id winner_seed winner_entry ... w_bpFaced l_ace l_df l_svpt l_1stIn l_1stWon l_2ndWon l_SvGms l_bpSaved l_bpFaced
0 1984-610 Dallas WCT Carpet 16 A 19840424 1 100342 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 1984-610 Dallas WCT Carpet 16 A 19840424 2 100529 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

2 rows × 49 columns


In [5]:
tennisALL = tennisALL.drop(["tourney_id","draw_size","tourney_level","match_num","winner_id",
                            "winner_seed","winner_entry","winner_hand","winner_ht","winner_ioc",
                            "winner_age","winner_rank_points","loser_id","loser_seed","loser_entry",
                            "loser_hand","loser_ht","loser_ioc","loser_age","loser_rank_points","score",
                            "best_of","minutes","w_ace","w_df","w_svpt","w_1stIn","w_1stWon","w_2ndWon",
                            "w_SvGms","w_bpSaved","w_bpFaced","l_ace","l_df","l_svpt","l_1stIn","l_1stWon",
                            "l_2ndWon","l_SvGms","l_bpSaved","l_bpFaced"],axis=1)
tennisALL["tourney_date"] = tennisALL["tourney_date"].astype(str)
tennisALL["tourney_date"] = tennisALL["tourney_date"].str[:4]
tennisALL["tourney_date"] = tennisALL["tourney_date"].astype(int)
tennisALL


Out[5]:
tourney_name surface tourney_date winner_name winner_rank loser_name loser_rank round
0 Dallas WCT Carpet 1984 Vitas Gerulaitis 20.0 Bill Scanlon 12.0 R16
1 Dallas WCT Carpet 1984 Kevin Curren 9.0 Mark Dickson 68.0 R16
2 Dallas WCT Carpet 1984 Eliot Teltscher 13.0 Henrik Sundstrom 23.0 R16
3 Dallas WCT Carpet 1984 Tim Mayotte 16.0 Tomas Smid 17.0 R16
4 Dallas WCT Carpet 1984 John Mcenroe 1.0 Vitas Gerulaitis 20.0 QF
5 Dallas WCT Carpet 1984 Kevin Curren 9.0 Johan Kriek 15.0 QF
6 Dallas WCT Carpet 1984 Jimmy Arias 6.0 Eliot Teltscher 13.0 QF
7 Dallas WCT Carpet 1984 Jimmy Connors 3.0 Tim Mayotte 16.0 QF
8 Dallas WCT Carpet 1984 John Mcenroe 1.0 Kevin Curren 9.0 SF
9 Dallas WCT Carpet 1984 Jimmy Connors 3.0 Jimmy Arias 6.0 SF
10 Dallas WCT Carpet 1984 John Mcenroe 1.0 Jimmy Connors 3.0 F
11 Richmond WCT Carpet 1984 John Mcenroe 1.0 Rodney Harmon 115.0 R16
12 Richmond WCT Carpet 1984 Stefan Edberg 53.0 Fritz Buehning 45.0 R16
13 Richmond WCT Carpet 1984 Vitas Gerulaitis 20.0 Aaron Krickstein 94.0 R16
14 Richmond WCT Carpet 1984 Eric Korita 48.0 Tom Cain 118.0 R16
15 Richmond WCT Carpet 1984 Greg Holmes 147.0 Kevin Curren 9.0 R16
16 Richmond WCT Carpet 1984 Mark Dickson 68.0 Bill Scanlon 12.0 R16
17 Richmond WCT Carpet 1984 Steve Denton 58.0 Manuel Orantes 74.0 R16
18 Richmond WCT Carpet 1984 Jimmy Arias 6.0 Vijay Amritraj 144.0 R16
19 Richmond WCT Carpet 1984 John Mcenroe 1.0 Stefan Edberg 53.0 QF
20 Richmond WCT Carpet 1984 Vitas Gerulaitis 20.0 Eric Korita 48.0 QF
21 Richmond WCT Carpet 1984 Mark Dickson 68.0 Greg Holmes 147.0 QF
22 Richmond WCT Carpet 1984 Steve Denton 58.0 Jimmy Arias 6.0 QF
23 Richmond WCT Carpet 1984 John Mcenroe 1.0 Vitas Gerulaitis 20.0 SF
24 Richmond WCT Carpet 1984 Steve Denton 58.0 Mark Dickson 68.0 SF
25 Richmond WCT Carpet 1984 John Mcenroe 1.0 Steve Denton 58.0 F
26 Taipei Carpet 1984 Brad Gilbert 31.0 Jim Gurfein 238.0 R32
27 Taipei Carpet 1984 Drew Gitlin 165.0 Jay Lapidus 86.0 R32
28 Taipei Carpet 1984 Kelvin Belcher 235.0 Tom Cain 223.0 R32
29 Taipei Carpet 1984 Marty Davis 57.0 Pat Dupre 235.0 R32
... ... ... ... ... ... ... ... ...
2928 Davis Cup G1 PO: THA vs CHN Hard 2015 Zhe Li 299.0 Warit Sornbutnark 1101.0 RR
2929 Davis Cup WG PO: IND vs CZE Hard 2015 Lukas Rosol 84.0 Yuki Bhambri 125.0 RR
2930 Davis Cup WG PO: IND vs CZE Hard 2015 Somdev Devvarman 164.0 Jiri Vesely 40.0 RR
2931 Davis Cup WG PO: IND vs CZE Hard 2015 Jiri Vesely 40.0 Yuki Bhambri 125.0 RR
2932 Davis Cup WG PO: SUI vs NED Hard 2015 Stanislas Wawrinka 4.0 Thiemo De Bakker 144.0 RR
2933 Davis Cup WG PO: SUI vs NED Hard 2015 Roger Federer 2.0 Jesse Huta Galung 436.0 RR
2934 Davis Cup WG PO: SUI vs NED Hard 2015 Roger Federer 2.0 Thiemo De Bakker 144.0 RR
2935 Davis Cup WG PO: SUI vs NED Hard 2015 Henri Laaksonen 359.0 Tim Van Rijthoven 1180.0 RR
2936 Davis Cup WG PO: RUS vs ITA Hard 2015 Teymuraz Gabashvili 58.0 Simone Bolelli 63.0 RR
2937 Davis Cup WG PO: RUS vs ITA Hard 2015 Fabio Fognini 28.0 Andrey Rublev 176.0 RR
2938 Davis Cup WG PO: RUS vs ITA Hard 2015 Fabio Fognini 28.0 Teymuraz Gabashvili 58.0 RR
2939 Davis Cup WG PO: RUS vs ITA Hard 2015 Paolo Lorenzi 88.0 Konstantin Kravchuk 151.0 RR
2940 Davis Cup WG PO: UZB vs USA Clay 2015 Denis Istomin 62.0 Steve Johnson 47.0 RR
2941 Davis Cup WG PO: UZB vs USA Clay 2015 Jack Sock 29.0 Farrukh Dustov 158.0 RR
2942 Davis Cup WG PO: UZB vs USA Clay 2015 Jack Sock 29.0 Denis Istomin 62.0 RR
2943 Davis Cup WG PO: COL vs JPN Clay 2015 Santiago Giraldo 59.0 Taro Daniel 124.0 RR
2944 Davis Cup WG PO: COL vs JPN Clay 2015 Kei Nishikori 6.0 Alejandro Falla 123.0 RR
2945 Davis Cup WG PO: COL vs JPN Clay 2015 Kei Nishikori 6.0 Santiago Giraldo 59.0 RR
2946 Davis Cup WG PO: COL vs JPN Clay 2015 Taro Daniel 124.0 Alejandro Falla 123.0 RR
2947 Davis Cup WG PO: DOM vs GER Hard 2015 Victor Estrella 57.0 Dustin Brown 108.0 RR
2948 Davis Cup WG PO: DOM vs GER Hard 2015 Philipp Kohlschreiber 34.0 Jose Hernandez 200.0 RR
2949 Davis Cup WG PO: DOM vs GER Hard 2015 Philipp Kohlschreiber 34.0 Victor Estrella 57.0 RR
2950 Davis Cup WG PO: DOM vs GER Hard 2015 Benjamin Becker 73.0 Roberto Cid 928.0 RR
2951 Davis Cup WG PO: BRA vs CRO Clay 2015 Thomaz Bellucci 30.0 Mate Delic 499.0 RR
2952 Davis Cup WG PO: BRA vs CRO Clay 2015 Borna Coric 33.0 Joao Souza 104.0 RR
2953 Davis Cup WG PO: BRA vs CRO Clay 2015 Borna Coric 33.0 Thomaz Bellucci 30.0 RR
2954 Davis Cup WG PO: POL vs SVK Hard 2015 Martin Klizan 36.0 Michal Przysiezny 143.0 RR
2955 Davis Cup WG PO: POL vs SVK Hard 2015 Jerzy Janowicz 61.0 Norbert Gombos 121.0 RR
2956 Davis Cup WG PO: POL vs SVK Hard 2015 Martin Klizan 36.0 Jerzy Janowicz 61.0 RR
2957 Davis Cup WG PO: POL vs SVK Hard 2015 Michal Przysiezny 143.0 Norbert Gombos 121.0 RR

107629 rows × 8 columns

I utilized the above data to develop functions that allowed me to easily obtain and organize any tennis player's early career statistics, such as top 10 or top 50 victories. Those functions are listed below with Roger Federer used as an example:

CareerWins: total career victories for any given tennis player


In [6]:
def CareerWins(player):
     return len(tennisALL[(tennisALL['winner_name']==player)])
    
CareerWins("Roger Federer")


Out[6]:
1069

YearWins: victories within a specific season for any given tennis player


In [7]:
def YearWins(player,year):
    return len(tennisALL[(tennisALL['tourney_date']==year) & (tennisALL['winner_name']==player)])

YearWins("Roger Federer",2014)


Out[7]:
71

CareerTopTen: total career top 10 victories for any given tennis player


In [8]:
def CareerTopTen(player):
    return len(tennisALL[(tennisALL['loser_rank']<=10) & (tennisALL['winner_name']==player)])

CareerTopTen("Roger Federer")


Out[8]:
198

YearTopTen: top ten victories within a specific season for any given tennis player


In [9]:
def YearTopTen(player,year):
    return len(tennisALL[(tennisALL['loser_rank']<=10)&(tennisALL['winner_name']==player)
                        &(tennisALL['tourney_date']==year)])

YearTopTen("Roger Federer",2014)


Out[9]:
17

CareerTopFifty: total career top 50 victories for any given tennis player


In [10]:
def CareerTopFifty(player):
    return len(tennisALL[(tennisALL['loser_rank']<=50) & (tennisALL['winner_name']==player)])

CareerTopFifty("Roger Federer")


Out[10]:
671

YearTopFifty: top fifty victories within a specific season for any given tennis player


In [11]:
def YearTopFifty(player,year):
    return len(tennisALL[(tennisALL['loser_rank']<=50)&(tennisALL['winner_name']==player)
                        &(tennisALL['tourney_date']==year)])

YearTopFifty("Roger Federer",2014)


Out[11]:
49

CareerTourneyWins: total career tournament victories for any given tennis player


In [12]:
def CareerTourneyWins(player):
    return len(tennisALL[(tennisALL['round']=='F') & (tennisALL['winner_name']==player)])

CareerTourneyWins("Roger Federer")


Out[12]:
88

YearTourneyWins: tournament victories within a specific season for any given tennis player


In [13]:
def YearTourneyWins(player,year):
    return len(tennisALL[(tennisALL['round']=='F')&(tennisALL['winner_name']==player)
                        &(tennisALL['tourney_date']==year)])

YearTourneyWins("Roger Federer",2014)


Out[13]:
5

Data (Part II)

While the data from Jeff Sackmann allowed me to easily collect data for individual tennis players, I needed to use an Excel document in order to better organize summary data for Top 20 and Top 10 players between the years of 1995 and 2015. In addition, I collected (from atpworld.com) and organized historical rankings data and the years in which players turned pro in order to investigate players' rankings progression. The Excel document contains the following worksheets:

  1. All Player Summary: historical, by year rankings for all tennis players; year turned pro; years it took players to obtain top 50 status
  2. Top 10 Summary (1995 Onward): historical, by year rankings for players that achieved top 10 status between 1995 and 2015
  3. Top 10 Ranking Progression: years it took to achieve top 50 status upon turning pro for players that eventually became top 10 in the world
  4. Top 10 Top 10 Wins: Summary of top 10 victories in first 3 and 5 professional seasons for players that achieved top 10 status between 1995 and 2015
  5. Top 10 Top 50 Wins: Summary of top 50 victories in first 3 and 5 professional seasons for players that achieved top 10 status between 1995 and 2015
  6. Top 10 Tournament Wins: Summary of tournament victories in first 3 and 5 professional seasons for players that achieved top 10 status between 1995 and 2015
  7. Top 20 Summary (1995 Onward): historical, by year rankings for players that achieved top 20 status between 1995 and 2015
  8. Top 20 Ranking Progression: years it took to achieve top 50 status upon turning pro for players that eventually became top 20 in the world
  9. Top 20 Top 10 Wins: Summary of top 10 victories in first 3 and 5 professional seasons for players that achieved top 20 status between 1995 and 2015
  10. Top 20 Top 50 Wins: Summary of top 50 victories in first 3 and 5 professional seasons for players that achieved top 20 status between 1995 and 2015
  11. Top 20 Tournament Wins: Summary of tournament victories in first 3 and 5 professional seasons for players that achieved top 20 status between 1995 and 2015

In [14]:
path = ("C:/Users/John/Desktop/Data_Bootcamp/Project Data/ATP Tennis Data (1973-2016).xlsx")
TennisData = pd.ExcelFile(path)

In [15]:
Top20Summary = TennisData.parse("Top 20 Summary (1995 Onward)").set_index("Player")
Top20RankingProgression = TennisData.parse("Top 20 Ranking Progression").set_index("Player")
Top10RankingProgression = TennisData.parse("Top 10 Ranking Progression").set_index("Player")
Top20Top10Wins = TennisData.parse("Top 20 Top 10 Wins").set_index("Player")
Top10Top10Wins = TennisData.parse("Top 10 Top 10 Wins").set_index("Player")
Top20Top50Wins = TennisData.parse("Top 20 Top 50 Wins").set_index("Player")
Top10Top50Wins = TennisData.parse("Top 10 Top 50 Wins").set_index("Player")
Top20TournamentWins = TennisData.parse("Top 20 Tournament Wins").set_index("Player")
Top10TournamentWins = TennisData.parse("Top 10 Tournament Wins").set_index("Player")
AllPlayerSummary = TennisData.parse("All Player Summary").set_index("Player")

The following cells illustrate some of the data contained within the Excel document:


In [16]:
Top20Summary = Top20Summary.fillna("")
Top20Summary.head(5)


Out[16]:
1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 ... 2008 2009 2010 2011 2012 2013 2014 2015 2016 Year Turned Pro
Player
Albert Costa 221 ... 1993
Albert Portas ... 1994
Alex Corretja 235 86 76 ... 1991
Alexandr Dolgopolov ... 309 131 48 15 18 57 23 36 28 2006
Andre Agassi 91 25 3 7 4 10 9 24 ... 1986

5 rows × 34 columns


In [17]:
Top20RankingProgression.head(5)


Out[17]:
Year Turned Pro Year 1 Year 2 Year 3 Year 4 Year 5 Year 6 Years Until Top 50
Player
Albert Costa 1993 221 52 24 13 19 14 3
Albert Portas 1994 269 119 182 35 84 90 4
Alex Corretja 1991 235 86 76 22 48 23 4
Alexandr Dolgopolov 2006 265 233 309 131 48 15 5
Andre Agassi 1986 91 25 3 7 4 10 2

In [18]:
Top10Top10Wins.head(5)


Out[18]:
Year Turned Pro Year 1 Year 2 Year 3 Year 4 Year 5 Total First 3 Years Total First 5 Years
Player
Albert Costa 1993 0 1 2 3 5 3 11
Alex Corretja 1991 0 0 0 3 2 0 5
Andre Agassi 1986 0 0 2 1 4 2 7
Andrei Medvedev 1991 0 1 6 2 2 7 11
Andy Murray 2005 0 4 5 12 14 9 35

In [19]:
AllPlayerSummary = AllPlayerSummary[["Years Until Top 50","Year Turned Pro"]]
AllPlayerSummary = AllPlayerSummary.fillna("Never")
AllPlayerSummary.head(5)


Out[19]:
Years Until Top 50 Year Turned Pro
Player
A Benson Never 1983
A Escofet Never 1978
A Noffat Never 1983
A Uehara Never 1975
Pete Sampras 3 1988

In order to investigate a player's early career, it is essential to identify when that player turned professional. I developed the following function based on the data within the Excel document to facilitate that process. Again, Roger Federer is used as an example.


In [20]:
def YearTurnedPro(Player):
    return AllPlayerSummary.ix[Player,"Year Turned Pro"]

YearTurnedPro("Roger Federer")


Out[20]:
1998

In [21]:
def RankingProgression(Player):
    print("Player Name:",Player)
    print("Year Turned Pro:",YearTurnedPro(Player))
    print("Rankings First Five Pro Years:",[Top20RankingProgression.ix[Player,"Year 1"],
        Top20RankingProgression.ix[Player,"Year 2"],Top20RankingProgression.ix[Player,"Year 3"],
        Top20RankingProgression.ix[Player,"Year 4"],Top20RankingProgression.ix[Player,"Year 5"]])
    
RankingProgression("Roger Federer")


Player Name: Roger Federer
Year Turned Pro: 1998
Rankings First Five Pro Years: [301, 64, 29, 13, 6]

As mentioned in the introduction, a primary goal for this project was to compare the stats of young tennis players to the early career stats of players who eventually achieved top 20 or top 10 status. Utilizing data from the Excel document, the following cells include my analysis of the early careers of players who achieved top 20 or top 10 status between the years of 1995 and 2015.

First, I compared how long, on average, it took players who eventually became top 20 or top 10 players to achieve top 50 status upon turning pro. For all analyses below, in addition to comparing the top 20 group to the top 10 group, I also made comparisons based on the year in which tennis players turned pro.


In [64]:
YearsUntilTop50Comparison = pd.DataFrame({'Year Turned Pro':['Before 1995','Between 1995 and 1999',
                                                             'Between 2000 and 2004','After 2004'],
    'Avg Years to Reach Top 50 (For Eventual Top 20 Players)':
        [Top20RankingProgression[(Top20RankingProgression['Year Turned Pro']<[1995])]
         ["Years Until Top 50"].mean(),
        Top20RankingProgression[(Top20RankingProgression['Year Turned Pro']>=[1995])&
                         (Top20RankingProgression['Year Turned Pro']<[2000])]["Years Until Top 50"].mean(),
        Top20RankingProgression[(Top20RankingProgression['Year Turned Pro']>=[2000])&
                          (Top20RankingProgression['Year Turned Pro']<[2005])]["Years Until Top 50"].mean(),
       Top20RankingProgression[(Top20RankingProgression['Year Turned Pro']>=[2005])]
         ["Years Until Top 50"].mean()],
    'Avg Years to Reach Top 50 (For Eventual Top 10 Players)':
        [Top10RankingProgression[(Top10RankingProgression['Year Turned Pro']<[1995])]
         ["Years Until Top 50"].mean(),
        Top10RankingProgression[(Top10RankingProgression['Year Turned Pro']>=[1995])&
                         (Top10RankingProgression['Year Turned Pro']<[2000])]["Years Until Top 50"].mean(),
        Top10RankingProgression[(Top10RankingProgression['Year Turned Pro']>=[2000])&
                         (Top10RankingProgression['Year Turned Pro']<[2005])]["Years Until Top 50"].mean(),
       Top10RankingProgression[(Top10RankingProgression['Year Turned Pro']>=[2005])]
         ["Years Until Top 50"].mean()]})

In [65]:
YearsUntilTop50Comparison = YearsUntilTop50Comparison.set_index("Year Turned Pro")
YearsUntilTop50ComparisonGraph = YearsUntilTop50Comparison.plot(kind="line",figsize=(10,5),fontsize=(14))
plt.ylabel("Years")
plt.style.use('fivethirtyeight')
plt.legend(loc='lower right',prop={'size':12}).get_frame().set_edgecolor('black')
plt.title("How Long Does it Take Tennis Stars to \nAchieve a Top 50 Ranking After Turning Pro?",fontsize=(16))


Out[65]:
<matplotlib.text.Text at 0xef5b386550>

As seen above, tennis stars who turned pro before 1995 achieved top 50 status more quickly than players who turned pro after 2004. As expected, players who eventually became top 10 achieved top 50 status more quickly than those players who only became top 20 in the world. The following two graphics outline similar information in table and histogram formats:


In [24]:
YearsUntilTop50Comparison.round(2)


Out[24]:
Avg Years to Reach Top 50 (For Eventual Top 10 Players) Avg Years to Reach Top 50 (For Eventual Top 20 Players)
Year Turned Pro
Before 1995 3.04 3.72
Between 1995 and 1999 3.35 4.40
Between 2000 and 2004 3.88 4.22
After 2004 3.80 4.38

In [25]:
Top20RankingProgression.plot(kind="hist",bins=10,y="Years Until Top 50")
plt.legend(loc='upper right',prop={'size':12}).get_frame().set_edgecolor('black')
plt.title("Number of Years It Took Top 20 Players from 1995-2015 to Reach Top 50\n\n",fontsize=(12))
plt.xlabel("\nYears",fontsize=(12))
plt.ylabel("Frequency\n",fontsize=(12))


Out[25]:
<matplotlib.text.Text at 0xef58502390>

Second, I compared how many top 10 wins, on average, tennis stars had within 3 years of turning pro.


In [26]:
Top10WinsComparison = pd.DataFrame({'Year Turned Pro':['Before 1995','Between 1995 and 1999',
                                                             'Between 2000 and 2004','After 2004'],
    'Top 10 Wins in First 3 Pro Years (For Eventual Top 20 Players)':
        [Top20Top10Wins[(Top20Top10Wins['Year Turned Pro']<[1995])]["Total First 3 Years"].mean(),
        Top20Top10Wins[(Top20Top10Wins['Year Turned Pro']>=[1995])&
                      (Top20Top10Wins['Year Turned Pro']<[2000])]["Total First 3 Years"].mean(),
        Top20Top10Wins[(Top20Top10Wins['Year Turned Pro']>=[2000])&
                      (Top20Top10Wins['Year Turned Pro']<[2005])]["Total First 3 Years"].mean(),
       Top20Top10Wins[(Top20Top10Wins['Year Turned Pro']>=[2005])]["Total First 3 Years"].mean()],
    'Top 10 Wins in First 3 Pro Years (For Eventual Top 10 Players)':
        [Top10Top10Wins[(Top10Top10Wins['Year Turned Pro']<[1995])]["Total First 3 Years"].mean(),
        Top10Top10Wins[(Top10Top10Wins['Year Turned Pro']>=[1995])&
                      (Top10Top10Wins['Year Turned Pro']<[2000])]["Total First 3 Years"].mean(),
        Top10Top10Wins[(Top10Top10Wins['Year Turned Pro']>=[2000])&
                      (Top10Top10Wins['Year Turned Pro']<[2005])]["Total First 3 Years"].mean(),
       Top10Top10Wins[(Top10Top10Wins['Year Turned Pro']>=[2005])]["Total First 3 Years"].mean()]})

In [27]:
Top10WinsComparison = Top10WinsComparison.set_index("Year Turned Pro")
Top10WinsComparisonGraph = Top10WinsComparison.plot(kind="line",figsize=(10,5),fontsize=(14))
plt.ylabel("Top 10 Victories")
plt.legend(loc='lower left',prop={'size':11}).get_frame().set_edgecolor('black')
plt.title("How Many Top 10 Victories Do Tennis Stars \nHave in First 3 Professional Seasons?",fontsize=(16))


Out[27]:
<matplotlib.text.Text at 0xef585f12e8>

As seen above, players who turned pro prior to 2000 generally had more top 10 victories within their first 3 professional seasons compared to players turning pro after 2000. Due to players like Andy Murray, who had incredible success early in his career, the trend has started to change over the last decade. The following graphs highlight similar information in table and histogram format:


In [28]:
Top10WinsComparison.round(2)


Out[28]:
Top 10 Wins in First 3 Pro Years (For Eventual Top 10 Players) Top 10 Wins in First 3 Pro Years (For Eventual Top 20 Players)
Year Turned Pro
Before 1995 3.50 2.60
Between 1995 and 1999 3.53 2.30
Between 2000 and 2004 0.81 0.96
After 2004 2.60 1.56

In [29]:
Top20Top10Wins.plot(kind="hist",bins=10,y="Total First 3 Years")
plt.legend(loc='upper right',prop={'size':12}).get_frame().set_edgecolor('black')
plt.title("Number of Top 10 Victories in First 3 Pro Seasons\n for Eventual Top 20 Players",fontsize=(12))
plt.xlabel("\nTop 10 Victories",fontsize=(12))
plt.ylabel("Frequency\n",fontsize=(12))


Out[29]:
<matplotlib.text.Text at 0xef585d5cf8>

Third, I compared how many top 10 wins, on average, tennis stars had within 5 years of turning pro.


In [30]:
Top10WinsComparison2 = pd.DataFrame({'Year Turned Pro':['Before 1995','Between 1995 and 1999',
                                                             'Between 2000 and 2004','After 2004'],
    'Top 10 Wins in First 5 Pro Years (For Eventual Top 20 Players)':
        [Top20Top10Wins[(Top20Top10Wins['Year Turned Pro']<[1995])]["Total First 5 Years"].mean(),
        Top20Top10Wins[(Top20Top10Wins['Year Turned Pro']>=[1995])&
                      (Top20Top10Wins['Year Turned Pro']<[2000])]["Total First 5 Years"].mean(),
        Top20Top10Wins[(Top20Top10Wins['Year Turned Pro']>=[2000])&
                      (Top20Top10Wins['Year Turned Pro']<[2005])]["Total First 5 Years"].mean(),
       Top20Top10Wins[(Top20Top10Wins['Year Turned Pro']>=[2005])]["Total First 5 Years"].mean()],
    'Top 10 Wins in First 5 Pro Years (For Eventual Top 10 Players)':
        [Top10Top10Wins[(Top10Top10Wins['Year Turned Pro']<[1995])]["Total First 5 Years"].mean(),
        Top10Top10Wins[(Top10Top10Wins['Year Turned Pro']>=[1995])&
                      (Top10Top10Wins['Year Turned Pro']<[2000])]["Total First 5 Years"].mean(),
        Top10Top10Wins[(Top10Top10Wins['Year Turned Pro']>=[2000])&
                      (Top10Top10Wins['Year Turned Pro']<[2005])]["Total First 5 Years"].mean(),
       Top10Top10Wins[(Top10Top10Wins['Year Turned Pro']>=[2005])]["Total First 5 Years"].mean()]})

In [31]:
Top10WinsComparison2 = Top10WinsComparison2.set_index("Year Turned Pro")
Top10WinsComparison2Graph = Top10WinsComparison2.plot(kind="line",figsize=(10,5),fontsize=(14))
plt.ylabel("Top 10 Victories")
plt.legend(loc='best',prop={'size':11}).get_frame().set_edgecolor('black')
plt.title("How Many Top 10 Victories Do Tennis Stars \nHave in First 5 Professional Seasons?",fontsize=(16))


Out[31]:
<matplotlib.text.Text at 0xef586f2828>

As seen above, players who turned pro prior to 2000 generally had more top 10 victories within their first 3 professional seasons compared to players turning pro after 2000. Due to players like Andy Murray, who had incredible success early in his career, the trend has started to change over the last decade. The following graphs highlight similar information in table and histogram format:


In [32]:
Top10WinsComparison2.round(2)


Out[32]:
Top 10 Wins in First 5 Pro Years (For Eventual Top 10 Players) Top 10 Wins in First 5 Pro Years (For Eventual Top 20 Players)
Year Turned Pro
Before 1995 11.58 9.02
Between 1995 and 1999 11.53 7.83
Between 2000 and 2004 5.19 4.96
After 2004 14.80 6.00

In [33]:
Top20Top10Wins.plot(kind="hist",bins=10,y="Total First 5 Years")
plt.legend(loc='upper right',prop={'size':12}).get_frame().set_edgecolor('black')
plt.title("Number of Top 10 Victories in First 5 Pro Seasons\n for Eventual Top 20 Players",fontsize=(12))
plt.xlabel("\nTop 10 Victories",fontsize=(12))
plt.ylabel("Frequency\n",fontsize=(12))


Out[33]:
<matplotlib.text.Text at 0xef586e6c18>

Fourth, I compared how many top 50 wins, on average, tennis stars had within 3 years of turning pro.


In [34]:
Top50WinsComparison = pd.DataFrame({'Year Turned Pro':['Before 1995','Between 1995 and 1999',
                                                             'Between 2000 and 2004','After 2004'],
    'Top 50 Wins in First 3 Pro Years (For Eventual Top 20 Players)':
        [Top20Top50Wins[(Top20Top50Wins['Year Turned Pro']<[1995])]["Total First 3 Years"].mean(),
        Top20Top50Wins[(Top20Top50Wins['Year Turned Pro']>=[1995])&
                      (Top20Top50Wins['Year Turned Pro']<[2000])]["Total First 3 Years"].mean(),
        Top20Top50Wins[(Top20Top50Wins['Year Turned Pro']>=[2000])&
                      (Top20Top50Wins['Year Turned Pro']<[2005])]["Total First 3 Years"].mean(),
       Top20Top50Wins[(Top20Top50Wins['Year Turned Pro']>=[2005])]["Total First 3 Years"].mean()],
    'Top 50 Wins in First 3 Pro Years (For Eventual Top 10 Players)':
        [Top10Top50Wins[(Top10Top50Wins['Year Turned Pro']<[1995])]["Total First 3 Years"].mean(),
        Top10Top50Wins[(Top10Top50Wins['Year Turned Pro']>=[1995])&
                      (Top10Top50Wins['Year Turned Pro']<[2000])]["Total First 3 Years"].mean(),
        Top10Top50Wins[(Top10Top50Wins['Year Turned Pro']>=[2000])&
                      (Top10Top50Wins['Year Turned Pro']<[2005])]["Total First 3 Years"].mean(),
       Top10Top50Wins[(Top10Top50Wins['Year Turned Pro']>=[2005])]["Total First 3 Years"].mean()]})

In [35]:
Top50WinsComparison = Top50WinsComparison.set_index("Year Turned Pro")
Top50WinsComparisonGraph = Top50WinsComparison.plot(kind="line",figsize=(10,5),fontsize=(14))
plt.ylabel("Top 50 Victories")
plt.legend(loc='best',prop={'size':11}).get_frame().set_edgecolor('black')
plt.title("How Many Top 50 Victories Do Tennis Stars \nHave in First 3 Professional Seasons?",fontsize=(16))


Out[35]:
<matplotlib.text.Text at 0xef59696470>

As seen above, players who turned pro prior to 2000 generally had more top 50 victories within their first 3 professional seasons compared to players turning pro after 2000. Due to players like Andy Murray, who had incredible success early in his career, the trend has started to change over the last decade. The following graphs highlight similar information in table and histogram format:


In [36]:
Top50WinsComparison.round(2)


Out[36]:
Top 50 Wins in First 3 Pro Years (For Eventual Top 10 Players) Top 50 Wins in First 3 Pro Years (For Eventual Top 20 Players)
Year Turned Pro
Before 1995 17.35 13.58
Between 1995 and 1999 15.71 10.40
Between 2000 and 2004 7.12 6.74
After 2004 11.20 7.50

In [37]:
Top20Top50Wins.plot(kind="hist",bins=10,y="Total First 3 Years")
plt.legend(loc='upper right',prop={'size':12}).get_frame().set_edgecolor('black')
plt.title("Number of Top 50 Victories in First 5 Pro Seasons\n for Eventual Top 20 Players",fontsize=(12))
plt.xlabel("\nTop 50 Victories",fontsize=(12))
plt.ylabel("Frequency\n",fontsize=(12))


Out[37]:
<matplotlib.text.Text at 0xef58a97128>

Fifth, I compared how many top 50 wins, on average, tennis stars had within 5 years of turning pro.


In [38]:
Top50WinsComparison2 = pd.DataFrame({'Year Turned Pro':['Before 1995','Between 1995 and 1999',
                                                             'Between 2000 and 2004','After 2004'],
    'Top 50 Wins in First 5 Pro Years (For Eventual Top 20 Players)':
        [Top20Top50Wins[(Top20Top50Wins['Year Turned Pro']<[1995])]["Total First 5 Years"].mean(),
        Top20Top50Wins[(Top20Top50Wins['Year Turned Pro']>=[1995])&
                      (Top20Top50Wins['Year Turned Pro']<[2000])]["Total First 5 Years"].mean(),
        Top20Top50Wins[(Top20Top50Wins['Year Turned Pro']>=[2000])&
                      (Top20Top50Wins['Year Turned Pro']<[2005])]["Total First 5 Years"].mean(),
       Top20Top50Wins[(Top20Top50Wins['Year Turned Pro']>=[2005])]["Total First 5 Years"].mean()],
    'Top 50 Wins in First 5 Pro Years (For Eventual Top 10 Players)':
        [Top10Top50Wins[(Top10Top50Wins['Year Turned Pro']<[1995])]["Total First 5 Years"].mean(),
        Top10Top50Wins[(Top10Top50Wins['Year Turned Pro']>=[1995])&
                      (Top10Top50Wins['Year Turned Pro']<[2000])]["Total First 5 Years"].mean(),
        Top10Top50Wins[(Top10Top50Wins['Year Turned Pro']>=[2000])&
                      (Top10Top50Wins['Year Turned Pro']<[2005])]["Total First 5 Years"].mean(),
       Top10Top50Wins[(Top10Top50Wins['Year Turned Pro']>=[2005])]["Total First 5 Years"].mean()]})

In [39]:
Top50WinsComparison2 = Top50WinsComparison2.set_index("Year Turned Pro")
Top50WinsComparison2Graph = Top50WinsComparison2.plot(kind="line",figsize=(10,5),fontsize=(14))
plt.ylabel("Top 50 Victories")
plt.legend(loc='best',prop={'size':11}).get_frame().set_edgecolor('black')
plt.title("How Many Top 50 Victories Do Tennis Stars \nHave in First 5 Professional Seasons?",fontsize=(16))


Out[39]:
<matplotlib.text.Text at 0xef5aa860f0>

As seen above, players who turned pro prior to 2000 generally had more top 50 victories within their first 5 professional seasons compared to players turning pro after 2000. Due to players like Andy Murray, who had incredible success early in his career, the trend has started to change over the last decade. The following graphs highlight similar information in table and histogram format:


In [40]:
Top50WinsComparison2.round(2)


Out[40]:
Top 50 Wins in First 5 Pro Years (For Eventual Top 10 Players) Top 50 Wins in First 5 Pro Years (For Eventual Top 20 Players)
Year Turned Pro
Before 1995 54.62 43.75
Between 1995 and 1999 53.71 37.03
Between 2000 and 2004 35.56 30.89
After 2004 55.00 29.25

In [41]:
Top20Top50Wins.plot(kind="hist",bins=10,y="Total First 5 Years")
plt.legend(loc='upper right',prop={'size':12}).get_frame().set_edgecolor('black')
plt.title("Number of Top 50 Victories in First 5 Pro Seasons\n for Eventual Top 20 Players",fontsize=(12))
plt.xlabel("\nTop 50 Victories",fontsize=(12))
plt.ylabel("Frequency\n",fontsize=(12))


Out[41]:
<matplotlib.text.Text at 0xef54c084e0>

Sixth, I compared how many tournament wins, on average, tennis stars had within 3 years of turning pro.


In [42]:
TournamentWinsComparison = pd.DataFrame({'Year Turned Pro':['Before 1995','Between 1995 and 1999',
                                                             'Between 2000 and 2004','After 2004'],
    'Tournament Wins in First 3 Pro Years (For Eventual Top 20 Players)':
        [Top20TournamentWins[(Top20TournamentWins['Year Turned Pro']<[1995])]["Total First 3 Years"].mean(),
        Top20TournamentWins[(Top20TournamentWins['Year Turned Pro']>=[1995])&
                      (Top20TournamentWins['Year Turned Pro']<[2000])]["Total First 3 Years"].mean(),
        Top20TournamentWins[(Top20TournamentWins['Year Turned Pro']>=[2000])&
                      (Top20TournamentWins['Year Turned Pro']<[2005])]["Total First 3 Years"].mean(),
       Top20TournamentWins[(Top20TournamentWins['Year Turned Pro']>=[2005])]["Total First 3 Years"].mean()],
    'Tournament Wins in First 3 Pro Years (For Eventual Top 10 Players)':
        [Top10TournamentWins[(Top10TournamentWins['Year Turned Pro']<[1995])]["Total First 3 Years"].mean(),
        Top10TournamentWins[(Top10TournamentWins['Year Turned Pro']>=[1995])&
                      (Top10TournamentWins['Year Turned Pro']<[2000])]["Total First 3 Years"].mean(),
        Top10TournamentWins[(Top10TournamentWins['Year Turned Pro']>=[2000])&
                      (Top10TournamentWins['Year Turned Pro']<[2005])]["Total First 3 Years"].mean(),
       Top10TournamentWins[(Top10TournamentWins['Year Turned Pro']>=[2005])]["Total First 3 Years"].mean()]},
                                        index=[1, 2, 3, 4])

In [43]:
TournamentWinsComparison = TournamentWinsComparison.set_index("Year Turned Pro")
Top50WinsComparisonGraph = TournamentWinsComparison.plot(kind="line",figsize=(10,5),fontsize=(14))
plt.ylabel("Tournament Victories")
plt.legend(loc='best',prop={'size':11}).get_frame().set_edgecolor('black')
plt.title("How Many Tournament Victories Do Tennis Stars \nHave in First 3 Professional Seasons?",
          fontsize=(16))


Out[43]:
<matplotlib.text.Text at 0xef5aca14e0>

As seen above, tennis stars that turned pro before 1995 had much more tournament success early in their careers compared with tennis stars that turned pro after 2000. The following graphs highlight similar information in table and histogram format:


In [44]:
TournamentWinsComparison.round(2)


Out[44]:
Tournament Wins in First 3 Pro Years (For Eventual Top 10 Players) Tournament Wins in First 3 Pro Years (For Eventual Top 20 Players)
Year Turned Pro
Before 1995 1.81 1.32
Between 1995 and 1999 1.00 0.63
Between 2000 and 2004 0.62 0.44
After 2004 0.80 0.31

In [45]:
Top20TournamentWins.plot(kind="hist",bins=10,y="Total First 3 Years")
plt.legend(loc='upper right',prop={'size':12}).get_frame().set_edgecolor('black')
plt.title("Number of Tournament Victories in First 3 Pro Seasons\n for Eventual Top 20 Players",fontsize=(12))
plt.xlabel("\nTournament Victories",fontsize=(12))
plt.ylabel("Frequency\n",fontsize=(12))


Out[45]:
<matplotlib.text.Text at 0xef5ac60c88>

Lastly, I compared how many tournament wins, on average, tennis stars had within 5 years of turning pro.


In [46]:
TournamentWinsComparison2 = pd.DataFrame({'Year Turned Pro':['Before 1995','Between 1995 and 1999',
                                                             'Between 2000 and 2004','After 2004'],
    'Tournament Wins in First 5 Pro Years (For Eventual Top 20 Players)':
        [Top20TournamentWins[(Top20TournamentWins['Year Turned Pro']<[1995])]["Total First 5 Years"].mean(),
        Top20TournamentWins[(Top20TournamentWins['Year Turned Pro']>=[1995])&
                      (Top20TournamentWins['Year Turned Pro']<[2000])]["Total First 5 Years"].mean(),
        Top20TournamentWins[(Top20TournamentWins['Year Turned Pro']>=[2000])&
                      (Top20TournamentWins['Year Turned Pro']<[2005])]["Total First 5 Years"].mean(),
       Top20TournamentWins[(Top20TournamentWins['Year Turned Pro']>=[2005])]["Total First 5 Years"].mean()],
    'Tournament Wins in First 5 Pro Years (For Eventual Top 10 Players)':
        [Top10TournamentWins[(Top10TournamentWins['Year Turned Pro']<[1995])]["Total First 5 Years"].mean(),
        Top10TournamentWins[(Top10TournamentWins['Year Turned Pro']>=[1995])&
                      (Top10TournamentWins['Year Turned Pro']<[2000])]["Total First 5 Years"].mean(),
        Top10TournamentWins[(Top10TournamentWins['Year Turned Pro']>=[2000])&
                      (Top10TournamentWins['Year Turned Pro']<[2005])]["Total First 5 Years"].mean(),
       Top10TournamentWins[(Top10TournamentWins['Year Turned Pro']>=[2005])]["Total First 5 Years"].mean()]},
                                        index=[1, 2, 3, 4])

In [47]:
TournamentWinsComparison2 = TournamentWinsComparison2.set_index("Year Turned Pro")
Top50WinsComparison2Graph = TournamentWinsComparison2.plot(kind="line",figsize=(10,5),fontsize=(14))
plt.ylabel("Tournament Victories")
plt.legend(loc='best',prop={'size':11}).get_frame().set_edgecolor('black')
plt.title("How Many Tournament Victories Do Tennis Stars \nHave in First 5 Professional Seasons?",
          fontsize=(16))


Out[47]:
<matplotlib.text.Text at 0xef5aec7438>

Once again, while there has been a resurgence more recently, tennis stars who turned pro before 1995 generally had more early success in tournaments compared with tennis stars who turned pro after 2000. The following graphs highlight similar information in table and histogram format:


In [48]:
TournamentWinsComparison2.round(2)


Out[48]:
Tournament Wins in First 5 Pro Years (For Eventual Top 10 Players) Tournament Wins in First 5 Pro Years (For Eventual Top 20 Players)
Year Turned Pro
Before 1995 5.96 4.38
Between 1995 and 1999 4.47 2.80
Between 2000 and 2004 3.50 2.48
After 2004 5.60 2.56

In [49]:
Top20TournamentWins.plot(kind="hist",bins=10,y="Total First 5 Years")
plt.legend(loc='upper right',prop={'size':12}).get_frame().set_edgecolor('black')
plt.title("Number of Tournament Victories in First 5 Pro Seasons\n for Eventual Top 20 Players",fontsize=(12))
plt.xlabel("\nTournament Victories",fontsize=(12))
plt.ylabel("Frequency\n",fontsize=(12))


Out[49]:
<matplotlib.text.Text at 0xef5ae94da0>

Summary

The following table summarizes the data from the above graphics. These numbers serve as the foundation for the ultimate analysis of young players' early careers. As we saw above, with the exception of outliers like Andy Murray, tennis stars in recent years have taken a longer time to reach top 50 status and have had fewer top 10, 50, and tournament victories in the first few years of their careers. Since the early 2000's, tennis has largely been dominated by four players (Roger Federer, Rafael Nadal, Andy Murray, and Novak Djokovic), and this is likely the primary explanation for the increased difficulty within men's professional tennis. With such dominant players at the top, it has been challenging for younger players to break into the top 10 or to win tournaments at a young age.


In [50]:
TennisSummary = [YearsUntilTop50Comparison["Avg Years to Reach Top 50 (For Eventual Top 20 Players)"],
            Top10WinsComparison["Top 10 Wins in First 3 Pro Years (For Eventual Top 20 Players)"],
            Top10WinsComparison2["Top 10 Wins in First 5 Pro Years (For Eventual Top 20 Players)"],
            Top50WinsComparison["Top 50 Wins in First 3 Pro Years (For Eventual Top 20 Players)"],
            Top50WinsComparison2["Top 50 Wins in First 5 Pro Years (For Eventual Top 20 Players)"],
            TournamentWinsComparison["Tournament Wins in First 3 Pro Years (For Eventual Top 20 Players)"],
            TournamentWinsComparison2["Tournament Wins in First 5 Pro Years (For Eventual Top 20 Players)"], 
            YearsUntilTop50Comparison["Avg Years to Reach Top 50 (For Eventual Top 10 Players)"],
            Top10WinsComparison["Top 10 Wins in First 3 Pro Years (For Eventual Top 10 Players)"],
            Top10WinsComparison2["Top 10 Wins in First 5 Pro Years (For Eventual Top 10 Players)"],
            Top50WinsComparison["Top 50 Wins in First 3 Pro Years (For Eventual Top 10 Players)"],
            Top50WinsComparison2["Top 50 Wins in First 5 Pro Years (For Eventual Top 10 Players)"],
            TournamentWinsComparison["Tournament Wins in First 3 Pro Years (For Eventual Top 10 Players)"],
            TournamentWinsComparison2["Tournament Wins in First 5 Pro Years (For Eventual Top 10 Players)"]]

result = pd.concat(TennisSummary,axis=1)

columns = [('All Top 20 Players between 1995 to 2015','All Top 10 Players between 1995 to 2015'),
                                                                ("Avg Years to Reach Top 50",
                                                                 "Top 10 Wins First 3 Pro Years",
                                                                 "Top 10 Wins First 5 Pro Years",
                                                                 "Top 50 Wins First 3 Pro Years",
                                                                 "Top 50 Wins First 5 Pro Years",
                                                                 "Tournament Wins First 3 Pro Years",
                                                                 "Tournament Wins First 5 Pro Years")]

result.columns=pd.MultiIndex.from_product(columns)
result.round(2)


Out[50]:
All Top 20 Players between 1995 to 2015 All Top 10 Players between 1995 to 2015
Avg Years to Reach Top 50 Top 10 Wins First 3 Pro Years Top 10 Wins First 5 Pro Years Top 50 Wins First 3 Pro Years Top 50 Wins First 5 Pro Years Tournament Wins First 3 Pro Years Tournament Wins First 5 Pro Years Avg Years to Reach Top 50 Top 10 Wins First 3 Pro Years Top 10 Wins First 5 Pro Years Top 50 Wins First 3 Pro Years Top 50 Wins First 5 Pro Years Tournament Wins First 3 Pro Years Tournament Wins First 5 Pro Years
Year Turned Pro
Before 1995 3.72 2.60 9.02 13.58 43.75 1.32 4.38 3.04 3.50 11.58 17.35 54.62 1.81 5.96
Between 1995 and 1999 4.40 2.30 7.83 10.40 37.03 0.63 2.80 3.35 3.53 11.53 15.71 53.71 1.00 4.47
Between 2000 and 2004 4.22 0.96 4.96 6.74 30.89 0.44 2.48 3.88 0.81 5.19 7.12 35.56 0.62 3.50
After 2004 4.38 1.56 6.00 7.50 29.25 0.31 2.56 3.80 2.60 14.80 11.20 55.00 0.80 5.60

As the final piece of the analysis, I developed a function to easily compare the average statistics from the early careers of the top 20 and top 10 players between 1995 and 2015 with the stats of any selected tennis player. The following functions utilized terms developed earlier in this report and allowed me to easily summarize tennis stats for any player.

TourneyWinsFirst3ProYears: Outputs number of tournament victories in first 3 professional seasons for any selected tennis player


In [51]:
def TourneyWinsFirst3ProYears(player):
    return (YearTourneyWins(player,YearTurnedPro(player)) + YearTourneyWins(player,YearTurnedPro(player)+1) + 
     YearTourneyWins(player,YearTurnedPro(player)+2))

TourneyWinsFirst3ProYears("Roger Federer")


Out[51]:
0

TourneyWinsFirst5ProYears: Outputs number of tournament victories in first 5 professional seasons for any selected tennis player


In [52]:
def TourneyWinsFirst5ProYears(player):
    return (YearTourneyWins(player,YearTurnedPro(player)) + YearTourneyWins(player,YearTurnedPro(player)+1) + 
     YearTourneyWins(player,YearTurnedPro(player)+2) + YearTourneyWins(player,YearTurnedPro(player)+3) + 
           YearTourneyWins(player,YearTurnedPro(player)+4))

TourneyWinsFirst5ProYears("Roger Federer")


Out[52]:
4

Top10WinsFirst3ProYears: Outputs number of top 10 victories in first 3 professional seasons for any selected tennis player


In [53]:
def Top10WinsFirst3ProYears(player):
    return (YearTopTen(player,YearTurnedPro(player)) + YearTopTen(player,YearTurnedPro(player)+1) + 
     YearTopTen(player,YearTurnedPro(player)+2))

Top10WinsFirst3ProYears("Roger Federer")


Out[53]:
4

Top10WinsFirst5ProYears: Outputs number of top ten victories in first 5 professional seasons for any selected tennis player


In [54]:
def Top10WinsFirst5ProYears(player):
    return (YearTopTen(player,YearTurnedPro(player)) + YearTopTen(player,YearTurnedPro(player)+1) + 
     YearTopTen(player,YearTurnedPro(player)+2) + YearTopTen(player,YearTurnedPro(player)+3) + 
           YearTopTen(player,YearTurnedPro(player)+4))

Top10WinsFirst5ProYears("Roger Federer")


Out[54]:
19

Top50WinsFirst3ProYears: Outputs number of top 50 victories in first 3 professional seasons for any selected tennis player


In [55]:
def Top50WinsFirst3ProYears(player):
    return (YearTopFifty(player,YearTurnedPro(player)) + YearTopFifty(player,YearTurnedPro(player)+1) + 
     YearTopFifty(player,YearTurnedPro(player)+2))

Top50WinsFirst3ProYears("Roger Federer")


Out[55]:
26

Top50WinsFirst5ProYears: Outputs number of top 50 victories in first 5 professional seasons for any selected tennis player


In [56]:
def Top50WinsFirst5ProYears(player):
    return (YearTopFifty(player,YearTurnedPro(player)) + YearTopFifty(player,YearTurnedPro(player)+1) + 
     YearTopFifty(player,YearTurnedPro(player)+2) + YearTopFifty(player,YearTurnedPro(player)+3) + 
           YearTopFifty(player,YearTurnedPro(player)+4))

Top50WinsFirst5ProYears("Roger Federer")


Out[56]:
87

YearsUntilTop50: Identifies how long it took a player to achieve top 50 status


In [57]:
def YearsUntilTop50(Player):
    if type(AllPlayerSummary.ix[Player,"Years Until Top 50"]) == float:
        return int(AllPlayerSummary.ix[Player,"Years Until Top 50"])
    else:
        return AllPlayerSummary.ix[Player,"Years Until Top 50"]
    
YearsUntilTop50("Roger Federer")


Out[57]:
3

The final function summarizes all the data seen thus far in this report in a single table. We can now compare any tennis player to the average statistics of players who eventually became top 10 or top 20 players in the world. For young tennis players, this could be used to gauge success early in their careers.


In [58]:
def WillHeBeGreat(Player):
    Will = pd.DataFrame({'Player':[Player,"Average for Top 20 Players from 1995 to 2015",
                                   "Average for Top 10 Players from 1995 to 2015"],
                         'Pro Years Until Top 50 Ranking':[YearsUntilTop50(Player),
                                                        Top20RankingProgression["Years Until Top 50"].mean(),
                                                        Top10RankingProgression["Years Until Top 50"].mean()],
                         'Top 10 Wins First 3 Pro Years':[Top10WinsFirst3ProYears(Player),
                                                        Top20Top10Wins["Total First 3 Years"].mean(),
                                                        Top10Top10Wins["Total First 3 Years"].mean()],
                         'Top 10 Wins First 5 Pro Years':[Top10WinsFirst5ProYears(Player),
                                                        Top20Top10Wins["Total First 5 Years"].mean(),
                                                        Top10Top10Wins["Total First 5 Years"].mean()],
                         'Top 50 Wins First 3 Pro Years':[Top50WinsFirst3ProYears(Player),
                                                        Top20Top50Wins["Total First 3 Years"].mean(),
                                                        Top10Top50Wins["Total First 3 Years"].mean()],
                         'Top 50 Wins First 5 Pro Years':[Top50WinsFirst5ProYears(Player),
                                                        Top20Top50Wins["Total First 5 Years"].mean(),
                                                        Top10Top50Wins["Total First 5 Years"].mean()],
                         'Tournament Wins First 3 Pro Years':[TourneyWinsFirst3ProYears(Player),
                                                        Top20TournamentWins["Total First 3 Years"].mean(),
                                                        Top10TournamentWins["Total First 3 Years"].mean()],
                         'Tournament Wins First 5 Pro Years':[TourneyWinsFirst5ProYears(Player),
                                                        Top20TournamentWins["Total First 5 Years"].mean(),
                                                        Top10TournamentWins["Total First 5 Years"].mean()],
                        'Year Turned Pro':[YearTurnedPro(Player),"",""]})
    ColumnOrder = ['Year Turned Pro','Pro Years Until Top 50 Ranking','Top 10 Wins First 3 Pro Years',
                  'Top 10 Wins First 5 Pro Years','Top 50 Wins First 3 Pro Years',
                  'Top 50 Wins First 5 Pro Years','Tournament Wins First 3 Pro Years',
                  'Tournament Wins First 5 Pro Years']
    Will = Will.set_index("Player")
    Will = Will[ColumnOrder]
    Will = Will.round(2)
    return Will

In [59]:
WillHeBeGreat("Andy Murray")


Out[59]:
Year Turned Pro Pro Years Until Top 50 Ranking Top 10 Wins First 3 Pro Years Top 10 Wins First 5 Pro Years Top 50 Wins First 3 Pro Years Top 50 Wins First 5 Pro Years Tournament Wins First 3 Pro Years Tournament Wins First 5 Pro Years
Player
Andy Murray 2005 2.00 10.00 35.00 41.00 118.00 4.00 14.00
Average for Top 20 Players from 1995 to 2015 4.12 1.98 7.31 10.24 36.84 0.79 3.25
Average for Top 10 Players from 1995 to 2015 3.39 2.77 10.22 13.88 49.64 1.22 4.92

Based on the information above, Andy Murray had an incredible start to his career. For example, it took him only 2 years to achieve a top 50 ranking, and he had 35 victories against top 10 opponents in his first 5 professional seasons (compared to an average of 10 for players that eventually became top 10 in the world). Looking at this table, it is not surprising that Andy Murray eventually became a top 5 player in the world with multiple grand slam championship titles. The following function allows us to see the same information in a more visual format:


In [60]:
def WillHeBeGreatGraph(Player):
    WillHeBeGreat(Player).plot(y=["Pro Years Until Top 50 Ranking",
                                      "Top 10 Wins First 3 Pro Years",
                                     "Top 10 Wins First 5 Pro Years",
                                    "Tournament Wins First 3 Pro Years",
                                    "Tournament Wins First 5 Pro Years"],kind="barh",figsize=(10,5))
    plt.style.use("ggplot")
    plt.ylabel("")
    plt.xlabel("Years or # of Victories")
    plt.title("Early Career of Selected Player vs. \nEarly Careers of Eventual Top 20 or Top 10 Players")

WillHeBeGreatGraph("Andy Murray")


As another example, we can look at the career of Steve Johnson. Steve Johnson is a perennial top 100 player in the world, but has struggled to make a significant impact on the ATP tour. Looking at his early career, it is not surprising that he has had difficulty becoming a top 20 player.


In [61]:
WillHeBeGreatGraph("Steve Johnson")


Finally, we can look at the early statistics for Roger Federer, one of the greatest men's tennis players of all time. As the graph illustrates, his play in the first few years of his career, particularly with respect to victories against top 10 opponents, made it clear that he would become a phenomenal tennis player someday.


In [62]:
WillHeBeGreatGraph("Roger Federer")


Conclusion

Ultimately, my analysis demonstrates what it takes for a young tennis player to have legitimate expectations of future, long-term success on the men's tour. While recent dominance at the top of the rankings has given young players some leeway when analyzing their early stats, inevitable retirements from top players like Rafael Nadal and Roger Federer will greatly increase expectations for young tennis players. Tennis is an incredibly expensive sport to play, and players cannot afford to simply survive on tour as a sub-100 player. Analyzing the early careers of tennis stars is another way for a young player to gauge success and to develop reasonable goals and expectations. This report consolidates that analysis in one place. It is no guarantee that a player will or will not eventually become a dominant player in the sport, but understanding the careers of players who went on to have great success will only help a young player to better understand what he needs to accomplish in order to excel in the sport.