For my Data Bootcamp project, I analyzed some factors that could correlate to a player's win rate in the popular MOBA game League of Legends. Using the open-source library Cassiopeia as well as the developer key linked to my League of Legends account, I was able to pull the data of players in the Challenger tier (top 200 players) on the North American server.
Since my analysis was performed only on players in NA Challenger, its implications should not be considered broadly with respect to the general playerbase. Additionally, since the data is on ranked statistics for the current season (season 7) and there are various factors (team composition, champion-specific proficiency, team communication, etc.) that can lead to victory/defeat, this analysis is far from conclusive. Nevertheless, this analysis can provide some interesting insights into the highest level of play. Hopefully you have as much fun reading this as I did making this!
A simple multivariable linear regression analysis will be performed with win rate as the dependent variable. The normality of the distribution will be evaluated, and the Ordinary Least Squares method will be utilized to generate regression results for the test model. Since this is a simple experiment/exercise in Python, this project will not explore heteroskedasticity, specification bias, collinearity, and other concerns. In turn, the results from this project should not be treated as particularly substantial.
Cassiopeia is an open source library used to access Riot API. Installing Cassiopeia will also naturally install SQLAlchemy.
pip install cassiopeia
In [1]:
import os # to access system environment variables
import pandas as pd # data management
import seaborn as sns # for data visualization
import matplotlib.pyplot as plt # for plots
import statsmodels.formula.api as smf # for regression output
import datetime as dt # date information
from cassiopeia import riotapi # to access Riot Games API
from cassiopeia.type.core.common import LoadPolicy # to utilize delayed loading policy
# IPython command, puts plots in notebook
%matplotlib inline
In [2]:
print('Last Updated', dt.date.today())
In [3]:
riotapi.set_region("NA") # sets the region to North America (rip Dyrus)
key = os.environ["DEV_KEY"] # grabs my API key from my environment variables
riotapi.set_api_key(key) # my dev key is specific to my account
riotapi.set_load_policy(LoadPolicy.lazy) # lazy -> delays loading certain objects for improved time + data usage
In [4]:
challenger_league = riotapi.get_challenger()
challenger_league
Out[4]:
In [5]:
challenger = [entry.summoner for entry in challenger_league]
In [6]:
challenger[0] #summoner.Summoner object of highest rank player in challenger
Out[6]:
In [7]:
challenger[0].name
Out[7]:
The method below returns a dictionary of ranked statistics for a player based on the champion played. Using the key [None] returns aggregate ranked statistics of all champions played by the summoner. This aggregate data can be pulled in JSON format.
challenger[0].ranked_stats()
In [8]:
challenger[0].ranked_stats()[None]
Out[8]:
In [9]:
challenger[0].ranked_stats()[None].to_json() #returns data in JSON format
Out[9]:
In [10]:
df = pd.DataFrame()
for player in challenger:
stats = pd.read_json(player.ranked_stats()[None].to_json(), typ = 'series')
df = df.append(stats, ignore_index=True)
Below is a view of the dataframe for the top 5 players. Several of the fields are left as 0.0 as information for them are not recorded for ranked statistics.
In [11]:
df.head()
Out[11]:
In [12]:
df.shape #200 player entries, 56 field columns
Out[12]:
In [13]:
df.columns
Out[13]:
Note: following code not updated, but still correctly conveys that all fields are of datatype float
In [16]:
df.dtypes #all values are floats
Out[16]:
In [14]:
df['averageAssists'] = df['totalAssists']/df['totalSessionsPlayed'] # will overwrite the 'averageAssists' field
df['averageKills'] = df['totalChampionKills']/df['totalSessionsPlayed']
df['averageDeaths'] = df['totalDeathsPerSession']/df['totalSessionsPlayed']
df['winRate'] = df['totalSessionsWon']/df['totalSessionsPlayed']
df['averageGoldEarned'] = df['totalGoldEarned']/df['totalSessionsPlayed']
As an example, the average assists of the top 5 players from the newly included averageAssists series can be seen below.
In [15]:
df['averageAssists'].head()
Out[15]:
In [16]:
df['winRate'].plot()
Out[16]:
From the descriptive statistics below, it is interesting to note that the win rates of all players in Challenger are above 50%. This makes sense as the ranked positions of all players are reset at the end of each season, and top players must climb the ranked ladder in order to get into Challenger. Thus, successfully winning at least as many games as losses seems like a necessary, but not sufficient condition to make it into Challenger.
In [17]:
df['winRate'].describe()
Out[17]:
Before moving onto the test model, the normality of the distribution of the win rates is evaluated. With a normally distribution, the skewness is 0 (neither skewed to the left or right) and the kurtosis is 3 ("tailedness" of the distribution). Since the win rate distribution has a positive skewness and a kurtosis less than 3, the distribution is skewed to the right and has a platykurtic distribution (peak looks flatter). A sample of 200 of the top players in a region is not be expected to be normally distributed, and the skew towards a higher win rate as well as a platykurtic distribution (fewer and less influential outliers) seem to represent this. Nevertheless, a multivariate linear regression will be performed on this dataset despite failing the normality condition.
In [18]:
df['winRate'].skew() # positive skewness -> skewed to the right (skewed to higher winrate)
Out[18]:
In [19]:
df['winRate'].kurt() # kurtosis less than 3 -> platykurtic
Out[19]:
In [20]:
sns.distplot(df['winRate'], bins=50)
Out[20]:
WINRATE = B0 + B1(GOLD) + B2(KDA) + B3(TURRETS) + B4(DIFF) + B5(GAMES)
In [21]:
df['kda'] = (df['averageAssists'] + df['averageKills'])/df['averageDeaths']
df['averageTurretsKilled'] = df['totalTurretsKilled']/df['totalSessionsPlayed']
df['averageDamageDifferential'] = (df['totalDamageDealt'] - df['totalDamageTaken'] + df['totalHeal'])/df['totalSessionsPlayed']
In [22]:
df.shape #200 players, 63 columns (7 added, 1 overwritten)
Out[22]:
Based on the plot below, the win rate range seems to widen as the average gold earned increases. Additionally, the players with the highest win rates tend to have average gold earned between 11,000 to 13,000. It is important to note that games do vary in length (like when SKT steamrolled Royal Club in the Season 3 World Championship Finals), so the average gold earned can also vary.
In [23]:
fig, ax = plt.subplots()
ax.scatter(df['winRate'], df['averageGoldEarned'])
ax.set_title('Average Gold Earned vs. Win Rate', loc='left', fontsize=14)
ax.set_xlabel('Win Rate')
ax.set_ylabel('Average Gold Earned')
Out[23]:
Based on the plot below, the KDAs and win rates of most players seem to be clustered in the bottom left. As the KDA Ratio increases, there seems to be an upward trend in win rate. It should be noted that having a high KDA does not necessarily lead to a greater win rate, as being too aggressive or not focusing on objectives enough may be detrimental to a team's success.
In [24]:
fig, ax = plt.subplots()
ax.scatter(df['winRate'], df['kda'])
ax.set_title('KDA Ratio vs. Win Rate', loc='left', fontsize=14)
ax.set_xlabel('Win Rate')
ax.set_ylabel('KDA Ratio')
Out[24]:
Based on the plot below, it does not seem like the average turrets killed correlates with win rate. Players can also contribute to the team's victory without needing to kill a turret personally (such as letting minions finish off a turret after a push or having a laning partner take the turret).
In [25]:
fig, ax = plt.subplots()
ax.scatter(df['winRate'], df['averageTurretsKilled'])
ax.set_title('Average Turrets Killed vs. Win Rate', loc='left', fontsize=14)
ax.set_xlabel('Win Rate')
ax.set_ylabel('Average Turrets Killed')
Out[25]:
Based on the plot below, most of the data points seem to be clustered within the average damage differential range of 100,000 to 150,000. Aside from a few data points, it seems that the players with the highest win rates tend to be within that range. Players below that range seem to have win rates on the lower end of the spectrum, suggesting that they may not be contributing enough (in team fights or being effective in their particular roles). It is important to note that certain roles are not expected to contribute as much in terms of damage output, so a support main can still have a high WINRATE value despite having a consistently low DIFF.
In [26]:
fig, ax = plt.subplots()
ax.scatter(df['winRate'], df['averageDamageDifferential'])
ax.set_title('Average Damage Differential vs. Win Rate', loc='left', fontsize=14)
ax.set_xlabel('Win Rate')
ax.set_ylabel('Average Damage Differential')
Out[26]:
This plot expresses that the number of games played is inversely proportional to a player's win rate. As the number of games a player plays increases, that player's respective win rate decreases, approaching the lower limit of 50%. This relationship seems to make sense given the pool of NA Challenger players. With fewer ranked games played, a Challenger player has a greater portion of games played at lower tiers due to climbing the ranked ladder, so that player's win rate would be higher. Additionally, players on win streaks can climb the ladder faster, and can enter into Challenger with a higher than average win rate.
In [27]:
fig, ax = plt.subplots()
ax.scatter(df['winRate'], df['totalSessionsPlayed'])
ax.set_title('Games Played vs. Win Rate', loc='left', fontsize=14)
ax.set_xlabel('Win Rate')
ax.set_ylabel('Games Played')
Out[27]:
Based on the test model established before
WINRATE = B0 + B1(GOLD) + B2(KDA) + B3(TURRETS) + B4(DIFF) + B5(GAMES)
regression output using the Ordinary Least Squares method is generated below using the StatsModel module. The "winRate" is passed as the dependent variable while the other variables are passed as independent variables in the "model" argument string.
In [28]:
model = 'winRate ~ averageGoldEarned + kda + averageTurretsKilled + averageDamageDifferential + totalSessionsPlayed'
results = smf.ols(model, data=df).fit()
results.summary()
Out[28]:
WINRATE = 0.5016 + 0.000003407(GOLD) + 0.0180(KDA) + 0.0175(TURRETS) - 0.00000005839(DIFF) - 0.00009006(GAMES)
From our regression results, the R-squared statistic of 0.559 expresses that 55.9% of the variability in the data is accounted for in the linear regression model. Adjusting for the number of predictors within the model, the adjusted R-squared statistic is 0.547 or 54.7%. Through these r-squared values, the regression results seem to suggest that the model is somewhat good at explaining the variability in the data. Nevertheless, a couple of the explanatory variables within the model have insignificant t-values and high p-values. A high r-squared statistic and a few significant t-values suggest that the model might be suffering from a classical case of multicollinearity (two or more predictor values are highly correlated). To deal with multicollinearity, variables that are insignificant will be dropped in the second test model.
The variables with insignificant t-values and high p-values are GOLD and DIFF
(194 degrees of freedom: |t-values| < approx. 2, p-values > alpha of 0.05
| t-value table ).
In turn, the second test model will only have the following independent variables: KDA, TURRETS, GAMES.
WINRATE = B0 + B1(KDA) + B2(TURRETS) + B3(GAMES)
In [29]:
model2 = 'winRate ~ kda + averageTurretsKilled + totalSessionsPlayed'
results2 = smf.ols(model2, data=df).fit()
results2.summary()
Out[29]:
WINRATE = 0.5337 + 0.0175(KDA) + 0.0203(TURRETS) - 0.00009076(GAMES)
Using the second test model, there is a slight drop in r-squared from 0.559 to 0.558 as well as a slight increase in adjusted r-squared from 0.547 to 0.551, suggesting that this model is better at explaining the variability in the data than the first test model.
Moreover, given the Durbin-Watson of 2.080, we do not reject the null hypothesis of no autocorrelation.
(196 degrees of freedom, 3 regressors excluding intercept: 1.704 < DW value of 2.080 < 2.296
| durbin-watson tables )
Based on the second test model, it is noted that KDA and TURRETS have a positive correlation with WINRATE, while GAMES has a negative correlation with WINRATE. Since the OLS regression expresses a high r-squared statistic of 0.558 and adjusted r-squared of 0.551, it can be argued that this model can effectively explain the win rates of NA Challenger players. By having a high KDA ratio, a high number of turrets killed on average, and low amount of games, a Challenger player is expected to have a relatively high win rate. The results expressed seem to fall in line with the expectations outlined earlier, but it was interesting to see that GOLD and DIFF were not significant variables within the model. This may be due to multicollinearity, which is not explored in this project. For instance, KDA and GOLD should be correlated, since getting kills and assists add to a player's gold count. In the scatterplot "Average Turrets Killed vs. Win Rate," while there appeared to not be any correlation between TURRETS and WINRATE, it seemed that such was not the case as TURRETS had a significant, positive relationship with WINRATE. Finally, in the scatterplot "Games Played vs. Win Rate," the inverse relationship between GAMES and WINRATE was distinctly expressed, and had the largest (absolute) t-value of the independent variables.
From this experiment, nothing particularly ground-breaking was found. It is not surprising that a player's KDA ratio, average turrets killed, and games played does reflect on that player's ranked win rate. For more meaningful insights, data can be drawn from the ranked matches that a player has played, but retrieving that information for 200 players would take significantly more time and work beyond the scope of this project. Additionally, this experiment can be made more rigorous by including White's General Test for Heteroskedasticity, omitted variable tests, influence statistics, and other tests/measures. Thank you for spending the time to read this, and I hope this was enjoyable!
In [ ]: