SeaFall Results

This SeaFall campaign ended after 14 games, so the leaderboard has a bit of history and depth. How did each player performed? How did each player reach their final glory total?

Visualizing Glory

The final score of each game is the only clear measure we have of player performance. We have not kept track of further gameplay details from game to game such as gold acquired, relics or tablets recovered, or number of permanent enmity on the board. This means we are limited in how we can describe the success of each player. However, visualizing the


In [1]:
%matplotlib inline

import itertools
import matplotlib
import matplotlib.pyplot
import numpy
import pandas
import scipy.misc
import scipy.special
import scipy.stats
import seaborn
import trueskill
import xlrd

TrueSkill

Microsoft research created a method for ranking player perfomance in multi-player games called TrueSkill. Let's apply this ranking to our game of SeaFall!

The goal of TrueSkill is to estimate a player's skill or ability in the game, which provides a different way of looking at the standings. The original implementation of TrueSkill was only concerned with wins, loses, and draws and did not incorporate the scoring in each game. For the free-for-all that is SeaFall the ranking of each player for each game is the input into the skill estimation. Since the ranking does not reflect the total glory accumulated throughout the game, TrueSkill will give a different way to measure and compare each player's performance.


In [2]:
df_glory = pandas.read_excel("./glory.xlsx")
df_glory


Out[2]:
game_number glory_target dave kyle scott mike joe dave_total kyle_total scott_total ... kyle_rank scott_rank mike_rank joe_rank dave_norm kyle_norm scott_norm mike_norm joe_norm total_glory
0 1 11 11 6 6 5 4 11 6 6 ... 2 2 4 5 1.000000 0.545455 0.545455 0.454545 0.363636 32
1 2 12 6 8 10 12 9 17 14 16 ... 4 2 1 3 0.500000 0.666667 0.833333 1.000000 0.750000 45
2 3 13 9 7 6 12 18 26 21 22 ... 4 5 2 1 0.692308 0.538462 0.461538 0.923077 1.384615 52
3 4 14 14 14 7 2 2 40 35 29 ... 1 3 4 4 1.000000 1.000000 0.500000 0.142857 0.142857 39
4 5 15 15 12 9 12 8 55 47 38 ... 2 4 2 5 1.000000 0.800000 0.600000 0.800000 0.533333 56
5 6 16 14 14 20 12 12 69 61 58 ... 2 1 4 4 0.875000 0.875000 1.250000 0.750000 0.750000 72
6 7 17 5 3 10 10 19 74 64 68 ... 5 2 2 1 0.294118 0.176471 0.588235 0.588235 1.117647 47
7 8 18 6 6 19 7 4 80 70 87 ... 3 1 2 5 0.333333 0.333333 1.055556 0.388889 0.222222 42
8 9 19 8 12 11 25 8 88 82 98 ... 2 3 1 4 0.421053 0.631579 0.578947 1.315789 0.421053 64
9 10 20 7 19 9 20 20 95 101 107 ... 3 4 1 1 0.350000 0.950000 0.450000 1.000000 1.000000 75
10 11 21 13 9 21 14 9 108 110 128 ... 4 1 2 4 0.619048 0.428571 1.000000 0.666667 0.428571 66
11 12 22 15 25 17 16 15 123 135 145 ... 1 2 3 4 0.681818 1.136364 0.772727 0.727273 0.681818 88
12 13 23 23 26 11 14 13 146 161 156 ... 1 5 3 4 1.000000 1.130435 0.478261 0.608696 0.565217 87
13 14 24 21 22 11 15 10 167 183 167 ... 1 4 3 5 0.875000 0.916667 0.458333 0.625000 0.416667 79

14 rows × 28 columns


In [3]:
dave = trueskill.Rating()
kyle = trueskill.Rating()
scott = trueskill.Rating()
mike = trueskill.Rating()
joe = trueskill.Rating()

In [4]:
# demo
trueskill.quality([(dave,),(kyle,),(scott,),(mike,),(joe,)])
print(dave)
new_dave, new_kyle, new_scott, new_mike, new_joe = trueskill.rate([(dave,),(kyle,),(scott,),(mike,),(joe,)])
print(new_dave)
new_dave, new_kyle, new_scott, new_mike, new_joe = trueskill.rate([new_scott,new_kyle,new_dave,new_mike,new_joe])
print(new_dave)
new_dave, new_kyle, new_scott, new_mike, new_joe = trueskill.rate([(dave,),(kyle,),(scott,),(mike,),(joe,)], ranks=[0, 1, 1, 3, 4])
print(new_dave)


trueskill.Rating(mu=25.000, sigma=8.333)
(trueskill.Rating(mu=34.363, sigma=6.136),)
(trueskill.Rating(mu=31.387, sigma=4.344),)
(trueskill.Rating(mu=32.687, sigma=6.176),)

In [5]:
def update_glory(ratings, s_glry):
    
    new_ratings = trueskill.rate(ratings, ranks=[s_glry["dave_rank"],
                                                 s_glry["kyle_rank"],
                                                 s_glry["scott_rank"],
                                                 s_glry["mike_rank"],
                                                 s_glry["joe_rank"]])
    
    return new_ratings

In [6]:
x = df_glory.loc[0]

In [7]:
mu, sigma = new_dave[0]

In [8]:
df_glory.shape[0]


Out[8]:
14

In [9]:
dave = trueskill.Rating()
kyle = trueskill.Rating()
scott = trueskill.Rating()
mike = trueskill.Rating()
joe = trueskill.Rating()

ratings = [(dave,),(kyle,),(scott,),(mike,),(joe,)]

for ind, row in df_glory.iterrows():
    
    ratings = update_glory(ratings, row)

In [10]:
df_q1 = df_glory.loc[(df_glory["game_number"] < 8)]
df_q2 = df_glory.loc[(df_glory["game_number"] >= 8)]

In [11]:
dave = trueskill.Rating()
kyle = trueskill.Rating()
scott = trueskill.Rating()
mike = trueskill.Rating()
joe = trueskill.Rating()

ratings_q1 = [(dave,),(kyle,),(scott,),(mike,),(joe,)]

for ind, row in df_q1.iterrows():
    
    ratings_q1 = update_glory(ratings_q1, row)

In [12]:
dave = trueskill.Rating()
kyle = trueskill.Rating()
scott = trueskill.Rating()
mike = trueskill.Rating()
joe = trueskill.Rating()

ratings_q2 = [(dave,),(kyle,),(scott,),(mike,),(joe,)]

for ind, row in df_q2.iterrows():
    
    ratings_q2 = update_glory(ratings_q2, row)

In [13]:
def ratings_mu_sigma(rating):
    
    mu, sigma = rating
    
    return (mu, sigma)

ratings_dict = {"player":["dave", "kyle", "scott", "mike", "joe"]}
ratings_tuple_array = [ratings_mu_sigma(rating[0]) for rating in ratings]
ratings_dict["mu"] = [rating[0] for rating in ratings_tuple_array]
ratings_dict["sigma"] = [rating[1] for rating in ratings_tuple_array]
ratings_tuple_array = [ratings_mu_sigma(rating[0]) for rating in ratings_q1]
ratings_dict["mu_q1"] = [rating[0] for rating in ratings_tuple_array]
ratings_dict["sigma_q1"] = [rating[1] for rating in ratings_tuple_array]
ratings_tuple_array = [ratings_mu_sigma(rating[0]) for rating in ratings_q2]
ratings_dict["mu_q2"] = [rating[0] for rating in ratings_tuple_array]
ratings_dict["sigma_q2"] = [rating[1] for rating in ratings_tuple_array]
ratings_tuple_array = [scipy.stats.norm.interval(0.95, loc=0, scale=sigma) for sigma in ratings_dict["sigma"]]
ratings_dict["err95"] = [err[1] for err in ratings_tuple_array]
ratings_tuple_array = [scipy.stats.norm.interval(0.95, loc=0, scale=sigma) for sigma in ratings_dict["sigma_q1"]]
ratings_dict["err95_q1"] = [err[1] for err in ratings_tuple_array]
ratings_tuple_array = [scipy.stats.norm.interval(0.95, loc=0, scale=sigma) for sigma in ratings_dict["sigma_q2"]]
ratings_dict["err95_q2"] = [err[1] for err in ratings_tuple_array]

In [14]:
df_trueskill = pandas.DataFrame.from_dict(ratings_dict)
df_trueskill.set_index("player", inplace=True)

In [15]:
x = df_trueskill.index.values

In [16]:
list(x)


Out[16]:
['dave', 'kyle', 'scott', 'mike', 'joe']

In [17]:
fig, ax = matplotlib.pyplot.subplots(figsize=(7, 4))

ax.axhline(y=25, color="lightgray", linewidth=6, zorder=0)
# standard error bars

ax.errorbar([-0.2, 0.8, 1.8, 2.8, 3.8], df_trueskill["mu_q1"], yerr=df_trueskill["err95_q1"], 
            linestyle="", linewidth=2, marker='o', markersize=12, label="games 1-7")
ax.errorbar([0, 1, 2, 3, 4], df_trueskill["mu_q2"], yerr=df_trueskill["err95_q2"], 
            linestyle="", linewidth=2, marker='o', markersize=12, label="games 7-14")
ax.errorbar([0.2, 1.2, 2.2, 3.2, 4.2], df_trueskill["mu"], yerr=df_trueskill["err95"], 
            linestyle="", linewidth=2, marker='o', markersize=12, label="games 1-14")

ax.set_xlim((-0.5, 4.5))
ax.set_ylabel("TrueSkill\n(25 is average performance)", fontsize=14)
ax.set_ylim((13, 37))
ax.set_xticks([0, 1, 2, 3, 4])
ax.set_xticklabels(df_trueskill.index.values, rotation=0, fontsize=14)
ax.set_title('TrueSkill in SeaFall', fontsize=20)
ax.legend(ncol=3, fancybox=True, loc="upper center")

fig.savefig("TrueSkillResults_games_1_to_14.svg", format="svg", dpi=1200)
fig.savefig("TrueSkillResults_games_1_to_14.png", format="png", dpi=1200)


Has anyone separated themselves from the pack (statistically)?

All the 95% confidence intervals are overlapping, but the difference between the means could be statistically significant. Let's do this pairwise for all the players, comparing the TrueSkill in games 1-10. We'll do this by calculating the difference distribution between each pair of player TrueSkill distributions. If the this difference distribution does not contain 0 with a 0.05 probability, then we'll consider the skills statistically significant.


In [18]:
trueskill_diff = numpy.zeros((5,5))

for pair in itertools.product(df_trueskill.index.values, df_trueskill.index.values):
    
    row = df_trueskill.index.get_loc(pair[0])
    
    col = df_trueskill.index.get_loc(pair[1])
    
    mu0 = df_trueskill["mu"][pair[0]]
    
    mu1 = df_trueskill["mu"][pair[1]]
    
    sigma0 = df_trueskill["sigma"][pair[0]]
    
    sigma1 = df_trueskill["sigma"][pair[1]]
    
    mudiff = numpy.abs(mu0-mu1)
    
    sigmadiff = numpy.sqrt(sigma0**2 + sigma1**2)
    
    left_tail, _ = scipy.stats.norm.interval(0.90, loc=mudiff, scale=sigmadiff)
    
    if left_tail > 0:
        # the difference is significant
        trueskill_diff[row, col] = 1
        print(pair)


('kyle', 'joe')
('mike', 'joe')
('joe', 'kyle')
('joe', 'mike')

In [19]:
trueskill_diff


Out[19]:
array([[0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1.],
       [0., 1., 0., 1., 0.]])

After a 14 game SeaFall campaign we can say there is a statistically significant difference between several player's TrueSkill. Sadly, Joe ended his campaign with a series of bitter loses, which made his overall play for the campaign statistically worse than Kyle's or Mike's overall play.


In [ ]: