Predicting World of Warcraft Avatar Leveling Behavior

This project utilizes the publicly available World of Warcraft Avatar History dataset to garner insight into the game itself as well as its player base. In this case, I have focused on one particular problem: the leveling behavior of avatars.

Problem

Can we predict whether or not an avatar on World of Warcraft will be leveled to the max based upon simple metrics describing play behavior? In particular, we would like to explore whether avatar location and guild preferences, as well as play behavior, can be used to predeict whether or not an avatar will reach the maximum level allowable in WoW.

Data

This dataset represents one year of observations in 2008 for ~30,000 avatars from the Horde faction in WoW. These observations include the level, location, guild, race, and class of each avatar at a given instance in time. Our goal here is to explore this data for any interesting relationships with our main problem in mind. Ideally, we would like to be able to boil down this raw dataset to a few useful metrics that correlate with whether or not an avatar reaches max level.



In [1]:

    
import pandas as pd
#We don't like infinities, so set those to null
pd.set_option('mode.use_inf_as_null', True)
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import cPickle

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, ShuffleSplit
from sklearn.metrics import precision_recall_fscore_support

from functions import plot_learning_curve, distill_wowah

import warnings
warnings.filterwarnings('ignore')

#Constants
#Max level before WLK
MAX_LEVEL1 = 70
#Max level after
MAX_LEVEL2 = 80
#Release of the Wrath of the Lich King expansion
WLK_RD = pd.to_datetime('11/18/2008')

Load and wrangle data



In [2]:

    
#Load data
wow_df = pd.read_csv('wowah_data.csv')

Get some basic info from the data



In [3]:

    
wow_df.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10826734 entries, 0 to 10826733
Data columns (total 7 columns):
char          int64
 level        int64
 race         object
 charclass    object
 zone         object
 guild        int64
 timestamp    object
dtypes: int64(3), object(4)
memory usage: 578.2+ MB



In [4]:

    
wow_df.columns.values









    Out[4]:





array(['char', ' level', ' race', ' charclass', ' zone', ' guild',
       ' timestamp'], dtype=object)



In [5]:

    
#Note that some column names have added whitespace. Strip that whitespace for clarity
for c in wow_df.columns:
    c = c.lstrip()
    
wow_df.columns = [x.lstrip() for x in wow_df.columns.values]



In [6]:

    
#Having all the unique character names will be useful later
chars = np.sort(wow_df.char.unique())
nav = np.size(chars)
#Calculate the average number of data points per avatar
print(10826733 / nav)

#Also, let's find the number of unique guilds
guilds = np.sort(wow_df.guild.unique())
ng = np.size(guilds)

#Also, let's find the number of unique locations
locs = np.sort(wow_df.zone.unique())
nl = np.size(locs)

So we have ~300 data points per avatar. Now let's explore the properties of the time sampling. For this, we need to convert the timestamp column to a numerical value.



In [7]:

    
#Now let's convert the timestamp column to numerical values
#First check if it's already been done
try:
    with open('timestamps.pkl', 'rb') as f:
        #Use cPickle for past serialization
        dts = cPickle.load(f)
except IOError:
    dts = pd.to_datetime(wow_df['timestamp'])

    with open('timestamps.pkl', 'wb') as f:
        cPickle.dump(dts, f)
        
#Just replace old timestamp column since it's no longer necessary 
wow_df['timestamp'] = dts

Let's try plotting a level curve for one player to examine how they progressed.



In [8]:

    
av = wow_df.loc[wow_df['char'] == chars[1]]

#Find where max level is reached since curve is pretty boring after that
prog = av[av['level'] < MAX_LEVEL2]

plt.plot(prog['timestamp'], prog['level'])
plt.xticks(['02-2008', '04-2008', '06-2008', '08-2008', '10-2008', '12-2008'], 
           ['Feb 2008', 'Apr 2008', 'Jun 2008', 'Aug 2008', 'Oct 2008', 'Dec 2008'])
plt.xlabel('Time')
plt.ylabel('Player Level')









    Out[8]:





<matplotlib.text.Text at 0x140261110>

It looks like this avatar was level ~54 when the observations began, and progressed to close to the max level in about a year with a long hiatus. This hiatus corresponds to the original WoW level cap, and the progression from this level begins when Wrath of the Lich King was released. It's also clear that the expansion release during this timeframe will need to be properly accounted for in order to make accurate predictions.

Distilling the data on an avatar-by-avatar basis

Since we are looking for variations in level progression across a broad range of levels and times, we want to come up with some features that describe the progression of level and play over time. So, let's go through each avatar and see if it reached max level. We will also create a new dataframe called av_df to store other potantially useful metrics for each avatar. These will include such things as average level, max level, level range, whether or not the avatar changed guilds at all, most frequented guild, and most frequented location.

At the end of the day, we want to characterize the variations in avatar level and see if these variations can help us predict whether each avatar reached the max level or not.



In [9]:

    
#Use a function to distill raw WoWAH data into quantities grouped by character ID. This script essentially 
# uses the pandas groupby and agg functions to calculate metrics for each character over the time baseline
av_df = distill_wowah(wow_df, chars)
av_df.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 37349 entries, 0 to 37348
Data columns (total 23 columns):
char                37349 non-null int64
nrace               37349 non-null int64
ncharclass          37349 non-null int64
race                37349 non-null object
charclass           37349 non-null object
avglvl              37349 non-null float64
maxlvl              37349 non-null int64
maxlvld             37349 non-null bool
maxlvld_preWLK      32537 non-null object
lvlrng              37349 non-null int64
nguild              37349 non-null int64
modguild            37349 non-null int64
modzon              37349 non-null object
nzon                37349 non-null int64
nplays              37349 non-null int64
lastplay            37349 non-null datetime64[ns]
firstplay           37349 non-null datetime64[ns]
baseline            37349 non-null timedelta64[ns]
baseline_td         37349 non-null float64
prog_baseline       37349 non-null timedelta64[ns]
prog_baseline_td    37349 non-null float64
preWLK              37349 non-null bool
postWLK             37349 non-null bool
dtypes: bool(3), datetime64[ns](2), float64(3), int64(9), object(4), timedelta64[ns](2)
memory usage: 6.1+ MB

So now we have a new dataframe that contains one row per avatar with some aggregate properties for each. We now want to clean the data up. As other explorations of this data have shown, there are some problematic avatars. First, a significant fraction of them logged on for only one play. Given that we want to explore time evolution with this dataset, singly observed avatars will not tell us much. Thus, we want to get rid of these. Additionally, many avatars show mysterious race/class swaps even though these data predate race swapping in WoW. It is unclear whether these race and class swaps are real, or whether they reflect inconsistencies in the data. In any case, such swaps will likely muddy the waters for some insights to gain in terms of race/class dependencies, so we will also remove all characters which show more than one race or class.



In [10]:

    
#Remove singly observed avatars
av_df = av_df[av_df['nplays'] > 1]
#Remove race changing avatars
av_df = av_df[av_df['nrace'] == 1]
#Remove class changing avatars
av_df = av_df[av_df['ncharclass'] == 1]

print(len(av_df))

So there were almost 10,000 problematic avatars. That effectively reduces our dataset by 1/3. Still, we should still be able to look for interesting relationships with this cleaned dataset.

Race and Class

Now let's see if there are any trends in terms of max leveling with race or class. Each race and class has its unqiue abilities which we expect should affect the level progression of avatars. Let us explore the existence of these correlations with some plots. The following plots will show the fraction of avatars in a given class-race combination that reached max level (level 80).



In [11]:

    
grid = sns.FacetGrid(av_df, row='race', size=4, aspect=3)
grid.map(sns.barplot, 'charclass','maxlvld', alpha=.5, ci=None)









    Out[11]:





<seaborn.axisgrid.FacetGrid at 0x103f70c50>

So it does indeed look like whether or not an avatar reached max level depends on both the race and class. Death Knights particularly appear to prefer reaching max level for several different races. This probably reflects the fact that the Wrath of the Lich King was released during this dataset, and thus this class had the 'new' factor causing many players to level their first Death Knight. We also see race specific preferences, such as the Troll Rogue, Tauren Mage, and Orc Paladin. In any case, both race and class will probably be useful

Play rate

We can combine the number of plays and total playtime features to calculate an average 'rate' of play, plrate. This newly engineered feature will measure the amount of play independent of the window function of the observations. Essentially, we will divide the number of plays by the total time baseline. This will give us an idea how frequently each avatar was played. We expect that the play rate should be correlated with max leveling.



In [12]:

    
# Construct play rate
# Construct the play rate from the number of plays observed and the total time baseline for which
# the avatar was observed
av_df['plrate'] = np.log10(av_df['nplays'] / av_df['baseline_td'])

g = sns.FacetGrid(av_df, col='maxlvld')
g.map(sns.distplot, 'plrate', bins=10, kde=False, norm_hist=True)









    Out[12]:





<seaborn.axisgrid.FacetGrid at 0x129844090>

It appears that our intuition was correct. The distributions of max leveling appear starkly different between the two sets of avatars. Avatars that reached max level (post WLK) exhibit a somewhat lognormal distribution peaking at around 10 plays/day. In contrast, the non-max-leveled avatars illustrate a much broader distribution with less frequent play rates on average. The exception is the lonesome peak at play rates >100/day. This would seem to represent avatars that started to level quickly after WLK was released, but did not succeed before these observations ended.

Progression rate

Now we will construct another feature: the progression rate, prate. For this, we want to quantify how quickly an avatar was leveled up over time. This will employ the lvlrng and prog_baseline_td features. The former is the range of levels observed for each avatar, and the latter measures the time each avatar was played before reaching the max level. Again, we expect this feature to correlate with max leveling, but we will investigate this using some plots.



In [13]:

    
#Construct progression rate
av_df['prate'] = np.log10(av_df['lvlrng'] / av_df['prog_baseline_td'])
#So pandas seems to be unable to properly handle NANs, so I am just going to drop rows with prate=NaN
av_df_npr = av_df[np.isfinite(av_df['prate'])]
g = sns.FacetGrid(av_df_npr, col='maxlvld')
g.map(sns.distplot, 'prate', bins=10, kde=False, norm_hist=True)









    Out[13]:





<seaborn.axisgrid.FacetGrid at 0x1235c6950>

Again by combining the two features decribing the avatar level progression, we obtain a new feature correlated with max leveling. This tells us that avatars that reach max level progress at different rates than those who do not. Thus, we conclude to include this feature in our final model.

Guild

Does the guild affect the likelihood of max leveling?

Another feature we would like to explore is the guild behavior the avatar. Its relation to max leveling may not be as initially straightforward as other features, but different cultures and membership of different guilds may certainly affect the likelihood that its constituent avatars reach max level. Note that an avatar can have been in multiple guilds throughout these observations, so we are forced to deal with some summary statistics. Namely, the number of guilds an avatar was found in a given time and the guild it was observed in most frequently.

First, we examine the former. For this, we use the nguild and baseline_td features. The former represents the number of unique guilds an avatar was part of, while the latter again measures the total time the avatar spent playing. Combining these gives us the grate feature, measuring the number of unique guilds an avatar inhabited throughout the time observed.



In [14]:

    
#Number of guilds per unit time
av_df['grate'] = np.log10(av_df['nguild'] / av_df['baseline_td'])
g = sns.FacetGrid(av_df.loc[np.isfinite(av_df['grate'])], col='maxlvld')
g.map(sns.distplot, 'grate', bins=20, kde=False, norm_hist=True)









    Out[14]:





<seaborn.axisgrid.FacetGrid at 0x11b57d090>

It appears that those avatars who fluctuate between guilds less rapidly achieve max level more frequently than those who float back and forth between many guilds quickly. The stark difference between the two distributions indicates that this will likely be an important feature to consider in our model.

We will now investigate whether or not specific guilds promote max leveling more than others. For this, we show the distribution of avatars that did and didn't reach max level separated by the modguild feature, where modguild is simply a integer value corresponding to a unique guild ID.



In [15]:

    
#Most frequented guild
g = sns.FacetGrid(av_df, col='maxlvld')
g.map(sns.distplot, 'modguild', bins=ng, kde=False, norm_hist=True)









    Out[15]:





<seaborn.axisgrid.FacetGrid at 0x117d1c050>

In line with our initial assumption, different guilds tend to promote max leveling more. There also appears to be one guild that tends to disfavor max leveling significantly more than the others. Finally, an avatar is overwhelmingly less likely to reach max level if it did not belong to a guild (strong peak at -1). Thus, the social aspect of guilds is an immensely important one to WoW, and we should try to include occupied guild in our model.

Location

Finally, let us explore another categorical feature on whether or not an avatar reached max level: locations frequented. First, we take a look at the number of locations frequented by avatars from the two sets. As with the guilds, it may be important to factor in how many locations an avatar vists in a timespan. Thus, we will create a location rate feature, lrate, which measures how frequently an avatar moves between different locations. For this we again use the baseline_td feature in addition to nzon, the number of unique locations visited by an avatar.



In [16]:

    
#Number of locations per unit time
av_df['lrate'] = np.log10(av_df['nzon'] / av_df['baseline_td'])
g = sns.FacetGrid(av_df.loc[np.isfinite(av_df['lrate'])], col='maxlvld')
g.map(sns.distplot, 'lrate', bins=20, kde=False, norm_hist=True)









    Out[16]:





<seaborn.axisgrid.FacetGrid at 0x136830910>

Again, we see that when we control for the time baseline of observations, the location change rate correlates fairly strongly with max leveling. The distributions are markedly different for avatars that reached max level compared with those that did not.

As with specific guilds, we now want to see if specific locations visited tend to correlate more strongly with leveling to the max. For this, we need to transform the most frequented location feature, modzon, to a numerical values. For this, we use make dummy variables for each location and use those to convert modzon to modzonkey, which is just an integer ID corresponding to a unique location.



In [17]:

    
#Most frequented location
#For this, we need to convert the categorical modzon feature to a numerical one
#Use get_dummies for this
av_df_dum = pd.get_dummies(av_df['modzon'] , columns=['modzon'])

av_df['modzonkey'] = av_df_dum.values.argmax(1)

g = sns.FacetGrid(av_df, col='maxlvld')
g.map(sns.distplot, 'modzonkey', bins=nl, kde=False, norm_hist=True)









    Out[17]:





<seaborn.axisgrid.FacetGrid at 0x134641990>

Finally, we do again see clear indications that visiting certain locations is strongly correlated with reaching max level. Certain locations contain quests, beasts, other game features that cater more towards certain level bands. Thus, we will want to include modzon in the features for the final model.

Fit model to predict whether avatars max leveled or not



In [18]:

    
#First, we need to convert the categorical variables to dummy variables for compatibility with standard algorithms
#Make dummy variables
av_df_dum = pd.get_dummies(av_df, columns=['race', 'charclass', 'modzon', 'modguild'])

#Get rid of NaN rates
av_df_dum = av_df_dum.loc[(np.isfinite(av_df_dum['plrate'])) & (np.isfinite(av_df_dum['prate']))
                          & (np.isfinite(av_df_dum['grate'])) & (np.isfinite(av_df_dum['lrate']))]

#We also want to get ridof many features. We want to remove all features that are cumulative over the time coverage 
# because the model loses its predictive power if it needs to know all these details over the specific time baseline
#We also remove the number of races and classes, since these have been cleaned to all be identically 1
X = av_df_dum.drop(['nrace', 'ncharclass', 'char', 'lastplay', 'firstplay', 'baseline', 'prog_baseline', 
                    'nzon', 'nguild', 'baseline_td', 'prog_baseline_td', 'nplays', 'lvlrng', 
                    'maxlvl', 'modzonkey',
                    'maxlvld', 'maxlvld_preWLK'
                    ], axis=1).copy()
X['preWLK'] = X['preWLK'].astype(int)
X['postWLK'] = X['postWLK'].astype(int)
y = av_df_dum['maxlvld']

#Next, split dataset up into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Thus, we are left with race, class, and most frequented guilds and locations as features for the model. Additionally, we include whether or not an avatar was created before or after the WLK expansion since this obviosuly has a huge bearing on the max level. Finally, we include only time baseline-independent numerical features: the average level and the play, progression, guild, and location rates. We do not want our model dependent upon how long an avatar has been observed for in order to increase the generality of the model.

Now we fit a logistic regression model to the avatar data. Such a model produces an average classfication accuracy of ~98.4%.



In [19]:

    
#Fit a logistic regression model to the maxlvld data
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
acc_log = round(logreg.score(X_train, y_train) * 100, 2)
acc_log









    Out[19]:





98.38

We can also see which features correlate most and least strongly with max leveling. The following table displays features and their correlations with max leveling.



In [20]:

    
coeff_df = pd.DataFrame(X_train.columns)
coeff_df.columns = ['Feature']
coeff_df["Correlation"] = pd.Series(logreg.coef_[0])

coeff_df.sort_values(by='Correlation', ascending=False)









    Out[20]:






  
    
      
      Feature
      Correlation
    
  
  
    
      4
      prate
      2.186520
    
    
      6
      lrate
      2.023194
    
    
      65
      modzon_Icecrown
      1.619145
    
    
      40
      modzon_Dalaran
      1.080569
    
    
      109
      modzon_The Storm Peaks
      0.852271
    
    
      310
      modguild_291
      0.766691
    
    
      56
      modzon_Ghostlands
      0.756438
    
    
      410
      modguild_402
      0.656360
    
    
      459
      modguild_471
      0.607090
    
    
      379
      modguild_368
      0.598273
    
    
      264
      modguild_241
      0.574088
    
    
      358
      modguild_342
      0.536802
    
    
      74
      modzon_Naxxramas
      0.482008
    
    
      57
      modzon_Grizzly Hills
      0.477511
    
    
      3
      plrate
      0.472671
    
    
      207
      modguild_156
      0.454525
    
    
      402
      modguild_393
      0.442921
    
    
      220
      modguild_174
      0.437169
    
    
      239
      modguild_204
      0.425061
    
    
      372
      modguild_359
      0.424679
    
    
      317
      modguild_298
      0.420009
    
    
      341
      modguild_324
      0.397221
    
    
      90
      modzon_Sholazar Basin
      0.386614
    
    
      45
      modzon_Dragonblight
      0.380228
    
    
      113
      modzon_Tirisfal Glades
      0.368718
    
    
      287
      modguild_266
      0.363341
    
    
      466
      modguild_485
      0.350170
    
    
      452
      modguild_463
      0.347421
    
    
      352
      modguild_336
      0.344439
    
    
      269
      modguild_247
      0.335891
    
    
      ...
      ...
      ...
    
    
      377
      modguild_365
      -0.785983
    
    
      273
      modguild_251
      -0.808040
    
    
      128
      modguild_-1
      -0.810566
    
    
      21
      charclass_Warrior
      -0.838627
    
    
      230
      modguild_191
      -0.846122
    
    
      17
      charclass_Priest
      -0.855880
    
    
      181
      modguild_115
      -0.916099
    
    
      334
      modguild_315
      -0.958422
    
    
      5
      grate
      -0.993463
    
    
      16
      charclass_Paladin
      -1.006036
    
    
      63
      modzon_Howling Fjord
      -1.075910
    
    
      20
      charclass_Warlock
      -1.126207
    
    
      13
      charclass_Druid
      -1.152706
    
    
      14
      charclass_Hunter
      -1.203782
    
    
      15
      charclass_Mage
      -1.206192
    
    
      12
      charclass_Death Knight
      -1.207788
    
    
      73
      modzon_Nagrand
      -1.255032
    
    
      18
      charclass_Rogue
      -1.270057
    
    
      19
      charclass_Shaman
      -1.335956
    
    
      60
      modzon_Hellfire Peninsula
      -1.369950
    
    
      124
      modzon_Zangarmarsh
      -1.497671
    
    
      102
      modzon_Terokkar Forest
      -1.532642
    
    
      9
      race_Tauren
      -2.104804
    
    
      11
      race_Undead
      -2.117047
    
    
      10
      race_Troll
      -2.228311
    
    
      8
      race_Orc
      -2.302052
    
    
      7
      race_Blood Elf
      -2.451017
    
    
      78
      modzon_Plaguelands: The Scarlet Enclave
      -3.326378
    
    
      2
      postWLK
      -3.873891
    
    
      1
      preWLK
      -4.673023
    
  

480 rows × 2 columns

Evaluating the model

So evidently, this model works pretty well. It has extremely high scores when predicting max leveling on the test set. Some interesting things to note.

1) The top few rows show the features most correlated with max leveling. Most of the most strongly correlated features are locations and guilds. Aside from those, we see the play rate, progression rate, and location rate to all be important factors in determining whether one of these avatars reaches max level or not.

2) On the other side of things, we see certain locations strongly correlated with not acheiving max level. Few guilds are negatively correlated with max leveling.

3) The pre and post-WLK flags are the strongest anti-correlated with max leveling.

But wait... is this score too good to be true? Let's investigate for possible overfitting using learning curves.



In [21]:

    
cv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=42)

plot_learning_curve(LogisticRegression(), 'Logistic Regression', X, y, ylim=(0.7, 1.01), cv=cv, n_jobs=4)









    Out[21]:





<module 'matplotlib.pyplot' from '/Users/two-liter/anaconda/lib/python2.7/site-packages/matplotlib/pyplot.pyc'>

This learning curve shows that the model is being pretty well fit for both the training and cross-validation sets. The fact that it does equally as well on the CV set indicates that overfitting may actually not be a problem.

Let's also examine the precision and recall of the model.



In [22]:

    
pr, rec, fsc, sup = precision_recall_fscore_support(y_test, y_pred, average=None)
print(pr)
print(rec)









    



[ 0.98831286  0.7615894 ]
[ 0.98920216  0.74675325]

It also appears that the model is quite precise and recalls accurately when an avatar did not reach max level. However, it appears less precise when predicitng max leveling, and recalls those avatars even worse (although in all cases, >3/4 of the labels are correclty recalled).

Removing Wrath of the Lich King

Evidently, our chosen model does not perform great in terms of precisely and accurately recalling whether avatars reached max level or not. Thus, maybe we need to tweak our model. In particular, it is likely that the effects of WLK on this dataset are discontinuous in time, and thus are probably mucking up things. Therefore, we will try the above again using only pre-WLK avatars.



In [23]:

    
#Recalculate progression metrics restricting to pre-WLK
#In this case, we will just reconstruct the avatar dataframe using only observations from the original wow_df pre-WLK
wow_df = wow_df.loc[wow_df['timestamp'] < WLK_RD]

#Now re-distill using only pre-WLK data
av_prewlk_df = distill_wowah(wow_df, chars)

#Now calculate rates
#Number of guilds per unit time
av_prewlk_df['grate'] = np.log10(av_prewlk_df['nguild'] / av_prewlk_df['baseline_td'])
#Number of locations per unit time
av_prewlk_df['lrate'] = np.log10(av_prewlk_df['nzon'] / av_prewlk_df['baseline_td'])
#Construct play rate
av_prewlk_df['plrate'] = np.log10(av_prewlk_df['nplays'] / av_prewlk_df['baseline_td'])
#Construct progression rate
av_prewlk_df['prate'] = np.log10(av_prewlk_df['lvlrng'] / av_prewlk_df['prog_baseline_td'])

#Make dummy variables
av_df_dum = pd.get_dummies(av_prewlk_df, columns=['race', 'charclass', 'modzon', 'modguild'])

#Get rid of NaN rates
av_df_dum = av_df_dum.loc[(np.isfinite(av_df_dum['plrate'])) & (np.isfinite(av_df_dum['prate']))
                          & (np.isfinite(av_df_dum['grate'])) & (np.isfinite(av_df_dum['lrate']))]
#We also want to get ridof many features. We want to remove all features that are cumulative over the time coverage 
# because the model loses its predictive power if it needs to know all these details over the specific time baseline
#We also remove the number of races and classes, since these have been cleaned to all be identically 1
X = av_df_dum.drop(['nrace', 'ncharclass', 'char', 'lastplay', 'firstplay', 'baseline', 'prog_baseline', 
                    'nzon', 'nguild', 'baseline_td', 'prog_baseline_td', 'nplays', 'lvlrng', 
                    'maxlvl',
                    'maxlvld', 'maxlvld_preWLK', 'preWLK', 'postWLK',
                    ], axis=1).copy()

y = av_df_dum['maxlvld_preWLK'].astype(bool)

#Next, split dataset up into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)



In [24]:

    
#Fit a logistic regression model to the maxlvld data
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
acc_log = round(logreg.score(X_train, y_train) * 100, 2)
acc_log









    Out[24]:





98.46



In [25]:

    
cv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=42)

plot_learning_curve(LogisticRegression(), 'Logistic Regression', X, y, ylim=(0.7, 1.01), cv=cv, n_jobs=4)









    Out[25]:





<module 'matplotlib.pyplot' from '/Users/two-liter/anaconda/lib/python2.7/site-packages/matplotlib/pyplot.pyc'>



In [26]:

    
pr, rec, fsc, sup = precision_recall_fscore_support(y_test, y_pred, average=None)
print(pr)
print(rec)









    



[ 0.98847835  0.91483516]
[ 0.98769353  0.9198895 ]

We see that after removing WLK data from this set, the predictive power of this model is greatly imporoved. With precisions and recalls >90% for both classes of avatars and no apparent overfitting, we conclude that this model should do a decent job at predicting whether or not an avatar will reach max level given some time-series observations. The above analysis indicates that each expansion era needs to be treated completely separately in order to make a strong model.

Conclusions

We found that this model can do a good job at predicting whether or not an avatar reaches max level. Essentially, this provides a way to accurately predict whether or not an avatar will reach max level based upon the locations it visits, the guilds it occupies, its race and class, and play behavior. The accuracy, precision, and recall of this model are all >90%. In terms of real-world applicability, such a model could potentially be used to track individual avatars and estimate whether or not they will be played to max level. Developers could then trace these avatars to individual players and use this information to better understand why some players might stop playing (e.g. stop max leveling?). Modifications to gameplay, UI, and pay structure might then be implemented considering these results.

	Feature	Correlation
4	prate	2.186520
6	lrate	2.023194
65	modzon_Icecrown	1.619145
40	modzon_Dalaran	1.080569
109	modzon_The Storm Peaks	0.852271
310	modguild_291	0.766691
56	modzon_Ghostlands	0.756438
410	modguild_402	0.656360
459	modguild_471	0.607090
379	modguild_368	0.598273
264	modguild_241	0.574088
358	modguild_342	0.536802
74	modzon_Naxxramas	0.482008
57	modzon_Grizzly Hills	0.477511
3	plrate	0.472671
207	modguild_156	0.454525
402	modguild_393	0.442921
220	modguild_174	0.437169
239	modguild_204	0.425061
372	modguild_359	0.424679
317	modguild_298	0.420009
341	modguild_324	0.397221
90	modzon_Sholazar Basin	0.386614
45	modzon_Dragonblight	0.380228
113	modzon_Tirisfal Glades	0.368718
287	modguild_266	0.363341
466	modguild_485	0.350170
452	modguild_463	0.347421
352	modguild_336	0.344439
269	modguild_247	0.335891
...	...	...
377	modguild_365	-0.785983
273	modguild_251	-0.808040
128	modguild_-1	-0.810566
21	charclass_Warrior	-0.838627
230	modguild_191	-0.846122
17	charclass_Priest	-0.855880
181	modguild_115	-0.916099
334	modguild_315	-0.958422
5	grate	-0.993463
16	charclass_Paladin	-1.006036
63	modzon_Howling Fjord	-1.075910
20	charclass_Warlock	-1.126207
13	charclass_Druid	-1.152706
14	charclass_Hunter	-1.203782
15	charclass_Mage	-1.206192
12	charclass_Death Knight	-1.207788
73	modzon_Nagrand	-1.255032
18	charclass_Rogue	-1.270057
19	charclass_Shaman	-1.335956
60	modzon_Hellfire Peninsula	-1.369950
124	modzon_Zangarmarsh	-1.497671
102	modzon_Terokkar Forest	-1.532642
9	race_Tauren	-2.104804
11	race_Undead	-2.117047
10	race_Troll	-2.228311
8	race_Orc	-2.302052
7	race_Blood Elf	-2.451017
78	modzon_Plaguelands: The Scarlet Enclave	-3.326378
2	postWLK	-3.873891
1	preWLK	-4.673023