How to use

  • Calculates the list of all superleaves of length 1-6 tiles (or loads this list in), loads in a log of moves from simulated games, and then calculates the expected value of each superleave based on how much was scored by each rack containing that superleave.

To-do

  • The frequency/count estimate for superleaves is currently calculated incorrectly (5/8 - is this still true?)
  • The synergy calculation is broken.

Changelog

  • 5/8/20 - My superleave calculation was being too short-sighted - not considering the future value of keeping a blank on your rack, and failing to recognize the awfulness of gravity wells like UVWW. I added an adjustment factor that tracks the value of your leftover tiles when you make a play from a rack containing that superleave, which will hopefully help with recognizing the value of holding a ? and not holding awful combinations/the Q.
  • 11/27/19 - wow, it's been awhile. Stopped loading all moves into memory (yikes) and instead wrote a much faster version that can go through 50M moves on my local machine in ~3 hours.
  • 1/27/19 - Determined that the speed of creation of the rack dataframes is a function of the length of the dataframe. From that, realized that we should organize leaves by least-frequent to most-frequent letter, such that sub-dataframes are created from the shortest racks possible.

In [1]:
import csv
from datetime import date
from itertools import combinations
import numpy as np
import pandas as pd
import pickle as pkl
import seaborn as sns
from string import ascii_uppercase
import time as time

%matplotlib inline

maximum_superleave_length = 6

log_file = '../logs/log_20200411.csv'
# log_file = '../logs/log_1m.csv'

todays_date = date.today().strftime("%Y%m%d")

In [2]:
todays_date


Out[2]:
'20200508'

Create a dictionary of all possible 1 to 6-tile leaves. Also, add functionality for sorting by an arbitrary key - allowing us to put rarest letters first


In [3]:
# tilebag = ['A']*9+['B']*2+['C']*2+['D']*4+['E']*12+\
#           ['F']*2+['G']*3+['H']*2+['I']*9+['J']*1+\
#           ['K']*1+['L']*4+['M']*2+['N']*6+['O']*8+\
#           ['P']*2+['Q']*1+['R']*6+['S']*4+['T']*6+\
#           ['U']*4+['V']*2+['W']*2+['X']*1+['Y']*2+\
#           ['Z']*1+['?']*2

# No superleave is longer than 6 letters, and so we only need to include
# 6 each of the As, Es, Is and Os. This shortens the time it takes to find all of
# the superleaves by 50%!
truncated_tilebag = \
          ['A']*6+['B']*2+['C']*2+['D']*4+['E']*6+\
          ['F']*2+['G']*3+['H']*2+['I']*6+['J']*1+\
          ['K']*1+['L']*4+['M']*2+['N']*6+['O']*6+\
          ['P']*2+['Q']*1+['R']*6+['S']*4+['T']*6+\
          ['U']*4+['V']*2+['W']*2+['X']*1+['Y']*2+\
          ['Z']*1+['?']*2
            
tiles = [x for x in ascii_uppercase] + ['?']

# potential future improvement: calculate optimal order of letters on the fly
# rarity_key = 'ZXKJQ?HYMFPWBCVSGDLURTNAOIE'
alphabetical_key = '?ABCDEFGHIJKLMNOPQRSTUVWXYZ'
sort_func = lambda x: alphabetical_key.index(x)

On my home machine, the following code takes about 7 minutes to run in its entirety.


In [4]:
# t0 = time.time()

# leaves = {i:sorted(list(set(list(combinations(truncated_tilebag,i))))) for i in 
#           range(1,maximum_superleave_length+1)}

# # turn leaves from lists of letters into strings
# # algorithm runs faster if leaves non-alphabetical!
# for i in range(1,maximum_superleave_length+1):
#     leaves[i] = [''.join(sorted(leave, key=sort_func))
#                  for leave in leaves[i]]

# t1 = time.time()
# print('Calculated superleaves up to length {} in {} seconds'.format(
#     maximum_superleave_length,t1-t0))

# pkl.dump(leaves,open('all_leaves.p','wb'))

In [5]:
leaves = pkl.load(open('all_leaves.p','rb'))

How many superleaves are there of each length? See below:


In [6]:
for i in range(1,maximum_superleave_length+1):
    print(i,len(leaves[i]))


1 27
2 373
3 3509
4 25254
5 148150
6 737311

Define metrics we're tallying for each subleaves

Currently, we track the following metrics with each new rack:

  • Total points
  • Count (how many times subleaves has appeared in data set)
  • Bingo Count

In [7]:
all_leaves = []

for i in range(1,maximum_superleave_length+1):
    all_leaves.extend(leaves[i])

In [8]:
def find_subleaves(rack, min_length=1, max_length=6, duplicates_allowed = False):
    if not duplicates_allowed:
        return [''.join(sorted(x, key=sort_func)) for i in range(min_length, max_length+1) 
            for x in set(list(combinations(rack,i)))]
    else:
        return [''.join(sorted(x, key=sort_func)) for i in range(min_length, max_length+1) 
            for x in list(combinations(rack,i))]

tile_limit below is the minimum number of tiles left on a rack for it to be factored into superleave calculation. The idea is that moves with the bag empty tend to be worth less, and may not reflect the value of a letter in the rest of the game (most notably, if you have the blank and the bag is empty, you often can't bingo!). Moves are tend to be worth a little bit less at the beginning of a game when there are fewer juicy spots to play.


In [9]:
t0 = time.time()

tile_limit = 1

bingo_count = {x:0 for x in all_leaves}
count = {x:0 for x in all_leaves}
equity = {x:0 for x in all_leaves}
points = {x:0 for x in all_leaves}
row_count = 0
total_equity = 0
total_points = 0

with open(log_file,'r') as f:
    moveReader = csv.reader(f)
    next(moveReader)
    
    for i,row in enumerate(moveReader):
        if i%1000000==0:
            t = time.time()
            print('Processed {} rows in {} seconds'.format(i,t-t0))
        
        if i<10:
            print(i,row)
            
        try:    
            if int(row[10]) >= tile_limit:

                total_equity += float(row[9])
                total_points += int(row[5])
                row_count += 1

                for subleave in find_subleaves(row[3],
                        max_length=maximum_superleave_length):
                    bingo_count[subleave] += row[7] == '7'
                    count[subleave] += 1
                    equity[subleave] += float(row[9])
                    points[subleave] += int(row[5])
        except:
            print(i,row)

t1 = time.time()
print('{} seconds to populate dictionaries'.format(t1-t0))


Processed 0 rows in 0.6255381107330322 seconds
0 ['0', 'pCdgpioJfST7J6AcwFPhwi', '1', 'AUP?NAL', '8D PLANUlA', '72', '72', '7', '', '72.000', '86']
1 ['1', 'pCdgpioJfST7J6AcwFPhwi', '2', 'BZONDDR', 'J8 .DZ', '33', '33', '2', 'BDNOR', '31.475', '79']
2 ['0', 'pCdgpioJfST7J6AcwFPhwi', '3', 'EHEEIOI', 'G5 HEI.IE', '11', '83', '5', 'EO', '10.338', '77']
3 ['1', 'pCdgpioJfST7J6AcwFPhwi', '4', 'GFBDNOR', 'F10 FROND', '34', '67', '5', 'BG', '29.018', '72']
4 ['0', 'qnzjBBfJvUqBoehxRpX7SA', '1', 'ESRNLUI', '(exch LU)', '0', '0', '2', 'EINRS', '20.276', '86']
5 ['1', 'qnzjBBfJvUqBoehxRpX7SA', '2', 'IEYKHPI', '8E PIKI', '20', '20', '4', 'EHY', '22.207', '86']
6 ['0', 'qnzjBBfJvUqBoehxRpX7SA', '3', 'AEEINRS', 'E6 NA.ERIES', '70', '70', '7', '', '70.000', '82']
7 ['0', 'pCdgpioJfST7J6AcwFPhwi', '5', 'NILMREO', '6F L.MONIER', '64', '147', '7', '', '64.000', '67']
8 ['1', 'pCdgpioJfST7J6AcwFPhwi', '6', 'SAVBTBG', '15F STAB', '28', '95', '4', 'BGV', '17.250', '60']
9 ['1', 'qnzjBBfJvUqBoehxRpX7SA', '4', 'TGPAEHY', 'D11 PHAGE', '40', '60', '5', 'TY', '39.171', '75']
Processed 1000000 rows in 290.0632629394531 seconds
Processed 2000000 rows in 583.4354450702667 seconds
Processed 3000000 rows in 878.0193929672241 seconds
Processed 4000000 rows in 1173.0611560344696 seconds
Processed 5000000 rows in 1468.6263360977173 seconds
Processed 6000000 rows in 1763.8764290809631 seconds
Processed 7000000 rows in 2059.671242952347 seconds
Processed 8000000 rows in 2356.420965909958 seconds
Processed 9000000 rows in 2653.687132835388 seconds
Processed 10000000 rows in 2950.4856650829315 seconds
10243927 ['playerID', 'gameID', 'turn', 'rack', 'play', 'score', 'totalscore', 'tilesplayed', 'leave', 'equity', 'tilesremaining']
Processed 11000000 rows in 3247.914932012558 seconds
Processed 12000000 rows in 3544.6504697799683 seconds
Processed 13000000 rows in 3841.753177165985 seconds
Processed 14000000 rows in 4138.559123039246 seconds
Processed 15000000 rows in 4435.906737804413 seconds
Processed 16000000 rows in 4733.104781150818 seconds
Processed 17000000 rows in 5031.032222986221 seconds
Processed 18000000 rows in 5328.637248039246 seconds
Processed 19000000 rows in 5652.1254761219025 seconds
Processed 20000000 rows in 5974.030874967575 seconds
Processed 21000000 rows in 6290.561558961868 seconds
Processed 22000000 rows in 6601.371460199356 seconds
Processed 23000000 rows in 6908.373700141907 seconds
Processed 24000000 rows in 7207.657851934433 seconds
Processed 25000000 rows in 7506.617345094681 seconds
Processed 26000000 rows in 7804.839006900787 seconds
Processed 27000000 rows in 8104.0125958919525 seconds
Processed 28000000 rows in 8402.028752803802 seconds
Processed 29000000 rows in 8700.205003976822 seconds
Processed 30000000 rows in 8998.223968029022 seconds
Processed 31000000 rows in 9296.284110069275 seconds
Processed 32000000 rows in 9594.680910110474 seconds
Processed 33000000 rows in 9896.296709775925 seconds
Processed 34000000 rows in 10199.190078020096 seconds
Processed 35000000 rows in 10500.670875787735 seconds
Processed 36000000 rows in 10801.836780071259 seconds
Processed 37000000 rows in 11102.635770082474 seconds
Processed 38000000 rows in 11401.184270143509 seconds
Processed 39000000 rows in 11699.662948131561 seconds
Processed 40000000 rows in 11998.481792926788 seconds
Processed 41000000 rows in 12296.833549022675 seconds
Processed 42000000 rows in 12595.516081809998 seconds
Processed 43000000 rows in 12893.744120121002 seconds
Processed 44000000 rows in 13192.000246047974 seconds
Processed 45000000 rows in 13490.934365987778 seconds
Processed 46000000 rows in 13789.397678136826 seconds
Processed 47000000 rows in 14088.359616994858 seconds
Processed 48000000 rows in 14386.770317077637 seconds
Processed 49000000 rows in 14685.02898311615 seconds
Processed 50000000 rows in 14984.22802901268 seconds
Processed 51000000 rows in 15282.77688407898 seconds
Processed 52000000 rows in 15581.06835103035 seconds
Processed 53000000 rows in 15879.86906003952 seconds
Processed 54000000 rows in 16179.60194516182 seconds
Processed 55000000 rows in 16477.81769800186 seconds
Processed 56000000 rows in 16776.130902051926 seconds
Processed 57000000 rows in 17074.412497997284 seconds
Processed 58000000 rows in 17372.38957309723 seconds
Processed 59000000 rows in 17670.323877811432 seconds
Processed 60000000 rows in 17968.374531030655 seconds
Processed 61000000 rows in 18266.066518068314 seconds
Processed 62000000 rows in 18564.52543902397 seconds
Processed 63000000 rows in 18862.46008491516 seconds
Processed 64000000 rows in 19159.80465888977 seconds
Processed 65000000 rows in 19457.576225042343 seconds
Processed 66000000 rows in 19755.647181034088 seconds
Processed 67000000 rows in 20053.955873012543 seconds
Processed 68000000 rows in 20352.419039964676 seconds
Processed 69000000 rows in 20651.613132953644 seconds
Processed 70000000 rows in 20950.13855099678 seconds
Processed 71000000 rows in 21248.634732961655 seconds
Processed 72000000 rows in 21546.635362148285 seconds
Processed 73000000 rows in 21844.780762910843 seconds
Processed 74000000 rows in 22143.222247838974 seconds
Processed 75000000 rows in 22441.92240691185 seconds
Processed 76000000 rows in 22740.20371198654 seconds
Processed 77000000 rows in 23038.38004398346 seconds
Processed 78000000 rows in 23343.46870303154 seconds
Processed 79000000 rows in 23649.8785469532 seconds
Processed 80000000 rows in 23950.18876695633 seconds
Processed 81000000 rows in 24249.64662694931 seconds
Processed 82000000 rows in 24548.638522148132 seconds
Processed 83000000 rows in 24847.796099185944 seconds
Processed 84000000 rows in 25146.319885015488 seconds
Processed 85000000 rows in 25444.138508081436 seconds
Processed 86000000 rows in 25742.373416900635 seconds
Processed 87000000 rows in 26040.061742067337 seconds
Processed 88000000 rows in 26338.605803966522 seconds
Processed 89000000 rows in 26636.76059293747 seconds
Processed 90000000 rows in 26935.34317111969 seconds
Processed 91000000 rows in 27233.250181913376 seconds
Processed 92000000 rows in 27531.564296007156 seconds
Processed 93000000 rows in 27830.179331064224 seconds
Processed 94000000 rows in 28128.733726024628 seconds
Processed 95000000 rows in 28427.40171098709 seconds
Processed 96000000 rows in 28725.814183950424 seconds
Processed 97000000 rows in 29023.863723039627 seconds
Processed 98000000 rows in 29322.55904006958 seconds
Processed 99000000 rows in 29620.48489189148 seconds
Processed 100000000 rows in 29918.456779956818 seconds
Processed 101000000 rows in 30216.729532957077 seconds
Processed 102000000 rows in 30515.19054889679 seconds
Processed 103000000 rows in 30813.773432970047 seconds
Processed 104000000 rows in 31112.458909988403 seconds
Processed 105000000 rows in 31411.37521791458 seconds
Processed 106000000 rows in 31710.219899892807 seconds
Processed 107000000 rows in 32008.789323091507 seconds
Processed 108000000 rows in 32308.110218048096 seconds
Processed 109000000 rows in 32607.408725976944 seconds
Processed 110000000 rows in 32906.404431819916 seconds
Processed 111000000 rows in 33205.698472976685 seconds
Processed 112000000 rows in 33504.53321003914 seconds
Processed 113000000 rows in 33803.41693902016 seconds
Processed 114000000 rows in 34101.88607788086 seconds
Processed 115000000 rows in 34400.72460794449 seconds
34615.24369597435 seconds to populate dictionaries

In [10]:
ev_df = pd.concat([pd.Series(points, name='points'),
                  pd.Series(equity, name='equity'),
                  pd.Series(count, name='count'),
                  pd.Series(bingo_count, name='bingo_count')],
                  axis=1)

In [11]:
mean_score = total_points/row_count
mean_equity = total_equity/row_count

In [12]:
ev_df['mean_score'] = ev_df['points']/ev_df['count']
ev_df['mean_equity'] = ev_df['equity']/ev_df['count']
ev_df['bingo pct'] = 100*ev_df['bingo_count']/ev_df['count']
ev_df['pct'] = 100*ev_df['count']/len(ev_df)
ev_df['adjusted_mean_score'] = ev_df['mean_score']-mean_score
ev_df['ev'] = ev_df['mean_equity']-mean_equity

In [13]:
ev_df['ev'].to_csv('leave_values_' + todays_date + '.csv', index=True)
ev_df.to_csv('leave_summary_' + todays_date + '.csv', index=True)


/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: FutureWarning: The signature of `Series.to_csv` was aligned to that of `DataFrame.to_csv`, and argument 'header' will change its default value from False to True: please pass an explicit value to suppress this warning.
  """Entry point for launching an IPython kernel.

Handle missing leave values

If a given superleave of length n is never observed in the trial games, three things can happen:

  • if majority of subleaves of length n-1 are positive, take their maximum.
  • if half and half, take average of subleaves.
  • if majority of subleaves are negative, take their minimum.

In [14]:
ev_dict = ev_df['ev'].to_dict()

In [15]:
t0 = time.time()

for leave in all_leaves:
    if pd.isnull(ev_dict[leave]):
        subleaves = find_subleaves(leave,
                                   min_length=len(leave)-1, 
                                   max_length=len(leave)-1,
                                   duplicates_allowed=True)
        sub_evs = [ev_dict[subleave] for subleave in subleaves]
        signs = sum([x/abs(x) for x in sub_evs])
        
        if signs==0:
            ev_dict[leave] = sum(sub_evs)/len(sub_evs)
        if signs>0:
            ev_dict[leave] = max(sub_evs)
        if signs<0:
            ev_dict[leave] = min(sub_evs)
        
t1 = time.time()
print('Filled in all NaN superleaves with best guesses in {} seconds'.format(t1-t0))


Filled in all NaN superleaves with best guesses in 0.6205301284790039 seconds

In [16]:
ev_df = ev_df.drop('ev', axis=1)
ev_df = pd.concat([ev_df,pd.Series(ev_dict,name='ev')], axis=1)

In [17]:
ev_df['ev'].to_csv('leave_values_' + todays_date + '_filled_nulls.csv', index=True)
ev_df.to_csv('leave_summary_' + todays_date + '_filled_nulls.csv', index=True)


/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: FutureWarning: The signature of `Series.to_csv` was aligned to that of `DataFrame.to_csv`, and argument 'header' will change its default value from False to True: please pass an explicit value to suppress this warning.
  """Entry point for launching an IPython kernel.

In [18]:
ev_df


Out[18]:
points equity count bingo_count mean_score mean_equity bingo pct pct adjusted_mean_score ev
? 782820960 9.019760e+08 13841229 7322671 56.557186 65.165892 52.904775 1513.324492 17.485506 24.078203
A 2005089326 2.117354e+09 49868572 10037371 40.207474 42.458690 20.127649 5452.357690 1.135795 1.371000
B 461839888 4.765642e+08 12390310 1601088 37.274280 38.462657 12.922098 1354.688921 -1.797399 -2.625032
C 576503825 6.071711e+08 14491423 2644552 39.782416 41.898656 18.249084 1584.413158 0.710737 0.810967
D 1012212785 1.059361e+09 25452930 4829723 39.768026 41.620414 18.975116 2782.884551 0.696347 0.532724
... ... ... ... ... ... ... ... ... ... ...
??WXYY 0 0.000000e+00 0 0 NaN NaN NaN 0.000000 NaN 31.301190
?WXYYZ 59 6.361000e+01 1 0 59.000000 63.610000 0.000000 0.000109 19.928321 22.522311
??WXYZ 54 5.319400e+01 1 0 54.000000 53.194000 0.000000 0.000109 14.928321 12.106311
??WYYZ 0 0.000000e+00 0 0 NaN NaN NaN 0.000000 NaN 32.654428
??XYYZ 0 0.000000e+00 0 0 NaN NaN NaN 0.000000 NaN 30.771954

914624 rows × 10 columns

Calculate rack synergy

In other words, how much better is the EV of this superleave than the value of each tile on its own?


In [19]:
t0 = time.time()

synergy = {x:0 for x in all_leaves}

for leave in all_leaves:
    if len(leave)>1:
        subleaves = find_subleaves(leave, min_length=1, max_length=1, duplicates_allowed=True)
        sub_evs = [ev_dict[subleave] for subleave in subleaves]
        synergy[leave] = ev_dict[leave]-sum(sub_evs)
        
t1 = time.time()
print('Calculated "synergy" in {} seconds'.format(t1-t0))


Calculated "synergy" in 5.305424928665161 seconds

In [20]:
ev_df = pd.concat([ev_df,pd.Series(synergy, name='synergy')], axis=1)

In [21]:
ev_df.index.rename('superleave')


Out[21]:
Index(['?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I',
       ...
       '?WWXYZ', '??WWXZ', '??WWYY', '?WWYYZ', '??WWYZ', '??WXYY', '?WXYYZ',
       '??WXYZ', '??WYYZ', '??XYYZ'],
      dtype='object', name='superleave', length=914624)

In [22]:
ev_df


Out[22]:
points count bingo_count mean bingo pct pct ev synergy
? 782820960 13841229 7322671 56.557186 52.904775 1513.324492 17.485506 0.000000
A 2005089326 49868572 10037371 40.207474 20.127649 5452.357690 1.135795 0.000000
B 461839888 12390310 1601088 37.274280 12.922098 1354.688921 -1.797399 0.000000
C 576503825 14491423 2644552 39.782416 18.249084 1584.413158 0.710737 0.000000
D 1012212785 25452930 4829723 39.768026 18.975116 2782.884551 0.696347 0.000000
... ... ... ... ... ... ... ... ...
??WXYY 0 0 0 NaN NaN 0.000000 7.466782 -24.600389
?WXYYZ 59 1 0 59.000000 0.000000 0.000109 19.928321 0.910221
??WXYZ 54 1 0 54.000000 0.000000 0.000109 14.928321 -22.587066
??WYYZ 0 0 0 NaN NaN 0.000000 15.678321 -18.552548
??XYYZ 0 0 0 NaN NaN 0.000000 14.833083 -24.823542

914624 rows × 8 columns

Save superleaves to an external file


In [23]:
ev_df['ev']


Out[23]:
?         17.485506
A          1.135795
B         -1.797399
C          0.710737
D          0.696347
            ...    
??WXYY     7.466782
?WXYYZ    19.928321
??WXYZ    14.928321
??WYYZ    15.678321
??XYYZ    14.833083
Name: ev, Length: 914624, dtype: float64

In [24]:
ev_df['ev'].to_csv('leave_values_' + todays_date + '.csv', index=False)
ev_df.to_csv('leave_summary_' + todays_date + '.csv', index=False)


/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: FutureWarning: The signature of `Series.to_csv` was aligned to that of `DataFrame.to_csv`, and argument 'header' will change its default value from False to True: please pass an explicit value to suppress this warning.
  """Entry point for launching an IPython kernel.

In [ ]: