This notebook contains my solution to this puzzle, written by John Bohannon for the Boston Python User Group:

Consider these base-10 digits: 123456789. If you insert spaces between them, you get various sequences of numbers:

1 2 3 4 5 6 7 8 9
12 3 4 5 67 8 9
1 2 34 5 6 7 89
12 34 56 78 9
1 23456 78 9
12345 6789
etc.

1) Write a program that generates all possible combinations of those digits.

How many are there?

Now let's insert a maximum of 8 addition or subtraction operators between the numbers, like this:

1+2+3+4+5+6+7-8+9
12-3+4+5-67-8+9
1+2+34+5-6-7-89
12-34+56+78+9
1+23456-78-9
12345+6789
etc.

Notice that those arithmetic expressions equate to different values:

1+2+3+4+5+6+7-8+9 = 29
12-3+4+5-67-8+9 = -48
1+2+34+5-6-7-89 = -60
12-34+56+78+9 = 121
1+23456-78-9 = 23370
12345+6789 = 19134
etc.

2) Write a program that generates all possible expressions in this way.

How many sum to 100?

3) Write a program that finds all such expressions for any sum.

Which sum is the most popular, i.e. has the most expressions?

4) Bonus: We can haz pretty data viz?

License: GNU GPLv3 http://www.gnu.org/licenses/gpl.html

The first step is to define OPERATORS and DIGITS. I am putting the sequence of digits into a NumPy array so I can take advantage of numpy.insert(), which can do multiple inserts with one function call.



In [210]:

    
from itertools import *
import numpy

OPERATORS = '+-'
DIGITS = numpy.array(list('123456789'))
NUM_DIGITS = len(DIGITS)

DIGITS









    Out[210]:





array(['1', '2', '3', '4', '5', '6', '7', '8', '9'], 
      dtype='|S1')

Assuming that we know how many operators we're going to insert in an expression, we want to enumerate all possible combinations of operators. itertools.product does what we want.



In [211]:

    
def operator_iter(n):
    """Iterates all sequences of n operators."""
    return product(OPERATORS, repeat=n)

list(operator_iter(2))









    Out[211]:





[('+', '+'), ('+', '-'), ('-', '+'), ('-', '-')]

Given the operators, we have to figure out where to put them. itertools.combinations enumerates the possible sequences of indices.



In [212]:

    
def index_iter(n):
    """Iterates all the places we could put n operators."""
    return combinations(range(1, NUM_DIGITS), n)

list(index_iter(2))









    Out[212]:





[(1, 2),
 (1, 3),
 (1, 4),
 (1, 5),
 (1, 6),
 (1, 7),
 (1, 8),
 (2, 3),
 (2, 4),
 (2, 5),
 (2, 6),
 (2, 7),
 (2, 8),
 (3, 4),
 (3, 5),
 (3, 6),
 (3, 7),
 (3, 8),
 (4, 5),
 (4, 6),
 (4, 7),
 (4, 8),
 (5, 6),
 (5, 7),
 (5, 8),
 (6, 7),
 (6, 8),
 (7, 8)]

Now we can use numpy.insert to combine the digits and operators.



In [213]:

    
def make_expr(index, operator):
    """Inserts the operators at the given locations and make a string."""
    return ''.join(numpy.insert(DIGITS, index, operator))

make_expr((2, 4), ('+', '-'))









    Out[213]:





'12+34-56789'

For a given number of operators, n, we map make_expr onto the product of the indices and the operators, yielding all possible expressions with n operators.



In [214]:

    
def expression_iter(n):
    """Iterates expressions with n operators."""
    return starmap(make_expr, product(index_iter(n), operator_iter(n)))

list(expression_iter(1))









    Out[214]:





['1+23456789',
 '1-23456789',
 '12+3456789',
 '12-3456789',
 '123+456789',
 '123-456789',
 '1234+56789',
 '1234-56789',
 '12345+6789',
 '12345-6789',
 '123456+789',
 '123456-789',
 '1234567+89',
 '1234567-89',
 '12345678+9',
 '12345678-9']

Finally, map expression_iter onto the range 0..NUMDIGITS-1 and chain together the results.



In [215]:

    
def all_expression_iter():
    """Iterates all expressions."""
    return chain(*imap(expression_iter, range(NUM_DIGITS)))

list(islice(all_expression_iter(), 10))









    Out[215]:





['123456789',
 '1+23456789',
 '1-23456789',
 '12+3456789',
 '12-3456789',
 '123+456789',
 '123-456789',
 '1234+56789',
 '1234-56789',
 '12345+6789']

forward maps from each string expression to its numerical value.



In [216]:

    
forward = dict(imap(lambda s: (s, eval(s)), all_expression_iter()))
min(forward.itervalues())









    Out[216]:





-23456788

counter maps from each numerical value to the number of times it appears.



In [217]:

    
from collections import Counter
counter = Counter(forward.itervalues())
counter[100]









    Out[217]:





11

most_frequent is a list of value, frequency pairs in descending order of frequency



In [218]:

    
from operator import itemgetter

most_frequent = sorted(counter.iteritems(), key=itemgetter(1), reverse=True)

for value, freq in most_frequent[:10]:
    print value, freq

Now we want to invert forward.



In [219]:

    
def invert_dict(forward):
    """Inverts a dictionary using a set comprehension."""
    reverse = {}
    set(reverse.setdefault(v, []).append(k) for k, v in forward.iteritems())
    return reverse

reverse maps from each numerical value to a list of expressions.



In [220]:

    
reverse = invert_dict(forward)
reverse[100]









    Out[220]:





['123-45-67+89',
 '1+23-4+5+6+78-9',
 '1+2+3-4+5+6+78+9',
 '123+45-67+8-9',
 '1+23-4+56+7+8+9',
 '12-3-4+5-6+7+89',
 '1+2+34-5+67-8+9',
 '12+3-4+5+67+8+9',
 '123+4-5+67-89',
 '123-4-5-6-7+8-9',
 '12+3+4+5-6-7+89']

A cumulative distribution function (CDF) is represented by a sorted list of values and a list of the corresponding cumulative probabilities.



In [221]:

    
def make_cdf(counter):
    """Makes a cumulative distribution function."""
    vals, freqs = zip(*sorted(counter.iteritems()))
    probs = numpy.cumsum(freqs, dtype=numpy.float)
    probs /= probs[-1]
    return vals, probs

abs_counter is a map from the magnitude of the values (abs) to frequence. cdf is the distribution of the absolute values.



In [222]:

    
abs_counter = Counter(imap(abs, forward.itervalues()))
cdf = make_cdf(abs_counter)
zip(*cdf)[:11]









    Out[222]:





[(0, 0.0016765736930345985),
 (1, 0.00823045267489712),
 (2, 0.010973936899862825),
 (3, 0.0172229843011736),
 (4, 0.019966468526139307),
 (5, 0.026063100137174212),
 (6, 0.029721079103795154),
 (7, 0.035665294924554183),
 (8, 0.038408779149519894),
 (9, 0.045419905502210027),
 (10, 0.048010973936899862)]

You can evaluate a CDF forward or backward in log time using bisect.



In [223]:

    
from bisect import bisect_left

def eval_cdf(cdf, x):
    """Evaluates a cumulative distribution function."""
    vals, probs = cdf
    return probs[bisect_left(vals, x)]

print 'x    CDF(x)'
for x in [10, 100, 1000, 10000, 100000]:
    print x, eval_cdf(cdf, x) * 100









    



x    CDF(x)
10 4.80109739369
100 42.7069044353
1000 83.6610272824
10000 95.5037341869
100000 98.7806736778

Let's try plotting the CDF.



In [224]:

    
import matplotlib.pyplot as pyplot

def plot_cdf(cdf, xscale='linear'):
    pyplot.plot(*cdf, linewidth=4, alpha=0.5)
    pyplot.xlabel('abs(value)')
    pyplot.ylabel('CDF')
    pyplot.xscale(xscale)
    pyplot.show()
    
plot_cdf(cdf)

On a linear scale, we can't see much. Let's try a log-x scale:



In [225]:

    
plot_cdf(cdf, xscale='log')

That looks like a long-tailed distribution. To see if it has a power-law tail, we can plot the complementary CDF (CCDF) on a log-log scale.



In [226]:

    
def plot_ccdf(cdf):
    ccdf = cdf[0], 1-cdf[1]
    pyplot.plot(*ccdf, linewidth=4, alpha=0.5)
    pyplot.xlabel('abs(value)')
    pyplot.ylabel('CCDF')
    pyplot.xscale('log')
    pyplot.yscale('log')
    pyplot.show()
    
plot_ccdf(cdf)

A straight line in this plot indicates a power-law tail. In this example is spans about 6 orders of magnitude.

So you might wonder whether this pattern persists. Let's try it with hexadecimal numbers.



In [40]:

    
DIGITS = numpy.array(list('123456789abcdef'))
NUM_DIGITS = len(DIGITS)

def hex_eval(expr):
    """Evaluates an expression with hexadecimal numbers."""
    expr = '0x' + expr.replace('+', '+0x').replace('-', '-0x')
    return eval(expr)

forward2 = dict(imap(lambda s: (s, hex_eval(s)), all_expression_iter()))



In [166]:

    
def num_and_log(n):
    """Returns a number and its number of digits (actually log base 10)."""
    return n, numpy.log10(abs(n))

num_and_log(len(forward2))









    Out[166]:





(4782969, 6.6796975660752738)



In [167]:

    
num_and_log(min(forward2.itervalues()))









    Out[167]:





(-9927935178558958, 15.996858932905857)



In [168]:

    
num_and_log(max(forward2.itervalues()))









    Out[168]:





(81985529216486895, 16.913737204383434)



In [169]:

    
abs_counter2 = Counter(imap(abs, forward2.itervalues()))
cdf2 = make_cdf(abs_counter2)
plot_ccdf(cdf2)

Yup, that looks like a power tail over about 14 orders of magnitude! The ripple pattern suggests that the tail is a mixture of distributions with geometrically increasing central tendency and decreasing probability. And it looks like the width of each ripple is a factor of 16 (there are 12 ripples; if each is a hexadecade, they should span 14.4 decades).



In [170]:

    
12 * numpy.log10(16)









    Out[170]:





14.449439791871097

That gives us a hint about the cause of the ripples. The value of each expression will tend to be dominated by the largest term. Here's a function that takes an expression and finds the number of digits in the dominant term.



In [171]:

    
import re

def dom_term_len(expr, val):
    return abs(val), max(imap(len, re.split('\+|-', expr)))
    
dom_term_len('1234+56789', 1234+56789)









    Out[171]:





(58023, 5)

dom_term_map maps from the number of digits in the largest term to a list of values.



In [172]:

    
dom_term_map = invert_dict(dict(starmap(dom_term_len, forward.iteritems())))



In [174]:

    
dom_term_map[6][:10]

Now we can compute the distribution of values within each decade (or hexadecade).



In [175]:

    
plot_cdf(make_cdf(Counter(dom_term_map[2])))



In [176]:

    
plot_cdf(make_cdf(Counter(dom_term_map[3])))



In [177]:

    
plot_cdf(make_cdf(Counter(dom_term_map[4])))



In [178]:

    
plot_cdf(make_cdf(Counter(dom_term_map[5])))



In [179]:

    
plot_cdf(make_cdf(Counter(dom_term_map[6])))



In [180]:

    
plot_cdf(make_cdf(Counter(dom_term_map[7])))



In [181]:

    
plot_cdf(make_cdf(Counter(dom_term_map[8])))

Within each decade (or hexadecade) the distribution of values is roughly uniform.

It looks like the distribution of values is a mixture of uniform distributions where each element of the mixture spans a decade. So why does the resulting mixture have a power-law tail? Because the number of values in each element of the mixture decreases geometrically. To see that, let's plot the number of values as a function of the size of the largest term.



In [190]:

    
pairs = sorted(starmap(lambda x, t: (x, len(t)), dom_term_map.iteritems()))
xs, ys = zip(*pairs)
pyplot.plot(xs, ys, linewidth=4, alpha=0.5)
pyplot.xlabel('Digits in largest term')
pyplot.ylabel('Number of values')
pyplot.show()

From 4 on up, that looks like a geometric decay. To confirm, let's look at the log ratio of successive elements.



In [192]:

    
diffs = numpy.diff(numpy.log10(ys))
pyplot.plot(xs[1:], diffs, linewidth=4, alpha=0.5)
pyplot.show()

That looks like it's converging at about -0.6, so we expect the mixture distribution, plotted on a log scale, to drop by 0.6 per decade. And it does.