This notebook contains my solution to this puzzle, written by John Bohannon for the Boston Python User Group:

Consider these base-10 digits: 123456789. If you insert spaces between them, you get various sequences of numbers:

1 2 3 4 5 6 7 8 9
12 3 4 5 67 8 9
1 2 34 5 6 7 89
12 34 56 78 9
1 23456 78 9
12345 6789
etc.

1) Write a program that generates all possible combinations of those digits.

How many are there?

Now let's insert a maximum of 8 addition or subtraction operators between the numbers, like this:

1+2+3+4+5+6+7-8+9
12-3+4+5-67-8+9
1+2+34+5-6-7-89
12-34+56+78+9
1+23456-78-9
12345+6789
etc.

Notice that those arithmetic expressions equate to different values:

1+2+3+4+5+6+7-8+9 = 29
12-3+4+5-67-8+9 = -48
1+2+34+5-6-7-89 = -60
12-34+56+78+9 = 121
1+23456-78-9 = 23370
12345+6789 = 19134
etc.

2) Write a program that generates all possible expressions in this way.

How many sum to 100?

3) Write a program that finds all such expressions for any sum.

Which sum is the most popular, i.e. has the most expressions?

4) Bonus: We can haz pretty data viz?

This solution is copyright 2014 Allen B. Downey

License: GNU GPLv3 http://www.gnu.org/licenses/gpl.html

The first step is to define OPERATORS and DIGITS. I am putting the sequence of digits into a NumPy array so I can take advantage of numpy.insert(), which can do multiple inserts with one function call.


In [210]:
from itertools import *
import numpy

OPERATORS = '+-'
DIGITS = numpy.array(list('123456789'))
NUM_DIGITS = len(DIGITS)

DIGITS


Out[210]:
array(['1', '2', '3', '4', '5', '6', '7', '8', '9'], 
      dtype='|S1')

Assuming that we know how many operators we're going to insert in an expression, we want to enumerate all possible combinations of operators. itertools.product does what we want.


In [211]:
def operator_iter(n):
    """Iterates all sequences of n operators."""
    return product(OPERATORS, repeat=n)

list(operator_iter(2))


Out[211]:
[('+', '+'), ('+', '-'), ('-', '+'), ('-', '-')]

Given the operators, we have to figure out where to put them. itertools.combinations enumerates the possible sequences of indices.


In [212]:
def index_iter(n):
    """Iterates all the places we could put n operators."""
    return combinations(range(1, NUM_DIGITS), n)

list(index_iter(2))


Out[212]:
[(1, 2),
 (1, 3),
 (1, 4),
 (1, 5),
 (1, 6),
 (1, 7),
 (1, 8),
 (2, 3),
 (2, 4),
 (2, 5),
 (2, 6),
 (2, 7),
 (2, 8),
 (3, 4),
 (3, 5),
 (3, 6),
 (3, 7),
 (3, 8),
 (4, 5),
 (4, 6),
 (4, 7),
 (4, 8),
 (5, 6),
 (5, 7),
 (5, 8),
 (6, 7),
 (6, 8),
 (7, 8)]

Now we can use numpy.insert to combine the digits and operators.


In [213]:
def make_expr(index, operator):
    """Inserts the operators at the given locations and make a string."""
    return ''.join(numpy.insert(DIGITS, index, operator))

make_expr((2, 4), ('+', '-'))


Out[213]:
'12+34-56789'

For a given number of operators, n, we map make_expr onto the product of the indices and the operators, yielding all possible expressions with n operators.


In [214]:
def expression_iter(n):
    """Iterates expressions with n operators."""
    return starmap(make_expr, product(index_iter(n), operator_iter(n)))

list(expression_iter(1))


Out[214]:
['1+23456789',
 '1-23456789',
 '12+3456789',
 '12-3456789',
 '123+456789',
 '123-456789',
 '1234+56789',
 '1234-56789',
 '12345+6789',
 '12345-6789',
 '123456+789',
 '123456-789',
 '1234567+89',
 '1234567-89',
 '12345678+9',
 '12345678-9']

Finally, map expression_iter onto the range 0..NUMDIGITS-1 and chain together the results.


In [215]:
def all_expression_iter():
    """Iterates all expressions."""
    return chain(*imap(expression_iter, range(NUM_DIGITS)))

list(islice(all_expression_iter(), 10))


Out[215]:
['123456789',
 '1+23456789',
 '1-23456789',
 '12+3456789',
 '12-3456789',
 '123+456789',
 '123-456789',
 '1234+56789',
 '1234-56789',
 '12345+6789']

forward maps from each string expression to its numerical value.


In [216]:
forward = dict(imap(lambda s: (s, eval(s)), all_expression_iter()))
min(forward.itervalues())


Out[216]:
-23456788

counter maps from each numerical value to the number of times it appears.


In [217]:
from collections import Counter
counter = Counter(forward.itervalues())
counter[100]


Out[217]:
11

most_frequent is a list of value, frequency pairs in descending order of frequency


In [218]:
from operator import itemgetter

most_frequent = sorted(counter.iteritems(), key=itemgetter(1), reverse=True)

for value, freq in most_frequent[:10]:
    print value, freq


-21 27
1 26
45 26
-43 24
-9 24
23 23
-27 23
-25 23
-15 23
-3 23

Now we want to invert forward.


In [219]:
def invert_dict(forward):
    """Inverts a dictionary using a set comprehension."""
    reverse = {}
    set(reverse.setdefault(v, []).append(k) for k, v in forward.iteritems())
    return reverse

reverse maps from each numerical value to a list of expressions.


In [220]:
reverse = invert_dict(forward)
reverse[100]


Out[220]:
['123-45-67+89',
 '1+23-4+5+6+78-9',
 '1+2+3-4+5+6+78+9',
 '123+45-67+8-9',
 '1+23-4+56+7+8+9',
 '12-3-4+5-6+7+89',
 '1+2+34-5+67-8+9',
 '12+3-4+5+67+8+9',
 '123+4-5+67-89',
 '123-4-5-6-7+8-9',
 '12+3+4+5-6-7+89']

A cumulative distribution function (CDF) is represented by a sorted list of values and a list of the corresponding cumulative probabilities.


In [221]:
def make_cdf(counter):
    """Makes a cumulative distribution function."""
    vals, freqs = zip(*sorted(counter.iteritems()))
    probs = numpy.cumsum(freqs, dtype=numpy.float)
    probs /= probs[-1]
    return vals, probs

abs_counter is a map from the magnitude of the values (abs) to frequence. cdf is the distribution of the absolute values.


In [222]:
abs_counter = Counter(imap(abs, forward.itervalues()))
cdf = make_cdf(abs_counter)
zip(*cdf)[:11]


Out[222]:
[(0, 0.0016765736930345985),
 (1, 0.00823045267489712),
 (2, 0.010973936899862825),
 (3, 0.0172229843011736),
 (4, 0.019966468526139307),
 (5, 0.026063100137174212),
 (6, 0.029721079103795154),
 (7, 0.035665294924554183),
 (8, 0.038408779149519894),
 (9, 0.045419905502210027),
 (10, 0.048010973936899862)]

You can evaluate a CDF forward or backward in log time using bisect.


In [223]:
from bisect import bisect_left

def eval_cdf(cdf, x):
    """Evaluates a cumulative distribution function."""
    vals, probs = cdf
    return probs[bisect_left(vals, x)]

print 'x    CDF(x)'
for x in [10, 100, 1000, 10000, 100000]:
    print x, eval_cdf(cdf, x) * 100


x    CDF(x)
10 4.80109739369
100 42.7069044353
1000 83.6610272824
10000 95.5037341869
100000 98.7806736778

Let's try plotting the CDF.


In [224]:
import matplotlib.pyplot as pyplot

def plot_cdf(cdf, xscale='linear'):
    pyplot.plot(*cdf, linewidth=4, alpha=0.5)
    pyplot.xlabel('abs(value)')
    pyplot.ylabel('CDF')
    pyplot.xscale(xscale)
    pyplot.show()
    
plot_cdf(cdf)


On a linear scale, we can't see much. Let's try a log-x scale:


In [225]:
plot_cdf(cdf, xscale='log')


That looks like a long-tailed distribution. To see if it has a power-law tail, we can plot the complementary CDF (CCDF) on a log-log scale.


In [226]:
def plot_ccdf(cdf):
    ccdf = cdf[0], 1-cdf[1]
    pyplot.plot(*ccdf, linewidth=4, alpha=0.5)
    pyplot.xlabel('abs(value)')
    pyplot.ylabel('CCDF')
    pyplot.xscale('log')
    pyplot.yscale('log')
    pyplot.show()
    
plot_ccdf(cdf)


A straight line in this plot indicates a power-law tail. In this example is spans about 6 orders of magnitude.

So you might wonder whether this pattern persists. Let's try it with hexadecimal numbers.


In [40]:
DIGITS = numpy.array(list('123456789abcdef'))
NUM_DIGITS = len(DIGITS)

def hex_eval(expr):
    """Evaluates an expression with hexadecimal numbers."""
    expr = '0x' + expr.replace('+', '+0x').replace('-', '-0x')
    return eval(expr)

forward2 = dict(imap(lambda s: (s, hex_eval(s)), all_expression_iter()))

In [166]:
def num_and_log(n):
    """Returns a number and its number of digits (actually log base 10)."""
    return n, numpy.log10(abs(n))

num_and_log(len(forward2))


Out[166]:
(4782969, 6.6796975660752738)

In [167]:
num_and_log(min(forward2.itervalues()))


Out[167]:
(-9927935178558958, 15.996858932905857)

In [168]:
num_and_log(max(forward2.itervalues()))


Out[168]:
(81985529216486895, 16.913737204383434)

In [169]:
abs_counter2 = Counter(imap(abs, forward2.itervalues()))
cdf2 = make_cdf(abs_counter2)
plot_ccdf(cdf2)


Yup, that looks like a power tail over about 14 orders of magnitude! The ripple pattern suggests that the tail is a mixture of distributions with geometrically increasing central tendency and decreasing probability. And it looks like the width of each ripple is a factor of 16 (there are 12 ripples; if each is a hexadecade, they should span 14.4 decades).


In [170]:
12 * numpy.log10(16)


Out[170]:
14.449439791871097

That gives us a hint about the cause of the ripples. The value of each expression will tend to be dominated by the largest term. Here's a function that takes an expression and finds the number of digits in the dominant term.


In [171]:
import re

def dom_term_len(expr, val):
    return abs(val), max(imap(len, re.split('\+|-', expr)))
    
dom_term_len('1234+56789', 1234+56789)


Out[171]:
(58023, 5)

dom_term_map maps from the number of digits in the largest term to a list of values.


In [172]:
dom_term_map = invert_dict(dict(starmap(dom_term_len, forward.iteritems())))

In [174]:
dom_term_map[6][:10]


Out[174]:
[123374,
 123387,
 123432,
 123462,
 124245,
 345657,
 345666,
 345668,
 345670,
 345672]

Now we can compute the distribution of values within each decade (or hexadecade).


In [175]:
plot_cdf(make_cdf(Counter(dom_term_map[2])))



In [176]:
plot_cdf(make_cdf(Counter(dom_term_map[3])))



In [177]:
plot_cdf(make_cdf(Counter(dom_term_map[4])))



In [178]:
plot_cdf(make_cdf(Counter(dom_term_map[5])))



In [179]:
plot_cdf(make_cdf(Counter(dom_term_map[6])))



In [180]:
plot_cdf(make_cdf(Counter(dom_term_map[7])))



In [181]:
plot_cdf(make_cdf(Counter(dom_term_map[8])))


Within each decade (or hexadecade) the distribution of values is roughly uniform.

It looks like the distribution of values is a mixture of uniform distributions where each element of the mixture spans a decade. So why does the resulting mixture have a power-law tail? Because the number of values in each element of the mixture decreases geometrically. To see that, let's plot the number of values as a function of the size of the largest term.


In [190]:
pairs = sorted(starmap(lambda x, t: (x, len(t)), dom_term_map.iteritems()))
xs, ys = zip(*pairs)
pyplot.plot(xs, ys, linewidth=4, alpha=0.5)
pyplot.xlabel('Digits in largest term')
pyplot.ylabel('Number of values')
pyplot.show()


From 4 on up, that looks like a geometric decay. To confirm, let's look at the log ratio of successive elements.


In [192]:
diffs = numpy.diff(numpy.log10(ys))
pyplot.plot(xs[1:], diffs, linewidth=4, alpha=0.5)
pyplot.show()


That looks like it's converging at about -0.6, so we expect the mixture distribution, plotted on a log scale, to drop by 0.6 per decade. And it does.