# puzzle1

This notebook contains my solution to this puzzle, written by John Bohannon for the Boston Python User Group:

Consider these base-10 digits: 123456789. If you insert spaces between them, you get various sequences of numbers:

``````1 2 3 4 5 6 7 8 9
12 3 4 5 67 8 9
1 2 34 5 6 7 89
12 34 56 78 9
1 23456 78 9
12345 6789
etc.

``````

1) Write a program that generates all possible combinations of those digits.

How many are there?

Now let's insert a maximum of 8 addition or subtraction operators between the numbers, like this:

``````1+2+3+4+5+6+7-8+9
12-3+4+5-67-8+9
1+2+34+5-6-7-89
12-34+56+78+9
1+23456-78-9
12345+6789
etc.

``````

Notice that those arithmetic expressions equate to different values:

``````1+2+3+4+5+6+7-8+9 = 29
12-3+4+5-67-8+9 = -48
1+2+34+5-6-7-89 = -60
12-34+56+78+9 = 121
1+23456-78-9 = 23370
12345+6789 = 19134
etc.

``````

2) Write a program that generates all possible expressions in this way.

How many sum to 100?

3) Write a program that finds all such expressions for any sum.

Which sum is the most popular, i.e. has the most expressions?

4) Bonus: We can haz pretty data viz?

This solution is copyright 2014 Allen B. Downey

The first step is to define `OPERATORS` and `DIGITS`. I am putting the sequence of digits into a NumPy array so I can take advantage of `numpy.insert()`, which can do multiple inserts with one function call.

``````

In :

from itertools import *
import numpy

OPERATORS = '+-'
DIGITS = numpy.array(list('123456789'))
NUM_DIGITS = len(DIGITS)

DIGITS

``````
``````

Out:

array(['1', '2', '3', '4', '5', '6', '7', '8', '9'],
dtype='|S1')

``````

Assuming that we know how many operators we're going to insert in an expression, we want to enumerate all possible combinations of operators. `itertools.product` does what we want.

``````

In :

def operator_iter(n):
"""Iterates all sequences of n operators."""
return product(OPERATORS, repeat=n)

list(operator_iter(2))

``````
``````

Out:

[('+', '+'), ('+', '-'), ('-', '+'), ('-', '-')]

``````

Given the operators, we have to figure out where to put them. `itertools.combinations` enumerates the possible sequences of indices.

``````

In :

def index_iter(n):
"""Iterates all the places we could put n operators."""
return combinations(range(1, NUM_DIGITS), n)

list(index_iter(2))

``````
``````

Out:

[(1, 2),
(1, 3),
(1, 4),
(1, 5),
(1, 6),
(1, 7),
(1, 8),
(2, 3),
(2, 4),
(2, 5),
(2, 6),
(2, 7),
(2, 8),
(3, 4),
(3, 5),
(3, 6),
(3, 7),
(3, 8),
(4, 5),
(4, 6),
(4, 7),
(4, 8),
(5, 6),
(5, 7),
(5, 8),
(6, 7),
(6, 8),
(7, 8)]

``````

Now we can use `numpy.insert` to combine the digits and operators.

``````

In :

def make_expr(index, operator):
"""Inserts the operators at the given locations and make a string."""
return ''.join(numpy.insert(DIGITS, index, operator))

make_expr((2, 4), ('+', '-'))

``````
``````

Out:

'12+34-56789'

``````

For a given number of operators, `n`, we map `make_expr` onto the product of the indices and the operators, yielding all possible expressions with `n` operators.

``````

In :

def expression_iter(n):
"""Iterates expressions with n operators."""
return starmap(make_expr, product(index_iter(n), operator_iter(n)))

list(expression_iter(1))

``````
``````

Out:

['1+23456789',
'1-23456789',
'12+3456789',
'12-3456789',
'123+456789',
'123-456789',
'1234+56789',
'1234-56789',
'12345+6789',
'12345-6789',
'123456+789',
'123456-789',
'1234567+89',
'1234567-89',
'12345678+9',
'12345678-9']

``````

Finally, map `expression_iter` onto the range `0..NUMDIGITS-1` and chain together the results.

``````

In :

def all_expression_iter():
"""Iterates all expressions."""
return chain(*imap(expression_iter, range(NUM_DIGITS)))

list(islice(all_expression_iter(), 10))

``````
``````

Out:

['123456789',
'1+23456789',
'1-23456789',
'12+3456789',
'12-3456789',
'123+456789',
'123-456789',
'1234+56789',
'1234-56789',
'12345+6789']

``````

`forward` maps from each string expression to its numerical value.

``````

In :

forward = dict(imap(lambda s: (s, eval(s)), all_expression_iter()))
min(forward.itervalues())

``````
``````

Out:

-23456788

``````

`counter` maps from each numerical value to the number of times it appears.

``````

In :

from collections import Counter
counter = Counter(forward.itervalues())
counter

``````
``````

Out:

11

``````

`most_frequent` is a list of value, frequency pairs in descending order of frequency

``````

In :

from operator import itemgetter

most_frequent = sorted(counter.iteritems(), key=itemgetter(1), reverse=True)

for value, freq in most_frequent[:10]:
print value, freq

``````
``````

-21 27
1 26
45 26
-43 24
-9 24
23 23
-27 23
-25 23
-15 23
-3 23

``````

Now we want to invert `forward`.

``````

In :

def invert_dict(forward):
"""Inverts a dictionary using a set comprehension."""
reverse = {}
set(reverse.setdefault(v, []).append(k) for k, v in forward.iteritems())
return reverse

``````

`reverse` maps from each numerical value to a list of expressions.

``````

In :

reverse = invert_dict(forward)
reverse

``````
``````

Out:

['123-45-67+89',
'1+23-4+5+6+78-9',
'1+2+3-4+5+6+78+9',
'123+45-67+8-9',
'1+23-4+56+7+8+9',
'12-3-4+5-6+7+89',
'1+2+34-5+67-8+9',
'12+3-4+5+67+8+9',
'123+4-5+67-89',
'123-4-5-6-7+8-9',
'12+3+4+5-6-7+89']

``````

A cumulative distribution function (CDF) is represented by a sorted list of values and a list of the corresponding cumulative probabilities.

``````

In :

def make_cdf(counter):
"""Makes a cumulative distribution function."""
vals, freqs = zip(*sorted(counter.iteritems()))
probs = numpy.cumsum(freqs, dtype=numpy.float)
probs /= probs[-1]
return vals, probs

``````

`abs_counter` is a map from the magnitude of the values (abs) to frequence. `cdf` is the distribution of the absolute values.

``````

In :

abs_counter = Counter(imap(abs, forward.itervalues()))
cdf = make_cdf(abs_counter)
zip(*cdf)[:11]

``````
``````

Out:

[(0, 0.0016765736930345985),
(1, 0.00823045267489712),
(2, 0.010973936899862825),
(3, 0.0172229843011736),
(4, 0.019966468526139307),
(5, 0.026063100137174212),
(6, 0.029721079103795154),
(7, 0.035665294924554183),
(8, 0.038408779149519894),
(9, 0.045419905502210027),
(10, 0.048010973936899862)]

``````

You can evaluate a CDF forward or backward in log time using `bisect`.

``````

In :

from bisect import bisect_left

def eval_cdf(cdf, x):
"""Evaluates a cumulative distribution function."""
vals, probs = cdf
return probs[bisect_left(vals, x)]

print 'x    CDF(x)'
for x in [10, 100, 1000, 10000, 100000]:
print x, eval_cdf(cdf, x) * 100

``````
``````

x    CDF(x)
10 4.80109739369
100 42.7069044353
1000 83.6610272824
10000 95.5037341869
100000 98.7806736778

``````

Let's try plotting the CDF.

``````

In :

import matplotlib.pyplot as pyplot

def plot_cdf(cdf, xscale='linear'):
pyplot.plot(*cdf, linewidth=4, alpha=0.5)
pyplot.xlabel('abs(value)')
pyplot.ylabel('CDF')
pyplot.xscale(xscale)
pyplot.show()

plot_cdf(cdf)

``````
``````

``````

On a linear scale, we can't see much. Let's try a log-x scale:

``````

In :

plot_cdf(cdf, xscale='log')

``````
``````

``````

That looks like a long-tailed distribution. To see if it has a power-law tail, we can plot the complementary CDF (CCDF) on a log-log scale.

``````

In :

def plot_ccdf(cdf):
ccdf = cdf, 1-cdf
pyplot.plot(*ccdf, linewidth=4, alpha=0.5)
pyplot.xlabel('abs(value)')
pyplot.ylabel('CCDF')
pyplot.xscale('log')
pyplot.yscale('log')
pyplot.show()

plot_ccdf(cdf)

``````
``````

``````

A straight line in this plot indicates a power-law tail. In this example is spans about 6 orders of magnitude.

So you might wonder whether this pattern persists. Let's try it with hexadecimal numbers.

``````

In :

DIGITS = numpy.array(list('123456789abcdef'))
NUM_DIGITS = len(DIGITS)

def hex_eval(expr):
"""Evaluates an expression with hexadecimal numbers."""
expr = '0x' + expr.replace('+', '+0x').replace('-', '-0x')
return eval(expr)

forward2 = dict(imap(lambda s: (s, hex_eval(s)), all_expression_iter()))

``````
``````

In :

def num_and_log(n):
"""Returns a number and its number of digits (actually log base 10)."""
return n, numpy.log10(abs(n))

num_and_log(len(forward2))

``````
``````

Out:

(4782969, 6.6796975660752738)

``````
``````

In :

num_and_log(min(forward2.itervalues()))

``````
``````

Out:

(-9927935178558958, 15.996858932905857)

``````
``````

In :

num_and_log(max(forward2.itervalues()))

``````
``````

Out:

(81985529216486895, 16.913737204383434)

``````
``````

In :

abs_counter2 = Counter(imap(abs, forward2.itervalues()))
cdf2 = make_cdf(abs_counter2)
plot_ccdf(cdf2)

``````
``````

``````

Yup, that looks like a power tail over about 14 orders of magnitude! The ripple pattern suggests that the tail is a mixture of distributions with geometrically increasing central tendency and decreasing probability. And it looks like the width of each ripple is a factor of 16 (there are 12 ripples; if each is a hexadecade, they should span 14.4 decades).

``````

In :

12 * numpy.log10(16)

``````
``````

Out:

14.449439791871097

``````

That gives us a hint about the cause of the ripples. The value of each expression will tend to be dominated by the largest term. Here's a function that takes an expression and finds the number of digits in the dominant term.

``````

In :

import re

def dom_term_len(expr, val):
return abs(val), max(imap(len, re.split('\+|-', expr)))

dom_term_len('1234+56789', 1234+56789)

``````
``````

Out:

(58023, 5)

``````

`dom_term_map` maps from the number of digits in the largest term to a list of values.

``````

In :

dom_term_map = invert_dict(dict(starmap(dom_term_len, forward.iteritems())))

``````
``````

In :

dom_term_map[:10]

``````
``````

Out:

[123374,
123387,
123432,
123462,
124245,
345657,
345666,
345668,
345670,
345672]

``````

``````

In :

plot_cdf(make_cdf(Counter(dom_term_map)))

``````
``````

``````
``````

In :

plot_cdf(make_cdf(Counter(dom_term_map)))

``````
``````

``````
``````

In :

plot_cdf(make_cdf(Counter(dom_term_map)))

``````
``````

``````
``````

In :

plot_cdf(make_cdf(Counter(dom_term_map)))

``````
``````

``````
``````

In :

plot_cdf(make_cdf(Counter(dom_term_map)))

``````
``````

``````
``````

In :

plot_cdf(make_cdf(Counter(dom_term_map)))

``````
``````

``````
``````

In :

plot_cdf(make_cdf(Counter(dom_term_map)))

``````
``````

``````

It looks like the distribution of values is a mixture of uniform distributions where each element of the mixture spans a decade. So why does the resulting mixture have a power-law tail? Because the number of values in each element of the mixture decreases geometrically. To see that, let's plot the number of values as a function of the size of the largest term.

``````

In :

pairs = sorted(starmap(lambda x, t: (x, len(t)), dom_term_map.iteritems()))
xs, ys = zip(*pairs)
pyplot.plot(xs, ys, linewidth=4, alpha=0.5)
pyplot.xlabel('Digits in largest term')
pyplot.ylabel('Number of values')
pyplot.show()

``````
``````

``````

From 4 on up, that looks like a geometric decay. To confirm, let's look at the log ratio of successive elements.

``````

In :

diffs = numpy.diff(numpy.log10(ys))
pyplot.plot(xs[1:], diffs, linewidth=4, alpha=0.5)
pyplot.show()

``````
``````

``````

That looks like it's converging at about -0.6, so we expect the mixture distribution, plotted on a log scale, to drop by 0.6 per decade. And it does.