Peter Norvig
January 2014
I love ♡ xkcd! It reliably provides top-quality insights, humor, or both. It was a thrill for me when I got to introduce Randall Monroe for his talk at Google. But in xkcd #1313,
![](http://imgs.xkcd.com/comics/regex_golf.png)
I found that the hover text, "/bu|[rn]t|[coy]e|[mtg]a|j|iso|n[hl]|[ae]d|lev|sh|[lnd]i|[po]o|ls/ matches the last names of elected US presidents but not their opponents", contains a confusing contradiction. There are several last names (like "Nixon") that denote both elected presidents and opponents. So no regular expression could both match and not match "Nixon". I could only assume that Randall meant for these names to be winners and not losers (and in fact he later confirmed that was the correct interpretation).
So that got me thinking: can I come up with an algorithm to find a short regex that matches the winners and not the losers?
I started by finding a page that lists presidential elections through 2000. Adding the 2004-2012 results I get these lists of winners and losers:
In [29]:
def words(text): return set(text.lower().split())
winners = words('''washington adams jefferson jefferson madison madison monroe
monroe adams jackson jackson van-buren harrison polk taylor pierce buchanan
lincoln lincoln grant grant hayes garfield cleveland harrison cleveland mckinley
mckinley roosevelt taft wilson wilson harding coolidge hoover roosevelt
roosevelt roosevelt roosevelt truman eisenhower eisenhower kennedy johnson nixon
nixon carter reagan reagan bush clinton clinton bush bush obama obama''')
losers = words('''clinton jefferson adams pinckney pinckney clinton king adams
jackson adams clay van-buren van-buren clay cass scott fremont breckinridge
mcclellan seymour greeley tilden hancock blaine cleveland harrison bryan bryan
parker bryan roosevelt hughes cox davis smith hoover landon wilkie dewey dewey
stevenson stevenson nixon goldwater humphrey mcgovern ford carter mondale
dukakis bush dole gore kerry mccain romney''')
We can see that there are names that are both winners and losers:
In [2]:
print(winners & losers)
Note that we are working with names, not people: Bill Clinton was a winner and George Clinton (the Revolutionary War leader, not the Funkadelic leader) was a loser, but they both count as 'clinton'
.
To avoid a contradiction and achieve Randall's intent, eliminate all winners from the set of losers:
In [30]:
losers = losers - winners
Next we need to be clear exactly what we're trying to achieve. We're looking for a Python regular expression which, when used with the re.search
function, will match all of the winners but none of the losers. We can define the function verify
to return True if a regex does that; if not it returns False and prints all the winners/losers it gets wrong:
In [31]:
import re
def verify(regex, winners, losers):
"Return true iff the regex matches all winners but no losers."
missed_winners = {W for W in winners if not re.search(regex, W)}
matched_losers = {L for L in losers if re.search(regex, L)}
if missed_winners:
print "Error: should match but did not:", ', '.join(missed_winners)
if matched_losers:
print "Error: should not match but did:", ', '.join(matched_losers)
return not (missed_winners or matched_losers)
We can use this to test xkcd's regex:
In [5]:
xkcd = "bu|[rn]t|[coy]e|[mtg]a|j|iso|n[hl]|[ae]d|lev|sh|[lnd]i|[po]o|ls"
verify(xkcd, winners, losers)
Out[5]:
That's interesting—we see that the xkcd regular expression incorrectly matches "fremont"
, representing John C. Fremont, opponent of James Buchanan in 1856. Could Randall have made an error? Is someone wrong on the Internet? Investigating the 1856 election, I see that Randall must have had Millard Fillmore, the third-party candidate, listed as the opponent. Fillmore is of course more famous, having served as the 13th president (although he never won an election; he became president when Taylor died in office). But Fillmore only got 8 electoral votes in 1856 and Fremont got 114, so I will stick with Fremont in my list of losers.
However, we can verify that Randall got it right under the interpretation that Fillmore, not Fremont, is the opponent. (Note: in the algebra of sets, (A & B)
means intersection, (A | B)
means union, and (A - B)
means set difference, or the elements of A
that are not in B
.)
In [6]:
verify(xkcd, winners, losers - {'fremont'} | {'fillmore'})
Out[6]:
We need a strategy to find a regex that matches all the winners but none of the losers. I came up with this approach:
"bu"
or "n.e"
or "r.e$"
)."bu|n.e|j"
).Note that there are four ways in which this strategy can fail to find the shortest possible regex:
"a|b|c|..."
. If there is a shorter regex that is not a disjunction, we can't find it. If the solutions end up looking bad, we can always go back and address one or more of these four issues.
The algorithm is below. We create an initial pool with regex_components(winners, losers)
, and will accumulate components into the list solution
, which starts empty. On each iteration choose the best component: the one with a maximum score. (I decided to score 4 points for each winner matched minus one point for each character in the component. The function matches(regex, strings)
returns a set of the strings that match.) We then add the best component to solution
, and remove from winners all the strings that are matched by best
. Finally, we update the pool, keeping only those components that still match one or more of the remaining winners. When there are no more winners left, OR together all the solution components to give the final regular expression string.
In [32]:
def findregex(winners, losers):
"Find a regex that matches all winners but no losers (sets of strings)."
# Make a pool of regex components, then pick from them to cover winners.
# On each iteration, add the 'best' component to 'solution',
# remove winners covered by best, and keep in 'pool' only components
# that still match some winner.
pool = regex_components(winners, losers)
solution = []
def score(r): return 4 * len(matches(r, winners)) - len(r)
while winners:
best = max(pool, key=score)
solution.append(best)
winners = winners - matches(best, winners)
pool = {r for r in pool if matches(r, winners)}
return OR(solution)
In [33]:
def matches(regex, strings):
"Return a set of all the strings that are matched by regex."
return {s for s in strings if re.search(regex, s)}
OR = '|'.join # Join a sequence of strings with '|' between them
In this notebook the following terminology is used:
'bu'
or 'a.a'
.'bu|a.a'
. Must match winners but not losers.'^word$'
Now we need to define what the regex_components
are. Here's what I came up with:
"winner"
, include "^winner$"
subparts('^it$')
== {'^', 'i', 't', '$', '^i', 'it', 't$', '^it', 'it$', '^it$'}
dotify('it')
== {'it', 'i.', '.t', '..'}
Note that I only used a few of the regular expression mechanisms: '.'
, '^'
, and '$'
. I didn't try to use character classes, (e.g., [a-z]
), nor any of the repetition operators, or other advanced mechanisms. Why? I wanted to keep it simple, and I thought that the advanced features take too many characters. I can always add more complicated mechanisms later. Here is the code:
In [34]:
def regex_components(winners, losers):
"Return components that match at least one winner, but no loser."
wholes = {'^'+winner+'$' for winner in winners}
parts = {d for w in wholes for s in subparts(w) for d in dotify(s)}
return wholes | {p for p in parts if not matches(p, losers)}
def subparts(word):
"Return a set of subparts of word, consecutive characters up to length 4."
return set(word[i:i+n] for i in range(len(word)) for n in (1, 2, 3, 4))
def dotify(part):
"Return all ways to replace a subset of chars in part with '.'."
if part == '':
return {''}
else:
return {c+rest for rest in dotify(part[1:])
for c in replacements(part[0])}
def replacements(char):
"Return replacement characters for char (char + '.' unless char is special)."
if (char == '^' or char == '$'):
return char
else:
return char + '.'
Our program is complete!
But to make sure we understand what eachof these subfunctions does, and to help us debug any problems, here's a test suite:
In [37]:
def test1():
assert subparts('^it$') == {'^', 'i', 't', '$', '^i', 'it', 't$', '^it', 'it$', '^it$'}
assert subparts('this') == {'t', 'h', 'i', 's', 'th', 'hi', 'is', 'thi', 'his', 'this'}
assert dotify('it') == {'it', 'i.', '.t', '..'}
assert dotify('^it$') == {'^it$', '^i.$', '^.t$', '^..$'}
assert dotify('this') == {'this', 'thi.', 'th.s', 'th..', 't.is', 't.i.', 't..s', 't...',
'.his', '.hi.', '.h.s', '.h..', '..is', '..i.', '...s', '....'}
assert replacements('x') == 'x.'
assert replacements('^') == '^'
assert replacements('$') == '$'
assert regex_components({'win'}, {'losers', 'bin', 'won'}) == {
'^win$', '^win', '^wi.', 'wi.', 'wi', '^wi', 'win$', 'win', 'wi.$'}
assert regex_components({'win'}, {'losers', 'bin', 'won', 'wine'}) == {
'^win$', 'win$', 'wi.$'}
assert matches('a|b|c', {'a', 'b', 'c', 'd', 'e'}) == {'a', 'b', 'c'}
assert matches('a|b|c', {'any', 'bee', 'succeed', 'dee', 'eee!'}) == {
'any', 'bee', 'succeed'}
assert verify('a+b+', {'ab', 'aaabb'}, {'a', 'bee', 'a b'})
assert findregex({"ahahah", "ciao"}, {"ahaha", "bye"}) == 'a.$'
assert OR(['a', 'b', 'c']) == 'a|b|c'
assert OR(['a']) == 'a'
assert words('This is a TEST this is') == {'this', 'is', 'a', 'test'}
assert lines('Testing / 1 2 3 / Testing over') == {'TESTING', '1 2 3', 'TESTING OVER'}
return 'test1 passes'
test1()
Out[37]:
Now we can finally run findregex
, verify that it works, and compare the length of our solution to Randall's:
In [38]:
findregex(winners, losers)
Out[38]:
In [39]:
solution = findregex(winners, losers)
verify(solution, winners, losers)
Out[39]:
In [41]:
print len(solution), len(xkcd)
Ours is 10 characters shorter. So we've answered the challenge in the hover text; what else can we do?
We can define a convenience function to do this finding and verifying, and we might as well do it in both directions (e.g. separating winners from losers and losers from winners). We will also report the number of characters in the solution and the competitive ratio of the solution: the ratio between the length of a trivial solution and the solution found (a trivial solution for the set of winners {'one', 'two', 'three'}
is the disjunction '^(one|two|three)$'
).
In [42]:
def findboth(A, B):
"Find a regex to match A but not B, and vice-versa. Print summary."
for (W, L, legend) in [(A, B, 'A-B'), (B, A, 'B-A')]:
solution = findregex(W, L)
assert verify(solution, W, L)
ratio = len('^(' + OR(W) + ')$') / float(len(solution))
print '%3d chars, %4.1f ratio, %2d winners %s: %r' % (
len(solution), ratio , len(W), legend, solution)
In [43]:
findboth(winners, losers)
Let's try the top 10 boys and girls names for 2012:
In [16]:
boys = words('jacob mason ethan noah william liam jayden michael alexander aiden')
girls = words('sophia emma isabella olivia ava emily abigail mia madison elizabeth')
findboth(boys, girls)
In [17]:
nfl_in = words('colts saints chargers 49ers seahawks patriots panthers broncos chiefs eagles bengals packers')
nfl_out = words('''jets dolphins bills steelers ravens browns titans jaguars texans raiders cowboys giants
redskins bears lions vikings falcons buccaneers cardinals rams''')
findboth(nfl_in, nfl_out)
And separating the top ten best-selling drugs from the top 10 cities to visit:
In [18]:
drugs = words('lipitor nexium plavix advair ablify seroquel singulair crestor actos epogen')
cities = words('paris trinidad capetown riga zurich shanghai vancouver chicago adelaide auckland')
findboth(drugs, cities)
In [19]:
def lines(text): return {line.upper().strip() for line in text.split('/')}
starwars = lines('''The Phantom Menace / Attack of the Clones / Revenge of the Sith /
A New Hope / The Empire Strikes Back / Return of the Jedi''')
startrek = lines('''The Wrath of Khan / The Search for Spock / The Voyage Home /
The Final Frontier / The Undiscovered Country / Generations / First Contact /
Insurrection / Nemesis''')
findboth(starwars, startrek)
Neat—the solution ' T|E.P| N'
is one character shorter than Randall's. This solution is equivalent to 'E.P| [TN]'
, so it shares the ' [TN]'
component. You can see why I didn't bother to put character classes (like [TN]
) in my pool of regex components: ' T| N'
is the same nunber of characters as ' [TN]'
.
We can verify that Randall's solution is correct:
In [20]:
verify('M | [TN]|B', starwars, startrek)
Out[20]:
There are lots of examples to play with over at regex.alf.nu, like this one:
In [21]:
foo = words('''afoot catfoot dogfoot fanfoot foody foolery foolish fooster
footage foothot footle footpad footway hotfoot jawfoot mafoo nonfood padfoot
prefool sfoot unfool''')
bar = words('''Atlas Aymoro Iberic Mahran Ormazd Silipan altared chandoo crenel
crooked fardo folksy forest hebamic idgah manlike marly palazzi sixfold
tarrock unfold''')
findboth(foo, bar)
I assume that one is intended to find 'foo'
, not 'f.o'
. My solution might be the same number of characters, 3, but it is smaller in the amount of ink/pixels it uses.
I see three options:
My first inclination was "stop here", and that's what this notebook will do (a few paragraphs from now). But several correspondents offered very interesting suggestions, so I returned to the problem in a second notebook. You'll see some tricks for making code run faster, and some new algorithms to find better (shorter) regexes.
That was fun! I hope this page gives you an idea of how to think about problems like this. Let's review what we did:
verify
to prove that we really understand exactly what the problem is.component_regexes
).findregex
).Thanks to Randall Monroe for inspiring me to do this, to regex.alf.nu for inspiring Randall to do his strip, and to Davide Canton, Thomas Breuel, and Stefan Pochmann for providing suggestions to improve my code.
Here is the complete program (without the data; you can apply it to any sets of strings you want):
In [26]:
import re
def verify(regex, winners, losers):
"Return true iff the regex matches all winners but no losers."
missed_winners = {W for W in winners if not re.search(regex, W)}
matched_losers = {L for L in losers if re.search(regex, L)}
if missed_winners:
print "Error: should match but did not:", ', '.join(missed_winners)
if matched_losers:
print "Error: should not match but did:", ', '.join(matched_losers)
return not (missed_winners or matched_losers)
def findregex(winners, losers):
"Find a regex that matches all winners but no losers (sets of strings)."
# Make a pool of regex components, then pick from them to cover winners.
# On each iteration, add the 'best' component to 'solution',
# remove winners covered by best, and keep in 'pool' only components
# that still match some winner.
pool = regex_components(winners, losers)
solution = []
def score(r): return 4 * len(matches(r, winners)) - len(r)
while winners:
best = max(pool, key=score)
solution.append(best)
winners = winners - matches(best, winners)
pool = {r for r in pool if matches(r, winners)}
return OR(solution)
def matches(regex, strings):
"Return a set of all the strings that are matched by regex."
return {s for s in strings if re.search(regex, s)}
OR = '|'.join # Join a sequence of strings with '|' between them
def regex_components(winners, losers):
"Return components that match at least one winner, but no loser."
wholes = {'^'+winner+'$' for winner in winners}
parts = {d for w in wholes for s in subparts(w) for d in dotify(s)}
return wholes | {p for p in parts if not matches(p, losers)}
def subparts(word):
"Return a set of subparts of word, consecutive characters up to length 4."
return set(word[i:i+n] for i in range(len(word)) for n in (1, 2, 3, 4))
def dotify(part):
"Return all ways to replace a subset of chars in part with '.'."
if part == '':
return {''}
else:
return {c+rest for rest in dotify(part[1:])
for c in replacements(part[0])}
def replacements(char):
"Return replacement characters for char (char + '.' unless char is special)."
if (char == '^' or char == '$'):
return char
else:
return char + '.'
def words(text): return set(text.lower().split())
def lines(text): return {line.upper().strip() for line in text.split('/')}
################ Data
winners = words('''washington adams jefferson jefferson madison madison monroe
monroe adams jackson jackson van-buren harrison polk taylor pierce buchanan
lincoln lincoln grant grant hayes garfield cleveland harrison cleveland mckinley
mckinley roosevelt taft wilson wilson harding coolidge hoover roosevelt
roosevelt roosevelt roosevelt truman eisenhower eisenhower kennedy johnson nixon
nixon carter reagan reagan bush clinton clinton bush bush obama obama''')
losers = words('''clinton jefferson adams pinckney pinckney clinton king adams
jackson adams clay van-buren van-buren clay cass scott fremont breckinridge
mcclellan seymour greeley tilden hancock blaine cleveland harrison bryan bryan
parker bryan roosevelt hughes cox davis smith hoover landon wilkie dewey dewey
stevenson stevenson nixon goldwater humphrey mcgovern ford carter mondale
dukakis bush dole gore kerry mccain romney''')
starwars = lines('''The Phantom Menace / Attack of the Clones / Revenge of the Sith /
A New Hope / The Empire Strikes Back / Return of the Jedi''')
startrek = lines('''The Wrath of Khan / The Search for Spock / The Voyage Home /
The Final Frontier / The Undiscovered Country / Generations / First Contact /
Insurrection / Nemesis''')
def findboth(A, B):
"Find a regex to match A but not B, and vice-versa. Print summary."
for (W, L, legend) in [(A, B, 'A-B'), (B, A, 'B-A')]:
solution = findregex(W, L)
assert verify(solution, W, L)
ratio = len('^(' + OR(W) + ')$') / float(len(solution))
print '%3d chars, %4.1f ratio, %2d winners %s: %r' % (
len(solution), ratio , len(W), legend, solution)
################ Tests
def test1():
assert subparts('^it$') == {'^', 'i', 't', '$', '^i', 'it', 't$', '^it', 'it$', '^it$'}
assert subparts('this') == {'t', 'h', 'i', 's', 'th', 'hi', 'is', 'thi', 'his', 'this'}
assert dotify('it') == {'it', 'i.', '.t', '..'}
assert dotify('^it$') == {'^it$', '^i.$', '^.t$', '^..$'}
assert dotify('this') == {'this', 'thi.', 'th.s', 'th..', 't.is', 't.i.', 't..s', 't...',
'.his', '.hi.', '.h.s', '.h..', '..is', '..i.', '...s', '....'}
assert replacements('x') == 'x.'
assert replacements('^') == '^'
assert replacements('$') == '$'
assert regex_components({'win'}, {'losers', 'bin', 'won'}) == {
'^win$', '^win', '^wi.', 'wi.', 'wi', '^wi', 'win$', 'win', 'wi.$'}
assert regex_components({'win'}, {'losers', 'bin', 'won', 'wine'}) == {
'^win$', 'win$', 'wi.$'}
assert matches('a|b|c', {'a', 'b', 'c', 'd', 'e'}) == {'a', 'b', 'c'}
assert matches('a|b|c', {'any', 'bee', 'succeed', 'dee', 'eee!'}) == {
'any', 'bee', 'succeed'}
assert verify('a+b+', {'ab', 'aaabb'}, {'a', 'bee', 'a b'})
assert findregex({"ahahah", "ciao"}, {"ahaha", "bye"}) == 'a.$'
assert OR(['a', 'b', 'c']) == 'a|b|c'
assert OR(['a']) == 'a'
assert words('This is a TEST this is') == {'this', 'is', 'a', 'test'}
assert lines('Testing / 1 2 3 / Testing over') == {'TESTING', '1 2 3', 'TESTING OVER'}
return 'test1 passes'
if __name__ == '__main__':
print test1()