A Coded Message

Just the other day my brother sent me an image of a curious coded message. The only explanation that came with it was the following challenge: "Can you crack this code?" Aha! Challenge accepted, sir!

This looks like symbols etched or burned (by laser??) onto a piece of wood. My brother is pretty crafty with his wood-working skills, but would he actually take the time to make this? Maybe?

My wife and I sat down and looked this over for a little while. She loves playing "Words with Friends" on her phone and she had some good ideas about how to interpret the symbols for the smaller words. It looks to me that this note might be a poem, judging by the periodic sentences, plus what might be the author's name appended at the very end. After an hour we had not made much progress and so I decided to pursue a nerdy statistical approach. This will be an interesting effort to work on during my free time over the holiday.

Transcribe Symbols to Numbers

The first step for me was to transcribe the indidual symbols in the message into a form amenable to manipulation by a machine. I scanned thrugh the document and assigned a two-digit number to each symbol. I wrote that all down on a piece paper while sitting on the couch. Then I spent way too much time taking photos of those symbols and editing them into the form seen below. Notice that there are exactly 26 distinct symbols in this code. This lends support for the idea that this code is a variant of the Substitution Cipher, where original letters are substituted for new symbols using a predetermined scheme. In this case, the scheme would be some clever assignment of random Greek letters to real letters.

I don't suspect the use of a Transposition Cipher in addition to a substitution cipher since the ordering of words, lines, and punctuation has a strong resemblance to real text, perhaps even a poem.

Next I transcribed the encoded message into numerical form, making sure to keep track of individual words. Each line of numbers listed below corresponds to an individual word from the coded message. I think it's a bit of a hassle at this point to worry too much about which words are placed on which line. But the least I can do now is to keep track of the index number for the first word on each line.


In [29]:
%matplotlib inline

import matplotlib.pyplot as plt
import prettyplotlib as ppl  # help matplotlib be all that it can be!  

import numpy as np
import pandas as pd
import requests

# Mysterious message transcribed to a number code for easier manipulation.
message = [[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9,  3, 10],
           [11,  5, 12, 13, 14, 14,  7,  6,  7],
           [13, 10,  2],
           [15, 11],
           
           [ 5, 16,  4, 11,  2,  4,  3,  3],
           [15, 17, 11, 18,  8, 14],
           [ 6,  5, 24,  2, 19, 20, 11, 21, 24, 17,  3, 21],
           [ 1, 22],           
           [10,  9, 17],
           
           [ 3, 14,  0,  7,  0, 19, 14,  5],           
           [23, 22],
           [ 8],
           [19,  6, 16,  2, 11, 17, 24,  5, 13, 12, 12],
           [ 8,  4],
           [24,  1, 22],
           
           [ 0,  4,  9, 18, 10, 22,  6, 10, 19,  4],
           [20, 15,  1, 10, 23],
           [18,  1],
           [ 3, 15, 21],
           [11, 23, 12, 25],
           
           [20, 19,  7, 23,  5, 11, 19,  7,  7],
           [12, 23, 19],
           [15, 16, 15, 21,  4,  8, 16,  5, 24],
           [ 7, 0, 23,  8],
           [24, 15, 24, 25,  2, 25]]

line_ends = [3, 8, 14, 19, 24]

Counting Letters

Now seems like a great time to start counting how many times we see each symbol repeated in the coded message text. And here's why (wikipedia):

Frequency analysis is based on the fact that, in any given stretch of written language, certain letters and combinations of letters occur with varying frequencies. Moreover, there is a characteristic distribution of letters that is roughly the same for almost all samples of that language


In [30]:
def count_helper(message):
    """Helper function to compute symbols is a message.

    """
    # Ensure supplied message is any kind of iterator, not a scalar.
    if not hasattr(message, '__iter__'):
        message = [message]
    
    counts = {}
    total = 0
    
    for s in message:
        if s not in counts:
            counts[s] = 0

        counts[s] += 1
        total += 1
                
    return counts, total


def count_occurances(message):
    """Compute number of occurances of each symbol in the supplied message.
    
    Parmeters
    ---------
    message : sequence of symbols, OR a sequence of sequences of symbols.
    
    """
    counts_final = {}
    total_final = 0
    
    for s in message:
        counts_sub, total_sub = count_helper(s)

        # Update final results with information from processed symbol (or sequence of symbols).
        for s_sub in counts_sub.keys():                
            if s_sub not in counts_final:
                counts_final[s_sub] = 0
            
            counts_final[s_sub] += 1
            total_final += 1
        
    return counts_final, total_final


# The two functions above get the job done, but they don't look very elegant.  Surely there must
# be a more Pythonic way to get this done.  Maybe I could get rid of that helper function, and replace
# it with a generator???

In [31]:
# Test run.
results_msg, total_msg = count_occurances(message)

# Print some results to the console.
symbols_msg = results_msg.keys()
# num_symbols = len(symbols)

counts_msg = np.asarray(results_msg.values())

print('\nSymbol  Count')
print('-------------')
for k, v in results_msg.items():
    print('  {:02d}     {:2d}'.format(k, v))

print('\n Total: {:d}\n'.format(total_msg))


Symbol  Count
-------------
  00      4
  01      5
  02      6
  03      5
  04      5
  05      8
  06      5
  07      5
  08      6
  09      3
  10      5
  11      8
  12      4
  13      3
  14      3
  15      6
  16      3
  17      4
  18      3
  19      6
  20      3
  21      3
  22      4
  23      6
  24      5
  25      2

 Total: 120

Histogram

That long list of numbers doesn't intuitively convey the story I'm building here. Let's make a histogram! Note that I'm using the prettyplotlib package to give the matplotlib graphics a more appealing look. I'm going to implement my histogram display as a function to make life easier later on.


In [32]:
def hist_display(results):
    """Display a nice histogram of symbol frequencies.
    
    Parameters
    ----------
    results : dict, output from function count_occurances().
    """
    symbols = results.keys()
    num_symbols = len(symbols)
    
#     counts = [results[s] for s in symbols]
#     counts = np.asarray(counts)
    counts = np.asarray(results.values())

    total = np.sum(counts)
    
    # Make some pretty labels with symbol name and percent of total observations.
    labels = ['{:s} | {:04.1f}%'.format(str(s), results[s]/float(total)*100.) for s in symbols]
    labels = np.asarray(labels)  # np array of strings.
    
    # Sort symbols by count value.
    ix_sort = np.argsort(counts)

    # Generate the plot.
    fig, ax = plt.subplots(figsize=(6, 12))

    ppl.barh(ax, range(num_symbols), counts[ix_sort], yticklabels=labels[ix_sort], height=.7, align='edge', edgecolor='grey')

    ax.set_ylim(0,num_symbols)
    ax.set_xlabel('Counts')

Using the function defined just above results in a neat little histogram shown below.


In [33]:
hist_display(results_msg)


Google's N-Gram Data

Peter Norvig at Google compiled an astounding set of statistics from information gleaned from Google's book scanning effort.

But for this little task I need access to more basic data in a form suitable for programmatic manipuation. Peter Norvig makes available a large number of data files summarizing the frequency of occurance of each letter, all as a function of the letter's position in a word. From Peter Norvig's web site:

Each column is given a name of the form "wordlength / start : end". For example, "4/2:3" means that the column counts the number of ngrams that occur in 4 letter words (such as "then"), and only in position 2 through 3 (such as the "he" in "then"). We aggregate counts with a notation involving a "*": the notation "*/2:3" refers to the second through third position within words of any length; "4/*" refers to any start positions in words of length 4; and "*/*" means any start position within words of any length. Finally, we also aggregate counts for positions near the ends of words: the notation "*/-3:-2" means the third-to-last through second-to-last position in words of any length (for example, this would be the bigram "he" for the words "hen", "then", "lexicographer", and "greatgrandfather").

I am going to focus here on N-grams of lengths one through three. The corresonding Google Fusion Tables generated by Peter Norvig are available here:

For my first time through I am going to ignore dependencies on word size and on symbol position within a word.

Side Note:

One interesting tool for accessing high-level results from the above N-Gram data is Google's Ngram Viewer. Using the words horse, train, automobile yields the following interesting results:


In [34]:
fname = 'ngrams1.csv'
df = pd.read_csv(fname)

symbols_goog = df['1-gram'].values
counts_goog = df['*/*']

# Convert counts to fraction of total count.
counts_goog = counts_goog.values  #.astype(np.float)  # / np.sum(counts_goog)

# Put it into a  dict.
results_goog = dict(zip(symbols_goog, counts_goog))

# Pretty histogram display.
hist_display(results_goog)


Decode

Decode message by subsituting for first two symbols in ordered list.


In [34]:


In [35]:
# Testing with my own message.
text_test = """
My wife and I sat down and looked this over for a little while. She loves playing Words with Friends on
her phone and she had some good ideas about how to interpret the symbols for the smaller words. It looks
to me that this note might be a poem, judging by the periodic sentences, plus what might be the author's name
appended at the very end. After an hour we had not made much progress and so I decided to pursue a nerdy
statistical approach. This will be an interesting effort to work on during my free time over the holiday.
"""

text_test = text_test.lower()
for z in ".,'\n ":
    text_test = text_test.replace(z, '')

# text_test = text_test[200:250]

results_test, total_test = count_occurances(text_test)

hist_display(results_test)



In [36]:
ix_sort_msg  = np.argsort(counts_msg)[::-1]
ix_sort_goog = np.argsort(counts_goog)[::-1]

N_fix = 2
remap = {}

for k in range(N_fix):
    a = symbols_msg[ix_sort_msg[k]]
    b = symbols_goog[ix_sort_goog[k]]
    remap[a] = b.lower()

remap[1] = 'a'

for word_msg in message:
    word_goog = ''
    for s in word_msg:
        word_goog += remap.get(s, '_')
        
    print(word_goog)


_a___e______
te_______
___
_t
e__t____
__t___
_e____t_____
a_
___
_______e
__
_
____t__e___
__
_a_
__________
__a__
_a
___
t___
____et___
___
_______e_
____
______

In [41]:
def decode_message(message, line_ends, mapping):
    """Message decoder.
    """
    message_plain = ''
    
    for counter, word_code in enumerate(message):
        word_plain = ''
        
        for s in word_code:
            word_plain += mapping.get(s, '_') + ' '
            
        word_plain += ' '
        message_plain += word_plain
        
        if counter in line_ends:
            message_plain += '\n'
            
    return message_plain

In [42]:
m = decode_message(message, line_ends, remap)

In [43]:
line_ends


Out[43]:
[3, 8, 14, 19, 24]

In [44]:
print(m)


_ a _ _ _ e _ _ _ _ _ _  t e _ _ _ _ _ _ _  _ _ _  _ t  
e _ _ t _ _ _ _  _ _ t _ _ _  _ e _ _ _ _ t _ _ _ _ _  a _  _ _ _  
_ _ _ _ _ _ _ e  _ _  _  _ _ _ _ t _ _ e _ _ _  _ _  _ a _  
_ _ _ _ _ _ _ _ _ _  _ _ a _ _  _ a  _ _ _  t _ _ _  
_ _ _ _ e t _ _ _  _ _ _  _ _ _ _ _ _ _ e _  _ _ _ _  _ _ _ _ _ _  


In [40]: