For the latest version of this document see https://github.com/markvanheeswijk/kryptos
This document describes how to get started with decoding Kryptos from scratch, gradually going all the way from ROT-13 to decoding Kryptos, discussing the relevant cryptanalysis principles and code along the way. Take your time to really understand each approach and its limitations. There will be some challenge ciphers to solve along the way.
First, we will look at the Caesar cipher and 4 possible ways to attack it:
Then we will move on to the Vigenere cipher, which is very similar to the Caesar cipher, except that the consecutive shifts of the plaintext characters are determined by a keyword. If the key has length N, then each Nth character of the plaintext will be shifted by the same shift, determined by the corresponding character in the key. Therefore, we can re-use many of the same principles which we already saw for solving the Caesar cipher.
Finally, we will look at the Keyed Caesar and Keyed Vigenere variants, which add an extra complication in the form of a keyed alphabet:
To conclude, we will use the principles discussed to solve Kryptos K1 and K2, using nothing but the 4 panels of Kryptos.
(for more references, see below)
The Caesar cipher is an example of a substution cipher, where each symbol in the plaintext is replaced by another symbol through a fixed mapping. In case of the Caesar cipher, the target alphabet is a shifted version of plaintext alphabet. A particular instance of this is ROT-13, where the target alphabet is shifted by 13 positions
ABCDEFGHIJKLMNOPQRSTUVWXYZ
becomes
NOPQRSTUVWXYZABCDEFGHIJKLM
and A gets replaced by N, B by O, etc.
Now, let's define a simple function for encrypting and decrypting using a Caesar cipher:
In [1]:
def rot(s, key, alphabet="ABCDEFGHIJKLMNOPQRSTUVWXYZ", direction=1):
keyval = alphabet.find(key)
t = ""
for sc in s:
i = alphabet.find(sc)
t += alphabet[(i + keyval * direction) % len(alphabet)] if i > -1 else sc
return t
It takes a letter as key, finds its position in the alphabet, and shifts / rotates the alphabet by that amount. Symbols that are not in the alphabet are unaffected. Now, define some text and encrypt it using the function:
In [2]:
ptext = '"I HEAR AND I FORGET. I SEE AND I REMEMBER. I DO AND I UNDERSTAND." --CONFUSIUS'
ctext = rot(ptext, 'N')
ctext
Out[2]:
Since we know the key we can decrypt it:
In [3]:
decrypted_ctext = rot(ctext, 'N', direction = -1)
decrypted_ctext
Out[3]:
In [4]:
alphabet = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
for key in alphabet:
decrypted_ctext = rot(ctext, key, direction = -1)
print "%s:\t%s" % (key, decrypted_ctext)
In [5]:
from __future__ import division
from matplotlib import pyplot as plt
import numpy as np
%matplotlib inline
plt.rcParams["figure.figsize"] = [12,4]
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
# determine letter frequency in plaintext, and plot
ptext_freq = [ptext.count(c)/len(ptext) for c in alphabet]
pfig = plt.figure()
plt.bar(range(len(alphabet)), ptext_freq, tick_label = list(alphabet), align = 'center', color = 'b')
plt.title('plaintext letter frequency')
# determine letter frequency in ciphertext, and plot
ctext_freq = [ctext.count(c)/len(ctext) for c in alphabet]
cfig = plt.figure()
plt.bar(range(len(alphabet)), ctext_freq, tick_label = list(alphabet), align = 'center', color = 'r')
plt.title('ciphertext letter frequency')
Out[5]:
From these plots, it can clearly be seen that the Caesar cipher shifts the frequencies of the letters (which is what we would expect, knowing how Caesar works). Finding the right key now corresponds to finding that shift, which aligns the frequency plots as well as possible. This can be done in two ways:
In the next sections, we'll look at ways to find the alignment automatically. For now, let's align and determine the key manually by aligning the frequency plots. Use the slider to align the frequency plots.
Note: normally you would not have the frequency plot of the plaintext of course, and you would instead use the frequencies based on some sufficiently large text in the same language.
In [6]:
# https://blog.dominodatalab.com/interactive-dashboards-in-jupyter/
# http://ipywidgets.readthedocs.io/en/latest/examples/Using%20Interact.html
from __future__ import division
from matplotlib import pyplot as plt
import numpy as np
from ipywidgets import interact, interactive, fixed
#import ipywidgets as widgets
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
def rotate(l, n):
return l[n:] + l[:n]
@interact(key_i = (0,25,1))
def plot_graphs(key_i = 0):
# decrypt ciphertext using this key
key = alphabet[key_i]
decrypted_ctext = rot(ctext, key, direction = -1)
# determine letter frequency in plaintext, and plot # FIXME: base on separate text
ptext_freq = np.array([ptext.count(c)/len(ptext) for c in alphabet])
pfig = plt.subplot(2,1,1)
plt.bar(range(len(alphabet)), ptext_freq, tick_label = list(alphabet), align = 'center', color = 'b')
plt.title('blue = plaintext letter frequency, green = decrypted ciphertext letter frequency')
# determine letter frequency in ciphertext, and plot
ctext_freq = np.array([decrypted_ctext.count(c)/len(decrypted_ctext) for c in alphabet])
cfig = plt.subplot(2,1,2)
plt.bar(range(len(alphabet)), ctext_freq, tick_label = rotate(list(alphabet), key_i), align = 'center', color = 'g')
plt.xlabel("key = %s" % key)
plt.show()
print decrypted_ctext
We just saw that finding the right key consists of aligning the letter-frequency distributions. If we can measure how well aligned the distributions are, it can be done automatically. This is where the Chi-squared Statistic comes in: it measures how different an observed distribution is from an expected distribution:
$$ \chi^2(O,E) = \sum_{i=A}^Z \frac{(O_i - E_i)^2}{E_i} $$where $O_i$ and $E_i$ are the observed and expected numbers of character $i$ in a text. It can also be expressed in terms of frequencies/probabilities:
$$ \chi^2 = N \sum_{i=A}^Z \frac{(O_i/N - p_i)^2}{p_i} = N \sum_{i=A}^Z \frac{(f_i - p_i)^2}{p_i} $$where $f_i$ and $p_i$ are the observed and expected relative frequencies of the characters, and $N$ is the length of the text.
For more details see:
Now, let's use it to find the key:
In [7]:
import numpy as np
# english letter frequencies, estimated from small sample text # FIXME: replace by better estimates
g_english = [0.0736, 0.0148, 0.0445, 0.0302, 0.102, 0.0227, 0.0122, 0.0277, 0.0855, 0.000557, 0.00237, 0.0342, 0.0206, 0.0717, 0.103, 0.0246, 0.00181, 0.0735, 0.0608, 0.0889, 0.0392, 0.0153, 0.0173, 0.000557, 0.032, 0.000278]
def chi_square(f, g, N):
chi2 = 0
chi2 = N * np.sum(np.array([(ff-gg)**2/gg for (ff,gg) in zip(f,g)]))
return chi2
In [8]:
alphabet = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
results = []
for key in alphabet:
decrypted_ctext = rot(ctext, key, direction = -1)
f = np.array([decrypted_ctext.count(c) for c in alphabet])
N = np.sum(f) # note: not counting the non-alphabet characters here
f = f / N
chi2_decrypted_ctext = chi_square(f, g_english, N)
results.append((chi2_decrypted_ctext, key, decrypted_ctext))
print "%s:\t%f\t%s" % (key, chi2_decrypted_ctext, decrypted_ctext)
print "\nthe best key and plaintext:\n %s" % repr(min(results))
TL;DR: for a given decoding, estimate how much like English (or another language) it is. Best key is the one that gives the decoding that is most like English (or another language)...
The final way to solve is to, for each possible key, compute the likelihood that the decrypted text is plaintext in a particular language. This of course is unnecessary for something as simple as the Caesar cipher, but now think of a cipher where you would have thousands or millions of possible keys...
For now, let's focus on understanding how it works. Spoiler: in a next section, this approach will be used to automatically determine the key for Kryptos and decrypted the K1 and K2 parts of it.
In the following we will assume this to be English, but it could just as well be a different language with different statistics. This is something that you should decide or guess from context.
In order to measure how much like English a certain string is, we need to compare its properties with strings from the English language. The criterion that will be used here to measure this will be so-called n-gram statistics, which in short, measuring for each sequence of n symbols, what is the frequency / probability that it appears in an English text.
Then, the likelihood that the observed decrypted ciphertext is an English text, is the joint probability of the observing the ngrams in English. Assuming the observations are independent and identically distributed, the joint probability of the observed ngrams in the decrypted ciphertext as follows:
$$ \mathcal{L}(English \mid decrypt(ctext, key)) = p(decrypt(ctext, key)) \mid English) = \prod_i p(ngram_i \mid English) $$where $ngram_i$ is the $i^{th}$ sequence of $n$ letters in the decrypted ciphertext. For example, in case $decrypt(ctext, key) = \texttt{ANEXAMPLETEXT}$ and using 3-grams, the observed 3-grams are
$$ \texttt{ANE, NEX, EXA, XAM, AMP, MPL, PLE, LET, ETE, TEX, EXT}, $$and therefore the likelihood that it is an English text can be computed as
$$ \begin{align} \mathcal{L}(English \mid \texttt{ANEXAMPLETEXT}) &= p(\texttt{ANEXAMPLETEXT} \mid English) \\ &= p(\texttt{ANE} \mid English) \cdot p(\texttt{NEX} \mid English) \cdot \ldots \cdot p(\texttt{TEX} \mid English) \cdot p(\texttt{EXT} \mid English). \end{align} $$Since the numbers involved typically get very small, people often work with the log-probabilities instead:
$$ \begin{align} \log \mathcal{L}(English \mid decrypt(ctext, key)) &= \log \prod_i p(ngram_i \mid English)\\ &= \sum_i \log p(ngram_i \mid English) \end{align} $$This so-called log-likelihood that the decrypted ctext is English text (i.e. $\log \mathcal{L}(English \mid decrypt(ctext, key))$) can now be used as a criterion to determine the best key. However, we need to estimate the probabilities of certain ngrams appearing in the English language (i.e. all $\log p(ngram_i \mid English)$). For this, we analyze some reference text (the longer the better).
In [9]:
import re
def find_ngrams(text, n):
text = re.sub(r'[^A-Z]', r'', text.upper()) #remove everything that is not A-Z
return zip(*[text[i:] for i in range(n)]) #determine ngrams
The above function takes a text and a number n and returns that text as ngrams:
Now, using this function, we can easily create a file of ngrams and their counts in some reference text. These statistics are later used to estimate the likelihood that another string is from the same language.
In [10]:
def compute_ngram_statistics(infile, outfile, n):
# read in file, and remove newlines
s = ''
print "reading %s..." % infile
with open(infile) as f:
s += " ".join(line.strip() for line in f)
print "finding and counting %d-grams..." % n
# compute ngrams
qgs = find_ngrams(s, n)
# count them
d = {}
for qg in qgs:
if d.has_key("".join(qg)):
d["".join(qg)]+=1
else:
d["".join(qg)]=1
# convert to sorted list and write to file
print "writing results to %s..." % outfile
qg_list = sorted(list(d.items()), key=lambda x: x[1], reverse=True)
f = open(outfile, 'w')
for t in qg_list:
f.write("%s %d\n" % t)
print "done!"
In [11]:
compute_ngram_statistics('ngrams/gutenberg_sherlock.txt', 'ngrams/en_sherlock_1grams', 1)
compute_ngram_statistics('ngrams/gutenberg_sherlock.txt', 'ngrams/en_sherlock_2grams', 2)
compute_ngram_statistics('ngrams/gutenberg_sherlock.txt', 'ngrams/en_sherlock_3grams', 3)
compute_ngram_statistics('ngrams/gutenberg_sherlock.txt', 'ngrams/en_sherlock_4grams', 4)
Finally, let's crack Caesar using this. We use function ngram_score
(source: practicalcryptography.com) for computing the log-likelihood that a text is a particular language, given a file with ngram counts for that language.
In [12]:
from ngram_score import ngram_score
fitness = ngram_score('ngrams/en_sherlock_1grams') # load our ngram statistics
alphabet = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
results = []
for key in alphabet:
decrypted_ctext = rot(ctext, key, direction = -1)
ll_decrypted_ctext = fitness.score(decrypted_ctext)
results.append((ll_decrypted_ctext, key, decrypted_ctext))
print "%s:\t%f\t%s" % (key, ll_decrypted_ctext, decrypted_ctext)
print "\nthe most likely key and plaintext:\n %s" % repr(max(results))
Perhaps a bit underwhelming to use this for cracking the Caesar cipher. However, now we have a useful tool in hand that we can use for other more complicated ciphers as well, like the Vigenere cipher which we will now look at.
Now we will move on to the Vigenere cipher, which is very similar to the Caesar cipher, except that the consecutive shifts of the plaintext characters are determined by a keyword. If the key has length $N$, then each $N^{th}$ character of the plaintext will be shifted by the same shift, determined by the corresponding character in the key. This relation of Vigenere being N interleaved Caesar ciphers is also visible from the code:
In [13]:
def vigenere(s, key, alphabet="ABCDEFGHIJKLMNOPQRSTUVWXYZ", direction=1):
t = []
key_i = 0
for i in range(len(s)):
if s[i] in alphabet: # only process symbols from the specified alphabet
t.append(rot(s[i],key[key_i], alphabet, direction))
key_i = (key_i + 1) % len(key)
else:
t.append(s[i])
return "".join(t)
With Caesar we had:
In [14]:
ptext = '"I HEAR AND I FORGET. I SEE AND I REMEMBER. I DO AND I UNDERSTAND." --CONFUSIUS'
ctext = rot(ptext, 'N')
ctext
Out[14]:
which can be seen as:
"I HEAR AND I FORGET. I SEE AND I REMEMBER. I DO AND I UNDERSTAND." --CONFUSIUS
N NNNN NNN N NNNNNN N NNN NNN N NNNNNNNN N NN NNN N NNNNNNNNNN NNNNNNNNN
------------------------------------------------------------------------------
"V URNE NAQ V SBETRG. V FRR NAQ V ERZRZORE. V QB NAQ V HAQREFGNAQ." --PBASHFVHF
Whereas, for Vigenere we have:
In [15]:
ptext = '"I HEAR AND I FORGET. I SEE AND I REMEMBER. I DO AND I UNDERSTAND." --CONFUSIUS'
ctext = vigenere(ptext, 'NO')
ctext
Out[15]:
which can be seen as:
"I HEAR AND I FORGET. I SEE AND I REMEMBER. I DO AND I UNDERSTAND." --CONFUSIUS
N ONON ONO N ONONON O NON ONO N ONONONON O NO NON O NONONONONO NONONONON
------------------------------------------------------------------------------
"V VROE OAR V TBFTSG. W FSR OAR V FRARAOSE. W QC NBQ W HBQSEGGOAR." --PCATHGVIF
Since Vigenere is so similar to the Caesar cipher, we can re-use many of the same principles which we already saw for solving the Caesar cipher. However, before we can attack the Vigenere cipher in the same way as the Caesar cipher, we need to guess the length of the key. Again, for this we can use the frequency distributions, however, now we look at the frequency distribution of every $N^{th}$, since each $N^{th}$ character is shifted with the same key. In summary, we are going to try different key lengths and see for which key length, the N different frequency distributions look most like shifted English distributions. The technical way to measure this is through the Index of Coincidence (IC), which characterizes piece of text by the likelihood of drawing a pair of identical letters at random from that text.
YFMRFDYCYRBEEXEBKRTTKMBRYJKEQNEHXVLXHDMRIRBEEGLNWVKXDUGRSVZIVCUNWVWVLSEAJIMXDNLRYTWRVUSGXFNXKQAYUYIFHFWENKBIQAUGYNMRWKSVCKQQHEIAIZNJHDEAYIWAVQAPMRTTKMBRYJPMIFEQHPKPLOAYQPBSWTEYJWBGRYPNWVLXRFHRUIMZLAUFFCXLDNEGHFZVHEPBSUQRJFOGMVBAHZTLXZFTRESVGCMEHEAEHZXLHDSGIZNJHDEAYGWMQFSVSKPIHZCEDGBMRZPETTMWVFHRHZXLHDUFJJIHLRFRWVVXDXPUFSMXIDOZTEMSIFHRWFEWKQAYUYIFHFUFJUIXHMCUUFQRWPECJELWRZAEJGMEWUNTPVGARDDNote how similar this approach is to the principles used to crack the Caesar cipher.
As mentioned, the technical way to measure how much a text is like a possibly shifted English letter frequency distribution, is through the Index of Coincidence (IC), which characterizes piece of text by the likelihood of drawing (without replacement) a pair of identical letters at random from that text:
$$ IC = \sum_{i=1}^c \big( \frac{n_i}{N} \times \frac{n_i-1}{N-1} \big) $$where $n_i$ is the count of the $i^{th}$ character in the alphabet, $c$ is the number of characters in the alphabet and $N$ is the length of the text.
Each language has its own typical IC, and for N large enough you would get:
$$ IC = \sum_{i=1}^c f_i^2 $$where $f_i$ is the relative frequency of the $i^{th}$ character in the alphabet of that language. As you can see from the formula, a nice property of the IC is it is not affected by a substitution cipher. Therefore, it can also be used to determine the language of the plaintext if you are only given the ciphertext of a substitution cipher, or whether you are even dealing with a substitution cipher.
For more details, see:
The IC can be used to determine the most likely key length, using the fact that it is higher for plaintext (or plaintext under a particular substitution) than for random text. To illustrate how this can be used for determining the key length, consider this as a representation of a ciphertext encrypted with a key of length 5:
ABCDEABCDEABCDEABCDEABCDEABCDEABCDE
where A represents a ciphertext character encrypted with the first character of the key, B the second, etc.
Not knowing the key length, we could try a key length of 3 or 4 and obtain:
supposed key length=3
ABC
DEA
BCD
EAB
CDE
ABC
DEA
BCD
EAB
CDE
ABC
DE
and
supposed key length=4
ABCD
EABC
DEAB
CDEA
BCDE
ABCD
EABC
DEAB
CDE
It can be seen that in each column the characters are encrypted using different letters of the key. Therefore the ciphertext characters in that column will be quite random and the IC will be low. However, when trying key length 5, we get
supposed key length=5
ABCDE
ABCDE
ABCDE
ABCDE
ABCDE
ABCDE
ABCDE
Each column will now correspond to a particular substitution / Caesar cipher corresponding to that key character, and the IC will be higher. Looking at average IC of all columns, will indicate the key length.
For more details, see:
Now, suppose we have the following ciphertext:
VYCHVUYPESJMZCJZTXNOOEFSXMBQJTTNQAXWKBWTPDDKTUODCOPGJFNTMOPRLACBZGTWEAEEEOKWAKCXCHVOATLGFMVYZRWBNGHPQUCDRKSGXSGWZRZWMURLSTLCHLZWBAECCIMEESTLLPWVNCCMYLELPKAAPTCZKYCTZQOIKGICRMEQTSPWMGHKFTYOHRDCUBAQEQWTVRBPCFKGPWGGFEQTZFSDWDVCCLOYVPBFWGPVFPLYRTMCWVSDCCIHSBLGCLPWTVKPBNVNRLAVWVPQTOEACSYJIUVVPBXSFARC
Let's try different key lengths and determine the average IC for each of them.
In [16]:
from __future__ import division
import numpy as np
# computes the IC for a given string
def IC(s, alphabet = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'):
s_filtered = filter(s, alphabet) # we only care about the letters in our alphabet
ic = 0
for c in alphabet:
c_count = s_filtered.count(c)
N = len(s_filtered)
if c_count > 1:
ic += (c_count * (c_count - 1)) / (N * (N - 1) / len(alphabet))
return ic
# helper function to filter out non-alphabet characters
def filter(s, alphabet = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'):
return "".join([c for c in s if c in alphabet])
# computes the avg IC of subsequences of each n'th character
def mean_IC(s, n, alphabet = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'):
s_filtered = filter(s, alphabet) # we only care about the letters in our alphabet
s_filtered_subseq = [s_filtered[i::n] for i in range(n)]
ic = []
for i in range(n):
ic.append(IC(s_filtered_subseq[i]))
return np.mean(ic)
In [17]:
ctext = "VYCHVUYPESJMZCJZTXNOOEFSXMBQJTTNQAXWKBWTPDDKTUODCOPGJFNTMOPRLACBZGTWEAEEEOKWAKCXCHVOATLGFMVYZRWBNGHPQUCDRKSGXSGWZRZWMURLSTLCHLZWBAECCIMEESTLLPWVNCCMYLELPKAAPTCZKYCTZQOIKGICRMEQTSPWMGHKFTYOHRDCUBAQEQWTVRBPCFKGPWGGFEQTZFSDWDVCCLOYVPBFWGPVFPLYRTMCWVSDCCIHSBLGCLPWTVKPBNVNRLAVWVPQTOEACSYJIUVVPBXSFARC"
In [18]:
from matplotlib import pyplot as plt
mean_ic_n = []
for n in range(1,32):
mean_ic = mean_IC(ctext.upper(), n)
mean_ic_n.append(mean_ic)
print "%2d %02f" % (n, mean_ic)
plt.bar(range(1,len(mean_ic_n)+1), mean_ic_n, align = 'center')
plt.xlabel("key length")
plt.ylabel("average IC")
Out[18]:
Given the peaks at a key length of 13 and 26, it can be concluded that the key is likely 13 characters long.
Now we know that the key is likely 13 characters long, we can determine the most likely shifts for each of the 13 Caesar ciphers using the Chi-square criterion.
In [19]:
key_length = 13
for i in range(key_length):
ctext_i = ctext[i::key_length]
alphabet = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
chi2_k = []
for key in alphabet:
decrypted_ctext_i_k = rot(ctext_i, key, direction = -1)
f = np.array([decrypted_ctext_i_k.count(c) for c in alphabet])
N = np.sum(f) # note: not counting the non-alphabet characters here
f = f / N
chi2_decrypted_ctext_i_k = chi_square(f, g_english, N)
chi2_k.append((chi2_decrypted_ctext_i_k, key, decrypted_ctext_i_k))
print "%s:\t%f\t%s" % (key, chi2_decrypted_ctext_i_k, decrypted_ctext_i_k)
print "the best key and plaintext:\n %s\n" % repr(min(chi2_k))
According to the Chi-square analysis the key is OCYPTANALYSIS.
In [20]:
vigenere(ctext, 'OCYPTANALYSIS', direction = -1)
Out[20]:
Pretty close, but no cigar. Let's try CRYPTANALYSIS.
In [21]:
vigenere(ctext, 'CRYPTANALYSIS', direction = -1)
Out[21]:
TZKHT BTWJN BKGDP GVFVO HBTWA ZFFZP JVMJT VTHAV VAMFB BSBKU JBUKL BDOKR MWZLH RHFFH WLEGF GISTR WBDSL VOHVY UQSWN PFOXR WKKAN YBBXY NGHAR TJBZN BQZUE VUGUR MQOST VPBFA WWHWM OFFKU AMQFI AFHVR WVUFE GZHYR AMTSS OFSEZ DKTKP RDICN CQAFA OPIKG QMYWA AJBXB OBTWF BVFVA LZKHT REAVF BISWS VUPVN AAAXT UFTFH AUQKS NHSJG QMRAR FUHYE NMTSV RCSVA BWXNE QXVZY NBTWF BVFKU VMEKA TFFVZ JQZKA FPBVB OBTWM BTHWN VWGKU ATCCI NLOGD RTWEG QMIGR YEHYR BKGDP GVFVP XVFAN HFGKB KMAXI AUSIR BBFGC EZDKN WIXQS GTPFG QIYST RVFRA MXDGF RTGZB WIXOH BBFVN CBQEP GJBXG XLQUI CISIG QMRGU EUVJR LBUGN GISRE CQELH NTGFS JZSAV ROHNB LTGWS GPHYV BAQUT VPB
more difficult to break (due to the text not being very typical?): IUOCA LGTAM DAQFN ISJTN MZGWS UZCBH GSSJT NM?LZ EYVKG LLZEE BJVPK EAGOW VQUXI EMVZB ZWING GTUSL IOOOC AYSTH FJGLS FDTSS PAEAT TFVWV VWRGS MWVVL OAOMP SFGWN MGEIL AONYV QMKDA NHDGG CFOWB TQCLL HIT?L JMQKH OVDFQ LKBUS AGLGM TTIWT MKGME XZGZW PWHPC PWOKT HFWZI ULLOD SVQGF ?ONMQ YELZI SXSUP AKLAT LOMKK AGFPV PAJTY FAIPL VEGSW GAXAF TZKGD WFMIO MVMKK IXQGK VLXIV FKGKG FDSOG TBZKE VFFVG KWVEO VGOJW ESFAI PLEIN VLGAX GRTZX QCJKE CPFFA OWSTJ VDGJG WSPoints to consider:
For solving Vigenere, we could also use Maximum Likelihood to evaluate how much like plaintext a decrypted ciphertext is. Combined with a way to search through possible keys, we can search for the best key. The advantage of using maximum likelihood is also that we can easily consider longer subsequences, rather than just look at the message as a bag of unordered letters where each letter has a frequency. By considering statistics of longer subsequences, we can better recognize when some text is written in a particular language.
This approach is adapted from:
Note that this is just one possible way to search through the possible keys. It does not use any information like IC for determining the likely key length N, or Chi-square for determining the likely values for the shifts of the N Caesar ciphers. Instead, for a range of key lengths, it exhaustively searches the possibilities for the first 3 characters of the key and keeps track of the best candidates. Then, it extends these subkeys one character at a time, while keeping track of the best candidates.
An alternative way would be to determine the key length and likely key, using IC and Chi-square criteria, and then gradually improve they key using a stochastic search algorithm like simulated annealing.
During the search, we need a way to keep track of the best results so far. This can be easily achieved using the $\texttt{nbest}$ class:
In [22]:
# keep a list of the N best things we have seen, discard anything else
# this class can also be imported using 'import nbest' from a separate file nbest.py
class nbest(object):
def __init__(self,N=1000):
self.store = []
self.N = N
def add(self,item):
self.store.append(item)
self.store.sort(reverse=True)
self.store = self.store[:self.N]
def __getitem__(self,k):
return self.store[k]
def __len__(self):
return len(self.store)
In [23]:
ctext = "VYCHVUYPESJMZCJZTXNOOEFSXMBQJTTNQAXWKBWTPDDKTUODCOPGJFNTMOPRLACBZGTWEAEEEOKWAKCXCHVOATLGFMVYZRWBNGHPQUCDRKSGXSGWZRZWMURLSTLCHLZWBAECCIMEESTLLPWVNCCMYLELPKAAPTCZKYCTZQOIKGICRMEQTSPWMGHKFTYOHRDCUBAQEQWTVRBPCFKGPWGGFEQTZFSDWDVCCLOYVPBFWGPVFPLYRTMCWVSDCCIHSBLGCLPWTVKPBNVNRLAVWVPQTOEACSYJIUVVPBXSFARC"
The search for the best key, as described above, is performed using the code below. Rather than solving and evaluating each shift individually, the criterion used to evaluate the quality of a keyword is the likelihood of the 4-grams in the decrypted ciphertext being English.
In [24]:
from itertools import permutations
from ngram_score import ngram_score
import re
import pprint as pp
qgram = ngram_score('ngrams/en_sherlock_4grams') # load our 4gram statistics
trigram = ngram_score('ngrams/en_sherlock_3grams') # load our 3gram statistics
# keep track of the 100 best keys
N=100
# test keys up to a length 15
for KLEN in range(3,16):
print "="*80
rec = nbest(N)
# exhaustively test all possible letters for first 3 entries of the key and keep track of the N best ones
# if KLEN=7, this will test e.g. FOOAAAA and BARAAAA
for i in permutations('ABCDEFGHIJKLMNOPQRSTUVWXYZ',3):
i = "".join(i)
key = ''.join(i) + 'A'*(KLEN-len(i))
decrypted_ctext = vigenere(ctext, key, direction = -1)
score = 0
for j in range(0,len(ctext),KLEN):
score += trigram.score(decrypted_ctext[j:j+3])
rec.add((score,''.join(i), decrypted_ctext))
next_rec = nbest(N)
# for the remaining KLEN-3 characters of the key,
for i in range(0,KLEN-3):
# go over the N best keys found so far...
for k in xrange(N):
# ...and determine the best next character of the key, while keeping best N keys so far
for c in 'ABCDEFGHIJKLMNOPQRSTUVWXYZ':
key = rec[k][1] + c
fullkey = key + 'A'*(KLEN-len(key))
decrypted_ctext = vigenere(ctext, fullkey, direction = -1)
score = 0
for j in range(0,len(ctext),KLEN):
score += qgram.score(decrypted_ctext[j:j+len(key)])
next_rec.add((score,key, decrypted_ctext))
rec = next_rec
next_rec = nbest(N)
# show the results
bestscore = rec[0][0]
bestkey = rec[0][1]
#decrypted_ctext = rec[0][2]
# always show entire decrypted ctext, even if the above analysis is done only on part of the ctext, e.g. ctext[0:100]
decrypted_ctext = vigenere(ctext, bestkey, direction = -1)
print bestscore, 'klen', KLEN, ':"'+bestkey+'",', decrypted_ctext
# uncomment the following lines to see top-10 results
#pp.pprint(rec.store[0:10])
#print '\n'
Here we can see that using the different criterion and search algorithm the correct key and plaintext have successfully been found.
Instead of trying to determine the key character by character, we can also bruteforce it and keep track of the best key as before.
The wordlist used here was downloaded from: https://packetstormsecurity.com/Crackers/wordlists/dictionaries/
In [25]:
ctext = "VYCHVUYPESJMZCJZTXNOOEFSXMBQJTTNQAXWKBWTPDDKTUODCOPGJFNTMOPRLACBZGTWEAEEEOKWAKCXCHVOATLGFMVYZRWBNGHPQUCDRKSGXSGWZRZWMURLSTLCHLZWBAECCIMEESTLLPWVNCCMYLELPKAAPTCZKYCTZQOIKGICRMEQTSPWMGHKFTYOHRDCUBAQEQWTVRBPCFKGPWGGFEQTZFSDWDVCCLOYVPBFWGPVFPLYRTMCWVSDCCIHSBLGCLPWTVKPBNVNRLAVWVPQTOEACSYJIUVVPBXSFARC"
In [26]:
from ngram_score import ngram_score
import re
import pprint as pp
qgram = ngram_score('ngrams/en_sherlock_4grams') # load our 4gram statistics
trigram = ngram_score('ngrams/en_sherlock_3grams') # load our 3gram statistics
# keep track of the 100 best keys
N=100
rec = nbest(N)
f = open('wordlists/websters-dictionary','r')
i = 1
for key in f:
key = re.sub(r'[^A-Z]','',key.upper())
decrypted_ctext = vigenere(ctext, key, direction = -1)
score = qgram.score(decrypted_ctext)
rec.add((score,key, decrypted_ctext))
i += 1
if i % 10000 == 0:
bestscore = rec[0][0]
bestkey = rec[0][1]
decrypted_ctext = vigenere(ctext, bestkey, direction = -1)
print "%20s\t%5.2f,\t%s,\t%s,\t%s" % (key, bestscore, len(bestkey), bestkey, decrypted_ctext)
Again, the key was found successfully. Interestingly, before finding CRYPTANALYSIS, very similar words of the same length were identified as candidate keys:
- ACROPARALYSIS
- CHROMATOLYSIS
- CRYPTANALYSIS
The Caesar Cipher is a substitution cipher, which substitutes letters by letters at the same index in a shifted version of the alphabet, i.e. in case of rot13, a Caesar cipher with key 'N'
plain alphabet ABCDEFGHIJKLMNOPQRSTUVWXYZ
cipher alphabet NOPQRSTUVWXYZABCDEFGHIJKLM (and other shifted versions of it, shown alphabet corresponds to key 'N')
A variant of the Caesar cipher is the Keyed Caesar Cipher, which uses an additional alphabet keyword to create a keyed cipher alphabet, i.e. in case the alphabet keyword is KRYPTOS:
plain alphabet ABCDEFGHIJKLMNOPQRSTUVWXYZ
cipher alphabet KRYPTOSABCDEFGHIJLMNQUVWXZ (and other shifted versions of it, shown alphabet corresponds to key 'A')
RYPTOSABCDEFGHIJLMNQUVWXZK (and other shifted versions of it, shown alphabet corresponds to key 'B')
Finally, we can also use a keyed alphabet for both the plain alphabet and the cipher alphabet to obtain we will call a Caesar Cipher with a keyed alphabet here (note the difference with Keyed Caesar):
plain alphabet KRYPTOSABCDEFGHIJLMNQUVWXZ
cipher alphabet KRYPTOSABCDEFGHIJLMNQUVWXZ (and other shifted versions of it, shown alphabet corresponds to key 'K')
RYPTOSABCDEFGHIJLMNQUVWXZK (and other shifted versions of it, shown alphabet corresponds to key 'R')
First, let's define a function for computing a keyed alphabet. It takes an alphabet and a keyword consisting of symbols from that alphabet. Then, it removes all but the first occurence of each character in the keyword (i.e. CRYPTANALYSIS becomes CRYPTANLSI), and then appends the remaining characters of the alphabet in order to obtain the keyed alphabet (i.e. CRYPTANLSIBDEFGHJKMOQUVWXZ)
In [27]:
import re
def compute_keyed_alphabet(keyword, alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"):
# remove double keyword letters, keeping first occurence
keyword_unique = ""
for c in keyword:
if c not in keyword_unique:
keyword_unique += c
# compute cipher alphabet
keyed_alphabet = keyword_unique + re.sub("[%s]" % keyword, "", alphabet)
return keyed_alphabet
In [28]:
def keyed_caesar(s, key, alphabet="ABCDEFGHIJKLMNOPQRSTUVWXYZ", alpha_key="A", direction=1):
# compute cipher alphabet
keyed_alphabet = compute_keyed_alphabet(alpha_key, alphabet)
# set source and destination alphabet depending on direction
if direction == 1:
src_alphabet = alphabet
dst_alphabet = keyed_alphabet
else:
src_alphabet = keyed_alphabet
dst_alphabet = alphabet
# encrypt / decrypt
keyval = alphabet.find(key)
t = ""
for sc in s:
i = src_alphabet.find(sc)
t += dst_alphabet[(i + keyval * direction) % len(dst_alphabet)] if i > -1 else sc
return t
In [29]:
ptext = "THESCULPTUREHASBEENBOTHAPUZZLEANDAMYSTERYFORTHOSEWHOHOPETOCRACKTHECYPHEREDMESSAGESCONTAINEDWITHINTHESCULPTURESTWOTHOUSANDALPHABETICLETTERSINTHETWENTYYEARSSINCEKRYPTOSWASERECTEDTHREEOFTHEFOURSECTIONSHAVEBEENCONFIRMEDTOHAVEBEENSOLVEDNOONEHASYETBEENABLETOSOLVETHEREMAININGNINETYSEVENCHARACTERMESSAGE"
Now, like we did before with the normal Caesar cipher and look at the effect of the cipher on the frequency distribution:
In [30]:
# https://blog.dominodatalab.com/interactive-dashboards-in-jupyter/
# http://ipywidgets.readthedocs.io/en/latest/examples/Using%20Interact.html
from __future__ import division
from matplotlib import pyplot as plt
import numpy as np
from ipywidgets import interact, interactive, fixed
#import ipywidgets as widgets
plt.rcParams["figure.figsize"] = [12,9]
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
keyed_alphabet = compute_keyed_alphabet("KRYPTOS", alphabet) # UNCOMMENT IF YOU WANT ORDER THE
def rotate(l, n):
return l[n:] + l[:n]
@interact(key_i = (0,25,1))
def plot_graphs(key_i = 0):
# encrypt ptext using this key
key = alphabet[key_i]
ctext = keyed_caesar(ptext, key, alpha_key="KRYPTOS", direction=1)
# determine letter frequency in plaintext, and plot
ptext_freq = np.array([ptext.count(c)/len(ptext) for c in alphabet])
pfig = plt.subplot(3,1,1)
plt.bar(range(len(alphabet)), ptext_freq, tick_label = list(alphabet), align = 'center', color = 'b')
plt.title('blue = plaintext letter frequency, red = ciphertext letter frequency')
# determine letter frequency in ciphertext, and plot, ordered by keyed alphabet
ctext_freq = np.array([ctext.count(c)/len(ctext) for c in keyed_alphabet])
cfig = plt.subplot(3,1,2)
plt.bar(range(len(keyed_alphabet)), ctext_freq, tick_label = rotate(list(keyed_alphabet), key_i), align = 'center', color = 'r')
plt.xlabel("key = %s" % key)
# determine letter frequency in ciphertext, and plot, ordered by alphabet
ctext_freq = np.array([ctext.count(c)/len(ctext) for c in alphabet])
cfig = plt.subplot(3,1,3)
plt.bar(range(len(alphabet)), ctext_freq, tick_label = rotate(list(alphabet), key_i), align = 'center', color = 'r')
plt.xlabel("key = %s" % key)
plt.show()
print ctext
We can see that due to the keyed alphabet, the effect of Keyed Caesar is not just a shift of the frequency distributions (unless we look at the frequencies in order of the keyed alphabet).
In [31]:
ctext = keyed_caesar(ptext, "N", alpha_key="KRYPTOS", direction=1)
ctext
Out[31]:
In [32]:
# https://blog.dominodatalab.com/interactive-dashboards-in-jupyter/
# http://ipywidgets.readthedocs.io/en/latest/examples/Using%20Interact.html
from __future__ import division
from matplotlib import pyplot as plt
import numpy as np
from ipywidgets import interact, interactive, fixed
#import ipywidgets as widgets
plt.rcParams["figure.figsize"] = [12,6]
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
def rotate(l, n):
return l[n:] + l[:n]
@interact(key_i = (0,25,1))
def plot_graphs(key_i = 0):
# decrypt ciphertext using this key
key = alphabet[key_i]
# NOTE THAT WE DO NOT KNOW THE KEY TO THE ALPHABET, SO WE CANNOT USE THAT!!!
decrypted_ctext = keyed_caesar(ctext, key, alpha_key = "A", direction=-1)
# determine letter frequency in plaintext, and plot # FIXME: base on separate text
ptext_freq = np.array([ptext.count(c)/len(ptext) for c in alphabet])
pfig = plt.subplot(2,1,1)
plt.bar(range(len(alphabet)), ptext_freq, tick_label = list(alphabet), align = 'center', color = 'b')
plt.title('blue = plaintext letter frequency, green = decrypted ciphertext letter frequency')
# determine letter frequency in ciphertext, and plot
ctext_freq = np.array([decrypted_ctext.count(c)/len(decrypted_ctext) for c in alphabet])
cfig = plt.subplot(2,1,2)
plt.bar(range(len(alphabet)), ctext_freq, tick_label = rotate(list(alphabet), key_i), align = 'center', color = 'g')
plt.xlabel("key = %s" % key)
plt.show()
print decrypted_ctext
For easy reference, this is the subtitution when keying the alphabet with 'KRYPTOS', and using key 'N':
plain alphabet ABCDEFGHIJKLMNOPQRSTUVWXYZ
cipher alphabet KRYPTOSABCDEFGHIJLMNQUVWXZ (and other shifted versions of it, shown alphabet corresponds to key 'A')
GHIJLMNQUVWXZKRYPTOSABCDEF (and other shifted versions of it, shown alphabet corresponds to key 'N')
From this plot, it can be seen that when we do not know what was the keyword for creating the keyed alphabet, the cipher cannot be broken by simply aligning the frequency distributions and a more general approach for breaking the substitution is needed.
However, since it is a substitution cipher, we could find the substitution and entire cipher alphabet as follows:
More on this approach can be found here.
In [33]:
ctext = keyed_caesar(ptext, "N", alpha_key="KRYPTOS", direction=1)
ctext
Out[33]:
In [34]:
def substitution(s, key, alphabet="ABCDEFGHIJKLMNOPQRSTUVWXYZ", direction=1):
if direction == 1:
src_alphabet = alphabet
dst_alphabet = key
else:
src_alphabet = key
dst_alphabet = alphabet
t = ""
for c in s:
if c in src_alphabet:
t += dst_alphabet[src_alphabet.find(c)]
else:
t += c
return t
In [35]:
import random
import re
from ngram_score import ngram_score
fitness = ngram_score('ngrams/en_sherlock_4grams') # load our quadgram statistics
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
maxkey = list(alphabet)
maxscore = -99e9
parentscore,parentkey = maxscore,maxkey[:]
print "Substitution Cipher solver, you may have to wait several iterations for the correct result. Halt execution by halting the kernel, using the stop button in the toolbar."
# keep going until we are killed by the user
i = 0
rec = nbest(1000)
MAX_ITER = 10
while i < MAX_ITER:
i = i+1
random.shuffle(parentkey)
# KEYED CAESAR WITH KEY "A" IS JUST A SUBSTITUTION CIPHER
deciphered = substitution(ctext, "".join(parentkey), alphabet, direction=-1)
parentscore = fitness.score(deciphered)
count = 0
while count < 1000:
a = random.randint(0,len(alphabet)-1)
b = random.randint(0,len(alphabet)-1)
child = parentkey[:]
# swap two characters in the child
child[a],child[b] = child[b],child[a]
deciphered = substitution(ctext, "".join(child), alphabet, direction=-1)
score = fitness.score(deciphered)
# if the child was better, replace the parent with it
if score > parentscore:
parentscore = score
parentkey = child[:]
count = 0
count = count+1
rec.add((score, child, deciphered))
# keep track of best score printed so far, and print if improved
if parentscore > maxscore:
maxscore = rec[0][0]
maxkey = rec[0][1]
#deciphered = rec[0][2]
deciphered = substitution(ctext, "".join(maxkey), alphabet, direction=-1)
print '\nbest score so far:',maxscore,'on iteration',i
print ' best key: '+ "".join(maxkey)
print ' plaintext: '+ deciphered
In [36]:
def caesar_keyed_alphabet(s, key, alphabet="ABCDEFGHIJKLMNOPQRSTUVWXYZ", alpha_key="", direction=1):
# compute keyed alphabet
keyed_alphabet = compute_keyed_alphabet(alpha_key, alphabet)
return rot(s, key, keyed_alphabet, direction)
In [37]:
ptext = "THESCULPTUREHASBEENBOTHAPUZZLEANDAMYSTERYFORTHOSEWHOHOPETOCRACKTHECYPHEREDMESSAGESCONTAINEDWITHINTHESCULPTURESTWOTHOUSANDALPHABETICLETTERSINTHETWENTYYEARSSINCEKRYPTOSWASERECTEDTHREEOFTHEFOURSECTIONSHAVEBEENCONFIRMEDTOHAVEBEENSOLVEDNOONEHASYETBEENABLETOSOLVETHEREMAININGNINETYSEVENCHARACTERMESSAGE"
alpha_key_ = "KRYPTOS"
In [38]:
# https://blog.dominodatalab.com/interactive-dashboards-in-jupyter/
# http://ipywidgets.readthedocs.io/en/latest/examples/Using%20Interact.html
from __future__ import division
from matplotlib import pyplot as plt
import numpy as np
from ipywidgets import interact, interactive, fixed
#import ipywidgets as widgets
plt.rcParams["figure.figsize"] = [12,9]
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
keyed_alphabet = compute_keyed_alphabet(alpha_key_, alphabet)
def rotate(l, n):
return l[n:] + l[:n]
@interact(key_i = (0,25,1))
def plot_graphs(key_i = 0):
# encrypt plaintext using this key
key = keyed_alphabet[key_i]
ctext = caesar_keyed_alphabet(ptext, key, alphabet="ABCDEFGHIJKLMNOPQRSTUVWXYZ", alpha_key=alpha_key_, direction=1)
# determine letter frequency in plaintext, and plot
ptext_freq = np.array([ptext.count(c)/len(ptext) for c in keyed_alphabet])
pfig = plt.subplot(3,1,1)
plt.bar(range(len(keyed_alphabet)), ptext_freq, tick_label = list(keyed_alphabet), align = 'center', color = 'b')
plt.title('blue = plaintext letter frequency, red = ciphertext letter frequency')
# determine letter frequency in ciphertext, and plot, in order keyed alphabet
ctext_freq = np.array([ctext.count(c)/len(ctext) for c in keyed_alphabet])
cfig = plt.subplot(3,1,2)
plt.bar(range(len(keyed_alphabet)), ctext_freq, tick_label = rotate(list(keyed_alphabet), key_i), align = 'center', color = 'r')
plt.xlabel("key = %s" % key)
# determine letter frequency in ciphertext, and plot, in order of alphabet
ctext_freq = np.array([ctext.count(c)/len(ctext) for c in alphabet])
cfig = plt.subplot(3,1,3)
plt.bar(range(len(alphabet)), ctext_freq, tick_label = rotate(list(alphabet), key_i), align = 'center', color = 'r')
plt.xlabel("key = %s" % key)
plt.show()
print ctext
If we look at the frequencies in order of the keyed alphabet, then again the effect of Caesar with Keyed Alphabet can be seen as shifting the frequency distributions. However, this is not the case when the looking at the frequencies in order of the alphabet.
In [39]:
ptext = "THESCULPTUREHASBEENBOTHAPUZZLEANDAMYSTERYFORTHOSEWHOHOPETOCRACKTHECYPHEREDMESSAGESCONTAINEDWITHINTHESCULPTURESTWOTHOUSANDALPHABETICLETTERSINTHETWENTYYEARSSINCEKRYPTOSWASERECTEDTHREEOFTHEFOURSECTIONSHAVEBEENCONFIRMEDTOHAVEBEENSOLVEDNOONEHASYETBEENABLETOSOLVETHEREMAININGNINETYSEVENCHARACTERMESSAGE"
ctext = caesar_keyed_alphabet(ptext, key="N", alphabet="ABCDEFGHIJKLMNOPQRSTUVWXYZ", alpha_key="KRYPTOS", direction=1)
ctext
Out[39]:
In [40]:
# https://blog.dominodatalab.com/interactive-dashboards-in-jupyter/
# http://ipywidgets.readthedocs.io/en/latest/examples/Using%20Interact.html
from __future__ import division
from matplotlib import pyplot as plt
import numpy as np
from ipywidgets import interact, interactive, fixed
#import ipywidgets as widgets
plt.rcParams["figure.figsize"] = [12,6]
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
def rotate(l, n):
return l[n:] + l[:n]
@interact(key_i = (0,25,1))
def plot_graphs(key_i = 0):
# decrypt ciphertext using this key
key = alphabet[key_i]
# NOTE THAT WE DO NOT KNOW THE KEY TO THE ALPHABET, SO WE CANNOT USE THAT!!!
decrypted_ctext = caesar_keyed_alphabet(ctext, key, alphabet="ABCDEFGHIJKLMNOPQRSTUVWXYZ", alpha_key="A", direction=-1)
# determine letter frequency in plaintext, and plot
ptext_freq = np.array([ptext.count(c)/len(ptext) for c in alphabet])
pfig = plt.subplot(2,1,1)
plt.bar(range(len(alphabet)), ptext_freq, tick_label = list(alphabet), align = 'center', color = 'b')
plt.title('blue = plaintext letter frequency, green = decrypted ciphertext letter frequency')
# determine letter frequency in ciphertext, and plot
ctext_freq = np.array([decrypted_ctext.count(c)/len(decrypted_ctext) for c in alphabet])
cfig = plt.subplot(2,1,2)
plt.bar(range(len(alphabet)), ctext_freq, tick_label = rotate(list(alphabet), key_i), align = 'center', color = 'g')
plt.xlabel("key = %s" % key)
plt.show()
print decrypted_ctext
In [41]:
ptext = "THESCULPTUREHASBEENBOTHAPUZZLEANDAMYSTERYFORTHOSEWHOHOPETOCRACKTHECYPHEREDMESSAGESCONTAINEDWITHINTHESCULPTURESTWOTHOUSANDALPHABETICLETTERSINTHETWENTYYEARSSINCEKRYPTOSWASERECTEDTHREEOFTHEFOURSECTIONSHAVEBEENCONFIRMEDTOHAVEBEENSOLVEDNOONEHASYETBEENABLETOSOLVETHEREMAININGNINETYSEVENCHARACTERMESSAGE"
ctext = caesar_keyed_alphabet(ptext, key="N", alphabet="ABCDEFGHIJKLMNOPQRSTUVWXYZ", alpha_key="KRYPTOS", direction=1)
ctext
Out[41]:
In [42]:
import random
import re
from ngram_score import ngram_score
fitness = ngram_score('ngrams/en_sherlock_4grams') # load our quadgram statistics
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
maxkey = list(alphabet)
maxscore = -99e9
parentscore,parentkey = maxscore,maxkey[:]
print "Substitution Cipher solver, you may have to wait several iterations for the correct result."
# keep going until we are killed by the user
i = 0
rec = nbest(1000)
MAX_ITER = 10 # increase if needed
while i < MAX_ITER:
i = i+1
random.shuffle(parentkey)
# KEYED CAESAR WITH KEY "A" IS JUST A SUBSTITUTION CIPHER
deciphered = substitution(ctext, "".join(parentkey), alphabet, direction=-1)
parentscore = fitness.score(deciphered)
count = 0
while count < 1000:
a = random.randint(0,len(alphabet)-1)
b = random.randint(0,len(alphabet)-1)
child = parentkey[:]
# swap two characters in the child
child[a],child[b] = child[b],child[a]
deciphered = substitution(ctext, "".join(child), alphabet, direction=-1)
score = fitness.score(deciphered)
# if the child was better, replace the parent with it
if score > parentscore:
parentscore = score
parentkey = child[:]
count = 0
count = count+1
rec.add((score, child, deciphered))
# keep track of best score printed so far, and print if improved
if parentscore > maxscore:
maxscore = rec[0][0]
maxkey = rec[0][1]
#deciphered = rec[0][2]
deciphered = substitution(ctext, "".join(maxkey), alphabet, direction=-1)
print '\nbest score so far:',maxscore,'on iteration',i
print ' best key: '+ "".join(maxkey)
print ' plaintext: '+ deciphered
Now we will move on to the Keyed Vigenere cipher, which is very similar to the Caesar cipher with a keyed alphabet, except that the consecutive shifts of the plaintext characters are determined by a keyword. If the key has length $N$, then each $N^{th}$ character of the plaintext will be shifted by the same shift, determined by the corresponding character in the key. This relation of Vigenere being N interleaved Caesar ciphers with keyed alphabet is also visible from the code:
In [43]:
def keyed_vigenere(s, key, alphabet="ABCDEFGHIJKLMNOPQRSTUVWXYZ", alpha_key="", direction=1):
# compute keyed alphabet
keyed_alphabet = compute_keyed_alphabet(alpha_key, alphabet)
t = []
key_i = 0
for i in range(len(s)):
if s[i] in alphabet:
t_i = rot(s[i], key[key_i % len(key)],keyed_alphabet, direction)
t.append(t_i)
key_i += 1
else:
t.append(s[i])
return "".join(t)
First, let's load some Keyed Vigenere ciphertext, with unknown key and unknown alphabet key.
In [44]:
ctext = "BYDMXVPQAMCWFJGOTCCNXTKHOGSEIPAJTWGNBYDNBHCJWLBNCBHPNRNXNHSUOXHLTQQKYAUEMBOUUBWGNQVWUFKKBRVIITHVWNOWCSEXWOICJBPANTCNRPTNHSJVVLICIRSEIAPMQEEHIIVRETZGSEIQPLAXLXKOTWPCSOJGOSEEOKETKBDOOBHSJVPMOWHLTSHHPOVGVATKBSEVTFDVHONNQALNUKOXHPCDQKYAUZSPQPSKAGFPOQWBCWUHLPLMSAAZPBIECBXOTFMIKTITTICPOFTVHONKFIZDNCAJIVWDNXUWUHLCSEWTIHTXOECKBGUUMOHSUUCNLEQJETEOMABYDRUURUHLSWSJDAPMPPDNJIOBTCADIHYNBRQOGEFKQDAEBEHXDOFCNGXEFJVSELPZEHTSCLPPJAHEAPLEIZACVODENXXACTANMLAJITWXSORTPOWDDUEMGZITKOSHSNRONPYGCXVWNHVBUSNXDNZHKQTLWNEPKCJEUYLRXECNRATZEXBNHOCPQWUKYGNBYDTOTPUYAQALSKCNRAIRTITNBTPPLLCGTCEBTVOTWHGTXMIEHKQPNRTZKHYPFAMOPUJCDUOWCSEZCGQEMGVAJTBUHLSLITJVVLICIRSEIITSQFSNCDSSLGLRXKONNPRKAJVVAEABLGSJCJRLOWFOGQDHXTGEYTOIDLVHPITAND"
In [45]:
IC(ctext)
Out[45]:
The IC tells us that it is likely not a transposition of English plaintext.
In [46]:
from matplotlib import pyplot as plt
%matplotlib inline
mean_ic_n = []
for n in range(1,32):
mean_ic = mean_IC(ctext.upper(), n)
mean_ic_n.append(mean_ic)
print "%2d %02f" % (n, mean_ic)
plt.rcParams["figure.figsize"] = [12,4]
plt.bar(range(1,len(mean_ic_n)+1), mean_ic_n, align = 'center')
plt.xlabel("key length")
plt.ylabel("average IC")
Out[46]:
The average IC for different key lengths tells us that it is likely encrypted with a key of length 9. We therefore have to crack 9 Caesar ciphers with a Keyed Alphabet, each of them with their own shift as defined by the Vigenere key.
Now, let's look at the frequency distributions for the 9 subsequences:
In [47]:
# https://blog.dominodatalab.com/interactive-dashboards-in-jupyter/
# http://ipywidgets.readthedocs.io/en/latest/examples/Using%20Interact.html
from __future__ import division
from matplotlib import pyplot as plt
import numpy as np
from ipywidgets import interact, interactive, fixed
#import ipywidgets as widgets
plt.rcParams["figure.figsize"] = [12,16]
def rotate(l, n):
return l[n:] + l[:n]
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
key_length = 9
g_english = [0.0736, 0.0148, 0.0445, 0.0302, 0.102, 0.0227, 0.0122, 0.0277, 0.0855, 0.000557, 0.00237, 0.0342, 0.0206, 0.0717, 0.103, 0.0246, 0.00181, 0.0735, 0.0608, 0.0889, 0.0392, 0.0153, 0.0173, 0.000557, 0.032, 0.000278]
pfig = plt.subplot(key_length+1,1,1)
plt.bar(range(len(alphabet)), g_english, tick_label = rotate(list(alphabet), 0), align = 'center', color = 'b')
ctext_freq = []
for i in range(key_length):
ctext_i = ctext[i::key_length]
ctext_freq.append(np.array([ctext_i.count(c)/len(ctext_i) for c in alphabet]))
cfig = plt.subplot(key_length+1,1,i+2)
plt.bar(range(len(alphabet)), ctext_freq[i], tick_label = rotate(list(alphabet), 0), align = 'center', color = 'r')
Each distribution looks quite different as a result using Caesar ciphers keyed alphabet. Each of the 9 Caesar ciphers with keyed alphabet uses the same unknown keyed alphabet, and its own shift, as defined by the 9-letter Vigenere keyword. Therefore, we need to determine both the keyed alphabet, and the Vigenere keywords.
In [48]:
from ngram_score import ngram_score
import re
import pprint as pp
qgram = ngram_score('ngrams/en_sherlock_4grams') # load our 4gram statistics
trigram = ngram_score('ngrams/en_sherlock_3grams') # load our 3gram statistics
# keep track of the 100 best keys
N=100
rec = nbest(N)
f = open('wordlists/websters-dictionary','r')
L = 42 # for speed, only consider first L characters in evaluation
i = 1
for alphakey_ in f:
alphakey_ = re.sub(r'[^A-Z]','',alphakey_.upper())
decrypted_ctext = keyed_vigenere(ctext[:L], key = "LODESTONE", alpha_key = alphakey_, direction = -1)
score = qgram.score(decrypted_ctext)
rec.add((score,alphakey_, decrypted_ctext))
i += 1
if i % 10000 == 0:
bestscore = rec[0][0]
bestalphakey_ = rec[0][1]
decrypted_ctext = keyed_vigenere(ctext, key = "LODESTONE", alpha_key = bestalphakey_, direction = -1)
print "%20s\t%5.2f,\t%s,\t%s,\t%s" % (alphakey_, bestscore, len(bestalphakey_), bestalphakey_, decrypted_ctext)
KWRGHGPEOKDBXGAWVMJDFVCVDHJZAMQGRNFHKWRLAVKWKWVFSCHLIVGRZCIHDUAWHPUFOKDNTLLEFCWHVBKBJOZRHVWCYVVGWAWADSDKNWGMBLUUPNKAHOHFZAJICCLGYKIMGZFKFODLIYCVHKHBIMGEFWEZHKZWVLMGAYDSDORZIYDJZMKBPLEMDGPSWAOJKAHPABWEXUGFIORNHHWSHWIHYOOLFFDUAODNUFOKDIQJTHIFTHIOIPNCWNDSDJOGIXTIFVAOSCDWVQTCDVRNQJUOIHKSHWIXNCHNPBTWENMNPKVNDSDGAWXNGRZCIOSFXHDDTAEMFHKAXZFRDJUWQJKWRKFHYGAWXBIGFKAYMJRLDQDKZEENMLOZHVYASWEFMPOZVOZKFWMMGEFWEOSORWGVDLRGJCMJNUVMTCXZAVQEEWKNGRFUYNTADWERMJNAEBOKUYXZJGRKYVMJZWESQDIYPLZHUCKBPLEMDOGRRLKVBEZWMFDTZVRNSWOKHMKAHMHVDKXZPBJJTPFFZHVVRNKURLDTIHMZIFKAHMAQKWRZHFMJOZYSQMRVHCGJNPFFZVYWVFMCVGHVVLOLMJTAUEDBJGWADSDPWHYNTEXUDNIGAWXJMJJICCLGYKIMGJZUFHIRWEEODEOKHFDAVOQYQGEIXNILOBIOKWHWIBXUAOKSZKSWJNDJTWKGLWRKIP
Now, let's take the same ciphertext and suppose we only know the alphabet key.
In [49]:
ctext = "BYDMXVPQAMCWFJGOTCCNXTKHOGSEIPAJTWGNBYDNBHCJWLBNCBHPNRNXNHSUOXHLTQQKYAUEMBOUUBWGNQVWUFKKBRVIITHVWNOWCSEXWOICJBPANTCNRPTNHSJVVLICIRSEIAPMQEEHIIVRETZGSEIQPLAXLXKOTWPCSOJGOSEEOKETKBDOOBHSJVPMOWHLTSHHPOVGVATKBSEVTFDVHONNQALNUKOXHPCDQKYAUZSPQPSKAGFPOQWBCWUHLPLMSAAZPBIECBXOTFMIKTITTICPOFTVHONKFIZDNCAJIVWDNXUWUHLCSEWTIHTXOECKBGUUMOHSUUCNLEQJETEOMABYDRUURUHLSWSJDAPMPPDNJIOBTCADIHYNBRQOGEFKQDAEBEHXDOFCNGXEFJVSELPZEHTSCLPPJAHEAPLEIZACVODENXXACTANMLAJITWXSORTPOWDDUEMGZITKOSHSNRONPYGCXVWNHVBUSNXDNZHKQTLWNEPKCJEUYLRXECNRATZEXBNHOCPQWUKYGNBYDTOTPUYAQALSKCNRAIRTITNBTPPLLCGTCEBTVOTWHGTXMIEHKQPNRTZKHYPFAMOPUJCDUOWCSEZCGQEMGVAJTBUHLSLITJVVLICIRSEIITSQFSNCDSSLGLRXKONNPRKAJVVAEABLGSJCJRLOWFOGQDHXTGEYTOIDLVHPITAND"
In [50]:
from itertools import permutations
from ngram_score import ngram_score
import re
import pprint as pp
qgram = ngram_score('ngrams/en_sherlock_4grams') # load our 4gram statistics
trigram = ngram_score('ngrams/en_sherlock_3grams') # load our 3gram statistics
# keep track of the 100 best keys
N=100
L=42 # only process L characters for speed
alphakey = "ENIGMA"
for KLEN in [9]: #range(3,10):
print "="*80
rec = nbest(N)
# exhaustively test all possible letters for first 3 entries of the key and keep track of the N best ones
# if KLEN=7, this will test e.g. FOOAAAA and BARAAAA
for i in permutations('ABCDEFGHIJKLMNOPQRSTUVWXYZ',3):
i = "".join(i)
key = ''.join(i) + 'A'*(KLEN-len(i))
decrypted_ctext = keyed_vigenere(ctext[:L], key, alpha_key = alphakey, direction = -1)
score = 0
for j in range(0,len(ctext),KLEN):
score += trigram.score(decrypted_ctext[j:j+3])
rec.add((score,''.join(i), decrypted_ctext))
next_rec = nbest(N)
# for the remaining KLEN-3 characters of the key,
for i in range(0,KLEN-3):
# go over the N best keys found so far...
for k in xrange(N):
# ...and determine the best next character of the key, while keeping best N keys so far
for c in 'ABCDEFGHIJKLMNOPQRSTUVWXYZ':
key = rec[k][1] + c
fullkey = key + 'A'*(KLEN-len(key))
decrypted_ctext = keyed_vigenere(ctext[:L], fullkey, alpha_key = alphakey, direction = -1)
score = 0
for j in range(0,len(ctext),KLEN):
score += qgram.score(decrypted_ctext[j:j+len(key)])
next_rec.add((score, key, decrypted_ctext))
rec = next_rec
next_rec = nbest(N)
# show the results
bestscore = rec[0][0]
bestkey = rec[0][1]
#decrypted_ctext = rec[0][2]
# always show entire decrypted ctext, even if the above analysis is done only on part of the ctext, e.g. ctext[0:100]
decrypted_ctext = keyed_vigenere(ctext, bestkey, alpha_key = alphakey, direction = -1)
print bestscore, 'klen', KLEN, ':"'+bestkey+'",', decrypted_ctext
# uncomment the following lines to see top-10 results
#pp.pprint(rec.store[0:10])
#print '\n'
In [51]:
kryptos_L1 = """
EMUFPHZLRFAXYUSDJKZLDKRNSHGNFIVJ
YQTQUXQBQVYUVLLTREVJYQTMKYRDMFD
VFPJUDEEHZWETZYVGWHKKQETGFQJNCE
GGWHKK?DQMCPFQZDQMMIAGPFXHQRLG
TIMVMZJANQLVKQEDAGDVFRPJUNGEUNA
QZGZLECGYUXUEENJTBJLBQCRTBJDFHRR
YIZETKZEMVDUFKSJHKFWHKUWQLSZFTI
HHDDDUVH?DWKBFUFPWNTDFIYCUQZERE
EVLDKFEZMOQQJLTTUGSYQPFEUNLAVIDX
FLGGTEZ?FKZBSFDQVGOGIPUFXHHDRKF
FHQNTGPUAECNUVPDJMQCLQUMUNEDFQ
ELZZVRRGKFFVOEEXBDMVPNFQXEZLGRE
DNQFMPNZGLFLPMRJQYALMGNUVPDXVKP
DQUMEBEDMHDAFMJGZNUPLGEWJLLAETG"""
kryptos_L2 = """
ENDYAHROHNLSRHEOCPTEOIBIDYSHNAIA
CHTNREYULDSLLSLLNOHSNOSMRWXMNE
TPRNGATIHNRARPESLNNELEBLPIIACAE
WMTWNDITEENRAHCTENEUDRETNHAEOE
TFOLSEDTIWENHAEIOYTEYQHEENCTAYCR
EIFTBRSPAMHHEWENATAMATEGYEERLB
TEEFOASFIOTUETUAEOTOARMAEERTNRTI
BSEDDNIAAHTTMSTEWPIEROAGRIEWFEB
AECTDDHILCEIHSITEGOEAOSDDRYDLORIT
RKLMLEHAGTDHARDPNEOHMGFMFEUHE
ECDMRIPFEIMEHNLSSTTRTVDOHW?OBKR
UOXOGHULBSOLIFBBWFLRVQQPRNGKSSO
TWTQSJQSSEKZZWATJKLUDIAWINFBNYP
VTTMZFPKWGDKZXTJCDIGKUHUAUEKCAR"""
In [52]:
ctext_kryptos = (kryptos_L1+kryptos_L2).replace(" ", "").strip().split("\n")
In [53]:
def rot(s, key, alphabet="ABCDEFGHIJKLMNOPQRSTUVWXYZ", direction=1):
keyval = alphabet.find(key)
t = ""
for sc in s:
i = alphabet.find(sc)
t += alphabet[(i + keyval * direction) % len(alphabet)] if i > -1 else sc
return t
In [54]:
import re
def compute_keyed_alphabet(keyword, alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"):
# remove double keyword letters, keeping first occurence
keyword_unique = ""
for c in keyword:
if c not in keyword_unique:
keyword_unique += c
# compute cipher alphabet
keyed_alphabet = keyword_unique + re.sub("[%s]" % keyword, "", alphabet)
return keyed_alphabet
In [55]:
def keyed_vigenere(s, key, alphabet="ABCDEFGHIJKLMNOPQRSTUVWXYZ", alpha_key="", direction=1):
# compute keyed alphabet
keyed_alphabet = compute_keyed_alphabet(alpha_key, alphabet)
t = []
key_i = 0
for i in range(len(s)):
if s[i] in alphabet:
t_i = rot(s[i], key[key_i % len(key)],keyed_alphabet, direction)
t.append(t_i)
key_i += 1
else:
t.append(s[i])
return "".join(t)
In [56]:
from __future__ import division
import numpy as np
# computes the IC for a given string
def IC(s, alphabet = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'):
s_filtered = filter(s, alphabet) # we only care about the letters in our alphabet
ic = 0
for c in alphabet:
c_count = s_filtered.count(c)
N = len(s_filtered)
if c_count > 1:
ic += (c_count * (c_count - 1)) / (N * (N - 1) / len(alphabet))
return ic
# helper function to filter out non-alphabet characters
def filter(s, alphabet = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'):
return "".join([c for c in s if c in alphabet])
# computes the avg IC of subsequences of each n'th character
def mean_IC(s, n, alphabet = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'):
s_filtered = filter(s, alphabet) # we only care about the letters in our alphabet
s_filtered_subseq = [s_filtered[i::n] for i in range(n)]
ic = []
for i in range(n):
ic.append(IC(s_filtered_subseq[i]))
return np.mean(ic)
In [57]:
# keep a list of the N best things we have seen, discard anything else
# this class can also be imported using 'import nbest' from a separate file nbest.py
class nbest(object):
def __init__(self,N=1000):
self.store = []
self.N = N
def add(self,item):
self.store.append(item)
self.store.sort(reverse=True)
self.store = self.store[:self.N]
def __getitem__(self,k):
return self.store[k]
def __len__(self):
return len(self.store)
In [58]:
ctext = "".join(ctext_kryptos[:2])
ctext
Out[58]:
In [59]:
IC(ctext)
Out[59]:
Based on the IC, it looks like this is not plaintext, or a transposition of plaintext.
In [60]:
from matplotlib import pyplot as plt
%matplotlib inline
mean_ic_n = []
for n in range(1,32):
mean_ic = mean_IC(ctext.upper(), n)
mean_ic_n.append(mean_ic)
print "%2d %02f" % (n, mean_ic)
plt.rcParams["figure.figsize"] = [12,4]
plt.bar(range(1,len(mean_ic_n)+1), mean_ic_n, align = 'center')
plt.xlabel("key length")
plt.ylabel("average IC")
Out[60]:
From the mean IC and peaks at key length 10 and 20, it looks like the key length might be 10.
In [61]:
# https://blog.dominodatalab.com/interactive-dashboards-in-jupyter/
# http://ipywidgets.readthedocs.io/en/latest/examples/Using%20Interact.html
from __future__ import division
from matplotlib import pyplot as plt
import numpy as np
from ipywidgets import interact, interactive, fixed
#import ipywidgets as widgets
plt.rcParams["figure.figsize"] = [12,16]
def rotate(l, n):
return l[n:] + l[:n]
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
key_length = 10
g_english = [0.0736, 0.0148, 0.0445, 0.0302, 0.102, 0.0227, 0.0122, 0.0277, 0.0855, 0.000557, 0.00237, 0.0342, 0.0206, 0.0717, 0.103, 0.0246, 0.00181, 0.0735, 0.0608, 0.0889, 0.0392, 0.0153, 0.0173, 0.000557, 0.032, 0.000278]
pfig = plt.subplot(key_length+1,1,1)
plt.bar(range(len(alphabet)), g_english, tick_label = rotate(list(alphabet), 0), align = 'center', color = 'b')
ctext_freq = []
for i in range(key_length):
ctext_i = ctext[i::key_length]
ctext_freq.append(np.array([ctext_i.count(c)/len(ctext_i) for c in alphabet]))
cfig = plt.subplot(key_length+1,1,i+2)
plt.bar(range(len(alphabet)), ctext_freq[i], tick_label = rotate(list(alphabet), 0), align = 'center', color = 'r')
Now, let's try that again, but use the cipher alphabet order.
In [62]:
# https://blog.dominodatalab.com/interactive-dashboards-in-jupyter/
# http://ipywidgets.readthedocs.io/en/latest/examples/Using%20Interact.html
from __future__ import division
from matplotlib import pyplot as plt
import numpy as np
from ipywidgets import interact, interactive, fixed
#import ipywidgets as widgets
plt.rcParams["figure.figsize"] = [12,16]
def rotate(l, n):
return l[n:] + l[:n]
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
keyed_alphabet = compute_keyed_alphabet("KRYPTOS")
key_length = 10
g_english = [0.0736, 0.0148, 0.0445, 0.0302, 0.102, 0.0227, 0.0122, 0.0277, 0.0855, 0.000557, 0.00237, 0.0342, 0.0206, 0.0717, 0.103, 0.0246, 0.00181, 0.0735, 0.0608, 0.0889, 0.0392, 0.0153, 0.0173, 0.000557, 0.032, 0.000278]
g_english_keyed = [g_english[alphabet.find(c)] for c in keyed_alphabet]
pfig = plt.subplot(key_length+1,1,1)
plt.bar(range(len(alphabet)), g_english_keyed, tick_label = rotate(list(keyed_alphabet), 0), align = 'center', color = 'b')
ctext_freq = []
for i in range(key_length):
ctext_i = ctext[i::key_length]
ctext_freq.append(np.array([ctext_i.count(c)/len(ctext_i) for c in keyed_alphabet]))
cfig = plt.subplot(key_length+1,1,i+2)
plt.bar(range(len(keyed_alphabet)), ctext_freq[i], tick_label = rotate(list(keyed_alphabet), 0), align = 'center', color = 'r')
In [63]:
from itertools import permutations
from ngram_score import ngram_score
import re
import pprint as pp
qgram = ngram_score('ngrams/en_sherlock_4grams') # load our 4gram statistics
trigram = ngram_score('ngrams/en_sherlock_3grams') # load our 3gram statistics
# keep track of the 100 best keys
N=100
L=42 # only process L characters for speed
alphakey = "KRYPTOS"
for KLEN in range(3,16):
print "="*80
rec = nbest(N)
# exhaustively test all possible letters for first 3 entries of the key and keep track of the N best ones
# if KLEN=7, this will test e.g. FOOAAAA and BARAAAA
for i in permutations('ABCDEFGHIJKLMNOPQRSTUVWXYZ',3):
i = "".join(i)
key = ''.join(i) + 'A'*(KLEN-len(i))
decrypted_ctext = keyed_vigenere(ctext[:L], key, alpha_key = alphakey, direction = -1)
score = 0
for j in range(0,len(ctext),KLEN):
score += trigram.score(decrypted_ctext[j:j+3])
rec.add((score,''.join(i), decrypted_ctext))
next_rec = nbest(N)
# for the remaining KLEN-3 characters of the key,
for i in range(0,KLEN-3):
# go over the N best keys found so far...
for k in xrange(N):
# ...and determine the best next character of the key, while keeping best N keys so far
for c in 'ABCDEFGHIJKLMNOPQRSTUVWXYZ':
key = rec[k][1] + c
fullkey = key + 'A'*(KLEN-len(key))
decrypted_ctext = keyed_vigenere(ctext[:L], fullkey, alpha_key = alphakey, direction = -1)
score = 0
for j in range(0,len(ctext),KLEN):
score += qgram.score(decrypted_ctext[j:j+len(key)])
next_rec.add((score, key, decrypted_ctext))
rec = next_rec
next_rec = nbest(N)
# show the results
bestscore = rec[0][0]
bestkey = rec[0][1]
#decrypted_ctext = rec[0][2]
# always show entire decrypted ctext, even if the above analysis is done only on part of the ctext, e.g. ctext[0:100]
decrypted_ctext = keyed_vigenere(ctext, bestkey, alpha_key = alphakey, direction = -1)
print bestscore, 'klen', KLEN, ':"'+bestkey+'",', decrypted_ctext
# uncomment the following lines to see top-10 results
#pp.pprint(rec.store[0:10])
#print '\n'
In [64]:
ctext = "".join(ctext_kryptos[:3])
key = "PALIMPSEST"
alphakey = "KRYPTOS"
print ctext
print keyed_vigenere(ctext, key, alpha_key = alphakey, direction = -1)
From this, it can be seen that only the first 2 lines make up K1.
In [65]:
ctext_k1 = "".join(ctext_kryptos[:2])
key = "PALIMPSEST"
alphakey = "KRYPTOS"
ptext_k1 = keyed_vigenere(ctext_k1, key, alpha_key = alphakey, direction = -1)
print ctext_k1
print ptext_k1
In [66]:
from word_score import word_score
fitness = word_score()
print fitness.score(ptext_k1)
BETWEEN SUBTLE SHADING AND THE ABSENCE OF LIGHT LIES THE NUANCE OF IQLUSION
Let's start with the rest of the lines of the first panel for K2:
In [67]:
ctext = "".join(ctext_kryptos[2:14])
ctext
Out[67]:
In [68]:
IC(ctext)
Out[68]:
Based on the IC value, it looks like we are not dealing with plaintext, or a transposition of plaintext.
In [69]:
from matplotlib import pyplot as plt
%matplotlib inline
mean_ic_n = []
for n in range(1,32):
mean_ic = mean_IC(ctext.upper(), n)
mean_ic_n.append(mean_ic)
print "%2d %02f" % (n, mean_ic)
plt.rcParams["figure.figsize"] = [12,4]
plt.bar(range(1,len(mean_ic_n)+1), mean_ic_n, align = 'center')
plt.xlabel("key length")
plt.ylabel("average IC")
Out[69]:
This suggests a key length of 8.
In [70]:
# https://blog.dominodatalab.com/interactive-dashboards-in-jupyter/
# http://ipywidgets.readthedocs.io/en/latest/examples/Using%20Interact.html
from __future__ import division
from matplotlib import pyplot as plt
import numpy as np
from ipywidgets import interact, interactive, fixed
#import ipywidgets as widgets
plt.rcParams["figure.figsize"] = [12,16]
def rotate(l, n):
return l[n:] + l[:n]
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
key_length = 8
g_english = [0.0736, 0.0148, 0.0445, 0.0302, 0.102, 0.0227, 0.0122, 0.0277, 0.0855, 0.000557, 0.00237, 0.0342, 0.0206, 0.0717, 0.103, 0.0246, 0.00181, 0.0735, 0.0608, 0.0889, 0.0392, 0.0153, 0.0173, 0.000557, 0.032, 0.000278]
pfig = plt.subplot(key_length+1,1,1)
plt.bar(range(len(alphabet)), g_english, tick_label = rotate(list(alphabet), 0), align = 'center', color = 'b')
ctext_freq = []
for i in range(key_length):
ctext_i = ctext[i::key_length]
ctext_freq.append(np.array([ctext_i.count(c)/len(ctext_i) for c in alphabet]))
cfig = plt.subplot(key_length+1,1,i+2)
plt.bar(range(len(alphabet)), ctext_freq[i], tick_label = rotate(list(alphabet), 0), align = 'center', color = 'r')
Now, let's try that again, but use the cipher alphabet order instead.
In [71]:
# https://blog.dominodatalab.com/interactive-dashboards-in-jupyter/
# http://ipywidgets.readthedocs.io/en/latest/examples/Using%20Interact.html
from __future__ import division
from matplotlib import pyplot as plt
import numpy as np
from ipywidgets import interact, interactive, fixed
#import ipywidgets as widgets
plt.rcParams["figure.figsize"] = [12,16]
def rotate(l, n):
return l[n:] + l[:n]
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
keyed_alphabet = compute_keyed_alphabet("KRYPTOS")
key_length = 8
g_english = [0.0736, 0.0148, 0.0445, 0.0302, 0.102, 0.0227, 0.0122, 0.0277, 0.0855, 0.000557, 0.00237, 0.0342, 0.0206, 0.0717, 0.103, 0.0246, 0.00181, 0.0735, 0.0608, 0.0889, 0.0392, 0.0153, 0.0173, 0.000557, 0.032, 0.000278]
g_english_keyed = [g_english[alphabet.find(c)] for c in keyed_alphabet]
pfig = plt.subplot(key_length+1,1,1)
plt.bar(range(len(alphabet)), g_english_keyed, tick_label = rotate(list(keyed_alphabet), 0), align = 'center', color = 'b')
ctext_freq = []
for i in range(key_length):
ctext_i = ctext[i::key_length]
ctext_freq.append(np.array([ctext_i.count(c)/len(ctext_i) for c in keyed_alphabet]))
cfig = plt.subplot(key_length+1,1,i+2)
plt.bar(range(len(keyed_alphabet)), ctext_freq[i], tick_label = rotate(list(keyed_alphabet), 0), align = 'center', color = 'r')
For K2, it looks like it might be possible to align the frequency plots by hand to break the key. To make this easier, let's create plots with two concatenated alphabets. That way, we can print them and align them (or do this by yourself on graph paper of course).
In [73]:
# https://blog.dominodatalab.com/interactive-dashboards-in-jupyter/
# http://ipywidgets.readthedocs.io/en/latest/examples/Using%20Interact.html
from __future__ import division
from matplotlib import pyplot as plt
import numpy as np
from ipywidgets import interact, interactive, fixed
#import ipywidgets as widgets
plt.rcParams["figure.figsize"] = [18,16]
def rotate(l, n):
return l[n:] + l[:n]
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
keyed_alphabet = compute_keyed_alphabet("KRYPTOS")
ncopies = 2
key_length = 8
g_english = [0.0736, 0.0148, 0.0445, 0.0302, 0.102, 0.0227, 0.0122, 0.0277, 0.0855, 0.000557, 0.00237, 0.0342, 0.0206, 0.0717, 0.103, 0.0246, 0.00181, 0.0735, 0.0608, 0.0889, 0.0392, 0.0153, 0.0173, 0.000557, 0.032, 0.000278]
g_english_keyed = [g_english[alphabet.find(c)] for c in keyed_alphabet]
pfig = plt.subplot(key_length+1,1,1)
plt.bar(range(len(alphabet)*ncopies), g_english_keyed + [0]*26, tick_label = rotate(list(keyed_alphabet)*ncopies, 0), align = 'center', color = 'b')
ctext_freq = []
for i in range(key_length):
ctext_i = ctext[i::key_length]
ctext_freq.append(np.array([ctext_i.count(c)/len(ctext_i) for c in keyed_alphabet]))
cfig = plt.subplot(key_length+1,1,i+2)
plt.bar(range(len(keyed_alphabet)*ncopies), list(ctext_freq[i])*ncopies, tick_label = rotate(list(keyed_alphabet)*ncopies, 0), align = 'center', color = 'r')
Some of these could be guessed based on the alignment: at least the A and the first S seem reasonable to guess, giving A????S?? for the key, after which the possible candidate keys can be guessed or looked up, e.g. on onelook.com.
In [74]:
from itertools import permutations
from ngram_score import ngram_score
import re
import pprint as pp
qgram = ngram_score('ngrams/en_sherlock_4grams') # load our 4gram statistics
trigram = ngram_score('ngrams/en_sherlock_3grams') # load our 3gram statistics
# keep track of the 100 best keys
N=100
L=42 # only process first L characters for speed, increase if needed for accuracy
alphakey = "KRYPTOS"
for KLEN in [8]: #range(3,16):
print "="*80
rec = nbest(N)
# exhaustively test all possible letters for first 3 entries of the key and keep track of the N best ones
# if KLEN=7, this will test e.g. FOOAAAA and BARAAAA
for i in permutations('ABCDEFGHIJKLMNOPQRSTUVWXYZ',3):
i = "".join(i)
key = ''.join(i) + 'A'*(KLEN-len(i))
decrypted_ctext = keyed_vigenere(ctext[:L], key, alpha_key = alphakey, direction = -1)
score = 0
for j in range(0,len(ctext),KLEN):
score += trigram.score(decrypted_ctext[j:j+3])
rec.add((score,''.join(i), decrypted_ctext))
next_rec = nbest(N)
# for the remaining KLEN-3 characters of the key,
for i in range(0,KLEN-3):
# go over the N best keys found so far...
for k in xrange(N):
# ...and determine the best next character of the key, while keeping best N keys so far
for c in 'ABCDEFGHIJKLMNOPQRSTUVWXYZ':
key = rec[k][1] + c
fullkey = key + 'A'*(KLEN-len(key))
decrypted_ctext = keyed_vigenere(ctext[:L], fullkey, alpha_key = alphakey, direction = -1)
score = 0
for j in range(0,len(ctext),KLEN):
score += qgram.score(decrypted_ctext[j:j+len(key)])
next_rec.add((score, key, decrypted_ctext))
rec = next_rec
next_rec = nbest(N)
# show the results
bestscore = rec[0][0]
bestkey = rec[0][1]
#decrypted_ctext = rec[0][2]
# always show entire decrypted ctext, even if the above analysis is done only on part of the ctext, e.g. ctext[0:100]
decrypted_ctext = keyed_vigenere(ctext, bestkey, alpha_key = alphakey, direction = -1)
print bestscore, 'klen', KLEN, ':"'+bestkey+'",', decrypted_ctext
# uncomment the following lines to see top-10 results
#pp.pprint(rec.store[0:10])
#print '\n'
In [75]:
ctext = "".join(ctext_kryptos[2:])
key = "ABSCISSA"
alphakey = "KRYPTOS"
print ctext
print keyed_vigenere(ctext, key, alpha_key = alphakey, direction = -1)
So K2 indeed seems to be just the remainder of the first panel.
In [76]:
ctext_k2 = "".join(ctext_kryptos[2:14])
key = "ABSCISSA"
alphakey = "KRYPTOS"
ptext_k2 = keyed_vigenere(ctext_k2, key, alpha_key = alphakey, direction = -1)
print ctext_k2
print ptext_k2
In [77]:
from word_score import word_score
fitness = word_score()
print fitness.score(ptext_k2)
In [78]:
ctext_k2_corrected = ctext_k2.replace("WJLLAETG", "SWJLLAETG") # insert missing character near the end
key = "ABSCISSA"
alphakey = "KRYPTOS"
ptext_k2_corrected = keyed_vigenere(ctext_k2_corrected, key, alpha_key = alphakey, direction = -1)
print ctext_k2_corrected
print ptext_k2_corrected
In [79]:
from word_score import word_score
fitness = word_score()
print fitness.score(ptext_k2_corrected)
IT WAS TOTALLY INVISIBLE HOWS THAT POSSIBLE? THEY USED THE EARTHS MAGNETIC FIELD X
THE INFORMATION WAS GATHERED AND TRANSMITTED UNDERGRUUND TO AN UNKNOWN LOCATION X
DOES LANGLEY KNOW ABOUT THIS? THEY SHOULD ITS BURIED OUT THERE SOMEWHERE X
WHO KNOWS THE EXACT LOCATION? ONLY WW THIS WAS HIS LAST MESSAGE X
THIRTY EIGHT DEGREES FIFTY SEVEN MINUTES SIX POINT FIVE SECONDS NORTH SEVENTY SEVEN DEGREES EIGHT MINUTES FORTY FOUR SECONDS WEST X
LAYER TWO