In [1]:
%matplotlib inline
1. (25 points) A Caesar cipher is a very simple method of encoding and decoding data. The cipher simply replaces characters with the character offset by $k$ places. For example, if the offset is 3, we replace a
with d
, b
with e
etc. The cipher wraps around so we replace y
with b
, z
with c
and so on. Punctuation, spaces and numbers are left unchanged. Note that we don't need a decode function - we can just use a negative offset to reverse the encoding.
encode(s, k)
where s
is the string to be enoded and k
is the offset. Check that you can encode If you think Python is hell, try writing this function in R!
with offset 10 as
Sp iye dlsxu Zidryx sc rovv, dbi gbsdsxq drsc pexmdsyx sx B!
and make sure you can recover the original string with offset -10.
Hint: Use the following
chr
ord
string.ascii_uppercase
string.ascii_lowercase
str.maketrans
str.translate
dictionaries
In [3]:
def encode(s, k):
"""Caesar cipher encoding with offset k for string s"""
import string
t = {c: chr(ord('A') + (ord(c) - ord('A') + k) % 26)
for c in string.ascii_uppercase}
t1 = {c: chr(ord('a') + (ord(c) - ord('a') + k) % 26)
for c in string.ascii_lowercase}
t.update(t1)
table = str.maketrans(t)
return s.translate(table)
In [4]:
s1 = encode('If you think Python is hell, try writing this function in R!', 10)
s1
Out[4]:
In [5]:
s2 = encode(s1, -10)
s2
Out[5]:
2. (50 points)
ecoli.fas
into a string variable containing only the sequence data with no header information or line breaks. The string should start with agcttttca
and be 4639675 characters long. (5 points)
In [7]:
with open('ecoli.fas') as f:
lines = f.readlines()
seq = ''.join([line.strip() for line in lines[1:]])
In [8]:
(seq.count('c') + seq.count('g'))/len(seq)
Out[8]:
In [9]:
windows = (seq for i in range(10))
sum(s.count('a') for s in zip(*windows))/(len(seq)-9)
Out[9]:
In [10]:
import numpy as np
s = np.where(np.array(list(seq)) == 'a', 1, 0)
kenrel = np.ones(10)
counts = np.convolve(s, kenrel, mode='valid')
counts.mean()
Out[10]:
In [11]:
import re
pattern = re.compile(r'gatt(.*?)aca')
for m in pattern.finditer(seq[:10000]):
print(m.start(), m.groups(0))
3. (25 points) Read in the text of Ulysses by James Joyce from the file 'Ulysses.txt
.
Find the 10 most frequently used words that begin with the letter 'u' in the full text using a generator to read in only one line at a time (this is essential when dealing with huge text files that may otherwise run out of memory).
Note: punctuation is any character in string.punctuation from the string
package
In [1]:
import string
counter = {}
with open('Ulysses.txt') as f:
for line in f:
line = line.strip()
line = line.lower()
line = line.translate(str.maketrans({c: None for c in string.punctuation}))
words = line.split()
for word in words:
counter[word] = counter.get(word, 0) + 1
In [2]:
n = 0
for word, count in sorted(counter.items(), key=lambda x: x[1], reverse=True):
if word.startswith('u'):
print(word, count)
n += 1
if n == 10:
break