In [1]:
%matplotlib inline

Working with text

One of the major reaosns for using Python is its powerful built-in methods for working with text data. Hence Python is often the language of choice for data munging or wrangling. These exercises give you some familiarity with how to work with text data.

1. (25 points) A Caesar cipher is a very simple method of encoding and decoding data. The cipher simply replaces characters with the character offset by $k$ places. For example, if the offset is 3, we replace a with d, b with e etc. The cipher wraps around so we replace y with b, z with c and so on. Punctuation, spaces and numbers are left unchanged. Note that we don't need a decode function - we can just use a negative offset to reverse the encoding.

  • Write a function encode(s, k) where s is the string to be enoded and k is the offset. Check that you can encode
If you think Python is hell, try writing this function in R!

with offset 10 as

Sp iye dlsxu Zidryx sc rovv, dbi gbsdsxq drsc pexmdsyx sx B!

and make sure you can recover the original string with offset -10.

Hint: Use the following

chr
ord
string.ascii_uppercase
string.ascii_lowercase
str.maketrans
str.translate
dictionaries

In [3]:
def encode(s, k):
    """Caesar cipher encoding with offset k for string s"""
    import string
    
    t = {c: chr(ord('A') + (ord(c) - ord('A') + k) % 26) 
         for c in string.ascii_uppercase}
    t1 = {c: chr(ord('a') + (ord(c) - ord('a') + k) % 26) 
          for c in string.ascii_lowercase}
    t.update(t1)

    table = str.maketrans(t)
    return s.translate(table)

In [4]:
s1 = encode('If you think Python is hell, try writing this function in R!', 10)
s1


Out[4]:
'Sp iye drsxu Zidryx sc rovv, dbi gbsdsxq drsc pexmdsyx sx B!'

In [5]:
s2 = encode(s1, -10)
s2


Out[5]:
'If you think Python is hell, try writing this function in R!'

2. (50 points)

  • Read the E coli genomic DNA from the file ecoli.fas into a string variable containing only the sequence data with no header information or line breaks. The string should start with agcttttca and be 4639675 characters long. (5 points)
  • Find the CG ratio, defined as (c+g)/(a+c+t+g). (10 points)
  • Find the average numbers of the letter 'a' in shifting windows of length 10. The first 3 windows are ('agcttttcat', 'gcttttcatt', 'cttttcattc'). (15 points)
  • Use regular expressions to find all non-overlapping occurrences of the string 'gatt-aca' where the '-' means any number of letters - that is, each string found must begin with 'gatt' and end with 'aca' but it does not matter what is in the middle. For each such string found, print the middle don't-care sequence and the starting position of the string (i.e. position of the first letter g in the full sequence). Restrict the search to the first 10,000 bases in the DNA sequence. (20 points)

In [7]:
with open('ecoli.fas') as f:
    lines = f.readlines()
seq = ''.join([line.strip() for line in lines[1:]])

In [8]:
(seq.count('c') + seq.count('g'))/len(seq)


Out[8]:
0.5078969970957018

Regular version


In [9]:
windows = (seq for i in range(10))
sum(s.count('a') for s in zip(*windows))/(len(seq)-9)


Out[9]:
2.4618754884511085

A version using convolution


In [10]:
import numpy as np

s = np.where(np.array(list(seq)) == 'a', 1, 0)
kenrel = np.ones(10)
counts = np.convolve(s, kenrel, mode='valid')
counts.mean()


Out[10]:
2.4618726865252802

In [11]:
import re

pattern = re.compile(r'gatt(.*?)aca')
for m in pattern.finditer(seq[:10000]):
    print(m.start(), m.groups(0))


42 ('aaaaaaagagtgtctgatagcagcttctgaactggttacctgccgtgagtaaattaaaattttattgacttaggtcactaaatactttaaccaatataggcatagcgc',)
485 ('gaaaaaaccattagcggccaggatgctttacccaatatcagcgatgccgaacgtatttttgccgaacttttgacgggactcgccgccgcccagccggggttcccgctggcgcaattgaaaactttcgtcgatcaggaatttgcccaaataaa',)
701 ('tgccgtggcgagaaaatgtcgatcgccattatggccggcgtattagaagcgcgcggtc',)
996 ('gttgcgagatttggacggacgttgacggggtctatacctgcgacccgcgtcaggtgcccgatgcgaggttgttgaagtcgatgtcctaccaggaagcgatggagctttcctacttcggcgctaaagttcttcacccccgcaccattacccccatcgcccagttccagatcccttgcctgattaaaaataccggaaatcctcaagcaccaggtacgctcattggtgccagccgtgatgaagacgaattaccggtcaagggcatttccaatctgaata',)
1379 ('acgcaatcatcttccgaat',)
1745 ('ggcgtcggtggcgttggcggtgcgctgctggagcaactgaagcgtcagcaaagctggctgaagaataa',)
2094 ('actaccatcagttgcgttatgcggcggaaaaatcgcggcgtaaattcctctatg',)
2170 ('accggttattgagaacctgcaaaatctgctcaatgcaggtgatgaattgatgaagttctccggcattctttctggttcgctttcttatatcttcggcaagttagacgaaggcatgagtttctccgaggcgaccacgctggcgcgggaaatgggttataccgaaccggacccgcgagatgatctttctggtatggatgtggcgcgtaaactattgattctcgctcgtgaaacgggacgtgaactggagctggcggatattgaaattgaacctgtgctgcccgcagagtttaacgccgagggtgatgttgccgcttttatggcgaatctgtc',)
2609 ('gccgaagtggatggtaatgatccgctgttcaaagtgaaaaatggcgaaaacgccctggccttctatagccactattatcagccgctgccgttggtactgcgcggatatggtgcgggcaatgacgtt',)
3336 ('aaagtctcgacggcagaagccagggctattttaccggcgcagtatcgccgccaggattgcattgcgcacgggcg',)
3880 ('ttgtcacccgcagtgcgaagatcctctcggcgtttattggtgatgaaatccc',)
4054 ('tcggcggtcgctttatggc',)
4285 ('tcgatgcctgtcaggcgctggtgaagcaggcgtttgatgatgaagaactgaaagtggcgctagggttaaactcggctaactcgatta',)
4398 ('tgctactactttgaagctgttgcgcagctgccgcaggagacgcgcaaccagctggttgtctcggtgccaagcggaaacttcggcgatttgacggcgggtctgctggcgaagtcactcggtctgccggtgaaacgttttattgctgcgaccaacgtgaacgataccgtgccacgtttcctgcacgacggtcagtggtcacccaaagcgactcaggcgacgttatccaacgcgatggacgtgagtcagccga',)
4896 ('ctcggtgaaacgttggatctgccaaaagagctggcagaacgtgctgatttacccttgctttc',)
4978 ('ttgctgcgttgcgtaaattgatgatgaatcatcagtaaaatctattcattatctcaatcaggccgggtttgcttttatgcagcccggcttttttatgaagaaattatggagaaaaatg',)
5143 ('aggattgcggagaata',)
5792 ('ttcaataatgaaacgactcatcagaccgcgtgctttcttagcgtagaagctgatgatcttaaatttgccgttcttctcatcgagga',)
5926 ('taaaatactcatctgacgccagattaatcacc',)
6526 ('actcctgcgaa',)
6836 ('gttttcggcataaatgtagttggcaacgatggagctgaaggcaa',)
7317 ('atcgccatcaacggg',)
7380 ('gccagcagagtaaag',)
8368 ('ccggaataccgtaagttgattgatgatgctgtcgcctgggcgaa',)
9670 ('cgcaa',)

3. (25 points) Read in the text of Ulysses by James Joyce from the file 'Ulysses.txt.

  • Find the 10 most frequently used words that begin with the letter 'u' in the full text using a generator to read in only one line at a time (this is essential when dealing with huge text files that may otherwise run out of memory).

    • A word cannot contain punctuation or the newline character '\n'
    • Ignore case - so ulysses and Ulysses are considered the same word

Note: punctuation is any character in string.punctuation from the string package


In [1]:
import string

counter = {}
with open('Ulysses.txt') as f:
    for line in f:
        line = line.strip()
        line = line.lower()
        line = line.translate(str.maketrans({c: None for c in string.punctuation}))
        words = line.split()
        for word in words:
            counter[word] = counter.get(word, 0) + 1

In [2]:
n = 0
for word, count in sorted(counter.items(), key=lambda x: x[1], reverse=True):
    if word.startswith('u'):
        print(word, count)
        n += 1
    if n == 10:
        break


up 833
us 257
under 230
upon 133
used 79
use 49
understand 36
usual 32
umbrella 22
unless 21