In [1]:

    
%matplotlib inline

Working with text

One of the major reaosns for using Python is its powerful built-in methods for working with text data. Hence Python is often the language of choice for data munging or wrangling. These exercises give you some familiarity with how to work with text data.

1. (25 points) A Caesar cipher is a very simple method of encoding and decoding data. The cipher simply replaces characters with the character offset by $k$ places. For example, if the offset is 3, we replace a with d, b with e etc. The cipher wraps around so we replace y with b, z with c and so on. Punctuation, spaces and numbers are left unchanged. Note that we don't need a decode function - we can just use a negative offset to reverse the encoding.

Write a function encode(s, k) where s is the string to be enoded and k is the offset. Check that you can encode

If you think Python is hell, try writing this function in R!

with offset 10 as

Sp iye drsxu Zidryx sc rovv, dbi gbsdsxq drsc pexmdsyx sx B!

and make sure you can recover the original string with offset -10.

Hint: Use the following

chr
ord
string.ascii_uppercase
string.ascii_lowercase
str.maketrans
str.translate
dictionaries



In [2]:

    
# Your solution here

2. (50 points)

Read the E coli genomic DNA from the file ecoli.fas into a string variable containing only the sequence data with no header information or line breaks. The string should start with agcttttca and be 4639675 characters long. (5 points)
Find the CG ratio, defined as (c+g)/(a+c+t+g). (10 points)
Find the average numbers of the letter 'a' in shifting windows of length 10. The first 3 windows are ('agcttttcat', 'gcttttcatt', 'cttttcattc'). (15 points)
Use regular expressions to find all non-overlapping occurrences of the string 'gatt-aca' where the '-' means any number of letters - that is, each string found must begin with 'gatt' and end with 'aca' but it does not matter what is in the middle. For each such string found, print the middle don't-care sequence and the starting position of the string (i.e. position of the first letter g in the full sequence). Restrict the search to the first 10,000 bases in the DNA sequence. (20 points)



In [6]:

    
# Your solution here

3. (25 points) Read in the text of Ulysses by James Joyce from the file 'Ulysses.txt.

Find the 10 most frequently used words that begin with the letter 'u' in the full text using a generator to read in only one line at a time (this is essential when dealing with huge text files that may otherwise run out of memory).
- A word cannot contain punctuation or the newline character '\n'
- Ignore case - so ulysses and Ulysses are considered the same word

Note: punctuation is any character in string.punctuation from the string package



In [12]:

    
# Your solution here