Day 3: opening, writing, and rearranging text files


Review exercises

  1. Store the string below as a variable and retrieve the last four characters from the string: 'this is a string'
  2. Overwrite the string stored in this variable with a different string: 'this is a modified string'
  3. Split the string 'this is a modified string' into a list (using spaces as your delimiter) and store it as a list under some variable name.
  4. Retrieve the second word (is) from your list using slice notation.
  5. Loop through every word in your list.
  6. Replace the third word in your list with the word 'my'.
  7. Loop through every word in your list again, and for each word, print the second character if it comes before 'm' in the alphabet.


Exercises 8, opening text files (30 minutes)

For the next several exercises, please download the following files and save them in the same folder as this notebook.

http://bioinfo.umassmed.edu/bootstrappers/bootstrappers-courses/python1/Python_I/yeast/Saccharomyces_cerevisiae.R64-1-1.78_transcripts.bed
http://bioinfo.umassmed.edu/bootstrappers/bootstrappers-courses/python1/Python_I/yeast/Saccharomyces_cerevisiae.R64-1-1.78_sample.gtf
http://bioinfo.umassmed.edu/bootstrappers/bootstrappers-courses/python1/Python_I/yeast/sacCer3.genome
http://bioinfo.umassmed.edu/bootstrappers/bootstrappers-courses/python1/Python_I/yeast/README.txt

  1. Try this sequence of commands to familiarize yourself with how the file object variable of python works:
  • x = open('sacCer3.genome') # x is now a 'file object' variable
  • print(x)
  • y=x.readline()
  • print(y)
  • print(x.readline())
  • print(x.readline())
  • print(x.readlines()) # note this is readlines and not readline
  • print(x.readline())
  • x.seek(0)
  • print(x.readline())
  • x.seek(8)
  • print(x.readline())
  • print(x.readline())



In [7]:

  1. Try experimenting with the seek method some more to figure out what it’s doing and how the readline method interacts with it. Can you figure out the numbers that rewind the file to the exact beginning of a line? (hint: look at the Python [documentation](https://docs.python.org/2.7/library/stdtypes.html#file-objects) on `file` to figure out what you can do with files)

In [ ]:

  1. Use the loop below to go through every line of the file 'sacCer3.genome'
{python}
x = open('sacCer3.genome')
for line in x:
    print('the current line of the file is: ' + line)

Notice that when you put a file handle as the object being iterated through, Python essentially executes the .readline() method of file objects again and again, and stores the result into the loop’s user-defined variable until the file runs out of lines. Alternatively, try this implementation to avoid making a file object variable at all:

{python}
for line in open('sacCer3.genome'):
    print('the current line is: ' + line)

In [ ]:

  1. Modify the loop from above and use slice notation on the variable called `line` to print only the fifth, sixth, and seventh characters of each line from the text file.

In [0]:

  1. Modify the loop to split each line using the tab character (the tab character is `'\t'`) and only print the second column (`column[1]`)

In [0]:

Exercises 9, writing to files (30 minutes)

  1. The following code writes a few characters to a file. Run it and open the 'test_output.txt' file in a text editor such as "notepad" or "textedit" to verify that it worked.
{python}
output_file=open('test_output.txt', 'w')
for letter in 'ACCGT':
    output_file.write(letter)
output_file.close()

In [7]:

  1. Now let’s take the input from one file and output it to another like this:
{python}
x = open('Saccharomyces_cerevisiae.R64-1-1.78_sample.gtf')
y = open('test_output2.txt', 'w')
for line in x:
    y.write(line)
y.close()

Afterward, you’ll want to open 'test_output2.txt' to see if it looks the same as 'Saccharomyces_cerevisiae.R64-1-1.78_sample.gtf'


In [ ]:

  1. The file 'Saccharomyces_cerevisiae.R64-1-1.78_transcripts.bed' contains transcript annotations for the yeast genome. The transcripts are ordered by genomic location. Write some code to read transcripts from the file, that are within the genomic region chrIV:70640-1461829 and that are on the Watson strand. Split each line into their separate (there are 6 tab-delimited columns in the file) columns and store all items in a list. The resulting list should have a structure like:
    `[[chromosome, start, end, ..., ..., ...], [chromosome, start, end, ..., ..., ...], ....]`

In [7]:

  1. Calculate the average transcript length of the transcripts in the list created in the previous question. Store the result in a variable.

In [7]:

  1. For transcripts that are longer than average, write a file, containing two columns, separated by a tab. The first column should contain the name of the transcripts and the second column their length.

In [7]:

  1. The file 'Saccharomyces_cerevisiae.R64-1-1.78_sample.gtf' contains an excerpt from the gene annotations for yeast, as downloaded from Ensembl. The format of the file is explained in 'README.txt'. Try to create a file containing genomic location (chromosome, start, end), transcript name and strand, separated by tabs (five columns), based on 'Saccharomyces_cerevisiae.R64-1-1.78_sample.gtf'. Be sure to only output transcripts.
    Example output:
    `chrVI 194812 196314 YFR021W + chrXVI 298571 299503 YPL134C - ... ... etc.`



In [7]:

Exercises 10, import and genomic data (30 minutes)

  1. For this exercise we will be using the "ucscgenome" module. This module most likely needs to be installed before it can be used. Open a __terminal__ (or a "__DOS-box__", or __cmd.exe__) and install "ucscgenome" using pip, using the following command:
    `pip install ucscgenome`
  2. The following code is an exaple of how 'ucscgenome' can be used once it is installed. Run it to ensure 'ucscgenome' was installed properly.
{python}
import ucscgenome
genome = ucscgenome.Genome("sacCer3")
sequence = genome["chrIV"]
print(sequence[100:110])

In [1]:
import ucscgenome
genome = ucscgenome.Genome("sacCer3")
sequence = genome["chrIV"]
print(sequence[100:110])


ACACCCACAC

In [ ]:

  1. Use the file created in the previous exercise (9.6) to extract sequence information from the yeast genome. For each of the transcripts in the file that are on the Watson strand, obtain the sequence in a window of 50 bases around the transcription start site. Write the results to a file containing the name of the transcripts as well as the 50-base sequences.

In [9]:

The following code makes it possible to translate one set of characters into another set of character in a string. How could you apply this to obtain a reverse complement sequence from a forward sequence?

{python}
import string
t = string.maketrans("aei", "qwe") # this create a 'translate string' `t`
print('this needs to be translated'.translate(t))

In [2]:
# example
import string
t = string.maketrans("aei", "qwe") # this create a 'translate string' `t`
print('this needs to be translated'.translate(t))


thes nwwds to bw trqnslqtwd

In [ ]:


In [ ]: