An introduction to solving biological problems with Python

Session 2.4: Delimited files

Data formats

Bioinformaticians love creating endless new file formats for their data, but there is one very common standard format that it is good to get used to parsing.

Delimited file example:

X 169008682 1 111267453 1.0976
2 8265484 5 69763543 4.9825
MT 10924 MT 81934 7.2357
3 127 8 10908776 1.2509

Reading delimited files

We can use the various string manipulation techniques covered earlier to process delimited files in a fairly straightforward way. Here we loop through a file with columns delimited by spaces, reading the data for each row into a list, and storing each of these lists into a main results list.

To view the an example of a delimited file, open a terminal window, go to the course directory, and print the content of the file using cat command or open it using your favourite editor:

cat data/mydata.txt
Index Organism Score
1 Human 1.076
2 Mouse 1.202
3 Frog 2.2362
4 Fly 0.9853

In [ ]:
results = []

with open("data/mydata.txt", "r") as data:
    header = data.readline()
    for line in data:
        results.append(line.split())
        
        
print(results)

Here we show a slightly more complicated example where we are reading the results into a more convenient data structure, a list of dictionaries with the dictionary keys corresponding to the column headers and the values to the values from each line. We also convert the columns to an appropriate type as we go.


In [ ]:
results = []

with open("data/mydata.txt", "r") as data:
    header = data.readline()
    for line in data:
        idx, org, score = line.split()
        row = {'Index': int(idx), 'Organism': org, 'Score': float(score)}
        results.append(row)
        
print(results)
print('Score of first row:', results[0]['Score'])

Writing delimited files

Writing out a delimited file is also straightforward using the join method. Here, as an example we will recreate our original file from above, but this time we will delimit the columns with a comma.


In [ ]:
mydata = [{'Organism': 'Human', 'Index': 1, 'Score': 1.076}, 
          {'Organism': 'Mouse', 'Index': 2, 'Score': 1.202}, 
          {'Organism': 'Frog', 'Index': 3, 'Score': 2.2362}, 
          {'Organism': 'Fly', 'Index': 4, 'Score': 0.9853}]

with open('data/mydata.csv', 'w') as output:
    # write a header
    header = ",".join(['Index', 'Organism', 'Score'])
    output.write(header + "\n")
    for row in mydata:
        line = ",".join([str(row['Index']), row['Organism'], str(row['Score'])])
        output.write(line + "\n")

To view the output file, open a terminal window, go to the course directory, and print the content of the file using cat command or open it using your favourite editor:

cat data/mydata.csv
Index,Organism,Score
1,Human,1.076
2,Mouse,1.202
3,Frog,2.2362
4,Fly,0.9853

Last but not least

A big thank you!

Remember...

Exercises 2.4.1

Write a script that reads a tab delimited file which has 4 columns: gene, chromosome, start and end coordinates; that computes each gene's length and stores it into a dictionary; and writes the results into a new tab separated file. You can find a data file in data/genes.txt directory of the course materials.

Exercises 2.4.2

Read the lyrics of Imagine by John Lennon, 1971 from the file in data/imagine.txt. Split the text into words. Print the total number of words, and the number of distinct words. Calculate the frequency of each distinct word and store the result into a dictionary. Print each distinct word along with its frequency. Find the most frequent word longer than 3 characters in the song, print it with its frequency.

Exercises 2.4.3

Real life example

You have a tab separated file which contains information about all the yeast (S.cerevisiae) gene data/yeast_genes.txt:

Systematic_name Standard_name Chromosome Start End YBR127C VMA2 chrII 491269 492822 YBR128C ATG14 chrII 493081 494115 ...

For every gene, its location and coordinates are recorded. You should read through the file and store the data into an appropriate structure. Then answer these questions:

  • How many genes are there in S.cerevisiae?
  • Which is the longest and which is the shortest gene?
  • How many genes per chromosome? Print the number of genes per chromosome.
  • For each chromosome, what is the longest and what is the shortest gene?
  • For each chromosome, how many genes on the Watson strand and how many genes on the Crick strand?

bonus

  • What is the chromosome with the highest gene density? You can calculate the length of each chromosome assuming that they start at 1 and they end at the end (if on the Watson strand) or at the start (if on the Crick strand) of their last gene. Then you can calculate the length of all the genes on each chromosome and the ratio between coding vs. noncoding regions.

Congratulation! You reached the end of day 2!