Bioinformaticians love creating endless new file formats for their data, but there is one very common standard format that it is good to get used to parsing.
Delimited file example:
X 169008682 1 111267453 1.0976
2 8265484 5 69763543 4.9825
MT 10924 MT 81934 7.2357
3 127 8 10908776 1.2509
We can use the various string manipulation techniques covered earlier to process delimited files in a fairly straightforward way. Here we loop through a file with columns delimited by spaces, reading the data for each row into a list, and storing each of these lists into a main results list.
To view the an example of a delimited file, open a terminal window, go to the course directory, and print the content of the file using cat command or open it using your favourite editor:
cat data/mydata.txt
Index Organism Score
1 Human 1.076
2 Mouse 1.202
3 Frog 2.2362
4 Fly 0.9853
In [ ]:
results = []
with open("data/mydata.txt", "r") as data:
header = data.readline()
for line in data:
results.append(line.split())
print(results)
Here we show a slightly more complicated example where we are reading the results into a more convenient data structure, a list of dictionaries with the dictionary keys corresponding to the column headers and the values to the values from each line. We also convert the columns to an appropriate type as we go.
In [ ]:
results = []
with open("data/mydata.txt", "r") as data:
header = data.readline()
for line in data:
idx, org, score = line.split()
row = {'Index': int(idx), 'Organism': org, 'Score': float(score)}
results.append(row)
print(results)
print('Score of first row:', results[0]['Score'])
Writing out a delimited file is also straightforward using the join method. Here, as an example we will recreate our original file from above, but this time we will delimit the columns with a comma.
In [ ]:
mydata = [{'Organism': 'Human', 'Index': 1, 'Score': 1.076},
{'Organism': 'Mouse', 'Index': 2, 'Score': 1.202},
{'Organism': 'Frog', 'Index': 3, 'Score': 2.2362},
{'Organism': 'Fly', 'Index': 4, 'Score': 0.9853}]
with open('data/mydata.csv', 'w') as output:
# write a header
header = ",".join(['Index', 'Organism', 'Score'])
output.write(header + "\n")
for row in mydata:
line = ",".join([str(row['Index']), row['Organism'], str(row['Score'])])
output.write(line + "\n")
To view the output file, open a terminal window, go to the course directory, and print the content of the file using cat command or open it using your favourite editor:
cat data/mydata.csv
Index,Organism,Score
1,Human,1.076
2,Mouse,1.202
3,Frog,2.2362
4,Fly,0.9853
Write a script that reads a tab delimited file which has 4 columns: gene, chromosome, start and end coordinates; that computes each gene's length and stores it into a dictionary; and writes the results into a new tab separated file. You can find a data file in data/genes.txt directory of the course materials.
Read the lyrics of Imagine by John Lennon, 1971 from the file in data/imagine.txt. Split the text into words. Print the total number of words, and the number of distinct words. Calculate the frequency of each distinct word and store the result into a dictionary. Print each distinct word along with its frequency. Find the most frequent word longer than 3 characters in the song, print it with its frequency.
You have a tab separated file which contains information about all the yeast (S.cerevisiae) gene data/yeast_genes.txt:
Systematic_name Standard_name Chromosome Start End
YBR127C VMA2 chrII 491269 492822
YBR128C ATG14 chrII 493081 494115
...
For every gene, its location and coordinates are recorded. You should read through the file and store the data into an appropriate structure. Then answer these questions:
bonus