1) Here is a function that counts the number of sequences in a FASTA file:
def count_seqs(input_file):
count = 0
for line in input_file:
line = line.lstrip() # strip leading spaces, if any
if line.startswith('>'):
count += 1
return count
Put this function in a file seq_utils.py
so that it can be imported as a module. E.g.
import seq_utils
filename = 'python.fasta'
input_file = open(filename)
seq_count = seq_utils.count_seqs(input_file)
print seq_count,"in",filename
2) Download the collection of sequences sequences.tar.bz2
from the SANBI FTP site using:
cd ~/Documents/python
wget -c ftp://ftp.sanbi.ac.za/sequences.tar.bz2
This is a compress tar file. Make a directory to store this data and unpack the file there:
mkdir data
cd data
tar xf ../sequences.tar.bz2
3) Write a script count_all_seqs.py
that takes a directory as an argument and then counts the number of the sequences for each file in a directory that ends in .fasta
. So e.g.
count_all_seqs.py ~/Documents/python/data
seq1.fasta 83
seq2.fasta 83
seq3.fasta 83
seq4.fasta 83
seq5.fasta 83
seq6.fasta 85
This script should use the seq_utils
module created in problem 1.
4) Follow the instructions at https://github.com/pvanheus/counting_sequences.
In [ ]: