1) Here is a function that counts the number of sequences in a FASTA file:

def count_seqs(input_file):
    count = 0
    for line in input_file:
        line = line.lstrip() # strip leading spaces, if any
        if line.startswith('>'):
            count += 1
    return count

Put this function in a file seq_utils.py so that it can be imported as a module. E.g.

import seq_utils

filename = 'python.fasta'
input_file = open(filename)
seq_count = seq_utils.count_seqs(input_file)
print seq_count,"in",filename

2) Download the collection of sequences sequences.tar.bz2 from the SANBI FTP site using:

cd ~/Documents/python
wget -c ftp://ftp.sanbi.ac.za/sequences.tar.bz2

This is a compress tar file. Make a directory to store this data and unpack the file there:

mkdir data
cd data
tar xf ../sequences.tar.bz2

3) Write a script count_all_seqs.py that takes a directory as an argument and then counts the number of the sequences for each file in a directory that ends in .fasta. So e.g.

count_all_seqs.py ~/Documents/python/data
seq1.fasta 83
seq2.fasta 83
seq3.fasta 83
seq4.fasta 83
seq5.fasta 83
seq6.fasta 85

This script should use the seq_utils module created in problem 1.

4) Follow the instructions at https://github.com/pvanheus/counting_sequences.


In [ ]: