Reading high-throughput sequencing files

@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=60
GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACCAAGTTACCCTTAACAACTTAAGGG
+SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=60
IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9ICIIIIIIIIIIIIIIIIIIIIDIII
  • @ line: basic read information
  • sequence line
  • + metadata line
  • quality score line

Big files


In [ ]:
f = open("files/simple-file.txt")
for l in f.readlines():
    print(l,end="")
f.close()

Problem

f.readlines is powerful but you just loaded the whole contents of the file into memory.

If the file is 10GB, you may have just crashed your computer!

Solution: turn the reader into an iterator


In [ ]:
with open("files/simple-file.txt") as f:
    for l in f:
        print(l.strip())
with open("files/simple-file.txt") as f:
    for l in f:
        do_something(l)

This bizarro syntax goes through the file line-by-line, wiping out each old line with the new one.

What if you have a compressed input file?


In [ ]:
with open("files/simple-file.txt.gz") as f:
    for l in f:
        print(l.strip())

Use the gzip module


In [ ]:
import gzip

with gzip.open("files/simple-file.txt.gz") as f:
    for l in f:
        l_ascii = l.decode("ascii")
        print(l_ascii.strip())
import gzip

with gzip.open("files/simple-file.txt.gz") as f:
    for l in f:
        l_ascii = l.decode("ascii")
        print(l_ascii.strip())
  • gzip.open is just like open except it takes a .gz file
  • l.decode("ascii") species that the binary blob that just came out of the file is normal text
  • What sequence occurs most often the in the file files/example.fastq.gz?
  • Bonus: create a histogram of read quality. The last line gives the PHRED score--Probability of error--for each nucleotide. It ranges from 1.00000 (encoded as the letter 0) to 0.00006 (encoded by the letter K). Sum up the score for all bases in the alignment. Table is here: (http://www.drive5.com/usearch/manual/qscores.gif)
@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=60
GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACCAAGTTACCCTTAACAACTTAAGGG
+SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=60
IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9ICIIIIIIIIIIIIIIIIIIIIDIII

In [ ]:


In [ ]: