This notebook explores parsing and understanding the SAM (Sequence Alignment/Map) and related BAM format. SAM is an extremely common format for representing read alignments. Most widely-used read alignment tools output SAM alignments.
SAM is a text format. There is a closely related binary format called BAM. There are two types of BAM files: unsorted or sorted. Various tools, notably SAMtools and Picard, can convert back and forth between SAM and BAM, and can sort an existing BAM file. When we say a BAM file is sorted we almost always mean that the alignments are sorted left-to-right along the reference genome. (It's also possible to sort a BAM file by read name, though that's only occassionally useful.)
Once you have an interesting set of read alignments that you would like to keep for a while and perhaps analyze further, it's a good idea to keep them in sorted BAM files. This is because:
That said, most alignment tools output SAM (not BAM), and the alignments come out in an arbitrary order -- not sorted.
An authoritative and complete document describing the SAM and BAM formats is the SAM specification. This document is thorough, but, being a specification, it does not describe is the various "dialects" of legal SAM emitted by popular tools. I'll cover some of that here.
In [1]:
# Here's a string representing a three-line SAM file. I'm temporarily
# ignoring the fact that SAM files usually have several header lines at
# the beginning.
samStr = '''\
r1 0 gi|9626243|ref|NC_001416.1| 18401 42 122M * 0 0 TGAATGCGAACTCCGGGACGCTCAGTAATGTGACGATAGCTGAAAACTGTACGATAAACNGTACGCTGAGGGCAGAAAAAATCGTCGGGGACATTNTAAAGGCGGCGAGCGCGGCTTTTCCG +"@6<:27(F&5)9)"B:%B+A-%5A?2$HCB0B+0=D<7E/<.03#!.F77@6B==?C"7>;))%;,3-$.A06+<-1/@@?,26">=?*@'0;$:;??G+:#+(A?9+10!8!?()?7C> AS:i:-5 XN:i:0 XM:i:3 XO:i:0 XG:i:0 NM:i:3 MD:Z:59G13G21G26 YT:Z:UU
r2 0 gi|9626243|ref|NC_001416.1| 8886 42 275M * 0 0 NTTNTGATGCGGGCTTGTGGAGTTCAGCCGATCTGACTTATGTCATTACCTATGAAATGTGAGGACGCTATGCCTGTACCAAATCCTACAATGCCGGTGAAAGGTGCCGGGATCACCCTGTGGGTTTATAAGGGGATCGGTGACCCCTACGCGAATCCGCTTTCAGACGTTGACTGGTCGCGTCTGGCAAAAGTTAAAGACCTGACGCCCGGCGAACTGACCGCTGAGNCCTATGACGACAGCTATCTCGATGATGAAGATGCAGACTGGACTGC (#!!'+!$""%+(+)'%)%!+!(&++)''"#"#&#"!'!("%'""("+&%$%*%%#$%#%#!)*'(#")(($&$'&%+&#%*)*#*%*')(%+!%%*"$%"#+)$&&+)&)*+!"*)!*!("&&"*#+"&"'(%)*("'!$*!!%$&&&$!!&&"(*"$&"#&!$%'%"#)$#+%*+)!&*)+(""#!)!%*#"*)*')&")($+*%%)!*)!('(%""+%"$##"#+(('!*(($*'!"*('"+)&%#&$+('**$$&+*&!#%)')'(+(!%+ AS:i:-14 XN:i:0 XM:i:8 XO:i:0 XG:i:0 NM:i:8 MD:Z:0A0C0G0A108C23G9T81T46 YT:Z:UU
r3 16 gi|9626243|ref|NC_001416.1| 11599 42 338M * 0 0 GGGCGCGTTACTGGGATGATCGTGAAAAGGCCCGTCTTGCGCTTGAAGCCGCCCGAAAGAAGGCTGAGCAGCAGACTCAAGAGGAGAAAAATGCGCAGCAGCGGAGCGATACCGAAGCGTCACGGCTGAAATATACCGAAGAGGCGCAGAAGGCTNACGAACGGCTGCAGACGCCGCTGCAGAAATATACCGCCCGTCAGGAAGAACTGANCAAGGCACNGAAAGACGGGAAAATCCTGCAGGCGGATTACAACACGCTGATGGCGGCGGCGAAAAAGGATTATGAAGCGACGCTGTAAAAGCCGAAACAGTCCAGCGTGAAGGTGTCTGCGGGCGAT 7F$%6=$:9B@/F'>=?!D?@0(:A*)7/>9C>6#1<6:C(.CC;#.;>;2'$4D:?&B!>689?(0(G7+0=@37F)GG=>?958.D2E04C<E,*AD%G0.%$+A:'H;?8<72:88?E6((CF)6DF#.)=>B>D-="C'B080E'5BH"77':"@70#4%A5=6.2/1>;9"&-H6)=$/0;5E:<8G!@::1?2DC7C*;@*#.1C0.D>H/20,!"C-#,6@%<+<D(AG-).?�.00'@)/F8?B!&"170,)>:?<A7#1(A@0E#&A.*DC.E")AH"+.,5,2>5"2?:G,F"D0B8D-6$65D<D!A/38860.*4;4B<*31?6 AS:i:-22 XN:i:0 XM:i:8 XO:i:0 XG:i:0 NM:i:8 MD:Z:80C4C16A52T23G30A8T76A41 YT:Z:UU'''
In [2]:
# I'll read this string in line-by-line as though it were a file.
# I'll (lightly) parse the alignment records as I go.
import string
from io import StringIO # reading from string rather than file
for ln in StringIO(samStr):
qname, flag, rname, pos, mapq, cigar, rnext, \
pnext, tlen, seq, qual, extras = str.split(ln, '\t', 11)
print(qname, len(seq)) # print read name, length of read sequence
As you can see from this last example, each SAM record is on a separate line, and it consists of several tab-delimited fields. The SAM specification is the authoritative source for information on what all the fields mean exactly, but here's a brief summary of each of the fields (in order):
qname
is the name of the readflags
is a bit field encoding some yes/no pieces of information about whether and how the read alignedrname
is the name of the reference sequence that the read aligned to (if applicable). E.g., might be "chr17" meaning "chromosome 17"pos
is the 1-based offset into the reference sequence where the read aligned.mapq
for an aligned read, this is a confidence value; high when we're very confident we've found the correct alignment, low when we're not confidentcigar
indicates where any gaps occur in the alignmentrnext
only relevant for paired-end reads; name of the reference sequence where other end alignedpnext
only relevant for paired-end reads; 1-based offset into the reference sequence where other end alignedtlen
only relevant for paired-end reads; fragment length inferred from alignmentseq
read sequence. If read aligned to reference genome's reverse-complement, this is the reverse complement of the read sequence.qual
quality sequence. If read aligned to reference genome's reverse-complement, this is the reverse of the quality sequence.extras
tab-separated "extra" fields, usually optional and aligner-specific but often very important!Field 1, qname
is the name of the read. Read names often contain information about:
Field 5, mapq
, encodes the probability p that the alignment reported is incorrect. The probability is encoded as an integer Q on the Phred scale:
Fields 7, 8 and 9 (rnext
, pnext
and tlen
) are only relevant if the read is part of a pair. By way of background, sequencers can be configured to report pairs of DNA snippets that appear close to each other in the genome. To accomplish this, the sequencer sequences both ends of a longer fragment of DNA. When this is the case rnext
and pnext
tell us where the other end of the pair aligned, and tlen
tells us the length of the fragment, as inferred from the alignments of the two ends.
Field 10, seq
, is the nucleotide sequence of the read. The nucleotide sequence is reverse complemented if the read aligned to the reverse complement of the reference genome. (This is equivalent to the reverse complement of the read aligning to the genome.) seq
can contain the character "N
". N
essentially means "no confidence." The sequencer knows there's a nucleotide there but doesn't know whether it's an A, C, G or T.
Field 11, qual
, is the quality sequence of the read. Each nucleotide in seq
has a corresponding quality value in this string. A nucleotide's quality value encodes the probability that the nucleotide was incorrectly called by the sequencing instrument and its software. For details on this encoding, see the FASTQ notebook.
The flags
field is a bitfield. Individual bits correspond to certain yes/no properties of the alignment. Here are the most relevant ones:
There are a few more that are used less often; see the [SAM specification] for details.
How do we know how good an alignment is, i.e., how well the read sequence matches the corresponding referene sequence. This information is spread across a few places:
AS:i
extra fieldcigar
fieldMD:Z
extra fieldWhile the AS:i
extra field is not required by the specification, all the most popular tools output it. The integer appearing in this field is an alignment score. The higher the score, the more similar the read sequence is to the reference sequence.
Different tools use differen scales for AS:i
. Sometimes (e.g. in Bowtie 2's --end-to-end
alignment mode)
We would like to know exactly how the read aligned to the reference, including where all the mismatches and gaps are, and which characters appear opposite the gaps. This is not possible just by looking at the read sequence together with the CIGAR string. That doesn't tell us what reference characters appear in the mismatched position, or in the positions involved in deletions. Instead, we combine information from the (a) read sequence, (b) CIGAR string, and (c) MD:Z
string. The CIGAR and MD:Z
strings are both described in the SAM specification.
First we construct a function to parse the CIGAR field:
In [3]:
def cigarToList(cigar):
''' Parse CIGAR string into a list of CIGAR operations. For more
info on CIGAR operations, see SAM spec:
http://samtools.sourceforge.net/SAMv1.pdf '''
ret, i = [], 0
op_map = {'M':0, # match or mismatch
'=':0, # match
'X':0, # mismatch
'I':1, # insertion in read w/r/t reference
'D':2, # deletion in read w/r/t reference
'N':3, # long gap due e.g. to splice junction
'S':4, # soft clipping due e.g. to local alignment
'H':5, # hard clipping
'P':6} # padding
# Seems like = and X together are strictly more expressive than M.
# Why not just have = and X and get rid of M? Space efficiency,
# mainly. The titans discuss: http://www.biostars.org/p/17043/
while i < len(cigar):
run = 0
while i < len(cigar) and cigar[i].isdigit():
# parse one more digit of run length
run *= 10
run += int(cigar[i])
i += 1
assert i < len(cigar)
# parse cigar operation
op = cigar[i]
i += 1
assert op in op_map
# append to result
ret.append([op_map[op], run])
return ret
In [4]:
cigarToList('10=1X10=')
Out[4]:
Next we construct a function to parse the MD:Z
extra field:
In [5]:
def mdzToList(md):
''' Parse MD:Z string into a list of operations, where 0=match,
1=read gap, 2=mismatch. '''
i = 0;
ret = [] # list of (op, run, str) tuples
while i < len(md):
if md[i].isdigit(): # stretch of matches
run = 0
while i < len(md) and md[i].isdigit():
run *= 10
run += int(md[i])
i += 1 # skip over digit
if run > 0:
ret.append([0, run, ""])
elif md[i].isalpha(): # stretch of mismatches
mmstr = ""
while i < len(md) and md[i].isalpha():
mmstr += md[i]
i += 1
assert len(mmstr) > 0
ret.append([1, len(mmstr), mmstr])
elif md[i] == "^": # read gap
i += 1 # skip over ^
refstr = ""
while i < len(md) and md[i].isalpha():
refstr += md[i]
i += 1 # skip over inserted character
assert len(refstr) > 0
ret.append([2, len(refstr), refstr])
else:
raise RuntimeError('Unexpected character in MD:Z: "%d"' % md[i])
return ret
In [6]:
# Each element in the list returned by this call is itself a list w/ 3
# elements. Element 1 is the MD:Z operation (0=match, 1=mismatch,
# 2=deletion). Element 2 is the length and element 3 is the relevant
# sequence of nucleotides from the reference.
mdzToList('10A5^AC6')
Out[6]:
Now we can write a fucntion that takes a read sequennce, a parsed CIGAR string, and a parse MD:Z
string and combines information from all three to make what I call a "stacked alignment."
In [7]:
def cigarMdzToStacked(seq, cgp, mdp_orig):
''' Takes parsed CIGAR and parsed MD:Z, generates a stacked alignment:
a pair of strings with gap characters inserted (possibly) and where
characters at at the same offsets are opposite each other in the
alignment. Only knows how to handle CIGAR ops M=XDINSH right now.
'''
mdp = mdp_orig[:]
rds, rfs = [], []
mdo, rdoff = 0, 0
for c in cgp:
op, run = c
skipping = (op == 4 or op == 5)
assert skipping or mdo < len(mdp)
if op == 0: # CIGAR op M, = or X
# Look for block matches and mismatches in MD:Z string
mdrun = 0
runleft = run
while runleft > 0 and mdo < len(mdp):
op_m, run_m, st_m = mdp[mdo]
run_comb = min(runleft, run_m)
runleft -= run_comb
assert op_m == 0 or op_m == 1
rds.append(seq[rdoff:rdoff + run_comb])
if op_m == 0: # match from MD:Z string
rfs.append(seq[rdoff:rdoff + run_comb])
else: # mismatch from MD:Z string
assert len(st_m) == run_comb
rfs.append(st_m)
mdrun += run_comb
rdoff += run_comb
# Stretch of matches in MD:Z could span M and I CIGAR ops
if run_comb < run_m:
assert op_m == 0
mdp[mdo][1] -= run_comb
else:
mdo += 1
elif op == 1: # CIGAR op I
rds.append(seq[rdoff:rdoff + run])
rfs.append("-" * run)
rdoff += run
elif op == 2: # D
op_m, run_m, st_m = mdp[mdo]
assert op_m == 2
assert run == run_m
assert len(st_m) == run
mdo += 1
rds.append("-" * run)
rfs.append(st_m)
elif op == 3: # N
rds.append("-" * run)
rfs.append("-" * run)
elif op == 4: # S
rds.append(seq[rdoff:rdoff + run].lower())
rfs.append(' ' * run)
rdoff += run
elif op == 5: # H
rds.append('!' * run)
rfs.append(' ' * run)
elif op == 6: # P
raise RuntimeError("Don't know how to handle P in CIGAR")
else:
raise RuntimeError('Unexpected CIGAR op: %d' % op)
assert mdo == len(mdp)
return ''.join(rds), ''.join(rfs)
In [8]:
# Following example includes gaps and mismatches
cigarMdzToStacked('GGACGCTCAGTAGTGACGATAGCTGAAAACCCTGTACGATAAACC', cigarToList('12M2D17M2I14M'), mdzToList('12^AT30G0'))
Out[8]:
In [9]:
# Following example also includes soft clipping (CIGAR: S)
# SAM spec: Soft clipping: "clipped sequences present in SEQ"
# We print them in lowercase to emphasize their clippedness
cigarMdzToStacked('GGACGCTCAGTAGTGACGATAGCTGAAAACCCTGTACGAGAAGCC', cigarToList('12M2D17M2I8M6S'), mdzToList('12^AT25'))
Out[9]:
In [10]:
# Following example also includes hard clipping (CIGAR: H)
# SAM spec: Hard clipping: "clipped sequences NOT present in SEQ"
cigarMdzToStacked('GGACGCTCAGTAGTGACGATAGCTGAAAACCCTGTACGAGAAGCC', cigarToList('12M2D17M2I8M6S3H'), mdzToList('12^AT25'))
# Note: don't see hard clipping in practice much
Out[10]:
In [11]:
# Following example also includes skipping (CIGAR: N), as seen in
# TopHat alignments
cigarMdzToStacked('GGACGCTCAGTAGTGACGATAGCTGAAAACCCTGTACGAGAAGCC',
cigarToList('12M2D10M10N7M2I8M6S3H'),
mdzToList('12^AT25'))
Out[11]:
From the stacked alignment, it's easy to do other things. E.g. we can turn a stacked alignment into a new CIGAR string that uses the =
and X
operations instead of the less specific M
operation:
In [12]:
def cigarize(rds, rfs):
off = 0
oplist = []
lastc, cnt = '', 0
for i in range(len(rds)):
c = None
if rfs[i] == ' ':
c = 'S'
elif rds[i] == '-' and rfs[i] == '-':
c = 'N'
elif rds[i] == '-':
c = 'D'
elif rfs[i] == '-':
c = 'I'
elif rds[i] != rfs[i]:
c = 'X'
else:
c = '='
if c == lastc:
cnt += 1
else:
if len(lastc) > 0:
oplist.append((lastc, cnt))
lastc, cnt = c, 1
if len(lastc) > 0:
oplist.append((lastc, cnt))
return ''.join(map(lambda x: str(x[1]) + x[0], oplist))
In [13]:
x, y = cigarMdzToStacked('ACGTACGT', cigarToList('8M'), mdzToList('4G3'))
In [14]:
cigarize(x, y)
Out[14]:
In [15]:
x, y = cigarMdzToStacked('GGACGCTCAGTAGTGACGATAGCTGAAAACCCTGTACGAGAAGCC',
cigarToList('12M2D10M10N7M2I8M6S3H'),
mdzToList('12^AT25'))
In [16]:
cigarize(x, y)
Out[16]:
© Copyright Ben Langmead 2014