In [1]:
cat ../data/rose.fa
In [2]:
head ../data/contigs.fasta
In [3]:
zcat ../data/BJ-HSR1_R1.fastq.gz | head
where:
e
- probability of a base being called wrongHow to encode it to text?
$Q_{phred} + 33$
LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL.................................................
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
| | | | | |
33 59 64 73 104 126
0........................26...31.......40
@NS500159:12:H2FJ5AFXX:1:11101:12552:1058 1:N:0:1
:
@NS500159
- machine id12
- run numberH2FJ5AFXX
- flowcell id1
- lane11101
- tile number12552:1058
- x
and y
coordinates1
- read 1 or 2 (for paire ends)N
- filtered (Y) or not (N)0
- always 0
for HiSeq and NextSeq1
- sample no from the sample sheetHeader lines start with @
and contain metadata: reference sequences names, lengths, aligner, etc.
Each alignment record contains 11 mandatory fields:
QNAME
- query template name (think header
from fastq
file)FLAG
- bitwise flag (more on it in a moment)RNAME
- reference sequence name (e.g. chr1
)POS
- 1-based left-most mapping positionMAPQ
- mapping quality (think uniqueness of the mapping)CIGAR
- details of the mapping (match/mismatch/indel/clipping etc)RNEXT
- reference sequence name for the pair (mate)PNEXT
- mapping position for the pair (mate)TLEN
- template (query) lengthSEQ
- (aligned) segment sequence (not necessarily entire query sequence)QUAL
- quality, as in fastq
FLAG
fieldThis is possibly the most important field in practical terms.
1 0x1
template having multiple segments in sequencing2 0x2
each segment properly aligned according to the aligner4 0x4
segment unmapped8 0x8
next segment in the template unmapped16 0x10
SEQ being reverse complemented32 0x20
SEQ of the next segment in the template being reverse complemented64 0x40
the first segment in the template128 0x80
the last segment in the template256 0x100
secondary alignment512 0x200
not passing filters, such as platform/vendor quality controls1024 0x400
PCR or optical duplicate2048 0x800
supplementary alignmentSame as sam
but compresses and therefore is not directly readable. But because of the compression efficiency, it is the preferred way of storing alignment data.
You don't usually work with these directly, rather they are produced as intermediate results that get processed further to yield biologically relevant insights.
These are result of any alignment to reference you perform.
pileup
tab delimited; records contain aggregate alignment data per reference position.
match on the forward strand,
match on the reverse strandACTGN
mismatch on forward strandactgn
mismatch on reverse strand+|-[0-9]ACTGNactgn
insertion | deletion^
start of the read segment$
end of the read segmentgff
(former gtf
) genomic feature format; tab-delimited plain textbed
generic position formatvcf
variant call format
In [ ]: