A line beginning with a dollar sign ($
) indicates the command line. It is meant to orient you that the command is to be performed in the terminal.
$
This is not to be confused with something like $HOME
, which means a variable named "HOME" that has the value (on my computer), shown by echo
(which prints the output of commands). The dollar sign here means to access that variable. The line after the command line means the output of the command.
$ echo $HOME
/Users/olga
We'll be using unix commands here and it's worth pausing and talking about what these commands are really doing. Let's pretend we have a program called libraryprep
. We can call the program using its name:
$ libraryprep
Let's say it has the option flag (aka "flag" aka "argument") --stranded
. You'd tell the program this flag by putting it after the command:
$ libraryprep --stranded
A lot of times there's a shortened version of the flag. For --stranded
let's say it's -s
. Then you could use the shorter command,
$ libraryprep -s # This is equal to "libraryprep --stranded"
Notice the hash "#
". In all the languages we'll be using (bash, Python), this is a "comment" meaning its only for human reading and the computer can stop reading as soon as it sees the hash.
If a flag requires an input, you would put it immediately after the flag. For example, say we have a flag called --input
and it requires the cell lysis file:
$ libraryprep -s --input lysis
Now Let's say it has the options -a
, -s
, -d
, -f
We can put all of them sequentially:
$ libraryprep -a -s -d -f
Or put them all in a row after a single dash. So the above command is equivalent to:
$ libraryprep -asdf
But if you have a multi-character flag like --stranded
then that has to be separate:
$ libraryprep -adf --stranded # correct
$ libraryprep -adfstranded # incorrect
Or if the flag requires an argument, then you have to keep it separate, unless the shortened version is at the end. For the purposes of this, let's say -i
is short for --input
.
$ libraryprep -asdf --input lysis # correct
$ libraryprep -asdfi lysis # correct
$ libraryprep -asdfinput lysis # incorrect
$ libraryprep -asdif lysis # incorrect
If you have a long command, you can separate it out onto multiple lines for readability (meaning just for you the human, not for the computer) by using the backslash "\
" for each line that's supposed to continue on to the next one. The indentation is purely for readability and isn't required - it just looks nicer.
Fun fact: The way I remember that "\
" backslash because the top is falling backwards, and "/
" is frontslash because the top is falling forwards.
$ # Correct
$ libraryprep -adf \
--input lysis \
--stranded
It is very important that there are no spaces or other characters after the slashes! Otherwise your program will throw a cryptic syntax error and you will be sad.
$ # Incorrect
$ libraryprep -adf \ # comments aren't allowed here
--input lysis \ --flags-arent-okay-either
--stranded
For most command-line programs, you can get help on how to actually use the program one of four ways. This is called finding the "usage" message. I've written them in priority order so, try the first one first, then the second, etc.
$ programname
$ programname -h
-h
is (generally) a universal short version flag for "help me!!! I don't know what's going on!!" Though sometimes that doesn't work either ...$ programname --help
-h
$ man programname
man
is for "manual" and will bring up the manual pages for that commandSome programs we'll use will have subcommands. This means that we're calling a specific subprogram of a single program. For example, let's pretend we have a program called microscopy
, and it has the subcommands brightfield
and fluorescence
. You wouldn't be able to call just microscopy
by itself, because you need to specify the exact subprogram (in this analogy, the type of light inputs) that you want to use:
$ microscopy
Error: Must specify either "brightfield" or "fluorescence"
There will be flags like --zoom
that would make sense for both subcommands:
$ microscopy brightfield --zoom 10x
But only for the fluorescence
subcommand does the flag --wavelength
make any sense, since brightfield
would show all the light.
$ microscopy fluorescence --wavelength 488 --zoom 10x
Some of these commands use the greater than (>
), which means that the output of the program should be saved a file. I like to think of it as the big input (the left side of the >
) being squeezed into the small corner of the arrow (right side of the >
).
For example, let's go back to our libraryprep
example. We tell the libraryprep
program we want a stranded library with our lysis
input, and to save it as the file called cdna
.
$ libraryprep --stranded --input lysis > cdna
.fasta
file formatThis is the simple, very commonly used file format in bioinformatics for storing sequences. An example of the p53 protein sequence is below.
>sp|P04637|P53_HUMAN Cellular tumor antigen p53 OS=Homo sapiens GN=TP53 PE=1 SV=4
MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGP
DEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAK
SVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHE
RCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNS
SCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELP
PGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGKEPG
GSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD
.fastq
file formatThe current standard of sequencing files is fastq
, which is derived from the fasta
format. Check out an example fastq
format below which shows 3 entries:
$ head -n 12 projects/shalek2013/raw_data/S10.chr11.R1.fastq
@SRR578577.1 C0HLAACXX120407:2:1107:16553:42649 length=101
TCTCCTTAAATTTTAAGTAAATGTTTAAGGGATTTTACACCGGTCTATGGAGGTTTGCATGTGTAATTTTACCTCTAATTAATTATAAGGCCAGGACCAAA
+SRR578577.1 C0HLAACXX120407:2:1107:16553:42649 length=101
@@@FFDDDHHFHGIG?EAFDHHGHHHFHIEG;FDHIHIIIGIIDGHGHHIGIGGEHGGGGIFHIIGGGGGIGHHHECEDFFFFFFDEDEEE@BCBBBCB##
@SRR578577.2 C0HLAACXX120407:2:1203:19118:72461 length=101
CCTCTCCTTAAATTTTAAGTAAATGTTTAAGGGATTTTACACCGGTCTATGGAGGTTTGCATGTGTAATTTTACCTCTAATTAATTATAAGGCCAGGACCA
+SRR578577.2 C0HLAACXX120407:2:1203:19118:72461 length=101
?@@DFFFDHGFDHEGHFHII<EHHAHHIIIIIIIIIIGEHEIIIIGGG<B<3BFGIEIGIIIGGC.7=CDCHCE??ACEBEE@>CACEC>6;;==BBBBBB
@SRR578577.3 C0HLAACXX120407:2:1110:7566:40229 length=101
ACCCTCTCCTTAAATTTTAAGTAAATGTTTAAGGGATTTTACACCGGTCTATGGAGGTTTGCATGTGTAATTTTACCTCTAATTAATTATAAGGCCAGGAC
+SRR578577.3 C0HLAACXX120407:2:1110:7566:40229 length=101
???;:A;;C:D?AG<EFBDG<3CF:F><CCDHI>GFFEGEFFEGE:D0?6?D*/.8-;>AC=FDD=@CAEE=>?CAE@B@@BD>ABB;;>5>;;>?#####
Notice that each read takes up 4 lines, where
@unique_identifier length=101
[genomic sequence]
+unique_identifier length=101
[ASCII-encoded PHRED quality score]
Let's break the quality down a bit.
The quality of base calling for each position in the read is encoded as a score $Q$ from 1 to 40. Defining $p$ as the probability of incorrect base calling, then
$ Q = -10 \log_{10}(p) $
Given any score, we can back-calculate the probability of an incorrect base by using our understanding of logarithms.
$$ \begin{align} 40 &= -10 \log_{10}(p)\\ \frac{40}{-10} &= \frac{-10}{-10} \log_{10}(p)\\ -4 &= \log_{10}(p)\\ 10^{-4} &= 10^{\log_{10}(p)}\\ 10^{-4} &= p \end{align} $$Thus the best score of 40 represents the lowest probability of incorrect base calling, $p=10^{-4}$.
What is the probabilty of incorrect base calling for a score of 39? For 31?
Because "40" takes up two spaces, instead of using numbers the score is encoded using letters in a system called "ASCII." You already know that computers store information in zeros and ones, which can be turned into numbers, well, the letter "!" (exclamation point) represents the number 64 and acts as the lowest value for the quality scores, while
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHI
.fastq
files.html
) and plots (.png
) of library qualityExample:
$ fastqc $HOME/projects/differentiation/raw_data/sample01.R1.fastq -o $HOME/projects/differentiation/processed_data
.fastq
filescutadapt
, trimgalore
.fastq
filesExample:
cutadapt \
# format of the input files \
-f fastq \
# Maximum number of times an adapter sequence can be removed \
--times 2 \
# Maximum error rate in adapter \
-e 0.0 \
# Minimum overlap length \
-O 5 \
# Minimum sequencing quality \
--quality-cutoff 6 \
# Minimum read length \
-m 18 \
# Adapters that could appear anywhere in the sequencing read (5' end or 3' end) \
-b TCGTATGCCGTCTTCTGCTTG \
-b ATCTCGTATGCCGTCTTCTGCTTG \
-b CGACAGGTTCAGAGTTCTACAGTCCGACGATC \
-b GATCGGAAGAGCACACGTCTGAACTCCAGTCAC \
-b AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA \
-b TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT \
# Read 1 output \
-o $HOME/projects/shalek2013/processed_data/S10_R1.polyATrim.adapterTrim.fastq \
# Read 2 output \
-p $HOME/projects/shalek2013/processed_data/S10_R2.polyATrim.adapterTrim.fastq \
# Read 1 input \
$HOME/projects/shalek2013/raw_data/S10_R1.fastq.gz \
# Read 2 input \
$HOME/projects/shalek2013/raw_data/S10_R2.fastq.gz \
# Statistics about how many adapters were removed
> $HOME/projects/shalek2013/processed_data/S10_R2.polyATrim.adapterTrim.metrics
.sam
file format"Sequence alignment map"
.fastq
files and repetitive element "genome"STAR
.sam
and unaligned .fastq
Example:
STAR --genomeDir $HOME/genomes/RepBase18.05.fasta/STAR/ --readFilesIn \
$HOME/projects/differentiation/raw_data/sample01.R1.fastq \
$HOME/projects/differentiation/raw_data/sample01.R2.fastq \
--outFileNamePrefix $HOME/projects/differentiation/processed_data/sample01.repetitive. \
# Save unmapped files as fastq
--outReadsUnmapped Fastx
.fastq
of trimmed reads which didn't align to repetitive elementsSTAR
.sam
Example:
STAR --genomeDir $HOME/genomes/hg19/gencode/v20/star \
--readFilesIn \
$HOME/projects/differentiation/processed_data/sample01.repetitive.Unmapped.out.mate1 \
$HOME/projects/differentiation/processed_data/sample01.repetitive.Unmapped.out.mate2 \
--outFileNamePrefix \
$HOME/projects/differentiation/processed_data/sample01.rmRepetitive.
Run STAR with the following parameters:
$HOME/genomes/mm10/gencode/m8/star_chr11
$HOME/projects/shalek2013/raw_data/S10.chr11.R1.fastq
and $HOME/projects/shalek2013/raw_data/S10.chr11.R2.fastq
$HOME/projects/shalek2013/processed_data/S10.chr11.
.bam
file.bam
file formatThe .bam
file format is a compressed (aka "binary" - meaning not human readable) version of the .sam
format.
All programs downstream of mapping require a sorted, indexed .bam
file. You can also do these steps with your repetitive element-aligned .sam
file if
Think of creating a sorted, indexed bam file like making a dictionary.
A raw aligned .sam
file like a jumble of words in random order:
single
cell
bioinformatics
experiment
beyonce
sequencing
library
prep
Sorting puts the words into alphabet order:
beyonce
bioinformatics
cell
experiment
library
prep
sequencing
single
And indexing is like adding the alphabet tabs in a dictionary:
B:
beyonce
bioinformatics
C:
cell
E:
experiment
L:
library
P:
prep
S:
sequencing
single
Instead of letters, indexing adds "chromosome tabs" so each program will know where, say chromosome 7 (chr7
) starts and can jump to it right away.
.sam
to .bam
.sam
of aligned readssamtools view -b
-b
flag forces output to be .bam
.bam
fileExample (samtools v1.3.1
):
samtools view -b sample01.sam > sample01.sorted.bam
Convert your .sam
file to .bam
.sam
For deeply sequenced samples (>50 million reads), this will take quite some time
.sam
of aligned readssamtools sort
.bam
Example (samtools v1.3.1
):
samtools sort sample01.bam > sample01.sorted.bam
Sort your .bam
file
.sam
Indexing only works with sorted bam files
.bam
of sorted, aligned readssamtools sort
.bam
Example (samtools v1.3.1
):
samtools index sample01.sorted.bam
This creates the file sample01.sorted.bam.bai
Index your .bam
file
TPM, FPKM, what? Check out this blog post on What the FPKM? to learn about RNA-seq units in depth.
Gene annotation files are either .gtf
or .gff
format. We'll use the GENCODE GTF format for our purposes. A subset of the gtf format is shown below. Notice the third column specifies the type of feature.
chr11 HAVANA gene 3125904 3131004 . + . gene_id "ENSMUSG00000082286.10"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "Pisd-ps1"; level 2; havana_gene "OTTMUSG00000000759.2";
chr11 HAVANA transcript 3125904 3129396 . + . gene_id "ENSMUSG00000082286.10"; transcript_id "ENSMUST00000145164.7"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "Pisd-ps1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "Pisd-ps1-004"; level 2; tag "basic"; transcript_support_level "1"; havana_gene "OTTMUSG00000000759.2"; havana_transcript "OTTMUST00000086908.1";
chr11 HAVANA exon 3125904 3126058 . + . gene_id "ENSMUSG00000082286.10"; transcript_id "ENSMUST00000145164.7"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "Pisd-ps1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "Pisd-ps1-004"; exon_number 1; exon_id "ENSMUSE00000728833.1"; level 2; tag "basic"; transcript_support_level "1"; havana_gene "OTTMUSG00000000759.2"; havana_transcript "OTTMUST00000086908.1";
chr11 HAVANA exon 3127479 3127644 . + . gene_id "ENSMUSG00000082286.10"; transcript_id "ENSMUST00000145164.7"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "Pisd-ps1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "Pisd-ps1-004"; exon_number 2; exon_id "ENSMUSE00000708691.1"; level 2; tag "basic"; transcript_support_level "1"; havana_gene "OTTMUSG00000000759.2"; havana_transcript "OTTMUST00000086908.1";
.bam
of sorted, aligned readsfeatureCounts
, Salmon, HTSeq, ...Example:
featureCounts -a annotation.gtf -o sample01.featurecounts.txt sample1.sorted.bam
Run featureCounts
on your sorted, indexed .bam
file
$HOME/genomes/mm10/gencode/m8/gencode.vM8.basic.annotation.chr11.gtf
awk
, cut
, paste
n_samples
$\times$ n_features
)