Command line basics

WHY?

  • when you work remotely on a server, that's all you have
  • most of the bioinformatics programs are command-line driven
  • easier to connect several software tools into a pipeline
  • great for a quick analysis/troubleshooting
  • when used properly it's so much faster than any GUI
  • DRY (don't repeat yourself!)

UNIX philosophy

  • write small programs that do one thing and do it well
  • write programs to communicate with each other so that output of one program is the input for another
  • write programs to communicate in plain text (because it's the only universal interface)

EVERYTHING is CaSE-SenSitivE

Common paradigm:

<verb> [modifiers] <subject>

for example:

  • ls - list directory (current by default)
  • ls dir - list directory dir
  • ls -lah - list directory dir with details, one item per line, including hidden files, and in human-readable format

Default: a value for argument when no explicit argument is given (where it makes sense, like ls)

Up arrow and down arrow scroll through command history.

Very important!

  • <Ctrl-R> - reverse history search
  • <Tab> - completion: never have to type the whole thing
  • <Alt-.> - inserts last argument from the history

Aside:

Why whitespace in file/directory name is a bad idea:

  • needs to be escaped with backslash or the entire argument needs to be quoted
  • because stuff is whitespace (space, tab, etc) delimited in bash
  • sooner or later it will break your scripts in unexpected and wierd ways.

Getting around

HOW TO GET HELP: man <command>

Path: /<dir1>/<dir2>/<dir3>/foo.bar for example: /home/ilya/src

Absoute path:

  • starts with /, aka root. Location relative to filesystem root.

Relative path:

  • doesn't start with /. Location relative to the curent directory.

Aside

UNIX filesystem is divided into two realms:

  • userland (everything under /home, each user has access to her own data)
  • system (everything else, only admins or sudoers have access to)

Commands

  • pwd - print working directory
  • cd <dir> - change directory
  • mkdir [options] <dir> - make new directory
  • ls [options] [<dir>...] - list directory

Exercise:

  • find your Downloads (or Documents) directory
  • list the content of your Documents directory using different options
  • create a new directory (where is it created?)

Shortcuts

These are huge time savers. But they are nothing more than aliases.

  • ~ - current user's home directory (/home/<user>/ or /Users/<user>/ on Mac OS)
  • . - current directory
  • .. - parent directory
  • - - last directory (although in most contexts it means stdin)

Other useful things:

  • pushd <dir> - pushes directory <dir> into stack
  • popd - pops the last pushed directory from the stack

These two can be thought about as "remember for later" and "recall the last remembered" commands.

Path and executables

Files that have x bit set in their permission are executable. These can be executed by typing their name at the prompt:

    $ /home/vasyapupkin/myprog1
    $ ./myprog1
    $ /bin/myprog1

or they can be executed by typing just their name at the prompt if their location is listed in PATH variable:

$ echo $PATH
$ myprog1

if unsure, use which programm to find the executable (if it exists!):

$ which python

Lookin at things (well, files)

  • cat will output its arguments to stdout
  • less will do the same but in a humane way (pagination, search, scrolling, etc)
  • man displays a help page for a given command
  • head outputs n first lines in a file
  • tail outputs n last lines in a file

Creating, copying and moving stuff

Create

Create a file (actually, change the file's timestamp):

touch <filename>
><filename>

Create a directory:

mkdir <dirname>

usual path rules apply (see absolute vs relative paths). Fancy switch -p:

mkdir -p path/to/my/new/dir
mkdir -p path/to/{one,two,three}

Copy

Copying stuff:

cp <source> <destination>

by default, cp only copies regular files and skips directories. To copy directories use -r (recursively) option:

cp -r <source_dir> <destination>

but watch for that trailing slash:

cp -r <source>/ <destination>

behaves differently. Why?

Globbing works as one would expect:

cp <source>/*.txt <destination>

will copy all files ending with .txt to <destination>

Move (aka rename)

How to move stuff?

mv <source> <destination>

But what if we want to move a bunch of stuff? Sure this should work:

mv <source>/*.txt <destination>

but it doesn't. WTF?

Cheating way: install rename programm. Won't work if you don't have admin rights though.

Proper way: loop

for f in *.txt; do mv $f <destintaion>; done

HINT: for a dry run replace mv with echo

Delete

CAUTION: There is no undelete. If you delete a file, it's gone forever!

Delete (remove) a file(s):

rm <file>

Delete a directory:

rm -r <directory>

Selecting what to show (filtering)

Globbing (aka wildcards)

  • ? matches one (any) character
  • * matches any number of any characters except OS seprator (/, .)
  • ** matches any number of any characters
  • {pattern1,pattern2,...} or {start..end} pattern expansion

grep

grep stands for Global Regular ExPression. Regular expressions regex is an advanced and powerful way to match patterns. grep can be thought of as a very versatile and efficient filter that can be configured to pass through only results you want.

Some plumbing: pipes, redirects and tee

  • | (aka pipe) - sends the output of the left program to the input of right program
  • tee - same as pipe but at the same time saves the output of the left command into a file
  • > - redirects the output of the programm to a file (overwriting the file if it exists)
  • >> - same as > but appends to the file if it exists

Practical things

Downloading stuff from Internet

wget - loads of options and protocols supported. Read manpages for all options.

Let's use it to download E.coli .gff file from NCBI (http://www.ncbi.nlm.nih.gov/genome/167):

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.gff.gz

and make sure it's where you expect it to be:

ls -lah *.gff.gz

and download some more stuff:

wget ngs.nudlerlab.info/master.zip
wget ngs.nudlerlab.info/BJ-HSR1.pe.fastq.gz

In [2]:
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.gff.gz


--2016-09-14 14:44:51--  ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.gff.gz
           => ‘GCF_000005845.2_ASM584v2_genomic.gff.gz’
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.11, 2607:f220:41e:250::10
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.11|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /genomes/all/GCF_000005845.2_ASM584v2 ... done.
==> SIZE GCF_000005845.2_ASM584v2_genomic.gff.gz ... 449791
==> PASV ... done.    ==> RETR GCF_000005845.2_ASM584v2_genomic.gff.gz ... done.
Length: 449791 (439K) (unauthoritative)

GCF_000005845.2_ASM 100%[===================>] 439.25K  2.64MB/s    in 0.2s    

2016-09-14 14:45:23 (2.64 MB/s) - ‘GCF_000005845.2_ASM584v2_genomic.gff.gz’ saved [449791]


In [4]:
mv GCF_000005845.2_ASM584v2_genomic.gff.gz ../data
ls -lah ../data | grep gff


-rw-rw-r-- 1 ilya ilya 440K Sep 14 14:45 GCF_000005845.2_ASM584v2_genomic.gff.gz

In [5]:
zcat ../data/GCF_000005845.2_ASM584v2_genomic.gff.gz | head


##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build ASM584v2
#!genome-build-accession NCBI_Assembly:GCF_000005845.2
##sequence-region NC_000913.3 1 4641652
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=511145
NC_000913.3	RefSeq	region	1	4641652	.	+	.	ID=id0;Dbxref=taxon:511145;Is_circular=true;Name=ANONYMOUS;gbkey=Src;genome=chromosome;mol_type=genomic DNA;strain=K-12;substrain=MG1655
NC_000913.3	RefSeq	gene	190	255	.	+	.	ID=gene0;Dbxref=EcoGene:EG11277,GeneID:944742;Name=thrL;gbkey=Gene;gene=thrL;gene_biotype=protein_coding;gene_synonym=ECK0001,JW4367;locus_tag=b0001
NC_000913.3	RefSeq	CDS	190	255	.	+	0	ID=cds0;Parent=gene0;Dbxref=ASAP:ABE-0000006,UniProtKB/Swiss-Prot:P0AD86,Genbank:NP_414542.1,EcoGene:EG11277,GeneID:944742;Name=NP_414542.1;gbkey=CDS;gene=thrL;product=thr operon leader peptide;protein_id=NP_414542.1;transl_table=11

gzip: stdout: Broken pipe

Working with compressed files

Most NGS data formats are text based and, therefore, are highly compressable. For instance, gzipped .fastq file can take 10-20% of the original space.

  • gzip - compresses the file
  • gunzip - uncompresses the file

By default both gzip and gunzip delete the original. To keep original file use zcat or -c flag for gzip/gunzip

It's a perfect usecase for pipes, so let's dig right in. Let's have a look what's inside the .gff file we've just downloaded:

zcat GCF_000005845.2_ASM584v2_genomic.gff.gz | less

How can we modify the above to show only beginning of the file? End of the file?

tar is for working with compressed directories

Soma random useful things (text processing)

sort - self explanatory. Sorts the input in variety of ways. Really useful when chaining several programs using pipes.

uniq - outputs unique items from the input stream. Can count occurences of each item. Again, really shines when used with other programs.

tr - translates or deletes characters from the input stream. Doesn't sound like much but is a real time-saver when building pipelines and workflows.

wc - word count. Self-explanatory and you get the idea, useful to compose "compound" commands from simple programs.

Again, to get the full list of available options use man <program> command.

Putting it all together

Coming back to .gff file. Let's see how we can build a nice little summary of E.coli genomic features (genes, CDS, and so on).

For starters:

zcat GCF_000005845.2_ASM584v2_genomic.gff.gz | less

good, but what's up with all those lines starting with #? Those are comments and we want to get rid of them.

grep to the rescue:

zcat GCF_000005845.2_ASM584v2_genomic.gff.gz | grep -v ^# | less

Better! So we are left with tab-delimited file (a relative of .csv really). Now we see we're interested in the 3rd column. Let's split the line on tabs and take the third field:

zcat GCF_000005845.2_ASM584v2_genomic.gff.gz | grep -v ^# | cut -f 3 | less

Ugly! But what if we sort the values and count unique items?

zcat GCF_000005845.2_ASM584v2_genomic.gff.gz | grep -v ^# | cut -f 3 | sort | uniq -c

And there you have it!

Permissions

In UNIX, access to files is controlled via permissions.

There are three levels of permissions:

  • u user
  • g group
  • o others

Permissions:

  • r read permission
  • w write permission (also create or delete)
  • x eXecute permission (directories must have x permission set in order to be able to cd into them!!!)

ls -l command will output lines starting with the permissions part.

By default, only file's owner (and root) has access to it.

Relevant commands:

  • chown - change the owner (must have permission to do so!)
  • chgrp - change file's group
  • chmod - change permission(s)

Optional: editing files (nano and vim), symbolic links, processes.