UNIX Quick Reference Guide

Looking at files and moving them around

pwd # Tell me which directory I'm in
ls # What else is in this directory
ls .. # What is in the directory above me
ls foo/bar/ # What is inside the bar directory which is inside the foo/ directory
ls -lah foo/ # Give the the details (-l) of all files and folders (-a) using human
             # readable file sizes (-h)
cd ../.. # Move up two directories
cd ../foo/bar # Move up one directory and down into the foo/bar/ subdirectories
cp -r foo/ baz/ # Copy the foo/ directory into the baz/ directory
mv baz/foo .. # Move the foo directory into the parent directory
rm -r ../foo # remove the directory called foo/ from the parent directory
find foo/ -name "*.gff" # find all the files with a gff extension in the directory foo/

Looking in files

less bar.bed # scroll through bar.bed
grep chrom bar.bed | less -S # Only look at lines in bar.bed which have 'chrom' and
                             # don't wrap lines (-S)
head -20 bar.bed # show me the first 20 lines of bar.bed
tail -20 bar.bed # show me the last 20 lines
cat bar.bed # show me all of the lines (bad for big files)
wc -l bar.bed # how many lines are there
sort -k 2 -n bar.bed # sort by the second column in numerical order
awk '{print $1}' bar.bed | sort | uniq # show the unique entries in the first column

Grep

grep foo bar.bed # show me the lines in bar.bed with 'foo' in them
grep foo baz/* # show me all examples of foo in the files immediately within baz/
grep -r foo baz/ # show me all examples of foo in baz/ and every subdirectory within it
grep '^foo' bar.bed # show me all of the lines begining with foo
grep 'foo$' bar.bed # show me all of the lines ending in foo
grep -i '^[acgt]$' bar.bed # show me all of the lines which only have the characters
                           # a,c,g and t (ignoring their case)
grep -v foo bar.bed # don't show me any files with foo in them

Awk

awk '{print $1}' bar.bed # just the first column
awk '$4 ~ /^foo/' bar.bed # just rows where the 4th column starts with foo
awk '$4 == "foo" {print $1}' bar.bed # the first column of rows where the 4th column is foo
awk -F"\t" '{print $NF}' bar.bed # ignore spaces and print the last column
awk -F"\t" '{print $(NF-1)}' bar.bed # print the penultimate column
awk '{sum+=$2} END {print sum}' bar.bed # print the sum of the second column
awk '/^foo/ {sum+=$2; count+=1} END {print sum/count}' bar.bed # print the average of the
                                                               # second value of lines starting
                                                               # with foo

Piping, redirection and more advanced queries

grep -hv '^#' bar/*.gff | awk -F"\t" '{print $1}' | sort -u
#  grep => -h: don't print file names
#          -v: don't give me matching files
#          '^#': get rid of the header rows
#          'bar/*.gff': only look in the gff files in bar/
#  awk => print the first column
#  sort => -u: give me unique values

awk 'NR%10 == 0' bar.bed | head -20
# awk => NR: is the row number
#        NR%10: is the modulo (remander) of dividing my 10
#        awk is therefore giving you every 10th line
# head => only show the first 20

awk '{l=($3-$2+1)}; (l<300 && $2>200000 && $3<250000)' exercises.bed
# Gives:
# contig-2  201156  201359  gene-67 24.7    -
# contig-4  245705  245932  gene-163    24.8    +
# Finds all of the lines with features less than 300 bases long which start
# after base 200,000 and end before base 250,000
# Note that this appears to have the action before the pattern.  This is
# because we need to calculate the length of each feature before we use it
# for filtering.  If they were the other way around, you'd get the line
# immediatly after the one you want:
awk '(l<300 && $2>200000 && $3<250000) {l=($3-$2+1); print $0}' exercises.bed
# Gives:
# contig-2  201156  201359  gene-67 24.7    -
# contig-2  242625  243449  gene-68 46.5    +

A script

#!/usr/bin/env bash

set -e # stop running the script if there are errors
set -u # stop running the script if it uses an unknown variable
set -x # print every line before you run it (useful for debugging but annoying)

if [ $# -ne 2 ]
then
  echo "You must provide two files"
  exit 1 # exit the programme (and number > 0 reports that this is a failure)
fi

file_one=$1
file_two=$2

if [ ! -f $file_one ]
then
  echo "The first file couldn't be found"
  exit 2
fi

if [ ! -f $file_two ]
then
  echo "The second file couldn't be found"
  exit 2
fi

# Get the lines which aren't headers,
# take the first column and return the unique values
number_of_contigs_in_one=$(awk '$1 !~ /^#/ {print $1}' $file_one | sort -u | wc -l)
number_of_contigs_in_two=$(awk '/^[^#]/ {print $1}' $file_two | sort -u | wc -l)

if [ $number_of_contigs_in_one -gt $number_of_contigs_in_two ]
then
  echo "The first file had more unique contigs than the second"
  exit
elif [ $number_of_contigs_in_one -lt $number_of_contigs_in_two ]
then
  echo "The second file had more unique contigs"
  exit
else
  echo "The two files had the same number of contigs"
  exit
fi

Pro tips

  • Use tab completion - it will save you time!
  • Always have a quick look at files with less or head to double check their format
  • Watch out for data in headers and that you don't accidentally grep some if you don't want them
  • Watch out for spaces, especially if you're using awk; if in doubt, use -F"\t"
  • Regular expressions are wierd, build them up slowly bit by bit
  • If you did something smart but can't remember what it was, try typing history and it might have a record
  • man the_name_of_a_command often gives you help
  • Google is normally better at giving examples (prioritise stackoverflow.com results, they're normally good)

Build commands slowly

If you wanted me to calculate the sum of all of the scores for genes on contig-1 in a bed file, I'd probably run each of the following commands before moving onto the next:

head -20 bar.bed # check which column is which and if there are any headers
head -20 bar.bed | awk '{print $5}' # have a look at the scores
awk '{print $1}' bar.bed | sort -u | less # check the contigs don't look wierd
awk '{print $4}' bar.bed | sort -u | less # check the genes don't look wierd
awk '$4 ~ /gene-/' bar.bed | head -20 # check that I can spot genes
awk '($1 == "contig-1" && $4 ~ /gene-/)' bar.bed | head -20 # check I can find
                                                            # genes on contig-1
# check my algorithm works on a subset of the data
head -20 bar.bed | awk '($1 == "contig-1" && $4 ~ /gene-/) {sum+=$5}; END {print sum}'
# apply the algorithm to all of the data
awk '($1 == "contig-1" && $4 ~ /gene-/) {sum+=$5}; END {print sum}' bar.bed

Which tool should I use?

You should probably use awk if:

  • your data has columns
  • you need to do simple maths

You should probable use grep if:

  • you're looking for files which contain some specific text (e.g. grep -r foo bar/: look in all the files in bar/ for any with the word 'foo')

You should use find if:

  • you know something about a file (like it's name or creation date) but not where it is
  • you want a list of all the files in a subdirectory and its subdirectories etc.

You should write a script if:

  • your code doesn't fit on one line
  • it's doing something you might want to do again in 3 months
  • you want someone else to be able to do it without asking loads of questions
  • you're doing something sensitive (e.g. deleting loads of files)
  • you're doing something lots of times

You should probably use less or head:

  • always, you should always use less or head to check intermediary steps in your analysis