Purpose of this tutorial

Teach you some Bash!

You will need the following programs to use this code:

  • a commandline with Bash
  • (optional) Git
  • (optional) IPython

A little bit about Bash

See Wikipedia's entry. http://en.wikipedia.org/wiki/Bash_(Unix_shell).

Getting more help

There is certainly not shortage of tutorials for Bash and writing scripts. In fact, Google is going to be one of your best friends when it comes to debugging errors and other issues with Bash. Remember, someone else probably had the problem before you and already posted the solution. Remember this before you email your course instructor or TA. Below you will find a list of online Bash tutorials that you may find useful:

How to read this tutorial

We are going to assume that you are ssh’d into proteus and working out of the course's GitHub repo. We will keep this repo update as the course goes on, bugs are found, and general improvements are made. If you have not already, clone the repo to your local folder. This can be done by running:

git clone https://github.com/gditzler/bio-course-materials.git

which will create a clone of the repo in the folder where the command was called. From time to time you may feel the need to update the repo with the staff's latest changes. To do this, run:

git reset --hard
git pull origin master

in the directory where you cloned the repo. Note that if you have been modifying the files in the repo, you'll encounter merge conflicts. Using git reset --hard will erase any changes that you have made. Therefore, it is recommend that you either: (i) copy the files you wish to experiment with to a new file before modifying them, or (ii) be aware that your changes will be erased whenever you reset the master branch. If you prefer to use IPython in our examples, run:

ipython notebook --pylab inline

Basic Bash commands and concepts

Below you will find a list of common commands using in Bash programming. While these commands are rather simple, we can manipulate them and use them with slightly more powerful commands to build complex expression with relatively few lines of code. Ignore the : character when calling these commands.


In [1]:
# cd <some path> : change directory  
# cp <file a> <file b> : copy `file a` to `file b`. note that `file a` still remains.
# mv <file a> <file b> : move `file a` to `file b`
# ls : list the conents of a directory 
# cat <file a> : print the contents of `file a`
# cat <file a> <file b> ... : concatenate the contents of files `file a` `file b` ...
# echo "Hello World!" : basic hello world program 
# head -M <file a> : print the first M lines of `file a`
# tail -M <file a> : print the last M lines of `file a`
# wget <web address> : download a file from a web address 
# mkdir <directory> : create a new directory 
# touch <file a> : create an empty file
# rm <file> : remove a file (cannot be undone)
# rm -Rf <folder> : remove a folder and all of its contents (cannot be undone)
# find <folder> : print the files in the directory and all of its sub-directorys 
# This is a comment!

Be extremely cautious when you are using the rm command as its action cannot be undone. This is not like placing an item in the trash bin on your desktop. Once you rm a file of folder you can never get it back. You have been warned!

Its important to note that you can always get help with a command by viewing its man page (man is short for manual). While, the man page can be helpful, Google is perhaps even more helpful! As is the tradition of shell scripting, asking for helps seem to lead to RTFM!


In [2]:
%%bash 
man echo # prints out a little weird!


ECHO(1)                   BSD General Commands Manual                  ECHO(1)

NNAAMMEE
     eecchhoo -- write arguments to the standard output

SSYYNNOOPPSSIISS
     eecchhoo [--nn] [_s_t_r_i_n_g _._._.]

DDEESSCCRRIIPPTTIIOONN
     The eecchhoo utility writes any specified operands, separated by single
     blank (` ') characters and followed by a newline (`\n') character, to
     the standard output.

     The following option is available:

     --nn    Do not print the trailing newline character.  This may also be
           achieved by appending `\c' to the end of the string, as is done
           by iBCS2 compatible systems.  Note that this option as well as
           the effect of `\c' are implementation-defined in IEEE Std
           1003.1-2001 (``POSIX.1'') as amended by Cor. 1-2002.  Applica-
           tions aiming for maximum portability are strongly encouraged to
           use printf(1) to suppress the newline character.

     Some shells may provide a builtin eecchhoo command which is similar or
     identical to this utility.  Most notably, the builtin eecchhoo in sh(1)
     does not accept the --nn option.  Consult the builtin(1) manual page.

EEXXIITT SSTTAATTUUSS
     The eecchhoo utility exits 0 on success, and >0 if an error occurs.

SSEEEE AALLSSOO
     builtin(1), csh(1), printf(1), sh(1)

SSTTAANNDDAARRDDSS
     The eecchhoo utility conforms to IEEE Std 1003.1-2001 (``POSIX.1'') as
     amended by Cor. 1-2002.

BSD                             April 12, 2003                             BSD

I have placed a very basic file tab delimited file in the data/ folder. Lets use some of the above Bash commands to pick the file a part. Since I am working in IPython, I need to add %%bash to the begining of all my lines. You can ignore them.


In [3]:
%%bash
ls -l ../data/ # list the ../data/ directory
# note `..` tells us to look back a directory. the -l is a flag that specifies ls to print the output in a list

echo " " 
echo "Lets look at the files in this directory"
ls


total 2776
-rw-r--r--@ 1 gditzler  staff  1000571 Jul 29 12:37 american-gut-mf.txt
-rw-r--r--  1 gditzler  staff       82 Jul 29 12:55 eesi-names-mycopy.txt
-rw-r--r--  1 gditzler  staff       81 Sep 17 09:58 eesi-names.txt
-rw-r-----@ 1 gditzler  staff    76480 Sep  2 10:02 ls_orchid.fasta
-rw-r-----@ 1 gditzler  staff     2519 Sep  8 15:08 ls_orchid.gbk
-rw-r-----@ 1 gditzler  staff   235482 Sep  8 15:09 ls_orchid_full.gbk
-rw-r--r--  1 gditzler  staff    74389 Sep  8 20:28 my_example.fasta
-rw-r--r--  1 gditzler  staff     3167 Sep  8 20:28 my_seqs.fa
-rw-r--r--  1 gditzler  staff       10 Jul 29 12:55 second-to-last-user.txt
-rw-r--r--  1 gditzler  staff       25 Sep  8 14:22 simple.dnd
 
Lets look at the files in this directory
Bash-Tutorial.ipynb
BioPython-Tutorial.ipynb
README.md

Lets perform the following tasks now that we know the location of the eesi-names.txt file.

  • print the first line of eesi-names.txt
  • print the last two lines of eesi-names.txt
  • print out three times the contents of eesi-names.txt
  • copy the eesi-names.txt to eesi-names-mycopy.txt

In [4]:
%%bash 
head -1 ../data/eesi-names.txt


First Last

In [5]:
%%bash
tail -2 ../data/eesi-names.txt


Yemin Lan
Steve Pastor

In [6]:
%%bash
cat ../data/eesi-names.txt ../data/eesi-names.txt ../data/eesi-names.txt


First Last
Gail  Rosen
Gregory Ditzler
Erin Reichenberger
Yemin Lan
Steve Pastor
First Last
Gail  Rosen
Gregory Ditzler
Erin Reichenberger
Yemin Lan
Steve Pastor
First Last
Gail  Rosen
Gregory Ditzler
Erin Reichenberger
Yemin Lan
Steve Pastor

In [7]:
%%bash 
cp ../data/eesi-names.txt ../data/eesi-names-mycopy.txt
# check to make sure its there
ls -l ../data/


total 2776
-rw-r--r--@ 1 gditzler  staff  1000571 Jul 29 12:37 american-gut-mf.txt
-rw-r--r--  1 gditzler  staff       81 Sep 17 10:05 eesi-names-mycopy.txt
-rw-r--r--  1 gditzler  staff       81 Sep 17 09:58 eesi-names.txt
-rw-r-----@ 1 gditzler  staff    76480 Sep  2 10:02 ls_orchid.fasta
-rw-r-----@ 1 gditzler  staff     2519 Sep  8 15:08 ls_orchid.gbk
-rw-r-----@ 1 gditzler  staff   235482 Sep  8 15:09 ls_orchid_full.gbk
-rw-r--r--  1 gditzler  staff    74389 Sep  8 20:28 my_example.fasta
-rw-r--r--  1 gditzler  staff     3167 Sep  8 20:28 my_seqs.fa
-rw-r--r--  1 gditzler  staff       10 Jul 29 12:55 second-to-last-user.txt
-rw-r--r--  1 gditzler  staff       25 Sep  8 14:22 simple.dnd

Variables & Programming Structures

Just as with any other programming langauges, Bash has variables. Some can be scalars, strings or arrays. In this section we go over some basic types and how we can manipulate them. We are going to define our variables just as we would with any other programming langauge; however, when we access them we need to place $ in front of the name. For example,


In [8]:
%%bash 
my_var=Greg
echo "Hello $my_var"
n=1
echo $(($n+1))


Hello Greg
2

In this section, we are going to take care of a couple of topics (arrays, for loops and if statements). Something to keep in mind about bash is that it is very picky when it comes to whitespace. Sometimes it matters and sometimes it doesn’t! This section will bring up some of the times when it is going to matter.

First, let use define an array by using the parenthesis and separating each of the entries with a space. All of the objects in our array are of the same type in that there is nothing special about them. In general, curly brackets in Bash are used to group things together and the square brackets are used to index something. We can use the @ symbol as an index to list all of the entries out, which we are going to need for the for loop. As shown below the for is pretty boiler plate compared to other scripting langauges; however the if statement is a bit different. Notice that there is whitespace padded inside of the square brackets. Removing this space will produce an error. This is one of those times where whitespace makes a difference. Furthermore, the test for equality is performed using a single = symbol, which is different than most other programming languages. Refer to this website for many examples of using conditional statements with Bash.


In [9]:
%%bash
names=( Gail Yemin Greg Cricket Steve )
echo "The entry in position 1 is ${names[1]}"
for name in ${names[@]}; do 
  echo $name
  if [ "$name" = "Greg" ]; then 
    echo "${name}ory"
  fi
done
echo ${names[@]}


The entry in position 1 is Yemin
Gail
Yemin
Greg
Gregory
Cricket
Steve
Gail Yemin Greg Cricket Steve

Pipes and Redirects

Pipes allow us to take the output of one program and feed them into the input of another program. While this concept is very simple, it will allow us to build very complex expressions. Lets us just do an example to see how this works. A pipe is given by "|". Lets say that we want to print out the second to last name in eesi-names.txt. We know that head prints out the header of a file and tail will print out the end of a file. We can use tail to print out the last two names then head to take the output from tail to get the second to last name. In code this is given by


In [10]:
%%bash 
cat ../data/eesi-names.txt | tail -2 | head -1


Yemin Lan

Let us finish this example by using a redirect. Redirects allow us to redirect the std output to a file. That is, dump what is being printed out to a file rather than printing it out to the user. Let us redirect the second to last EESI name to a file. Note there are other, more clever ways to use redirects, however, we are only covering one usage.


In [11]:
%%bash 
cat ../data/eesi-names.txt | tail -2 | head -1 > ../data/second-to-last-user.txt
cat ../data/second-to-last-user.txt


Yemin Lan

Basics of regular expressions

In this section, we are going to look at a couple of commands that use regular expressions, or regex for short. The first command we want to look at is grep. The grep utility searches any given input files, selecting lines that match one or more patterns. We can do many more operation with grep; however, just printing out certian lines of a file is powerful enough own its own because it can lead to further manipulation. Lets come back to eesi-names.txt and print out only the lines at start with the pattern G ('^' is used to denote the start of a line in the expression -- $ is used for the end of a line).


In [12]:
%%bash
cat ../data/eesi-names.txt | grep '^G'
echo " "
cat ../data/eesi-names.txt | grep 'Yemin'


Gail  Rosen
Gregory Ditzler
 
Yemin Lan

The final regex tool we are going to look at is sed. This is a very powerful tool, however, we are only going to be interested in find/replace functionality. The way this works is we are going to give sed an expression telling it the pattern we want to search for and the pattern that we want to repace it with. For example, lets find any occurance of Gregory and replace it with Greg in eesi-names.txt. Have a close look at how we are calling sed in the example below.


In [13]:
%%bash 
cat ../data/eesi-names.txt | sed -e 's/Gregory/Greg/g'

echo " "
# we can also group things and replace them. see if you can tell whats going on here
cat ../data/eesi-names.txt  | sed -e 's/\(^G[a-z]*\)/\1 MIDDLE /g'


First Last
Gail  Rosen
Greg Ditzler
Erin Reichenberger
Yemin Lan
Steve Pastor
 
First Last
Gail MIDDLE   Rosen
Gregory MIDDLE  Ditzler
Erin Reichenberger
Yemin Lan
Steve Pastor

Editing files

There are several flavors of text editors for the shell. Some are:

If you are interested in looking through a file but not editing it, I would recommend using less or more.

Example: Downloading photos from an HTML file

Lets say I need to download some photos from a website, and I am far too lazy to right click on every photo and save it to my computer. I am, however, stubborn enough to write some Bash code to parse and html file and download the photos without needing to right click a single image! In this example, we are going to download all of the photos from Dr. Rosen's EESI webpage and, surprisingly, given the few simple commands we have learned, can be achieved with just a few lines of code.

This problem is actually very easy to accomplish with Bash. First of all, lets think about the logicial steps that need to be accomplished and how we can use these basic commands to achieve this task.

  • Download the raw PHP file. This is really the first step and you have already been given the web address! The file can be downloaded with wget and after looking through the file, we find that the image location is specified with scr=<path to the file>. Bash tools: wget.
  • Find the links to the images. Well, we already know where they are as per the src= being found in PHP line. However, src need not just include images, it could include JavaScript. Therefore, we only need the line with the jpg file extension. The links to the images should be saved into an array. Bash tools: cat, grep, and sed.
  • Download the images. Self-explanatory! Bash tools: wget.

Then 3 (actual) lines of code later!


In [14]:
%%bash 
# rm *.jpg
# rm people.php*
web_home=http://www.ece.drexel.edu/gailr/EESI
wget -q ${web_home}/people.php
image_array=$(cat people.php | grep -E "src=.*jpg" | sed -e "s/.*src=\"\(.*\.jpg\)\".*/\1/g")

for image in ${image_array}; do 
  wget -q $web_home/$image
done

Example: Need a bioinformatics example

Ideas

  • examine a qiime map file
  • clean up a blast output
  • ....

Writing simple Bash scripts

Writing a Bash script is relatively easy one we know the commands. In fact, its much like writing a script in any other language, such as Matlab. There are a few subtle differences from Matlab though. First, we can pass in any arbitrary number of arguments and access them which $1, $2, ..., where $1 is the first argument, $2 is the second argument, etc. Second, we should stick with a convention with our file names. Therefore, we will use sh to denote that the script is a shell script. Finally, we are going to add #!/usr/bin/env bash to the top of every file. This will tell the interpreter that the script is a Bash shell script as opposed to being a Python or Awk script. Here is an example of a very simple script:


In [15]:
%%bash 
cat ../examples/bash-script.sh


#!/usr/bin/env bash 

# this is a basic bash script
f_name=$1 # get the first argument 
l_name=$2 # get the second argument 
echo "Hello $f_name $l_name"

To run the script, call sh and the name of the script with any arguments.


In [16]:
%%bash 
sh ../examples/bash-script.sh Greg Ditzler


Hello Greg Ditzler