BASH scripts

So far, we have run single commands in a terminal. However, it is useful to be able to run multiple commands that process some data and produce output. These commands can be put into a separate file (ie a script), and then run on the input data. This has the advantage of reproducibility, so that the same analysis can be run on many input data sets.

First script

It is traditional when learning a new language (in this case BASH), to write a simple script that says "Hello World!". We will do this now.

First, open a terminal and make a new directory in your home called scripts, by typing

cd
mkdir ~/scripts

Next open a text editor, which you will use to write the script. What text editors are available will depend on your system. For example, gedit in Linux. Do not try to use a word processor, such as Word! If you don't already have a favorite, try gedit by running the following command:

gedit &

Type this into the text editor:

echo Hello World!

and save this to a file called hello.sh in your new scripts directory. This script will print Hello World! to the screen when we run it. First, in your terminal, check that the script is saved in the correct place.



In [ ]:

    
cd scripts
ls hello.sh

If everything is OK, then next try to run the script. For now, we need to tell Unix that this is a bash script, and where it is:



In [ ]:

    
bash hello.sh

Setting up a scripts directory

It would be nice if all our scripts could simply be run from anywhere in the filesystem, without having to tell Unix where the script is, or that it is a BASH script. This is how the built-in commands work, such as cd or ls.

To tell Unix that the script is a BASH script, make this the first line of the script:

#!/usr/bin/env bash

and remember to save the script again. This special line at the start of the file tells Unix that the file is a bash script, so that it expects bash commands throughout the file. There is one more change to be made to the file to tell Unix that it is a program to be run (it is "executable"). This is done with the command chmod. Type this into the terminal to make the file executable:



In [ ]:

    
chmod +x hello.sh

Now, the script can be run, but we must still tell Unix where the script is in the filesystem. In this case, it is in the current working directory, which is called "./".



In [ ]:

    
./hello.sh

The final thing to do is change our setup so that Unix can find the script without us having to explicitly say where it is. Whenever a command is typed into Unix, it has a list of directories that it searches through to look for the command. We need to add the new scripts directory to that list of directories. Try typing

echo $PATH

It returns a list of directories, which are all the places Unix will look for a command. Before we add the scripts directory to this list, check what happens if we try to run the script without telling Unix where it is:

hello.sh
bash: hello.sh: command not found

Unix did not find it! The command to run to add the scripts directory to $PATH is:

export PATH=$PATH:~/scripts/

If you want this change to be permanent, ie so that Unix finds your scripts after you restart or logout and login again, add that line to the end of a file called ~/.bashrc. If you are using a Mac, then the file should instead be ~/.bash_profile. If the file does not already exist, then create it and put that line into it.

The following command is only here so that this notebook finds scripts correctly and the remaining examples work. Do not type the next command into your terminal.



In [ ]:

    
PATH=$PATH:$PWD # do not type this into your terminal!

Now the script works, no matter where we are in the filesystem. Unix will check the scripts directory and find the file hello.sh. You can be anywhere in your filesystem, and simply running

hello.sh

will always work. Try it now.



In [ ]:

    
hello.sh

In general, when making a new script, you can now copy and edit an existing script, or make a new one like this:

cd ~/scripts
touch my_script.sh
chmod +x my_script.sh

and then open my_script.sh in a text editor.

Getting options from the terminal and printing a help message

Usually, we would like a script to read in options from the user, such as the name of an input file. This would mean a script can be run like this:

my_script.sh input_file

Inside the script, the parameters provided by the user are given the names $1, $2, $3 etc (do not confuse these with column names used by awk!). Here is a simple example that expects the user to provide a filename and a number. The script simply prints the filename to the screen, and then the first few lines of the file (the number of lines is determined by the number given by the user).



In [ ]:

    
cat options_example.sh



In [ ]:

    
options_example.sh test_file 2

The options have been used by the script, but the script itself is not very readable. It is better to use names instead of $1 and $2. Here is an improved version of the script that does exactly the same as the previous script, but is more readable.



In [ ]:

    
cat options_example.2.sh

Checking options from the user

The previous scripts will have strange behaviour if the input is not as expected by the script. Many things could go wrong. For example:

The wrong number of options are given by the user
The input file does not exist.

Try running the script with different options and see what happens.

A convention with scripts is that it should output a help message if it is not run correctly. This shows anyone how the script should be run (including you!) without having to look at the code inside the script.

A basic check for this script would be to verify that two options were supplied, and if not then print a help message. The code looks like this:

if [ $# -ne 2 ]
then
        echo "usage: options_example.3.sh filename number_of_lines"
        echo
        echo "Prints the filename, and the given first number of lines of the file"
        exit
fi

You can copy this code into the start of any of your scripts, and easily modify it to work for that script. A little explanation:

A special variable $# has been used, which is the number of options that were given by the user.
The whole block of code has the form "if [ $# -ne 2 ] then .... fi". This only runs the code between the then and fi, if $# (the number of options) is not 2.
The line exit simply makes the script end, so that no more code is run.



In [ ]:

    
options_example.3.sh

Another check is that the input file really does exist. If it does not exist, then there is no point in trying to run any more code. This can be checked with another if ... then ... fi block of code:

if [ ! -f $filename ]

then
    echo "File '$filename' not found! Cannot continue"
    exit
fi

Putting this all together, the script now looks like this:



In [ ]:

    
cat options_example.3.sh

Two new features have also been introduced in this file:

The second line is "set -eu". Without this line, if any line produces an error, the script will carry on regardless to the end of the script. Using the -e option, an error anywhere in the file will result in the script stopping at the line that produced the error, instead of continuing. In general, it is best that the script stops at any error. The -u creates an error if you try to use a variable which doesn't exist. This helps to stop typos doing bad things to your analysis.
There are several lines starting with a hash #. These lines are "comment lines" that are not run. They are used to document the code, containing explanations of what is happening. It is good practice to comment your scripts!

The above script provides a template for writing your own scripts. The general method is:

Tell Unix that this is a BASH script, and to stop at the first error.
Check if the user ran the script correctly. If not, output a message telling the user how to run the script.
Check the input looks OK (in this case, that the input file exists).
Process the input.

Using variables to store output from commands

It can be useful to run a command and put the results into a variable. Recall that we stored the input from the user in sensibly named variables:

filename=$1

The part after the equals sign could actually be any command that returns some output. For example, running this in Unix

wc -l filename | awk '{print $1}'

returns the number of lines. In case you are wondering why the command includes | awk '{print $1}', check what happens with and without the pipe to awk:



In [ ]:

    
wc -l options_example.3.sh



In [ ]:

    
wc -l options_example.3.sh | awk '{print $1}'

With a small change, this can be stored in a variable and then used later.



In [ ]:

    
filename=options_example.3.sh
line_count=$(wc -l $filename | awk '{print $1}')
echo There are $line_count lines in the file $filename

Repeating analysis with loops

It is common in Bioinformatics to run the same analysis on many files. Suppose we had a script that ran one type of analysis, and wanted to repeat the same analysis on 100 different files. It would be tedious, and error-prone, to write the same command 100 times. Instead we can use a loop. As an example, we will just run the Unix command wc on each file but instead, in reality this would be a script that runs in-depth analysis. We can run wc on each of the files in the directory loop_files/ with the following command.



In [ ]:

    
for filename in loop_files/*; do wc $filename; done

Exercises

Write a script that gets a filename from the user. If the file exists, it prints a nice human-readable message telling the user how many lines are in the file.
Use a loop to run the script from Exercise 1 on the files in the directory loop_files/.
Write a script that takes a GFF filename as input. Make the script produce a summary of various properties of the file. There is an example input file provided called bash_scripts/exercise_3.gff. Use your imagination! You could have a look back at the awk section of the course for inspiration. Here are some ideas you may wish to try:
- Does the file exist?
- How many records (ie lines) are in the file?
- How many genes are in the file?
- Is the file badly formatted in any way (eg wrong number of columns, do the coordinates look like numbers)?