So far, we have run single commands in a terminal. However, it is useful to be able to run multiple commands that process some data and produce output. These commands can be put into a separate file (ie a script), and then run on the input data. This has the advantage of reproducibility, so that the same analysis can be run on many input data sets.
It is traditional when learning a new language (in this case BASH), to write a simple script that says "Hello World!". We will do this now.
First, open a terminal and make a new directory in your home called scripts
, by typing
cd
mkdir ~/scripts
Next open a text editor, which you will use to write the script. What text editors are available will depend on your system. For example, gedit in Linux. Do not try to use a word processor, such as Word! If you don't already have a favorite, try gedit by running the following command:
gedit &
Type this into the text editor:
echo Hello World!
and save this to a file called hello.sh
in your new scripts
directory. This script will print Hello World!
to the screen when we run it. First, in your terminal, check that the script is saved in the correct place.
In [ ]:
cd scripts
ls hello.sh
If everything is OK, then next try to run the script. For now, we need to tell Unix that this is a bash script, and where it is:
In [ ]:
bash hello.sh
It would be nice if all our scripts could simply be run from anywhere in the filesystem, without having to tell Unix where the script is, or that it is a BASH script. This is how the built-in commands work, such as cd
or ls
.
To tell Unix that the script is a BASH script, make this the first line of the script:
#!/usr/bin/env bash
and remember to save the script again. This special line at the start of the file tells Unix that the file is a bash script, so that it expects bash commands throughout the file. There is one more change to be made to the file to tell Unix that it is a program to be run (it is "executable"). This is done with the command chmod
. Type this into the terminal to make the file executable:
In [ ]:
chmod +x hello.sh
Now, the script can be run, but we must still tell Unix where the script is in the filesystem. In this case, it is in the current working directory, which is called "./
".
In [ ]:
./hello.sh
The final thing to do is change our setup so that Unix can find the script without us having to explicitly say where it is. Whenever a command is typed into Unix, it has a list of directories that it searches through to look for the command. We need to add the new scripts directory to that list of directories. Try typing
echo $PATH
It returns a list of directories, which are all the places Unix will look for a command. Before we add the scripts directory to this list, check what happens if we try to run the script without telling Unix where it is:
hello.sh
bash: hello.sh: command not found
Unix did not find it! The command to run to add the scripts directory to $PATH
is:
export PATH=$PATH:~/scripts/
If you want this change to be permanent, ie so that Unix finds your scripts after you restart or logout and login again, add that line to the end of a file called ~/.bashrc
. If you are using a Mac, then the file should instead be ~/.bash_profile
. If the file does not already exist, then create it and put that line into it.
The following command is only here so that this notebook finds scripts correctly and the remaining examples work. Do not type the next command into your terminal.
In [ ]:
PATH=$PATH:$PWD # do not type this into your terminal!
Now the script works, no matter where we are in the filesystem. Unix will check the scripts directory and find the file hello.sh
. You can be anywhere in your filesystem, and simply running
hello.sh
will always work. Try it now.
In [ ]:
hello.sh
In general, when making a new script, you can now copy and edit an existing script, or make a new one like this:
cd ~/scripts
touch my_script.sh
chmod +x my_script.sh
and then open my_script.sh
in a text editor.
Usually, we would like a script to read in options from the user, such as the name of an input file. This would mean a script can be run like this:
my_script.sh input_file
Inside the script, the parameters provided by the user are given the names $1
, $2
, $3
etc (do not confuse these with column names used by awk
!). Here is a simple example that expects the user to provide a filename and a number. The script simply prints the filename to the screen, and then the first few lines of the file (the number of lines is determined by the number given by the user).
In [ ]:
cat options_example.sh
In [ ]:
options_example.sh test_file 2
The options have been used by the script, but the script itself is not very readable. It is better to use names instead of $1
and $2
. Here is an improved version of the script that does exactly the same as the previous script, but is more readable.
In [ ]:
cat options_example.2.sh
The previous scripts will have strange behaviour if the input is not as expected by the script. Many things could go wrong. For example:
Try running the script with different options and see what happens.
A convention with scripts is that it should output a help message if it is not run correctly. This shows anyone how the script should be run (including you!) without having to look at the code inside the script.
A basic check for this script would be to verify that two options were supplied, and if not then print a help message. The code looks like this:
if [ $# -ne 2 ]
then
echo "usage: options_example.3.sh filename number_of_lines"
echo
echo "Prints the filename, and the given first number of lines of the file"
exit
fi
You can copy this code into the start of any of your scripts, and easily modify it to work for that script. A little explanation:
$#
has been used, which is the number of options that were given by the user.if [ $# -ne 2 ] then .... fi
". This only runs the code between the then
and fi
, if $#
(the number of options) is not 2.exit
simply makes the script end, so that no more code is run.
In [ ]:
options_example.3.sh
Another check is that the input file really does exist. If it does not exist, then there is no point in trying to run any more code. This can be checked with another if ... then ... fi
block of code:
if [ ! -f $filename ]
then
echo "File '$filename' not found! Cannot continue"
exit
fi
Putting this all together, the script now looks like this:
In [ ]:
cat options_example.3.sh
Two new features have also been introduced in this file:
set -eu
". Without this line, if any line
produces an error, the script will carry on regardless to the end of the script.
Using the -e
option, an error anywhere in the file will result in the script stopping at the line
that produced the error, instead of continuing. In general, it is best that
the script stops at any error. The -u
creates an error if you try to use a variable which doesn't exist.
This helps to stop typos doing bad things to your analysis.#
. These lines are "comment lines"
that are not run. They are used to document the code, containing explanations of
what is happening. It is good practice to comment your scripts!The above script provides a template for writing your own scripts. The general method is:
It can be useful to run a command and put the results into a variable. Recall that we stored the input from the user in sensibly named variables:
filename=$1
The part after the equals sign could actually be any command that returns some output. For example, running this in Unix
wc -l filename | awk '{print $1}'
returns the number of lines. In case you are wondering why the command includes | awk '{print $1}'
, check what happens with and without the pipe to awk
:
In [ ]:
wc -l options_example.3.sh
In [ ]:
wc -l options_example.3.sh | awk '{print $1}'
With a small change, this can be stored in a variable and then used later.
In [ ]:
filename=options_example.3.sh
line_count=$(wc -l $filename | awk '{print $1}')
echo There are $line_count lines in the file $filename
It is common in Bioinformatics to run the same analysis on many files. Suppose we had a script that ran one type of analysis, and wanted to repeat the same analysis on 100 different files. It would be tedious, and error-prone, to write the same command 100 times. Instead we can use a loop. As an example, we will just run the Unix command wc
on each file but instead, in reality this would be a script that runs in-depth analysis. We can run wc
on each of the files in the directory loop_files/
with the following command.
In [ ]:
for filename in loop_files/*; do wc $filename; done
loop_files/
.Write a script that takes a GFF filename as input. Make the script produce a summary of various properties of the file. There is an example input file provided called bash_scripts/exercise_3.gff
. Use your imagination! You could have a look back at the awk
section of the course for inspiration. Here are some ideas you may wish to try: