Unix for Bioinformatics

Introduction

Unix is the standard operating system on most large computer systems in scientific research, in the same way that Microsoft Windows is the dominant operating system on desktop PCs.

Unix and MS Windows both perform the important job of managing the computer's hardware (screen, keyboard, mouse, hard disks, network connections, etc...) on your behalf. They also provide you with tools to manage your files and to run application software. They both offer a graphical user interface (desktop). These desktop interfaces look different between the operating systems, use different names for things (e.g. directory versus folder) and have different images but they mostly offer the same functionality.

Unix is a powerful, secure, robust and stable operating system which allows dozens of people to run programs on the same computer at the same time. This is why it is the preferred operating system for large-scale scientific computing. It runs on all kinds of machines, from mobile phones (Android), desktop PCs... to supercomputers.

Why Unix?

Increasingly, the output of biological research exists as in silico data, usually in the form of large text files. Unix is particularly suitable for working with such files and has several powerful and flexible commands that can be used to process and analyse this data. One advantage of learning Unix is that many of the commands can be combined in an almost unlimited fashion. So if you can learn just six Unix commands, you will be able to do a lot more than just six things.

Unix contains hundreds of commands, but to conduct your analysis you will probably only need 10 or so to achieve most of what you want to do. In this tutorial we will introduce you to some basic Unix commands followed by some more advanced commands and provide examples of how they can be used in bioinformatics analyses.

Learning outcomes

This tutorial consists of two sections, Introduction to UNIX and Advanced UNIX for Bioinformatics. By the end of the first section you can expect to be able to:

  • Describe why UNIX is sutable for analysing NGS data
  • Know what the UNIX command line is
  • Understand the UNIX directory structure and navigate around this structure
  • Manipulate (move, copy and delete ) files using the command line
  • Look at and sort the contents of a file
  • Find the unique items in a list
  • Use the man command to find out more information about UNIX commands

By the end of the second section you can expect to be able to:

  • Extract information from large files
  • Use regular expressions to search for particular patterns in a file
  • Use the AWK programming language to extract and filter information from a file
  • Create a bash script to perform several tasks at once
  • Use a bash loop to perform the same task several times

Sections of the Unix tutorial

Introduction to UNIX comprises the following sections:

  1. Basic unix
  2. Files

Advanced UNIX for Bioinformatics comprises the following sections:

  1. grep
  2. awk
  3. Bash scripts

Authors

This tutorial was created by Jacqui Keane and Martin Hunt.

Running the commands from this tutorial

You can run the commands in this tutorial either directly from the Jupyter notebook (if using Jupyter), or by typing the commands in your terminal window.

Running commands on Jupyter

If you are using Jupyter, command cells (like the one below) can be run by selecting the cell and clicking Cell -> Run from the menu above or using ctrl Enter to run the command. Let's give this a try by printing our working directory using the pwd command and listing the files within it. Run the commands in the two cells below.


In [ ]:
pwd

In [ ]:
ls -l

Running commands in the terminal

You can also follow this tutorial by typing all the commands you see into a terminal window. This is similar to the "Command Prompt" window on MS Windows systems, which allows the user to type DOS commands to manage files.

To get started, select the cell below with the mouse and then either press control and enter or choose Cell -> Run in the menu at the top of the page.


In [ ]:
echo cd $PWD

Now open a new terminal on your computer and type the command that was output by the previous cell followed by the enter key. The command will look similar to this:


In [ ]:
cd /home/manager/pathogen-informatics-training/Notebooks/Unix/

Now you can follow the instructions in the tutorial from here.

Cheat sheet

We've also included a cheat sheet. It probably won't make a lot of sense now, but it might be a useful reminder of this module later in the tutorial.

Let’s get started!

To get started with the tutorial, head to the first section: Basic unix