Use the following command to view the list of directories in your path. We do this to make sure we have properly modified the PATH variable.
(1) Enter the result of the echo $PATH command in the following text box:
PORT requires input files to be organized into a specific directory structure that looks like this. We will create this directory structure now.
STUDY
└── reads
├── Sample_1
│ ├── Unaligned reads
│ └── Aligned.sam/bam
├── Sample_2
│ ├── Unaligned reads
│ └── Aligned.sam/bam
├── Sample_3
│ ├── Unaligned reads
│ └── Aligned.sam/bam
└── Sample_4
├── Unaligned reads
└── Aligned.sam/bam
First give STUDY directory a unique name as follows:
Create sample directories and link them to the files of unaligned reads.
We are not actually copying the raw data into your folders, we are making what are called "symbolic links" in order to avoid making many copies of the same large files. But it should look just like the files are present in your folders.
The folders of raw data will live inside $HOME/RNASEQ/reads
.
Copy the following lines into your terminal. You should be able to copy/paste them all at once, so you don't have to do them one line at a time.
You now have all of the files of raw data in place. These files have the short reads that come off the machine. Let's look at the forward read of the first read-pair in the first sample.
(2) run the command below and and paste the result in the box below.
head -4 $HOME/RNASEQ/reads/sample1/sample1_forward.fq
The first row is the name of the read, the second row is the read itself, the fourth row is the "quality string" for the read.
Now, we only have raw reads so far, so we do not yet know where in the genome each read comes from. So the first job is to align the reads to the genome.
(3) When you make a file like we just did, you should always check that it worked.
So do that by
cat $HOME/RNASEQ/sample_dirs.txt
and make sure it has the four lines it should have, no more and no less. Paste the result of the cat command in the box below.
Now we are ready to align the data with STAR.
Now we wait while STAR does the aligning. Depending on how many reads you have this can take minutes to hours.
We can monitor the progress in several ways.
First off the bjobs
command displays the current status of the pending, running or suspended jobs that you own.
(4) Run that command now and paste what you see in the box below. You should see a header line and up to four lines showing active jobs. Some of the jobs may have already finished so you may not see all four.
(5) Secondly, to monitor progress, you can check the log files in the $HOME/RNASEQ/logs
directory.
Run the command below on that directory now and enter the result in the box below.
We ran a script that in turn ran all four STAR alignments for us so we didn't have to do them by hand one-by-one.
(6) But you could, if you wanted, run the four STAR jobs one-by-one directly at the command line. The the Perl script wrote the command to run STAR into a shell script. You could execute the line in that shell script directly at the command prompt, if you wanted to. To see the command run the following and paste the result in the box below:
So execute the following command and paste the result in the box below:
[Q2] You can see from the first few lines which chromosomes have reads aligned to them. Type those chromosome names into the box below, one per line (from the results of the head command):
[Q3] Now your next task is to count the number of alignments (rows) for which the read aligned perfectly without gaps.
There is a particular field in the SAM file called the "CIGAR String" that gives us this information. If the CIGAR string is "100M" then the read aligned perfectly with no gaps.
So your job is to count the number of rows that have 100M as their CIGAR string.
You will have to refer to the SAMv1.pdf document to find out where the CIGAR string is in each row.
Then you should write one UNIX command that will return the answer. You will have to pipe together a few basic UNIX commands to make this work.
a. Construct the UNIX command and paste it into the box below.
b. Now execute the command and paste the answer in the box below. Because it is a large file it may take several minutes for the command to finish.