Tutorial 2. Tracking a RNA-Seq NGS workflow in full with Synapse

Note: The goal of this demo to show how to keep track of a typical steps in a RNA-Seq NGS workflow and not necessarily a recommended RNA-Seq or differential expression analysis method


In [18]:
import os
import sys
import synapseclient
import shutil
from synapseclient import File, Project, Wiki, Folder
import tempfile
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

%run 'synapse_demo_utils_functions.ipynb'

In [19]:
syn = synapseclient.login('synapsedemo','demo')


Welcome, synapsedemo!

1. Getting Raw Data

  • When you are starting a new project you would have the raw data locally. For demo purposes we are going to download a subsampled small RNA-Seq dataset for adrenal and brain tissue generated by the Illumina Body Map project

The subsampled raw dataset have been stored in a public synapse project. The following are the direct links to unique synapse id for the fastq data. This would serv


In [ ]:
brain_reads = 'syn2468554'
adrenal_reads = 'syn2468552'

#lets download the fastq data
brain_reads = syn.get(brain_reads)
adrenal_reads = syn.get(adrenal_reads)

2. Map the raw reads

  • For demo purposed we are using a small region of chr19 (300000-350000 Mb) of the hg19 reference genome. This could also be fetched from directly from synapse.

  • You can then use mapper of your choice to map these reads. We will use STAR to map the reads

2.1 Setting up the local environment


In [4]:
local_analysis_dir = tempfile.mkdtemp(prefix='temp_Demo_RNA-Seq_workflow')

#the following wud be replaced by the location of the STAR executable
STAR_executable = '/Users/abhishek/softwares/STAR_2.3.0e.OSX_x86_64/STAR'

#dir to store the STAR genome index for the reference genome
ref_genome_dir = create_dir(local_analysis_dir + '/ref_genome/')

2.2 Creating a genome index


In [ ]:
#the reference genome
ref_genome = 'syn2468557'
ref_genome = syn.get(ref_genome)

#create a STAR genome index
star_genome_index = STAR_executable + ' --runMode genomeGenerate --genomeDir {genome_dir} --genomeFastaFiles {genome_fasta}'
star_genome_index = star_genome_index.format(genome_dir=ref_genome_dir, genome_fasta = ref_genome.path)
! $star_genome_index

2.3 Map adrenal & brain tissue reads


In [7]:
#constructing a basic command to run the mapping
#this command could be replaced by any mapper of choice

basic_star_command_template = STAR_executable +' --runThreadN 1 --genomeDir {genome_dir} --outFileNamePrefix  {prefix} --outSAMunmapped Within --readFilesIn {read_1}'

In [ ]:
#run the mapping commands
adrenal_star_command = basic_star_command_template.format(genome_dir = ref_genome_dir,
                                                          prefix = local_analysis_dir + '/adrenal_',
                                                          read_1 = adrenal_reads.path
                                                          )
! $adrenal_star_command

brain_star_command = basic_star_command_template.format(genome_dir = ref_genome_dir,
                                                        prefix = local_analysis_dir + '/brain_',
                                                        read_1 = brain_reads.path
                                                        )
! $brain_star_command

Lets store the analysis provenance now in Synapse

Key things required

  • A synapse user account
  • A personal synapse project to store the data and/or analysis

For help refer to Synapse Getting Started Guide


In [11]:
#create a project
project_name = 'RNA_Seq_Project-%s'  % randomword(5) ##mainly to create a random unique project name
myProj = syn.store(Project(project_name))

#create a folder to store the bams
mapped_folder = syn.store(Folder('mapped_files', parent = myProj))
Push the mapped files upto Synapse along with the provenance that generated them

In [12]:
brain_mapped_file = get_FilesList(local_analysis_dir, 'brain.*sam')[0]
brain_mapped_file = File(brain_mapped_file, parent = mapped_folder.id, synapseStore=False)

executed = [pushToSynapse(syn,brain_star_command, parentId=mapped_folder.id)]
used = [brain_reads]
brain_mapped_file = syn.store(brain_mapped_file, used = used, executed= executed)


#################

adrenal_mapped_file = get_FilesList(local_analysis_dir, 'adrenal.*sam')[0]
adrenal_mapped_file = synapseclient.File(adrenal_mapped_file, parent = mapped_folder , synapseStore=False)

executed = [pushToSynapse(syn, adrenal_star_command, parentId=mapped_folder.id)]
used = [adrenal_reads]
adrenal_mapped_file = syn.store(adrenal_mapped_file, used = used, executed= executed)


[get_FilesList]: Found 1 files at /var/folders/1n/y92vmpms2llcp2y_36bf_2f40000gp/T/temp_Demo_RNA-Seq_workflowjcnyCH
..,!!
Upload completed in 3 seconds.
[get_FilesList]: Found 1 files at /var/folders/1n/y92vmpms2llcp2y_36bf_2f40000gp/T/temp_Demo_RNA-Seq_workflowjcnyCH
..,!!
Upload completed in 2 seconds.

3. Lets continue with counting the reads / gene

and this time lets push the provenance as we go to synapse

Key things needed

  • genes/features to count against
  • mapped files from adrenal and brain tissues
  • htseq: software to count the reads falling in each gene

3.1 fetch the annotation & create a folder to store the counts


In [14]:
annotation_file = 'syn2468559' #pre-stored in synaspe 
annotation_file = syn.get(annotation_file)

#lets create a folder to keep the counts
counts_folder = syn.store(synapseclient.Folder('counts', parentId=myProj.id))

In [15]:
counts_files = []

for sam in [adrenal_mapped_file, brain_mapped_file]:
    
    count_file = sam.path.replace('.sam','_htseq-counts.txt')
    
    command = "htseq-count -f sam -s no -q %s %s > %s" % (sam.path, annotation_file.path, count_file)
    ! $command
    
    #pushing the counts file to synapse and 
    used = [sam, annotation_file]
    executed = pushToSynapse(syn, command, parentId = counts_folder.id)
    count_file = File(count_file, parentId=counts_folder.id)
    count_file = syn.store(count_file, used=used, executed=executed)
    
    #keeping for later use
    counts_files.append(count_file)


..,!!!
Upload completed in 4 seconds.
..,!!
Upload completed in 3 seconds.
..,!!
Upload completed in 2 seconds.
..,!!
Upload completed in 3 seconds.

4. Merge the counts from all the samples

produce a single counts file filtering out genes with zero counts across all the samples


In [16]:
cleaned_counts  = merge_htseq_counts(counts_files)
cleaned_counts


Out[16]:
adrenal brain
gene
C2CD4C 83 29
FLJ45445 647 169
MADCAM1 41 22
MIER2 12 2
ODF3L2 1796 1067
OR4F17 11 9
PPAP2C 3 270
SHC2 309 266
THEG 56 16
WASH5P 4018 5676

10 rows × 2 columns


In [58]:
cleaned_counts.apply(lambda x: np.log2(x), axis=1).plot(rot=90, title="raw counts(log)", kind="bar")
plt.savefig('raw_read_counts.png')


4.1 Store the final merged counts with provenance


In [103]:
outFile = local_analysis_dir + '/merged_counts.tsv'
cleaned_counts.to_csv(outFile, sep="\t", header=True, index=True )
cleaned_counts_syn = syn.store(File(outFile,parentId = counts_folder.id), used=[x.id for x in counts_files])


..,!!
Upload completed in 3 seconds.

5. Lets also create a Wiki for our project

summarizing the analysis and show analysis provenance to our users


In [109]:
wiki = temp_create_wiki(syn, cleaned_counts, 'raw_read_counts.png', myProj, cleaned_counts_syn)

In [110]:
syn.onweb(myProj)

Cleanup

delete the demo project and local analysis space


In [17]:
syn.delete(myProj)
shutil.rmtree(local_analysis_dir)

In [ ]: