Meta Analysis of the Datasets for the Epi² pilot project

Subcellular cell localization datasets

PYTHON 3 / R Notebook

Adrien Leger / EMBL EBI

Starting date 02/07/2016

Import general package and definition of specific functions



In [2]:

    
# Local lib import
from pycl import *
from pyBioPlot import *

# Standard lib imports
import datetime
from glob import glob
from pprint import pprint as pp
from os.path import basename
from os import listdir, remove, rename
from os.path import abspath, basename, isdir
from collections import OrderedDict, Counter
import random
import statistics as stat
from IPython.core.display import display, HTML, Markdown, Image

# Third party import
import numpy as np
import scipy.stats as stats
import pylab as pl
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from Bio import SeqIO

# Pyplot tweaking
%matplotlib inline
pl.rcParams['figure.figsize'] = 30, 10  # that's default image size for this interactive session

# Jupyter display tweaking 
toogle_code()
larger_display(75)

# Simplify warning reporting to lighten the notebook style 
import warnings
warnings.formatwarning=(lambda message, *args: "{}\n".format(message))

# Allow to use R directly
%load_ext rpy2.ipython









    





    Toggle on/off the raw code






    











    



The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython

Cabili = HT-FISH data

Cabili, M. N. et al. Localization and abundance analysis of human lncRNAs at single-cell and single-molecule resolution. Genome Biol 16, (2015)

Subcellular Location of lncRNA by RNA FISH
Nice dataset with Human foreskin fibloblast (hFFs), Human lung fibroblast (hLFs) and HeLA cells
Limited to a very number of candidate = 34 only
Could be a good control datasets if we stick to Hela cells

The only data available are a large multitab excel file containing a lot of messy information. The annotation are based on different database (lncipedia or Noncode, or refSeq and rarely ensembl)... sometime the lncRNA is only described in one database and not in ensembl. First I obtained the common name for all lncRNA from lncipedia. However it was impossible to get the Ensembl equivalent in most of the cases.

Fortunatly they provide all the probe they used to fish the lncRNA. So when the information is not available I have to blast the probe in Ensemble to get the proper gene corresponding to the probes... It is a long and painfull work, but this dataset is going to be my reference for future localization experient so it has to be done properly



In [2]:

    
d = OrderedDict()

for line in open("./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq/All_probes.csv", 'r'):
    sequence, seq_name, gene_num = line.split("\t")
    d[gene_num]=seq_name.rpartition("_")[0]

for k,v in d.items():
    print ("{}\t{}".format(k.strip(),v))









    



1	XLOC_004198
2	XLOC_009233
3	NR_024412
4	NM_000272
5	XLOC_004456
6	NR_015370
7	XLOC_I2_010926
8	NR_024031
9	NR_038322
10	NR_027063
11	XLOC_009474
12	XLOC_004198
13	XLOC_009233
14	NR_024412
15	NM_000272
16	XLOC_004456
17	NR_015370
18	XLOC_I2_010926
19	NR_024031
20	NR_038322
21	NR_027063
22	XLOC_009474
23	XLOC_L2_008203
24	XLOC_012564
25	XLOC_012197
26	XLOC_012046_(exonsuniquetoNR_036444)
27	XLOC_011264
28	XLOC_010709
29	XLOC_010556
30	XLOC_009702
31	XLOC_009662
32	XLOC_009447
33	XLOC_008583
34	XLOC_008174
35	XLOC_005764
36	XLOC_005151
37	XLOC_004803
38	XLOC_004122
39	XLOC_003526
40	XLOC_002408
41	XLOC_002094
42	XLOC_000304
43	NR_033857exons0-2
44	NM_139078(XLOC_010202_neighbor)
45	NM_023000(NR_029435_neighbor)
46	XLOC_013841
47	XLOC_012046(exonsuniquetoNR_033925)
48	XLOC_010263
49	NM_005853(XLOC_011950_neighbor)
50	NM_000820_neighbor_of_XLOC_010514
51	ANCR
52	lincFOXF1
53	lincSFPQ
54	lincDR1
55	lincTUG1
56	lincGARS
57	lincMKLN1_A
58	ANRil
59	GAS5
60	Kcnq1ot1
61	Meg3
62	NR_033857(part not shared with XLOC_008531)
63	NRON
64	NEAT1
65	TERC
66	h19
67	Xist
68	XLOC_011226
69	XLOC_012187
70	XLOC_010853
71	XLOC_008531
72	XLOC_006922
73	XLOC_002746
74	XLOC_011950
75	XLOC_014160
76	XLOC_012599
77	XLOC_010017
78	XLOC_012192
79	XLOC_012980
80	XLOC_010202
81	XLOC_009474
82	XLOC_010514
83	XLOC_006198
84	XLOC_011185
85	NR_029435
86	XLOC_010853
87	XLOC_012599
88	XLOC_012980
89	XLOC_I2_010926
90	ZBTB37
91	RAB30_3UTR
92	MALAT1
93	hMorc2
94	hMKLN1
95	FOXF1_3UTR
96	CCNA2_exons
97	MIAT



In [7]:

    
# Extract the sequences of the probles for each gene in a separate file
def get_fasta_probes(infile, outfile, id_list=[]):
    
    seq_dict = OrderedDict()
    for line in open (infile, "r"):
        sequence, seq_name, gene_num = line.split("\t")
        gene_name, sep, seq_num = seq_name.rpartition("_")
        
        for id in id_list:
            if (type(id) == int and id == int(gene_num)) or (type(id) == str and id == gene_id):
                seq_dict[seq_name] = sequence
        
        with open (outfile, "w") as fp:
            for key, val in seq_dict.items():
                fp.write(">{}\n{}\n".format(key, val))



In [28]:

    
infile = "./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq/All_probes.csv"

# Creating a file contaning the probe used for each 34 valid experimental designs
d ={"DANCR":[51, 8, 19],
    "CDKN2B-AS1":[58],
    "GAS5":[59],
    "KCNQ1OT1":[60],
    "FENDRR":[26, 47, 52, 95],
    "LINC-PINT":[57],
    "lnc-SFPQ-2":[53],
    "MALAT1":[92],
    "MEG3":[61],
    "NEAT1":[64],
    "LINC00998":[3, 14],
    "PSMA3-AS1":[85],
    "lnc-AC078802.1-2":[65],
    "TUG1":[55],
    "XIST":[67],
    "lnc-NRXN1-2":[41],
    "LINC01116":[40],
    "DUBR":[73],
    "lnc-GRXCR1-4":[39],
    "TMEM161B-AS1":[5, 16],
    "lnc-C5orf39-2":[37],
    "lnc-RREB1-4":[36],
    "LINC00472":[35],
    "PVT1":[72],
    "LINC01278":[34],
    "RAB30-AS1":[2, 13],
    "LINC00941":[30],
    "lnc-MAGOHB-1":[77],
    "GAS6-AS2":[82],
    "lnc-SNURF-1":[84],
    "OIP5-AS1":[68],
    "CRNDE":[74],
    "lnc-USP14-2":[76,87],
    "MIR4435-2HG":[23]}

for gene_name, id_list in d.items():
    outfile = "./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq/Cabili_{}.fa".format(gene_name)
    get_fasta_probes(infile, outfile, id_list)
    print ("Created " + outfile)









    



Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq/Cabili_lnc-RREB1-4.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq/Cabili_LINC00941.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq/Cabili_CDKN2B-AS1.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq/Cabili_lnc-SNURF-1.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq/Cabili_MEG3.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq/Cabili_lnc-C5orf39-2.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq/Cabili_LINC00998.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq/Cabili_lnc-AC078802.1-2.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq/Cabili_OIP5-AS1.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq/Cabili_MALAT1.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq/Cabili_lnc-USP14-2.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq/Cabili_CRNDE.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq/Cabili_NEAT1.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq/Cabili_LINC00472.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq/Cabili_DUBR.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq/Cabili_lnc-SFPQ-2.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq/Cabili_LINC01116.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq/Cabili_TMEM161B-AS1.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq/Cabili_RAB30-AS1.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq/Cabili_lnc-MAGOHB-1.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq/Cabili_KCNQ1OT1.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq/Cabili_FENDRR.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq/Cabili_PVT1.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq/Cabili_LINC-PINT.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq/Cabili_GAS6-AS2.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq/Cabili_XIST.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq/Cabili_GAS5.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq/Cabili_lnc-GRXCR1-4.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq/Cabili_MIR4435-2HG.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq/Cabili_PSMA3-AS1.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq/Cabili_TUG1.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq/Cabili_lnc-NRXN1-2.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq/Cabili_LINC01278.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq/Cabili_DANCR.fa

Each file was blasted against ensembl human genomic sequence hgrh38. I manually checked the overlapping genes and the position of hits relative to exons. Finally I reported the corresponding ensembl ID. In 2 conditions I was nor able to find a gene in the ensembl build at the position indicated by the probes. Instead in fond genes only described in NCBI genes. From what I saw in the blast results, all of the genes identified are not lncRNA. I compared the ensembl gene ID found with the gencodev23 file containing all the genes. Results are summarized in the table thereafter



In [132]:

    
def parse_ensembl_annotation (annotation_file):

    annotation_dict = OrderedDict()
    # Ensembl gen decomposing template
    template=[0,"\t",1,"\t",2,"\t",3,"\t",4,"\t",5,"\t",6,"\t",7,"\tID=",8,";gene_id=",9,";gene_type=",10,
              ";gene_status=",11,";gene_name=",12,";level=",13,";havana_gene=",14]
    
    # Parsing annotation file in a dict
    with open (annotation_file, "r") as fp:
        for line in fp:
            if line[0] != "#":
                sl = _decompose_line(line, template)       
                if sl[2] == "gene":
                    id = sl[8].split(".")[0]
                    assert id not in annotation_dict, "Duplicated gene ID !! Fuck" 
                    annotation_dict[id] = {
                        "seqid":sl[0],"source":sl[1],"type":sl[2],"start":sl[3],"end":sl[4],"score":sl[5],
                        "strand":sl[6],"phase":sl[7],"ID":sl[8],"gene_id":sl[9],"gene_type":sl[10],
                        "gene_status":sl[11],"gene_name":sl[12],"level":sl[13],"havana_gene":sl[14]}
    
    return annotation_dict

def annotate_ensembl_ID (annotation_file, candidate_file, out_file):
    
    print ("Parsing annotation file ...")
    annotation_dict = parse_ensembl_annotation(annotation_file)
    
    print ("Searching for candidates ...")
    with open (candidate_file, "r") as cf, open (out_file, "w") as of:
        of.write("#Gen_screenID\tGene_name\tEnsemb_ID\tseqid\tsource\ttype\tstart\tend\tscore\tstrand\tphase\tID\tgene_id\tgene_type\tgene_status\tgene_name\tlevel\thavana_gene\n")                
        
        for line in cf:
            if line[0] != "#":
                line = line.strip()
                sl= line.split("\t")           
                if sl[2] in annotation_dict:
                    of.write ("{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}\n".format(
                            line,
                            annotation_dict[sl[2]]["seqid"].strip(),
                            annotation_dict[sl[2]]["source"].strip(),
                            annotation_dict[sl[2]]["type"].strip(),
                            annotation_dict[sl[2]]["start"].strip(),
                            annotation_dict[sl[2]]["end"].strip(),
                            annotation_dict[sl[2]]["score"].strip(),
                            annotation_dict[sl[2]]["strand"].strip(),
                            annotation_dict[sl[2]]["phase"].strip(),
                            annotation_dict[sl[2]]["ID"].strip(),
                            annotation_dict[sl[2]]["gene_id"].strip(),
                            annotation_dict[sl[2]]["gene_type"].strip(),
                            annotation_dict[sl[2]]["gene_status"].strip(),
                            annotation_dict[sl[2]]["gene_name"].strip(),
                            annotation_dict[sl[2]]["level"].strip(),
                            annotation_dict[sl[2]]["havana_gene"].strip()))
                else:
                    of.write ("{}\t{}\n".format(line, "NOT_FOUND"))



In [134]:

    
# Request annotation for all candidates in lncRNA Gencode database
annotation_file = "../Reference_Annotation/gencode_v24_lncRNAs_gene.gff3"
candidate_file = "./Localisation_Original_Datasets/Cabili_RNA_FISH/candidate_list.csv"
out_file = "./Localisation_Original_Datasets/Cabili_RNA_FISH/candidate_list_annotated.csv"
annotate_ensembl_ID (annotation_file, candidate_file, out_file)

head(out_file, 5)









    



Parsing annotation file ...
Searching for candidates ...
#Gen_screenID	Gene_name	Ensemb_ID	seqid	source	type	start	end	score	strand	phase	ID	gene_id	gene_type	gene_status	gene_name	level	havana_gene
ANCR	DANCR	ENSG00000226950	chr4	HAVANA	gene	52712404	52720351	.	+	.	ENSG00000226950.6	ENSG00000226950.6	processed_transcript	KNOWN	DANCR	2;tag=ncRNA_host	OTTHUMG00000154670.3
Anril	CDKN2B-AS1	ENSG00000240498	chr9	HAVANA	gene	21994778	22121097	.	+	.	ENSG00000240498.6	ENSG00000240498.6	antisense	KNOWN	CDKN2B-AS1	1;tag=ncRNA_host	OTTHUMG00000019689.3
GAS5	GAS5	ENSG00000234741	chr1	HAVANA	gene	173863900	173868882	.	-	.	ENSG00000234741.7	ENSG00000234741.7	processed_transcript	KNOWN	GAS5	2	OTTHUMG00000037216.5
Kcnq1ot1	KCNQ1OT1	ENSG00000269821	chr11	HAVANA	gene	2608328	2699994	.	-	.	ENSG00000269821.1	ENSG00000269821.1	antisense	KNOWN	KCNQ1OT1	2	OTTHUMG00000171022.2



In [138]:

    
# Request annotation for all candidates in all Gencode database
annotation_file = "../Reference_Annotation/gencode_v24_gene.gff3"
candidate_file = "./Localisation_Original_Datasets/Cabili_RNA_FISH/candidate_list.csv"
out_file = "./Localisation_Original_Datasets/Cabili_RNA_FISH/candidate_list_annotated.csv"
annotate_ensembl_ID (annotation_file, candidate_file, out_file)

head(out_file, 5)









    



Parsing annotation file ...
Searching for candidates ...
#Gen_screenID	Gene_name	Ensemb_ID	seqid	source	type	start	end	score	strand	phase	ID	gene_id	gene_type	gene_status	gene_name	level	havana_gene
ANCR	DANCR	ENSG00000226950	chr4	HAVANA	gene	52712404	52720351	.	+	.	ENSG00000226950.6	ENSG00000226950.6	processed_transcript	KNOWN	DANCR	2;tag=ncRNA_host	OTTHUMG00000154670.3
Anril	CDKN2B-AS1	ENSG00000240498	chr9	HAVANA	gene	21994778	22121097	.	+	.	ENSG00000240498.6	ENSG00000240498.6	antisense	KNOWN	CDKN2B-AS1	1;tag=ncRNA_host	OTTHUMG00000019689.3
GAS5	GAS5	ENSG00000234741	chr1	HAVANA	gene	173863900	173868882	.	-	.	ENSG00000234741.7	ENSG00000234741.7	processed_transcript	KNOWN	GAS5	2	OTTHUMG00000037216.5
Kcnq1ot1	KCNQ1OT1	ENSG00000269821	chr11	HAVANA	gene	2608328	2699994	.	-	.	ENSG00000269821.1	ENSG00000269821.1	antisense	KNOWN	KCNQ1OT1	2	OTTHUMG00000171022.2

The 2 non-Ensembl lncRNA were not matched. As for the other 29 belong to the lncRNA class (processed_transcript, antisense, lincRNA), 1 is protein_coding and 2 are transcribed_unprocessed_pseudogene that do no belong to the lncRNA class. I think I will still keep a track of the pseudogene, since thay could also have a particular interest later in the analysis. So I have a final number of 29 valid lncRNA + 3 pseudogenes in the final dataset

#Gen_screenID	Gene_name	Ensemb_ID	gene_type	lncRNA_class	seqid	start	end	strand
ANCR	DANCR	ENSG00000226950	processed_transcript	YES	chr4	52712404	52720351	+
Anril	CDKN2B-AS1	ENSG00000240498	antisense	YES	chr9	21994778	22121097	+
GAS5	GAS5	ENSG00000234741	processed_transcript	YES	chr1	173863900	173868882	-
Kcnq1ot1	KCNQ1OT1	ENSG00000269821	antisense	YES	chr11	2608328	2699994	-
XLOC_012046	FENDRR	ENSG00000268388	lincRNA	YES	chr16	86474529	86509099	-
lincMKLN1_A1	LINC-PINT	ENSG00000231721	antisense	YES	chr7	130941760	131110176	-
lincSFPQ	SFPQ	ENSG00000116560	protein_coding	NO	chr1	35176378	35193148	-
MALAT1	MALAT1	ENSG00000251562	lincRNA	YES	chr11	65497762	65506516	+
Meg3	MEG3	ENSG00000214548	lincRNA	YES	chr14	100779410	100861031	+
NEAT1	NEAT1	ENSG00000245532	lincRNA	YES	chr11	65422774	65445540	+
NR_024412	LINC00998	ENSG00000214194	lincRNA	YES	chr7	113116718	113118613	-
NR_029435	PSMA3-AS1	ENSG00000257621	antisense	YES	chr14	58265365	58298134	-
TERC	TERC	ENSG00000270141	lincRNA	YES	chr3	169764520	169765060	-
XLOC_014209	TUG1	ENSG00000253352	antisense	YES	chr22	30970677	30979395	+
XLOC_008185	XIST	ENSG00000229807	lincRNA	YES	chrX	73820651	73852753	-
XLOC_002094	AC007682.1	ENSG00000231918	lincRNA	YES	chr2	51032601	52407917	+
XLOC_002408	LINC01116	ENSG00000163364	lincRNA	YES	chr2	176629589	176637931	-
XLOC_002746	DUBR	ENSG00000243701	lincRNA	YES	chr3	107240692	107326964	+
XLOC_003526	RP11-1E6.1	ENSG00000250657	processed_transcript	YES	chr4	43340875	43345600	+
XLOC_004456	TMEM161B-AS1	ENSG00000247828	antisense	YES	chr5	88268895	88436685	+
XLOC_004803	REFSEQ-FLJ32255	LOC643977	NA	NA	NA	NA	NA	NA
XLOC_005151	NCBI-105374902	LOC105374902	NA	NA	NA	NA	NA	NA
XLOC_005764	LINC00472	ENSG00000233237	lincRNA	YES	chr6	71344344	71420769	-
XLOC_006922	PVT1	ENSG00000249859	lincRNA	YES	chr8	127794533	128101253	+
XLOC_008174	LINC01278	ENSG00000235437	processed_transcript	YES	chrX	63343227	63561071	-
XLOC_009233	RAB30-AS1	ENSG00000246067	lincRNA	YES	chr11	83072066	83106719	+
XLOC_009702	LINC00941	ENSG00000235884	lincRNA	YES	chr12	30795681	30802711	+
XLOC_010017	KLRAP1	ENSG00000256667	transcribed_unprocessed_pseudogene	NO	chr12	10588063	10599669	-
XLOC_010514	GAS6-AS2	ENSG00000272695	lincRNA	YES	chr13	113864168	113866833	+
XLOC_011185	SNHG14	ENSG00000224078	processed_transcript	YES	chr15	24978583	25419462	+
XLOC_011226	OIP5-AS1	ENSG00000247556	processed_transcript	YES	chr15	41283990	41309737	+
XLOC_011950	CRNDE	ENSG00000245694	lincRNA	YES	chr16	54918863	54929189	-
XLOC_012599	ROCK1P1	ENSG00000263006	transcribed_unprocessed_pseudogene	NO	chr18	109065	122219	+
XLOC_L2_008203	MIR4435-2HG	ENSG00000172965	lincRNA	YES	chr2	111196350	111495100	-

I removed the data from the protein coding and non ensembl gene and crosmatch them with the experimental data in a simple 3 column csv file (Ensembl-ID : Cell type : localization code). I will create a bedfile containing the coordinate of the gene with additional information fetched from the gencodev23 annotation file. The idea is to obtain a standardize file similar to the bed file created previously for PTM.



In [156]:

    
def Bed_ensembl_ID (annotation_file, localization_file, out_file):
        
    print ("Parsing annotation file ...")
    annotation_dict = parse_ensembl_annotation(annotation_file)
    
    h = "# Data cleaned, converted to BED6, standardized, hg38 coordinates\n"
    h+= "# Gene annotation scrapped from gencodev23\n"
    h+= "# Adrien Leger (aleg@ebi.ac.uk) {}\n".format(str (datetime.datetime.today()))
    h+= "# Localization type is a numeric code from the original publication\n"
    h+= "# 0 = Invalid detection\n"
    h+= "# 1 = One or 2 large nuclear foci\n"
    h+= "# 2 = Large nuclear foci and single molecules scattered through the nucleus\n"
    h+= "# 3 = Predominantly nuclear, without foci\n"
    h+= "# 4 = Cytoplasmic and nuclear\n"
    h+= "# 5 = Predominantly cytoplasmic\n"
    h+= "# chrom\tchromstart\tchromend\tLocalization_code|cell_type|method|PMID|ensembl_id|gene_type|gene_name\tscore\tstrand\n"
        
    print ("Searching for candidates ...")
    with open (localization_file, "r") as infile, open (out_file, "w") as outfile:
        outfile.write(h)
        total = invalid_loc_code = invalid_ensemble_id = 0
        
        for line in infile:
            if line[0] != "#":
                total+=1        
                sl= line.split("\t")
                ensemble_id = sl[0].strip()
                cell_type = sl[1].strip()
                loc_code = sl[2].strip()
                
                if loc_code == '0':
                    invalid_loc_code+=1
                    continue
                
                if ensemble_id not in annotation_dict:
                    invalid_ensemble_id+=1
                    continue
                    
                outfile.write ("{0}\t{1}\t{2}\t{3}|{4}|{5}|{6}|{7}|{8}|{9}\t{10}\t{11}\n".format(
                    annotation_dict[ensemble_id]["seqid"].strip(),
                    annotation_dict[ensemble_id]["start"].strip(),
                    annotation_dict[ensemble_id]["end"].strip(),
                    loc_code,
                    cell_type,
                    "RNA_FISH",
                    "25630241",
                    ensemble_id,
                    annotation_dict[ensemble_id]["gene_type"].strip(),
                    annotation_dict[ensemble_id]["gene_name"].strip(),
                    "-",
                    annotation_dict[ensemble_id]["strand"].strip()))
    
    print("Total: {}\tInvalid Localization code: {}\tInvalid Ensembl ID: {}".format(total,invalid_loc_code,invalid_ensemble_id))



In [158]:

    
# Request annotation for all candidates in all Gencode database (pseudogenes + lncRNA)
annotation_file = "../Reference_Annotation/gencode_v24_gene.gff3"
localization_file = "./Localisation_Original_Datasets/Cabili_RNA_FISH/Localization_cell_ensembl.csv"
out_file = "./Localisation_Clean_Datasets/Cabili_RNA_FISH_hg38.bed"

Bed_ensembl_ID (annotation_file, localization_file, out_file)

head (out_file, 15)









    



Parsing annotation file ...
Searching for candidates ...
Total: 93	Invalid Localization code: 24	Invalid Ensembl ID: 0
# Data cleaned, converted to BED6, standardized, hg38 coordinates
# Gene annotation scrapped from gencodev23
# Adrien Leger (aleg@ebi.ac.uk) 2016-06-15 17:58:10.405934
# Localization type is a numeric code from the original publication
# 0 = Invalid detection
# 1 = One or 2 large nuclear foci
# 2 = Large nuclear foci and single molecules scattered through the nucleus
# 3 = Predominantly nuclear, without foci
# 4 = Cytoplasmic and nuclear
# 5 = Predominantly cytoplasmic
# chrom	chromstart	chromend	Localization_code|cell_type|method|PMID|ensembl_id|gene_type|gene_name	score	strand
chr4	52712404	52720351	4|hela|RNA_FISH|25630241|ENSG00000226950|processed_transcript|DANCR	-	+
chr4	52712404	52720351	4|hLF|RNA_FISH|25630241|ENSG00000226950|processed_transcript|DANCR	-	+
chr4	52712404	52720351	4|hFF|RNA_FISH|25630241|ENSG00000226950|processed_transcript|DANCR	-	+
chr9	21994778	22121097	1|hela|RNA_FISH|25630241|ENSG00000240498|antisense|CDKN2B-AS1	-	+

This last step removed 24 lines out of 93 for genes that had a invalid localization code (0). The data are now stored in a Bed compatible dataset and ready to be used. I am done with this dataset.

Transcript based reanalysis + automatization of Blasting

It might be a good idea to do this analysis at transcript level. Indeed, sometime only a fraction of the transcripts were targeted by the probes... A solution would be to blast automatically all the probes, from the coordinates, extract the overlapping transcript by simple intersection with bedtools for example

Same thing than prevously but cleaner with an id



In [8]:

    
! rm ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq2/*









    



rm: cannot remove './Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq2/*': No such file or directory



In [9]:

    
infile = "./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq/All_probes.csv"

# Creating a file contaning the probe used for each 34 valid experimental designs
d ={"01_DANCR":[51, 8, 19],
    "02_CDKN2B-AS1":[58],
    "03_GAS5":[59],
    "04_KCNQ1OT1":[60],
    "05_FENDRR":[26, 47, 52, 95],
    "06_LINC-PINT":[57],
    "07_lnc-SFPQ-2":[53],
    "08_MALAT1":[92],
    "09_MEG3":[61],
    "10_NEAT1":[64],
    "11_LINC00998":[3, 14],
    "12_PSMA3-AS1":[85],
    "13_TERC":[65],
    "14_TUG1":[55],
    "15_XIST":[67],
    "16_lnc-NRXN1-2":[41],
    "17_LINC01116":[40],
    "18_DUBR":[73],
    "19_lnc-GRXCR1-4":[39],
    "20_TMEM161B-AS1":[5, 16],
    "21_lnc-C5orf39-2":[37],
    "22_lnc-RREB1-4":[36],
    "23_LINC00472":[35],
    "24_PVT1":[72],
    "25_LINC01278":[34],
    "26_RAB30-AS1":[2, 13],
    "27_LINC00941":[30],
    "28_KLRAP1":[77],
    "29_GAS6-AS2":[82],
    "30_SNHG14":[84],
    "31_OIP5-AS1":[68],
    "32_CRNDE":[74],
    "33_ROCK1P1":[76,87],
    "34_MIR4435-2HG":[23]} 

outdir = "./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq2"
mkdir(outdir)
for gene_name, id_list in d.items():
    outfile = "{}/{}.fa".format(outdir, gene_name)
    get_fasta_probes(infile, outfile, id_list)
    print ("Created " + outfile)









    



Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq2/32_CRNDE.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq2/14_TUG1.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq2/24_PVT1.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq2/21_lnc-C5orf39-2.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq2/33_ROCK1P1.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq2/05_FENDRR.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq2/09_MEG3.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq2/22_lnc-RREB1-4.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq2/12_PSMA3-AS1.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq2/23_LINC00472.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq2/10_NEAT1.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq2/31_OIP5-AS1.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq2/01_DANCR.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq2/08_MALAT1.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq2/25_LINC01278.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq2/34_MIR4435-2HG.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq2/13_TERC.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq2/16_lnc-NRXN1-2.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq2/06_LINC-PINT.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq2/03_GAS5.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq2/07_lnc-SFPQ-2.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq2/15_XIST.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq2/26_RAB30-AS1.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq2/18_DUBR.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq2/20_TMEM161B-AS1.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq2/30_SNHG14.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq2/19_lnc-GRXCR1-4.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq2/04_KCNQ1OT1.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq2/27_LINC00941.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq2/02_CDKN2B-AS1.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq2/17_LINC01116.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq2/28_KLRAP1.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq2/11_LINC00998.fa
Created ./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq2/29_GAS6-AS2.fa



In [41]:

    
head("./Localisation_Original_Datasets/Cabili_RNA_FISH/localization_patern.csv")









    



id	Gen_screenID	Gene_name	cell_type	validInCell	localization
1	ANCR	DANCR	HeLa	1	4
1	ANCR	DANCR	IMR90	1	4
1	ANCR	DANCR	CRL2097	1	4
2	Anril	CDKN2B-AS1	HeLa	1	1
2	Anril	CDKN2B-AS1	IMR90	0	0
2	Anril	CDKN2B-AS1	CRL2097	1	1
3	GAS5	GAS5	HeLa	1	2
3	GAS5	GAS5	IMR90	1	4
3	GAS5	GAS5	CRL2097	1	4

Blast fasta files against the human lncRNA transcriptome



In [121]:

    
# Use my python wrapper for blast+
from Blastn import Blastn

def parse_loc_file (loc_file):
    ### Import the sample localization file in a panda dfataframe and parse it in a dict
    loc_dict = {}
    df = pd.read_table(loc_file)
    for n, v in df.iterrows():
        if v["id"] not in loc_dict:
            loc_dict[v["id"]]={}
        # Define the localization of the sample for the dataset according to the validity of the assay
        loc_dict[v["id"]][v["cell_type"]] = int(v["localization"]) if v["validInCell"] else np.nan
    print ("Found {} sample id".format(len(loc_dict.keys())))
    return (loc_dict)


def blast_probe (reference, subject_dir, min_hit_evalue, min_hit_len, min_count_valid, verbose=False):

    ### Blast the probe against the reference and find the transcript of origin
    # Dict to collect all the resuts
    G_dict = {}
    T_dict = {}

    print ("Create Blast database")
    with Blastn(ref_path=reference) as blastn:
        print ("Blasting subject sequences")
        for subject in glob(subject_dir+"*.fa"):
            # Get the name and id of the sample currently analysed
            sample_name = supersplit(subject, [".","/","_"])[-2]
            sample_id = supersplit(subject, [".","/","_"])[-3]
            if verbose:
                print("Blasting sample :", sample_id, sample_name)
            # Init dict for these samples
            g_dict = {}
            t_dict = {}
            # Blast the sequences of the probes against the reference and get a lis of hits
            hit_list = blastn(query_path=subject, task="blastn-short")
            # Parse the list of blast hits found
            for hit in hit_list:  
                # Select only if inverse orientation good evalue and len match longer than 18
                if hit.evalue <= min_hit_evalue and hit.length >=min_hit_len and hit.s_orient != hit.q_orient :               
                    s = hit.s_id.split("|")
                    t_id = s[0]
                    g_id = s[1]
                    t_name = s[4]
                    g_name = s[5]
                    t_len = s[6]
                    t_type = s[7] 

                    # Fill the transcript based dict
                    if t_id not in t_dict:
                        t_dict[t_id] = OrderedDict()
                        t_dict[t_id]["g_id"] = g_id
                        t_dict[t_id]["t_name"] = t_name
                        t_dict[t_id]["g_name"] = g_name
                        t_dict[t_id]["t_len"] = t_len
                        t_dict[t_id]["t_type"] = t_type
                        t_dict[t_id]["count"] = 0
                    t_dict[t_id]["count"]+=1

                    # Fill of the gene based dict
                    if g_id not in g_dict:
                        g_dict[g_id] = OrderedDict()
                        g_dict[g_id]["g_name"] = g_name
                        g_dict[g_id]["t_type"] = t_type
                        g_dict[g_id]["count"] = 0
                    g_dict[g_id]["count"]+=1

            # Filtered the resuts and write 
            filt_g_dict = {k:v for k, v in g_dict.items() if v["count"] > min_count_valid}
            if g_dict:
                G_dict[sample_id] = filt_g_dict
                if verbose:
                    print (sample_id, sample_name, "GENES")
                    print ("UNFILTERED")
                    for k,v in g_dict.items():
                        print (k, "\t".join(["{}:{}".format(i,j) for i,j in v.items()]))
                    print ("FILTERED")
                    for k,v in filt_g_dict.items():
                        print (k, "\t".join(["{}:{}".format(i,j) for i,j in v.items()]))

            filt_t_dict = {k:v for k, v in t_dict.items() if v["count"] > min_count_valid}
            if t_dict:
                T_dict[sample_id] = filt_t_dict
                if verbose:
                    print (sample_id, sample_name, "TRANSCRIPTS")
                    print ("UNFILTERED")
                    for k,v in t_dict.items():
                        print (k, "\t".join(["{:6}:{:<18}".format(i,j) for i,j in v.items()]))
                    print ("FILTERED")
                    for k,v in filt_t_dict.items():
                        print (k, "\t".join(["{:6}:{:<18}".format(i,j) for i,j in v.items()]))
    
    return (G_dict, T_dict)

def write_loc_file (d, loc_dict, outdir="./out/", prefix=""):
    
    # Create the output directory if needed
    mkdir(outdir)
    
    # Iterate over the 3 cell types in the dataset
    for cell in ["HeLa", "IMR90", "CRL2097"]:
        outfile = "{}{}_{}.csv".format(outdir, prefix, cell)
        first_line = True
        print (outfile)
        with open (outfile, "w") as fp:
            for sample_id, id_dict in d.items():
                # Retrieve the localization for this sample and this cell line
                localization = loc_dict[int(sample_id)][cell]
                for seq_id, seq_val in id_dict.items():
                    # Write the file header
                    if first_line:
                        fp.write("target_id\tsample_id\tlocalization\t{}\n".format("\t".join(seq_val.keys())))
                        first_line=False
                    # Write the value fields
                    fp.write("{}\t{}\t{}\t{}\n".format(seq_id, sample_id, localization,"\t".join([str(i) for i in seq_val.values()])))



In [123]:

    
# Analysis with a reference containing all the human RNA 
reference = "../../Reference_Genomes/gencode.v24.transcripts.fa"
subject_dir = "./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq2/"
loc_file =  "./Localisation_Original_Datasets/Cabili_RNA_FISH/localization_patern.csv"
outdir = "./Localisation_Original_Datasets/Cabili_RNA_FISH/results_allRNA/"

loc_dict = parse_loc_file (loc_file)
G_dict, T_dict = blast_probe (reference, subject_dir, min_hit_evalue=0.01, min_hit_len=18, min_count_valid=3, verbose=False)
write_loc_file(G_dict, loc_dict, outdir, prefix="gene")
write_loc_file(T_dict, loc_dict, outdir, prefix="transcript")









    



Found 34 sample id
Create Blast database
Blasting subject sequences
./Localisation_Original_Datasets/Cabili_RNA_FISH/results_allRNA/gene_HeLa.csv
./Localisation_Original_Datasets/Cabili_RNA_FISH/results_allRNA/gene_IMR90.csv
./Localisation_Original_Datasets/Cabili_RNA_FISH/results_allRNA/gene_CRL2097.csv
./Localisation_Original_Datasets/Cabili_RNA_FISH/results_allRNA/transcript_HeLa.csv
./Localisation_Original_Datasets/Cabili_RNA_FISH/results_allRNA/transcript_IMR90.csv
./Localisation_Original_Datasets/Cabili_RNA_FISH/results_allRNA/transcript_CRL2097.csv



In [122]:

    
# Analysis with a reference containing the human lncRNA only 
reference = "../../Reference_Genomes/gencode.v24.lncRNA_transcripts.fa"
subject_dir = "./Localisation_Original_Datasets/Cabili_RNA_FISH/probes_seq2/"
loc_file =  "./Localisation_Original_Datasets/Cabili_RNA_FISH/localization_patern.csv"
outdir = "./Localisation_Original_Datasets/Cabili_RNA_FISH/results_lncRNA/"

loc_dict = parse_loc_file (loc_file)
G_dict, T_dict = blast_probe (reference, subject_dir, min_hit_evalue=0.01, min_hit_len=18, min_count_valid=3, verbose=False)
write_loc_file(G_dict, loc_dict, outdir, prefix="gene")
write_loc_file(T_dict, loc_dict, outdir, prefix="transcript")









    



Found 34 sample id
Create Blast database
Blasting subject sequences
./Localisation_Original_Datasets/Cabili_RNA_FISH/results_lncRNA/gene_HeLa.csv
./Localisation_Original_Datasets/Cabili_RNA_FISH/results_lncRNA/gene_IMR90.csv
./Localisation_Original_Datasets/Cabili_RNA_FISH/results_lncRNA/gene_CRL2097.csv
./Localisation_Original_Datasets/Cabili_RNA_FISH/results_lncRNA/transcript_HeLa.csv
./Localisation_Original_Datasets/Cabili_RNA_FISH/results_lncRNA/transcript_IMR90.csv
./Localisation_Original_Datasets/Cabili_RNA_FISH/results_lncRNA/transcript_CRL2097.csv

I obtained several list containing the localization of RNAs at gene level and transcript level based on a conservative base pairing between the probes used for the FISH experiments and the human transcripts. The available datasets are summarized in the table below

cell_line	RNA set	RNA level
HeLa	lncRNA	transcript
HeLa	lncRNA	gene
HeLa	allRNA	transcript
HeLa	allRNA	gene
IMR90	lncRNA	transcript
IMR90	lncRNA	gene
IMR90	allRNA	transcript
IMR90	allRNA	gene
CRL2097	lncRNA	transcript
CRL2097	lncRNA	gene
CRL2097	allRNA	transcript
CRL2097	allRNA	gene

Djebali = ENCODE RNA-Seq nuclear/cytoplasmic fractions

Exploration of the data space

I just found this new extensive dataset from the ENCODE v2 consortium published in 2012 with RNA-Seq data from various human cell lines in the nucleus and in the cytoplasm. The data were analysed with the previous version of the human genome and would probably have to be remapped completly on the new version of the genome hg38 and using the last GENCODE annotations. I for each cell line I will compare the nuclear vs the cytoplasmic fractions to get the transcript significantly enriched in the nucleus vs the cytoplasm.

The content of the dataset is listed bellow

Cell types (include immortalized cell lines, primary cells and embryonic cells)

A549
GM12878
H1-hESC
HeLa-S3
HepG2
HUVEC
IMR90
K562
MCF-7
NHEK
SK-N-SH

Localization

Nucleus / Cytoplasm > Separated using Qiagen RLN buffer
Cytosol
Chromatin (K562 only)
Nucleolus (K562 only)
Nucleoplasm (K562 only)

Additional informations

RNAs of >200 nt length are selected using the RNeasy MiniElute Cleanup kit
RNA were split in 2 fractions by polyA + and the polyA - using Oligotex
All fractions were pre‐treated with the Ribominus Eukaryotic Kit for RNA‐seq
PhiX was added at 1% in
There are 2 biological replicates for all the samples (except the H1-hESC embyonic cells)
fastq Paired end reads are in 2 separates files
bed, Bigwig, sam .. are also available but I only need the fastq since I will remap the data

Summary of the dataset I will focus on

Cell Line	Localization	RNA Extract	ENCODE Accesion
A549	cytosol	Long PolyA+ RNA	EH002624
A549	nucleus	Long PolyA+ RNA	EH002625
GM12878	cytosol	Long PolyA+ RNA	EH000147
GM12878	nucleus	Long PolyA+ RNA	EH000170
H1-hESC	cytosol	Long PolyA+ RNA	EH000151
H1-hESC	nucleus	Long PolyA+ RNA	EH000152
HeLa-S3	cytosol	Long PolyA+ RNA	EH000171
HeLa-S3	nucleus	Long PolyA+ RNA	EH000172
HepG2	cytosol	Long PolyA+ RNA	EH000161
HepG2	nucleus	Long PolyA+ RNA	EH000158
HUVEC	cytosol	Long PolyA+ RNA	EH000156
HUVEC	nucleus	Long PolyA+ RNA	EH000157
IMR90	cytosol	Long PolyA+ RNA	EH002626
IMR90	nucleus	Long PolyA+ RNA	EH002628
K562	cytosol	Long PolyA+ RNA	EH000140
K562	nucleus	Long PolyA+ RNA	EH000174
MCF-7	cytosol	Long PolyA+ RNA	EH002633
MCF-7	nucleus	Long PolyA+ RNA	EH002627
NHEK	cytosol	Long PolyA+ RNA	EH000166
NHEK	nucleus	Long PolyA+ RNA	EH000165
SK-N-SH	cytosol	Long PolyA+ RNA	EH002630
SK-N-SH	nucleus	Long PolyA+ RNA	EH002631

Reference cleaning and Index generation with Kallisto

It has been a while that I want to use Kalisto, the software designed by Lior Patcher's team to analyse RNA data. I will perform this analysis with the conventional pipeline (STAR) in parallel to compare the results. The data are quite massive, so I will just prototype on my laptop but run the final analysis on the cluster

I dl the last reference fasta transcriptome for all RNA and for lncRNA only from the gencode server (v23). I need to clean the fasta Reference sequence from Gencode since they contain a lot of information from the transcript names of each sequences that will hamper the readability of the results. In order to be able to use the gene names and information later; I also need to save all the information contained in the names in a separate file.



In [6]:

    
from gzip import open as gopen

def Gencode_fastaclean_listinfo (gencode_fasta, clean_fasta="gencode_clean.fa.gz", gencode_info="gencode_info.tsv"):
    with gopen (gencode_fasta, "rt") as fasta_in, gopen (clean_fasta, "wt") as fasta_out, open (gencode_info, "wt") as info:
        
        # write header in the info file
        info.write ("{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}\n".format(
            "GENCODE_transcript_id", "GENCODE_gene_id", "HAVANA_gene_id", "HAVANA_transcript_id", "transcript_name",
            "gene_name", "length", "RNA_type"))
        
        nline = nseq = 0
        for line in fasta_in:
            nline += 1
            
            if line[0] == ">":
                nseq += 1
                ls = [i.strip() for i in line[1:].split("|")]
                fasta_out.write (">{}\n".format(ls[0]))
                
                # in lncRNA file not RNA type at the end...
                if not ls[7]:
                    ls[7]="NA"
                info.write("\t".join(ls)+"\n")
            
            else:
                fasta_out.write(line)
    
    print ("Parsed {} lines\t Found {} Sequences".format(nline, nseq))



In [7]:

    
gencode_fasta = "../../Reference_Genomes/gencode.v24.lncRNA_transcripts.fa.gz"
clean_fasta = "../../Reference_Genomes/gencode_v24_lncRNA_transcripts_clean.fa.gz"
gencode_info = "../../Reference_Genomes/gencode_v24_lncRNA_transcripts_info.tsv"

print ("Clean the file and generate the list from IDs\n") 
Gencode_fastaclean_listinfo (gencode_fasta, clean_fasta, gencode_info)

print ("Visualize the result\n")
head(gencode_fasta, 2)
head(clean_fasta, 2)
head(gencode_info, 2)









    



Clean the file and generate the list from IDs

Parsed 56062 lines	 Found 28031 Sequences
Visualize the result

>ENST00000473358.1|ENSG00000243485.3|OTTHUMG00000000959.2|OTTHUMT00000002840.1|RP11-34P13.3-001|RP11-34P13.3|712|
GTGCACACGGCTCCCATGCGTTGTCTTCCGAGCGTCAGGCCGCCCCTACCCGTGCTTTCTGCTCTGCAGACCCTCTTCCTAGACCTCCGTCCTTTGTCCCATCGCTGCCTTCCCCTCAAGCTCAGGGCCAAGCTGTCCGCCAACCTCGGCTCCTCCGGGCAGCCCTCGCCCGGGGTGCGCCCCGGGGCAGGACCCCCAGCCCACGCCCAGGGCCCGCCCCTGCCCTCCAGCCCTACGCCTTGACCCGCTTTCCTGCGTCTCTCAGCCTACCTGACCTTGTCTTTACCTCTGTGGGCAGCTCCCTTGTGATCTGCTTAGTTCCCACCCCCCTTTAAGAATTCAATAGAGAAGCCAGACGCAAAACTACAGATATCGTATGAGTCCAGTTTTGTGAAGTGCCTAGAATAGTCAAAATTCACAGAGACAGAAGCAGTGGTCGCCAGGAATGGGGAAGCAAGGCGGAGTTGGGCAGCTCGTGTTCAATGGTTTTGTCCGCCTTCCCTGCCTCCTCTTCTGGGGGAGTTAGATCGAGTTGTAACAAGAACATGCCACTGTCTCGCTGGCTGCAGCGTGTGGTCCCCTTACCAGAGTGAGGATGCGAAGAGAAGGTGGCTGTCTGCAAACCAGGAAGAGAGCCCTCACCGGGAACCCGTCCAGCTGCCACCTTGAACTTGGACTTCCAAGCCTCCAGAACTGTGAGGGATAAATGTAT

>ENST00000473358.1
GTGCACACGGCTCCCATGCGTTGTCTTCCGAGCGTCAGGCCGCCCCTACCCGTGCTTTCTGCTCTGCAGACCCTCTTCCTAGACCTCCGTCCTTTGTCCCATCGCTGCCTTCCCCTCAAGCTCAGGGCCAAGCTGTCCGCCAACCTCGGCTCCTCCGGGCAGCCCTCGCCCGGGGTGCGCCCCGGGGCAGGACCCCCAGCCCACGCCCAGGGCCCGCCCCTGCCCTCCAGCCCTACGCCTTGACCCGCTTTCCTGCGTCTCTCAGCCTACCTGACCTTGTCTTTACCTCTGTGGGCAGCTCCCTTGTGATCTGCTTAGTTCCCACCCCCCTTTAAGAATTCAATAGAGAAGCCAGACGCAAAACTACAGATATCGTATGAGTCCAGTTTTGTGAAGTGCCTAGAATAGTCAAAATTCACAGAGACAGAAGCAGTGGTCGCCAGGAATGGGGAAGCAAGGCGGAGTTGGGCAGCTCGTGTTCAATGGTTTTGTCCGCCTTCCCTGCCTCCTCTTCTGGGGGAGTTAGATCGAGTTGTAACAAGAACATGCCACTGTCTCGCTGGCTGCAGCGTGTGGTCCCCTTACCAGAGTGAGGATGCGAAGAGAAGGTGGCTGTCTGCAAACCAGGAAGAGAGCCCTCACCGGGAACCCGTCCAGCTGCCACCTTGAACTTGGACTTCCAAGCCTCCAGAACTGTGAGGGATAAATGTAT

GENCODE_transcript_id	GENCODE_gene_id	HAVANA_gene_id	HAVANA_transcript_id	transcript_name	gene_name	length	RNA_type
ENST00000473358.1	ENSG00000243485.3	OTTHUMG00000000959.2	OTTHUMT00000002840.1	RP11-34P13.3-001	RP11-34P13.3	712	NA



In [6]:

    
gencode_fasta = "../../Reference_Genomes/gencode.v24.transcripts.fa.gz"
clean_fasta = "../../Reference_Genomes/gencode_v24_transcripts_clean.fa.gz"
gencode_info = "../../Reference_Genomes/gencode_v24_transcripts_info.tsv"

print ("Clean the file and generate the list from IDs\n") 
Gencode_fastaclean_listinfo (gencode_fasta, clean_fasta, gencode_info)

print ("Visualize the result\n")
head(gencode_fasta, 2)
head(clean_fasta, 2)
head(gencode_info, 2)









    



Clean the file and generate the list from IDs

Parsed 398338 lines	 Found 199169 Sequences
Visualize the result

>ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|DDX11L1-002|DDX11L1|1657|processed_transcript|
GTTAACTTGCCGTCAGCCTTTTCTTTGACCTCTTCTTTCTGTTCATGTGTATTTGCTGTCTCTTAGCCCAGACTTCCCGTGTCCTTTCCACCGGGCCTTTGAGAGGTCACAGGGTCTTGATGCTGTGGTCTTCATCTGCAGGTGTCTGACTTCCAGCAACTGCTGGCCTGTGCCAGGGTGCAAGCTGAGCACTGGAGTGGAGTTTTCCTGTGGAGAGGAGCCATGCCTAGAGTGGGATGGGCCATTGTTCATCTTCTGGCCCCTGTTGTCTGCATGTAACTTAATACCACAACCAGGCATAGGGGAAAGATTGGAGGAAAGATGAGTGAGAGCATCAACTTCTCTCACAACCTAGGCCAGTGTGTGGTGATGCCAGGCATGCCCTTCCCCAGCATCAGGTCTCCAGAGCTGCAGAAGACGACGGCCGACTTGGATCACACTCTTGTGAGTGTCCCCAGTGTTGCAGAGGCAGGGCCATCAGGCACCAAAGGGATTCTGCCAGCATAGTGCTCCTGGACCAGTGATACACCCGGCACCCTGTCCTGGACACGCTGTTGGCCTGGATCTGAGCCCTGGTGGAGGTCAAAGCCACCTTTGGTTCTGCCATTGCTGCTGTGTGGAAGTTCACTCCTGCCTTTTCCTTTCCCTAGAGCCTCCACCACCCCGAGATCACATTTCTCACTGCCTTTTGTCTGCCCAGTTTCACCAGAAGTAGGCCTCTTCCTGACAGGCAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACAAAGGTGAAACCCAGGAGAGTGTGGAGTCCAGAGTGTTGCCAGGACCCAGGCACAGGCATTAGTGCCCGTTGGAGAAAACAGGGGAATCCCGAAGAAATGGTGGGTCCTGGCCATCCGTGAGATCTTCCCAGGGCAGCTCCCCTCTGTGGAATCCAATCTGTCTTCCATCCTGCGTGGCCGAGGGCCAGGCTTCTCACTGGGCCTCTGCAGGAGGCTGCCATTTGTCCTGCCCACCTTCTTAGAAGCGAGACGGAGCAGACCCATCTGCTACTGCCCTTTCTATAATAACTAAAGTTAGCTGCCCTGGACTATTCACCCCCTAGTCTCAATTTAAGAAGATCCCCATGGCCACAGGGCCCCTGCCTGGGGGCTTGTCACCTCCCCCACCTTCTTCCTGAGTCATTCCTGCAGCCTTGCTCCCTAACCTGCCCCACAGCCTTGCCTGGATTTCTATCTCCCTGGCTTGGTGCCAGTTCCTCCAAGTCGATGGCACCTCCCTCCCTCTCAACCACTTGAGCAAACTCCAAGACATCTTCTACCCCAACACCAGCAATTGTGCCAAGGGCCATTAGGCTCTCAGCATGACTATTTTTAGAGACCCCGTGTCTGTCACTGAAACCTTTTTTGTGGGAGACTATTCCTCCCATCTGCAACAGCTGCCCCTGCTGACTGCCCTTCTCTCCTCCCTCTCATCCCAGAGAAACAGGTCAGCTGGGAGCTTCTGCCCCCACTGCCTAGGGACCAACAGGGGCAGGAGGCAGTCACTGACCCCGAGACGTTTGCATCCTGCACAGCTAGAGATCCTTTATTAAAAGCACACTGTTGGTTTCTG

>ENST00000456328.2
GTTAACTTGCCGTCAGCCTTTTCTTTGACCTCTTCTTTCTGTTCATGTGTATTTGCTGTCTCTTAGCCCAGACTTCCCGTGTCCTTTCCACCGGGCCTTTGAGAGGTCACAGGGTCTTGATGCTGTGGTCTTCATCTGCAGGTGTCTGACTTCCAGCAACTGCTGGCCTGTGCCAGGGTGCAAGCTGAGCACTGGAGTGGAGTTTTCCTGTGGAGAGGAGCCATGCCTAGAGTGGGATGGGCCATTGTTCATCTTCTGGCCCCTGTTGTCTGCATGTAACTTAATACCACAACCAGGCATAGGGGAAAGATTGGAGGAAAGATGAGTGAGAGCATCAACTTCTCTCACAACCTAGGCCAGTGTGTGGTGATGCCAGGCATGCCCTTCCCCAGCATCAGGTCTCCAGAGCTGCAGAAGACGACGGCCGACTTGGATCACACTCTTGTGAGTGTCCCCAGTGTTGCAGAGGCAGGGCCATCAGGCACCAAAGGGATTCTGCCAGCATAGTGCTCCTGGACCAGTGATACACCCGGCACCCTGTCCTGGACACGCTGTTGGCCTGGATCTGAGCCCTGGTGGAGGTCAAAGCCACCTTTGGTTCTGCCATTGCTGCTGTGTGGAAGTTCACTCCTGCCTTTTCCTTTCCCTAGAGCCTCCACCACCCCGAGATCACATTTCTCACTGCCTTTTGTCTGCCCAGTTTCACCAGAAGTAGGCCTCTTCCTGACAGGCAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACAAAGGTGAAACCCAGGAGAGTGTGGAGTCCAGAGTGTTGCCAGGACCCAGGCACAGGCATTAGTGCCCGTTGGAGAAAACAGGGGAATCCCGAAGAAATGGTGGGTCCTGGCCATCCGTGAGATCTTCCCAGGGCAGCTCCCCTCTGTGGAATCCAATCTGTCTTCCATCCTGCGTGGCCGAGGGCCAGGCTTCTCACTGGGCCTCTGCAGGAGGCTGCCATTTGTCCTGCCCACCTTCTTAGAAGCGAGACGGAGCAGACCCATCTGCTACTGCCCTTTCTATAATAACTAAAGTTAGCTGCCCTGGACTATTCACCCCCTAGTCTCAATTTAAGAAGATCCCCATGGCCACAGGGCCCCTGCCTGGGGGCTTGTCACCTCCCCCACCTTCTTCCTGAGTCATTCCTGCAGCCTTGCTCCCTAACCTGCCCCACAGCCTTGCCTGGATTTCTATCTCCCTGGCTTGGTGCCAGTTCCTCCAAGTCGATGGCACCTCCCTCCCTCTCAACCACTTGAGCAAACTCCAAGACATCTTCTACCCCAACACCAGCAATTGTGCCAAGGGCCATTAGGCTCTCAGCATGACTATTTTTAGAGACCCCGTGTCTGTCACTGAAACCTTTTTTGTGGGAGACTATTCCTCCCATCTGCAACAGCTGCCCCTGCTGACTGCCCTTCTCTCCTCCCTCTCATCCCAGAGAAACAGGTCAGCTGGGAGCTTCTGCCCCCACTGCCTAGGGACCAACAGGGGCAGGAGGCAGTCACTGACCCCGAGACGTTTGCATCCTGCACAGCTAGAGATCCTTTATTAAAAGCACACTGTTGGTTTCTG

GENCODE_transcript_id	GENCODE_gene_id	HAVANA_gene_id	HAVANA_transcript_id	transcript_name	gene_name	length	RNA_type
ENST00000456328.2	ENSG00000223972.5	OTTHUMG00000000961.2	OTTHUMT00000362751.1	DDX11L1-002	DDX11L1	1657	processed_transcript

To compare the different conditions it might be interesting to to use the internal spike in used in all ENCODE samples. The added the ERCC (External RNA Controls Consortium) RNA Spike-In Mix that is also sometime called NIST spike. This is a pool of 96 synthetic RNAs with various lengths, and GC content covering a 2^20 concentration range as spike-in controls to measure sensitivity, accuracy, and biases in RNA-seq experiments as well as to derive standard curves for quantifying the abundance of transcripts. In all the sample I have selected, they used the ERCC pool 14 and the provide the concentration for all RNAs in a tsv file. I also need to have a fasta file, a gtf annotation file and an info list to further use. I found the fasta and the gtf easily on the web but I need to generate the info file from the GTF



In [2]:

    
# define variables and reformat the gff to obtain a simple tsv table
input_file = gunzip_file("../../Reference_Annotation/ERCC.gtf.gz")
output_file = "../../Reference_Genomes/ERCC_info.tsv"
init_template = [0,"\t",1,"\t",2,"\t",3,"\t",4,"\t",5,"\t",6,"\t",7,"\tgene_id \"",8,"\"; transcript_id \"",9,"\";"]
final_template = [0,"\t",8,"\t",9,"\t",4]
header="transcript_name\tgene_id\ttranscript_id\tlength\n"
reformat_table(input_file, output_file, init_template, final_template, header)

linerange (input_file)
linerange (output_file)









    



Uncompressing ../../Reference_Annotation/ERCC.gtf.gz
96 Lines processed	96 Lines pass	0 Lines filtered out	0 Lines fail
0	ERCC-00002	ERCC	exon	1	1061	0	+	.	gene_id "ERCC-00002"; transcript_id "KC702164";
1	ERCC-00003	ERCC	exon	1	1023	0	+	.	gene_id "ERCC-00003"; transcript_id "KC702165";
2	ERCC-00004	ERCC	exon	1	523	0	+	.	gene_id "ERCC-00004"; transcript_id "KC702166";

93	ERCC-00168	ERCC	exon	1	1024	0	+	.	gene_id "ERCC-00168"; transcript_id "KC702236";
94	ERCC-00170	ERCC	exon	1	1024	0	+	.	gene_id "ERCC-00170"; transcript_id "KC702237";
95	ERCC-00171	ERCC	exon	1	505	0	+	.	gene_id "ERCC-00171"; transcript_id "KC702238";

0	transcript_name	gene_id	transcript_id	length
1	ERCC-00002	ERCC-00002	KC702164	1061
2	ERCC-00003	ERCC-00003	KC702165	1023

94	ERCC-00168	ERCC-00168	KC702236	1024
95	ERCC-00170	ERCC-00170	KC702237	1024
96	ERCC-00171	ERCC-00171	KC702238	505

To enable there analysis one should include the sequences of the oligo transcripts in the alignment step. I will generate a new reference mixing the human genome transcripts as well as the ERCC transcript directil on the computation farm

Create indexes for Kalisto from the human lncRNA and all RNA transcriptome gencode v23 including the ERCC spike-in transcripts that might be used for normalization later



In [8]:

    
program = "kallisto-0.43 index"
index_dir = "../../Index/kallisto/"
index_name = "gencode_v24_lncRNA_transcripts_ERCC.idx"
index_report = "gencode_v24_lncRNA_transcripts_ERCC.txt"
gencode_fasta = "../../Reference_Genomes/gencode.v24.lncRNA_transcripts.fa.gz"
gencode_ERCC = "../../Reference_Genomes/ERCC.fa.gz"

mkdir(index_dir)

cmd = "{} -i {}{} {} {}".format(program, index_dir, index_name, gencode_fasta, gencode_ERCC)
print (cmd)

stderr = bash (cmd, ret_stderr=True, ret_stdout=False)
print (stderr)

# write the report
with open (index_dir+index_report, "w") as fp:
    fp.write(stderr)









    



kallisto-0.43 index -i ../../Index/kallisto/gencode_v24_lncRNA_transcripts_ERCC.idx ../../Reference_Genomes/gencode.v24.lncRNA_transcripts.fa.gz ../../Reference_Genomes/ERCC.fa.gz

[build] loading fasta file ../../Reference_Genomes/gencode.v24.lncRNA_transcripts.fa.gz
[build] loading fasta file ../../Reference_Genomes/ERCC.fa.gz
[build] k-mer length: 31
[build] warning: clipped off poly-A tail (longer than 10)
        from 305 target sequences
[build] warning: replaced 4 non-ACGUT characters in the input sequence
        with pseudorandom nucleotides
[build] counting k-mers ... done.
[build] building target de Bruijn graph ...  done 
[build] creating equivalence classes ...  done
[build] target de Bruijn graph has 154016 contigs and contains 21010527 k-mers



In [9]:

    
program = "kallisto-0.43 index"
index_dir = "../../Index/kallisto/"
index_name = "gencode_v24_transcripts_ERCC.idx"
index_report = "gencode_v24_transcripts_ERCC.txt"
gencode_fasta = "../../Reference_Genomes/gencode.v24.transcripts.fa.gz"
gencode_ERCC = "../../Reference_Genomes/ERCC.fa.gz"

mkdir(index_dir)

cmd = "{} -i {}{} {} {}".format(program, index_dir, index_name, gencode_fasta, gencode_ERCC)
print (cmd)

stderr = bash (cmd, ret_stderr=True, ret_stdout=False)
print (stderr)

# write the report
with open (index_dir+index_report, "w") as fp:
    fp.write(stderr)









    



kallisto-0.43 index -i ../../Index/kallisto/gencode_v24_transcripts_ERCC.idx ../../Reference_Genomes/gencode.v24.transcripts.fa.gz ../../Reference_Genomes/ERCC.fa.gz

[build] loading fasta file ../../Reference_Genomes/gencode.v24.transcripts.fa.gz
[build] loading fasta file ../../Reference_Genomes/ERCC.fa.gz
[build] k-mer length: 31
[build] warning: clipped off poly-A tail (longer than 10)
        from 1629 target sequences
[build] warning: replaced 87 non-ACGUT characters in the input sequence
        with pseudorandom nucleotides
[build] counting k-mers ... done.
[build] building target de Bruijn graph ...  done 
[build] creating equivalence classes ...  done
[build] target de Bruijn graph has 1261145 contigs and contains 123979963 k-mers

Fastq files download

To analyse the data I will have to get all the fastq files and I will use the Cluster because of the amount of data to be analysed. All the data are available on the ENCODE website https://www.encodeproject.org/ and I can acces them thanks to the ENCODE Accesion number per experimental conditions. For each conditions I retrieved manually a file containing download links for all the files they generated. see bellow



In [10]:

    
!cat ./Localisation_Original_Datasets/Djebali-ENCODE/ENCODE_sample_files/EH000140.txt









    



https://www.encodeproject.org/metadata/type=Experiment&y.limit=&searchTerm=EH000140/metadata.tsv
https://www.encodeproject.org/files/ENCFF000HIL/@@download/ENCFF000HIL.bigBed
https://www.encodeproject.org/files/ENCFF000HIM/@@download/ENCFF000HIM.gtf.gz
https://www.encodeproject.org/files/ENCFF000HIO/@@download/ENCFF000HIO.gtf.gz
https://www.encodeproject.org/files/ENCFF000HIS/@@download/ENCFF000HIS.bam
https://www.encodeproject.org/files/ENCFF000HIT/@@download/ENCFF000HIT.bam
https://www.encodeproject.org/files/ENCFF000HIU/@@download/ENCFF000HIU.gtf.gz
https://www.encodeproject.org/files/ENCFF000HIW/@@download/ENCFF000HIW.gtf.gz
https://www.encodeproject.org/files/ENCFF000HIY/@@download/ENCFF000HIY.gtf.gz
https://www.encodeproject.org/files/ENCFF000HJA/@@download/ENCFF000HJA.bigBed
https://www.encodeproject.org/files/ENCFF000HJB/@@download/ENCFF000HJB.bigWig
https://www.encodeproject.org/files/ENCFF000HJC/@@download/ENCFF000HJC.bigWig
https://www.encodeproject.org/files/ENCFF000HJD/@@download/ENCFF000HJD.bigWig
https://www.encodeproject.org/files/ENCFF000HJE/@@download/ENCFF000HJE.bigWig
https://www.encodeproject.org/files/ENCFF000HJF/@@download/ENCFF000HJF.fastq.gz
https://www.encodeproject.org/files/ENCFF000HJK/@@download/ENCFF000HJK.gtf.gz
https://www.encodeproject.org/files/ENCFF000HJM/@@download/ENCFF000HJM.gtf.gz
https://www.encodeproject.org/files/ENCFF000HJO/@@download/ENCFF000HJO.gtf.gz
https://www.encodeproject.org/files/ENCFF000HJP/@@download/ENCFF000HJP.fastq.gz
https://www.encodeproject.org/files/ENCFF000HJW/@@download/ENCFF000HJW.fastq.gz
https://www.encodeproject.org/files/ENCFF000HJX/@@download/ENCFF000HJX.fastq.gz
https://www.encodeproject.org/files/ENCFF631VVQ/@@download/ENCFF631VVQ.bed.gz
https://www.encodeproject.org/files/ENCFF983WUC/@@download/ENCFF983WUC.bed.gz

The first file contains a large table with detailled informations for all the files contained in the conditions including there download links I will parse each of the tables and extract only the information related to fastq files



In [16]:

    
from urllib.request import urlopen

def parse_ENCODE_info_files(file_dir, tsv_outfile):

    # Handler to write the collect information
    with open (tsv_outfile, "w") as outfp:
        # Loop over the information files to extract the url of the tsv file 
        for f in sorted(glob(file_dir+"*.txt")):
            print("Parsing {}".format(file_basename(f)))
            with open(f) as infp:
                URL = next(infp).strip()

                # Parse the tsv url and extract the lines correponding to the fadtq files   
                with urlopen(URL) as tsv:
                    for line in tsv:
                        dl = line.decode('utf-8')
                        sl = dl.split("\t")

                        # Write the lines in the collecting file
                        if sl[1] == "fastq":
                            outfp.write(dl)
    return tsv_outfile



In [17]:

    
tsv_outfile = "./Localisation_Original_Datasets/Djebali-ENCODE/fastq/fastq_details.tsv"
file_dir = "./Localisation_Original_Datasets/Djebali-ENCODE/ENCODE_sample_files/"

parse_ENCODE_info_files(file_dir, tsv_outfile)
print(fastcount(outfile))
head (outfile, 2)









    



Parsing EH000140
Parsing EH000147
Parsing EH000151
Parsing EH000152
Parsing EH000156
Parsing EH000157
Parsing EH000158
Parsing EH000161
Parsing EH000165
Parsing EH000166
Parsing EH000170
Parsing EH000171
Parsing EH000172
Parsing EH000174
Parsing EH002624
Parsing EH002625
Parsing EH002626
Parsing EH002627
Parsing EH002628
Parsing EH002630
Parsing EH002631
Parsing EH002633
84
ENCFF000HJF	fastq	reads	ENCSR000COK	RNA-seq	EFO:0002067	K562	immortalized cell line	adult	female	Homo sapiens		cytosol					polyadenylated mRNA	rRNA	see document	see document		2011-10-17	ENCODE		see document	>200	53 year	2	1	76	paired-ended	1	ENCFF000HJX		5848000346	Thomas Gingeras, CSHL	7d6fb0723fb5c2b48e40c6312fc9134e	https://www.encodeproject.org/files/ENCFF000HJF/@@download/ENCFF000HJF.fastq.gz		Illumina Genome Analyzer IIx
ENCFF000HJP	fastq	reads	ENCSR000COK	RNA-seq	EFO:0002067	K562	immortalized cell line	adult	female	Homo sapiens		cytosol					polyadenylated mRNA	rRNA	see document	see document		2011-10-17	ENCODE		see document	>200	53 year	1	1	76	paired-ended	1	ENCFF000HJW		8866919688	Thomas Gingeras, CSHL	521ed7d0b3b6e8b8d5a0a177d13be8ec	https://www.encodeproject.org/files/ENCFF000HJP/@@download/ENCFF000HJP.fastq.gz		Illumina Genome Analyzer IIx

From this file containing all the information I will write a small python script to download the data on the cluster, similar to the following. https://github.com/a-slide/pyScripts/blob/master/EncodeParser.py. I launched the jobs on the cluster farm to download all the fastq files corresponding to the ENCODE number of Nucleus and cytosolic fraction polyA+ RNA. All the data are now available in the ebi server. There are 42 paired ended datafile. I renamed some of the datasets because HUVEC cells were called endothelial-cell-of-umbilical-vein and NHEK were called keratinocyte

Alignment with Kallisto quant

Test alignement with 1M reads extracted from one of the ENCODE datasets with 4 threads, 5 bootstraps and in rf stranded mode



In [5]:

    
program = "kallisto-0.43 quant"
index="../../Index/kallisto/gencode_v24_lncRNA_transcripts_ERCC.idx"
R1="./Localisation_Original_Datasets/Djebali-ENCODE/fastq/1M_test_R1.fastq.gz"
R2="./Localisation_Original_Datasets/Djebali-ENCODE/fastq/1M_test_R2.fastq.gz"
output_dir = "./Localisation_Original_Datasets/Djebali-ENCODE/kallisto/test/"
align_report = output_dir+"kallisto_report.txt"
n_thread = 4
n_bootstraps = 10

mkdir(output_dir)

cmd = "{} -t {} -b {} --fr-stranded -i {} -o {} {} {}".format(program, n_thread, n_bootstraps, index, output_dir, R1, R2)
print (cmd)

stderr = bash (cmd, ret_stderr=True, ret_stdout=False)
print (stderr)

# write the report
with open (align_report, "w") as fp:
    fp.write(stderr)









    



kallisto-0.43 quant -t 4 -b 10 --fr-stranded -i ../../Index/kallisto/gencode_v24_lncRNA_transcripts_ERCC.idx -o ./Localisation_Original_Datasets/Djebali-ENCODE/kallisto/test/ ./Localisation_Original_Datasets/Djebali-ENCODE/fastq/1M_test_R1.fastq.gz ./Localisation_Original_Datasets/Djebali-ENCODE/fastq/1M_test_R2.fastq.gz

[quant] fragment length distribution will be estimated from the data
[index] k-mer length: 31
[index] number of targets: 28,127
[index] number of k-mers: 21,010,527
[index] number of equivalence classes: 91,580
[quant] running in paired-end mode
[quant] will process pair 1: ./Localisation_Original_Datasets/Djebali-ENCODE/fastq/1M_test_R1.fastq.gz
                             ./Localisation_Original_Datasets/Djebali-ENCODE/fastq/1M_test_R2.fastq.gz
[quant] finding pseudoalignments for the reads ... done
[quant] processed 1,000,000 reads, 58,177 reads pseudoaligned
[quant] estimated average fragment length: 303.622
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 611 rounds
[bstrp] number of EM bootstraps complete: 10

Kalisto quant is working correctly and we obtain a file with counts directly in a very short time. However due to space and parralelisation concerns I will perform the dataset wide analyses on the cluster

The alignments against the modified reference containing both human transcripts were performed on the cluster. I performed 4 different aligments:

Against all gencode transcripts + ERCC transcripts with strand specific alignment
Against lncRNA gencode gencode transcripts + ERCC transcripts with strand specific alignment
Against all gencode transcripts only with strand specific alignment
Against lncRNA gencode gencode transcripts only with strand specific alignment

According to the Sleuth paper it might be beneficial perform 100 boostrap although they say that they don't really knows which is the minimun to obtain reliable results

kallisto quant --bias -b 100 --rf-stranded -t 4 -i INDEX -o OUTPUT_FOLDER READ1 READ2

I then copied the results localy for further analysis with Sleuth, since the size is rather small.



In [8]:

    
# Parse the stderr files containing the number of read aligned with kallisto
sname_list=[]
tae=[] # total all RNA ERCC
mae=[] # mapped all RNA ERCC
tle=[] # total lncRNA ERCC
mle=[] # mapped lncRNA ERCC
ma=[] # mapped all RNA no ERCC
ml=[] # mapped lncRNA no ERCC
flen_list=[]

# results obtained with basic alignment using Kallisto quant
for file in glob("./Localisation_Original_Datasets/Djebali-ENCODE/kallisto/allRNA_ERCC/*/stderr.txt"):
    sname_list.append(file.split("/")[-2])   
    lines = !cat {file}
    flen_list.append(float(lines[12].split()[-1].replace(",","")))
    tae.append(int(lines[11].split()[2].replace(",","")))
    mae.append(int(lines[11].split()[4].replace(",","")))

# results obtained with stranded alignment using Kallisto quant
for file in glob("./Localisation_Original_Datasets/Djebali-ENCODE/kallisto/lncRNA_ERCC/*/stderr.txt"):
    lines = !cat {file}
    tle.append(int(lines[11].split()[2].replace(",","")))
    mle.append(int(lines[11].split()[4].replace(",","")))

# results obtained with stranded alignment using Kallisto quant
for file in glob("./Localisation_Original_Datasets/Djebali-ENCODE/kallisto/allRNA/*/stderr.txt"):
    lines = !cat {file}
    ma.append(int(lines[11].split()[4].replace(",","")))
    
# results obtained with stranded alignment using Kallisto quant
for file in glob("./Localisation_Original_Datasets/Djebali-ENCODE/kallisto/lncRNA/*/stderr.txt"):
    lines = !cat {file}
    ml.append(int(lines[11].split()[4].replace(",","")))



In [24]:

    
# Plot the results
index = np.arange(len(sname_list))
bar_width = 0.35

p0 = pl.bar(index, tae, bar_width, color="LightGreen", alpha=0.7)
p1 = pl.bar(index, mae, bar_width, color="ForestGreen", alpha=0.7)
p2 = pl.bar(index, ma, bar_width, color="DarkGreen", alpha=0.7)
p3 = pl.bar(index+bar_width, tle, bar_width, color="SkyBlue", alpha=0.7)
p4 = pl.bar(index+bar_width, mle, bar_width, color="Blue", alpha=0.7)
p5 = pl.bar(index+bar_width, ml, bar_width, color="DarkBlue", alpha=0.7)

pl.xticks(index+bar_width, sname_list, rotation=45, horizontalalignment ="right")
pl.legend((p0[0], p1[0], p2[0], p3[0], p4[0], p5[0]), ('Total Reads', 'Mapped Reads','Mapped Reads no ERCC','Total Reads', 'Mapped Reads lncRNA','Mapped Reads lncRNA no ERCC'))
pl.title("Number of reads analysed by Kallisto")
pl.show()



In [23]:

    
# Plot the results
index = np.arange(len(sname_list))
bar_width = 0.35

percent_unstrand = [mapped/total*100.0 for mapped, total in zip(mapped_list, total_list)]
percent_strand = [mapped/total*100.0 for mapped, total in zip(mapped_lnc_list, total_lnc_list)]
    
p1 = pl.bar(index, percent_unstrand, bar_width, color="green", alpha=0.7)
p2 = pl.bar(index+bar_width, percent_strand, bar_width, color="blue", alpha=0.7)

pl.xticks(index+bar_width, sname_list, rotation=45, horizontalalignment ="right")
pl.legend((p1[0], p2[0]), ('% Mapped Reads stranded', '% Mapped Reads stranded lnc'))
pl.title("% of reads aligned by Kallisto")
pl.show()

I obtained between 4E7 and 2.3E8 reads aligned (ie 60% and 90%). The stranded alignment results in less aligned reads but it is surely more accurate, so I will use only these results. There are much fewer read aligned on lncRNA (< 10%) but this is consistant with previous knowledge

Sleuth differencial expression analysis

For the analysis with Sleuth I will use the basic pipeline recommanded by the documentation https://rawgit.com/pachterlab/sleuth/master/inst/doc/intro.html. Apparently it is not possible to use the ERCC to normalize the tpm and est, since the bootstrap would also have to be normalized. Instead I will perform the analysis with the ERCC transcripts and will verify that they do not appear significantly differentially expressed. If so, the Sleuth normalization is certainly enough, else I would have to find another way to analyse kallisto data

Normalize the name of the replicates in the folder containing the samples which are sometime called Rep1 Re2p Rep3 or Rep4



In [28]:

    
def normalize_repname(basedir):
    subst_dict = {"Rep1": "1", "Rep2": "2","Rep3": "1","Rep4": "2"}

    for subdir in [path for path in glob(basedir+"/*") if isdir(path)]:
        ss = subdir.split("_")
        if ss[-1] in subst_dict:
            ss[-1] = subst_dict[ss[-1]]
        newname = "_".join(ss)
        rename (subdir, newname)



In [37]:

    
for basedir in [
    "./Localisation_Original_Datasets/Djebali-ENCODE/kallisto/lncRNA/",
    "./Localisation_Original_Datasets/Djebali-ENCODE/kallisto/allRNA/",
    "./Localisation_Original_Datasets/Djebali-ENCODE/kallisto/lncRNA_ERCC/",
    "./Localisation_Original_Datasets/Djebali-ENCODE/kallisto/allRNA_ERCC/"
    ]:
    
    normalize_repname(basedir)
    !ls {basedir} | head -2









    



ENCSR000COK_K562_cytosol_1
ENCSR000COK_K562_cytosol_2
ENCSR000COK_K562_cytosol_1
ENCSR000COK_K562_cytosol_2
ENCSR000COK_K562_cytosol_1
ENCSR000COK_K562_cytosol_2
ENCSR000COK_K562_cytosol_1
ENCSR000COK_K562_cytosol_2

Create a file containing the name of samples, the cell type, the localization and the path to the dir containing kallisto results



In [6]:

    
# Create a file with sample names, descr and path of the result data
def sleuth_sample_list (basedir):
    
    sample_list = []
    list_file = basedir+"/list_file.tsv"
    
    # Iterate throught the result folder 
    for subdir in [path for path in glob(basedir+"/*") if isdir(path)]:
        
        # Extract info
        ss = basename(subdir).split("_")       
        cell_type = ss[1].strip().replace("-", "")
        localization = ss[2].strip().replace("-", "")
        replicate = ss[3].strip().replace("-", "")
        
        sample_list.append ({
                "sample" : "{}_{}_{}".format(cell_type, localization, replicate),
                "cell_type" : cell_type,
                "localization" : localization,
                "replicate" : replicate,
                "path" : abspath(subdir)
                })
        
    # Sort the list to work with grouped data in the next steps of the analysis
    sample_list.sort(key = lambda x: (x['cell_type'], x['localization'], x['replicate']))
    
    # Find the number of replicates of each sample
    rep_dict={}
    for d in sample_list:
        uniq_sample =  "{}_{}".format(d["cell_type"], d["localization"])
        if uniq_sample not in rep_dict:
            rep_dict[uniq_sample] = 0
        rep_dict[uniq_sample] += 1
    
    # Filter out samples with less than 2 replicates
    filter_list = []
    for d in sample_list:
        uniq_sample =  "{}_{}".format(d["cell_type"], d["localization"])
        if rep_dict[uniq_sample] >= 2:
            filter_list.append(d)
        else:
            print ("{} was filtered out\n".format(d["sample"]))
            
    # Write the results in a file
    with open (list_file, "w") as outfp:
        outfp.write("sample\tcell_type\tlocalization\treplicate\tpath\n")
        for d in filter_list:
            outfp.write("{}\t{}\t{}\t{}\t{}\n".format (d["sample"], d["cell_type"], d["localization"], d["replicate"], d["path"]))
            
    return (list_file)



In [8]:

    
for basedir in [
    "./Localisation_Original_Datasets/Djebali-ENCODE/kallisto/lncRNA",
    "./Localisation_Original_Datasets/Djebali-ENCODE/kallisto/allRNA",
    "./Localisation_Original_Datasets/Djebali-ENCODE/kallisto/lncRNA_ERCC",
    "./Localisation_Original_Datasets/Djebali-ENCODE/kallisto/allRNA_ERCC"
    ]:
    
    list_file = sleuth_sample_list(basedir)

head (list_file, 20)









    



H1hESC_cytosol_2 was filtered out

H1hESC_nucleus_2 was filtered out

H1hESC_cytosol_2 was filtered out

H1hESC_nucleus_2 was filtered out

H1hESC_cytosol_2 was filtered out

H1hESC_nucleus_2 was filtered out

H1hESC_cytosol_2 was filtered out

H1hESC_nucleus_2 was filtered out

sample	cell_type	localization	replicate	path
A549_cytosol_1	A549	cytosol	1	/home/aleg/Data/Datasets/Epi2_pilot/Localisation_Original_Datasets/Djebali-ENCODE/kallisto/allRNA_ERCC/ENCSR000CTL_A549_cytosol_1
A549_cytosol_2	A549	cytosol	2	/home/aleg/Data/Datasets/Epi2_pilot/Localisation_Original_Datasets/Djebali-ENCODE/kallisto/allRNA_ERCC/ENCSR000CTL_A549_cytosol_2
A549_nucleus_1	A549	nucleus	1	/home/aleg/Data/Datasets/Epi2_pilot/Localisation_Original_Datasets/Djebali-ENCODE/kallisto/allRNA_ERCC/ENCSR000CTM_A549_nucleus_1
A549_nucleus_2	A549	nucleus	2	/home/aleg/Data/Datasets/Epi2_pilot/Localisation_Original_Datasets/Djebali-ENCODE/kallisto/allRNA_ERCC/ENCSR000CTM_A549_nucleus_2
GM12878_cytosol_1	GM12878	cytosol	1	/home/aleg/Data/Datasets/Epi2_pilot/Localisation_Original_Datasets/Djebali-ENCODE/kallisto/allRNA_ERCC/ENCSR000COR_GM12878_cytosol_1
GM12878_cytosol_2	GM12878	cytosol	2	/home/aleg/Data/Datasets/Epi2_pilot/Localisation_Original_Datasets/Djebali-ENCODE/kallisto/allRNA_ERCC/ENCSR000COR_GM12878_cytosol_2
GM12878_nucleus_1	GM12878	nucleus	1	/home/aleg/Data/Datasets/Epi2_pilot/Localisation_Original_Datasets/Djebali-ENCODE/kallisto/allRNA_ERCC/ENCSR000CPO_GM12878_nucleus_1
GM12878_nucleus_2	GM12878	nucleus	2	/home/aleg/Data/Datasets/Epi2_pilot/Localisation_Original_Datasets/Djebali-ENCODE/kallisto/allRNA_ERCC/ENCSR000CPO_GM12878_nucleus_2
HUVEC_cytosol_1	HUVEC	cytosol	1	/home/aleg/Data/Datasets/Epi2_pilot/Localisation_Original_Datasets/Djebali-ENCODE/kallisto/allRNA_ERCC/ENCSR000CPA_HUVEC_cytosol_1
HUVEC_cytosol_2	HUVEC	cytosol	2	/home/aleg/Data/Datasets/Epi2_pilot/Localisation_Original_Datasets/Djebali-ENCODE/kallisto/allRNA_ERCC/ENCSR000CPA_HUVEC_cytosol_2
HUVEC_nucleus_1	HUVEC	nucleus	1	/home/aleg/Data/Datasets/Epi2_pilot/Localisation_Original_Datasets/Djebali-ENCODE/kallisto/allRNA_ERCC/ENCSR000CPB_HUVEC_nucleus_1
HUVEC_nucleus_2	HUVEC	nucleus	2	/home/aleg/Data/Datasets/Epi2_pilot/Localisation_Original_Datasets/Djebali-ENCODE/kallisto/allRNA_ERCC/ENCSR000CPB_HUVEC_nucleus_2
HeLaS3_cytosol_1	HeLaS3	cytosol	1	/home/aleg/Data/Datasets/Epi2_pilot/Localisation_Original_Datasets/Djebali-ENCODE/kallisto/allRNA_ERCC/ENCSR000CPP_HeLa-S3_cytosol_1
HeLaS3_cytosol_2	HeLaS3	cytosol	2	/home/aleg/Data/Datasets/Epi2_pilot/Localisation_Original_Datasets/Djebali-ENCODE/kallisto/allRNA_ERCC/ENCSR000CPP_HeLa-S3_cytosol_2
HeLaS3_nucleus_1	HeLaS3	nucleus	1	/home/aleg/Data/Datasets/Epi2_pilot/Localisation_Original_Datasets/Djebali-ENCODE/kallisto/allRNA_ERCC/ENCSR000CPQ_HeLa-S3_nucleus_1
HeLaS3_nucleus_2	HeLaS3	nucleus	2	/home/aleg/Data/Datasets/Epi2_pilot/Localisation_Original_Datasets/Djebali-ENCODE/kallisto/allRNA_ERCC/ENCSR000CPQ_HeLa-S3_nucleus_2
HepG2_cytosol_1	HepG2	cytosol	1	/home/aleg/Data/Datasets/Epi2_pilot/Localisation_Original_Datasets/Djebali-ENCODE/kallisto/allRNA_ERCC/ENCSR000CPF_HepG2_cytosol_1
HepG2_cytosol_2	HepG2	cytosol	2	/home/aleg/Data/Datasets/Epi2_pilot/Localisation_Original_Datasets/Djebali-ENCODE/kallisto/allRNA_ERCC/ENCSR000CPF_HepG2_cytosol_2
HepG2_nucleus_1	HepG2	nucleus	1	/home/aleg/Data/Datasets/Epi2_pilot/Localisation_Original_Datasets/Djebali-ENCODE/kallisto/allRNA_ERCC/ENCSR000CPC_HepG2_nucleus_1

H1-hESC was not included in the results because there are not replicates

Define an R function to analyse data with Sleuth



In [1]:

    
################## R KERNEL ##################

sleuth_analysis <- function(sample_list, transcript_info, select_col, select_val, de_col, beta_factor_list, outdir=".", ncore=4, max_bootstrap=10) {
    
    # Imports and setups
    library("sleuth")
    options(mc.cores = ncore)
 
    # Create the output directory
    outdir=file.path(outdir, select_val)
    dir.create(outdir, showWarnings = FALSE, recursive = TRUE)
    
    cat("IMPORT AND SELECT THE SAMPLES\n")   
    # Import the file in a table and select sample according to select val and col 
    s2c = read.table(sample_list, header = TRUE, stringsAsFactors=FALSE)
    s2c = s2c[s2c[,select_col] == select_val,]   
    print (s2c$sample)
    
    # Save the table in file for tracability
    fname = file.path(outdir,"selected_datasets.tsv")
    write.table(s2c, file = fname, sep="\t", col.names = TRUE)
    
    cat("IMPORT THE GENE INFORMATIONS\n")   
    # Import the gene info in table and rename the fields for Sleuth compatibility
    t2g = read.table(transcript_info, header = TRUE, stringsAsFactors=FALSE)
    t2g = dplyr::rename(t2g, target_id = GENCODE_transcript_id, ens_gene = GENCODE_gene_id)
    
    cat("PREPARE THE DATA FOR SLEUTH\n")   
    # Extract the data from the kallisto results, LIMIT to 3 because for 100 the kernel get killed !!
    de_formula = as.formula(paste("~", de_col))
    so = sleuth_prep(s2c, de_formula, max_bootstrap=max_bootstrap, target_mapping = t2g)

    cat("FIT TO FULL MODEL\n")  
    # Estimate parameters for the sleuth response error measurement (full) model
    so <- sleuth_fit(so)
    
    cat("FIT TO REDUCED MODEL\n")
    # Estimate parameters for the sleuth reduced model (shrinkage)
    so <- sleuth_fit(so, ~1, 'reduced')

    cat("PERFORM WALD TESTING\n")
    # Performing Wald test
    for (beta_factor in beta_factor_list) { so <- sleuth_wt(so, which_beta = beta_factor, 'full')}

    cat("PERFORM LIKELYHOOD RATION TESTING\n")
    # Perform likelihood ratio test
    so <- sleuth_lrt(so, 'reduced', 'full')
    
    cat("EXPORT RESULTS")
    # Save all the raw abundance results in a table
    fname = file.path(outdir, "abundance_raw.tsv")
    write.table(so$obs_raw, file = fname, row.names=FALSE, na="",col.names=TRUE, sep="\t", quote=FALSE)
    # Save all the normalised abundance results in a table
    fname = file.path(outdir, "abundance_normed.tsv")
    write.table(so$obs_norm, file = fname, row.names=FALSE, na="",col.names=TRUE, sep="\t", quote=FALSE)
    # Save all the normalised and filtered abundance results in a table
    fname = file.path(outdir, "abundance_normed_filtered.tsv")
    write.table(so$obs_norm_filt, file = fname, row.names=FALSE, na="",col.names=TRUE, sep="\t", quote=FALSE)
    # Save all the results in a table
    results_table <- sleuth_results(so, 'reduced:full', test_type = 'lrt')
    fname = file.path(outdir, "DE_Sleuth.tsv")
    write.table(results_table, file = fname, row.names=TRUE, na="",col.names=TRUE, sep="\t", quote=FALSE)

    # Save the Sleuth object for further analysis
    fname = file.path(outdir, "DE_Sleuth.rds")
    saveRDS(so, file = fname)
    
    return (so)
}

Test with a sample



In [3]:

    
so = sleuth_analysis(
    sample_list = "./Localisation_Original_Datasets/Djebali-ENCODE/kallisto/lncRNA/list_file.tsv",
    transcript_info = "../../Reference_Genomes/gencode_v24_lncRNA_transcripts_info.tsv",
    select_col = "cell_type",
    select_val = "HeLaS3",
    de_col = "localization + replicate",
    beta_factor_list = c('localizationnucleus', 'replicate'),
    outdir = "./Localisation_Original_Datasets/Djebali-ENCODE/test",
    max_bootstrap=2
)









    



IMPORT AND SELECT THE SAMPLES
[1] "HeLaS3_cytosol_1" "HeLaS3_cytosol_2" "HeLaS3_nucleus_1" "HeLaS3_nucleus_2"
IMPORT THE GENE INFORMATIONS
PREPARE THE DATA FOR SLEUTH






    



reading in kallisto results
....
normalizing est_counts
11629 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps






    



FIT TO FULL MODEL






    



fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas






    



FIT TO REDUCED MODEL






    



fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas






    



PERFORM WALD TESTING
PERFORM LIKELYHOOD RATION TESTING
EXPORT RESULTS

I launched the analysis with all the datasets



In [2]:

    
transcript_info = "../../Reference_Genomes/gencode_v24_lncRNA_transcripts_info.tsv"
kallisto_dir = "./Localisation_Original_Datasets/Djebali-ENCODE/kallisto"
sleuth_dir = "./Localisation_Original_Datasets/Djebali-ENCODE/sleuth"
subdir_list = c("lncRNA_ERCC") # not with all RNA for the moment.
select_col = "cell_type"
de_col = "localization + replicate"
beta_factor_list = c('localizationnucleus', 'replicate')
select_val_list = c("IMR90","A549","GM12878","HeLaS3","HepG2","HUVEC","K562","MCF7","NHEK","SKNSH")

for (subdir in subdir_list) {
    sample_list = file.path(kallisto_dir, subdir, "list_file.tsv") # With the modified ordered file list
    outdir = file.path(sleuth_dir, subdir)
    cat (paste(c("\n########## ANALYSING", subdir, "DATAGROUP ##########\n"), collapse=" "))
    for (select_val in select_val_list){
        cat (paste(c("\n### ANALYSING", select_val, "SAMPLE ###\n"), collapse=" "))
        sleuth_analysis (sample_list, transcript_info, select_col, select_val, de_col, beta_factor_list, outdir, max_bootstrap=100)
    }
}









    



########## ANALYSING lncRNA_ERCC DATAGROUP ##########

### ANALYSING IMR90 SAMPLE ###






    



Loading required package: ggplot2
Loading required package: dplyr

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union







    



IMPORT AND SELECT THE SAMPLES
[1] "IMR90_cytosol_1" "IMR90_cytosol_2" "IMR90_nucleus_1" "IMR90_nucleus_2"
IMPORT THE GENE INFORMATIONS
PREPARE THE DATA FOR SLEUTH






    



reading in kallisto results
....
normalizing est_counts
11667 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps






    



FIT TO FULL MODEL






    



fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas






    



FIT TO REDUCED MODEL






    



fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas






    



PERFORM WALD TESTING
PERFORM LIKELYHOOD RATION TESTING
EXPORT RESULTS
### ANALYSING A549 SAMPLE ###
IMPORT AND SELECT THE SAMPLES
[1] "A549_cytosol_1" "A549_cytosol_2" "A549_nucleus_1" "A549_nucleus_2"
IMPORT THE GENE INFORMATIONS
PREPARE THE DATA FOR SLEUTH






    



reading in kallisto results
....
normalizing est_counts
12007 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps






    



FIT TO FULL MODEL






    



fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas






    



FIT TO REDUCED MODEL






    



fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas






    



PERFORM WALD TESTING
PERFORM LIKELYHOOD RATION TESTING
EXPORT RESULTS
### ANALYSING GM12878 SAMPLE ###
IMPORT AND SELECT THE SAMPLES
[1] "GM12878_cytosol_1" "GM12878_cytosol_2" "GM12878_nucleus_1"
[4] "GM12878_nucleus_2"
IMPORT THE GENE INFORMATIONS
PREPARE THE DATA FOR SLEUTH






    



reading in kallisto results
....
normalizing est_counts
13025 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps






    



FIT TO FULL MODEL






    



fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas






    



FIT TO REDUCED MODEL






    



fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas






    



PERFORM WALD TESTING
PERFORM LIKELYHOOD RATION TESTING
EXPORT RESULTS
### ANALYSING HeLaS3 SAMPLE ###
IMPORT AND SELECT THE SAMPLES
[1] "HeLaS3_cytosol_1" "HeLaS3_cytosol_2" "HeLaS3_nucleus_1" "HeLaS3_nucleus_2"
IMPORT THE GENE INFORMATIONS
PREPARE THE DATA FOR SLEUTH






    



reading in kallisto results
....
normalizing est_counts
11689 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps






    



FIT TO FULL MODEL






    



fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas






    



FIT TO REDUCED MODEL






    



fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas






    



PERFORM WALD TESTING
PERFORM LIKELYHOOD RATION TESTING
EXPORT RESULTS
### ANALYSING HepG2 SAMPLE ###
IMPORT AND SELECT THE SAMPLES
[1] "HepG2_cytosol_1" "HepG2_cytosol_2" "HepG2_nucleus_1" "HepG2_nucleus_2"
IMPORT THE GENE INFORMATIONS
PREPARE THE DATA FOR SLEUTH






    



reading in kallisto results
....
normalizing est_counts
11896 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps






    



FIT TO FULL MODEL






    



fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas






    



FIT TO REDUCED MODEL






    



fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas






    



PERFORM WALD TESTING
PERFORM LIKELYHOOD RATION TESTING
EXPORT RESULTS
### ANALYSING HUVEC SAMPLE ###
IMPORT AND SELECT THE SAMPLES
[1] "HUVEC_cytosol_1" "HUVEC_cytosol_2" "HUVEC_nucleus_1" "HUVEC_nucleus_2"
IMPORT THE GENE INFORMATIONS
PREPARE THE DATA FOR SLEUTH






    



reading in kallisto results
....
normalizing est_counts
12266 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps






    



FIT TO FULL MODEL






    



fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas






    



FIT TO REDUCED MODEL






    



fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas






    



PERFORM WALD TESTING
PERFORM LIKELYHOOD RATION TESTING
EXPORT RESULTS
### ANALYSING K562 SAMPLE ###
IMPORT AND SELECT THE SAMPLES
[1] "K562_cytosol_1" "K562_cytosol_2" "K562_nucleus_1" "K562_nucleus_2"
IMPORT THE GENE INFORMATIONS
PREPARE THE DATA FOR SLEUTH






    



reading in kallisto results
....
normalizing est_counts
12002 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps






    



FIT TO FULL MODEL






    



fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas






    



FIT TO REDUCED MODEL






    



fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas






    



PERFORM WALD TESTING
PERFORM LIKELYHOOD RATION TESTING
EXPORT RESULTS
### ANALYSING MCF7 SAMPLE ###
IMPORT AND SELECT THE SAMPLES
[1] "MCF7_cytosol_1" "MCF7_cytosol_2" "MCF7_nucleus_1" "MCF7_nucleus_2"
IMPORT THE GENE INFORMATIONS
PREPARE THE DATA FOR SLEUTH






    



reading in kallisto results
....
normalizing est_counts
13004 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps






    



FIT TO FULL MODEL






    



fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas






    



FIT TO REDUCED MODEL






    



fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas






    



PERFORM WALD TESTING
PERFORM LIKELYHOOD RATION TESTING
EXPORT RESULTS
### ANALYSING NHEK SAMPLE ###
IMPORT AND SELECT THE SAMPLES
[1] "NHEK_cytosol_1" "NHEK_cytosol_2" "NHEK_nucleus_1" "NHEK_nucleus_2"
IMPORT THE GENE INFORMATIONS
PREPARE THE DATA FOR SLEUTH






    



reading in kallisto results
....
normalizing est_counts
11398 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps






    



FIT TO FULL MODEL






    



fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas






    



FIT TO REDUCED MODEL






    



fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas






    



PERFORM WALD TESTING
PERFORM LIKELYHOOD RATION TESTING
EXPORT RESULTS
### ANALYSING SKNSH SAMPLE ###
IMPORT AND SELECT THE SAMPLES
[1] "SKNSH_cytosol_1" "SKNSH_cytosol_2" "SKNSH_nucleus_1" "SKNSH_nucleus_2"
IMPORT THE GENE INFORMATIONS
PREPARE THE DATA FOR SLEUTH






    



reading in kallisto results
....
normalizing est_counts
14174 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps






    



FIT TO FULL MODEL






    



fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas






    



FIT TO REDUCED MODEL






    



fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas






    



PERFORM WALD TESTING
PERFORM LIKELYHOOD RATION TESTING
EXPORT RESULTS



In [ ]:

    
# start a shinny web server to explore the data
library("sleuth")
so = readRDS("./Localisation_Original_Datasets/Djebali-ENCODE/sleuth/lncRNA_ERCC/SK-N-SH/DE_Sleuth.rds")
sleuth_live(so)









    



Loading required package: shiny

Listening on http://127.0.0.1:42427
Warning message:
: Removed 3 rows containing missing values (geom_point).Warning message:
: Removed 3 rows containing missing values (geom_point).Warning message:
: Removed 3 rows containing missing values (geom_point).Warning message:
: Removed 3 rows containing missing values (geom_point).Warning message:
: Removed 3 rows containing missing values (geom_point).Warning message:
: Removed 3 rows containing missing values (geom_point).Warning message:
: Removed 3 rows containing missing values (geom_point).

This is working OK but I will need to find a way to normalize between sample using a method more advanced that the one offered by Sleuth. In addition the transcript for the ERCC spike need to be used for this normalisation but removed from the final analysis.Finally I have to include the gene level analysis if I want to compare with the other methods at some point. All the samples were analysed on my laptop within one day. All the rds files are available to further data exploration with sleuth_live shiny interface

Analysis of the ERCC data

I will first have a look to the correspondance between the concentration of the ERCC spike and the abundance of the ERCC before and after normalization to see if there is any biais during the alignment or the normalization with sleuth

To analyse this difference I extracted the real concentrations of the 96 ERCC spikes (as reported by ENCODE) as well as the raw, normalized and filtered abundance files from Kallisto/Sleuth



In [10]:

    
################## PYTHON3 KERNEL ##################

# Parse the file with the concentration of ERCC 
def ERCC_concentration_parsing (ERCC_conc):

    # Read the concentration data from the ERCC
    ERCC_dict = {}
    with open (ERCC_conc, "r") as fp:
        for line in fp:
            sl = line.split()
            if sl[0] != "#":
                ERCC_dict[sl[0]] = float(sl[1])
    return ERCC_dict

def ERCC_kallisto_parsing (ERCC_dict, kallisto_basedir):
    
    # Parse the kallisto tsv files containing the tpm results and extract the tpm and the est values
    # Build a dict table containing all the data indexed by sample and gene name 
    ERCC_abundance_dict = OrderedDict ()
    abundance_results = {
        "raw":"abundance_raw.tsv",
        "norm":"abundance_normed.tsv",
        "filter":"abundance_normed_filtered.tsv"}

    # Iterate throught the result folder 
    for subdir in sorted([path for path in glob(kallisto_basedir+"/*") if isdir(path)]):

        # Iterate throught the abundance result files 
        for res_type, abundance_file_suffix in abundance_results.items():
            abundance_file = "{}/{}".format(subdir, abundance_file_suffix)
            with open (abundance_file, "r") as fp:

                # read the first line and determine the table structure that is not always the same...
                header_line = next(fp)
                hd =  {name:i for i, name in enumerate(header_line.split())}

                # now iterate through the other lines and find the lines with ERCC transcripts
                for line in fp:
                    sl = line.split()
                    transcript_id = sl[hd["target_id"]]
                    sample_name = sl[hd["sample"]]
                    if transcript_id in ERCC_dict:
                        key = "{}_{}".format(transcript_id, sample_name)

                        # When found for the first time = extract generic informations for this sample/gen combination
                        if key not in ERCC_abundance_dict:
                            cell, localization, replicate = sample_name.split("_")
                            ERCC_abundance_dict[key] = OrderedDict ()
                            ERCC_abundance_dict[key]["cell_loc"] = "{}_{}".format(cell, localization)
                            ERCC_abundance_dict[key]["cell"] = cell
                            ERCC_abundance_dict[key]["localization"] = localization
                            ERCC_abundance_dict[key]["replicate"] = replicate
                            ERCC_abundance_dict[key]["transcript_id"] = transcript_id
                            ERCC_abundance_dict[key]["log_concentration"] = np.log(ERCC_dict[transcript_id])
                            ERCC_abundance_dict[key]["eff_length"] = float(sl[hd["eff_len"]])
                            ERCC_abundance_dict[key]["length"] = float(sl[hd["len"]])

                        # Then extract the tpm and est_count
                        ERCC_abundance_dict[key]["log_est_"+res_type] = np.log(float(sl[hd["est_counts"]]))
                        ERCC_abundance_dict[key]["log_tpm_"+res_type] = np.log(float(sl[hd["tpm"]]))

    # Transform into a panda dataframe
    ERCC_abundance_df = pd.DataFrame.from_dict(ERCC_abundance_dict, orient='index')
    return (ERCC_abundance_dict, ERCC_abundance_df)


def ERCC_table_parsing (ERCC_abundance_dict, countfile, colname="log_col"):
    
    # import the tximport file into a dataframe
    df = pd.read_table(countfile)
     
    # Iterate through the dataframe to extract tximport normalized aggregated counts
    for gene_id, val_list in df.iterrows():
        for sample_name, val in val_list.iteritems():
            key = "{}_{}".format(gene_id, sample_name)
            if key in ERCC_abundance_dict:
                ERCC_abundance_dict[key][colname] = np.log(float(val))
    
    # Transform into a panda dataframe
    ERCC_abundance_df = pd.DataFrame.from_dict(ERCC_abundance_dict, orient='index')
    return (ERCC_abundance_dict, ERCC_abundance_df)



In [7]:

    
# Parse the concentration file
ERCC_conc = "./Localisation_Original_Datasets/Djebali-ENCODE/Protocol_info/NistPool14Concentrations.tsv"
ERCC_dict = ERCC_concentration_parsing (ERCC_conc)

# Create a dataframe by combining the kallisto results and the ERCC concentrations
kallisto_basedir = "./Localisation_Original_Datasets/Djebali-ENCODE/sleuth/lncRNA_ERCC"
ERCC_abundance_dict, ERCC_abundance_df = ERCC_kallisto_parsing (ERCC_dict, kallisto_basedir)
ERCC_abundance_df.head(5)









    Out[7]:






  
    
      
      cell_loc
      cell
      localization
      replicate
      transcript_id
      log_concentration
      eff_length
      length
      log_est_filter
      log_tpm_filter
      log_est_raw
      log_tpm_raw
      log_est_norm
      log_tpm_norm
    
  
  
    
      ERCC-00002_A549_cytosol_1
      A549_cytosol
      A549
      cytosol
      1
      ERCC-00002
      -9.866192
      1011.634547
      1061.0
      12.879195
      10.330360
      12.436224
      9.670236
      12.879195
      10.330360
    
    
      ERCC-00002_A549_cytosol_2
      A549_cytosol
      A549
      cytosol
      2
      ERCC-00002
      -9.866192
      958.263200
      1061.0
      12.708273
      10.183487
      12.409797
      9.735956
      12.708273
      10.183487
    
    
      ERCC-00002_A549_nucleus_1
      A549_nucleus
      A549
      nucleus
      1
      ERCC-00002
      -9.866192
      800.665212
      1061.0
      12.159725
      9.739943
      12.628106
      10.240396
      12.159725
      9.739943
    
    
      ERCC-00002_A549_nucleus_2
      A549_nucleus
      A549
      nucleus
      2
      ERCC-00002
      -9.866192
      781.193314
      1061.0
      12.200813
      9.804348
      12.473880
      10.411550
      12.200813
      9.804348
    
    
      ERCC-00002_GM12878_cytosol_1
      GM12878_cytosol
      GM12878
      cytosol
      1
      ERCC-00002
      -9.866192
      1158.248810
      1061.0
      11.467636
      8.527796
      11.230430
      8.360036
      11.467636
      8.527796

Comparison of the nuclear and cytoplasmic fractions tpm and est before and after normalization relative to the initial spike concentrations



In [8]:

    
for X, Y in [["log_concentration", "log_tpm_raw"], ["log_concentration", "log_est_raw"], ["log_concentration", "log_tpm_norm"], ["log_concentration", "log_est_norm"], ["log_tpm_raw", "log_tpm_norm"]] :
    g = sns.lmplot(x=X, y=Y, hue="localization", col="cell", data=ERCC_abundance_df, lowess=True)
    plt.subplots_adjust(top=0.8)
    g.fig.suptitle("{} /{}".format(Y.upper(), X.upper()))

Apparently the normalization algorithm of Sleuth is not doing a good job with the datasets. Indeed, instead of improving the variability between the replicates and the proximity of both nucleus and cytoplasm it increases the difference

I will compare with the sesults obtained with other mappers including STAR and HISAT

STAR

Generate an index with lncRNA and ERCC only

STAR2.5.2a --runThreadN 8 --runMode genomeGenerate --genomeDir ./ --genomeFastaFiles ../../Reference_Genomes/GRCh38_primary_assembly_ERCC.fa --sjdbGTFfile ../../Reference_Annotation/gencode_v24_lncRNAs_ERCC.gtf

Align all the samples with basic options recommanded by ENCODE using a simple bash script

RNAseq_STAR_pipeline.sh ../../../../../Index/STAR/GRCh38_ERCC_gencode_v24_lncRNAs/ R1 R2 8 output_dir 0

2 samples failed because of improper fastq formating "EXITING because of FATAL ERROR in reads input: quality string length is not equal to sequence length"*

ENCSR000CTO_MCF-7_nucleus_2
ENCSR000CPK_NHEK_cytosol_1

List the STAR Samples similar to kallisto



In [14]:

    
# Create a file with sample names, descr and path of the result data
def STAR_sample_list (basedir):
    
    sample_list = []
    list_file = basedir+"/list_file.tsv"
    
    # Iterate throught the result folder 
    for subdir in [path for path in glob(basedir+"/*") if isdir(path)]:
        
        # Extract info
        ss = basename(subdir).split("_") 
        cell_type = ss[1].strip().replace("-", "")
        localization = ss[2].strip().replace("-", "")
        replicate = ss[3].strip().replace("-", "")
        
        sample_list.append ({
                "sample" : "{}_{}_{}".format(cell_type, localization, replicate),
                "cell_type" : cell_type,
                "localization" : localization,
                "replicate" : replicate,
                "path" : abspath(subdir)+"/ReadsPerGene.out.tab"
                })
        
    # Sort the list to work with grouped data in the next steps of the analysis
    sample_list.sort(key = lambda x: (x['cell_type'], x['localization'], x['replicate']))
    
    # Find the number of replicates of each sample
    rep_dict={}
    for d in sample_list:
        uniq_sample =  "{}_{}".format(d["cell_type"], d["localization"])
        if uniq_sample not in rep_dict:
            rep_dict[uniq_sample] = 0
        rep_dict[uniq_sample] += 1
    
    # Filter out samples with less than 2 replicates
    filter_list = []
    for d in sample_list:
        uniq_sample =  "{}_{}".format(d["cell_type"], d["localization"])
        if rep_dict[uniq_sample] >= 2:
            filter_list.append(d)
        else:
            print ("{} was filtered out\n".format(d["sample"]))
            
    # Write the results in a file
    with open (list_file, "w") as outfp:
        outfp.write("sample\tcell_type\tlocalization\treplicate\tpath\n")
        for d in filter_list:
            outfp.write("{}\t{}\t{}\t{}\t{}\n".format (d["sample"], d["cell_type"], d["localization"], d["replicate"], d["path"]))
            
    return (list_file)



In [16]:

    
fp = STAR_sample_list("./Localisation_Original_Datasets/Djebali-ENCODE/STAR/lncRNA_ERCC")
head(fp, 20)









    



H1hESC_cytosol_2 was filtered out

H1hESC_nucleus_2 was filtered out

MCF7_nucleus_1 was filtered out

NHEK_cytosol_2 was filtered out

sample	cell_type	localization	replicate	path
A549_cytosol_1	A549	cytosol	1	/home/aleg/Data/Datasets/Epi2_pilot/Localisation_Original_Datasets/Djebali-ENCODE/STAR/lncRNA_ERCC/ENCSR000CTL_A549_cytosol_1/ReadsPerGene.out.tab
A549_cytosol_2	A549	cytosol	2	/home/aleg/Data/Datasets/Epi2_pilot/Localisation_Original_Datasets/Djebali-ENCODE/STAR/lncRNA_ERCC/ENCSR000CTL_A549_cytosol_2/ReadsPerGene.out.tab
A549_nucleus_1	A549	nucleus	1	/home/aleg/Data/Datasets/Epi2_pilot/Localisation_Original_Datasets/Djebali-ENCODE/STAR/lncRNA_ERCC/ENCSR000CTM_A549_nucleus_1/ReadsPerGene.out.tab
A549_nucleus_2	A549	nucleus	2	/home/aleg/Data/Datasets/Epi2_pilot/Localisation_Original_Datasets/Djebali-ENCODE/STAR/lncRNA_ERCC/ENCSR000CTM_A549_nucleus_2/ReadsPerGene.out.tab
GM12878_cytosol_1	GM12878	cytosol	1	/home/aleg/Data/Datasets/Epi2_pilot/Localisation_Original_Datasets/Djebali-ENCODE/STAR/lncRNA_ERCC/ENCSR000COR_GM12878_cytosol_1/ReadsPerGene.out.tab
GM12878_cytosol_2	GM12878	cytosol	2	/home/aleg/Data/Datasets/Epi2_pilot/Localisation_Original_Datasets/Djebali-ENCODE/STAR/lncRNA_ERCC/ENCSR000COR_GM12878_cytosol_2/ReadsPerGene.out.tab
GM12878_nucleus_1	GM12878	nucleus	1	/home/aleg/Data/Datasets/Epi2_pilot/Localisation_Original_Datasets/Djebali-ENCODE/STAR/lncRNA_ERCC/ENCSR000CPO_GM12878_nucleus_1/ReadsPerGene.out.tab
GM12878_nucleus_2	GM12878	nucleus	2	/home/aleg/Data/Datasets/Epi2_pilot/Localisation_Original_Datasets/Djebali-ENCODE/STAR/lncRNA_ERCC/ENCSR000CPO_GM12878_nucleus_2/ReadsPerGene.out.tab
HUVEC_cytosol_1	HUVEC	cytosol	1	/home/aleg/Data/Datasets/Epi2_pilot/Localisation_Original_Datasets/Djebali-ENCODE/STAR/lncRNA_ERCC/ENCSR000CPA_HUVEC_cytosol_1/ReadsPerGene.out.tab
HUVEC_cytosol_2	HUVEC	cytosol	2	/home/aleg/Data/Datasets/Epi2_pilot/Localisation_Original_Datasets/Djebali-ENCODE/STAR/lncRNA_ERCC/ENCSR000CPA_HUVEC_cytosol_2/ReadsPerGene.out.tab
HUVEC_nucleus_1	HUVEC	nucleus	1	/home/aleg/Data/Datasets/Epi2_pilot/Localisation_Original_Datasets/Djebali-ENCODE/STAR/lncRNA_ERCC/ENCSR000CPB_HUVEC_nucleus_1/ReadsPerGene.out.tab
HUVEC_nucleus_2	HUVEC	nucleus	2	/home/aleg/Data/Datasets/Epi2_pilot/Localisation_Original_Datasets/Djebali-ENCODE/STAR/lncRNA_ERCC/ENCSR000CPB_HUVEC_nucleus_2/ReadsPerGene.out.tab
HeLaS3_cytosol_1	HeLaS3	cytosol	1	/home/aleg/Data/Datasets/Epi2_pilot/Localisation_Original_Datasets/Djebali-ENCODE/STAR/lncRNA_ERCC/ENCSR000CPP_HeLa-S3_cytosol_1/ReadsPerGene.out.tab
HeLaS3_cytosol_2	HeLaS3	cytosol	2	/home/aleg/Data/Datasets/Epi2_pilot/Localisation_Original_Datasets/Djebali-ENCODE/STAR/lncRNA_ERCC/ENCSR000CPP_HeLa-S3_cytosol_2/ReadsPerGene.out.tab
HeLaS3_nucleus_1	HeLaS3	nucleus	1	/home/aleg/Data/Datasets/Epi2_pilot/Localisation_Original_Datasets/Djebali-ENCODE/STAR/lncRNA_ERCC/ENCSR000CPQ_HeLa-S3_nucleus_1/ReadsPerGene.out.tab
HeLaS3_nucleus_2	HeLaS3	nucleus	2	/home/aleg/Data/Datasets/Epi2_pilot/Localisation_Original_Datasets/Djebali-ENCODE/STAR/lncRNA_ERCC/ENCSR000CPQ_HeLa-S3_nucleus_2/ReadsPerGene.out.tab
HepG2_cytosol_1	HepG2	cytosol	1	/home/aleg/Data/Datasets/Epi2_pilot/Localisation_Original_Datasets/Djebali-ENCODE/STAR/lncRNA_ERCC/ENCSR000CPF_HepG2_cytosol_1/ReadsPerGene.out.tab
HepG2_cytosol_2	HepG2	cytosol	2	/home/aleg/Data/Datasets/Epi2_pilot/Localisation_Original_Datasets/Djebali-ENCODE/STAR/lncRNA_ERCC/ENCSR000CPF_HepG2_cytosol_2/ReadsPerGene.out.tab
HepG2_nucleus_1	HepG2	nucleus	1	/home/aleg/Data/Datasets/Epi2_pilot/Localisation_Original_Datasets/Djebali-ENCODE/STAR/lncRNA_ERCC/ENCSR000CPC_HepG2_nucleus_1/ReadsPerGene.out.tab

Aggregating all the results in a panda dataframe and export in a table



In [17]:

    
head("/home/aleg/Data/Datasets/Epi2_pilot/Localisation_Original_Datasets/Djebali-ENCODE/STAR/lncRNA_ERCC/ENCSR000CTL_A549_cytosol_1/ReadsPerGene.out.tab")









    



N_unmapped	3824591	3824591	3824591
N_multimapping	6397833	6397833	6397833
N_noFeature	117989065	121889754	124358255
N_ambiguous	344853	238963	56752
ENSG00000243485.3	1	1	0
ENSG00000237613.2	0	0	0
ENSG00000238009.6	14	0	14
ENSG00000239945.1	0	0	0
ENSG00000239906.1	0	0	0
ENSG00000241860.6	92	0	92



In [18]:

    
def STAR_aggregate (list_file, outdir):
    
    data_dict = OrderedDict()
    
    # Read the STAR sample_list
    with open (list_file, "r") as sample_list:
        header = next(sample_list)
        
        for sample_line in sample_list:
            sample, cell_type, localization, replicate, path = sample_line.strip().split("\t")
            
            # Read the STAR count files and collect the data
            with open (path, "r") as count_file:
                for count_line in count_file:
                    gene_id, unstrand_count, pos_strand_count, neg_strand_count = count_line.strip().split("\t")
                    if gene_id not in data_dict:
                        data_dict[gene_id] = OrderedDict()
                    if sample not in data_dict[gene_id]:
                        data_dict[gene_id][sample] = OrderedDict()
                    data_dict[gene_id][sample]["unstrand_count"] = unstrand_count
                    data_dict[gene_id][sample]["pos_strand_count"] = pos_strand_count
                    data_dict[gene_id][sample]["neg_strand_count"] = neg_strand_count
    
    # Write the collected data in a dict separatly for unstranded, neg_strand and pos_strand
    with \
        open ( outdir+"/STAR_unstrand_count.tsv", "w") as unstrand_fp, \
        open ( outdir+"/STAR_pos_strand_count.tsv", "w") as pos_strand_fp, \
        open ( outdir+"/STAR_neg_strand_count.tsv", "w") as neg_strand_fp:
    
        first_row = True
        for gene_id, gene_dict in data_dict.items():
            if first_row:
                unstrand_fp.write("\t".join(gene_dict.keys())+"\n")
                pos_strand_fp.write("\t".join(gene_dict.keys())+"\n")
                neg_strand_fp.write("\t".join(gene_dict.keys())+"\n")
                first_row = False

            unstrand_fp.write(gene_id)
            pos_strand_fp.write(gene_id)
            neg_strand_fp.write(gene_id
                               )
            for sample_dict in gene_dict.values():
                unstrand_fp.write("\t{}".format(sample_dict["unstrand_count"])) 
                pos_strand_fp.write("\t{}".format(sample_dict["pos_strand_count"]))
                neg_strand_fp.write("\t{}".format(sample_dict["neg_strand_count"]))

            unstrand_fp.write("\n")
            pos_strand_fp.write("\n")
            neg_strand_fp.write("\n")



In [19]:

    
STAR_aggregate ("./Localisation_Original_Datasets/Djebali-ENCODE/STAR/lncRNA_ERCC/list_file.tsv", "./Localisation_Original_Datasets/Djebali-ENCODE/STAR/lncRNA_ERCC")

head("./Localisation_Original_Datasets/Djebali-ENCODE/STAR/lncRNA_ERCC/STAR_unstrand_count.tsv")
head("./Localisation_Original_Datasets/Djebali-ENCODE/STAR/lncRNA_ERCC/STAR_pos_strand_count.tsv")
head("./Localisation_Original_Datasets/Djebali-ENCODE/STAR/lncRNA_ERCC/STAR_neg_strand_count.tsv")









    



A549_cytosol_1	A549_cytosol_2	A549_nucleus_1	A549_nucleus_2	GM12878_cytosol_1	GM12878_cytosol_2	GM12878_nucleus_1	GM12878_nucleus_2	HUVEC_cytosol_1	HUVEC_cytosol_2	HUVEC_nucleus_1	HUVEC_nucleus_2	HeLaS3_cytosol_1	HeLaS3_cytosol_2	HeLaS3_nucleus_1	HeLaS3_nucleus_2	HepG2_cytosol_1	HepG2_cytosol_2	HepG2_nucleus_1	HepG2_nucleus_2	IMR90_cytosol_1	IMR90_cytosol_2	IMR90_nucleus_1	IMR90_nucleus_2	K562_cytosol_1	K562_cytosol_2	K562_nucleus_1	K562_nucleus_2	MCF7_cytosol_1	MCF7_cytosol_2	NHEK_nucleus_1	NHEK_nucleus_2	SKNSH_cytosol_1	SKNSH_cytosol_2	SKNSH_nucleus_1	SKNSH_nucleus_2
N_unmapped	3824591	3740160	4566615	3280032	6199518	4187671	8887206	8315464	6253580	6469580	6372600	5425745	7345167	5765453	6429413	8834347	5091946	4252099	6960792	7254641	3692279	4065183	4390657	4726810	5872600	5239631	6049613	7629890	12259267	9300101	8148345	8123898	7850536	7988309	7781320	5865987
N_multimapping	6397833	6976778	6496860	4220819	10831550	7081567	6253766	5806594	11052163	12214875	5333181	4639090	9691402	7276837	4618609	4405025	7571564	7300392	5949550	5636364	4609222	5250836	4148113	4841763	10460119	7188633	7514697	5562820	7525648	6943013	6271676	6026177	11964685	13226119	5433130	8411056
N_noFeature	117989065	132077378	160857369	86886998	103561960	78470003	106525618	96367577	92222855	89314264	100387374	96105957	97784951	85323916	60961603	85988990	94784797	94363927	84540211	69225012	96824215	124156055	147933478	149042834	101090408	70600493	97299755	86839174	133871756	129249875	85136241	100710872	202798719	193066611	125024216	209774845
N_ambiguous	344853	307593	314082	168600	180531	134497	451816	455925	221813	228846	148878	136872	182137	193316	100826	147827	148072	168125	180267	154926	231581	322597	412134	714403	179015	133298	215975	161463	319881	297341	140472	177401	583875	523471	435095	873537
ENSG00000243485.3	1	2	0	1	1	0	1	2	0	1	2	5	1	0	0	3	0	0	3	2	0	2	14	9	7	14	24	13	1	1	0	0	1	3	16	5
ENSG00000237613.2	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	2	1	2	9	3	6	6	0	0	0	0	0	0	0	0
ENSG00000238009.6	14	9	7	7	5	9	10	4	8	16	17	17	5	19	4	33	25	21	18	12	0	8	14	19	139	52	139	123	29	29	16	20	27	22	7	37
ENSG00000239945.1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
ENSG00000239906.1	0	0	0	0	0	0	0	0	0	1	2	0	0	0	0	0	0	0	0	0	0	0	3	4	4	6	13	20	0	0	3	0	0	0	0	0

A549_cytosol_1	A549_cytosol_2	A549_nucleus_1	A549_nucleus_2	GM12878_cytosol_1	GM12878_cytosol_2	GM12878_nucleus_1	GM12878_nucleus_2	HUVEC_cytosol_1	HUVEC_cytosol_2	HUVEC_nucleus_1	HUVEC_nucleus_2	HeLaS3_cytosol_1	HeLaS3_cytosol_2	HeLaS3_nucleus_1	HeLaS3_nucleus_2	HepG2_cytosol_1	HepG2_cytosol_2	HepG2_nucleus_1	HepG2_nucleus_2	IMR90_cytosol_1	IMR90_cytosol_2	IMR90_nucleus_1	IMR90_nucleus_2	K562_cytosol_1	K562_cytosol_2	K562_nucleus_1	K562_nucleus_2	MCF7_cytosol_1	MCF7_cytosol_2	NHEK_nucleus_1	NHEK_nucleus_2	SKNSH_cytosol_1	SKNSH_cytosol_2	SKNSH_nucleus_1	SKNSH_nucleus_2
N_unmapped	3824591	3740160	4566615	3280032	6199518	4187671	8887206	8315464	6253580	6469580	6372600	5425745	7345167	5765453	6429413	8834347	5091946	4252099	6960792	7254641	3692279	4065183	4390657	4726810	5872600	5239631	6049613	7629890	12259267	9300101	8148345	8123898	7850536	7988309	7781320	5865987
N_multimapping	6397833	6976778	6496860	4220819	10831550	7081567	6253766	5806594	11052163	12214875	5333181	4639090	9691402	7276837	4618609	4405025	7571564	7300392	5949550	5636364	4609222	5250836	4148113	4841763	10460119	7188633	7514697	5562820	7525648	6943013	6271676	6026177	11964685	13226119	5433130	8411056
N_noFeature	121889754	136269230	165618158	90661412	106168163	80630076	110417651	99786769	94422409	91127572	103665219	99134464	100198635	87539532	63486477	89793692	96640516	96357239	87319703	71227744	99981178	126408003	152793342	156299437	104297280	72479515	101291957	90509757	139738611	136078998	87544321	103671982	213213692	202940284	131535840	222021062
N_ambiguous	238963	191379	165391	85992	96995	70963	38764	33524	150899	163835	42957	41468	93711	93834	26435	29372	87929	101891	45024	38717	170834	236772	125515	111568	74807	54458	23209	21508	176885	163046	27307	31765	448439	412627	107171	162179
ENSG00000243485.3	1	1	0	0	0	0	0	1	0	1	1	4	0	0	0	3	0	0	3	2	0	1	8	7	1	13	15	11	0	0	0	0	0	2	10	2
ENSG00000237613.2	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0
ENSG00000238009.6	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
ENSG00000239945.1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
ENSG00000239906.1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	2	1	1	2	0	0	0	0	0	0	0	0

A549_cytosol_1	A549_cytosol_2	A549_nucleus_1	A549_nucleus_2	GM12878_cytosol_1	GM12878_cytosol_2	GM12878_nucleus_1	GM12878_nucleus_2	HUVEC_cytosol_1	HUVEC_cytosol_2	HUVEC_nucleus_1	HUVEC_nucleus_2	HeLaS3_cytosol_1	HeLaS3_cytosol_2	HeLaS3_nucleus_1	HeLaS3_nucleus_2	HepG2_cytosol_1	HepG2_cytosol_2	HepG2_nucleus_1	HepG2_nucleus_2	IMR90_cytosol_1	IMR90_cytosol_2	IMR90_nucleus_1	IMR90_nucleus_2	K562_cytosol_1	K562_cytosol_2	K562_nucleus_1	K562_nucleus_2	MCF7_cytosol_1	MCF7_cytosol_2	NHEK_nucleus_1	NHEK_nucleus_2	SKNSH_cytosol_1	SKNSH_cytosol_2	SKNSH_nucleus_1	SKNSH_nucleus_2
N_unmapped	3824591	3740160	4566615	3280032	6199518	4187671	8887206	8315464	6253580	6469580	6372600	5425745	7345167	5765453	6429413	8834347	5091946	4252099	6960792	7254641	3692279	4065183	4390657	4726810	5872600	5239631	6049613	7629890	12259267	9300101	8148345	8123898	7850536	7988309	7781320	5865987
N_multimapping	6397833	6976778	6496860	4220819	10831550	7081567	6253766	5806594	11052163	12214875	5333181	4639090	9691402	7276837	4618609	4405025	7571564	7300392	5949550	5636364	4609222	5250836	4148113	4841763	10460119	7188633	7514697	5562820	7525648	6943013	6271676	6026177	11964685	13226119	5433130	8411056
N_noFeature	124358255	139056439	168599642	90542412	108011616	82054180	108997372	98586725	96382691	93551028	103063260	98495146	101624453	89158134	62781757	88103192	97938501	98273544	86624046	71015936	102505766	132670252	154128737	153638044	105211331	74079578	99399502	88800709	139113642	134267859	86997807	102811611	210814233	199813609	129659259	216307540
N_ambiguous	56752	58918	74231	37985	38671	29257	44435	35817	26792	25485	39080	33420	28893	31607	30514	47014	19506	23321	42936	39229	23096	46534	71273	63701	33432	25050	37788	37111	68312	61821	29305	31488	61527	43238	59588	96922
ENSG00000243485.3	0	1	0	1	1	0	1	1	0	0	1	1	1	0	0	0	0	0	0	0	0	1	6	2	6	1	9	2	1	1	0	0	1	1	6	3
ENSG00000237613.2	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	2	1	1	9	3	6	6	0	0	0	0	0	0	0	0
ENSG00000238009.6	14	9	7	7	4	9	10	4	8	16	17	17	5	19	4	33	25	21	18	12	0	8	14	19	139	52	139	123	29	29	16	20	27	22	7	37
ENSG00000239945.1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
ENSG00000239906.1	0	0	0	0	0	0	0	0	0	1	2	0	0	0	0	0	0	0	0	0	0	0	3	4	2	5	12	18	0	0	3	0	0	0	0	0

Adding the STAR information to the dict containing kallisto results



In [21]:

    
# Define the vasedir containing STAR count results
STAR_countfile = "./Localisation_Original_Datasets/Djebali-ENCODE/STAR/lncRNA_ERCC/STAR_neg_strand_count.tsv"
ERCC_abundance_dict, ERCC_abundance_df = ERCC_table_parsing (ERCC_abundance_dict, STAR_countfile, "log_STAR_count")
ERCC_abundance_df.head(20)









    Out[21]:






  
    
      
      cell_loc
      cell
      localization
      replicate
      transcript_id
      log_concentration
      eff_length
      length
      log_est_filter
      log_tpm_filter
      log_est_raw
      log_tpm_raw
      log_est_norm
      log_tpm_norm
      log_STAR_count
    
  
  
    
      ERCC-00002_A549_cytosol_1
      A549_cytosol
      A549
      cytosol
      1
      ERCC-00002
      -9.866192
      1011.634547
      1061.0
      12.879195
      10.330360
      12.436224
      9.670236
      12.879195
      10.330360
      12.423580
    
    
      ERCC-00002_A549_cytosol_2
      A549_cytosol
      A549
      cytosol
      2
      ERCC-00002
      -9.866192
      958.263200
      1061.0
      12.708273
      10.183487
      12.409797
      9.735956
      12.708273
      10.183487
      12.398963
    
    
      ERCC-00002_A549_nucleus_1
      A549_nucleus
      A549
      nucleus
      1
      ERCC-00002
      -9.866192
      800.665212
      1061.0
      12.159725
      9.739943
      12.628106
      10.240396
      12.159725
      9.739943
      12.616804
    
    
      ERCC-00002_A549_nucleus_2
      A549_nucleus
      A549
      nucleus
      2
      ERCC-00002
      -9.866192
      781.193314
      1061.0
      12.200813
      9.804348
      12.473880
      10.411550
      12.200813
      9.804348
      12.445239
    
    
      ERCC-00002_GM12878_cytosol_1
      GM12878_cytosol
      GM12878
      cytosol
      1
      ERCC-00002
      -9.866192
      1158.248810
      1061.0
      11.467636
      8.527796
      11.230430
      8.360036
      11.467636
      8.527796
      11.188220
    
    
      ERCC-00002_GM12878_cytosol_2
      GM12878_cytosol
      GM12878
      cytosol
      2
      ERCC-00002
      -9.866192
      1329.923311
      1061.0
      11.383686
      8.489214
      10.809061
      7.652244
      11.383686
      8.489214
      10.779310
    
    
      ERCC-00002_GM12878_nucleus_1
      GM12878_nucleus
      GM12878
      nucleus
      1
      ERCC-00002
      -9.866192
      1071.046659
      1061.0
      10.560210
      7.860315
      10.985259
      8.318615
      10.560210
      7.860315
      10.940260
    
    
      ERCC-00002_GM12878_nucleus_2
      GM12878_nucleus
      GM12878
      nucleus
      2
      ERCC-00002
      -9.866192
      1092.073958
      1061.0
      10.472601
      7.740907
      10.859384
      8.287338
      10.472601
      7.740907
      10.809041
    
    
      ERCC-00002_HUVEC_cytosol_1
      HUVEC_cytosol
      HUVEC
      cytosol
      1
      ERCC-00002
      -9.866192
      1208.543817
      1061.0
      12.034857
      9.418134
      11.340725
      8.503493
      12.034857
      9.418134
      11.296050
    
    
      ERCC-00002_HUVEC_cytosol_2
      HUVEC_cytosol
      HUVEC
      cytosol
      2
      ERCC-00002
      -9.866192
      1178.095228
      1061.0
      11.788333
      9.129867
      10.993816
      8.605956
      11.788333
      9.129867
      10.942987
    
    
      ERCC-00002_HUVEC_nucleus_1
      HUVEC_nucleus
      HUVEC
      nucleus
      1
      ERCC-00002
      -9.866192
      999.057973
      1061.0
      10.633840
      8.253097
      11.366338
      8.785413
      10.633840
      8.253097
      11.323301
    
    
      ERCC-00002_HUVEC_nucleus_2
      HUVEC_nucleus
      HUVEC
      nucleus
      2
      ERCC-00002
      -9.866192
      1023.137222
      1061.0
      10.307372
      7.861786
      11.063524
      8.768024
      10.307372
      7.861786
      11.008132
    
    
      ERCC-00002_HeLaS3_cytosol_1
      HeLaS3_cytosol
      HeLaS3
      cytosol
      1
      ERCC-00002
      -9.866192
      1142.966389
      1061.0
      11.553087
      8.978558
      11.218756
      8.433860
      11.553087
      8.978558
      11.162971
    
    
      ERCC-00002_HeLaS3_cytosol_2
      HeLaS3_cytosol
      HeLaS3
      cytosol
      2
      ERCC-00002
      -9.866192
      1053.303920
      1061.0
      11.383709
      8.808822
      10.866795
      8.301908
      11.383709
      8.808822
      10.817576
    
    
      ERCC-00002_HeLaS3_nucleus_1
      HeLaS3_nucleus
      HeLaS3
      nucleus
      1
      ERCC-00002
      -9.866192
      1088.432757
      1061.0
      10.304682
      7.908169
      10.460414
      8.261499
      10.304682
      7.908169
      10.418285
    
    
      ERCC-00002_HeLaS3_nucleus_2
      HeLaS3_nucleus
      HeLaS3
      nucleus
      2
      ERCC-00002
      -9.866192
      1086.144908
      1061.0
      10.000265
      7.613103
      10.695778
      8.311384
      10.000265
      7.613103
      10.649488
    
    
      ERCC-00002_HepG2_cytosol_1
      HepG2_cytosol
      HepG2
      cytosol
      1
      ERCC-00002
      -9.866192
      1071.885012
      1061.0
      11.249952
      8.603433
      10.754514
      8.608259
      11.249952
      8.603433
      10.692740
    
    
      ERCC-00002_HepG2_cytosol_2
      HepG2_cytosol
      HepG2
      cytosol
      2
      ERCC-00002
      -9.866192
      1155.010408
      1061.0
      11.488466
      8.809816
      10.910752
      8.183842
      11.488466
      8.809816
      10.876518
    
    
      ERCC-00002_HepG2_nucleus_1
      HepG2_nucleus
      HepG2
      nucleus
      1
      ERCC-00002
      -9.866192
      1019.937798
      1061.0
      10.419430
      7.967528
      11.081896
      8.204966
      10.419430
      7.967528
      11.031675
    
    
      ERCC-00002_HepG2_nucleus_2
      HepG2_nucleus
      HepG2
      nucleus
      2
      ERCC-00002
      -9.866192
      1023.247994
      1061.0
      10.310023
      7.844775
      10.720709
      8.228484
      10.310023
      7.844775
      10.671093



In [22]:

    
for X, Y in [["log_concentration", "log_STAR_count"], ["log_STAR_count", "log_est_raw"], ["log_STAR_count", "log_est_norm"], ["log_STAR_count", "log_tpm_raw"], ["log_STAR_count", "log_tpm_norm"]] :
    g = sns.lmplot(x=X, y=Y, hue="localization", col="cell", data=ERCC_abundance_df, lowess=True)
    plt.subplots_adjust(top=0.8)
    g.fig.suptitle("{} /{}".format(Y.upper(), X.upper()))









    



/usr/local/lib/python3.5/dist-packages/numpy/core/_methods.py:59: RuntimeWarning: Mean of empty slice.
  warnings.warn("Mean of empty slice.", RuntimeWarning)

As seen on the graphs, STAR counts correlate very well the initial concentration of the ERCC spike. They also correlate nicelly with raw kallisto est counts. Again, the normalisation is not really doing a great job again since the 2 fractions are clearly separated after norm ...

New orientations

I need to investigate alternative normalization methods such as RUVSeq or the method used by Tom in the CaptureSeq Paper..
I asked the Kallisto/Sleuth team for more information and wait for their answer. Apparently it is not possible to use the existing tools for normalization from the raw Kallisto results
For the moment I will continue the analysis at the gene level with both Kallisto and STAR results
Star results does need to be modified but I need to aggregate kallisto results to gene level.
There is an R package that was developed just for this, called tximport. According to the authors this transcript level aggregation is more precise that the conventional gene level count.
From these results I will normalise the results with RUV-Seq and perform the DE expression analysis with DESeq2 or Voom.

Gene level aggregation and count with tximport



In [1]:

    
################## R KERNEL ##################

# Test of tximport with kallisto counts
vignette("tximport")









    



starting httpd help server ... done



In [1]:

    
# Import libraries
library(dplyr)
library(tximport)
library(readr)

# Define a function to prepare data and analyse them with 
tximport_analysis <- function(sample_list, transcript_info_gencode, transcript_info_ERCC, outname) {

    # Import the sample name and path from the sample list file
    s2p = read.table(sample_list, header = TRUE)
    abundance_list = file.path(s2p$path, "abundance.tsv")
    names(abundance_list) = s2p$sample

    # Import the gene info from gencode genes in a dataframe and rename the fields for tximport compatibility
    t2g_encode = read.table(transcript_info_gencode, header = TRUE, stringsAsFactors=FALSE)
    t2g_encode = rename(t2g_encode, TXNAME = GENCODE_transcript_id, GENEID = GENCODE_gene_id)
    t2g_encode = select(t2g_encode, c(TXNAME, GENEID))

    # Same thing for ERCC gene info list
    t2g_ERCC = read.table(transcript_info_ERCC, header = TRUE, stringsAsFactors=FALSE)
    t2g_ERCC = rename(t2g_ERCC, TXNAME = transcript_name, GENEID = gene_id)
    t2g_ERCC = select(t2g_ERCC, c(TXNAME, GENEID))

    # concatenate the 2 dataframes
    t2g = rbind(t2g_encode, t2g_ERCC)

    # Now aggregate the results at the gene level with tximport with a tpm and transcript length normalization
    txi = tximport(abundance_list, type="kallisto", tx2gene=t2g, reader=read_tsv, countsFromAbundance="lengthScaled") # Scaled by transcript lenght only

    # Write the normalized aggregated counts in a tsv table file
    write.table(txi$counts, file = outname, row.names=TRUE, na="",col.names=TRUE, sep="\t", quote=FALSE)
    return(txi$counts)
    }









    



Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



In [2]:

    
# Define files
sample_list = "./Localisation_Original_Datasets/Djebali-ENCODE/kallisto/lncRNA_ERCC/list_file.tsv"
transcript_info_gencode = "../../Reference_Genomes/gencode_v24_lncRNA_transcripts_info.tsv"
transcript_info_ERCC = "../../Reference_Genomes/ERCC_info.tsv"
outname = "./Localisation_Original_Datasets/Djebali-ENCODE/tximport/lncRNA_ERCC_gene_level_counts.tsv"

counts = tximport_analysis(sample_list, transcript_info_gencode, transcript_info_ERCC, outname)
head (counts)









    



reading in files
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 
summarizing abundance
summarizing counts
summarizing length






    





A549_cytosol_1 A549_cytosol_2 A549_nucleus_1 A549_nucleus_2 GM12878_cytosol_1 GM12878_cytosol_2 GM12878_nucleus_1 GM12878_nucleus_2 HUVEC_cytosol_1 HUVEC_cytosol_2 ⋯ MCF7_nucleus_1 MCF7_nucleus_2 NHEK_cytosol_1 NHEK_cytosol_2 NHEK_nucleus_1 NHEK_nucleus_2 SKNSH_cytosol_1 SKNSH_cytosol_2 SKNSH_nucleus_1 SKNSH_nucleus_2

	ENSG00000082929.8 1.16045695852945 2.18115350251946 7.64620614504533 0                0.859748074437148 0                0.565356414989429 1.28832509124729 0                0                ⋯                26.9438076412688 5.34587427488773 0                0                0                0                1.09192984274639 1.58821771138767 0                1.12687657464808 
	ENSG00000083622.8 0                0                0                2.56990394384941 0                0                0.722165705975895 3.98581170792923 0                0                ⋯                0                0                0                0                0                0                0                0                0                0                
	ENSG00000093100.13 4206.58525776049 5245.44532067576 42246.6671145073 17345.4798764029 2786.64983749646 2091.24126013715 8034.338939453  10705.3661938123 2311.82183919593 1934.56595273566 ⋯               5363.38805564951 1687.06573490952 1365.36952354241 1963.37356687556 10331.6192259408 10477.5619662555 8092.1303367047 5620.28566102706 59144.8086332296 85625.1230322502
	ENSG00000099869.7 3.89274133185661 2.02083880944801 7.46284691847996 3.41995085820615 0                0                0                0                4.62901334548625 0                ⋯                1.85148217135638 0                0                0                0                0                0                0                0.824219713139594 0                
	ENSG00000100181.21 610.343019447761 673.730123311848 1148.63205572315 1068.16303362129 1158.04680082864 774.353319596   1964.63945083306 2574.65942742537 525.661551196438 548.651766402981 ⋯               1842.48523037281 648.403047303998 1418.10083385372 2965.4447669461 5139.14025251317 5744.68069123436 753.666074861697 648.840275878875 2032.41967960662 2844.09733862694
	ENSG00000115934.11 26.5928725960095 31.1329158208316 92.1987541709246 129.292130967264 97.7291836672975 49.9085326069612 193.771626511939 212.18697749753 27.4608245128903 26.7699108852439 ⋯               206.744401508904 55.5545985527226 34.2299176175002 63.3576946002565 169.262475081999 174.126501003718 68.4717592117736 60.661218890785 173.126273860766 267.769557336931



In [4]:

    
################## PYTHON3 KERNEL ##################

# Rerun all the data collection pipeline and add the last data from tximport
ERCC_dict = ERCC_concentration_parsing ("./Localisation_Original_Datasets/Djebali-ENCODE/Protocol_info/NistPool14Concentrations.tsv")
ERCC_abundance_dict, ERCC_abundance_df = ERCC_kallisto_parsing (ERCC_dict, "./Localisation_Original_Datasets/Djebali-ENCODE/sleuth/lncRNA_ERCC")
ERCC_abundance_dict, ERCC_abundance_df = ERCC_table_parsing (ERCC_abundance_dict, "./Localisation_Original_Datasets/Djebali-ENCODE/STAR/lncRNA_ERCC/STAR_neg_strand_count.tsv", "log_STAR_count")
ERCC_abundance_dict, ERCC_abundance_df = ERCC_table_parsing (ERCC_abundance_dict, "./Localisation_Original_Datasets/Djebali-ENCODE/tximport/lncRNA_ERCC_gene_level_counts.tsv", "log_tximport_count")
ERCC_abundance_df.head()









    Out[4]:






  
    
      
      cell_loc
      cell
      localization
      replicate
      transcript_id
      log_concentration
      eff_length
      length
      log_est_filter
      log_tpm_filter
      log_est_norm
      log_tpm_norm
      log_est_raw
      log_tpm_raw
      log_STAR_count
      log_tximport_count
    
  
  
    
      ERCC-00002_A549_cytosol_1
      A549_cytosol
      A549
      cytosol
      1
      ERCC-00002
      -9.866192
      1011.634547
      1061.0
      12.879195
      10.330360
      12.879195
      10.330360
      12.436224
      9.670236
      12.423580
      12.424200
    
    
      ERCC-00002_A549_cytosol_2
      A549_cytosol
      A549
      cytosol
      2
      ERCC-00002
      -9.866192
      958.263200
      1061.0
      12.708273
      10.183487
      12.708273
      10.183487
      12.409797
      9.735956
      12.398963
      12.423404
    
    
      ERCC-00002_A549_nucleus_1
      A549_nucleus
      A549
      nucleus
      1
      ERCC-00002
      -9.866192
      800.665212
      1061.0
      12.159725
      9.739943
      12.159725
      9.739943
      12.628106
      10.240396
      12.616804
      12.688344
    
    
      ERCC-00002_A549_nucleus_2
      A549_nucleus
      A549
      nucleus
      2
      ERCC-00002
      -9.866192
      781.193314
      1061.0
      12.200813
      9.804348
      12.200813
      9.804348
      12.473880
      10.411550
      12.445239
      12.536824
    
    
      ERCC-00002_GM12878_cytosol_1
      GM12878_cytosol
      GM12878
      cytosol
      1
      ERCC-00002
      -9.866192
      1158.248810
      1061.0
      11.467636
      8.527796
      11.467636
      8.527796
      11.230430
      8.360036
      11.188220
      10.905533



In [5]:

    
for X, Y in [["log_concentration", "log_tximport_count"], ["log_concentration", "log_STAR_count"], ["log_concentration", "log_est_raw"], ["log_tximport_count", "log_STAR_count"], ["log_tximport_count", "log_est_raw"]] :
    g = sns.lmplot(x=X, y=Y, hue="localization", col="cell", data=ERCC_abundance_df, lowess=True)
    plt.subplots_adjust(top=0.8)
    g.fig.suptitle("{} /{}".format(Y.upper(), X.upper()))









    



/usr/local/lib/python3.5/dist-packages/numpy/core/_methods.py:59: RuntimeWarning: Mean of empty slice.
  warnings.warn("Mean of empty slice.", RuntimeWarning)

Althought the purpose was only to aggregate the kallisto results at gene level, we can notice a slight improvement after gene level aggreggation, sequencing depth normalization and transcript lenght normalization using tximport. The concentration vs count normalizaton seems a little better for low concentrations and the correlation with Star is always good

Removing unwanted variation with RUVseq

RUVSeq can be run in 3 different modes, but I will try only 2 of them and compare their performances:

RUVg => Use the ERCC spikes to normalize
RUVr => Use the deviance residuals of a first pass general linear model regression



In [5]:

    
%%R
vignette('RUVSeq')



In [6]:

    
%%R
# Import libraries
library(dplyr)
library(RUVSeq)
library(RColorBrewer)
library(repr)

# Function to run the RUV-Seq normalization 
RUVSeq_analysis <- function(count_file, outdir, prefix, unwanted_variation_factor=1) {

    cat("IMPORT THE DATA\n")
    # Import the data in a dataframe
    allcount = read.table(count_file, header = TRUE)

    genes = rownames(allcount)[grep("^ENS", rownames(allcount), perl = TRUE)]
    spikes  = rownames(allcount)[grep("^ERCC", rownames(allcount), perl = TRUE)]
    cat(paste("\tGenes initial ", length(genes), "\n"))
    cat(paste("\tSpikes initial ", length(spikes), "\n"))

    cat("FILTERING NON AND LOW EXPRESSED GENES\n")
    # Filter out non-expressed genes, by requiring more than 5 reads in at least two samples for each gene.
    filter = apply(allcount, 1, function(x) length(x[x>5])>=2)
    filtered = allcount[filter,]
    genes = rownames(filtered)[grep("^ENS", rownames(filtered), perl = TRUE)]
    spikes = rownames(filtered)[grep("^ERCC", rownames(filtered), perl = TRUE)]
    cat(paste("\tGenes after filtering ", length(genes), "\n"))
    cat(paste("\tSpikes after filtering ", length(spikes), "\n"))

    cat("PLOTING DATA BEFORE NORMALIZATION\n")
    # Create a list of samples by grouping sample without the replicate number at the end of the sample name (_#)
    cond = as.factor(sub('(.)_.$', '\\1', colnames(filtered)))
    #cond = as.factor(sapply(strsplit(colnames(filtered),'_'), "[", 1-2))
    cell = as.factor(sapply(strsplit(colnames(filtered),'_'), "[", 1))
    localization = as.factor(sapply(strsplit(colnames(filtered),'_'), "[", 2))
    replicate = as.factor(sapply(strsplit(colnames(filtered),'_'), "[", 3))
    df = data.frame(cell, localization, replicate, row.names=colnames(filtered))
    set1 = newSeqExpressionSet(as.matrix(filtered), phenoData = df)       
    
    cat("PERFORM UPPER QUARTILE NORMALIZATION\n")
    # Normalize between lanes using EDASeq upper quartile method
    set2 = betweenLaneNormalization(set1, which="upper")

    cat("PERFORM RUV G NORMALIZATION\n")
    # Perform RUV g normalization based on the ERCC spike in controls
    set3 = RUVg(set2, spikes, k=unwanted_variation_factor)
    
    cat("PERFORM RUV R NORMALIZATION\n")
    # Perform RUVr normalization based on residuals of a first GLM pass
    # compute the residuals from the GLM fit
    design = model.matrix(~localization, data=pData(set2))
    y = DGEList(counts=counts(set2), group=localization)
    y = calcNormFactors(y, method="upperquartile")
    y = estimateGLMCommonDisp(y, design)
    y = estimateGLMTagwiseDisp(y, design)
    fit = glmFit(y, design)
    res = residuals(fit, type="deviance")
    # estimate the factors of unwanted variation for all genes
    set4 = RUVr(set2, genes, k=unwanted_variation_factor, res)
    
    cat("PLOTING DATA\n")
    # Define a color vector to differentiate groups of replicates
    n = length(levels(cond))
    getPalette = colorRampPalette(brewer.pal(12, "Set3"))
    colors = getPalette(n)
    
    # Plot the Relative log ratio of read count to median read count accross sample = RLE 
    par(xpd=T, mar=c(0.2,4,2,10))
    options(repr.plot.width=10, repr.plot.height=4)
    plotRLE(set1, outline=F, las=2, cex.axis=0.5, ylim=c(-4, 4), col=colors[cond], ylab="RLE", main = "RLE plot before normalization")
    legend("right",inset=c(-0.4,0), legend=levels(cond), fill=getPalette(n), cex = 0.7, y.intersp=1.5, bty="n")
    plotRLE(set2, outline=F, las = 2, cex.axis = 0.5, ylim=c(-4, 4), col=colors[cond], ylab="RLE", main = "RLE plot after upper-quartile normalization")
    legend("right",inset=c(-0.4,0), legend=levels(cond), fill=getPalette(n), cex = 0.7, y.intersp=1.5, bty="n")
    plotRLE(set3, outline=F, las = 2, cex.axis = 0.5, ylim=c(-4, 4), col=colors[cond], ylab="RLE", main = "RLE plot after RUVg normalization")
    legend("right",inset=c(-0.4,0), legend=levels(cond), fill=getPalette(n), cex = 0.7, y.intersp=1.5, bty="n")
    plotRLE(set4, outline=F, las = 2, cex.axis = 0.5, ylim=c(-4, 4), col=colors[cond], ylab="RLE", main = "RLE plot after RUVr normalization")
    legend("right",inset=c(-0.4,0), legend=levels(cond), fill=getPalette(n), cex = 0.7, y.intersp=1.5, bty="n")  
        
    # Plot 3D PCA of the datasets
    par(mar=c(4,4,4,4))
    plotPCA(set1, k=2, col=colors[cond], cex=0.7, main = "PCA plot before normalization")
    plotPCA(set2, k=2, col=colors[cond], cex=0.7, main = "PCA plot after upper-quartile normalization")        
    plotPCA(set3, k=2, col=colors[cond], cex=0.7, main = "PCA plot after RUVg normalization")
    plotPCA(set4, k=2, col=colors[cond], cex=0.7, main = "PCA plot after RUVr normalization")
        
    cat("EXPORTING THE RESULTS\n")
    fname = paste(c(outdir,"/",prefix,"RUVg_counts.tsv"), sep = "", collapse="")
    write.table(normCounts(set3), file = fname, row.names=TRUE, na="",col.names=TRUE, sep="\t", quote=FALSE)
    fname = paste(c(outdir,"/",prefix,"RUVr_counts.tsv"), sep = "", collapse="")
    write.table(normCounts(set4), file = fname, row.names=TRUE, na="",col.names=TRUE, sep="\t", quote=FALSE)
    return (set3)
    }









    



Loading required package: Biobase
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:dplyr’:

    combine, intersect, setdiff, union

The following objects are masked from ‘package:stats’:

    IQR, mad, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, cbind, colnames, do.call,
    duplicated, eval, evalq, Filter, Find, get, grep, grepl, intersect,
    is.unsorted, lapply, lengths, Map, mapply, match, mget, order,
    paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind,
    Reduce, rownames, sapply, setdiff, sort, table, tapply, union,
    unique, unsplit

Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Loading required package: EDASeq
Loading required package: ShortRead
Loading required package: BiocParallel
Loading required package: Biostrings
Loading required package: S4Vectors
Loading required package: stats4

Attaching package: ‘S4Vectors’

The following objects are masked from ‘package:dplyr’:

    first, rename

The following objects are masked from ‘package:base’:

    colMeans, colSums, expand.grid, rowMeans, rowSums

Loading required package: IRanges

Attaching package: ‘IRanges’

The following objects are masked from ‘package:dplyr’:

    collapse, desc, regroup, slice

Loading required package: XVector
Loading required package: Rsamtools
Loading required package: GenomeInfoDb
Loading required package: GenomicRanges
Loading required package: GenomicAlignments
Loading required package: SummarizedExperiment

Attaching package: ‘GenomicAlignments’

The following object is masked from ‘package:dplyr’:

    last


Attaching package: ‘ShortRead’

The following object is masked from ‘package:dplyr’:

    id

Loading required package: edgeR
Loading required package: limma

Attaching package: ‘limma’

The following object is masked from ‘package:BiocGenerics’:

    plotMA



In [7]:

    
%%R
# Normalising Kallisto_tximport data with k = 1
count_file = "./Localisation_Original_Datasets/Djebali-ENCODE/tximport/lncRNA_ERCC_gene_level_counts.tsv"
unwanted_variation_factor = 1
outname = "./Localisation_Original_Datasets/Djebali-ENCODE/RUV-seq/lncRNA_ERCC"
prefix = "kallisto_k1_"

set = RUVSeq_analysis(count_file, outname, prefix, unwanted_variation_factor)









    





IMPORT THE DATA
	Genes initial  15941 
	Spikes initial  96 
FILTERING NON AND LOW EXPRESSED GENES
	Genes after filtering  11922 
	Spikes after filtering  77 
PLOTING DATA BEFORE NORMALIZATION
PERFORM UPPER QUARTILE NORMALIZATION
PERFORM RUV G NORMALIZATION
PERFORM RUV R NORMALIZATION
PLOTING DATA
EXPORTING THE RESULTS



In [9]:

    
%%R
# Normalising STAR data
count_file = "./Localisation_Original_Datasets/Djebali-ENCODE/STAR/lncRNA_ERCC/STAR_neg_strand_count.tsv"
unwanted_variation_factor = 1
outname = "./Localisation_Original_Datasets/Djebali-ENCODE/RUV-seq/lncRNA_ERCC"
prefix = "STAR_k1_"

set = RUVSeq_analysis(count_file, outname, prefix, unwanted_variation_factor)









    





IMPORT THE DATA
	Genes initial  15941 
	Spikes initial  96 
FILTERING NON AND LOW EXPRESSED GENES
	Genes after filtering  9740 
	Spikes after filtering  76 
PLOTING DATA BEFORE NORMALIZATION
PERFORM UPPER QUARTILE NORMALIZATION
PERFORM RUV G NORMALIZATION
PERFORM RUV R NORMALIZATION
PLOTING DATA
EXPORTING THE RESULTS

The normalisation reduced significantly the difference of RLE between sample (Upper quartile norm) and the intra sample variability for both STAR and kallisto datasets. The PCA also shows that the replicates generally clustered together. After the UQ norm we can still see a clear nucleus / cytoplasm difference, but everything seems more mixed up after the RUVSeq normalization



In [11]:

    
# Rerun all the data collection pipeline and add the last data from tximport
ERCC_dict = ERCC_concentration_parsing ("./Localisation_Original_Datasets/Djebali-ENCODE/Protocol_info/NistPool14Concentrations.tsv")
ERCC_abundance_dict, ERCC_abundance_df = ERCC_kallisto_parsing (ERCC_dict, "./Localisation_Original_Datasets/Djebali-ENCODE/sleuth/lncRNA_ERCC/")
ERCC_abundance_dict, ERCC_abundance_df = ERCC_table_parsing (ERCC_abundance_dict, "./Localisation_Original_Datasets/Djebali-ENCODE/STAR/lncRNA_ERCC/STAR_neg_strand_count.tsv", "log_STAR_count")
ERCC_abundance_dict, ERCC_abundance_df = ERCC_table_parsing (ERCC_abundance_dict, "./Localisation_Original_Datasets/Djebali-ENCODE/tximport/lncRNA_ERCC_gene_level_counts.tsv", "log_kallisto_tximport_count")
ERCC_abundance_dict, ERCC_abundance_df = ERCC_table_parsing (ERCC_abundance_dict, "./Localisation_Original_Datasets/Djebali-ENCODE/RUV-seq/lncRNA_ERCC/kallisto_k1_RUVg_counts.tsv", "log_kallisto_RUVg_count")
ERCC_abundance_dict, ERCC_abundance_df = ERCC_table_parsing (ERCC_abundance_dict, "./Localisation_Original_Datasets/Djebali-ENCODE/RUV-seq/lncRNA_ERCC/kallisto_k1_RUVr_counts.tsv", "log_kallisto_RUVr_count")
ERCC_abundance_dict, ERCC_abundance_df = ERCC_table_parsing (ERCC_abundance_dict, "./Localisation_Original_Datasets/Djebali-ENCODE/RUV-seq/lncRNA_ERCC/STAR_k1_RUVg_counts.tsv", "log_STAR_RUVg_count")
ERCC_abundance_dict, ERCC_abundance_df = ERCC_table_parsing (ERCC_abundance_dict, "./Localisation_Original_Datasets/Djebali-ENCODE/RUV-seq/lncRNA_ERCC/STAR_k1_RUVr_counts.tsv", "log_STAR_RUVr_count")
ERCC_abundance_df.head()









    Out[11]:






  
    
      
      cell_loc
      cell
      localization
      replicate
      transcript_id
      log_concentration
      eff_length
      length
      log_est_norm
      log_tpm_norm
      log_est_filter
      log_tpm_filter
      log_est_raw
      log_tpm_raw
      log_STAR_count
      log_kallisto_tximport_count
      log_kallisto_RUVg_count
      log_kallisto_RUVr_count
      log_STAR_RUVg_count
      log_STAR_RUVr_count
    
  
  
    
      ERCC-00002_A549_cytosol_1
      A549_cytosol
      A549
      cytosol
      1
      ERCC-00002
      -9.866192
      1011.634547
      1061.0
      12.879195
      10.330360
      12.879195
      10.330360
      12.436224
      9.670236
      12.423580
      12.424200
      11.638289
      12.377565
      11.706714
      12.451975
    
    
      ERCC-00002_A549_cytosol_2
      A549_cytosol
      A549
      cytosol
      2
      ERCC-00002
      -9.866192
      958.263200
      1061.0
      12.708273
      10.183487
      12.708273
      10.183487
      12.409797
      9.735956
      12.398963
      12.423404
      11.712710
      12.227355
      11.735669
      12.251562
    
    
      ERCC-00002_A549_nucleus_1
      A549_nucleus
      A549
      nucleus
      1
      ERCC-00002
      -9.866192
      800.665212
      1061.0
      12.159725
      9.739943
      12.159725
      9.739943
      12.628106
      10.240396
      12.616804
      12.688344
      12.438743
      11.640157
      12.325627
      12.018381
    
    
      ERCC-00002_A549_nucleus_2
      A549_nucleus
      A549
      nucleus
      2
      ERCC-00002
      -9.866192
      781.193314
      1061.0
      12.200813
      9.804348
      12.200813
      9.804348
      12.473880
      10.411550
      12.445239
      12.536824
      12.150417
      12.002315
      11.995728
      12.177184
    
    
      ERCC-00002_GM12878_cytosol_1
      GM12878_cytosol
      GM12878
      cytosol
      1
      ERCC-00002
      -9.866192
      1158.248810
      1061.0
      11.467636
      8.527796
      11.467636
      8.527796
      11.230430
      8.360036
      11.188220
      10.905533
      11.539830
      12.714651
      11.670117
      12.127933



In [12]:

    
for X, Y in [["log_concentration", "log_est_raw"], ["log_concentration", "log_kallisto_RUVg_count"], ["log_concentration", "log_kallisto_RUVr_count"], ["log_concentration", "log_STAR_count"], ["log_concentration", "log_STAR_RUVg_count"], ["log_concentration", "log_STAR_RUVr_count"]] :
    g = sns.lmplot(x=X, y=Y, hue="localization", col="cell", data=ERCC_abundance_df, lowess=True)
    plt.subplots_adjust(top=0.8)
    g.fig.suptitle("{} /{}".format(Y.upper(), X.upper()))









    



Mean of empty slice.

The RUVseq normalization is considerably reducing the already little difference between the 2 cellular fractions for the ERCC. This was expected since the normalization is based on the spike-in controls.

RUVr normalization doesn't seem to improve the results. I need to see with Catalina, Mat or Tom

The normalized results from STAR and Kallisto are still highly correlated, although maybe a little bit more variable for low values

Discusion of the results with Mat and Tom

The general assupmtion of differential expression methods such as DESeq2 is that the majority of genes don't change their expression patern between 2 conditions
In this case where 2 different cellular fractions are compared it is likely that many genes will have a different expression pattern
Plot the count for the nucleus vs the cytoplasmic fraction
It will break the initial assumption so we are unsure if DESeq2 or EdgeR can be used in this case, at least not for the p-Value calculation
Send an email to DESeq developpers, Sleuth developpers and talk with Catalina from the Marioni's group

Plot the count for the nucleus vs the cytoplasmic fraction from Kallisto TPM



In [60]:

    
head("./Localisation_Original_Datasets/Djebali-ENCODE/kallisto/lncRNA_ERCC/ENCSR000COK_K562_cytosol_2/abundance.tsv", 3)
head("./Localisation_Original_Datasets/Djebali-ENCODE/kallisto/lncRNA_ERCC/list_file.tsv", 3)









    



target_id	length	eff_length	est_counts	tpm
ENST00000473358.1	712	529.791	44.5908	6.15485
ENST00000469289.1	535	316.492	18.9677	4.38258

sample	cell_type	localization	replicate	path
A549_cytosol_1	A549	cytosol	1	/home/aleg/Data/Datasets/Epi2_pilot/Localisation_Original_Datasets/Djebali-ENCODE/kallisto/lncRNA_ERCC/ENCSR000CTL_A549_cytosol_1
A549_cytosol_2	A549	cytosol	2	/home/aleg/Data/Datasets/Epi2_pilot/Localisation_Original_Datasets/Djebali-ENCODE/kallisto/lncRNA_ERCC/ENCSR000CTL_A549_cytosol_2



In [13]:

    
# File path    
sample_file = "./Localisation_Original_Datasets/Djebali-ENCODE/kallisto/lncRNA_ERCC/list_file.tsv"
gene_info_file = "../../Reference_Genomes/gencode_v24_lncRNA_transcripts_info.tsv"
fish_loc_file = "./Localisation_Original_Datasets/Cabili_RNA_FISH/Localization_cell_ensembl_valid_HeLa.csv"

# Read the list of sample
sample_df = pd.read_table(sample_file, index_col=0)
count_dict = OrderedDict()

for sample, sval in sample_df.iterrows():
    
    # Read the kallisto count file
    count_df = pd.read_table(sval["path"]+"/abundance.tsv", index_col=0)
    print ("Parsing sample: {}".format(sample))
    
    for target_id, cval in count_df.iterrows():
        
        # key = transcript + cell type to agregate results from the same cell type together
        key = "{}_{}".format(target_id, sval["cell_type"])
        
        if key not in count_dict :
            count_dict[key] = OrderedDict()
            count_dict[key]["target_id"] = target_id
            count_dict[key]["cell_type"] = sval["cell_type"]
            count_dict[key]["tpm_nucleus"] = []
            count_dict[key]["tpm_cytosol"] = []
            count_dict[key]["est_nucleus"] = []
            count_dict[key]["est_cytosol"] = []
            
        count_dict[key]["tpm_"+sval["localization"]].append(float(cval["tpm"]))
        count_dict[key]["est_"+sval["localization"]].append(float(cval["est_counts"]))

# Calculate the log of the mean for each tpm and est list
for key in count_dict.keys():
    mean_tpm_nucleus = stat.mean(count_dict[key]["tpm_nucleus"])
    mean_tpm_cytosol = stat.mean(count_dict[key]["tpm_cytosol"])
    if mean_tpm_nucleus or mean_tpm_cytosol:           
        count_dict[key]["mean_log2_tpm_nucleus"] = np.log2(0.5+mean_tpm_nucleus)
        count_dict[key]["mean_log2_tpm_cytosol"] = np.log2(0.5+mean_tpm_cytosol)
        
    mean_est_nucleus = stat.mean(count_dict[key]["est_nucleus"])
    mean_est_cytosol = stat.mean(count_dict[key]["est_cytosol"])
    if mean_est_nucleus or mean_est_cytosol:           
        count_dict[key]["mean_log2_est_nucleus"] = np.log2(0.5+mean_est_nucleus)
        count_dict[key]["mean_log2_est_cytosol"] = np.log2(0.5+mean_est_cytosol)

# Add an additional column based on the gene localization by FISH for Hela cells only
# Import the gene info file and select only the columns with gene ID and transcript ID 
t2g_df = pd.read_table(gene_info_file)[[0,1]]
# Remove the number at the end of the info file to be compatible with the numberless IDs in the loc fish file
t2g_df.GENCODE_gene_id = t2g_df.GENCODE_gene_id.str.rsplit('.').str[0]

# Import the columns geneID and localization number of the fish loc file
loc_df = pd.read_table(fish_loc_file)[[0,2]]

# Merge the 2 dataframe to retain only the intersection, ie association of transcript with a localization index 
t2l_df = pd.merge(loc_df, t2g_df, how='inner', left_on='gene_ID', right_on='GENCODE_gene_id')
# Select the important columns and save in a dictionnary  
t2l_dict = {val["GENCODE_transcript_id"]:int(val["localization pattern"]) for i, val in t2l_df.iterrows()}

for key in count_dict.keys():
    if count_dict[key]["cell_type"] == "HeLaS3" and count_dict[key]["target_id"] in t2l_dict:
        count_dict[key]["real_localization"] = t2l_dict[count_dict[key]["target_id"]]
    else:
        count_dict[key]["real_localization"] = 0

count_df = pd.DataFrame.from_dict(count_dict, orient='index')
count_df.head()









    



Parsing sample: A549_cytosol_1
Parsing sample: A549_cytosol_2
Parsing sample: A549_nucleus_1
Parsing sample: A549_nucleus_2
Parsing sample: GM12878_cytosol_1
Parsing sample: GM12878_cytosol_2
Parsing sample: GM12878_nucleus_1
Parsing sample: GM12878_nucleus_2
Parsing sample: HUVEC_cytosol_1
Parsing sample: HUVEC_cytosol_2
Parsing sample: HUVEC_nucleus_1
Parsing sample: HUVEC_nucleus_2
Parsing sample: HeLaS3_cytosol_1
Parsing sample: HeLaS3_cytosol_2
Parsing sample: HeLaS3_nucleus_1
Parsing sample: HeLaS3_nucleus_2
Parsing sample: HepG2_cytosol_1
Parsing sample: HepG2_cytosol_2
Parsing sample: HepG2_nucleus_1
Parsing sample: HepG2_nucleus_2
Parsing sample: IMR90_cytosol_1
Parsing sample: IMR90_cytosol_2
Parsing sample: IMR90_nucleus_1
Parsing sample: IMR90_nucleus_2
Parsing sample: K562_cytosol_1
Parsing sample: K562_cytosol_2
Parsing sample: K562_nucleus_1
Parsing sample: K562_nucleus_2
Parsing sample: MCF7_cytosol_1
Parsing sample: MCF7_cytosol_2
Parsing sample: MCF7_nucleus_1
Parsing sample: MCF7_nucleus_2
Parsing sample: NHEK_cytosol_1
Parsing sample: NHEK_cytosol_2
Parsing sample: NHEK_nucleus_1
Parsing sample: NHEK_nucleus_2
Parsing sample: SKNSH_cytosol_1
Parsing sample: SKNSH_cytosol_2
Parsing sample: SKNSH_nucleus_1
Parsing sample: SKNSH_nucleus_2






    Out[13]:






  
    
      
      target_id
      cell_type
      tpm_nucleus
      tpm_cytosol
      est_nucleus
      est_cytosol
      mean_log2_tpm_nucleus
      mean_log2_tpm_cytosol
      mean_log2_est_nucleus
      mean_log2_est_cytosol
      real_localization
    
  
  
    
      ENST00000222396.9_A549
      ENST00000222396.9
      A549
      [2.06925, 0.7255149999999999]
      [0.384623, 2.1539]
      [9.280439999999999, 2.27547]
      [2.36717, 12.1537]
      0.924011
      0.823147
      2.650295
      2.956138
      0
    
    
      ENST00000222396.9_GM12878
      ENST00000222396.9
      GM12878
      [1.16744, 0.381981]
      [0.0, 0.0]
      [6.43525, 1.8635400000000002]
      [0.0, 0.0]
      0.350170
      -1.000000
      2.217043
      -1.000000
      0
    
    
      ENST00000222396.9_HUVEC
      ENST00000222396.9
      HUVEC
      [3.05652, 2.28606]
      [0.0, 0.0]
      [15.8865, 8.62248]
      [0.0, 0.0]
      1.665070
      -1.000000
      3.672933
      -1.000000
      0
    
    
      ENST00000222396.9_HeLaS3
      ENST00000222396.9
      HeLaS3
      [6.32489, 3.5624599999999997]
      [0.602563, 0.0]
      [24.3011, 16.2414]
      [2.9820599999999997, 0.0]
      2.444581
      -0.319619
      4.376516
      0.993515
      0
    
    
      ENST00000222396.9_HepG2
      ENST00000222396.9
      HepG2
      [1.43084, 1.39398]
      [0.0, 0.0]
      [8.92423, 6.06273]
      [0.0, 0.0]
      0.935392
      -1.000000
      2.998824
      -1.000000
      0



In [16]:

    
# Extensive data ploting for HeLaS3 cell line

sns.set(style="white", color_codes=True)
df = count_df[(count_df.cell_type == "HeLaS3")]

# Plot the results for Hela Cells EST counts 
g=sns.jointplot("mean_log2_est_nucleus", "mean_log2_est_cytosol", data=df, kind="reg", size=7, xlim=[-2,20], ylim=[-2,20], color="0.8", line_kws={"color":"green", "linewidth":1})
# Highlight the data from the FISH dataset
for loc, color in [[1,"darkred"], [2,"red"], [3,"orange"], [4,"gold"], [5,"white"]]:
    sub_df = df[(df.real_localization == loc)]
    plt.scatter(sub_df.mean_log2_est_nucleus, sub_df.mean_log2_est_cytosol, c=color)
# Add the ERCC genes to see where they align 
sns.regplot("mean_log2_est_nucleus", "mean_log2_est_cytosol", data=df[count_df['target_id'].str.contains("ERCC")], color="blue", line_kws={"color":"darkblue", "linewidth":1})
# Add a diagonal line
plt.plot([-2,20],[-2,20], ls="--", c="black", linewidth =1)
# Add title and legend
g.fig.suptitle("HELA S3 - EST RESULTS")
plt.legend(bbox_to_anchor=(1.3, 1), loc=2,
           labels=["Regression line all RNA","Regression line ERCC","Diagonal","All RNA","Distribution All RNA", "1 or 2 large nuclear foci", "Large nuclear foci + scattered in nucleus", "Predominantly nuclear no foci", "Cytoplasmic and nuclear", "Predominantly cytoplasmic", "ERCC"])

# Plot the results for Hela Cells TPM counts 
g=sns.jointplot("mean_log2_tpm_nucleus", "mean_log2_tpm_cytosol", data=df, kind="reg", size=7, xlim=[-2,20], ylim=[-2,20], color="0.8", line_kws={"color":"green", "linewidth":1})
# Highlight the data from the FISH dataset
for loc, color in [[1,"darkred"], [2,"red"], [3,"orange"], [4,"gold"], [5,"white"]]:
    sub_df = df[(df.real_localization == loc)]
    plt.scatter(sub_df.mean_log2_tpm_nucleus, sub_df.mean_log2_tpm_cytosol, c=color)
# Add the ERCC genes to see where they align 
sns.regplot("mean_log2_tpm_nucleus", "mean_log2_tpm_cytosol", data=df[count_df['target_id'].str.contains("ERCC")], color="blue", line_kws={"color":"darkblue", "linewidth":1})
# Add a diagonal line
plt.plot([-2,20],[-2,20], ls="--", c="black", linewidth =1)
# Add title and legend
g.fig.suptitle("HELA S3 - TPM RESULTS")
plt.legend(bbox_to_anchor=(1.3, 1), loc=2,
           labels=["Regression line all RNA","Regression line ERCC","Diagonal","All RNA","Distribution All RNA", "1 or 2 large nuclear foci", "Large nuclear foci + scattered in nucleus", "Predominantly nuclear no foci", "Cytoplasmic and nuclear", "Predominantly cytoplasmic", "ERCC"])









    



using a non-integer number instead of an integer will result in an error in the future
Boolean Series key will be reindexed to match DataFrame index.
Boolean Series key will be reindexed to match DataFrame index.






    Out[16]:





<matplotlib.legend.Legend at 0x7f776bed1a20>



In [17]:

    
# Plot the results for the other cell lines 
sns.set(style="white")

for cell_type in ["IMR90","A549","GM12878","HepG2","HUVEC","K562","MCF7","NHEK","SKNSH"]:
    df = count_df[(count_df.cell_type == cell_type)]
    g=sns.jointplot("mean_log2_tpm_nucleus", "mean_log2_tpm_cytosol", data=df, kind="reg", size=5, xlim=[-2,20], ylim=[-2,20], color="0.8", line_kws={"color":"green", "linewidth":1})
    # Add the ERCC genes to see where they align 
    sns.regplot("mean_log2_tpm_nucleus", "mean_log2_tpm_cytosol", data=df[count_df['target_id'].str.contains("ERCC")], color="blue", line_kws={"color":"darkblue", "linewidth":1})
    # Add a diagonal line
    plt.plot([-2,20],[-2,20], ls="--", c="black", linewidth =1)
    # Add title and legend
    g.fig.suptitle(cell_type+" - TPM RESULTS")
    plt.legend(bbox_to_anchor=(1.3, 1), loc=2, labels=["Regression line all RNA","Regression line ERCC","Diagonal","All RNA","Distribution All RNA", "ERCC"])









    



using a non-integer number instead of an integer will result in an error in the future
Boolean Series key will be reindexed to match DataFrame index.

The distribution is actually more balanced than expected between the 2 fractions, althougt there are clearly a lot of difference in the enrichment of the genes. A significant subset of the transcript are found only in the nucleus or only in the cytoplasm. They might be the more interesting candidates but one should take into account that they might be undetected because of the depth of sequencing of the relative abundance of the other genes

The ERCC reg line is quite perfectly fitted on the diagonal (ie balance between the 2 fractions) after tpm normalization.

Given these results I doesn't seems to be neccessary to perform any additional normalization step, it might not break the assumptions of DESeq2 or Sleuth

RUVseq normalization might sligthly improve the data but probably not very significantly

=> Analyse at gene level with DESeq2 including TPM normalization by tximport +- RUV seq normalization but without the geometric mean normalization

=> Analyse at transcript and gene level with Sleuth but I need to tweak the normalization function to deactivate it

Sleuth without geo-mean normalization + optimizations



In [21]:

    
%%R
# Imports and setups
library("sleuth")
library("dplyr")

sleuth_analysis = function(
    sample_list,
    transcript_info=NULL,
    select_col=NULL,
    select_val=NULL,
    full_model=~1,
    outdir="./out/",
    ncore=1,
    max_bootstrap=NULL,
    norm_fun=norm_factors,
    filter_fun=basic_filter,
    aggregation_column=NULL) {
    
    # Multicore option
    options(mc.cores = ncore)

    # Create the output directory
    dir.create(outdir, showWarnings=F, recursive=T)

    cat("IMPORT AND SELECT THE SAMPLES\n")   
    # Import the file in a table and select sample according to select val and col 
    s2c = read.table(sample_list, header=T, stringsAsFactors=F)
    if (!is.null(select_col)) {s2c = s2c[s2c[,select_col] == select_val,]}
    print (s2c$sample)
    
    cat("IMPORT THE GENE INFORMATIONS\n")   
    # Import the gene info in table and rename the fields for Sleuth compatibility
    t2g = read.table(transcript_info, header=T, stringsAsFactors=F)
    #t2g = rename(t2g, target_id = GENCODE_transcript_id, ens_gene = GENCODE_gene_id)
    t2g = select(t2g, target_id, ens_gene, transcript_name, gene_name)

    cat("PREPARE THE DATA FOR SLEUTH\n")
    # Extract the data from the kallisto results.
    so = sleuth_prep(s2c, full_model,filter_fun, t2g, max_bootstrap, norm_fun, norm_fun, aggregation_column)
    
    cat("FIT TO FULL AND REDUCED MODEL\n")
    # Estimate parameters for the sleuth response error measurement (full) model
    so <- sleuth_fit(so)
    # Estimate parameters for the sleuth reduced model (shrinkage)
    so <- sleuth_fit(so, ~1, 'reduced')

    cat("PERFORM WALD TESTING\n")
    # Performing Wald test for all possible cofactors
    for (beta_factor in colnames(so$design_matrix)) {so <- sleuth_wt(so, which_beta = beta_factor, 'full')}

    cat("PERFORM LIKELYHOOD RATION TESTING\n")
    # Perform likelihood ratio test
    so <- sleuth_lrt(so, 'reduced', 'full')
    
    cat("EXPORT RESULTS")
    # Save all the results in a table
    results_table <- sleuth_results(so, 'reduced:full', test_type = 'lrt')
    fname = paste0(outdir, "DE_Sleuth.tsv")
    write.table(results_table, file = fname, row.names=TRUE, na="",col.names=TRUE, sep="\t", quote=FALSE)
    # Save the Sleuth object for further analysis
    fname = paste0(outdir, "DE_Sleuth.rds")
    saveRDS(so, file = fname)
    return(so)
    }



In [23]:

    
%%R
# test with one dataset
so = sleuth_analysis (
        sample_list="./Localisation_Original_Datasets/Djebali-ENCODE/kallisto/lncRNA_ERCC/list_file.tsv",
        transcript_info="./Localisation_Original_Datasets/Djebali-ENCODE/Protocol_info/gencode_v24_transcripts_info_sleuth.tsv",
        ncore=4,
        max_bootstrap=3,
        select_col="cell_type",
        select_val="HeLaS3",
        full_model=~localization+replicate,
        outdir="./Localisation_Original_Datasets/Djebali-ENCODE/test/",
        norm_fun = function(mat) {setNames(rep(1.0,ncol(mat)),colnames(mat))},
        filter_fun = basic_filter)









    



reading in kallisto results

.


normalizing est_counts

11689 targets passed the filter

normalizing tpm

merging in metadata

normalizing bootstrap samples

summarizing bootstraps

fitting measurement error models

shrinkage estimation

Adding missing grouping variables: `x_group`

computing variance of betas







    





IMPORT AND SELECT THE SAMPLES
[1] "HeLaS3_cytosol_1" "HeLaS3_cytosol_2" "HeLaS3_nucleus_1" "HeLaS3_nucleus_2"
IMPORT THE GENE INFORMATIONS
PREPARE THE DATA FOR SLEUTH
FIT TO FULL AND REDUCED MODEL
PERFORM WALD TESTING
PERFORM LIKELYHOOD RATION TESTING
EXPORT RESULTS



In [ ]:

    
%%R
sleuth_live(so)

Analyse of samples at transcript level



In [4]:

    
# Define generic arguments 
sample_list = "./Localisation_Original_Datasets/Djebali-ENCODE/kallisto/lncRNA_ERCC/list_file.tsv"
transcript_info = "../../Reference_Genomes/gencode_v24_lncRNA_transcripts_info.tsv"
outdir = "./Localisation_Original_Datasets/Djebali-ENCODE/sleuth/lncRNA_ERCC/"
ncore=4
max_bootstrap=100
norm_fun = function(mat) {setNames(rep(1.0,ncol(mat)),colnames(mat))}
filter_fun = function(row,min_reads=10,min_prop=0.47) {mean(row>=min_reads)>=min_prop}
        
# Analysis for all samples together at transcript level
cat ("\n\n### ANALYSING ALL SAMPLES ###\n")
so = sleuth_analysis (sample_list=sample_list, transcript_info=transcript_info, ncore=ncore, max_bootstrap=max_bootstrap,
    full_model=~localization+replicate+cell_type,
    outdir = paste0(outdir, "transcript_all","/"))

# Analysis cell line per cell line at transcript level
for (cell in c("IMR90","A549","GM12878","HeLaS3","HepG2","HUVEC","K562","MCF7","NHEK","SKNSH")) {
    cat (paste0("\n\n### ANALYSING SAMPLE ", cell, "###\n"))
    so = sleuth_analysis (sample_list=sample_list, transcript_info=transcript_info, ncore=ncore, max_bootstrap=max_bootstrap,
        full_model=~localization+replicate,
        outdir = paste0(outdir, "transcript_", cell,"/"),
        select_col="cell_type",
        select_val=cell)
}









    




### ANALYSING ALL SAMPLES ###
IMPORT AND SELECT THE SAMPLES
 [1] "A549_cytosol_1"    "A549_cytosol_2"    "A549_nucleus_1"   
 [4] "A549_nucleus_2"    "GM12878_cytosol_1" "GM12878_cytosol_2"
 [7] "GM12878_nucleus_1" "GM12878_nucleus_2" "HUVEC_cytosol_1"  
[10] "HUVEC_cytosol_2"   "HUVEC_nucleus_1"   "HUVEC_nucleus_2"  
[13] "HeLaS3_cytosol_1"  "HeLaS3_cytosol_2"  "HeLaS3_nucleus_1" 
[16] "HeLaS3_nucleus_2"  "HepG2_cytosol_1"   "HepG2_cytosol_2"  
[19] "HepG2_nucleus_1"   "HepG2_nucleus_2"   "IMR90_cytosol_1"  
[22] "IMR90_cytosol_2"   "IMR90_nucleus_1"   "IMR90_nucleus_2"  
[25] "K562_cytosol_1"    "K562_cytosol_2"    "K562_nucleus_1"   
[28] "K562_nucleus_2"    "MCF7_cytosol_1"    "MCF7_cytosol_2"   
[31] "MCF7_nucleus_1"    "MCF7_nucleus_2"    "NHEK_cytosol_1"   
[34] "NHEK_cytosol_2"    "NHEK_nucleus_1"    "NHEK_nucleus_2"   
[37] "SKNSH_cytosol_1"   "SKNSH_cytosol_2"   "SKNSH_nucleus_1"  
[40] "SKNSH_nucleus_2"  
IMPORT THE GENE INFORMATIONS
PREPARE THE DATA FOR SLEUTH






    



reading in kallisto results
........................................
normalizing est_counts
11068 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps






    



FIT TO FULL AND REDUCED MODEL






    



fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas






    



PERFORM WALD TESTING
PERFORM LIKELYHOOD RATION TESTING
EXPORT RESULTS

### ANALYSING SAMPLE IMR90###
IMPORT AND SELECT THE SAMPLES
[1] "IMR90_cytosol_1" "IMR90_cytosol_2" "IMR90_nucleus_1" "IMR90_nucleus_2"
IMPORT THE GENE INFORMATIONS
PREPARE THE DATA FOR SLEUTH






    



reading in kallisto results
....
normalizing est_counts
11667 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps






    



FIT TO FULL AND REDUCED MODEL






    



fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas






    



PERFORM WALD TESTING
PERFORM LIKELYHOOD RATION TESTING
EXPORT RESULTS

### ANALYSING SAMPLE A549###
IMPORT AND SELECT THE SAMPLES
[1] "A549_cytosol_1" "A549_cytosol_2" "A549_nucleus_1" "A549_nucleus_2"
IMPORT THE GENE INFORMATIONS
PREPARE THE DATA FOR SLEUTH






    



reading in kallisto results
....
normalizing est_counts
12007 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps






    



FIT TO FULL AND REDUCED MODEL






    



fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas






    



PERFORM WALD TESTING
PERFORM LIKELYHOOD RATION TESTING
EXPORT RESULTS

### ANALYSING SAMPLE GM12878###
IMPORT AND SELECT THE SAMPLES
[1] "GM12878_cytosol_1" "GM12878_cytosol_2" "GM12878_nucleus_1"
[4] "GM12878_nucleus_2"
IMPORT THE GENE INFORMATIONS
PREPARE THE DATA FOR SLEUTH






    



reading in kallisto results
....
normalizing est_counts
13025 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps






    



FIT TO FULL AND REDUCED MODEL






    



fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas






    



PERFORM WALD TESTING
PERFORM LIKELYHOOD RATION TESTING
EXPORT RESULTS

### ANALYSING SAMPLE HeLaS3###
IMPORT AND SELECT THE SAMPLES
[1] "HeLaS3_cytosol_1" "HeLaS3_cytosol_2" "HeLaS3_nucleus_1" "HeLaS3_nucleus_2"
IMPORT THE GENE INFORMATIONS
PREPARE THE DATA FOR SLEUTH






    



reading in kallisto results
....
normalizing est_counts
11689 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps






    



FIT TO FULL AND REDUCED MODEL






    



fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas






    



PERFORM WALD TESTING
PERFORM LIKELYHOOD RATION TESTING
EXPORT RESULTS

### ANALYSING SAMPLE HepG2###
IMPORT AND SELECT THE SAMPLES
[1] "HepG2_cytosol_1" "HepG2_cytosol_2" "HepG2_nucleus_1" "HepG2_nucleus_2"
IMPORT THE GENE INFORMATIONS
PREPARE THE DATA FOR SLEUTH






    



reading in kallisto results
....
normalizing est_counts
11896 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps






    



FIT TO FULL AND REDUCED MODEL






    



fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas






    



PERFORM WALD TESTING
PERFORM LIKELYHOOD RATION TESTING
EXPORT RESULTS

### ANALYSING SAMPLE HUVEC###
IMPORT AND SELECT THE SAMPLES
[1] "HUVEC_cytosol_1" "HUVEC_cytosol_2" "HUVEC_nucleus_1" "HUVEC_nucleus_2"
IMPORT THE GENE INFORMATIONS
PREPARE THE DATA FOR SLEUTH






    



reading in kallisto results
....
normalizing est_counts
12266 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps






    



FIT TO FULL AND REDUCED MODEL






    



fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas






    



PERFORM WALD TESTING
PERFORM LIKELYHOOD RATION TESTING
EXPORT RESULTS

### ANALYSING SAMPLE K562###
IMPORT AND SELECT THE SAMPLES
[1] "K562_cytosol_1" "K562_cytosol_2" "K562_nucleus_1" "K562_nucleus_2"
IMPORT THE GENE INFORMATIONS
PREPARE THE DATA FOR SLEUTH






    



reading in kallisto results
....
normalizing est_counts
12002 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps






    



FIT TO FULL AND REDUCED MODEL






    



fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas






    



PERFORM WALD TESTING
PERFORM LIKELYHOOD RATION TESTING
EXPORT RESULTS

### ANALYSING SAMPLE MCF7###
IMPORT AND SELECT THE SAMPLES
[1] "MCF7_cytosol_1" "MCF7_cytosol_2" "MCF7_nucleus_1" "MCF7_nucleus_2"
IMPORT THE GENE INFORMATIONS
PREPARE THE DATA FOR SLEUTH






    



reading in kallisto results
....
normalizing est_counts
13004 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps






    



FIT TO FULL AND REDUCED MODEL






    



fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas






    



PERFORM WALD TESTING
PERFORM LIKELYHOOD RATION TESTING
EXPORT RESULTS

### ANALYSING SAMPLE NHEK###
IMPORT AND SELECT THE SAMPLES
[1] "NHEK_cytosol_1" "NHEK_cytosol_2" "NHEK_nucleus_1" "NHEK_nucleus_2"
IMPORT THE GENE INFORMATIONS
PREPARE THE DATA FOR SLEUTH






    



reading in kallisto results
....
normalizing est_counts
11398 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps






    



FIT TO FULL AND REDUCED MODEL






    



fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas






    



PERFORM WALD TESTING
PERFORM LIKELYHOOD RATION TESTING
EXPORT RESULTS

### ANALYSING SAMPLE SKNSH###
IMPORT AND SELECT THE SAMPLES
[1] "SKNSH_cytosol_1" "SKNSH_cytosol_2" "SKNSH_nucleus_1" "SKNSH_nucleus_2"
IMPORT THE GENE INFORMATIONS
PREPARE THE DATA FOR SLEUTH






    



reading in kallisto results
....
normalizing est_counts
14174 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps






    



FIT TO FULL AND REDUCED MODEL






    



fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas






    



PERFORM WALD TESTING
PERFORM LIKELYHOOD RATION TESTING
EXPORT RESULTS

Using an Rscript to analyse the data with Sleuth and export graphs and table in a file

This will also allow to use the cluster for computationally intensive analysis at gene level and with all RNA (not only lncRNA)

The program also ouput more complex graphs than the one proposed by sleuth_live.



In [2]:

    
!SleuthDEA.R









    



### INITIALIZE ###
Parse arguments and import files
Usage: /home/aleg/.local/bin/SleuthDEA.R [options]


Options:
	--s2c=CHARACTER
		Sample list file containing the sample name (colname = sample), grouping covariate column(s) and the path (colname = path) to the kallisto dir containing results [mandatory argument]

	--t2g=CHARACTER
		File containing the transcript (colname = target_id) to gene (ens_gene) information [default "NULL"]

	-o CHARACTER, --outdir=CHARACTER
		Directory where to output the result files [default "./out/"]

	-c CHARACTER, --select_col=CHARACTER
		if needeed will select row with *select_val* from this column [default "NULL"]

	-v CHARACTER, --select_val=CHARACTER
		if needeed will select row with this value from *select_col* column [default "NULL"]

	-m CHARACTER, --full_model=CHARACTER
		Define the model to fit the data based on the covariates. Example: cell+localization [default "1"]

	-b NUMBER, --max_bootstrap=NUMBER
		Max number of bootstrap to analyse [default All, Minimun 2]

	-n, --no_norm
		If given, will deactivate the default geometric mean normalisation [default "FALSE"]

	-e NUMBER, --min_est=NUMBER
		Minimal number of estimated counts per transcript for the filtering step [default "5"]

	-p NUMBER, --min_prop=NUMBER
		Minimal proportion of sample with *min_est* counts for the filtering step [default "0.47"]

	-a, --aggreg_col
		 [If given, will aggregate genes based on the *ens_gene* column of the file indicated by the t2g arg [default "FALSE"]

	-t NUMBER, --threads=NUMBER
		Number of core to be used (for gene level aggregation only) [default "1"]

	-s NUMBER, --sig_level=NUMBER
		FDR threshold fo significance in the plots [default "0.1"]

	-r, --rds
		If given, output an rds file for further analysis with sleuth_live (takes time) [default "FALSE"]

	-h, --help
		Show this help message and exit


Error: Please provide a sample to covariate tabulated file with kallisto directory result path(s)
Execution halted

Analysis at transcript level for individual cell lines



In [4]:

    
# Generic arguments
t2g = "./Localisation_Original_Datasets/Djebali-ENCODE/Protocol_info/gencode_v24_transcripts_info_sleuth.tsv"
select_col = "cell_type"
full_model = "localization+replicate"
min_est = 10
sig_level = 0.01

for gene_set in ["lncRNA_ERCC", "lncRNA"]:
    for cell_line in ["IMR90","A549","GM12878","HeLaS3","HepG2","HUVEC","K562","MCF7","NHEK","SKNSH"]:
        # Specific arguments
        s2c = "./Localisation_Original_Datasets/Djebali-ENCODE/kallisto/{}/list_file.tsv".format(gene_set)
        outdir  = "./Localisation_Original_Datasets/Djebali-ENCODE/sleuth/transcript_{}/{}/".format(gene_set, cell_line)
        # Build command line and launch it
        cl = "SleuthDEA.R --s2c {} --t2g {} --outdir {} --select_col {} --select_val {} --full_model {} --min_est {} --sig_level {} --no_norm".format(s2c, t2g, outdir, select_col, cell_line, full_model, min_est, sig_level)
        !{cl}









    



### INITIALIZE ###
Parse arguments and import files
Select sample:
	 IMR90_cytosol_1 
 	 IMR90_cytosol_2 
 	 IMR90_nucleus_1 
 	 IMR90_nucleus_2 

### SLEUTH ANALYSIS ###
Prepare the data for sleuth
reading in kallisto results
....
normalizing est_counts
10032 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps
Fit to full and reduced model
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
Perform wald testing
	Beta factor: (Intercept) 
	Beta factor: localizationnucleus 
	Beta factor: replicate 
Perform likelyhood ratio testing
plot graphs
Warning messages:
1: Removed 26 rows containing missing values (geom_point). 
2: Removed 35 rows containing missing values (geom_point). 
3: Removed 26 rows containing missing values (geom_point). 
4: Removed 35 rows containing missing values (geom_point). 
5: Removed 26 rows containing missing values (geom_point). 
6: Removed 35 rows containing missing values (geom_point). 
7: Removed 1 rows containing missing values (position_stack). 
8: Removed 26 rows containing missing values (geom_path). 

### DONE ###

### INITIALIZE ###
Parse arguments and import files
Select sample:
	 A549_cytosol_1 
 	 A549_cytosol_2 
 	 A549_nucleus_1 
 	 A549_nucleus_2 

### SLEUTH ANALYSIS ###
Prepare the data for sleuth
reading in kallisto results
....
normalizing est_counts
10386 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps
Fit to full and reduced model
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
Perform wald testing
	Beta factor: (Intercept) 
	Beta factor: localizationnucleus 
	Beta factor: replicate 
Perform likelyhood ratio testing
plot graphs
Warning messages:
1: Removed 20 rows containing missing values (geom_point). 
2: Removed 33 rows containing missing values (geom_point). 
3: Removed 20 rows containing missing values (geom_point). 
4: Removed 33 rows containing missing values (geom_point). 
5: Removed 20 rows containing missing values (geom_point). 
6: Removed 33 rows containing missing values (geom_point). 
7: Removed 1 rows containing missing values (position_stack). 
8: Removed 20 rows containing missing values (geom_path). 

### DONE ###

### INITIALIZE ###
Parse arguments and import files
Select sample:
	 GM12878_cytosol_1 
 	 GM12878_cytosol_2 
 	 GM12878_nucleus_1 
 	 GM12878_nucleus_2 

### SLEUTH ANALYSIS ###
Prepare the data for sleuth
reading in kallisto results
....
normalizing est_counts
11485 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps
Fit to full and reduced model
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
Perform wald testing
	Beta factor: (Intercept) 
	Beta factor: localizationnucleus 
	Beta factor: replicate 
Perform likelyhood ratio testing
plot graphs
Warning messages:
1: Removed 32 rows containing missing values (geom_point). 
2: Removed 38 rows containing missing values (geom_point). 
3: Removed 32 rows containing missing values (geom_point). 
4: Removed 38 rows containing missing values (geom_point). 
5: Removed 32 rows containing missing values (geom_point). 
6: Removed 38 rows containing missing values (geom_point). 
7: Removed 1 rows containing missing values (position_stack). 
8: Removed 32 rows containing missing values (geom_path). 

### DONE ###

### INITIALIZE ###
Parse arguments and import files
Select sample:
	 HeLaS3_cytosol_1 
 	 HeLaS3_cytosol_2 
 	 HeLaS3_nucleus_1 
 	 HeLaS3_nucleus_2 

### SLEUTH ANALYSIS ###
Prepare the data for sleuth
reading in kallisto results
....
normalizing est_counts
10198 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps
Fit to full and reduced model
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
Perform wald testing
	Beta factor: (Intercept) 
	Beta factor: localizationnucleus 
	Beta factor: replicate 
Perform likelyhood ratio testing
plot graphs
Warning messages:
1: Removed 24 rows containing missing values (geom_point). 
2: Removed 38 rows containing missing values (geom_point). 
3: Removed 24 rows containing missing values (geom_point). 
4: Removed 38 rows containing missing values (geom_point). 
5: Removed 24 rows containing missing values (geom_point). 
6: Removed 38 rows containing missing values (geom_point). 
7: Removed 1 rows containing missing values (position_stack). 
8: Removed 24 rows containing missing values (geom_path). 

### DONE ###

### INITIALIZE ###
Parse arguments and import files
Select sample:
	 HepG2_cytosol_1 
 	 HepG2_cytosol_2 
 	 HepG2_nucleus_1 
 	 HepG2_nucleus_2 

### SLEUTH ANALYSIS ###
Prepare the data for sleuth
reading in kallisto results
....
normalizing est_counts
10323 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps
Fit to full and reduced model
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
Perform wald testing
	Beta factor: (Intercept) 
	Beta factor: localizationnucleus 
	Beta factor: replicate 
Perform likelyhood ratio testing
plot graphs
Warning messages:
1: Removed 48 rows containing missing values (geom_point). 
2: Removed 38 rows containing missing values (geom_point). 
3: Removed 48 rows containing missing values (geom_point). 
4: Removed 38 rows containing missing values (geom_point). 
5: Removed 48 rows containing missing values (geom_point). 
6: Removed 38 rows containing missing values (geom_point). 
7: Removed 1 rows containing missing values (position_stack). 
8: Removed 48 rows containing missing values (geom_path). 

### DONE ###

### INITIALIZE ###
Parse arguments and import files
Select sample:
	 HUVEC_cytosol_1 
 	 HUVEC_cytosol_2 
 	 HUVEC_nucleus_1 
 	 HUVEC_nucleus_2 

### SLEUTH ANALYSIS ###
Prepare the data for sleuth
reading in kallisto results
....
normalizing est_counts
10745 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps
Fit to full and reduced model
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
Perform wald testing
	Beta factor: (Intercept) 
	Beta factor: localizationnucleus 
	Beta factor: replicate 
Perform likelyhood ratio testing
plot graphs
Warning messages:
1: Removed 27 rows containing missing values (geom_point). 
2: Removed 37 rows containing missing values (geom_point). 
3: Removed 27 rows containing missing values (geom_point). 
4: Removed 37 rows containing missing values (geom_point). 
5: Removed 27 rows containing missing values (geom_point). 
6: Removed 37 rows containing missing values (geom_point). 
7: Removed 1 rows containing missing values (position_stack). 
8: Removed 27 rows containing missing values (geom_path). 

### DONE ###

### INITIALIZE ###
Parse arguments and import files
Select sample:
	 K562_cytosol_1 
 	 K562_cytosol_2 
 	 K562_nucleus_1 
 	 K562_nucleus_2 

### SLEUTH ANALYSIS ###
Prepare the data for sleuth
reading in kallisto results
....
normalizing est_counts
10600 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps
Fit to full and reduced model
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
Perform wald testing
	Beta factor: (Intercept) 
	Beta factor: localizationnucleus 
	Beta factor: replicate 
Perform likelyhood ratio testing
plot graphs
Warning messages:
1: Removed 20 rows containing missing values (geom_point). 
2: Removed 38 rows containing missing values (geom_point). 
3: Removed 20 rows containing missing values (geom_point). 
4: Removed 38 rows containing missing values (geom_point). 
5: Removed 20 rows containing missing values (geom_point). 
6: Removed 38 rows containing missing values (geom_point). 
7: Removed 1 rows containing missing values (position_stack). 
8: Removed 20 rows containing missing values (geom_path). 

### DONE ###

### INITIALIZE ###
Parse arguments and import files
Select sample:
	 MCF7_cytosol_1 
 	 MCF7_cytosol_2 
 	 MCF7_nucleus_1 
 	 MCF7_nucleus_2 

### SLEUTH ANALYSIS ###
Prepare the data for sleuth
reading in kallisto results
....
normalizing est_counts
11145 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps
Fit to full and reduced model
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
Perform wald testing
	Beta factor: (Intercept) 
	Beta factor: localizationnucleus 
	Beta factor: replicate 
Perform likelyhood ratio testing
plot graphs
Warning messages:
1: Removed 22 rows containing missing values (geom_point). 
2: Removed 26 rows containing missing values (geom_point). 
3: Removed 22 rows containing missing values (geom_point). 
4: Removed 26 rows containing missing values (geom_point). 
5: Removed 22 rows containing missing values (geom_point). 
6: Removed 26 rows containing missing values (geom_point). 
7: Removed 1 rows containing missing values (position_stack). 
8: Removed 22 rows containing missing values (geom_path). 

### DONE ###

### INITIALIZE ###
Parse arguments and import files
Select sample:
	 NHEK_cytosol_1 
 	 NHEK_cytosol_2 
 	 NHEK_nucleus_1 
 	 NHEK_nucleus_2 

### SLEUTH ANALYSIS ###
Prepare the data for sleuth
reading in kallisto results
....
normalizing est_counts
9807 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps
Fit to full and reduced model
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
Perform wald testing
	Beta factor: (Intercept) 
	Beta factor: localizationnucleus 
	Beta factor: replicate 
Perform likelyhood ratio testing
plot graphs
Warning messages:
1: Removed 45 rows containing missing values (geom_point). 
2: Removed 50 rows containing missing values (geom_point). 
3: Removed 45 rows containing missing values (geom_point). 
4: Removed 50 rows containing missing values (geom_point). 
5: Removed 45 rows containing missing values (geom_point). 
6: Removed 50 rows containing missing values (geom_point). 
7: Removed 1 rows containing missing values (position_stack). 
8: Removed 45 rows containing missing values (geom_path). 

### DONE ###

### INITIALIZE ###
Parse arguments and import files
Select sample:
	 SKNSH_cytosol_1 
 	 SKNSH_cytosol_2 
 	 SKNSH_nucleus_1 
 	 SKNSH_nucleus_2 

### SLEUTH ANALYSIS ###
Prepare the data for sleuth
reading in kallisto results
....
normalizing est_counts
12573 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps
Fit to full and reduced model
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
Perform wald testing
	Beta factor: (Intercept) 
	Beta factor: localizationnucleus 
	Beta factor: replicate 
Perform likelyhood ratio testing
plot graphs
Warning messages:
1: Removed 31 rows containing missing values (geom_point). 
2: Removed 26 rows containing missing values (geom_point). 
3: Removed 31 rows containing missing values (geom_point). 
4: Removed 26 rows containing missing values (geom_point). 
5: Removed 31 rows containing missing values (geom_point). 
6: Removed 26 rows containing missing values (geom_point). 
7: Removed 1 rows containing missing values (position_stack). 
8: Removed 31 rows containing missing values (geom_path). 

### DONE ###

### INITIALIZE ###
Parse arguments and import files
Select sample:
	 IMR90_cytosol_1 
 	 IMR90_cytosol_2 
 	 IMR90_nucleus_1 
 	 IMR90_nucleus_2 

### SLEUTH ANALYSIS ###
Prepare the data for sleuth
reading in kallisto results
....
normalizing est_counts
9978 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps
Fit to full and reduced model
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
Perform wald testing
	Beta factor: (Intercept) 
	Beta factor: localizationnucleus 
	Beta factor: replicate 
Perform likelyhood ratio testing
plot graphs
Warning messages:
1: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
2: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
3: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
4: Removed 27 rows containing missing values (geom_point). 
5: Removed 27 rows containing missing values (geom_point). 
6: Removed 27 rows containing missing values (geom_point). 
7: Removed 1 rows containing missing values (position_stack). 
8: Removed 27 rows containing missing values (geom_path). 

### DONE ###

### INITIALIZE ###
Parse arguments and import files
Select sample:
	 A549_cytosol_1 
 	 A549_cytosol_2 
 	 A549_nucleus_1 
 	 A549_nucleus_2 

### SLEUTH ANALYSIS ###
Prepare the data for sleuth
reading in kallisto results
....
normalizing est_counts
10321 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps
Fit to full and reduced model
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
Perform wald testing
	Beta factor: (Intercept) 
	Beta factor: localizationnucleus 
	Beta factor: replicate 
Perform likelyhood ratio testing
plot graphs
Warning messages:
1: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
2: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
3: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
4: Removed 33 rows containing missing values (geom_point). 
5: Removed 33 rows containing missing values (geom_point). 
6: Removed 33 rows containing missing values (geom_point). 
7: Removed 1 rows containing missing values (position_stack). 
8: Removed 33 rows containing missing values (geom_path). 

### DONE ###

### INITIALIZE ###
Parse arguments and import files
Select sample:
	 GM12878_cytosol_1 
 	 GM12878_cytosol_2 
 	 GM12878_nucleus_1 
 	 GM12878_nucleus_2 

### SLEUTH ANALYSIS ###
Prepare the data for sleuth
reading in kallisto results
....
normalizing est_counts
11426 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps
Fit to full and reduced model
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
Perform wald testing
	Beta factor: (Intercept) 
	Beta factor: localizationnucleus 
	Beta factor: replicate 
Perform likelyhood ratio testing
plot graphs
Warning messages:
1: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
2: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
3: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
4: Removed 32 rows containing missing values (geom_point). 
5: Removed 32 rows containing missing values (geom_point). 
6: Removed 32 rows containing missing values (geom_point). 
7: Removed 1 rows containing missing values (position_stack). 
8: Removed 32 rows containing missing values (geom_path). 

### DONE ###

### INITIALIZE ###
Parse arguments and import files
Select sample:
	 HeLaS3_cytosol_1 
 	 HeLaS3_cytosol_2 
 	 HeLaS3_nucleus_1 
 	 HeLaS3_nucleus_2 

### SLEUTH ANALYSIS ###
Prepare the data for sleuth
reading in kallisto results
....
normalizing est_counts
10142 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps
Fit to full and reduced model
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
Perform wald testing
	Beta factor: (Intercept) 
	Beta factor: localizationnucleus 
	Beta factor: replicate 
Perform likelyhood ratio testing
plot graphs
Warning messages:
1: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
2: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
3: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
4: Removed 25 rows containing missing values (geom_point). 
5: Removed 25 rows containing missing values (geom_point). 
6: Removed 25 rows containing missing values (geom_point). 
7: Removed 1 rows containing missing values (position_stack). 
8: Removed 25 rows containing missing values (geom_path). 

### DONE ###

### INITIALIZE ###
Parse arguments and import files
Select sample:
	 HepG2_cytosol_1 
 	 HepG2_cytosol_2 
 	 HepG2_nucleus_1 
 	 HepG2_nucleus_2 

### SLEUTH ANALYSIS ###
Prepare the data for sleuth
reading in kallisto results
....
normalizing est_counts
10273 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps
Fit to full and reduced model
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
Perform wald testing
	Beta factor: (Intercept) 
	Beta factor: localizationnucleus 
	Beta factor: replicate 
Perform likelyhood ratio testing
plot graphs
Warning messages:
1: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
2: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
3: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
4: Removed 46 rows containing missing values (geom_point). 
5: Removed 46 rows containing missing values (geom_point). 
6: Removed 46 rows containing missing values (geom_point). 
7: Removed 1 rows containing missing values (position_stack). 
8: Removed 46 rows containing missing values (geom_path). 

### DONE ###

### INITIALIZE ###
Parse arguments and import files
Select sample:
	 HUVEC_cytosol_1 
 	 HUVEC_cytosol_2 
 	 HUVEC_nucleus_1 
 	 HUVEC_nucleus_2 

### SLEUTH ANALYSIS ###
Prepare the data for sleuth
reading in kallisto results
....
normalizing est_counts
10677 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps
Fit to full and reduced model
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
Perform wald testing
	Beta factor: (Intercept) 
	Beta factor: localizationnucleus 
	Beta factor: replicate 
Perform likelyhood ratio testing
plot graphs
Warning messages:
1: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
2: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
3: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
4: Removed 55 rows containing missing values (geom_point). 
5: Removed 55 rows containing missing values (geom_point). 
6: Removed 55 rows containing missing values (geom_point). 
7: Removed 1 rows containing missing values (position_stack). 
8: Removed 55 rows containing missing values (geom_path). 

### DONE ###

### INITIALIZE ###
Parse arguments and import files
Select sample:
	 K562_cytosol_1 
 	 K562_cytosol_2 
 	 K562_nucleus_1 
 	 K562_nucleus_2 

### SLEUTH ANALYSIS ###
Prepare the data for sleuth
reading in kallisto results
....
normalizing est_counts
10535 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps
Fit to full and reduced model
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
Perform wald testing
	Beta factor: (Intercept) 
	Beta factor: localizationnucleus 
	Beta factor: replicate 
Perform likelyhood ratio testing
plot graphs
Warning messages:
1: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
2: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
3: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
4: Removed 18 rows containing missing values (geom_point). 
5: Removed 18 rows containing missing values (geom_point). 
6: Removed 18 rows containing missing values (geom_point). 
7: Removed 1 rows containing missing values (position_stack). 
8: Removed 18 rows containing missing values (geom_path). 

### DONE ###

### INITIALIZE ###
Parse arguments and import files
Select sample:
	 MCF7_cytosol_1 
 	 MCF7_cytosol_2 
 	 MCF7_nucleus_1 
 	 MCF7_nucleus_2 

### SLEUTH ANALYSIS ###
Prepare the data for sleuth
reading in kallisto results
....
normalizing est_counts
11074 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps
Fit to full and reduced model
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
Perform wald testing
	Beta factor: (Intercept) 
	Beta factor: localizationnucleus 
	Beta factor: replicate 
Perform likelyhood ratio testing
plot graphs
Warning messages:
1: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
2: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
3: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
4: Removed 30 rows containing missing values (geom_point). 
5: Removed 30 rows containing missing values (geom_point). 
6: Removed 30 rows containing missing values (geom_point). 
7: Removed 1 rows containing missing values (position_stack). 
8: Removed 30 rows containing missing values (geom_path). 

### DONE ###

### INITIALIZE ###
Parse arguments and import files
Select sample:
	 NHEK_cytosol_1 
 	 NHEK_cytosol_2 
 	 NHEK_nucleus_1 
 	 NHEK_nucleus_2 

### SLEUTH ANALYSIS ###
Prepare the data for sleuth
reading in kallisto results
....
normalizing est_counts
9771 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps
Fit to full and reduced model
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
Perform wald testing
	Beta factor: (Intercept) 
	Beta factor: localizationnucleus 
	Beta factor: replicate 
Perform likelyhood ratio testing
plot graphs
Warning messages:
1: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
2: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
3: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
4: Removed 45 rows containing missing values (geom_point). 
5: Removed 45 rows containing missing values (geom_point). 
6: Removed 45 rows containing missing values (geom_point). 
7: Removed 1 rows containing missing values (position_stack). 
8: Removed 45 rows containing missing values (geom_path). 

### DONE ###

### INITIALIZE ###
Parse arguments and import files
Select sample:
	 SKNSH_cytosol_1 
 	 SKNSH_cytosol_2 
 	 SKNSH_nucleus_1 
 	 SKNSH_nucleus_2 

### SLEUTH ANALYSIS ###
Prepare the data for sleuth
reading in kallisto results
....
normalizing est_counts
12484 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps
Fit to full and reduced model
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
Perform wald testing
	Beta factor: (Intercept) 
	Beta factor: localizationnucleus 
	Beta factor: replicate 
Perform likelyhood ratio testing
plot graphs
Warning messages:
1: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
2: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
3: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
4: Removed 36 rows containing missing values (geom_point). 
5: Removed 36 rows containing missing values (geom_point). 
6: Removed 36 rows containing missing values (geom_point). 
7: Removed 1 rows containing missing values (position_stack). 
8: Removed 36 rows containing missing values (geom_path). 

### DONE ###

### INITIALIZE ###
Parse arguments and import files
Select sample:
	 IMR90_cytosol_1 
 	 IMR90_cytosol_2 
 	 IMR90_nucleus_1 
 	 IMR90_nucleus_2 

### SLEUTH ANALYSIS ###
Prepare the data for sleuth
reading in kallisto results
....
normalizing est_counts
84927 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps
Fit to full and reduced model
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
Perform wald testing
	Beta factor: (Intercept) 
	Beta factor: localizationnucleus 
	Beta factor: replicate 
Perform likelyhood ratio testing
plot graphs
Warning messages:
1: Removed 5 rows containing missing values (geom_point). 
2: Removed 35 rows containing missing values (geom_point). 
3: Removed 5 rows containing missing values (geom_point). 
4: Removed 35 rows containing missing values (geom_point). 
5: Removed 5 rows containing missing values (geom_point). 
6: Removed 35 rows containing missing values (geom_point). 
7: Removed 1 rows containing missing values (position_stack). 
8: Removed 5 rows containing missing values (geom_path). 

### DONE ###

### INITIALIZE ###
Parse arguments and import files
Select sample:
	 A549_cytosol_1 
 	 A549_cytosol_2 
 	 A549_nucleus_1 
 	 A549_nucleus_2 

### SLEUTH ANALYSIS ###
Prepare the data for sleuth
reading in kallisto results
....
normalizing est_counts
90437 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps
Fit to full and reduced model
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
Perform wald testing
	Beta factor: (Intercept) 
	Beta factor: localizationnucleus 
	Beta factor: replicate 
Perform likelyhood ratio testing
plot graphs
Warning messages:
1: Removed 1 rows containing missing values (geom_point). 
2: Removed 31 rows containing missing values (geom_point). 
3: Removed 1 rows containing missing values (geom_point). 
4: Removed 31 rows containing missing values (geom_point). 
5: Removed 1 rows containing missing values (geom_point). 
6: Removed 31 rows containing missing values (geom_point). 
7: Removed 1 rows containing missing values (position_stack). 
8: Removed 1 rows containing missing values (geom_path). 

### DONE ###

### INITIALIZE ###
Parse arguments and import files
Select sample:
	 GM12878_cytosol_1 
 	 GM12878_cytosol_2 
 	 GM12878_nucleus_1 
 	 GM12878_nucleus_2 

### SLEUTH ANALYSIS ###
Prepare the data for sleuth
reading in kallisto results
....
normalizing est_counts
95450 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps
Fit to full and reduced model
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
Perform wald testing
	Beta factor: (Intercept) 
	Beta factor: localizationnucleus 
	Beta factor: replicate 
Perform likelyhood ratio testing
plot graphs
Warning messages:
1: Removed 9 rows containing missing values (geom_point). 
2: Removed 37 rows containing missing values (geom_point). 
3: Removed 9 rows containing missing values (geom_point). 
4: Removed 37 rows containing missing values (geom_point). 
5: Removed 9 rows containing missing values (geom_point). 
6: Removed 37 rows containing missing values (geom_point). 
7: Removed 1 rows containing missing values (position_stack). 
8: Removed 9 rows containing missing values (geom_path). 

### DONE ###

### INITIALIZE ###
Parse arguments and import files
Select sample:
	 HeLaS3_cytosol_1 
 	 HeLaS3_cytosol_2 
 	 HeLaS3_nucleus_1 
 	 HeLaS3_nucleus_2 

### SLEUTH ANALYSIS ###
Prepare the data for sleuth
reading in kallisto results
....
normalizing est_counts
86052 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps
Fit to full and reduced model
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
Perform wald testing
	Beta factor: (Intercept) 
	Beta factor: localizationnucleus 
	Beta factor: replicate 
Perform likelyhood ratio testing
plot graphs
Warning messages:
1: Removed 10 rows containing missing values (geom_point). 
2: Removed 37 rows containing missing values (geom_point). 
3: Removed 10 rows containing missing values (geom_point). 
4: Removed 37 rows containing missing values (geom_point). 
5: Removed 10 rows containing missing values (geom_point). 
6: Removed 37 rows containing missing values (geom_point). 
7: Removed 1 rows containing missing values (position_stack). 
8: Removed 10 rows containing missing values (geom_path). 

### DONE ###

### INITIALIZE ###
Parse arguments and import files
Select sample:
	 HepG2_cytosol_1 
 	 HepG2_cytosol_2 
 	 HepG2_nucleus_1 
 	 HepG2_nucleus_2 

### SLEUTH ANALYSIS ###
Prepare the data for sleuth
reading in kallisto results
....
normalizing est_counts
87337 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps
Fit to full and reduced model
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
Perform wald testing
	Beta factor: (Intercept) 
	Beta factor: localizationnucleus 
	Beta factor: replicate 
Perform likelyhood ratio testing
plot graphs
Warning messages:
1: Removed 32 rows containing missing values (geom_point). 
2: Removed 38 rows containing missing values (geom_point). 
3: Removed 32 rows containing missing values (geom_point). 
4: Removed 38 rows containing missing values (geom_point). 
5: Removed 32 rows containing missing values (geom_point). 
6: Removed 38 rows containing missing values (geom_point). 
7: Removed 1 rows containing missing values (position_stack). 
8: Removed 32 rows containing missing values (geom_path). 

### DONE ###

### INITIALIZE ###
Parse arguments and import files
Select sample:
	 HUVEC_cytosol_1 
 	 HUVEC_cytosol_2 
 	 HUVEC_nucleus_1 
 	 HUVEC_nucleus_2 

### SLEUTH ANALYSIS ###
Prepare the data for sleuth
reading in kallisto results
....
normalizing est_counts
96078 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps
Fit to full and reduced model
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
Perform wald testing
	Beta factor: (Intercept) 
	Beta factor: localizationnucleus 
	Beta factor: replicate 
Perform likelyhood ratio testing
plot graphs
Warning messages:
1: Removed 354 rows containing missing values (geom_point). 
2: Removed 37 rows containing missing values (geom_point). 
3: Removed 354 rows containing missing values (geom_point). 
4: Removed 37 rows containing missing values (geom_point). 
5: Removed 354 rows containing missing values (geom_point). 
6: Removed 37 rows containing missing values (geom_point). 
7: Removed 1 rows containing missing values (position_stack). 
8: Removed 354 rows containing missing values (geom_path). 

### DONE ###

### INITIALIZE ###
Parse arguments and import files
Select sample:
	 K562_cytosol_1 
 	 K562_cytosol_2 
 	 K562_nucleus_1 
 	 K562_nucleus_2 

### SLEUTH ANALYSIS ###
Prepare the data for sleuth
reading in kallisto results
....
normalizing est_counts
89329 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps
Fit to full and reduced model
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
Perform wald testing
	Beta factor: (Intercept) 
	Beta factor: localizationnucleus 
	Beta factor: replicate 
Perform likelyhood ratio testing
plot graphs
Warning messages:
1: Removed 11 rows containing missing values (geom_point). 
2: Removed 39 rows containing missing values (geom_point). 
3: Removed 11 rows containing missing values (geom_point). 
4: Removed 39 rows containing missing values (geom_point). 
5: Removed 11 rows containing missing values (geom_point). 
6: Removed 39 rows containing missing values (geom_point). 
7: Removed 1 rows containing missing values (position_stack). 
8: Removed 11 rows containing missing values (geom_path). 

### DONE ###

### INITIALIZE ###
Parse arguments and import files
Select sample:
	 MCF7_cytosol_1 
 	 MCF7_cytosol_2 
 	 MCF7_nucleus_1 
 	 MCF7_nucleus_2 

### SLEUTH ANALYSIS ###
Prepare the data for sleuth
reading in kallisto results
....
normalizing est_counts
93052 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps
Fit to full and reduced model
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
Perform wald testing
	Beta factor: (Intercept) 
	Beta factor: localizationnucleus 
	Beta factor: replicate 
Perform likelyhood ratio testing
plot graphs
Warning messages:
1: Removed 267 rows containing missing values (geom_point). 
2: Removed 29 rows containing missing values (geom_point). 
3: Removed 267 rows containing missing values (geom_point). 
4: Removed 29 rows containing missing values (geom_point). 
5: Removed 267 rows containing missing values (geom_point). 
6: Removed 29 rows containing missing values (geom_point). 
7: Removed 1 rows containing missing values (position_stack). 
8: Removed 267 rows containing missing values (geom_path). 

### DONE ###

### INITIALIZE ###
Parse arguments and import files
Select sample:
	 NHEK_cytosol_1 
 	 NHEK_cytosol_2 
 	 NHEK_nucleus_1 
 	 NHEK_nucleus_2 

### SLEUTH ANALYSIS ###
Prepare the data for sleuth
reading in kallisto results
....
normalizing est_counts
85707 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps
Fit to full and reduced model
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
Perform wald testing
	Beta factor: (Intercept) 
	Beta factor: localizationnucleus 
	Beta factor: replicate 
Perform likelyhood ratio testing
plot graphs
Warning messages:
1: Removed 21 rows containing missing values (geom_point). 
2: Removed 51 rows containing missing values (geom_point). 
3: Removed 21 rows containing missing values (geom_point). 
4: Removed 51 rows containing missing values (geom_point). 
5: Removed 21 rows containing missing values (geom_point). 
6: Removed 51 rows containing missing values (geom_point). 
7: Removed 1 rows containing missing values (position_stack). 
8: Removed 21 rows containing missing values (geom_path). 

### DONE ###

### INITIALIZE ###
Parse arguments and import files
Select sample:
	 SKNSH_cytosol_1 
 	 SKNSH_cytosol_2 
 	 SKNSH_nucleus_1 
 	 SKNSH_nucleus_2 

### SLEUTH ANALYSIS ###
Prepare the data for sleuth
reading in kallisto results
....
normalizing est_counts
104262 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps
Fit to full and reduced model
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
Perform wald testing
	Beta factor: (Intercept) 
	Beta factor: localizationnucleus 
	Beta factor: replicate 
Perform likelyhood ratio testing
plot graphs
Warning messages:
1: Removed 233 rows containing missing values (geom_point). 
2: Removed 26 rows containing missing values (geom_point). 
3: Removed 233 rows containing missing values (geom_point). 
4: Removed 26 rows containing missing values (geom_point). 
5: Removed 233 rows containing missing values (geom_point). 
6: Removed 26 rows containing missing values (geom_point). 
7: Removed 1 rows containing missing values (position_stack). 
8: Removed 233 rows containing missing values (geom_path). 

### DONE ###

### INITIALIZE ###
Parse arguments and import files
Select sample:
	 IMR90_cytosol_1 
 	 IMR90_cytosol_2 
 	 IMR90_nucleus_1 
 	 IMR90_nucleus_2 

### SLEUTH ANALYSIS ###
Prepare the data for sleuth
reading in kallisto results
....
normalizing est_counts
84905 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps
Fit to full and reduced model
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
Perform wald testing
	Beta factor: (Intercept) 
	Beta factor: localizationnucleus 
	Beta factor: replicate 
Perform likelyhood ratio testing
plot graphs
Warning messages:
1: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
2: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
3: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
4: Removed 5 rows containing missing values (geom_point). 
5: Removed 5 rows containing missing values (geom_point). 
6: Removed 5 rows containing missing values (geom_point). 
7: Removed 1 rows containing missing values (position_stack). 
8: Removed 5 rows containing missing values (geom_path). 

### DONE ###

### INITIALIZE ###
Parse arguments and import files
Select sample:
	 A549_cytosol_1 
 	 A549_cytosol_2 
 	 A549_nucleus_1 
 	 A549_nucleus_2 

### SLEUTH ANALYSIS ###
Prepare the data for sleuth
reading in kallisto results
....
normalizing est_counts
90381 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps
Fit to full and reduced model
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
Perform wald testing
	Beta factor: (Intercept) 
	Beta factor: localizationnucleus 
	Beta factor: replicate 
Perform likelyhood ratio testing
plot graphs
Warning messages:
1: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
2: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
3: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
4: Removed 1 rows containing missing values (geom_point). 
5: Removed 1 rows containing missing values (geom_point). 
6: Removed 1 rows containing missing values (geom_point). 
7: Removed 1 rows containing missing values (position_stack). 
8: Removed 1 rows containing missing values (geom_path). 

### DONE ###

### INITIALIZE ###
Parse arguments and import files
Select sample:
	 GM12878_cytosol_1 
 	 GM12878_cytosol_2 
 	 GM12878_nucleus_1 
 	 GM12878_nucleus_2 

### SLEUTH ANALYSIS ###
Prepare the data for sleuth
reading in kallisto results
....
normalizing est_counts
95416 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps
Fit to full and reduced model
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
Perform wald testing
	Beta factor: (Intercept) 
	Beta factor: localizationnucleus 
	Beta factor: replicate 
Perform likelyhood ratio testing
plot graphs
Warning messages:
1: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
2: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
3: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
4: Removed 9 rows containing missing values (geom_point). 
5: Removed 9 rows containing missing values (geom_point). 
6: Removed 9 rows containing missing values (geom_point). 
7: Removed 1 rows containing missing values (position_stack). 
8: Removed 9 rows containing missing values (geom_path). 

### DONE ###

### INITIALIZE ###
Parse arguments and import files
Select sample:
	 HeLaS3_cytosol_1 
 	 HeLaS3_cytosol_2 
 	 HeLaS3_nucleus_1 
 	 HeLaS3_nucleus_2 

### SLEUTH ANALYSIS ###
Prepare the data for sleuth
reading in kallisto results
....
normalizing est_counts
86000 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps
Fit to full and reduced model
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
Perform wald testing
	Beta factor: (Intercept) 
	Beta factor: localizationnucleus 
	Beta factor: replicate 
Perform likelyhood ratio testing
plot graphs
Warning messages:
1: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
2: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
3: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
4: Removed 10 rows containing missing values (geom_point). 
5: Removed 10 rows containing missing values (geom_point). 
6: Removed 10 rows containing missing values (geom_point). 
7: Removed 1 rows containing missing values (position_stack). 
8: Removed 10 rows containing missing values (geom_path). 

### DONE ###

### INITIALIZE ###
Parse arguments and import files
Select sample:
	 HepG2_cytosol_1 
 	 HepG2_cytosol_2 
 	 HepG2_nucleus_1 
 	 HepG2_nucleus_2 

### SLEUTH ANALYSIS ###
Prepare the data for sleuth
reading in kallisto results
....
normalizing est_counts
87278 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps
Fit to full and reduced model
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
Perform wald testing
	Beta factor: (Intercept) 
	Beta factor: localizationnucleus 
	Beta factor: replicate 
Perform likelyhood ratio testing
plot graphs
Warning messages:
1: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
2: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
3: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
4: Removed 32 rows containing missing values (geom_point). 
5: Removed 32 rows containing missing values (geom_point). 
6: Removed 32 rows containing missing values (geom_point). 
7: Removed 1 rows containing missing values (position_stack). 
8: Removed 32 rows containing missing values (geom_path). 

### DONE ###

### INITIALIZE ###
Parse arguments and import files
Select sample:
	 HUVEC_cytosol_1 
 	 HUVEC_cytosol_2 
 	 HUVEC_nucleus_1 
 	 HUVEC_nucleus_2 

### SLEUTH ANALYSIS ###
Prepare the data for sleuth
reading in kallisto results
....
normalizing est_counts
96006 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps
Fit to full and reduced model
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
Perform wald testing
	Beta factor: (Intercept) 
	Beta factor: localizationnucleus 
	Beta factor: replicate 
Perform likelyhood ratio testing
plot graphs
Warning messages:
1: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
2: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
3: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
4: Removed 336 rows containing missing values (geom_point). 
5: Removed 336 rows containing missing values (geom_point). 
6: Removed 336 rows containing missing values (geom_point). 
7: Removed 1 rows containing missing values (position_stack). 
8: Removed 336 rows containing missing values (geom_path). 

### DONE ###

### INITIALIZE ###
Parse arguments and import files
Select sample:
	 K562_cytosol_1 
 	 K562_cytosol_2 
 	 K562_nucleus_1 
 	 K562_nucleus_2 

### SLEUTH ANALYSIS ###
Prepare the data for sleuth
reading in kallisto results
....
normalizing est_counts
89277 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps
Fit to full and reduced model
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
Perform wald testing
	Beta factor: (Intercept) 
	Beta factor: localizationnucleus 
	Beta factor: replicate 
Perform likelyhood ratio testing
plot graphs
Warning messages:
1: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
2: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
3: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
4: Removed 10 rows containing missing values (geom_point). 
5: Removed 10 rows containing missing values (geom_point). 
6: Removed 10 rows containing missing values (geom_point). 
7: Removed 1 rows containing missing values (position_stack). 
8: Removed 10 rows containing missing values (geom_path). 

### DONE ###

### INITIALIZE ###
Parse arguments and import files
Select sample:
	 MCF7_cytosol_1 
 	 MCF7_cytosol_2 
 	 MCF7_nucleus_1 
 	 MCF7_nucleus_2 

### SLEUTH ANALYSIS ###
Prepare the data for sleuth
reading in kallisto results
....
normalizing est_counts
92970 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps
Fit to full and reduced model
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
Perform wald testing
	Beta factor: (Intercept) 
	Beta factor: localizationnucleus 
	Beta factor: replicate 
Perform likelyhood ratio testing
plot graphs
Warning messages:
1: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
2: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
3: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
4: Removed 264 rows containing missing values (geom_point). 
5: Removed 264 rows containing missing values (geom_point). 
6: Removed 264 rows containing missing values (geom_point). 
7: Removed 1 rows containing missing values (position_stack). 
8: Removed 264 rows containing missing values (geom_path). 

### DONE ###

### INITIALIZE ###
Parse arguments and import files
Select sample:
	 NHEK_cytosol_1 
 	 NHEK_cytosol_2 
 	 NHEK_nucleus_1 
 	 NHEK_nucleus_2 

### SLEUTH ANALYSIS ###
Prepare the data for sleuth
reading in kallisto results
....
normalizing est_counts
85675 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps
Fit to full and reduced model
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
Perform wald testing
	Beta factor: (Intercept) 
	Beta factor: localizationnucleus 
	Beta factor: replicate 
Perform likelyhood ratio testing
plot graphs
Warning messages:
1: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
2: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
3: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
4: Removed 22 rows containing missing values (geom_point). 
5: Removed 22 rows containing missing values (geom_point). 
6: Removed 22 rows containing missing values (geom_point). 
7: Removed 1 rows containing missing values (position_stack). 
8: Removed 22 rows containing missing values (geom_path). 

### DONE ###

### INITIALIZE ###
Parse arguments and import files
Select sample:
	 SKNSH_cytosol_1 
 	 SKNSH_cytosol_2 
 	 SKNSH_nucleus_1 
 	 SKNSH_nucleus_2 

### SLEUTH ANALYSIS ###
Prepare the data for sleuth
reading in kallisto results
....
normalizing est_counts
104167 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps
Fit to full and reduced model
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas
Perform wald testing
	Beta factor: (Intercept) 
	Beta factor: localizationnucleus 
	Beta factor: replicate 
Perform likelyhood ratio testing
plot graphs
Warning messages:
1: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
2: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
3: In plot_ma(so, beta_factor, sig_level = sig_level, highlight = ERCC_id,  :
  Couldn't find any transcripts from highlight set in this test. They were probably filtered out.
4: Removed 233 rows containing missing values (geom_point). 
5: Removed 233 rows containing missing values (geom_point). 
6: Removed 233 rows containing missing values (geom_point). 
7: Removed 1 rows containing missing values (position_stack). 
8: Removed 233 rows containing missing values (geom_path). 

### DONE ###

Other samples

My laptop doesn't have enought RAM to analyse the larger datasets and the gene level aggregation mode. I used a similar python script than the one above to run all the conditions:

lncRNA +- ERCC for each cell independently at transcript level (same as on my laptop)
lncRNA +- ERCC for all cell lines together at transcript level
lncRNA +- ERCC for all cell lines together at gene level
lncRNA +- ERCC for all cell lines together at gene level
allRNA +- ERCC for each cell independently at transcript level
allRNA +- ERCC for all cell lines together at transcript level
allRNA +- ERCC for all cell lines together at gene level
allRNA +- ERCC for all cell lines together at gene level

Coding genes vs non coding genes localization

Extract the gene info from gencode datasets and identify with type of RNA is present in lncRNA and allRNA at GENE level



In [6]:

    
# Transform the gff3 format in a simple tsv format for simple parsing
# Common arguments 
final_template=["{seqid}","\t","{type}","\t","{start}","\t","{end}","\t","{strand}","\t","{ID}","\t","{gene_type}","\t","{gene_name}"]
standard_template="gff3_ens_gene"
file_list = [
    ["../../Reference_Annotation/gencode_v24_lncRNAs.gff3", "../../Reference_Annotation/gencode_v24_lncRNAs_gene.tsv"],
    ["../../Reference_Annotation/gencode_v24.gff3", "../../Reference_Annotation/gencode_v24_gene.tsv"]]

for input_file, output_file in file_list:
    # Parse and simplify the gff3 tables
    reformat_table ( input_file=input_file, output_file=output_file, standard_template=standard_template,
                    final_template=final_template, keep_original_header=False, header_from_final_template= True)
    # Analyse column 7 containing the type of the genes
    print(colsum(output_file, colrange=[6], ret_type='report', max_items=0, header=True))









    



Using gff3 ensembl gene template. Non-gene features will be filtered out
Initial template values
seqid	source	type	start	end	score	strand	phase	ID	gene_id	gene_type	gene_status	gene_name	level	havana_gene
Final template values
seqid	type	start	end	strand	ID	gene_type	gene_name
128183 Lines processed	15941 Lines pass	112242 Lines filtered out	0 Lines fail
6
	lincRNA	7674
	antisense	5564
	TEC	1053
	sense_intronic	919
	processed_transcript	502
	sense_overlapping	193
	3prime_overlapping_ncrna	29
	non_coding	3
	bidirectional_promoter_lncrna	3
	macro_lncRNA	1

Using gff3 ensembl gene template. Non-gene features will be filtered out
Initial template values
seqid	source	type	start	end	score	strand	phase	ID	gene_id	gene_type	gene_status	gene_name	level	havana_gene
Final template values
seqid	type	start	end	strand	ID	gene_type	gene_name
2570242 Lines processed	60554 Lines pass	2509688 Lines filtered out	0 Lines fail
6
	protein_coding	19815
	processed_pseudogene	10283
	lincRNA	7674
	antisense	5564
	miRNA	4093
	unprocessed_pseudogene	2612
	misc_RNA	2298
	snRNA	1896
	TEC	1053
	snoRNA	949
	sense_intronic	919
	transcribed_unprocessed_pseudogene	682
	rRNA	544
	processed_transcript	502
	transcribed_processed_pseudogene	445
	sense_overlapping	193
	IG_V_pseudogene	183
	unitary_pseudogene	171
	IG_V_gene	145
	TR_V_gene	108
	TR_J_gene	79
	polymorphic_pseudogene	58
	scaRNA	49
	IG_D_gene	37
	TR_V_pseudogene	30
	3prime_overlapping_ncrna	29
	Mt_tRNA	22
	pseudogene	21
	sRNA	20
	IG_J_gene	18
	IG_C_gene	14
	IG_C_pseudogene	9
	ribozyme	8
	TR_C_gene	6
	TR_J_pseudogene	4
	TR_D_gene	4
	IG_J_pseudogene	3
	non_coding	3
	bidirectional_promoter_lncrna	3
	transcribed_unitary_pseudogene	3
	Mt_rRNA	2
	macro_lncRNA	1
	translated_unprocessed_pseudogene	1
	vaultRNA	1

Extract the gene info from gencode datasets and identify with type of RNA is present in lncRNA and allRNA at TRANSCRIPT level



In [7]:

    
# Transform the gff3 format in a simple tsv format for simple parsing
# Common arguments 
final_template=["{seqid}","\t","{type}","\t","{start}","\t","{end}","\t","{strand}","\t","{ID}","\t","{transcript_type}","\t","{transcript_name}","\t","{gene_id}","\t","{gene_type}","\t","{gene_name}"]
standard_template="gff3_ens_transcript"
file_list = [
    ["../../Reference_Annotation/gencode_v24_lncRNAs.gff3", "../../Reference_Annotation/gencode_v24_lncRNAs_transcript.tsv"],
    ["../../Reference_Annotation/gencode_v24.gff3", "../../Reference_Annotation/gencode_v24_transcript.tsv"]]

for input_file, output_file in file_list:
    # Parse and simplify the gff3 tables
    reformat_table ( input_file=input_file, output_file=output_file, standard_template=standard_template,
                    final_template=final_template, keep_original_header=False, header_from_final_template= True)
    # Analyse column 7 containing the type of the genes
    print(colsum(output_file, colrange=[6,9], ret_type='report', max_items=0, header=True))









    



Using gff3 ensembl transcript template. Non-transcript features will be filtered out
Initial template values
seqid	source	type	start	end	score	strand	phase	ID	Parent	gene_id	transcript_id	gene_type	gene_status	gene_name	transcript_type	transcript_status	transcript_name	level	transcript_support_level	tag	havana_gene	havana_transcript
Final template values
seqid	type	start	end	strand	ID	transcript_type	transcript_name	gene_id	gene_type	gene_name
128183 Lines processed	28031 Lines pass	100152 Lines filtered out	0 Lines fail
6
	lincRNA	13477
	antisense	11191
	TEC	1081
	sense_intronic	978
	retained_intron	559
	sense_overlapping	343
	processed_transcript	325
	3prime_overlapping_ncrna	33
	pseudogene	23
	misc_RNA	12
	bidirectional_promoter_lncrna	5
	non_coding	3
	macro_lncRNA	1
9
	lincRNA	12656
	antisense	10191
	processed_transcript	2761
	TEC	1069
	sense_intronic	979
	sense_overlapping	334
	3prime_overlapping_ncrna	32
	bidirectional_promoter_lncrna	5
	non_coding	3
	macro_lncRNA	1

Using gff3 ensembl transcript template. Non-transcript features will be filtered out
Initial template values
seqid	source	type	start	end	score	strand	phase	ID	Parent	gene_id	transcript_id	gene_type	gene_status	gene_name	transcript_type	transcript_status	transcript_name	level	transcript_support_level	tag	havana_gene	havana_transcript
Final template values
seqid	type	start	end	strand	ID	transcript_type	transcript_name	gene_id	gene_type	gene_name
2570242 Lines processed	199169 Lines pass	2371073 Lines filtered out	0 Lines fail
6
	protein_coding	79930
	processed_transcript	26977
	retained_intron	26704
	lincRNA	13481
	nonsense_mediated_decay	13409
	antisense	11194
	processed_pseudogene	10285
	miRNA	4093
	unprocessed_pseudogene	2613
	misc_RNA	2312
	snRNA	1896
	TEC	1140
	sense_intronic	978
	snoRNA	961
	transcribed_unprocessed_pseudogene	682
	rRNA	544
	transcribed_processed_pseudogene	445
	sense_overlapping	343
	IG_V_pseudogene	183
	unitary_pseudogene	171
	IG_V_gene	158
	TR_V_gene	110
	non_stop_decay	81
	TR_J_gene	79
	polymorphic_pseudogene	75
	scaRNA	49
	pseudogene	44
	IG_D_gene	37
	3prime_overlapping_ncrna	33
	TR_V_pseudogene	30
	Mt_tRNA	22
	sRNA	20
	IG_C_gene	19
	IG_J_gene	18
	IG_C_pseudogene	9
	TR_C_gene	9
	ribozyme	8
	bidirectional_promoter_lncrna	5
	TR_J_pseudogene	4
	TR_D_gene	4
	IG_J_pseudogene	3
	non_coding	3
	transcribed_unitary_pseudogene	3
	Mt_rRNA	2
	macro_lncRNA	1
	translated_unprocessed_pseudogene	1
	vaultRNA	1
9
	protein_coding	143947
	lincRNA	12656
	processed_pseudogene	10285
	antisense	10191
	miRNA	4093
	processed_transcript	2761
	unprocessed_pseudogene	2613
	misc_RNA	2298
	transcribed_unprocessed_pseudogene	2211
	snRNA	1896
	TEC	1069
	sense_intronic	979
	snoRNA	961
	transcribed_processed_pseudogene	922
	rRNA	544
	unitary_pseudogene	418
	sense_overlapping	334
	IG_V_pseudogene	183
	IG_V_gene	158
	polymorphic_pseudogene	152
	TR_V_gene	110
	TR_J_gene	79
	scaRNA	49
	IG_D_gene	37
	3prime_overlapping_ncrna	32
	TR_V_pseudogene	30
	Mt_tRNA	22
	pseudogene	21
	sRNA	20
	IG_C_gene	19
	IG_J_gene	18
	transcribed_unitary_pseudogene	11
	IG_C_pseudogene	9
	TR_C_gene	9
	ribozyme	8
	bidirectional_promoter_lncrna	5
	TR_J_pseudogene	4
	TR_D_gene	4
	IG_J_pseudogene	3
	non_coding	3
	Mt_rRNA	2
	macro_lncRNA	1
	translated_unprocessed_pseudogene	1
	vaultRNA	1

Define categories based on the definition of the gene types



In [21]:

    
# Find the count of the diffreent gene types in the localization data
df1 = pd.read_table("./Localisation/Djebali-ENCODE/sleuth/gene_allRNA/ALL/localizationnucleus_wald_test.tsv")
df2 = pd.read_table("../../Reference_Annotation/gencode_v24_gene.tsv")
df3 = pd.merge(df1,df2, how='inner', left_on="target_id", right_on="ID")
d = {gene_type:len(df) for gene_type, df in df3.groupby("gene_type")}
print(dict_to_report(d))









    



protein_coding:15302
lincRNA:2551
antisense:2202
processed_pseudogene:791
TEC:506
sense_intronic:466
transcribed_unprocessed_pseudogene:356
misc_RNA:335
processed_transcript:312
miRNA:234
unprocessed_pseudogene:209
transcribed_processed_pseudogene:156
sense_overlapping:114
snoRNA:94
snRNA:60
unitary_pseudogene:41
3prime_overlapping_ncrna:10
Mt_tRNA:9
scaRNA:9
rRNA:6
pseudogene:5
polymorphic_pseudogene:5
non_coding:3
IG_V_gene:2
transcribed_unitary_pseudogene:2
Mt_rRNA:2
TR_C_gene:1
macro_lncRNA:1
TR_V_gene:1
TR_V_pseudogene:1
ribozyme:1
IG_C_pseudogene:1



In [10]:

    
gene_file = "../../Reference_Annotation/gencode_v24_transcript.tsv"
DE_file_dir = "./Localisation/Djebali-ENCODE/sleuth/transcript_allRNA"



In [2]:

    
def lnc_coding_DE_plots (DE_file_dir, gene_file, FDR=1):
    print_arg()
    
    lncRNA_types = ["lincRNA", "antisense", "TEC", "sense_intronic", "processed_transcript", "sense_overlapping", "3prime_overlapping_ncrna", "non_coding", "bidirectional_promoter_lncrna", "macro_lncRNA"]
    coding_RNA_types = ["protein_coding", "transcribed_processed_pseudogene", "transcribed_unitary_pseudogene", "translated_unprocessed_pseudogene"]
    
    # Import the df with all gene IDs and create the lists of target id to highlight 
    gene_df = pd.read_table(gene_file)
    hl = [
        {"target_id": gene_df.ID[(gene_df.gene_type.isin(coding_RNA_types))], "label": "Coding RNA", "alpha":0.2, "color":"lightskyblue"},
        {"target_id": gene_df.ID[(gene_df.gene_type.isin(lncRNA_types))], "label": "lnc RNA", "alpha":0.2, "color":"coral"}]

    #for condition in condition_list:
    for condition in ["ALL","IMR90","A549","GM12878","HeLaS3","HepG2","HUVEC","K562","MCF7","NHEK","SKNSH"]:

        # read the appropriate file
        DE_file = "{}/{}/localizationnucleus_wald_test.tsv".format(DE_file_dir, condition)
        df= pd.read_table(DE_file)
        df = df.dropna()

        # Plot the graphs
        plot_text(condition, align="left", fontsize=20, fontweight="bold")
        volcano_plot (df=df, highlight_list=hl, X="b", Y="qval", FDR=FDR, X_cutoff=0, figsize=[15,7], alpha=0.5, xlim=[-8,8], ylim=[0,150],
                      sig_neg_color="0.8", sig_pos_color="0.8", non_sig_color="0.9",
                      xlabel = "Cytosol <<< beta factor >>> Nucleus", ylabel = "-log10(qval)", highlight_FDR=FDR)
        MA_plot (df=df, highlight_list=hl, X="mean_obs", Y="b", FDR=FDR, FDR_col="qval", figsize=[15,7], alpha=0.5, xlim=[0,14], ylim=[-10,10] ,
                 sig_neg_color="0.8", sig_pos_color="0.8", non_sig_color="0.9",
                 xlabel="Mean expression", ylabel="Cytosol <<< beta factor >>> Nucleus", highlight_FDR=FDR)

Plot each types of RNA on the volcano and MA plot at Gene Level



In [3]:

    
lnc_coding_DE_plots(
    gene_file = "../../Reference_Annotation/gencode_v24_gene.tsv",
    DE_file_dir = "./Localisation/Djebali-ENCODE/sleuth/gene_allRNA",
    FDR=0.05
    )









    



Enumerated named argument list:
	gene_file: ../../Reference_Annotation/gencode_v24_gene.tsv
	FDR: 0.05
	DE_file_dir: ./Localisation/Djebali-ENCODE/sleuth/gene_allRNA






    



More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).

Plot each types of RNA on the volcano and MA plot at Transcript Level



In [4]:

    
lnc_coding_DE_plots(
    gene_file = "../../Reference_Annotation/gencode_v24_transcript.tsv",
    DE_file_dir = "./Localisation/Djebali-ENCODE/sleuth/transcript_allRNA",
    FDR=0.05
    )









    



Enumerated named argument list:
	gene_file: ../../Reference_Annotation/gencode_v24_transcript.tsv
	FDR: 0.05
	DE_file_dir: ./Localisation/Djebali-ENCODE/sleuth/transcript_allRNA






    



More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).

Further analyses at transcript level for All cells together



In [6]:

    
# Importing data and plotting all RNA
df = pd.read_table( "./Localisation_Original_Datasets/Djebali-ENCODE/sleuth/transcript_allRNA/ALL/localizationnucleus_wald_test.tsv")
df = df.dropna()
sns.kdeplot(df.mean_obs, df.b, cmap="Greys", shade=True, shade_lowest=False, alpha=0.5)

# Defining RNA categories and importing the list of genes in gencode v24 
lncRNA_types = ["lincRNA", "antisense", "TEC", "sense_intronic", "processed_transcript", "sense_overlapping", "3prime_overlapping_ncrna", "non_coding", "bidirectional_promoter_lncrna", "macro_lncRNA"]
coding_RNA_types = ["protein_coding", "transcribed_processed_pseudogene", "transcribed_unitary_pseudogene", "translated_unprocessed_pseudogene"]
gene_df = pd.read_table( "../../Reference_Annotation/gencode_v24_transcript.tsv")

# Subsetting coding genes and ploting 
codingRNA_id = gene_df.ID[(gene_df.gene_type.isin(coding_RNA_types))]
codingRNA_df = df[df.target_id.isin(codingRNA_id)]
sns.kdeplot(codingRNA_df.mean_obs, codingRNA_df.b, cmap="Blues", shade=True, shade_lowest=False, alpha=0.5)

# Subsetting lncRNA genes and ploting
lncRNA_id = gene_df.ID[(gene_df.gene_type.isin(lncRNA_types))]
lncRNA_df = df[df.target_id.isin(lncRNA_id)]
sns.kdeplot(lncRNA_df.mean_obs, lncRNA_df.b, cmap="Reds", shade=True, shade_lowest=False, alpha=0.5)

# Draw a horizontal separation line
pl.hlines(0, 0,12, linestyle=":", color="0.75")









    Out[6]:





<matplotlib.collections.LineCollection at 0x7f4dc156ae48>

all RNA and all cell line at transcript level depending of the transcript type



In [10]:

    
def transcript_level_DE_plots  (DE_file, gene_file, min_transcript_per_class=20, max_RNA_classes=8, FDR=1):
    print_arg()
    # Import the data for all RNA and all cell line at transcript level 
    DE_df = pd.read_table(DE_file , skipinitialspace=True)
    # Import transcript information
    gene_df = pd.read_table(gene_file)
    # Merge the 2 dataframes together
    unique_col = DE_df.columns.difference(gene_df.columns)
    df = pd.merge(gene_df, DE_df[unique_col], how="inner", left_on="ID", right_on="target_id")
    # Index by gene name and transcript name and clean
    df.drop(["ID", "type", "ens_gene"], axis=1, inplace=True)
    df.dropna(inplace=True)

    # Create list of RNA id per classes
    hl=[]
    for transcript_type, transcript_df in df.groupby("transcript_type"):
        # filter by min_transcript_per_class
        if len(transcript_df) >= min_transcript_per_class:            
            hl.append({"target_id": transcript_df.target_id, "label": transcript_type, "alpha":0.75, "linestyle":"--"})      
    # Filter by max_RNA_classes
    hl.sort(key = lambda x: len(x["target_id"]), reverse=True)
    hl = hl[:max_RNA_classes]

    # Plot graphs
    plot_text("Volcano plot Cytosol/ nuclear localization for each transcript biotypes", fontsize=15, align="center", fontweight="bold")
    volcano_plot (df=df, highlight_list=hl, X="b", Y="qval", FDR=FDR, X_cutoff=0, figsize=[30,10], ylim=[0,100], alpha=0.5,
                  xlabel = "Cytosol <<< beta factor >>> Nucleus", ylabel = "-log10(qval)", highlight_FDR=FDR)
    
    plot_text("MA plot Cytosol/ nuclear localization for each transcript biotypes", fontsize=15, align="center", fontweight="bold")
    MA_plot (df=df, highlight_list=hl, X="mean_obs", Y="b", FDR=FDR, FDR_col="qval", figsize=[30,10], alpha=0.5, ylim=[-10,10],
                 xlabel="Mean expression", ylabel="Cytosol <<< beta factor >>> Nucleus", highlight_FDR=FDR)
    
    plot_text("Cumulative distribution Cytosol/ nuclear localization for each transcript biotypes", fontsize=15, align="center", fontweight="bold")
    density_plot(df[(df.qval <= FDR)], "b", figsize=[30,10], ylabel="Cumulative distribution", highlight_list= hl, cumulative = True)
    
    plot_text("Mean expression distribution for each transcript biotypes", fontsize=15, align="center", fontweight="bold")
    density_plot(df[(df.qval <= FDR)], "mean_obs", figsize=[30,10], ylabel="Mean expression distribution", highlight_list= hl, cumulative = False)



In [11]:

    
DE_file = "./Localisation/Djebali-ENCODE/sleuth/transcript_allRNA/ALL/localizationnucleus_wald_test.tsv"
gene_file = "../../Reference_Annotation/gencode_v24_transcript.tsv"
min_transcript_per_class = 20
max_RNA_classes = 10
FDR=1

transcript_level_DE_plots(DE_file, gene_file, min_transcript_per_class, max_RNA_classes, FDR)









    



Enumerated named argument list:
	max_RNA_classes: 10
	min_transcript_per_class: 20
	gene_file: ../../Reference_Annotation/gencode_v24_transcript.tsv
	FDR: 1
	DE_file: ./Localisation/Djebali-ENCODE/sleuth/transcript_allRNA/ALL/localizationnucleus_wald_test.tsv






    



using a non-integer number instead of an integer will result in an error in the future

Same analysis but with FDR thresholding to remove transcripts with uncertain localization



In [12]:

    
DE_file = "./Localisation/Djebali-ENCODE/sleuth/transcript_allRNA/ALL/localizationnucleus_wald_test.tsv"
gene_file = "../../Reference_Annotation/gencode_v24_transcript.tsv"
min_transcript_per_class = 20
max_RNA_classes = 8
FDR = 0.05

transcript_level_DE_plots(DE_file, gene_file, min_transcript_per_class, max_RNA_classes, FDR)









    



Enumerated named argument list:
	max_RNA_classes: 8
	min_transcript_per_class: 20
	gene_file: ../../Reference_Annotation/gencode_v24_transcript.tsv
	FDR: 0.05
	DE_file: ./Localisation/Djebali-ENCODE/sleuth/transcript_allRNA/ALL/localizationnucleus_wald_test.tsv






    



using a non-integer number instead of an integer will result in an error in the future

Most of the lncRNA classes (linc, processed transcripts, antisense pseudogene...) are clearly enriched in the nuclear fraction and their mean expression is rather low.

The "intron retained" class is the most nuclear class with an intermediate expression pattern. There nuclear localization was already reported in several studies. => good control of the analysis

The non-sense mediated decay class is equally distributed in cytosol and nucleus, also with an intermediate expression pattern

The protein coding transcripts are enriched apparently the only class enriched in the cytoplasm, which is also a good control because they are rrxpected to be in this compartment

Correlation localization Cabili/HT-FISH vs Djebali/ENCODE RNA-Seq

HeLa cell and IMR90 are common between the 2 datasets. For the raw counts the correlation was not really good (see above). In addition I plotted the transcripts whose genes were reported in Cabili paper. This was maybe not a really good way to do it.

I reanalysed the Cabili dataset more extensively and I now have more precise results at both gene and transcript level.

The DE results from the Sleuth analysis are more representative of the nuclear or cytosol enrichment than the raw counts

Comparison of the matching cell lines at gene and transcript levels for lncRNA

No FISH data for coding RNAs



In [9]:

    
condition_list = [[
        "HeLa Gene Level",
        "./Localisation/Djebali-ENCODE/sleuth/gene_lncRNA/HeLaS3/localizationnucleus_wald_test.tsv",
        "./Localisation/Cabili_RNA_FISH/results_lncRNA/gene_HeLa.csv"],
                  [
        "IMR90 Gene Level",
        "./Localisation/Djebali-ENCODE/sleuth/gene_lncRNA/IMR90/localizationnucleus_wald_test.tsv",
        "./Localisation/Cabili_RNA_FISH/results_lncRNA/gene_IMR90.csv"],
                  [
        "HeLa Transcript Level",
        "./Localisation/Djebali-ENCODE/sleuth/transcript_lncRNA/HeLaS3/localizationnucleus_wald_test.tsv",
        "./Localisation/Cabili_RNA_FISH/results_lncRNA/transcript_HeLa.csv"],
                  [
        "IMR90 Transcript Level",
        "./Localisation/Djebali-ENCODE/sleuth/transcript_lncRNA/IMR90/localizationnucleus_wald_test.tsv",
        "./Localisation/Cabili_RNA_FISH/results_lncRNA/transcript_IMR90.csv"]] 

for condition, DE_file, FISH_file in condition_list:
    # Load the Diff expression file from Sleuth
    df= pd.read_table(DE_file)
    df = df.dropna()
    
    # Load the FISH localization results  
    loc_df = pd.read_table(FISH_file)
    hl =[]
    
    # Find the localization for each category of pattern from Cabili paper
    for loc in [1,2,3,4,5]:
        hl.append(
            {"target_id": loc_df.target_id[(loc_df.localization == loc)], "label":"Localization pattern {}".format(loc)})
    
    # Plot the graphs
    plot_text(condition, align="left", fontsize=30, fontweight="bold")
    volcano_plot (df=df, highlight_list=hl, X="b", Y="qval", FDR=0.05, X_cutoff=1, figsize=[15,7], alpha=0.5, ylim=[-1,30], 
                  xlabel = "beta factor Nucleus localization", ylabel = "-log10(qval)")
    MA_plot (df=df, highlight_list=hl, X="mean_obs", Y="b", FDR=0.05, FDR_col="qval", figsize=[15,7], alpha=0.5,
             xlabel="Mean expression", ylabel="Beta factor Nucleus localization")









    



Series #4 empty. Skipping to the next one
Series #4 empty. Skipping to the next one
Series #4 empty. Skipping to the next one
Series #4 empty. Skipping to the next one

Extend of dispersion of the transcripts from lncRNA genes in nuclear and cytoplasmic fractions

Analyse the distribution of genes with several transcripts



In [31]:

    
head("../../Reference_Annotation/gencode_v24_transcript.tsv")









    



seqid	type	start	end	strand	ID	transcript_type	transcript_name	gene_id	gene_type	gene_name
chr1	transcript	11869	14409	+	ENST00000456328.2	processed_transcript	DDX11L1-002	ENSG00000223972.5	transcribed_unprocessed_pseudogene	DDX11L1
chr1	transcript	12010	13670	+	ENST00000450305.2	transcribed_unprocessed_pseudogene	DDX11L1-001	ENSG00000223972.5	transcribed_unprocessed_pseudogene	DDX11L1
chr1	transcript	14404	29570	-	ENST00000488147.1	unprocessed_pseudogene	WASH7P-001	ENSG00000227232.5	unprocessed_pseudogene	WASH7P
chr1	transcript	17369	17436	-	ENST00000619216.1	miRNA	MIR6859-1-201	ENSG00000278267.1	miRNA	MIR6859-1
chr1	transcript	29554	31097	+	ENST00000473358.1	lincRNA	RP11-34P13.3-001	ENSG00000243485.3	lincRNA	RP11-34P13.3
chr1	transcript	30267	31109	+	ENST00000469289.1	lincRNA	RP11-34P13.3-002	ENSG00000243485.3	lincRNA	RP11-34P13.3
chr1	transcript	30366	30503	+	ENST00000607096.1	miRNA	MIR1302-2-201	ENSG00000274890.1	miRNA	MIR1302-2
chr1	transcript	34554	36081	-	ENST00000417324.1	lincRNA	FAM138A-001	ENSG00000237613.2	lincRNA	FAM138A
chr1	transcript	35245	36073	-	ENST00000461467.1	lincRNA	FAM138A-002	ENSG00000237613.2	lincRNA	FAM138A

Import the Sleuth DE data for transcript and merge with the transcript information from gencode



In [3]:

    
# Import the data for all RNA and all cell line at transcript level 
DE_df = pd.read_table( "./Localisation_Original_Datasets/Djebali-ENCODE/sleuth/transcript_allRNA/ALL/localizationnucleus_wald_test.tsv", skipinitialspace=True)
# Import transcript information
gene_df = pd.read_table( "../../Reference_Annotation/gencode_v24_transcript.tsv")
# Merge the 2 dataframes together
unique_col = DE_df.columns.difference(gene_df.columns)
df = pd.merge(gene_df, DE_df[unique_col], how="inner", left_on="ID", right_on="target_id")
# Index by gene name and transcript name and clean
df.drop(["target_id", "type", "ens_gene"], axis=1, inplace=True)
df.rename(columns={'ID': 'transcript_id'}, inplace=True)
df.sort_values(by=['gene_id', 'transcript_id'], ascending=[True, True], inplace=True)
df.dropna(inplace=True)
df.head()

# Save to file for latter
df.to_csv("./Localisation_Original_Datasets/Djebali-ENCODE/All-RNA_All-cells_Sleuth-DE-Nuclear-loc_Gencodev24.tsv", sep="\t", index=False)

Plot the number of transcripts per gene depending of the types of RNAs (lncRNA or coding)



In [82]:

    
# Init counters
count_all=Counter()
count_codingRNA=Counter()
count_lncRNA=Counter()
w=0.25

# Define types of RNA
coding_RNA_types = ["protein_coding", "transcribed_processed_pseudogene", "transcribed_unitary_pseudogene", "translated_unprocessed_pseudogene"]
lncRNA_types = ["lincRNA", "antisense", "TEC", "sense_intronic", "processed_transcript", "sense_overlapping", "3prime_overlapping_ncrna", "non_coding", "bidirectional_promoter_lncrna", "macro_lncRNA"]

# Loop and count the number of transcript per genes
for gene_id, transcript_df in df.groupby("gene_id"):
    count_all[len(transcript_df)]+=1
    if transcript_df["gene_type"].all() in coding_RNA_types:
        count_codingRNA[len(transcript_df)]+=1
    elif transcript_df["gene_type"].all() in lncRNA_types:
        count_lncRNA[len(transcript_df)]+=1

# Ploting the distributions
c= get_color_list(4, "Reds_r")
pl.figure(figsize=(20,7))
sns.set(style="whitegrid")
p1 = pl.bar(np.array([float(i) for i in count_all.keys()]), count_all.values(), label="ALL RNA transcript per gene distribution", color=next(c), width=w)
p2 = pl.bar(np.array([float(i) for i in count_codingRNA.keys()])+w, count_codingRNA.values(), label="Coding RNA transcript per gene distribution", color=next(c), width=w)
p3 = pl.bar(np.array([float(i) for i in count_lncRNA.keys()])+w*2, count_lncRNA.values(), label="LncRNA transcript per gene distribution", color=next(c), width=w)
pl.xlabel("Number of expressed isoforms per gene")
pl.ylabel("Count")
pl.title("")
pl.legend(bbox_to_anchor=(1, 1), loc=2,frameon=False)









    Out[82]:





<matplotlib.legend.Legend at 0x7f8959a4f6a0>

The majority of the genes only have 1 isoform detected expressed. It is particularly true for lncRNAs.

Select only the genes with transcripts highly nuclear and transcripts highly cytoplasmic



In [69]:

    
min_beta_range = 2
min_beta = 10
max_beta = -10
min_n_transcript = 3
max_genes_per_type = 100

# General collectors
count_dict = OrderedDict()
count_dict["all"]=[]
genetype_df_list=[]
for genetype, genetype_df in df.groupby("gene_type"):
    # geneid level collectors
    geneid_df_list = []
    count_dict[genetype]=[]
    for geneid, geneid_df in genetype_df.groupby("gene_id"):
        
        # Select gene with at least 2 transcripts
        if len(geneid_df)>=2:
            # Store the range of beta nuclear loc variation for this gene ID
            count_dict["all"].append(geneid_df.b.max()-geneid_df.b.min())
            count_dict[genetype].append(geneid_df.b.max()-geneid_df.b.min())
            # Store the geneid df in a list 
            geneid_df_list.append(geneid_df)
    
    # For this type of gene select the most variable genes
    geneid_df_list.sort(key = lambda df: df.b.max()-df.b.min(), reverse=True)
    geneid_df_list = [df for df in geneid_df_list if len(df)>=min_n_transcript and df.b.max()-df.b.min()>=min_beta_range and df.b.min()<=min_beta and df.b.max() >= max_beta]
    geneid_df_list = geneid_df_list[0:max_genes_per_type]
    
    # If there are still genes after filtering concat the res in a single df and add it to the df list
    if geneid_df_list:
        geneid_df = pd.concat(geneid_df_list)
        genetype_df_list.append(geneid_df)
        n_genes = len(geneid_df.gene_id.unique())
        print("Gene type {}: {} genes selected".format(genetype, n_genes))

# Concat everything in a single dataframe
df2 = pd.concat(genetype_df_list)
n_genes = len(df2.gene_id.unique())
print("Total: {} genes selected".format(n_genes))









    



Gene type antisense: 64 genes selected
Gene type lincRNA: 56 genes selected
Gene type polymorphic_pseudogene: 1 genes selected
Gene type processed_transcript: 51 genes selected
Gene type protein_coding: 100 genes selected
Gene type transcribed_processed_pseudogene: 6 genes selected
Gene type transcribed_unprocessed_pseudogene: 53 genes selected
Gene type unitary_pseudogene: 7 genes selected
Total: 338 genes selected

Plot the results per genes



In [62]:

    
# Graph distribution of beta value delta per genes with at least 2 transcripts
pl.figure(figsize=(20, 10))
for genetype, val_list in count_dict.items():
    if len(val_list)>=20:
        sns.kdeplot(pd.Series(val_list), label=genetype)
pl.xlabel("Delta beta value between transcripts")
pl.ylabel("Count")
pl.title("Delta between beta values of nuclear/cytoplasmic localization of transcript per gene")
pl.legend(bbox_to_anchor=(1, 1), loc=2,frameon=False)
pl.xlim(-2,10)









    



using a non-integer number instead of an integer will result in an error in the future






    Out[62]:





(-2, 10)



In [70]:

    
sns.set(style="whitegrid")

# Graph plotting and tweaking
for genetype, genetype_df in df2.groupby("gene_type"):
    plot_text("{} - Distribution of preferential Nuclear/cytoplasmic localization of transcript per gene".format(genetype), fontsize=15, fontweight="bold", align="center", plot_len=30)
    
    n_genes = len(genetype_df.gene_id.unique())
    pl.figure(figsize=(30, n_genes/6), frameon=False)
    pl.axes(frameon=False)
    g= sns.boxplot(x="b", y="gene_id", data=genetype_df, whis=np.inf, width=0)
    g= sns.stripplot(x="b", y="gene_id", hue="transcript_type", data=genetype_df, palette="Set1", size=(df2.mean_obs*2))
    pl.grid(b=False, which="both", axis="y")
    pl.grid(b=True, which="both", axis="x", linestyle=":", color="grey", linewidth=2)
    pl.legend(bbox_to_anchor=(1, 1), loc=2,frameon=False)
    pl.xlim(-10,10)

Pattern enrichment of lncRNA transcript in each fractions



In [ ]:

	cell_loc	cell	localization	replicate	transcript_id	log_concentration	eff_length	length	log_est_filter	log_tpm_filter	log_est_raw	log_tpm_raw	log_est_norm	log_tpm_norm
ERCC-00002_A549_cytosol_1	A549_cytosol	A549	cytosol	1	ERCC-00002	-9.866192	1011.634547	1061.0	12.879195	10.330360	12.436224	9.670236	12.879195	10.330360
ERCC-00002_A549_cytosol_2	A549_cytosol	A549	cytosol	2	ERCC-00002	-9.866192	958.263200	1061.0	12.708273	10.183487	12.409797	9.735956	12.708273	10.183487
ERCC-00002_A549_nucleus_1	A549_nucleus	A549	nucleus	1	ERCC-00002	-9.866192	800.665212	1061.0	12.159725	9.739943	12.628106	10.240396	12.159725	9.739943
ERCC-00002_A549_nucleus_2	A549_nucleus	A549	nucleus	2	ERCC-00002	-9.866192	781.193314	1061.0	12.200813	9.804348	12.473880	10.411550	12.200813	9.804348
ERCC-00002_GM12878_cytosol_1	GM12878_cytosol	GM12878	cytosol	1	ERCC-00002	-9.866192	1158.248810	1061.0	11.467636	8.527796	11.230430	8.360036	11.467636	8.527796

	A549_cytosol_1	A549_cytosol_2	A549_nucleus_1	A549_nucleus_2	GM12878_cytosol_1	GM12878_cytosol_2	GM12878_nucleus_1	GM12878_nucleus_2	HUVEC_cytosol_1	HUVEC_cytosol_2	⋯	MCF7_nucleus_1	MCF7_nucleus_2	NHEK_cytosol_1	NHEK_cytosol_2	NHEK_nucleus_1	NHEK_nucleus_2	SKNSH_cytosol_1	SKNSH_cytosol_2	SKNSH_nucleus_1	SKNSH_nucleus_2
ENSG00000082929.8	1.16045695852945	2.18115350251946	7.64620614504533	0	0.859748074437148	0	0.565356414989429	1.28832509124729	0	0	⋯	26.9438076412688	5.34587427488773	0	0	0	0	1.09192984274639	1.58821771138767	0	1.12687657464808
ENSG00000083622.8	0	0	0	2.56990394384941	0	0	0.722165705975895	3.98581170792923	0	0	⋯	0	0	0	0	0	0	0	0	0	0
ENSG00000093100.13	4206.58525776049	5245.44532067576	42246.6671145073	17345.4798764029	2786.64983749646	2091.24126013715	8034.338939453	10705.3661938123	2311.82183919593	1934.56595273566	⋯	5363.38805564951	1687.06573490952	1365.36952354241	1963.37356687556	10331.6192259408	10477.5619662555	8092.1303367047	5620.28566102706	59144.8086332296	85625.1230322502
ENSG00000099869.7	3.89274133185661	2.02083880944801	7.46284691847996	3.41995085820615	0	0	0	0	4.62901334548625	0	⋯	1.85148217135638	0	0	0	0	0	0	0	0.824219713139594	0
ENSG00000100181.21	610.343019447761	673.730123311848	1148.63205572315	1068.16303362129	1158.04680082864	774.353319596	1964.63945083306	2574.65942742537	525.661551196438	548.651766402981	⋯	1842.48523037281	648.403047303998	1418.10083385372	2965.4447669461	5139.14025251317	5744.68069123436	753.666074861697	648.840275878875	2032.41967960662	2844.09733862694
ENSG00000115934.11	26.5928725960095	31.1329158208316	92.1987541709246	129.292130967264	97.7291836672975	49.9085326069612	193.771626511939	212.18697749753	27.4608245128903	26.7699108852439	⋯	206.744401508904	55.5545985527226	34.2299176175002	63.3576946002565	169.262475081999	174.126501003718	68.4717592117736	60.661218890785	173.126273860766	267.769557336931

	target_id	cell_type	tpm_nucleus	tpm_cytosol	est_nucleus	est_cytosol	mean_log2_tpm_nucleus	mean_log2_tpm_cytosol	mean_log2_est_nucleus	mean_log2_est_cytosol
ENST00000222396.9_A549	ENST00000222396.9	A549	[2.06925, 0.7255149999999999]	[0.384623, 2.1539]	[9.280439999999999, 2.27547]	[2.36717, 12.1537]	0.924011	0.823147	2.650295	2.956138
ENST00000222396.9_GM12878	ENST00000222396.9	GM12878	[1.16744, 0.381981]	[0.0, 0.0]	[6.43525, 1.8635400000000002]	[0.0, 0.0]	0.350170	-1.000000	2.217043	-1.000000
ENST00000222396.9_HUVEC	ENST00000222396.9	HUVEC	[3.05652, 2.28606]	[0.0, 0.0]	[15.8865, 8.62248]	[0.0, 0.0]	1.665070	-1.000000	3.672933	-1.000000
ENST00000222396.9_HeLaS3	ENST00000222396.9	HeLaS3	[6.32489, 3.5624599999999997]	[0.602563, 0.0]	[24.3011, 16.2414]	[2.9820599999999997, 0.0]	2.444581	-0.319619	4.376516	0.993515
ENST00000222396.9_HepG2	ENST00000222396.9	HepG2	[1.43084, 1.39398]	[0.0, 0.0]	[8.92423, 6.06273]	[0.0, 0.0]	0.935392	-1.000000	2.998824	-1.000000