Gavin Gray 10th June 2014

Looking into misc high-throughput paper

Downloaded supplementary from promising looking experiment found on the GEO site: SnapShot-Seq: a method for extracting genome-wide, in vivo mRNA dynamics from a single total RNA sample. Took first sample entry for this paper and looking at first supplemental file. Unsure at this point what's the difference between the different samples.

File GSM1212180_HiSeq_IsoG-CPT_D6-26_c10_BINS-READS-Densities_r50_N10M_ALL-GENES_INT.EVERY_u100.tsv.gz downloaded from here.

Opening up the .tsv and having a look at it:



In [1]:

    
cd /home/gavin/Documents/MRes/misc/









    



/home/gavin/Documents/MRes/misc



In [6]:

    
import pandas as pd
df = pd.read_csv("GSM1212180_HiSeq_IsoG-CPT_D6-26_c10_BINS-READS-Densities_r50_N10M_ALL-GENES_INT.EVERY_u100.tsv", delimiter="\t", header=0)



In [11]:

    
df.head()









    Out[11]:






  
    
      
      FeatureSORT
      Gene
      GeneID
      Chrom
      Strand
      FeatureStart
      FeatureStop
      GeneLength
      zStart
      zStop
      ...
      Bin5p_91/100
      Bin5p_92/100
      Bin5p_93/100
      Bin5p_94/100
      Bin5p_95/100
      Bin5p_96/100
      Bin5p_97/100
      Bin5p_98/100
      Bin5p_99/100
      Bin5p_100/100
    
  
  
    
      0
       1
         FAM138A
       645520
       1
       -
        35482
        35720
        1471
        34611
        36130
      ...
       0
       0
       0
       0
       0
       0.0000
       0.0000
       0
       0.0000
       0.0000
    
    
      1
       2
         FAM138A
       645520
       1
       -
        35175
        35276
        1471
        34611
        36130
      ...
       0
       0
       0
       0
       0
       0.0000
       0.0000
       0
       0.0000
       0.0000
    
    
      2
       3
       LOC643837
       643837
       1
       +
       763156
       764382
       26677
       763064
       789789
      ...
       0
       0
       0
       0
       0
       0.0000
       0.0000
       0
       0.0000
       0.0000
    
    
      3
       4
       LOC643837
       643837
       1
       +
       764485
       783033
       26677
       763064
       789789
      ...
       0
       0
       0
       0
       0
       0.0107
       0.0654
       0
       0.1523
       0.0305
    
    
      4
       5
       LOC643837
       643837
       1
       +
       783187
       787306
       26677
       763064
       789789
      ...
       0
       0
       0
       0
       0
       0.0000
       0.0000
       0
       0.0000
       0.0000
    
  

5 rows × 121 columns



In [37]:

    
#plot a graph of some of these bin values
plot(df.iloc[1,21:])
plot(df.iloc[2,21:])









    Out[37]:





[<matplotlib.lines.Line2D at 0x7f6194b15c90>]

Just got around to reading about this file:

The "BINS-READS-Densities" .tsv files contain tiled Densities per every individual intron in the annotated genome of interest. Each intron is divided into 100 bins of equal size (fractional bp); the contributions of reads that only partially overlap a bin are pro rated to it (a single read's average Density in a bin equals the fraction of the bin it covers). Average Densities per bin are renormalized as in MAPtoFeatures, but only Sense reads are included. About 20 columns of genic and locus information for the ~200k introns are followed by 100 columns listing each intron's 100 normalized Densities ordered from its 5'-most to its 3'-most bin. These Densities are used to quantify the slopes of the differential abundance of nascent transcripts across introns. Supplementary_files_format_and_content: The bigWig .bw files are in standard format for compressed variable-step wiggle tracks. Most samples listed here have two bigWig files, intended for separate tracks for POS and NEG strands. (There are only POS tracks for the 6 strand-nonspecific LCL samples, except additionally both strands are shown for each end for the 2 samples with paired ends. See "Exceptions" below.)

So it appears these aren't time dependent abundances, if that's what we're looking for.

To check though, how many of the genes in here are represented in our dataset?



In [17]:

    
#retreive just gene IDs
EntIDs = df.GeneID.values
#remove duplicate entries
from collections import OrderedDict
EntIDs = list(OrderedDict.fromkeys(EntIDs))
print "%i unique genes in %i rows"%(len(EntIDs), len(df.GeneID))









    



18226 unique genes in 184057 rows



In [40]:

    
cd ~/Documents/MRes/forGAVIN/pulldown_data/BAITS/









    



/home/gavin/Documents/MRes/forGAVIN/pulldown_data/BAITS



In [42]:

    
import csv
baitids = list(flatten(csv.reader(open("baits_entrez_ids.csv"))))



In [43]:

    
cd ~/Documents/MRes/forGAVIN/pulldown_data/PREYS/









    



/home/gavin/Documents/MRes/forGAVIN/pulldown_data/PREYS



In [44]:

    
preyids = list(flatten(csv.reader(open("prey_entrez_ids.csv"))))



In [45]:

    
cd /home/gavin/Documents/MRes/DIP/human/









    



/home/gavin/Documents/MRes/DIP/human



In [46]:

    
dipIDs = list(flatten(csv.reader(open("total.Entrez.txt"))))



In [47]:

    
crossover = [x for x in baitids+preyids+dipIDs if x in EntIDs]

Different example paper

Trying new paper, to see if I can find some more applicable data. New example paper: Simultaneous transcriptional profiling of bacteria and their host cells by heterogeneous RNA-Seq (hRNA-Seq).

Looking at this record.

Looks like there's only one supplementary file without a whole lot explanation. Looking at ftp link. Data appears to be in SRA format, which is apparently a Sequence Read Archive. This might be too raw for me to use, also no indication if this will be a time series.



In [ ]:

	FeatureSORT	Gene	GeneID	Chrom	Strand	FeatureStart	FeatureStop	GeneLength	zStart	zStop	...	Bin5p_96/100	Bin5p_97/100	Bin5p_99/100	Bin5p_100/100
0	1	FAM138A	645520	1	-	35482	35720	1471	34611	36130	...	0.0000	0.0000	0.0000	0.0000
1	2	FAM138A	645520	1	-	35175	35276	1471	34611	36130	...	0.0000	0.0000	0.0000	0.0000
2	3	LOC643837	643837	1	+	763156	764382	26677	763064	789789	...	0.0000	0.0000	0.0000	0.0000
3	4	LOC643837	643837	1	+	764485	783033	26677	763064	789789	...	0.0107	0.0654	0.1523	0.0305
4	5	LOC643837	643837	1	+	783187	787306	26677	763064	789789	...	0.0000	0.0000	0.0000	0.0000