Downloaded supplementary from promising looking experiment found on the GEO site: SnapShot-Seq: a method for extracting genome-wide, in vivo mRNA dynamics from a single total RNA sample. Took first sample entry for this paper and looking at first supplemental file. Unsure at this point what's the difference between the different samples.
File GSM1212180_HiSeq_IsoG-CPT_D6-26_c10_BINS-READS-Densities_r50_N10M_ALL-GENES_INT.EVERY_u100.tsv.gz downloaded from here.
Opening up the .tsv and having a look at it:
In [1]:
cd /home/gavin/Documents/MRes/misc/
In [6]:
import pandas as pd
df = pd.read_csv("GSM1212180_HiSeq_IsoG-CPT_D6-26_c10_BINS-READS-Densities_r50_N10M_ALL-GENES_INT.EVERY_u100.tsv", delimiter="\t", header=0)
In [11]:
df.head()
Out[11]:
In [37]:
#plot a graph of some of these bin values
plot(df.iloc[1,21:])
plot(df.iloc[2,21:])
Out[37]:
Just got around to reading about this file:
The "BINS-READS-Densities" .tsv files contain tiled Densities per every individual intron in the annotated genome of interest. Each intron is divided into 100 bins of equal size (fractional bp); the contributions of reads that only partially overlap a bin are pro rated to it (a single read's average Density in a bin equals the fraction of the bin it covers). Average Densities per bin are renormalized as in MAPtoFeatures, but only Sense reads are included. About 20 columns of genic and locus information for the ~200k introns are followed by 100 columns listing each intron's 100 normalized Densities ordered from its 5'-most to its 3'-most bin. These Densities are used to quantify the slopes of the differential abundance of nascent transcripts across introns. Supplementary_files_format_and_content: The bigWig .bw files are in standard format for compressed variable-step wiggle tracks. Most samples listed here have two bigWig files, intended for separate tracks for POS and NEG strands. (There are only POS tracks for the 6 strand-nonspecific LCL samples, except additionally both strands are shown for each end for the 2 samples with paired ends. See "Exceptions" below.)
So it appears these aren't time dependent abundances, if that's what we're looking for.
To check though, how many of the genes in here are represented in our dataset?
In [17]:
#retreive just gene IDs
EntIDs = df.GeneID.values
#remove duplicate entries
from collections import OrderedDict
EntIDs = list(OrderedDict.fromkeys(EntIDs))
print "%i unique genes in %i rows"%(len(EntIDs), len(df.GeneID))
In [40]:
cd ~/Documents/MRes/forGAVIN/pulldown_data/BAITS/
In [42]:
import csv
baitids = list(flatten(csv.reader(open("baits_entrez_ids.csv"))))
In [43]:
cd ~/Documents/MRes/forGAVIN/pulldown_data/PREYS/
In [44]:
preyids = list(flatten(csv.reader(open("prey_entrez_ids.csv"))))
In [45]:
cd /home/gavin/Documents/MRes/DIP/human/
In [46]:
dipIDs = list(flatten(csv.reader(open("total.Entrez.txt"))))
In [47]:
crossover = [x for x in baitids+preyids+dipIDs if x in EntIDs]
Trying new paper, to see if I can find some more applicable data. New example paper: Simultaneous transcriptional profiling of bacteria and their host cells by heterogeneous RNA-Seq (hRNA-Seq).
Looking at this record.
Looks like there's only one supplementary file without a whole lot explanation. Looking at ftp link. Data appears to be in SRA format, which is apparently a Sequence Read Archive. This might be too raw for me to use, also no indication if this will be a time series.
In [ ]: