Gavin Gray 10th June 2014

Looking into misc high-throughput paper

Downloaded supplementary from promising looking experiment found on the GEO site: SnapShot-Seq: a method for extracting genome-wide, in vivo mRNA dynamics from a single total RNA sample. Took first sample entry for this paper and looking at first supplemental file. Unsure at this point what's the difference between the different samples.

File GSM1212180_HiSeq_IsoG-CPT_D6-26_c10_BINS-READS-Densities_r50_N10M_ALL-GENES_INT.EVERY_u100.tsv.gz downloaded from here.

Opening up the .tsv and having a look at it:


In [1]:
cd /home/gavin/Documents/MRes/misc/


/home/gavin/Documents/MRes/misc

In [6]:
import pandas as pd
df = pd.read_csv("GSM1212180_HiSeq_IsoG-CPT_D6-26_c10_BINS-READS-Densities_r50_N10M_ALL-GENES_INT.EVERY_u100.tsv", delimiter="\t", header=0)

In [11]:
df.head()


Out[11]:
FeatureSORT Gene GeneID Chrom Strand FeatureStart FeatureStop GeneLength zStart zStop ... Bin5p_91/100 Bin5p_92/100 Bin5p_93/100 Bin5p_94/100 Bin5p_95/100 Bin5p_96/100 Bin5p_97/100 Bin5p_98/100 Bin5p_99/100 Bin5p_100/100
0 1 FAM138A 645520 1 - 35482 35720 1471 34611 36130 ... 0 0 0 0 0 0.0000 0.0000 0 0.0000 0.0000
1 2 FAM138A 645520 1 - 35175 35276 1471 34611 36130 ... 0 0 0 0 0 0.0000 0.0000 0 0.0000 0.0000
2 3 LOC643837 643837 1 + 763156 764382 26677 763064 789789 ... 0 0 0 0 0 0.0000 0.0000 0 0.0000 0.0000
3 4 LOC643837 643837 1 + 764485 783033 26677 763064 789789 ... 0 0 0 0 0 0.0107 0.0654 0 0.1523 0.0305
4 5 LOC643837 643837 1 + 783187 787306 26677 763064 789789 ... 0 0 0 0 0 0.0000 0.0000 0 0.0000 0.0000

5 rows × 121 columns


In [37]:
#plot a graph of some of these bin values
plot(df.iloc[1,21:])
plot(df.iloc[2,21:])


Out[37]:
[<matplotlib.lines.Line2D at 0x7f6194b15c90>]

Just got around to reading about this file:

The "BINS-READS-Densities" .tsv files contain tiled Densities per every individual intron in the annotated genome of interest. Each intron is divided into 100 bins of equal size (fractional bp); the contributions of reads that only partially overlap a bin are pro rated to it (a single read's average Density in a bin equals the fraction of the bin it covers). Average Densities per bin are renormalized as in MAPtoFeatures, but only Sense reads are included. About 20 columns of genic and locus information for the ~200k introns are followed by 100 columns listing each intron's 100 normalized Densities ordered from its 5'-most to its 3'-most bin. These Densities are used to quantify the slopes of the differential abundance of nascent transcripts across introns. Supplementary_files_format_and_content: The bigWig .bw files are in standard format for compressed variable-step wiggle tracks. Most samples listed here have two bigWig files, intended for separate tracks for POS and NEG strands. (There are only POS tracks for the 6 strand-nonspecific LCL samples, except additionally both strands are shown for each end for the 2 samples with paired ends. See "Exceptions" below.)

So it appears these aren't time dependent abundances, if that's what we're looking for.

To check though, how many of the genes in here are represented in our dataset?


In [17]:
#retreive just gene IDs
EntIDs = df.GeneID.values
#remove duplicate entries
from collections import OrderedDict
EntIDs = list(OrderedDict.fromkeys(EntIDs))
print "%i unique genes in %i rows"%(len(EntIDs), len(df.GeneID))


18226 unique genes in 184057 rows

In [40]:
cd ~/Documents/MRes/forGAVIN/pulldown_data/BAITS/


/home/gavin/Documents/MRes/forGAVIN/pulldown_data/BAITS

In [42]:
import csv
baitids = list(flatten(csv.reader(open("baits_entrez_ids.csv"))))

In [43]:
cd ~/Documents/MRes/forGAVIN/pulldown_data/PREYS/


/home/gavin/Documents/MRes/forGAVIN/pulldown_data/PREYS

In [44]:
preyids = list(flatten(csv.reader(open("prey_entrez_ids.csv"))))

In [45]:
cd /home/gavin/Documents/MRes/DIP/human/


/home/gavin/Documents/MRes/DIP/human

In [46]:
dipIDs = list(flatten(csv.reader(open("total.Entrez.txt"))))

In [47]:
crossover = [x for x in baitids+preyids+dipIDs if x in EntIDs]

Different example paper

Trying new paper, to see if I can find some more applicable data. New example paper: Simultaneous transcriptional profiling of bacteria and their host cells by heterogeneous RNA-Seq (hRNA-Seq).

Looking at this record.

Looks like there's only one supplementary file without a whole lot explanation. Looking at ftp link. Data appears to be in SRA format, which is apparently a Sequence Read Archive. This might be too raw for me to use, also no indication if this will be a time series.


In [ ]: