Testing ortho

  • First, see if Likit has putative ortholog assignments between the chick RNAseq and human proteins. If not, the eel-pond protocol has some instructions on how to calculate this. Might take a few days. Then, we can recalculate the Venn diagram with a better curated set of mRNAseq transcripts, and also give people (Jerry in particular) the gene names that are in A, C, and D.

    • Likit: I only have homologs of chick RNASeq and human proteins of differential-expressed genes, which total ~1000 genes as of now.

    • Main goal right now: how many "real" mRNAseq genes, i.e. genes with orthology to human, do/do not match in the genome? Nic has done exactly this analysis for fungus.


In [64]:
%matplotlib inline
from matplotlib import pyplot as plt
from matplotlib_venn import venn3, venn3_circles

Preparing all RNA data


In [30]:
!cd .. && make outputs/chicken_transcripts/global_merged.fa.clean.nr.renamed.fasta.nsq


make: `outputs/chicken_transcripts/global_merged.fa.clean.nr.renamed.fasta.nsq' is up to date.

Preparing 'only in RNA' data


In [34]:
!cd .. && make outputs/rna/only_rna.fa.renamed.fasta.nsq


make: `outputs/rna/only_rna.fa.renamed.fasta.nsq' is up to date.

Preparing Uniprot


In [36]:
!cd .. && make outputs/uniprot/uniprot_sprot.fasta.nsq


cd outputs/uniprot && \
	formatdb -i uniprot_sprot.fasta -o T -p T

In [40]:
!cd .. && make outputs/uniprot/uniprot_sprot.namedb


cd outputs/uniprot && \
	python /mnt/research/ged/irberlui/biodata/galGal/scripts/make-namedb.py uniprot_sprot.fasta uniprot_sprot.namedb

In [42]:
!cd .. && make outputs/uniprot/uniprot_sprot.fasta_screed


make: `outputs/uniprot/uniprot_sprot.fasta_screed' is up to date.

Uniprot ortho


In [46]:
!cd .. && make workdir/results/uniprot.x.chick.pbs


mkdir -p workdir/results
JOBID=`echo make workdir/results/uniprot.x.chick | cat pbs/header.sub - pbs/footer.sub | \
	  qsub -l walltime=20:00:00,nodes=1:ppn=8,mem=400mb -A ged -N pbcr.uniprot.x.chick.pbs -o workdir/results/uniprot.x.chick.pbs -e workdir/results/uniprot.x.chick.pbs.err | cut -d"." -f1` ; \
	while [ -n "$(qstat -a |grep ${JOBID})" ]; do sleep 600; done
^Cmake: *** [workdir/results/uniprot.x.chick.pbs] Interrupt


In [47]:
!cd .. && make workdir/results/chick.x.uniprot.pbs
!cd .. && make workdir/results/uniprot.x.chick.pbs


mkdir -p workdir/results
JOBID=`echo make workdir/results/chick.x.uniprot | cat pbs/header.sub - pbs/footer.sub | \
	  qsub -l walltime=20:00:00,nodes=1:ppn=8,mem=400mb -A ged -N blast.chick.x.uniprot.pbs -o workdir/results/chick.x.uniprot.pbs -e workdir/results/chick.x.uniprot.pbs.err | cut -d"." -f1` ; \
	while [ -n "$(qstat -a |grep ${JOBID})" ]; do sleep 600; done
^Cmake: *** [workdir/results/chick.x.uniprot.pbs] Interrupt

In [8]:
#!cp ../workdir/results/uniprot.x.chick ../outputs/rna/uniprot.x.chick
#!cp ../workdir/results/chick.x.uniprot ../outputs/rna/chick.x.uniprot

In [56]:
!cd .. && make outputs/rna/global_merged.fa.clean.nr.renamed.fasta.annot


cd outputs/rna && \
	python /mnt/research/ged/irberlui/biodata/galGal/scripts/annotate-seqs.py /mnt/research/ged/irberlui/biodata/galGal/outputs/chicken_transcripts/global_merged.fa.clean.nr.renamed.fasta chick.x.uniprot.ortho chick.x.uniprot.homol
Scanning sequences -- first pass to gather info
... 0
... 25000
... 50000
... 75000
... 100000
... 125000
... 150000
... 175000
... 200000
... 225000
... 250000
... 275000
... 300000
... 325000
... 350000
... 375000
... 400000
second pass: annotating
... x2 0
... x2 25000
... x2 50000
... x2 75000
... x2 100000
... x2 125000
... x2 150000
... x2 175000
... x2 200000
... x2 225000
... x2 250000
... x2 275000
... x2 300000
... x2 325000
... x2 350000
... x2 375000
... x2 400000
----
419276 sequences total
30421 annotated / ortho
182213 annotated / homol
0 annotated / tr
212634 total annotated

annotated sequences in FASTA format: global_merged.fa.clean.nr.renamed.fasta.annot
annotation spreadsheet in: global_merged.fa.clean.nr.renamed.fasta.annot.csv
annotation spreadsheet with sequences (warning: LARGE): global_merged.fa.clean.nr.renamed.fasta.annot.large.csv

Recalculating Venn diagram with ortho


In [58]:
!cd .. && make outputs/rna/ortho.fa


python /mnt/research/ged/irberlui/biodata/galGal/scripts/extract.py outputs/rna/global_merged.fa.clean.nr.renamed.fasta.annot outputs/rna/ortho.fa ortho

In [9]:
!cd .. && make outputs/reference/galGal4.fa.nsq


make: `outputs/reference/galGal4.fa.nsq' is up to date.

In [8]:
!cd .. && make outputs/moleculo/LR6000017-DNA_A01-LRAAA-AllReads.fasta.00.nsq


make: `outputs/moleculo/LR6000017-DNA_A01-LRAAA-AllReads.fasta.00.nsq' is up to date.

In [10]:
!cd .. && make workdir/results/Chick_RNA_BLAST.txt.pbs


blastn -query outputs/chicken_transcripts/global_merged.fa.clean.nr.renamed.fasta -out workdir/results/Chick_RNA_BLAST.txt -db outputs/reference/galGal4.fa -outfmt 6 -evalue 1e-3 -num_threads 8

In [11]:
!cd .. && make workdir/results/Moleculo_RNA_BLAST.txt.pbs


blastn -query outputs/chicken_transcripts/global_merged.fa.clean.nr.renamed.fasta -out workdir/results/Moleculo_RNA_BLAST.txt -db outputs/moleculo/LR6000017-DNA_A01-LRAAA-AllReads.fasta -outfmt 6 -evalue 1e-3 -num_threads 16

In [54]:
!cd .. && make outputs/rna/match_ortho.rna.ref.txt


cd outputs/rna && \
	python /mnt/research/ged/irberlui/biodata/galGal/scripts/find_match_2.py ortho.fa Chick_RNA_BLAST.txt ortho.rna.ref
Number of matches: 26766
Number of no matches: 3655

In [55]:
!cd .. && make outputs/rna/match_ortho.rna.moleculo.txt


cd outputs/rna && \
	python /mnt/research/ged/irberlui/biodata/galGal/scripts/find_match_2.py ortho.fa Moleculo_RNA_BLAST.txt ortho.rna.moleculo
Number of matches: 23123
Number of no matches: 7298

In [59]:
with open("../outputs/rna/match_ortho.rna.moleculo.txt") as rna_mol_match:
    rna_mol_set = set([line.split(' ')[0] for line in rna_mol_match if line.strip()])

with open("../outputs/rna/match_ortho.rna.ref.txt") as rna_ref_match:
    rna_ref_set = set([line.split(' ')[0] for line in rna_ref_match if line.strip()])

In [60]:
B_set = rna_mol_set & rna_ref_set
len(B_set)


Out[60]:
23120

In [61]:
B = len(rna_mol_set & rna_ref_set)
A = len(rna_ref_set) - B
C = len(rna_mol_set) - B
E = 30421
D = E - A - C - B

print A, B, C, D, E


3646 23120 3 3652 30421

In [65]:
#scale = float(max([A, B, C, D]))
#v = venn3((1, 1, 1, D / scale, A / scale, C / scale, B / scale), set_labels=('Ref', 'Moleculo', 'RNA'))

v = venn3((1, 1, 1, 1, 1, 1, 1), set_labels=('Ref', 'Moleculo', 'RNA'))

v.get_label_by_id('100').set_text('')
v.get_label_by_id('010').set_text('')
v.get_label_by_id('001').set_text('D\n%d' % D)

v.get_label_by_id('011').set_text('C\n%d' % C)
v.get_label_by_id('101').set_text('A\n%d' % A)
v.get_label_by_id('110').set_text('')

v.get_label_by_id('111').set_text('B\n%d' % B)
plt.title('Filtered mRNAseq\n(only genes with orthology to UniProt genes)')


Out[65]:
<matplotlib.text.Text at 0x2b5617b25a90>