Testing ortho

First, see if Likit has putative ortholog assignments between the chick RNAseq and human proteins. If not, the eel-pond protocol has some instructions on how to calculate this. Might take a few days. Then, we can recalculate the Venn diagram with a better curated set of mRNAseq transcripts, and also give people (Jerry in particular) the gene names that are in A, C, and D.
- Likit: I only have homologs of chick RNASeq and human proteins of differential-expressed genes, which total ~1000 genes as of now.
- Main goal right now: how many "real" mRNAseq genes, i.e. genes with orthology to human, do/do not match in the genome? Nic has done exactly this analysis for fungus.



In [64]:

    
%matplotlib inline
from matplotlib import pyplot as plt
from matplotlib_venn import venn3, venn3_circles

Preparing all RNA data



In [30]:

    
!cd .. && make outputs/chicken_transcripts/global_merged.fa.clean.nr.renamed.fasta.nsq









    



make: `outputs/chicken_transcripts/global_merged.fa.clean.nr.renamed.fasta.nsq' is up to date.

Preparing 'only in RNA' data



In [34]:

    
!cd .. && make outputs/rna/only_rna.fa.renamed.fasta.nsq









    



make: `outputs/rna/only_rna.fa.renamed.fasta.nsq' is up to date.

Preparing Uniprot



In [36]:

    
!cd .. && make outputs/uniprot/uniprot_sprot.fasta.nsq









    



cd outputs/uniprot && \
	formatdb -i uniprot_sprot.fasta -o T -p T



In [40]:

    
!cd .. && make outputs/uniprot/uniprot_sprot.namedb









    



cd outputs/uniprot && \
	python /mnt/research/ged/irberlui/biodata/galGal/scripts/make-namedb.py uniprot_sprot.fasta uniprot_sprot.namedb



In [42]:

    
!cd .. && make outputs/uniprot/uniprot_sprot.fasta_screed









    



make: `outputs/uniprot/uniprot_sprot.fasta_screed' is up to date.

Uniprot ortho



In [46]:

    
!cd .. && make workdir/results/uniprot.x.chick.pbs









    



mkdir -p workdir/results
JOBID=`echo make workdir/results/uniprot.x.chick | cat pbs/header.sub - pbs/footer.sub | \
	  qsub -l walltime=20:00:00,nodes=1:ppn=8,mem=400mb -A ged -N pbcr.uniprot.x.chick.pbs -o workdir/results/uniprot.x.chick.pbs -e workdir/results/uniprot.x.chick.pbs.err | cut -d"." -f1` ; \
	while [ -n "$(qstat -a |grep ${JOBID})" ]; do sleep 600; done
^Cmake: *** [workdir/results/uniprot.x.chick.pbs] Interrupt



In [47]:

    
!cd .. && make workdir/results/chick.x.uniprot.pbs
!cd .. && make workdir/results/uniprot.x.chick.pbs









    



mkdir -p workdir/results
JOBID=`echo make workdir/results/chick.x.uniprot | cat pbs/header.sub - pbs/footer.sub | \
	  qsub -l walltime=20:00:00,nodes=1:ppn=8,mem=400mb -A ged -N blast.chick.x.uniprot.pbs -o workdir/results/chick.x.uniprot.pbs -e workdir/results/chick.x.uniprot.pbs.err | cut -d"." -f1` ; \
	while [ -n "$(qstat -a |grep ${JOBID})" ]; do sleep 600; done
^Cmake: *** [workdir/results/chick.x.uniprot.pbs] Interrupt



In [8]:

    
#!cp ../workdir/results/uniprot.x.chick ../outputs/rna/uniprot.x.chick
#!cp ../workdir/results/chick.x.uniprot ../outputs/rna/chick.x.uniprot



In [56]:

    
!cd .. && make outputs/rna/global_merged.fa.clean.nr.renamed.fasta.annot









    



cd outputs/rna && \
	python /mnt/research/ged/irberlui/biodata/galGal/scripts/annotate-seqs.py /mnt/research/ged/irberlui/biodata/galGal/outputs/chicken_transcripts/global_merged.fa.clean.nr.renamed.fasta chick.x.uniprot.ortho chick.x.uniprot.homol
Scanning sequences -- first pass to gather info
... 0
... 25000
... 50000
... 75000
... 100000
... 125000
... 150000
... 175000
... 200000
... 225000
... 250000
... 275000
... 300000
... 325000
... 350000
... 375000
... 400000
second pass: annotating
... x2 0
... x2 25000
... x2 50000
... x2 75000
... x2 100000
... x2 125000
... x2 150000
... x2 175000
... x2 200000
... x2 225000
... x2 250000
... x2 275000
... x2 300000
... x2 325000
... x2 350000
... x2 375000
... x2 400000
----
419276 sequences total
30421 annotated / ortho
182213 annotated / homol
0 annotated / tr
212634 total annotated

annotated sequences in FASTA format: global_merged.fa.clean.nr.renamed.fasta.annot
annotation spreadsheet in: global_merged.fa.clean.nr.renamed.fasta.annot.csv
annotation spreadsheet with sequences (warning: LARGE): global_merged.fa.clean.nr.renamed.fasta.annot.large.csv

Recalculating Venn diagram with ortho



In [58]:

    
!cd .. && make outputs/rna/ortho.fa









    



python /mnt/research/ged/irberlui/biodata/galGal/scripts/extract.py outputs/rna/global_merged.fa.clean.nr.renamed.fasta.annot outputs/rna/ortho.fa ortho



In [9]:

    
!cd .. && make outputs/reference/galGal4.fa.nsq









    



make: `outputs/reference/galGal4.fa.nsq' is up to date.



In [8]:

    
!cd .. && make outputs/moleculo/LR6000017-DNA_A01-LRAAA-AllReads.fasta.00.nsq









    



make: `outputs/moleculo/LR6000017-DNA_A01-LRAAA-AllReads.fasta.00.nsq' is up to date.



In [10]:

    
!cd .. && make workdir/results/Chick_RNA_BLAST.txt.pbs









    



blastn -query outputs/chicken_transcripts/global_merged.fa.clean.nr.renamed.fasta -out workdir/results/Chick_RNA_BLAST.txt -db outputs/reference/galGal4.fa -outfmt 6 -evalue 1e-3 -num_threads 8



In [11]:

    
!cd .. && make workdir/results/Moleculo_RNA_BLAST.txt.pbs









    



blastn -query outputs/chicken_transcripts/global_merged.fa.clean.nr.renamed.fasta -out workdir/results/Moleculo_RNA_BLAST.txt -db outputs/moleculo/LR6000017-DNA_A01-LRAAA-AllReads.fasta -outfmt 6 -evalue 1e-3 -num_threads 16



In [54]:

    
!cd .. && make outputs/rna/match_ortho.rna.ref.txt









    



cd outputs/rna && \
	python /mnt/research/ged/irberlui/biodata/galGal/scripts/find_match_2.py ortho.fa Chick_RNA_BLAST.txt ortho.rna.ref
Number of matches: 26766
Number of no matches: 3655



In [55]:

    
!cd .. && make outputs/rna/match_ortho.rna.moleculo.txt









    



cd outputs/rna && \
	python /mnt/research/ged/irberlui/biodata/galGal/scripts/find_match_2.py ortho.fa Moleculo_RNA_BLAST.txt ortho.rna.moleculo
Number of matches: 23123
Number of no matches: 7298



In [59]:

    
with open("../outputs/rna/match_ortho.rna.moleculo.txt") as rna_mol_match:
    rna_mol_set = set([line.split(' ')[0] for line in rna_mol_match if line.strip()])

with open("../outputs/rna/match_ortho.rna.ref.txt") as rna_ref_match:
    rna_ref_set = set([line.split(' ')[0] for line in rna_ref_match if line.strip()])



In [60]:

    
B_set = rna_mol_set & rna_ref_set
len(B_set)









    Out[60]:





23120



In [61]:

    
B = len(rna_mol_set & rna_ref_set)
A = len(rna_ref_set) - B
C = len(rna_mol_set) - B
E = 30421
D = E - A - C - B

print A, B, C, D, E









    



3646 23120 3 3652 30421



In [65]:

    
#scale = float(max([A, B, C, D]))
#v = venn3((1, 1, 1, D / scale, A / scale, C / scale, B / scale), set_labels=('Ref', 'Moleculo', 'RNA'))

v = venn3((1, 1, 1, 1, 1, 1, 1), set_labels=('Ref', 'Moleculo', 'RNA'))

v.get_label_by_id('100').set_text('')
v.get_label_by_id('010').set_text('')
v.get_label_by_id('001').set_text('D\n%d' % D)

v.get_label_by_id('011').set_text('C\n%d' % C)
v.get_label_by_id('101').set_text('A\n%d' % A)
v.get_label_by_id('110').set_text('')

v.get_label_by_id('111').set_text('B\n%d' % B)
plt.title('Filtered mRNAseq\n(only genes with orthology to UniProt genes)')









    Out[65]:





<matplotlib.text.Text at 0x2b5617b25a90>