First, see if Likit has putative ortholog assignments between the chick RNAseq and human proteins. If not, the eel-pond protocol has some instructions on how to calculate this. Might take a few days. Then, we can recalculate the Venn diagram with a better curated set of mRNAseq transcripts, and also give people (Jerry in particular) the gene names that are in A, C, and D.
Likit: I only have homologs of chick RNASeq and human proteins of differential-expressed genes, which total ~1000 genes as of now.
Main goal right now: how many "real" mRNAseq genes, i.e. genes with orthology to human, do/do not match in the genome? Nic has done exactly this analysis for fungus.
In [64]:
%matplotlib inline
from matplotlib import pyplot as plt
from matplotlib_venn import venn3, venn3_circles
In [30]:
!cd .. && make outputs/chicken_transcripts/global_merged.fa.clean.nr.renamed.fasta.nsq
In [34]:
!cd .. && make outputs/rna/only_rna.fa.renamed.fasta.nsq
In [36]:
!cd .. && make outputs/uniprot/uniprot_sprot.fasta.nsq
In [40]:
!cd .. && make outputs/uniprot/uniprot_sprot.namedb
In [42]:
!cd .. && make outputs/uniprot/uniprot_sprot.fasta_screed
In [46]:
!cd .. && make workdir/results/uniprot.x.chick.pbs
In [47]:
!cd .. && make workdir/results/chick.x.uniprot.pbs
!cd .. && make workdir/results/uniprot.x.chick.pbs
In [8]:
#!cp ../workdir/results/uniprot.x.chick ../outputs/rna/uniprot.x.chick
#!cp ../workdir/results/chick.x.uniprot ../outputs/rna/chick.x.uniprot
In [56]:
!cd .. && make outputs/rna/global_merged.fa.clean.nr.renamed.fasta.annot
In [58]:
!cd .. && make outputs/rna/ortho.fa
In [9]:
!cd .. && make outputs/reference/galGal4.fa.nsq
In [8]:
!cd .. && make outputs/moleculo/LR6000017-DNA_A01-LRAAA-AllReads.fasta.00.nsq
In [10]:
!cd .. && make workdir/results/Chick_RNA_BLAST.txt.pbs
In [11]:
!cd .. && make workdir/results/Moleculo_RNA_BLAST.txt.pbs
In [54]:
!cd .. && make outputs/rna/match_ortho.rna.ref.txt
In [55]:
!cd .. && make outputs/rna/match_ortho.rna.moleculo.txt
In [59]:
with open("../outputs/rna/match_ortho.rna.moleculo.txt") as rna_mol_match:
rna_mol_set = set([line.split(' ')[0] for line in rna_mol_match if line.strip()])
with open("../outputs/rna/match_ortho.rna.ref.txt") as rna_ref_match:
rna_ref_set = set([line.split(' ')[0] for line in rna_ref_match if line.strip()])
In [60]:
B_set = rna_mol_set & rna_ref_set
len(B_set)
Out[60]:
In [61]:
B = len(rna_mol_set & rna_ref_set)
A = len(rna_ref_set) - B
C = len(rna_mol_set) - B
E = 30421
D = E - A - C - B
print A, B, C, D, E
In [65]:
#scale = float(max([A, B, C, D]))
#v = venn3((1, 1, 1, D / scale, A / scale, C / scale, B / scale), set_labels=('Ref', 'Moleculo', 'RNA'))
v = venn3((1, 1, 1, 1, 1, 1, 1), set_labels=('Ref', 'Moleculo', 'RNA'))
v.get_label_by_id('100').set_text('')
v.get_label_by_id('010').set_text('')
v.get_label_by_id('001').set_text('D\n%d' % D)
v.get_label_by_id('011').set_text('C\n%d' % C)
v.get_label_by_id('101').set_text('A\n%d' % A)
v.get_label_by_id('110').set_text('')
v.get_label_by_id('111').set_text('B\n%d' % B)
plt.title('Filtered mRNAseq\n(only genes with orthology to UniProt genes)')
Out[65]: