First, see if Likit has putative ortholog assignments between the chick RNAseq and human proteins. If not, the eel-pond protocol has some instructions on how to calculate this. Might take a few days. Then, we can recalculate the Venn diagram with a better curated set of mRNAseq transcripts, and also give people (Jerry in particular) the gene names that are in A, C, and D.
Likit: I only have homologs of chick RNASeq and human proteins of differential-expressed genes, which total ~1000 genes as of now.
Main goal right now: how many "real" mRNAseq genes, i.e. genes with orthology to human, do/do not match in the genome? Nic has done exactly this analysis for fungus.
In [98]:
%matplotlib inline
from matplotlib import pyplot as plt
from matplotlib_venn import venn3, venn3_circles
In [1]:
!cd .. && make outputs/chicken_transcripts/global_merged.fa.clean.nr.renamed.fasta.nsq
In [21]:
!cd .. && make outputs/rna/pacbio/galGal4/only_rna.fa.renamed.fasta.nsq
In [23]:
!cd .. && make outputs/uniprot/uniprot_sprot.fasta.nsq
In [24]:
!cd .. && make outputs/uniprot/uniprot_sprot.namedb
In [25]:
!cd .. && make outputs/uniprot/uniprot_sprot.fasta_screed
In [30]:
!cd .. && make -j2 outputs/rna/pacbio/galGal4/chick.x.uniprot.pbs outputs/rna/pacbio/galGal4/uniprot.x.chick.pbs
In [86]:
!cd .. && make outputs/rna/pacbio/galGal4/global_merged.fa.clean.nr.renamed.fasta.annot
In [87]:
!cd .. && make outputs/rna/pacbio/galGal4/ortho.fa
In [38]:
!cd .. && make outputs/reference/galGal4.fa.nsq
In [41]:
!cd .. && make outputs/pacbio_unmapped/chicken_unmapped.fasta.nsq
In [63]:
!cd .. && make -j2 outputs/rna/pacbio/galGal4/Chick_RNA_BLAST.txt.pbs outputs/rna/pacbio/galGal4/seq_RNA_BLAST.txt.pbs
In [88]:
!cd .. && make outputs/rna/pacbio/galGal4/match_ortho.rna.ref.txt
In [89]:
!cd .. && make outputs/rna/pacbio/galGal4/match_ortho.rna.seq.txt
In [94]:
with open("../outputs/rna/pacbio/galGal4/match_ortho.rna.seq.txt") as rna_seq_match:
rna_seq_set = set([line.split(' ')[0] for line in rna_seq_match if line.strip()])
with open("../outputs/rna/pacbio/galGal4/match_ortho.rna.ref.txt") as rna_ref_match:
rna_ref_set = set([line.split(' ')[0] for line in rna_ref_match if line.strip()])
In [95]:
B_set = rna_seq_set & rna_ref_set
len(B_set)
Out[95]:
In [96]:
B = len(rna_seq_set & rna_ref_set)
A = len(rna_ref_set) - B
C = len(rna_seq_set) - B
E = 30421
D = E - A - C - B
print A, B, C, D, E
In [99]:
#scale = float(max([A, B, C, D]))
#v = venn3((1, 1, 1, D / scale, A / scale, C / scale, B / scale), set_labels=('Ref', 'Moleculo', 'RNA'))
v = venn3((1, 1, 1, 1, 1, 1, 1), set_labels=('Ref', 'PacBio', 'RNA'))
v.get_label_by_id('100').set_text('')
v.get_label_by_id('010').set_text('')
v.get_label_by_id('001').set_text('D\n%d' % D)
v.get_label_by_id('011').set_text('C\n%d' % C)
v.get_label_by_id('101').set_text('A\n%d' % A)
v.get_label_by_id('110').set_text('')
v.get_label_by_id('111').set_text('B\n%d' % B)
plt.title('Filtered mRNAseq\n(only genes with orthology to UniProt genes)')
Out[99]:
In [104]:
from screed.fasta import fasta_iter
missing_genes = rna_seq_set - B_set
with open('../outputs/rna/pacbio/galGal4/global_merged.fa.clean.nr.renamed.fasta.annot') as rna_seqs:
with open('../outputs/rna/pacbio/galGal4/missing.fa', 'w') as f:
for seq in fasta_iter(rna_seqs):
if seq['name'] in missing_genes:
f.write(">%s %s\n%s\n" % (seq['name'], seq['description'], seq['sequence']))