Testing ortho

First, see if Likit has putative ortholog assignments between the chick RNAseq and human proteins. If not, the eel-pond protocol has some instructions on how to calculate this. Might take a few days. Then, we can recalculate the Venn diagram with a better curated set of mRNAseq transcripts, and also give people (Jerry in particular) the gene names that are in A, C, and D.
- Likit: I only have homologs of chick RNASeq and human proteins of differential-expressed genes, which total ~1000 genes as of now.
- Main goal right now: how many "real" mRNAseq genes, i.e. genes with orthology to human, do/do not match in the genome? Nic has done exactly this analysis for fungus.



In [98]:

    
%matplotlib inline
from matplotlib import pyplot as plt
from matplotlib_venn import venn3, venn3_circles

Preparing all RNA data



In [1]:

    
!cd .. && make outputs/chicken_transcripts/global_merged.fa.clean.nr.renamed.fasta.nsq









    



make: `outputs/chicken_transcripts/global_merged.fa.clean.nr.renamed.fasta.nsq' is up to date.

Preparing 'only in RNA' data



In [21]:

    
!cd .. && make outputs/rna/pacbio/galGal4/only_rna.fa.renamed.fasta.nsq









    



make: `outputs/rna/pacbio/galGal4/only_rna.fa.renamed.fasta.nsq' is up to date.

Preparing Uniprot



In [23]:

    
!cd .. && make outputs/uniprot/uniprot_sprot.fasta.nsq









    



cd outputs/uniprot && \
	formatdb -i uniprot_sprot.fasta -o T -p T



In [24]:

    
!cd .. && make outputs/uniprot/uniprot_sprot.namedb









    



make: `outputs/uniprot/uniprot_sprot.namedb' is up to date.



In [25]:

    
!cd .. && make outputs/uniprot/uniprot_sprot.fasta_screed









    



make: `outputs/uniprot/uniprot_sprot.fasta_screed' is up to date.

Uniprot ortho



In [30]:

    
!cd .. && make -j2 outputs/rna/pacbio/galGal4/chick.x.uniprot.pbs outputs/rna/pacbio/galGal4/uniprot.x.chick.pbs









    



mkdir -p outputs/rna/pacbio/galGal4
JOBID=`echo make outputs/rna/pacbio/galGal4/chick.x.uniprot | cat pbs/header.sub - pbs/footer.sub | \
	  qsub -l walltime=20:00:00,nodes=1:ppn=8,mem=400mb -A ged -N blast.chick.x.uniprot.pbs -o outputs/rna/pacbio/galGal4/chick.x.uniprot.pbs -e outputs/rna/pacbio/galGal4/chick.x.uniprot.pbs.err | cut -d"." -f1` ; \
	while [ -n "$(qstat -a |grep ${JOBID})" ]; do sleep 600; done
^Cmake: *** [outputs/rna/pacbio/galGal4/chick.x.uniprot.pbs] Interrupt
mkdir -p outputs/rna/pacbio/galGal4
JOBID=`echo make outputs/rna/pacbio/galGal4/uniprot.x.chick | cat pbs/header.sub - pbs/footer.sub | \
	  qsub -l walltime=20:00:00,nodes=1:ppn=8,mem=400mb -A ged -N blast.uniprot.x.chick.pbs -o outputs/rna/pacbio/galGal4/uniprot.x.chick.pbs -e outputs/rna/pacbio/galGal4/uniprot.x.chick.pbs.err | cut -d"." -f1` ; \
	while [ -n "$(qstat -a |grep ${JOBID})" ]; do sleep 600; done
^Cmake: *** [outputs/rna/pacbio/galGal4/uniprot.x.chick.pbs] Interrupt



In [86]:

    
!cd .. && make outputs/rna/pacbio/galGal4/global_merged.fa.clean.nr.renamed.fasta.annot









    



cd outputs/rna/pacbio/galGal4 && \
	python /mnt/research/ged/irberlui/biodata/galGal/scripts/annotate-seqs.py /mnt/research/ged/irberlui/biodata/galGal/outputs/chicken_transcripts/global_merged.fa.clean.nr.renamed.fasta chick.x.uniprot.ortho chick.x.uniprot.homol
Scanning sequences -- first pass to gather info
... 0
... 25000
... 50000
... 75000
... 100000
... 125000
... 150000
... 175000
... 200000
... 225000
... 250000
... 275000
... 300000
... 325000
... 350000
... 375000
... 400000
second pass: annotating
... x2 0
... x2 25000
... x2 50000
... x2 75000
... x2 100000
... x2 125000
... x2 150000
... x2 175000
... x2 200000
... x2 225000
... x2 250000
... x2 275000
... x2 300000
... x2 325000
... x2 350000
... x2 375000
... x2 400000
----
419276 sequences total
30421 annotated / ortho
182213 annotated / homol
0 annotated / tr
212634 total annotated

annotated sequences in FASTA format: global_merged.fa.clean.nr.renamed.fasta.annot
annotation spreadsheet in: global_merged.fa.clean.nr.renamed.fasta.annot.csv
annotation spreadsheet with sequences (warning: LARGE): global_merged.fa.clean.nr.renamed.fasta.annot.large.csv

Recalculating Venn diagram with ortho



In [87]:

    
!cd .. && make outputs/rna/pacbio/galGal4/ortho.fa









    



python /mnt/research/ged/irberlui/biodata/galGal/scripts/extract.py outputs/rna/pacbio/galGal4/global_merged.fa.clean.nr.renamed.fasta.annot outputs/rna/pacbio/galGal4/ortho.fa ortho



In [38]:

    
!cd .. && make outputs/reference/galGal4.fa.nsq









    



make: `outputs/reference/galGal4.fa.nsq' is up to date.



In [41]:

    
!cd .. && make outputs/pacbio_unmapped/chicken_unmapped.fasta.nsq









    



makeblastdb -in outputs/pacbio_unmapped/chicken_unmapped.fasta -out outputs/pacbio_unmapped/chicken_unmapped.fasta -dbtype nucl


Building a new DB, current time: 03/29/2015 18:30:33
New DB name:   outputs/pacbio_unmapped/chicken_unmapped.fasta
New DB title:  outputs/pacbio_unmapped/chicken_unmapped.fasta
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 42508 sequences in 2.63234 seconds.



In [63]:

    
!cd .. && make -j2 outputs/rna/pacbio/galGal4/Chick_RNA_BLAST.txt.pbs outputs/rna/pacbio/galGal4/seq_RNA_BLAST.txt.pbs









    



make: `outputs/rna/pacbio/galGal4/Chick_RNA_BLAST.txt.pbs' is up to date.
make: `outputs/rna/pacbio/galGal4/seq_RNA_BLAST.txt.pbs' is up to date.



In [88]:

    
!cd .. && make outputs/rna/pacbio/galGal4/match_ortho.rna.ref.txt









    



cd outputs/rna/pacbio/galGal4 && \
	python /mnt/research/ged/irberlui/biodata/galGal/scripts/find_match_2.py ortho.fa Chick_RNA_BLAST.txt ortho.rna.ref
Number of matches: 26766
Number of no matches: 3655



In [89]:

    
!cd .. && make outputs/rna/pacbio/galGal4/match_ortho.rna.seq.txt









    



cd outputs/rna/pacbio/galGal4 && \
	python /mnt/research/ged/irberlui/biodata/galGal/scripts/find_match_2.py ortho.fa seq_RNA_BLAST.txt ortho.rna.seq
Number of matches: 825
Number of no matches: 29596



In [94]:

    
with open("../outputs/rna/pacbio/galGal4/match_ortho.rna.seq.txt") as rna_seq_match:
    rna_seq_set = set([line.split(' ')[0] for line in rna_seq_match if line.strip()])

with open("../outputs/rna/pacbio/galGal4/match_ortho.rna.ref.txt") as rna_ref_match:
    rna_ref_set = set([line.split(' ')[0] for line in rna_ref_match if line.strip()])



In [95]:

    
B_set = rna_seq_set & rna_ref_set
len(B_set)









    Out[95]:





337



In [96]:

    
B = len(rna_seq_set & rna_ref_set)
A = len(rna_ref_set) - B
C = len(rna_seq_set) - B
E = 30421
D = E - A - C - B

print A, B, C, D, E









    



26429 337 488 3167 30421

Diagram



In [99]:

    
#scale = float(max([A, B, C, D]))
#v = venn3((1, 1, 1, D / scale, A / scale, C / scale, B / scale), set_labels=('Ref', 'Moleculo', 'RNA'))

v = venn3((1, 1, 1, 1, 1, 1, 1), set_labels=('Ref', 'PacBio', 'RNA'))

v.get_label_by_id('100').set_text('')
v.get_label_by_id('010').set_text('')
v.get_label_by_id('001').set_text('D\n%d' % D)

v.get_label_by_id('011').set_text('C\n%d' % C)
v.get_label_by_id('101').set_text('A\n%d' % A)
v.get_label_by_id('110').set_text('')

v.get_label_by_id('111').set_text('B\n%d' % B)
plt.title('Filtered mRNAseq\n(only genes with orthology to UniProt genes)')









    Out[99]:





<matplotlib.text.Text at 0x2acbab7bf990>



In [104]:

    
from screed.fasta import fasta_iter

missing_genes = rna_seq_set - B_set
with open('../outputs/rna/pacbio/galGal4/global_merged.fa.clean.nr.renamed.fasta.annot') as rna_seqs:
    with open('../outputs/rna/pacbio/galGal4/missing.fa', 'w') as f:
        for seq in fasta_iter(rna_seqs):
            if seq['name'] in missing_genes:
                f.write(">%s %s\n%s\n" % (seq['name'], seq['description'], seq['sequence']))