Testing ortho

  • First, see if Likit has putative ortholog assignments between the chick RNAseq and human proteins. If not, the eel-pond protocol has some instructions on how to calculate this. Might take a few days. Then, we can recalculate the Venn diagram with a better curated set of mRNAseq transcripts, and also give people (Jerry in particular) the gene names that are in A, C, and D.

    • Likit: I only have homologs of chick RNASeq and human proteins of differential-expressed genes, which total ~1000 genes as of now.

    • Main goal right now: how many "real" mRNAseq genes, i.e. genes with orthology to human, do/do not match in the genome? Nic has done exactly this analysis for fungus.


In [98]:
%matplotlib inline
from matplotlib import pyplot as plt
from matplotlib_venn import venn3, venn3_circles

Preparing all RNA data


In [1]:
!cd .. && make outputs/chicken_transcripts/global_merged.fa.clean.nr.renamed.fasta.nsq


make: `outputs/chicken_transcripts/global_merged.fa.clean.nr.renamed.fasta.nsq' is up to date.

Preparing 'only in RNA' data


In [21]:
!cd .. && make outputs/rna/pacbio/galGal4/only_rna.fa.renamed.fasta.nsq


make: `outputs/rna/pacbio/galGal4/only_rna.fa.renamed.fasta.nsq' is up to date.

Preparing Uniprot


In [23]:
!cd .. && make outputs/uniprot/uniprot_sprot.fasta.nsq


cd outputs/uniprot && \
	formatdb -i uniprot_sprot.fasta -o T -p T

In [24]:
!cd .. && make outputs/uniprot/uniprot_sprot.namedb


make: `outputs/uniprot/uniprot_sprot.namedb' is up to date.

In [25]:
!cd .. && make outputs/uniprot/uniprot_sprot.fasta_screed


make: `outputs/uniprot/uniprot_sprot.fasta_screed' is up to date.

Uniprot ortho


In [30]:
!cd .. && make -j2 outputs/rna/pacbio/galGal4/chick.x.uniprot.pbs outputs/rna/pacbio/galGal4/uniprot.x.chick.pbs


mkdir -p outputs/rna/pacbio/galGal4
JOBID=`echo make outputs/rna/pacbio/galGal4/chick.x.uniprot | cat pbs/header.sub - pbs/footer.sub | \
	  qsub -l walltime=20:00:00,nodes=1:ppn=8,mem=400mb -A ged -N blast.chick.x.uniprot.pbs -o outputs/rna/pacbio/galGal4/chick.x.uniprot.pbs -e outputs/rna/pacbio/galGal4/chick.x.uniprot.pbs.err | cut -d"." -f1` ; \
	while [ -n "$(qstat -a |grep ${JOBID})" ]; do sleep 600; done
^Cmake: *** [outputs/rna/pacbio/galGal4/chick.x.uniprot.pbs] Interrupt
mkdir -p outputs/rna/pacbio/galGal4
JOBID=`echo make outputs/rna/pacbio/galGal4/uniprot.x.chick | cat pbs/header.sub - pbs/footer.sub | \
	  qsub -l walltime=20:00:00,nodes=1:ppn=8,mem=400mb -A ged -N blast.uniprot.x.chick.pbs -o outputs/rna/pacbio/galGal4/uniprot.x.chick.pbs -e outputs/rna/pacbio/galGal4/uniprot.x.chick.pbs.err | cut -d"." -f1` ; \
	while [ -n "$(qstat -a |grep ${JOBID})" ]; do sleep 600; done
^Cmake: *** [outputs/rna/pacbio/galGal4/uniprot.x.chick.pbs] Interrupt

In [86]:
!cd .. && make outputs/rna/pacbio/galGal4/global_merged.fa.clean.nr.renamed.fasta.annot


cd outputs/rna/pacbio/galGal4 && \
	python /mnt/research/ged/irberlui/biodata/galGal/scripts/annotate-seqs.py /mnt/research/ged/irberlui/biodata/galGal/outputs/chicken_transcripts/global_merged.fa.clean.nr.renamed.fasta chick.x.uniprot.ortho chick.x.uniprot.homol
Scanning sequences -- first pass to gather info
... 0
... 25000
... 50000
... 75000
... 100000
... 125000
... 150000
... 175000
... 200000
... 225000
... 250000
... 275000
... 300000
... 325000
... 350000
... 375000
... 400000
second pass: annotating
... x2 0
... x2 25000
... x2 50000
... x2 75000
... x2 100000
... x2 125000
... x2 150000
... x2 175000
... x2 200000
... x2 225000
... x2 250000
... x2 275000
... x2 300000
... x2 325000
... x2 350000
... x2 375000
... x2 400000
----
419276 sequences total
30421 annotated / ortho
182213 annotated / homol
0 annotated / tr
212634 total annotated

annotated sequences in FASTA format: global_merged.fa.clean.nr.renamed.fasta.annot
annotation spreadsheet in: global_merged.fa.clean.nr.renamed.fasta.annot.csv
annotation spreadsheet with sequences (warning: LARGE): global_merged.fa.clean.nr.renamed.fasta.annot.large.csv

Recalculating Venn diagram with ortho


In [87]:
!cd .. && make outputs/rna/pacbio/galGal4/ortho.fa


python /mnt/research/ged/irberlui/biodata/galGal/scripts/extract.py outputs/rna/pacbio/galGal4/global_merged.fa.clean.nr.renamed.fasta.annot outputs/rna/pacbio/galGal4/ortho.fa ortho

In [38]:
!cd .. && make outputs/reference/galGal4.fa.nsq


make: `outputs/reference/galGal4.fa.nsq' is up to date.

In [41]:
!cd .. && make outputs/pacbio_unmapped/chicken_unmapped.fasta.nsq


makeblastdb -in outputs/pacbio_unmapped/chicken_unmapped.fasta -out outputs/pacbio_unmapped/chicken_unmapped.fasta -dbtype nucl


Building a new DB, current time: 03/29/2015 18:30:33
New DB name:   outputs/pacbio_unmapped/chicken_unmapped.fasta
New DB title:  outputs/pacbio_unmapped/chicken_unmapped.fasta
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 42508 sequences in 2.63234 seconds.

In [63]:
!cd .. && make -j2 outputs/rna/pacbio/galGal4/Chick_RNA_BLAST.txt.pbs outputs/rna/pacbio/galGal4/seq_RNA_BLAST.txt.pbs


make: `outputs/rna/pacbio/galGal4/Chick_RNA_BLAST.txt.pbs' is up to date.
make: `outputs/rna/pacbio/galGal4/seq_RNA_BLAST.txt.pbs' is up to date.

In [88]:
!cd .. && make outputs/rna/pacbio/galGal4/match_ortho.rna.ref.txt


cd outputs/rna/pacbio/galGal4 && \
	python /mnt/research/ged/irberlui/biodata/galGal/scripts/find_match_2.py ortho.fa Chick_RNA_BLAST.txt ortho.rna.ref
Number of matches: 26766
Number of no matches: 3655

In [89]:
!cd .. && make outputs/rna/pacbio/galGal4/match_ortho.rna.seq.txt


cd outputs/rna/pacbio/galGal4 && \
	python /mnt/research/ged/irberlui/biodata/galGal/scripts/find_match_2.py ortho.fa seq_RNA_BLAST.txt ortho.rna.seq
Number of matches: 825
Number of no matches: 29596

In [94]:
with open("../outputs/rna/pacbio/galGal4/match_ortho.rna.seq.txt") as rna_seq_match:
    rna_seq_set = set([line.split(' ')[0] for line in rna_seq_match if line.strip()])

with open("../outputs/rna/pacbio/galGal4/match_ortho.rna.ref.txt") as rna_ref_match:
    rna_ref_set = set([line.split(' ')[0] for line in rna_ref_match if line.strip()])

In [95]:
B_set = rna_seq_set & rna_ref_set
len(B_set)


Out[95]:
337

In [96]:
B = len(rna_seq_set & rna_ref_set)
A = len(rna_ref_set) - B
C = len(rna_seq_set) - B
E = 30421
D = E - A - C - B

print A, B, C, D, E


26429 337 488 3167 30421

Diagram


In [99]:
#scale = float(max([A, B, C, D]))
#v = venn3((1, 1, 1, D / scale, A / scale, C / scale, B / scale), set_labels=('Ref', 'Moleculo', 'RNA'))

v = venn3((1, 1, 1, 1, 1, 1, 1), set_labels=('Ref', 'PacBio', 'RNA'))

v.get_label_by_id('100').set_text('')
v.get_label_by_id('010').set_text('')
v.get_label_by_id('001').set_text('D\n%d' % D)

v.get_label_by_id('011').set_text('C\n%d' % C)
v.get_label_by_id('101').set_text('A\n%d' % A)
v.get_label_by_id('110').set_text('')

v.get_label_by_id('111').set_text('B\n%d' % B)
plt.title('Filtered mRNAseq\n(only genes with orthology to UniProt genes)')


Out[99]:
<matplotlib.text.Text at 0x2acbab7bf990>

In [104]:
from screed.fasta import fasta_iter

missing_genes = rna_seq_set - B_set
with open('../outputs/rna/pacbio/galGal4/global_merged.fa.clean.nr.renamed.fasta.annot') as rna_seqs:
    with open('../outputs/rna/pacbio/galGal4/missing.fa', 'w') as f:
        for seq in fasta_iter(rna_seqs):
            if seq['name'] in missing_genes:
                f.write(">%s %s\n%s\n" % (seq['name'], seq['description'], seq['sequence']))