First, see if Likit has putative ortholog assignments between the chick RNAseq and human proteins. If not, the eel-pond protocol has some instructions on how to calculate this. Might take a few days. Then, we can recalculate the Venn diagram with a better curated set of mRNAseq transcripts, and also give people (Jerry in particular) the gene names that are in A, C, and D.
Likit: I only have homologs of chick RNASeq and human proteins of differential-expressed genes, which total ~1000 genes as of now.
Main goal right now: how many "real" mRNAseq genes, i.e. genes with orthology to human, do/do not match in the genome? Nic has done exactly this analysis for fungus.
In [1]:
%matplotlib inline
from matplotlib import pyplot as plt
from matplotlib_venn import venn3, venn3_circles
In [2]:
!ls ../inputs/chicken_transcripts/
In [3]:
!mkdir -p ../outputs/chicken_transcripts
!cp ../{inputs,outputs}/chicken_transcripts/global_merged.fa.clean.nr
In [4]:
!cd ../outputs/chicken_transcripts && do-partition.py -x 1e10 -N 4 --threads 4 chick_rna global_merged.fa.clean.nr
In [5]:
!cd ../outputs/chicken_transcripts && python ../../scripts/rename-with-partitions.py chick_rna global_merged.fa.clean.nr
In [6]:
!cd ../outputs/chicken_transcripts && gunzip -f global_merged.fa.clean.nr.renamed.fasta.gz
In [2]:
!cd ../outputs/chicken_transcripts && formatdb -i global_merged.fa.clean.nr.renamed.fasta -o T -p F
In [1]:
!mkdir -p ../outputs/rna
In [13]:
!cd ../outputs/rna/ && do-partition.py -x 1e10 -N 4 --threads 4 chick_rna only_rna.fa
In [25]:
!cd ../outputs/rna && python ../../scripts/rename-with-partitions.py chick_rna only_rna.fa
In [27]:
!cd ../outputs/rna && gunzip -f only_rna.fa.renamed.fasta.gz
In [1]:
!cd ../outputs/rna && formatdb -i only_rna.fa.renamed.fasta -o T -p F
In [5]:
!mkdir -p ../{inputs,outputs}/uniprot
!cd ../inputs/uniprot && wget -c ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
!cp ../inputs/uniprot/uniprot_sprot.fasta.gz ../outputs/uniprot/
!gunzip -f ../outputs/uniprot/uniprot_sprot.fasta.gz
In [1]:
!cd ../outputs/uniprot && formatdb -i uniprot_sprot.fasta -o T -p T
In [2]:
!cd ../outputs/uniprot && python ../../scripts/make-namedb.py uniprot_sprot.fasta uniprot.namedb
In [3]:
!cd ../outputs/uniprot && python -m screed.fadbm uniprot_sprot.fasta
In [4]:
#!cd ../scripts && ./submit_blast_uniprot_chick.pbs
#!cd ../scripts && ./submit_blast_chick_uniprot.pbs
In [8]:
!cp ../workdir/results/uniprot.x.chick ../outputs/rna/uniprot.x.chick
!cp ../workdir/results/chick.x.uniprot ../outputs/rna/chick.x.uniprot
In [9]:
!cd ../outputs/rna && python ../../scripts/make-uni-best-hits.py chick.x.uniprot chick.x.uniprot.homol
In [10]:
!cd ../outputs/rna && python ../../scripts/make-reciprocal-best-hits.py chick.x.uniprot uniprot.x.chick chick.x.uniprot.ortho
In [11]:
!cd ../outputs/rna && python ../../scripts/annotate-seqs.py ../chicken_transcripts/global_merged.fa.clean.nr.renamed.fasta chick.x.uniprot.ortho chick.x.uniprot.homol
In [24]:
!cd ../outputs/rna && python ../../scripts/extract.py global_merged.fa.clean.nr.renamed.fasta.annot ortho.fa ortho
In [38]:
%%bash
module load BLAST+
#cd ../outputs/galGal4 && makeblastdb -in galGal4.fa -out galGal4_db -dbtype nucl
cd ../outputs/moleculo && makeblastdb -in LR6000017-DNA_A01-LRAAA-AllReads.fasta -out moleculo_db -dbtype nucl
In [39]:
#!cd ../scripts && ./submit_blast_ortho_ref.pbs
#!cp ../workdir/results/Chick_RNA_BLAST.txt ../outputs/rna/Chick_RNA_BLAST.txt
#!mv ../workdir/ ../workdirs/ortho_blast_ref
In [42]:
!cd ../outputs/rna/ && python ../../scripts/find_match_2.py ortho.fa Chick_RNA_BLAST.txt ortho.rna.ref
In [48]:
#!cd ../scripts && ./submit_blast_ortho_moleculo.pbs
#!cp ../workdir/results/Moleculo_RNA_BLAST.txt ../outputs/rna/Moleculo_RNA_BLAST.txt
#!mv ../workdir/ ../workdirs/ortho_blast_moleculo
In [49]:
!cd ../outputs/rna/ && python ../../scripts/find_match_2.py ortho.fa Moleculo_RNA_BLAST.txt ortho.rna.moleculo
In [1]:
with open("../outputs/rna/match_ortho.rna.moleculo.txt") as rna_mol_match:
rna_mol_set = set([line.split(' ')[0] for line in rna_mol_match if line.strip()])
with open("../outputs/rna/match_ortho.rna.ref.txt") as rna_ref_match:
rna_ref_set = set([line.split(' ')[0] for line in rna_ref_match if line.strip()])
In [2]:
B = len(rna_mol_set & rna_ref_set)
A = 26830 - B
C = 23183 - B
E = 30515
D = E - A - C - B
print A, B, C, D, E
In [4]:
#scale = float(max([A, B, C, D]))
#v = venn3((1, 1, 1, D / scale, A / scale, C / scale, B / scale), set_labels=('Ref', 'Moleculo', 'RNA'))
v = venn3((1, 1, 1, 1, 1, 1, 1), set_labels=('Ref', 'Moleculo', 'RNA'))
v.get_label_by_id('100').set_text('')
v.get_label_by_id('010').set_text('')
v.get_label_by_id('001').set_text('D\n%d' % D)
v.get_label_by_id('011').set_text('C\n%d' % C)
v.get_label_by_id('101').set_text('A\n%d' % A)
v.get_label_by_id('110').set_text('')
v.get_label_by_id('111').set_text('B\n%d' % B)
plt.title('Filtered mRNAseq\n(only genes with orthology to UniProt genes)')
Out[4]:
In [1]:
!mkdir -p ../{inputs,outputs}/human
!cd ../inputs/human && wget -c ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/mRNA_Prot/human.protein.faa.gz
!cp ../inputs/human/human.protein.faa.gz ../outputs/human/
!gunzip -f ../outputs/human/human.protein.faa.gz
In [80]:
!cd ../outputs/human && formatdb -i human.protein.faa -o T -p T
In [7]:
!cd ../outputs/human && python ../../scripts/make-namedb.py human.protein.faa human.namedb
In [8]:
!cd ../outputs/human && python -m screed.fadbm human.protein.faa
In [ ]:
#!cd ../scripts && ./submit_blast_human_chick.pbs
#!cd ../scripts && ./submit_blast_chick_human.pbs
In [9]:
#!cp ../workdir/results/human.x.chick ../outputs/rna/human.x.chick
#!cp ../workdir/results/chick.x.human ../outputs/rna/chick.x.human
In [10]:
!cd ../outputs/rna && python ../../scripts/make-uni-best-hits.py chick.x.human chick.x.human.homol
In [11]:
!cd ../outputs/rna && python ../../scripts/make-reciprocal-best-hits.py chick.x.human human.x.chick chick.x.human.ortho
In [12]:
!cd ../outputs/rna && python ../../scripts/annotate-seqs.py only_rna.fa.renamed.fasta chick.x.human.ortho chick.x.human.homol
In [ ]: