Testing ortho

First, see if Likit has putative ortholog assignments between the chick RNAseq and human proteins. If not, the eel-pond protocol has some instructions on how to calculate this. Might take a few days. Then, we can recalculate the Venn diagram with a better curated set of mRNAseq transcripts, and also give people (Jerry in particular) the gene names that are in A, C, and D.
- Likit: I only have homologs of chick RNASeq and human proteins of differential-expressed genes, which total ~1000 genes as of now.
- Main goal right now: how many "real" mRNAseq genes, i.e. genes with orthology to human, do/do not match in the genome? Nic has done exactly this analysis for fungus.



In [1]:

    
%matplotlib inline
from matplotlib import pyplot as plt
from matplotlib_venn import venn3, venn3_circles

Preparing all RNA data



In [2]:

    
!ls ../inputs/chicken_transcripts/









    



global_merged.fa  global_merged.fa.clean.nr



In [3]:

    
!mkdir -p ../outputs/chicken_transcripts
!cp ../{inputs,outputs}/chicken_transcripts/global_merged.fa.clean.nr



In [4]:

    
!cd ../outputs/chicken_transcripts && do-partition.py -x 1e10 -N 4 --threads 4 chick_rna global_merged.fa.clean.nr









    



|| This is the script 'do-partition.py' in khmer.
|| You are running khmer version 1.0-dirty
|| You are also using screed version 0.7
||
|| If you use this script in a publication, please cite EACH of the following:
||
||   * MR Crusoe et al., 2014. doi: 10.6084/m9.figshare.979190
||   * J Pell et al., PNAS, 2014 (PMID 22847406)
||
|| Please see the CITATION file for details.


PARAMETERS:
 - kmer size =    32 		(-k)
 - n tables =     4 		(-N)
 - min tablesize = 1e+10 	(-x)

Estimated memory usage is 5e+09 bytes (n_tables x min_tablesize / 8)
--------
Saving k-mer presence table to chick_rna
Loading kmers from sequences in ['global_merged.fa.clean.nr']
--
SUBSET SIZE 100000
N THREADS 4
--
making k-mer presence table
consuming input global_merged.fa.clean.nr
fp rate estimated to be 0.000
** Traverse all the things: stop_big_traversals is false.
enqueued 24 subset tasks
starting 24 threads
---
starting: chick_rna 0
starting: chick_rnastarting: starting: starting: starting: starting: starting: chick_rnastarting: chick_rnachick_rnastarting:  1starting:  chick_rna9chick_rna 6chick_rna
  4
chick_rna 7
5


 starting: chick_rna10  3 
chick_rnachick_rna 
 82

11
starting: chick_rna 12
starting: chick_rna 13
starting: chick_rna 14
starting: chick_rna 15
starting: chick_rna 16
starting: chick_rna 17
starting: chick_rna 18
starting: chick_rna 19
starting: chick_rna 20
starting: chick_rna 21
starting: chick_rna 22
done starting threads
starting: chick_rna 23
saving: chick_rna 23
exiting
...subset-part 11387458923063776822-12365781638151502618: 100000 <- 82162
saving: chick_rna 21
exiting
...subset-part 12365781638151502618-14952903455444444558: 100000 <- 82308
saving: chick_rna 22
exiting
...subset-part 9623407754336171674-10346220413045743106: 100000 <- 85598
saving: chick_rna 19
...subset-part 7469112752205571742-8293951506219918730: 100000 <- 85826
saving: chick_rna 16
exiting
exiting
...subset-part 1658844840242791984-2144438498467839238: 100000 <- 85688
saving: chick_rna 5
exiting
...subset-part 10346220413045743106-11387458923063776822: 100000 <- 86825
saving: chick_rna 20
...subset-part 4100357578492272937-4568666954935214814: 100000 <- 85498
exiting
 saving: chick_rna 10
exiting
...subset-part 3651522090524221891-4100357578492272937: 100000 <- 87085
saving: chick_rna 9
exiting
...subset-part 8838167936561326947-9623407754336171674: 100000 <- 85888
saving: chick_rna 18
exiting
...subset-part 2540677669633971824-2998780665272668775: 100000 <- 87064
saving: chick_rna 7
exiting
...subset-part 8293951506219918730-8838167936561326947: 100000 <- 86258
saving: chick_rna 17
exiting
...subset-part 6924533334413435836-7469112752205571742: 100000 <- 86253
saving: chick_rna 15
exiting
...subset-part 6372967790940774418-6924533334413435836: 100000 <- 86498
saving: chick_rna 14
...subset-part 1217101929279536231-1658844840242791984: 100000 <- 85102
saving: chick_rna 4
exiting
 exiting
...subset-part 2998780665272668775-3651522090524221891: 100000 <- 86140
saving: chick_rna 8
exiting
...subset-part 2144438498467839238-2540677669633971824: 100000 <- 86171
saving: chick_rna 6
exiting
...subset-part 5188519471860631576-5980847930645956818: 100000 <- 86462
saving: chick_rna 12
exiting
...subset-part 870696019245508114-1217101929279536231: 100000 <- 86146
saving: chick_rna 3
exiting
...subset-part 4568666954935214814-5188519471860631576: 100000 <- 85670
saving: chick_rna 11
exiting
...subset-part 471252599398885518-870696019245508114: 100000 <- 86147
saving: chick_rna 2
exiting
...subset-part 161003717381596016-471252599398885518: 100000 <- 85471
saving: chick_rna 1
exiting
...subset-part 5980847930645956818-6372967790940774418: 100000 <- 84400
saving: chick_rna 13
exiting
...subset-part 0-161003717381596016: 100000 <- 78749
saving: chick_rna 0
exiting
---
done making subsets! see chick_rna.subset.*.pmap
loading 24 pmap files (first one: chick_rna.subset.9.pmap)
merging chick_rna.subset.9.pmap
merging chick_rna.subset.21.pmap
merging chick_rna.subset.10.pmap
merging chick_rna.subset.0.pmap
merging chick_rna.subset.7.pmap
merging chick_rna.subset.2.pmap
merging chick_rna.subset.18.pmap
merging chick_rna.subset.1.pmap
merging chick_rna.subset.6.pmap
merging chick_rna.subset.5.pmap
merging chick_rna.subset.14.pmap
merging chick_rna.subset.12.pmap
merging chick_rna.subset.4.pmap
merging chick_rna.subset.3.pmap
merging chick_rna.subset.16.pmap
merging chick_rna.subset.22.pmap
merging chick_rna.subset.11.pmap
merging chick_rna.subset.15.pmap
merging chick_rna.subset.23.pmap
merging chick_rna.subset.8.pmap
merging chick_rna.subset.17.pmap
merging chick_rna.subset.13.pmap
merging chick_rna.subset.20.pmap
merging chick_rna.subset.19.pmap
removing pmap files
outputting partitions for global_merged.fa.clean.nr
... output_partitions 100000 0
... output_partitions 200000 0
... output_partitions 300000 0
output 79967 partitions for global_merged.fa.clean.nr
partitions are in global_merged.fa.clean.nr.part



In [5]:

    
!cd ../outputs/chicken_transcripts && python ../../scripts/rename-with-partitions.py chick_rna global_merged.fa.clean.nr









    



... 0
... 10000
... 20000
... 30000
... 40000
... 50000
... 60000
... 70000
... 80000
... 90000
... 100000
... 110000
... 120000
... 130000
... 140000
... 150000
... 160000
... 170000
... 180000
... 190000
... 200000
... 210000
... 220000
... 230000
... 240000
... 250000
... 260000
... 270000
... 280000
... 290000
... 300000
... 310000
... 320000
... 330000
... 340000
... 350000
... 360000
... 370000
... 380000
... 390000
... 400000
... 410000
---------------
partition, size
0 1
1 1
2 1
3 1
4 1
5 1
6 1
7 1
8 1
9 1
10 1
---------------
creating global_merged.fa.clean.nr.renamed.fasta.gz
...writing 0
...writing 10000
...writing 20000
...writing 30000
...writing 40000
...writing 50000
...writing 60000
...writing 70000
...writing 80000
...writing 90000
...writing 100000
...writing 110000
...writing 120000
...writing 130000
...writing 140000
...writing 150000
...writing 160000
...writing 170000
...writing 180000
...writing 190000
...writing 200000
...writing 210000
...writing 220000
...writing 230000
...writing 240000
...writing 250000
...writing 260000
...writing 270000
...writing 280000
...writing 290000
...writing 300000
...writing 310000
...writing 320000
...writing 330000
...writing 340000
...writing 350000
...writing 360000
...writing 370000
...writing 380000
...writing 390000
...writing 400000
...writing 410000
total sequences: 419277
total transcript families: 419277



In [6]:

    
!cd ../outputs/chicken_transcripts && gunzip -f global_merged.fa.clean.nr.renamed.fasta.gz



In [2]:

    
!cd ../outputs/chicken_transcripts && formatdb -i global_merged.fa.clean.nr.renamed.fasta -o T -p F

Preparing 'only in RNA' data



In [1]:

    
!mkdir -p ../outputs/rna



In [13]:

    
!cd ../outputs/rna/ && do-partition.py -x 1e10 -N 4 --threads 4 chick_rna only_rna.fa









    



|| This is the script 'do-partition.py' in khmer.
|| You are running khmer version 1.0-dirty
|| You are also using screed version 0.7
||
|| If you use this script in a publication, please cite EACH of the following:
||
||   * MR Crusoe et al., 2014. doi: 10.6084/m9.figshare.979190
||   * J Pell et al., PNAS, 2014 (PMID 22847406)
||
|| Please see the CITATION file for details.


PARAMETERS:
 - kmer size =    32 		(-k)
 - n tables =     4 		(-N)
 - min tablesize = 1e+10 	(-x)

Estimated memory usage is 5e+09 bytes (n_tables x min_tablesize / 8)
--------
Saving k-mer presence table to chick_rna
Loading kmers from sequences in ['only_rna.fa']
--
SUBSET SIZE 100000
N THREADS 4
--
making k-mer presence table
consuming input only_rna.fa
fp rate estimated to be 0.000
** Traverse all the things: stop_big_traversals is false.
enqueued 5 subset tasks
starting 5 threads
---
starting: chick_rna 0
starting: chick_rna 1
starting: chick_rna 2
starting: chick_rna 3
done starting threads
starting: chick_rna 4
saving: chick_rna 4
exiting
...subset-part 0-2158756723823735773: 100000 <- 61912
saving: chick_rna 0
exiting
...subset-part 4689650774654918483-8188399981275226890: 100000 <- 64474
saving: chick_rna 2
exiting
...subset-part 2158756723823735773-4689650774654918483: 100000 <- 64124
saving: chick_rna 1
exiting
...subset-part 8188399981275226890-11857480345320340719: 100000 <- 61415
saving: chick_rna 3
exiting
---
done making subsets! see chick_rna.subset.*.pmap
loading 5 pmap files (first one: chick_rna.subset.0.pmap)
merging chick_rna.subset.0.pmap
merging chick_rna.subset.2.pmap
merging chick_rna.subset.1.pmap
merging chick_rna.subset.4.pmap
merging chick_rna.subset.3.pmap
removing pmap files
outputting partitions for only_rna.fa
... output_partitions 100000 0
output 58617 partitions for only_rna.fa
partitions are in only_rna.fa.part



In [25]:

    
!cd ../outputs/rna && python ../../scripts/rename-with-partitions.py chick_rna only_rna.fa









    



... 0
... 10000
... 20000
... 30000
... 40000
... 50000
... 60000
... 70000
... 80000
... 90000
... 100000
... 110000
... 120000
---------------
partition, size
0 1
1 1
2 1
3 1
4 1
5 1
6 1
7 1
8 1
9 1
10 1
---------------
creating only_rna.fa.renamed.fasta.gz
...writing 0
...writing 10000
...writing 20000
...writing 30000
...writing 40000
...writing 50000
...writing 60000
...writing 70000
...writing 80000
...writing 90000
...writing 100000
...writing 110000
...writing 120000
total sequences: 121301
total transcript families: 121301



In [27]:

    
!cd ../outputs/rna && gunzip -f only_rna.fa.renamed.fasta.gz



In [1]:

    
!cd ../outputs/rna && formatdb -i only_rna.fa.renamed.fasta -o T -p F

Preparing Uniprot



In [5]:

    
!mkdir -p ../{inputs,outputs}/uniprot
!cd ../inputs/uniprot && wget -c ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
!cp ../inputs/uniprot/uniprot_sprot.fasta.gz ../outputs/uniprot/
!gunzip -f ../outputs/uniprot/uniprot_sprot.fasta.gz









    



--2014-04-15 15:29:40--  ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
           => “uniprot_sprot.fasta.gz”
Resolving ftp.uniprot.org... 141.161.180.197
Connecting to ftp.uniprot.org|141.161.180.197|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/databases/uniprot/current_release/knowledgebase/complete ... done.
==> SIZE uniprot_sprot.fasta.gz ... 81579509
==> PASV ... done.    ==> REST 81579509 ... done.    
==> RETR uniprot_sprot.fasta.gz ... done.
Length: 81579509 (78M), 0 remaining (unauthoritative)

100%[+++++++++++++++++++++++++++++++++++++++] 81,579,509  --.-K/s   in 0s      

2014-04-15 15:29:40 (0.00 B/s) - “uniprot_sprot.fasta.gz” saved [81579509]



In [1]:

    
!cd ../outputs/uniprot && formatdb -i uniprot_sprot.fasta -o T -p T



In [2]:

    
!cd ../outputs/uniprot && python ../../scripts/make-namedb.py uniprot_sprot.fasta uniprot.namedb



In [3]:

    
!cd ../outputs/uniprot && python -m screed.fadbm uniprot_sprot.fasta









    



Database saved in uniprot_sprot.fasta_screed

Uniprot ortho



In [4]:

    
#!cd ../scripts && ./submit_blast_uniprot_chick.pbs
#!cd ../scripts && ./submit_blast_chick_uniprot.pbs



In [8]:

    
!cp ../workdir/results/uniprot.x.chick ../outputs/rna/uniprot.x.chick
!cp ../workdir/results/chick.x.uniprot ../outputs/rna/chick.x.uniprot



In [9]:

    
!cd ../outputs/rna && python ../../scripts/make-uni-best-hits.py chick.x.uniprot chick.x.uniprot.homol









    



collecting best hits
... 0
... 10000
... 20000
... 30000
... 40000
... 50000
... 60000
... 70000
... 80000
... 90000
... 100000
... 110000
... 120000
... 130000
... 140000
... 150000
... 160000
... 170000
... 180000
... 190000
... 200000
... 210000



In [10]:

    
!cd ../outputs/rna && python ../../scripts/make-reciprocal-best-hits.py chick.x.uniprot uniprot.x.chick chick.x.uniprot.ortho









    



collecting best hits from: chick.x.uniprot
... chick.x.uniprot 0
... chick.x.uniprot 25000
... chick.x.uniprot 50000
... chick.x.uniprot 75000
... chick.x.uniprot 100000
... chick.x.uniprot 125000
... chick.x.uniprot 150000
... chick.x.uniprot 175000
... chick.x.uniprot 200000
collecting best hits from: uniprot.x.chick
... uniprot.x.chick 0
... uniprot.x.chick 25000
... uniprot.x.chick 50000
... uniprot.x.chick 75000
... uniprot.x.chick 100000
... uniprot.x.chick 125000
... uniprot.x.chick 150000
... uniprot.x.chick 175000
... uniprot.x.chick 200000
... uniprot.x.chick 225000
... uniprot.x.chick 250000
... uniprot.x.chick 275000
... uniprot.x.chick 300000
... uniprot.x.chick 325000
... uniprot.x.chick 350000
... uniprot.x.chick 375000
saving best hits result to chick.x.uniprot.ortho.cache
calculating reciprocal best hits
saving reciprocal best hits to chick.x.uniprot.ortho



In [11]:

    
!cd ../outputs/rna && python ../../scripts/annotate-seqs.py ../chicken_transcripts/global_merged.fa.clean.nr.renamed.fasta chick.x.uniprot.ortho chick.x.uniprot.homol









    



Scanning sequences -- first pass to gather info
... 0
... 25000
... 50000
... 75000
... 100000
... 125000
... 150000
... 175000
... 200000
... 225000
... 250000
... 275000
... 300000
... 325000
... 350000
... 375000
... 400000
second pass: annotating
... x2 0
... x2 25000
... x2 50000
... x2 75000
... x2 100000
... x2 125000
... x2 150000
... x2 175000
... x2 200000
... x2 225000
... x2 250000
... x2 275000
... x2 300000
... x2 325000
... x2 350000
... x2 375000
... x2 400000
----
419276 sequences total
30515 annotated / ortho
182561 annotated / homol
0 annotated / tr
213076 total annotated

annotated sequences in FASTA format: global_merged.fa.clean.nr.renamed.fasta.annot
annotation spreadsheet in: global_merged.fa.clean.nr.renamed.fasta.annot.csv
annotation spreadsheet with sequences (warning: LARGE): global_merged.fa.clean.nr.renamed.fasta.annot.large.csv

Recalculating Venn diagram with ortho



In [24]:

    
!cd ../outputs/rna && python ../../scripts/extract.py global_merged.fa.clean.nr.renamed.fasta.annot ortho.fa ortho



In [38]:

    
%%bash
module load BLAST+
#cd ../outputs/galGal4 && makeblastdb -in galGal4.fa -out galGal4_db -dbtype nucl
cd ../outputs/moleculo && makeblastdb -in LR6000017-DNA_A01-LRAAA-AllReads.fasta -out moleculo_db -dbtype nucl









    




Building a new DB, current time: 04/29/2014 09:14:28
New DB name:   moleculo_db
New DB title:  LR6000017-DNA_A01-LRAAA-AllReads.fasta
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 1579060 sequences in 480.157 seconds.






    



application-specific initialization failed: Can't find a usable init.tcl in the following directories: 
    /opt/anaconda1anaconda2anaconda3/lib/tcl8.5 /usr/lib/tcl8.5 /lib/tcl8.5 /usr/library /library /tcl8.5.13/library /tcl8.5.13/library



This probably means that Tcl wasn't installed properly.



In [39]:

    
#!cd ../scripts && ./submit_blast_ortho_ref.pbs
#!cp ../workdir/results/Chick_RNA_BLAST.txt ../outputs/rna/Chick_RNA_BLAST.txt
#!mv ../workdir/ ../workdirs/ortho_blast_ref









    



17729447.mgr-04.i



In [42]:

    
!cd ../outputs/rna/ && python ../../scripts/find_match_2.py ortho.fa Chick_RNA_BLAST.txt ortho.rna.ref









    



Number of matches: 26830
Number of no matches: 3685



In [48]:

    
#!cd ../scripts && ./submit_blast_ortho_moleculo.pbs
#!cp ../workdir/results/Moleculo_RNA_BLAST.txt ../outputs/rna/Moleculo_RNA_BLAST.txt
#!mv ../workdir/ ../workdirs/ortho_blast_moleculo



In [49]:

    
!cd ../outputs/rna/ && python ../../scripts/find_match_2.py ortho.fa Moleculo_RNA_BLAST.txt ortho.rna.moleculo









    



Number of matches: 23183
Number of no matches: 7332



In [1]:

    
with open("../outputs/rna/match_ortho.rna.moleculo.txt") as rna_mol_match:
    rna_mol_set = set([line.split(' ')[0] for line in rna_mol_match if line.strip()])

with open("../outputs/rna/match_ortho.rna.ref.txt") as rna_ref_match:
    rna_ref_set = set([line.split(' ')[0] for line in rna_ref_match if line.strip()])



In [2]:

    
B = len(rna_mol_set & rna_ref_set)
A = 26830 - B
C = 23183 - B
E = 30515
D = E - A - C - B

print A, B, C, D, E









    



3650 23180 3 3682 30515



In [4]:

    
#scale = float(max([A, B, C, D]))
#v = venn3((1, 1, 1, D / scale, A / scale, C / scale, B / scale), set_labels=('Ref', 'Moleculo', 'RNA'))

v = venn3((1, 1, 1, 1, 1, 1, 1), set_labels=('Ref', 'Moleculo', 'RNA'))

v.get_label_by_id('100').set_text('')
v.get_label_by_id('010').set_text('')
v.get_label_by_id('001').set_text('D\n%d' % D)

v.get_label_by_id('011').set_text('C\n%d' % C)
v.get_label_by_id('101').set_text('A\n%d' % A)
v.get_label_by_id('110').set_text('')

v.get_label_by_id('111').set_text('B\n%d' % B)
plt.title('Filtered mRNAseq\n(only genes with orthology to UniProt genes)')









    Out[4]:





<matplotlib.text.Text at 0x2b3868d5fb50>

Preparing Human



In [1]:

    
!mkdir -p ../{inputs,outputs}/human
!cd ../inputs/human && wget -c ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/mRNA_Prot/human.protein.faa.gz
!cp ../inputs/human/human.protein.faa.gz ../outputs/human/
!gunzip -f ../outputs/human/human.protein.faa.gz









    



--2014-04-17 15:45:15--  ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/mRNA_Prot/human.protein.faa.gz
           => “human.protein.faa.gz”
Resolving ftp.ncbi.nih.gov... 130.14.250.10, 2607:f220:41e:250::11
Connecting to ftp.ncbi.nih.gov|130.14.250.10|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /refseq/H_sapiens/mRNA_Prot ... done.
==> SIZE human.protein.faa.gz ... 18881310
==> PASV ... done.    ==> RETR human.protein.faa.gz ... done.
Length: 18881310 (18M) (unauthoritative)

100%[======================================>] 18,881,310  9.18M/s   in 2.0s    

2014-04-17 15:45:17 (9.18 MB/s) - “human.protein.faa.gz” saved [18881310]



In [80]:

    
!cd ../outputs/human && formatdb -i human.protein.faa -o T -p T



In [7]:

    
!cd ../outputs/human && python ../../scripts/make-namedb.py human.protein.faa human.namedb



In [8]:

    
!cd ../outputs/human && python -m screed.fadbm human.protein.faa









    



Database saved in human.protein.faa_screed

Human ortho



In [ ]:

    
#!cd ../scripts && ./submit_blast_human_chick.pbs
#!cd ../scripts && ./submit_blast_chick_human.pbs



In [9]:

    
#!cp ../workdir/results/human.x.chick ../outputs/rna/human.x.chick
#!cp ../workdir/results/chick.x.human ../outputs/rna/chick.x.human



In [10]:

    
!cd ../outputs/rna && python ../../scripts/make-uni-best-hits.py chick.x.human chick.x.human.homol









    



collecting best hits
... 0
... 10000
... 20000



In [11]:

    
!cd ../outputs/rna && python ../../scripts/make-reciprocal-best-hits.py chick.x.human human.x.chick chick.x.human.ortho









    



collecting best hits from: chick.x.human
... chick.x.human 0
... chick.x.human 25000
collecting best hits from: human.x.chick
... human.x.chick 0
... human.x.chick 25000
saving best hits result to chick.x.human.ortho.cache
calculating reciprocal best hits
saving reciprocal best hits to chick.x.human.ortho



In [12]:

    
!cd ../outputs/rna && python ../../scripts/annotate-seqs.py only_rna.fa.renamed.fasta chick.x.human.ortho chick.x.human.homol









    



Scanning sequences -- first pass to gather info
... 0
... 25000
... 50000
... 75000
... 100000
second pass: annotating
... x2 0
... x2 25000
... x2 50000
... x2 75000
... x2 100000
----
121300 sequences total
0 annotated / ortho
0 annotated / homol
0 annotated / tr
0 total annotated

annotated sequences in FASTA format: only_rna.fa.renamed.fasta.annot
annotation spreadsheet in: only_rna.fa.renamed.fasta.annot.csv
annotation spreadsheet with sequences (warning: LARGE): only_rna.fa.renamed.fasta.annot.large.csv



In [ ]: