Objective: To identify genotype-phenotype trait association in Rice

Develop a workflow to identify genes indirectly associated with rice traits (Grain Size, Grain number etc) using EKP and visualize them in an interactive knowledge graph.

Load necessary libraries


In [22]:
library(dplyr)
library(tidyr)
library(sqldf)
library(splitstackshape)
library(stringr)
library(compare)

Set working environment and load EKP api


In [23]:
setwd("~/ODEX4all-UseCases/Bayer/data")

source("..//src/EuretosInfrastructure.R")
options(warn=-1)



Retrieving page 0
Retrieving page 1
Retrieving page 2
Retrieving page 3
Retrieving page 4
Retrieving page 5
Retrieving page 6
Retrieving page 7
Retrieving page 8
Retrieving page 9
Retrieving page 10
Retrieving page 11

Load selected genes from Qtaro database found at qtaro.abr.affrc.go.jp/qtab/table


In [24]:
rice_genes <-read.csv("GeneInformationTable_Qtaro_Selected.csv",header=TRUE)

Here we consider only the following morphological trait as specified in the input provided

"grain size" (EKP concept id : 5899980)

"grain thickness" (EKP concept id :5900661)

"grain number" (EKP concept id (rice specific) :4343608)

"kernel number" (EKP concept id:5900190)

"GRNB" (EKP concept:5900394)

"fruit number" (EKP concept:5900077)

"grain number per plant" (EKP concept (exact): 5900828)

"GN" (EKP concept:(vague many hits within EKP))

Load rice genes


In [25]:
head(rice_genes)


locus_id
loc4325145
loc4336431
os02g0630300
loc4324691
loc4335790
loc4338448

Step 1a : Get the starting concept identifiers for genes


In [26]:
start<-getConceptID(rice_genes[,"locus_id"])
start<-start[,"EKP_Concept_Id"]











In [27]:
head(start)


  1. '3939406'
  2. '3943638'
  3. '7191380'
  4. '3940353'
  5. '3942413'
  6. '3941570'

Step 1b: Get the ending concept identifiers for traits


In [28]:
traits<-c("TO:0000590","TO:0000382","TO:0000396","TO:0000397","TO:0000734","TO:0000402","TO:0002759","TO:0000447")

Get Trait ekp ids for ending concepts


In [29]:
end<-NULL
for (i in 1:length(traits)){
  tmp <- getTraitEKPID(traits[i])
  tmpContent<-cbind(traits[i],tmp)
  end<-rbind(end,tmpContent)
}
end<-end[,c(2,3,4)]
colnames(end)<-c("TOid","TOEKPid","TOContentName")

head(end)









TOidTOEKPidTOContentName
TO:0000590 5899973 dehulled grain weight
TO:0000382 5900098 1000-seed weight
TO:0000396 5900965 grain yield trait
TO:0000397 5899980 grain size
TO:0000734 5900194 grain length
TO:0000402 5899965 grain width

Step 2a: Get indirect relationship for connected traits

for the traits that exists within EKP and save intermediate results


In [30]:
genes2Trait<-getIndirectRelation(start,end[c(3,7,8),"TOEKPid"])
save(genes2Trait, file = "genes2Trait.rda")




























Step 2b: Get Indirect relationships for "Trait Neighbours"(end) and save intermediate results


In [31]:
neig<-read.csv("NeighbouringTraitEKPid.csv",stringsAsFactors = FALSE,header=TRUE)
genes2TraitNeighbours<-getIndirectRelation(start,end[c(3,7,8),"TOEKPid"])
save(genes2TraitNeighbours, file = "genes2TraitNeighbours.rda")




























Step 2c: Now get the relationship between Traits and their Neighbours and save intermediate results


In [32]:
Trait2TraitNeighbours<-getIndirectRelation(unique(neig[,1]),unique(neig[,2]))
save(Trait2TraitNeighbours, file = "Trait2TraitNeighbours.rda")

























































































































































































































































































































































































































































































































































Step 2d: Get Direct relationship between genes and traits and save intermediate results


In [33]:
genes2TraitsDirect<-getIndirectRelation(start,end[,"TOEKPid"])
save(genes2TraitsDirect, file = "genes2TraitsDirect.rda")









































































Step 3: Combine the results together


In [39]:
load("genes2Trait.rda")
load("genes2TraitNeighbours.rda")
load("Traits_and_their_neighbours.rda")
load("genes2TraitsDirect.rda")

genes2Trait<-as.matrix(getTableFromJson(genes2Trait))


genes2TraitNeighbours<-as.matrix(getTableFromJson(genes2TraitNeighbours))

Traits_and_their_neighbours<-as.matrix(getTableFromJson(a))

genes2TraitsDirect <- as.matrix(getTableFromJson(genes2TraitsDirect))



dfs<-data.frame(unique(rbind(genes2Trait,genes2TraitNeighbours,Traits_and_their_neighbours,genes2TraitsDirect)))

In [44]:
head(dfs)


SubjectPredicateObjectPublicationsScore
3940353 10773543 5900965 1826538147.0512
5900965 10773540 5900394 2322279837.0512
3942413 10773543 5900965 1826556066.9721
5900965 10773540 5900394 2322279836.9721
7190948 10773543 5900394 2322268004.9542
5900394 10773540 5900594 2145101424.9542

Step 4: Map human redable triples from the reference database

reference list is collected from EKP


In [36]:
pred<-read.csv("Reference_Predicate_List.csv",header=TRUE)
pred<-pred[,c(2,3)]
colnames(pred)<-c("pred","names")


subject_name<-getConceptName(dfs[,"Subject"])
dfs<-cbind(dfs,subject_name[,1])

object_name<-getConceptName(dfs[,"Object"])
dfs<-cbind(dfs,object_name[,1])

predicate_name<-sqldf('select * from dfs left join pred on pred.pred=dfs.Predicate')

pbs<-getPubMedId(dfs$Publications)

tripleName<-cbind(subject_name,as.character(predicate_name[,"names"]),object_name,pbs,as.character(dfs[,"Score"]))
colnames(tripleName)<-c("Subject","Predicate","Object","Provenance","Score")

write.table(tripleName,file="~/ODEX4all-UseCases/Bayer/data/Results_Genes_Traits.csv",sep=",",row.names = FALSE)













































Loading required package: tcltk






















In [45]:
head(tripleName)


SubjectPredicateObjectProvenanceScore
loc4324691 (oryza sativa japonica) is associated with grain yield trait NA 7.0512
grain yield trait is a grain number http://tools.gramene.org/ontology/term/to:0002759 7.0512
loc4335790 (oryza sativa japonica) is associated with grain yield trait NA 6.9721
grain yield trait is a grain number http://tools.gramene.org/ontology/term/to:0002759 6.9721
os07g0153600 (oryza sativa japonica) is associated with grain number http://browser.planteome.org/amigo/search/ontology?q=TO:00003574.9542
grain number is a filled grain number http://www.ncbi.nlm.nih.gov/pubmed/18820699 4.9542