by Kristian Garza
In this exploration the first question we asked was how big is our dataset? We are looking at events that represents links between journal article and scholarly resources. Therefore our dataset looks like this: there are around 50K links from journal articles to scholarly resources and around 900K links from scholarly resources to journal articles.
In [ ]:
In [130]:
library(ggplot2)
library(plyr)
library(scales)
library(dplyr)
library(stringr)
library(RColorBrewer)
library(httr)
library(tidyr)
library(psych)
library(reshape)
source("../functions/graph_functions.r")
subject,year,count,percentage,sum copper,2006,32,79,5255 silver,2006,4176,79,5255
In [131]:
load("../data/2018-10-08_source_crossref_meta.Rda",verbose=TRUE)
registrants <- meta$registrants
citation_types <- meta$`citation-types`
relation_types <- meta$`relation-types`
pairings <- meta$pairings
In [132]:
citation_types<-citation_types %>% unnest(`year-months`)
In [133]:
flat_year<-function(years){
x <- filter(years[[1]], title == "2017")
return(x$sum)
}
flat_year_8<-function(years){
x <- filter(years[[1]], title == "2018")
return(x$sum)
}
for (row in 1:nrow(registrants)) {
first <- flat_year(registrants$years[row])
second <- flat_year_8(registrants$years[row])
if(length(first) == 0){
first<-0
}
if(length(second) == 0){
second<-0
}
registrants$`2017`[row] <- first
registrants$`2018`[row] <- second
}
registrants<-registrants %>%
mutate(m=((`2018`-`2017`)/(10000)), client=title,
`2018-p`=100*(`2018`/count),
`2017-p`=100*(`2017`/count) ,
`2018`=`2018`,
`2017`=`2017`
) %>%
filter(startsWith(title, "crossref")) %>% arrange(desc(`2018`))
In [134]:
load("../data/2018-10-10_crossref_registrants.Rda",verbose=TRUE)
registrants <- registrants %>% rowwise() %>% left_join(crossref_reg)
In [135]:
plot_slopegraph(y_label="Article to Datasets Links", slope_df=head(registrants,8))
Another interesting thing we can look is relationships of citations between Publishers and Datacenters.
FIG Parallel set graph for data citations between particular Publishers and a particular Data Center. Publishers as the top category and Data Centers as the bottom category. The width of the bar denotes the absolute number of citations for that Publisher-Data center match. The dataset corresponds to links collected as of September 2018.
In [136]:
pairings<-pairings %>% unnest(registrants)
In [137]:
pairings<-pairings %>% filter(startsWith(title, "datacite")) %>% mutate(datacenter=as.factor(title),publisher=as.factor(id1)) %>%
arrange(desc(sum))
# head(pairings,10)
# summary(pairings$count)
In [138]:
with(pairings, parallelset(publisher,datacenter, freq=sum, col="#008888", alpha=0.4))
In [139]:
pairings <- meta$pairings
# head(pairings,2)
pairings<-pairings %>% unnest(registrants) %>% filter(startsWith(title, "crossref")) %>% mutate(datacenter=as.factor(title1),publisher=as.factor(title)) %>%
arrange(desc(sum))
pairings <- pairings %>% rowwise() %>% left_join(crossref_reg)
# head(pairings,10)
# summary(pairings$count)
In [140]:
pairings_h <- pairings %>%
mutate(highlighted = ifelse(publisher=="crossref.4950","Yes","No"))
myt <- within(pairings_h, {
highlighted <- factor(highlighted, levels=c("Yes","No"))
color <- ifelse(highlighted=="Yes","#008888","#9e99a3")
})
with(myt, parallelset(publisher, datacenter, freq=sum, col=color, alpha=0.4))
In [141]:
pairings_h <- pairings %>%
mutate(highlighted = ifelse(publisher=="crossref.297","Yes","No"))
myt <- within(pairings_h, {
highlighted <- factor(highlighted, levels=c("Yes","No"))
color <- ifelse(highlighted=="Yes","#008888","#9e99a3")
})
with(myt, parallelset(publisher, datacenter, freq=sum, col=color, alpha=0.4))
In [142]:
relation_types <- meta$`relation-types`
types <- relation_types %>%
mutate(total = sum(count), percentage = (count/total)*100, type=title, column="Type") %>%
arrange(desc(total))
hundred_plot(types, "Links by type",FALSE)
In [143]:
load("../data/2018-10-09_source_crossref_all_citations_types_meta.Rda",verbose=TRUE)
In [144]:
citation_types <- meta$`citation-types`
types <- citation_types %>%
mutate(total = sum(count), percentage = (count/total)*100, type=title, column="Type") %>%
arrange(desc(percentage))
hundred_plot(head(types,7),"Types of Resources Linked (%)",TRUE)
# head(types)
This is a distribution of between schilarly resources and article publications over the time. One might say that looks very small. But I think a better way to see it is by asking how has it change. Data citation is relatively new. Joint Declaration of Data Citation Principles was just release in 2014. We would like to know how things have change over the last couple years.
This is a distribution of data citations over the time. When we look at that growth we can see that since the Joint Declaration of Data Citation Principles the growth has been interesting, reaching around 40% increase every year with respect to the last year.
In [145]:
load("../data/2018-10-08_source_crossref_meta.Rda",verbose=TRUE)
print((meta$registrants$years[1]))
registrants <- meta$registrants
In [146]:
registrants<-registrants %>% unnest(`years`) %>% filter(id1>"2008") %>% mutate(year=as.factor(id1),sum=as.integer(sum))
x<-group_by(registrants, year) %>% summarise(total = sum(sum))
In [147]:
p<-ggplot(x, aes(x=year,y=total)) + geom_bar(stat="identity") + scale_y_continuous(label=comma) +
labs(x="Years", y="Article to Dataset Links") # Axis labels
p + theme(
axis.text.x = element_text(angle = 90, hjust = 1))
In [ ]:
In [ ]: