Exploratory Analysis Data Center Angle

by Kristian Garza

In this exploration the first question we asked was how big is our dataset? We are looking at events that represents links between journal article and scholarly resources. Therefore our dataset looks like this: there are around 50K links from journal articles to scholarly resources and around 900K links from scholarly resources to journal articles.


In [63]:
library(ggplot2)
library(jsonlite)
library(plyr)
library(scales)
library(dplyr)
library(stringr)
library(RColorBrewer)
library(httr)
library(tidyr)
library(psych)
source("../functions/graph_functions.r")

subject,year,count,percentage,sum copper,2006,32,79,5255 silver,2006,4176,79,5255


In [64]:
# load("../data/2018-10-03_source_datacite-crossref_meta.Rda",verbose=TRUE)
load("../data/2018-10-28_source_datacite-crossref_meta.Rda",verbose=TRUE)
print((meta$registrants$years[1]))

registrants <- meta$registrants
citation_types <- meta$`citation-types`
relation_types <- meta$`relation-types`
pairings <- meta$pairings


Loading objects:
  meta
[[1]]
    id title    sum
1 2016  2016  95308
2 2017  2017  10793
3 2018  2018 108948


In [65]:
flat_year<-function(years){
    x <- filter(years[[1]], title == "2017")
    return(x$sum)
}
flat_year_8<-function(years){
    x <- filter(years[[1]], title == "2018")
    return(x$sum)
}

# registrants %>%  mutate(`2017` = "",`2018` = "" ) 

for (row in 1:nrow(registrants)) {
    first <- flat_year(registrants$years[row])
    second <- flat_year_8(registrants$years[row])
    if(length(first) == 0){
        first<-0
    }
    if(length(second) == 0){
        second<-0
    }        
        
    registrants$`2017`[row] <- first
    registrants$`2018`[row] <- second
}

registrants<-registrants %>% 
    mutate(m=((`2018`-`2017`)/(10000)), 
           client=title,
           `2018`=`2018`,
           `2017`=`2017`,
           `2018-p`=100*(`2018`/count),
           `2017-p`=100*(`2017`/count)      
          
          ) %>% 
    filter(startsWith(title, "datacite"))
# head(registrants,5)

How Data citation has changed in the last 24 months?

FIG Slopegraph comparing data citations changes over time for a list of Publishers. In this graph we filtered to the top 8 contributors of data citations. The dataset corresponds to data citations collected as of September 2018.


In [66]:
load("../data/2018-10-11_datacite_registrants.Rda",verbose=TRUE)
registrants <- registrants %>% rowwise() %>% left_join(datacite_reg)


Loading objects:
  datacite_reg
Joining, by = "id"

In [67]:
plot_slopegraph(y_label="Datasets to Article Links", slope_df=head(registrants,15))


   vars n mean      sd median trimmed  mad min   max range skew kurtosis
X1    1 9 3516 7179.44      2    3516 2.97   0 20166 20166 1.42     0.37
        se
X1 2393.15
   vars n     mean      sd median  trimmed     mad  min    max  range skew
X1    1 9 48725.89 68212.6   6421 48725.89 7582.02 1307 183574 182267 0.85
   kurtosis       se
X1    -1.03 22737.53

In [68]:
types <- relation_types %>%   
        mutate(total = sum(count), percentage = (count/total)*100, type=title, column="Type") %>%
        arrange(desc(total))


hundred_plot(head(types,7),"Links by relationship type (%)",TRUE)


   vars n     mean       sd median  trimmed      mad min    max  range skew
X1    1 7 89185.14 107557.8   9682 89185.14 13734.81 418 241222 240804 0.31
   kurtosis       se
X1    -2.04 40653.04

In [21]:
citation <- citation_types %>%   
        mutate(total = sum(count), percentage = (count/total)*100, type=title, column="Type") %>%
        arrange(desc(total))


hundred_plot(head(citation,7),"Links by type (%)",TRUE)


   vars n   mean sd median trimmed mad    min    max range skew kurtosis se
X1    1 1 624723 NA 624723  624723   0 624723 624723     0   NA       NA NA

Links from Scholarly Resources to Article Publications

FIG This is a distribution of links Scholarly Resources to Article Publications by type of citations. Links from dataset resources to schorlaly articles make the biggest contribution to this dataset.


In [22]:
load("../data/2018-10-28_source_datacite_all_citations_types_meta.Rda",verbose=TRUE)
citation_types <- meta$`citation-types`

types <- citation_types %>%   
        mutate(total = sum(count), percentage = (count/total)*100, type=title, column="Type") %>%
        arrange(desc(percentage))
hundred_plot(head(types,7),"Types of Citation (%)",TRUE)


Loading objects:
  meta
   vars n     mean       sd median  trimmed      mad  min    max  range skew
X1    1 7 117212.6 225054.3  42153 117212.6 32589.03 1771 624723 622952 1.58
   kurtosis       se
X1     0.72 85062.53

In [23]:
pairings<-pairings %>% unnest(registrants)

In [24]:
pairings<-pairings %>% filter(startsWith(title, "datacite")) %>% mutate(datacenter=as.factor(title),publisher=as.factor(id1))  %>%   
arrange(desc(sum))
head(pairings,10)
summary(pairings$count)


idtitlecountid1title1sumdatacenterpublisher
datacite.dk.gbif datacite.dk.gbif 121258 crossref.4913 crossref.4913 58866 datacite.dk.gbif crossref.4913
datacite.bl.ccdc datacite.bl.ccdc 215049 crossref.316 crossref.316 58463 datacite.bl.ccdc crossref.316
datacite.bl.ccdc datacite.bl.ccdc 215049 crossref.292 crossref.292 45719 datacite.bl.ccdc crossref.292
datacite.dk.gbif datacite.dk.gbif 121258 crossref.2258 crossref.2258 42066 datacite.dk.gbif crossref.2258
datacite.bl.ccdc datacite.bl.ccdc 215049 crossref.311 crossref.311 38299 datacite.bl.ccdc crossref.311
datacite.bl.ccdc datacite.bl.ccdc 215049 crossref.78 crossref.78 37508 datacite.bl.ccdc crossref.78
datacite.gesis.icpsrdatacite.gesis.icpsr183574 crossref.78 crossref.78 37264 datacite.gesis.icpsrcrossref.78
datacite.gesis.icpsrdatacite.gesis.icpsr183574 crossref.179 crossref.179 28642 datacite.gesis.icpsrcrossref.179
datacite.gesis.icpsrdatacite.gesis.icpsr183574 crossref.311 crossref.311 24913 datacite.gesis.icpsrcrossref.311
datacite.gesis.icpsrdatacite.gesis.icpsr183574 crossref.297 crossref.297 20670 datacite.gesis.icpsrcrossref.297
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1330    1901   59178   81446  183574  215049 

Relationships between Data Centers and Publishers

Another interesting thing we can look is relationships of citations between Publishers and Datacenters.

FIG Parallel set graph for data citations between particular Publishers and a particular Data Center. Publishers as the top category and Data Centers as the bottom category. The width of the bar denotes the absolute number of citations for that Publisher-Data center match. The dataset corresponds to links collected as of September 2018.


In [25]:
with(pairings, parallelset(datacenter, publisher,  freq=sum, col="#008888", alpha=0.4))


Highlighting examples of Relationship between Datacenters and repositories

We highlight four examples:

  • The Cambridge Crystallographic Data Centre
  • PANGAEA
  • Global Biodiversity Information Facility
  • Inter-university Consortium for Political and Social Research

In [26]:
pairings_h <- pairings %>%
    mutate(highlighted = ifelse(datacenter=="datacite.bl.ccdc","Yes","No")) 
    

myt <- within(pairings_h, {
  highlighted <- factor(highlighted, levels=c("Yes","No"))
  color <- ifelse(highlighted=="Yes","#008888","#9e99a3")
})
with(myt, parallelset(datacenter, publisher,  freq=sum, col=color, alpha=0.4))



In [27]:
pairings_h <- pairings %>%
    mutate(highlighted = ifelse(datacenter=="datacite.tib.pangaea","Yes","No")) 
    

myt <- within(pairings_h, {
  highlighted <- factor(highlighted, levels=c("Yes","No"))
  color <- ifelse(highlighted=="Yes","#008888","#9e99a3")
})
with(myt, parallelset(datacenter, publisher,  freq=sum, col=color, alpha=0.4))



In [28]:
pairings_h <- pairings %>%
    mutate(highlighted = ifelse(datacenter=="datacite.dk.gbif","Yes","No")) 
    

myt <- within(pairings_h, {
  highlighted <- factor(highlighted, levels=c("Yes","No"))
  color <- ifelse(highlighted=="Yes","#008888","#9e99a3")
})
with(myt, parallelset(datacenter, publisher,  freq=sum, col=color, alpha=0.4))



In [29]:
pairings_h <- pairings %>%
    mutate(highlighted = ifelse(datacenter=="datacite.gesis.icpsr","Yes","No")) 
    

myt <- within(pairings_h, {
  highlighted <- factor(highlighted, levels=c("Yes","No"))
  color <- ifelse(highlighted=="Yes","#008888","#9e99a3")
})
with(myt, parallelset(datacenter, publisher,  freq=sum, col=color, alpha=0.4))


Links growth over the years

This is a distribution of between schilarly resources and article publications over the time.


In [30]:
load("../data/2018-10-28_source_datacite-crossref_meta.Rda",verbose=TRUE)
print((meta$registrants$years[1]))

registrants <- meta$registrants
citation_types <- meta$`citation-types`
relation_types <- meta$`relation-types`
pairings <- meta$pairings


Loading objects:
  meta
[[1]]
    id title    sum
1 2016  2016  95308
2 2017  2017  10793
3 2018  2018 108948


In [31]:
registrants<-registrants %>% unnest(`years`) %>% filter(id1>"2008") %>% mutate(year=as.factor(id1),sum=as.integer(sum))

x<-group_by(registrants, year) %>% summarise(total = sum(sum))

In [32]:
p<-ggplot(x, aes(x=year,y=total)) + geom_bar(stat="identity") + scale_y_continuous(label=comma) +
                      labs(x="Years", y="Dataset to Article Links")  # Axis labels


p + theme( 
              axis.text.x = element_text(angle = 90, hjust = 1))