# Exploratory Analysis Data Center Angle

by Kristian Garza

In this exploration the first question we asked was how big is our dataset? We are looking at events that represents links between journal article and scholarly resources. Therefore our dataset looks like this: there are around 50K links from journal articles to scholarly resources and around 900K links from scholarly resources to journal articles.

``````

In [63]:

library(ggplot2)
library(jsonlite)
library(plyr)
library(scales)
library(dplyr)
library(stringr)
library(RColorBrewer)
library(httr)
library(tidyr)
library(psych)
source("../functions/graph_functions.r")

``````

subject,year,count,percentage,sum copper,2006,32,79,5255 silver,2006,4176,79,5255

``````

In [64]:

print((meta\$registrants\$years[1]))

registrants <- meta\$registrants
citation_types <- meta\$`citation-types`
relation_types <- meta\$`relation-types`
pairings <- meta\$pairings

``````
``````

meta
[[1]]
id title    sum
1 2016  2016  95308
2 2017  2017  10793
3 2018  2018 108948

``````
``````

In [65]:

flat_year<-function(years){
x <- filter(years[[1]], title == "2017")
return(x\$sum)
}
flat_year_8<-function(years){
x <- filter(years[[1]], title == "2018")
return(x\$sum)
}

# registrants %>%  mutate(`2017` = "",`2018` = "" )

for (row in 1:nrow(registrants)) {
first <- flat_year(registrants\$years[row])
second <- flat_year_8(registrants\$years[row])
if(length(first) == 0){
first<-0
}
if(length(second) == 0){
second<-0
}

registrants\$`2017`[row] <- first
registrants\$`2018`[row] <- second
}

registrants<-registrants %>%
mutate(m=((`2018`-`2017`)/(10000)),
client=title,
`2018`=`2018`,
`2017`=`2017`,
`2018-p`=100*(`2018`/count),
`2017-p`=100*(`2017`/count)

) %>%
filter(startsWith(title, "datacite"))

``````

# How Data citation has changed in the last 24 months?

FIG Slopegraph comparing data citations changes over time for a list of Publishers. In this graph we filtered to the top 8 contributors of data citations. The dataset corresponds to data citations collected as of September 2018.

``````

In [66]:

registrants <- registrants %>% rowwise() %>% left_join(datacite_reg)

``````
``````

datacite_reg

Joining, by = "id"

``````
``````

In [67]:

``````
``````

vars n mean      sd median trimmed  mad min   max range skew kurtosis
X1    1 9 3516 7179.44      2    3516 2.97   0 20166 20166 1.42     0.37
se
X1 2393.15
vars n     mean      sd median  trimmed     mad  min    max  range skew
X1    1 9 48725.89 68212.6   6421 48725.89 7582.02 1307 183574 182267 0.85
kurtosis       se
X1    -1.03 22737.53

``````
``````

In [68]:

types <- relation_types %>%
mutate(total = sum(count), percentage = (count/total)*100, type=title, column="Type") %>%
arrange(desc(total))

``````
``````

vars n     mean       sd median  trimmed      mad min    max  range skew
X1    1 7 89185.14 107557.8   9682 89185.14 13734.81 418 241222 240804 0.31
kurtosis       se
X1    -2.04 40653.04

``````
``````

In [21]:

citation <- citation_types %>%
mutate(total = sum(count), percentage = (count/total)*100, type=title, column="Type") %>%
arrange(desc(total))

``````
``````

vars n   mean sd median trimmed mad    min    max range skew kurtosis se
X1    1 1 624723 NA 624723  624723   0 624723 624723     0   NA       NA NA

``````

# Links from Scholarly Resources to Article Publications

FIG This is a distribution of links Scholarly Resources to Article Publications by type of citations. Links from dataset resources to schorlaly articles make the biggest contribution to this dataset.

``````

In [22]:

citation_types <- meta\$`citation-types`

types <- citation_types %>%
mutate(total = sum(count), percentage = (count/total)*100, type=title, column="Type") %>%
arrange(desc(percentage))
hundred_plot(head(types,7),"Types of Citation (%)",TRUE)

``````
``````

meta
vars n     mean       sd median  trimmed      mad  min    max  range skew
X1    1 7 117212.6 225054.3  42153 117212.6 32589.03 1771 624723 622952 1.58
kurtosis       se
X1     0.72 85062.53

``````
``````

In [23]:

pairings<-pairings %>% unnest(registrants)

``````
``````

In [24]:

pairings<-pairings %>% filter(startsWith(title, "datacite")) %>% mutate(datacenter=as.factor(title),publisher=as.factor(id1))  %>%
arrange(desc(sum))
summary(pairings\$count)

``````
``````

idtitlecountid1title1sumdatacenterpublisher

datacite.dk.gbif    datacite.dk.gbif    121258              crossref.4913       crossref.4913       58866               datacite.dk.gbif    crossref.4913
datacite.bl.ccdc    datacite.bl.ccdc    215049              crossref.316        crossref.316        58463               datacite.bl.ccdc    crossref.316
datacite.bl.ccdc    datacite.bl.ccdc    215049              crossref.292        crossref.292        45719               datacite.bl.ccdc    crossref.292
datacite.dk.gbif    datacite.dk.gbif    121258              crossref.2258       crossref.2258       42066               datacite.dk.gbif    crossref.2258
datacite.bl.ccdc    datacite.bl.ccdc    215049              crossref.311        crossref.311        38299               datacite.bl.ccdc    crossref.311
datacite.bl.ccdc    datacite.bl.ccdc    215049              crossref.78         crossref.78         37508               datacite.bl.ccdc    crossref.78
datacite.gesis.icpsrdatacite.gesis.icpsr183574              crossref.78         crossref.78         37264               datacite.gesis.icpsrcrossref.78
datacite.gesis.icpsrdatacite.gesis.icpsr183574              crossref.179        crossref.179        28642               datacite.gesis.icpsrcrossref.179
datacite.gesis.icpsrdatacite.gesis.icpsr183574              crossref.311        crossref.311        24913               datacite.gesis.icpsrcrossref.311
datacite.gesis.icpsrdatacite.gesis.icpsr183574              crossref.297        crossref.297        20670               datacite.gesis.icpsrcrossref.297

Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
1330    1901   59178   81446  183574  215049

``````

# Relationships between Data Centers and Publishers

Another interesting thing we can look is relationships of citations between Publishers and Datacenters.

FIG Parallel set graph for data citations between particular Publishers and a particular Data Center. Publishers as the top category and Data Centers as the bottom category. The width of the bar denotes the absolute number of citations for that Publisher-Data center match. The dataset corresponds to links collected as of September 2018.

``````

In [25]:

with(pairings, parallelset(datacenter, publisher,  freq=sum, col="#008888", alpha=0.4))

``````
``````

``````

## Highlighting examples of Relationship between Datacenters and repositories

We highlight four examples:

• The Cambridge Crystallographic Data Centre
• PANGAEA
• Global Biodiversity Information Facility
• Inter-university Consortium for Political and Social Research
``````

In [26]:

pairings_h <- pairings %>%
mutate(highlighted = ifelse(datacenter=="datacite.bl.ccdc","Yes","No"))

myt <- within(pairings_h, {
highlighted <- factor(highlighted, levels=c("Yes","No"))
color <- ifelse(highlighted=="Yes","#008888","#9e99a3")
})
with(myt, parallelset(datacenter, publisher,  freq=sum, col=color, alpha=0.4))

``````
``````

``````
``````

In [27]:

pairings_h <- pairings %>%
mutate(highlighted = ifelse(datacenter=="datacite.tib.pangaea","Yes","No"))

myt <- within(pairings_h, {
highlighted <- factor(highlighted, levels=c("Yes","No"))
color <- ifelse(highlighted=="Yes","#008888","#9e99a3")
})
with(myt, parallelset(datacenter, publisher,  freq=sum, col=color, alpha=0.4))

``````
``````

``````