Exploratory Analysis Publisher Angle

by Kristian Garza

In this exploration the first question we asked was how big is our dataset? We are looking at events that represents links between journal article and scholarly resources. Therefore our dataset looks like this: there are around 50K links from journal articles to scholarly resources and around 900K links from scholarly resources to journal articles.


In [ ]:


In [130]:
library(ggplot2)
library(plyr)
library(scales)
library(dplyr)
library(stringr)
library(RColorBrewer)
library(httr)
library(tidyr)
library(psych)
library(reshape)
source("../functions/graph_functions.r")

subject,year,count,percentage,sum copper,2006,32,79,5255 silver,2006,4176,79,5255


In [131]:
load("../data/2018-10-08_source_crossref_meta.Rda",verbose=TRUE)

registrants <- meta$registrants
citation_types <- meta$`citation-types`
relation_types <- meta$`relation-types`
pairings <- meta$pairings


Loading objects:
  meta

In [132]:
citation_types<-citation_types %>% unnest(`year-months`)

In [133]:
flat_year<-function(years){
    x <- filter(years[[1]], title == "2017")
    return(x$sum)
}
flat_year_8<-function(years){
    x <- filter(years[[1]], title == "2018")
    return(x$sum)
}


for (row in 1:nrow(registrants)) {
    first <- flat_year(registrants$years[row])
    second <- flat_year_8(registrants$years[row])
    if(length(first) == 0){
        first<-0
    }
    if(length(second) == 0){
        second<-0
    }        
        
    registrants$`2017`[row] <- first
    registrants$`2018`[row] <- second
}

registrants<-registrants %>% 
    mutate(m=((`2018`-`2017`)/(10000)), client=title, 
           `2018-p`=100*(`2018`/count),
           `2017-p`=100*(`2017`/count) ,
           `2018`=`2018`,
           `2017`=`2017` 
          ) %>% 
    filter(startsWith(title, "crossref")) %>% arrange(desc(`2018`))

In [134]:
load("../data/2018-10-10_crossref_registrants.Rda",verbose=TRUE)
registrants <- registrants %>% rowwise() %>% left_join(crossref_reg)


Loading objects:
  crossref_reg
Joining, by = "id"

In [135]:
plot_slopegraph(y_label="Article to Datasets Links", slope_df=head(registrants,8))


   vars n  mean     sd median trimmed   mad min max range skew kurtosis    se
X1    1 8 211.5 219.99    101   211.5 97.11  17 560   543 0.75     -1.4 77.78
   vars n mean     sd median trimmed   mad min max range skew kurtosis    se
X1    1 8  185 209.89   83.5     185 66.72  33 609   576 1.05    -0.67 74.21

Relationships between Data Centers and Publishers

Another interesting thing we can look is relationships of citations between Publishers and Datacenters.

FIG Parallel set graph for data citations between particular Publishers and a particular Data Center. Publishers as the top category and Data Centers as the bottom category. The width of the bar denotes the absolute number of citations for that Publisher-Data center match. The dataset corresponds to links collected as of September 2018.


In [136]:
pairings<-pairings %>% unnest(registrants)

In [137]:
pairings<-pairings %>% filter(startsWith(title, "datacite")) %>% mutate(datacenter=as.factor(title),publisher=as.factor(id1))  %>%   
arrange(desc(sum))
# head(pairings,10)
# summary(pairings$count)

In [138]:
with(pairings, parallelset(publisher,datacenter,  freq=sum, col="#008888", alpha=0.4))



In [139]:
pairings <- meta$pairings
# head(pairings,2)
pairings<-pairings %>% unnest(registrants)  %>% filter(startsWith(title, "crossref")) %>% mutate(datacenter=as.factor(title1),publisher=as.factor(title)) %>% 
    arrange(desc(sum))
pairings <- pairings %>% rowwise() %>% left_join(crossref_reg)
# head(pairings,10)

# summary(pairings$count)


Joining, by = "id"

Highlighting examples of Relationship between Datacenters and publishers

We highlight four examples:

  • Springer Nature
  • F1000Research
  • Dryad

In [140]:
pairings_h <- pairings %>%
    mutate(highlighted = ifelse(publisher=="crossref.4950","Yes","No")) 
    

myt <- within(pairings_h, {
  highlighted <- factor(highlighted, levels=c("Yes","No"))
  color <- ifelse(highlighted=="Yes","#008888","#9e99a3")
})
with(myt, parallelset(publisher, datacenter,  freq=sum, col=color, alpha=0.4))