In [1]:
from datetime import datetime
start = datetime.utcnow() # For measuring the total processing time
In [2]:
import json
from urllib.request import urlopen
import pandas as pd
import numpy as np
In [3]:
AMC_URL = "http://articlemeta.scielo.org/api/v1/collection/identifiers/"
amc_data = pd.DataFrame(json.load(urlopen(AMC_URL)))
In [4]:
amc_data.head(6)
Out[4]:
Some collections won't be analyzed, mainly to avoid duplicates
(there are articles in more than one collection).
The spa
(Public Health collection) should have part of it
kept in the result, but it's not a collection
whose journals/articles are assigned to a single country.
The collections below are linked to a single country:
In [5]:
dont_evaluate = ["bio", "cci", "cic", "ecu", "psi", "pry", "rve", "rvo", "rvt", "sss", "spa", "wid"]
amc_names_map = {
"code": "collection",
"acron2": "origin",
}
amc_pairs = amc_data \
[(amc_data["acron2"].str.len() == 2) &
~amc_data["code"].isin(dont_evaluate)] \
[list(amc_names_map.keys())] \
.rename(columns=amc_names_map) \
.assign(origin=lambda df: df["origin"].str.upper())
amc_pairs
Out[5]:
These journals in the spa
collection have the following countries:
In [6]:
spa_issn_country = pd.DataFrame([
("0021-2571", "IT"),
("0042-9686", "CH"),
("1020-4989", "US"),
("1555-7960", "US"),
], columns=["issn", "origin"])
spa_issn_country # For collection = "spa", only!
Out[6]:
This dataset is the Network spreadsheet/CSV pack which can be found in the SciELO Analytics report web page. The first two rows of it are:
In [7]:
import zipfile
# Use the Zip file in jcatalog/data/scielo
# with zipfile.ZipFile('../../data/scielo/tabs_network_181203.zip', 'r') as zip_ref:
# zip_ref.extract('documents_affiliations.csv', 'csv_files')
with zipfile.ZipFile('../../data/scielo/tabs_network_190210.zip', 'r') as zip_ref:
zip_ref.extract('documents_affiliations.csv', 'csv_files')
In [8]:
dataset = pd.read_csv("csv_files/documents_affiliations.csv", keep_default_na=False)
dataset.head(3).T
Out[8]:
In [9]:
dataset.shape
Out[9]:
We won't need all the information, and we can simplify the column names for the columns we need:
In [10]:
names_map = {
"document publishing ID (PID SciELO)": "pid",
"document affiliation country ISO 3166": "country",
"document is citable": "is_citable",
"ISSN SciELO": "issn",
"collection": "collection",
"document publishing year": "year",
}
cdf = dataset[list(names_map.keys())].rename(columns=names_map)
cdf[610_000::80_000] # cdf stands for "Country/Collection Data Frame"
Out[10]:
The country
column in the last dataframe is the affiliation country,
not the journal/article origin country.
Let's add the former as a new origin
column,
grabbing it from the collection
or from the ISSN (when collection is spa
):
In [11]:
cdfwof = pd.concat([
pd.merge(cdf[cdf["collection"] != "spa"], amc_pairs, how="inner", on="collection"),
pd.merge(cdf[cdf["collection"] == "spa"], spa_issn_country, how="inner", on="issn"),
])
cdfwof[610_000::80_000] # wof stands for "With Origin, Filtered"
Out[11]:
The rows without an assignable origin have been removed:
In [12]:
set(cdfwof.collection)
Out[12]:
In [13]:
spa = cdfwof[cdfwof['collection'].str.contains('spa')]
In [14]:
set(spa.issn)
Out[14]:
In [15]:
cdfwof["years"] = np.where(cdfwof['year'] <= 1996, 'ate_1996', cdfwof["year"])
In [16]:
# compare
cdf.shape
Out[16]:
In [17]:
cdfwof.shape
Out[17]:
In [18]:
cdfwof[(cdfwof["pid"] == "S0004-27302009000900010")]
Out[18]:
Are the affiliations countries and the journal/origin country always the same? The goal now is to create a summary of the affiliation countries by comparing them to the journal/origin country.
In [19]:
origin_country = cdfwof["country"] == cdfwof["origin"]
In [20]:
result = cdfwof.assign(
origin_country=origin_country,
other_country=~(origin_country | (cdfwof["country"] == "")),
no_country=cdfwof["country"] == "",
).groupby("pid").sum().assign(
has_origin=lambda df: df["origin_country"].apply(bool),
has_other=lambda df: df["other_country"].apply(bool),
has_no=lambda df: df["no_country"].apply(bool),
).assign(
has_both=lambda df: df["has_origin"] & df["has_other"],
all_no=lambda df: ~(df["has_origin"] | df["has_other"]),
).applymap(int)
In [21]:
result[:20_000:2_500]
Out[21]:
Each row has an affiliation summary for a single article, identified by its PID. A brief explanation of the columns:
origin_country
: Number of affiliations whose country is the origin country;other_country
: Number of affiliations whose country isn't the origin country;no_country
: Number of affiliations whose country is unknown;has_origin
: This article has at least one affiliation whose country is the origin country;has_other
: This article has at least one affiliation whose country isn't the origin country;has_no
: This article has at least one affiliation whose country is unknown;has_both
: This article has affiliations from both the origin country and another country;all_no
: All affiliations are from unknown countries.The trailing columns are represented by the integers
1
(meaning True
) and 0
(meaning False
).
Let's join the ISSN, collection and origin information to our analysis:
In [22]:
full_result = \
pd.merge(result.reset_index(),
cdfwof[["pid", "issn", "collection", "origin", "is_citable", "years"]].drop_duplicates(),
how="left", on="pid") \
.set_index("pid") \
.sort_index()
full_result[7_500::30_000]
Out[22]:
In [23]:
full_result[153234:154000].head(70)
Out[23]:
There should be no more affiliations than what we had when we started... nor less...
In [24]:
full_result[["origin_country", "other_country", "no_country"]].values.sum() == cdfwof.shape[0]
Out[24]:
In [25]:
full_result.shape
Out[25]:
In [26]:
print(f"Notebook processing duration: {datetime.utcnow() - start}")
In [27]:
filter_citables = full_result.loc[(full_result['is_citable_y'] == 1)]
filter_citables.shape
Out[27]:
In [28]:
values_list = ["has_origin", "has_other", "has_no", "has_both", "all_no"]
td = filter_citables.pivot_table(
index=["issn"],
values=values_list,
columns=["years"],
aggfunc=np.count_nonzero,
fill_value=0)
In [29]:
td.T
Out[29]:
In [30]:
# r is rename
r = {"has_origin":"pais_",
"has_other":"estrang_",
"has_no":"nao_ident_",
"has_both":"pais_estrang_",
"all_no":"nao_ident_todos_"
}
newlabel = []
for k in td.keys():
newlabel.append(r[k[0]]+k[1])
In [31]:
newlabel
Out[31]:
In [32]:
td.columns = newlabel
In [33]:
td.head(9).T
Out[33]:
In [34]:
td.to_csv("output/td_documents_affiliations_network.csv")
# td.to_csv("output/td_affi_bra_190123.csv")
In [35]:
print(f"Notebook processing duration: {datetime.utcnow() - start}")