Affiliations pivot table


In [1]:
from datetime import datetime
start = datetime.utcnow() # For measuring the total processing time

In [2]:
import json
from urllib.request import urlopen
import pandas as pd
import numpy as np


/home/ednilson/.virtualenvs/jupyter/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)

Get collection information from ArticleMeta


In [3]:
AMC_URL = "http://articlemeta.scielo.org/api/v1/collection/identifiers/"
amc_data = pd.DataFrame(json.load(urlopen(AMC_URL)))

In [4]:
amc_data.head(6)


Out[4]:
acron acron2 code document_count domain has_analytics is_active journal_count name original_name status type
0 arg ar arg 39006.0 www.scielo.org.ar True True {'deceased': 22, 'current': 125} {'en': 'Argentina', 'pt': 'Argentina', 'es': '... Argentina certified journals
1 chl cl chl 63467.0 www.scielo.cl True True {'deceased': 13, 'suspended': 1, 'current': 105} {'en': 'Chile', 'pt': 'Chile', 'es': 'Chile'} Chile certified journals
2 col co col 70213.0 www.scielo.org.co True True {'suspended': 7, 'current': 226} {'en': 'Colombia', 'pt': 'Colombia', 'es': 'Co... Colombia certified journals
3 cub cu cub 33492.0 scielo.sld.cu True True {'deceased': 2, 'suspended': 4, 'current': 61} {'en': 'Cuba', 'pt': 'Cuba', 'es': 'Cuba'} Cuba certified journals
4 esp es esp 37946.0 scielo.isciii.es True True {'deceased': 6, 'suspended': 11, 'current': 43} {'en': 'Spain', 'pt': 'Espanha', 'es': 'España'} España certified journals
5 mex mx mex 64406.0 www.scielo.org.mx True True {'deceased': 12, 'suspended': 44, 'current': 159} {'en': 'Mexico', 'pt': 'Mexico', 'es': 'Mexico'} Mexico certified journals

Some collections won't be analyzed, mainly to avoid duplicates (there are articles in more than one collection). The spa (Public Health collection) should have part of it kept in the result, but it's not a collection whose journals/articles are assigned to a single country. The collections below are linked to a single country:


In [5]:
dont_evaluate = ["bio", "cci", "cic", "ecu", "psi", "pry", "rve", "rvo", "rvt", "sss", "spa", "wid"]
amc_names_map = {
    "code": "collection",
    "acron2": "origin",
}
amc_pairs = amc_data \
    [(amc_data["acron2"].str.len() == 2) &
     ~amc_data["code"].isin(dont_evaluate)] \
    [list(amc_names_map.keys())] \
    .rename(columns=amc_names_map) \
    .assign(origin=lambda df: df["origin"].str.upper())
amc_pairs


Out[5]:
collection origin
0 arg AR
1 chl CL
2 col CO
3 cub CU
4 esp ES
5 mex MX
6 prt PT
8 scl BR
11 sza ZA
12 ven VE
14 bol BO
15 cri CR
16 per PE
19 ury UY

ISSN selection from spa

These journals in the spa collection have the following countries:


In [6]:
spa_issn_country = pd.DataFrame([
    ("0021-2571", "IT"),
    ("0042-9686", "CH"),
    ("1020-4989", "US"),
    ("1555-7960", "US"),
], columns=["issn", "origin"])
spa_issn_country # For collection = "spa", only!


Out[6]:
issn origin
0 0021-2571 IT
1 0042-9686 CH
2 1020-4989 US
3 1555-7960 US

Affiliations dataset

This dataset is the Network spreadsheet/CSV pack which can be found in the SciELO Analytics report web page. The first two rows of it are:

Unzip the CSV file


In [7]:
import zipfile
# Use the Zip file in jcatalog/data/scielo
# with zipfile.ZipFile('../../data/scielo/tabs_network_181203.zip', 'r') as zip_ref:
#     zip_ref.extract('documents_affiliations.csv', 'csv_files')
with zipfile.ZipFile('../../data/scielo/tabs_network_190210.zip', 'r') as zip_ref:
    zip_ref.extract('documents_affiliations.csv', 'csv_files')

In [8]:
dataset = pd.read_csv("csv_files/documents_affiliations.csv", keep_default_na=False)
dataset.head(3).T


Out[8]:
0 1 2
extraction date 2019-02-10 2019-02-10 2019-02-10
study unit document document document
collection scl scl scl
ISSN SciELO 0100-879X 0100-879X 0100-879X
ISSN's 0100-879X;1414-431X 0100-879X;1414-431X 0100-879X;1414-431X
title at SciELO Brazilian Journal of Medical and Biological Re... Brazilian Journal of Medical and Biological Re... Brazilian Journal of Medical and Biological Re...
title thematic areas Biological Sciences;Health Sciences Biological Sciences;Health Sciences Biological Sciences;Health Sciences
title is agricultural sciences 0 0 0
title is applied social sciences 0 0 0
title is biological sciences 1 1 1
title is engineering 0 0 0
title is exact and earth sciences 0 0 0
title is health sciences 1 1 1
title is human sciences 0 0 0
title is linguistics, letters and arts 0 0 0
title is multidisciplinary 0 0 0
title current status current current current
document publishing ID (PID SciELO) S0100-879X1998000800006 S0100-879X1998000800011 S0100-879X1998000800005
document publishing year 1998 1998 1998
document type research-article rapid-communication research-article
document is citable 1 1 1
document affiliation instituition University of Gorakhpur Universidade Estadual de Londrina Southern Sea Biology Institute
document affiliation country
document affiliation country ISO 3166
document affiliation state
document affiliation city

In [9]:
dataset.shape


Out[9]:
(1772877, 26)

We won't need all the information, and we can simplify the column names for the columns we need:


In [10]:
names_map = {
    "document publishing ID (PID SciELO)": "pid",
    "document affiliation country ISO 3166": "country",
    "document is citable": "is_citable",
    "ISSN SciELO": "issn",
    "collection": "collection",
    "document publishing year": "year",
}
cdf = dataset[list(names_map.keys())].rename(columns=names_map)
cdf[610_000::80_000] # cdf stands for "Country/Collection Data Frame"


Out[10]:
pid country is_citable issn collection year
610000 S0104-07072015000200424 BR 1 0104-0707 scl 2015
690000 S0080-62342016000500792 BR 1 0080-6234 scl 2016
770000 S0100-736X2018000100129 BR 1 0100-736X scl 2018
850000 S0325-00752010000300006 1 0325-0075 arg 2010
930000 S0370-41061977000400001 1 0370-4106 chl 1977
1010000 S0250-71612018000100261 AR 1 0250-7161 chl 2018
1090000 S1657-95342006000500005 CO 1 1657-9534 col 2006
1170000 S0012-73532015000600001 ES 1 0012-7353 col 2015
1250000 S1699-65852000000200001 0 1699-6585 esp 2000
1330000 S0035-001X2008000900002 FR 1 0035-001X mex 2008
1410000 S1405-33222016000200173 MX 1 1405-3322 mex 2016
1490000 S1516-37172002000100006 1 1516-3717 psi 2002
1570000 S1983-14472015000300086 BR 1 1983-1447 rve 2015
1650000 S0042-96862014000200086 KH 1 0042-9686 spa 2013
1730000 S2225-62532018000900007 1 2225-6253 sza 2018

Adding journal country as origin

The country column in the last dataframe is the affiliation country, not the journal/article origin country. Let's add the former as a new origin column, grabbing it from the collection or from the ISSN (when collection is spa):


In [11]:
cdfwof = pd.concat([
    pd.merge(cdf[cdf["collection"] != "spa"], amc_pairs,        how="inner", on="collection"),
    pd.merge(cdf[cdf["collection"] == "spa"], spa_issn_country, how="inner", on="issn"),
])
cdfwof[610_000::80_000] # wof stands for "With Origin, Filtered"


Out[11]:
pid country is_citable issn collection year origin
610000 S0104-07072015000200424 BR 1 0104-0707 scl 2015 BR
690000 S0080-62342016000500792 BR 1 0080-6234 scl 2016 BR
770000 S0100-736X2018000100129 BR 1 0100-736X scl 2018 BR
850000 S0325-00752010000300006 1 0325-0075 arg 2010 AR
930000 S0370-41061977000400001 1 0370-4106 chl 1977 CL
1010000 S0250-71612018000100261 AR 1 0250-7161 chl 2018 CL
1090000 S1657-95342006000500005 CO 1 1657-9534 col 2006 CO
1170000 S0012-73532015000600001 ES 1 0012-7353 col 2015 CO
1250000 S0213-91112006000400011 ES 1 0213-9111 esp 2006 ES
1330000 S0188-77422007000200005 ES 1 0188-7742 mex 2007 MX
1410000 S0185-38802004000100004 1 0185-3880 mex 2004 MX
1490000 S0375-15892009000400002 ZA 1 0375-1589 sza 2009 ZA
1570000 S0048-77322013000100007 1 0048-7732 ven 2013 VE

The rows without an assignable origin have been removed:

Check collections and SPA


In [12]:
set(cdfwof.collection)


Out[12]:
{'arg',
 'bol',
 'chl',
 'col',
 'cri',
 'cub',
 'esp',
 'mex',
 'per',
 'prt',
 'scl',
 'spa',
 'sza',
 'ury',
 'ven'}

In [13]:
spa = cdfwof[cdfwof['collection'].str.contains('spa')]

In [14]:
set(spa.issn)


Out[14]:
{'0021-2571', '0042-9686', '1020-4989', '1555-7960'}

Add years


In [15]:
cdfwof["years"] = np.where(cdfwof['year'] <= 1996, 'ate_1996', cdfwof["year"])

In [16]:
# compare
cdf.shape


Out[16]:
(1772877, 6)

In [17]:
cdfwof.shape


Out[17]:
(1595232, 8)

In [18]:
cdfwof[(cdfwof["pid"] == "S0004-27302009000900010")]


Out[18]:
pid country is_citable issn collection year origin years
286009 S0004-27302009000900010 BR 1 0004-2730 scl 2009 BR 2009
286010 S0004-27302009000900010 BR 1 0004-2730 scl 2009 BR 2009
286011 S0004-27302009000900010 BR 1 0004-2730 scl 2009 BR 2009

Country summary

Are the affiliations countries and the journal/origin country always the same? The goal now is to create a summary of the affiliation countries by comparing them to the journal/origin country.


In [19]:
origin_country = cdfwof["country"] == cdfwof["origin"]

In [20]:
result = cdfwof.assign(
    origin_country=origin_country,
    other_country=~(origin_country | (cdfwof["country"] == "")),
    no_country=cdfwof["country"] == "",
).groupby("pid").sum().assign(
    has_origin=lambda df: df["origin_country"].apply(bool),
    has_other=lambda df: df["other_country"].apply(bool),
    has_no=lambda df: df["no_country"].apply(bool),
).assign(
    has_both=lambda df: df["has_origin"] & df["has_other"],
    all_no=lambda df: ~(df["has_origin"] | df["has_other"]),
).applymap(int)

In [21]:
result[:20_000:2_500]


Out[21]:
is_citable year origin_country other_country no_country has_origin has_other has_no has_both all_no
pid
S0001-37141998000300001 1 1998 0 0 1 0 0 1 0 1
S0001-60022000000300009 1 2000 0 0 1 0 0 1 0 1
S0003-25732005000200015 1 2005 0 0 1 0 0 1 0 1
S0004-06222001000400001 1 2001 0 0 1 0 0 1 0 1
S0004-27302009000400004 1 2009 1 0 0 1 0 0 0 0
S0004-27492007000100009 4 8028 4 0 0 1 0 0 0 0
S0004-28032016000100010 3 6048 3 0 0 1 0 0 0 0
S0004-282X1990000300002 3 5970 0 3 0 0 1 0 0 0

Each row has an affiliation summary for a single article, identified by its PID. A brief explanation of the columns:

  • origin_country: Number of affiliations whose country is the origin country;
  • other_country: Number of affiliations whose country isn't the origin country;
  • no_country: Number of affiliations whose country is unknown;
  • has_origin: This article has at least one affiliation whose country is the origin country;
  • has_other: This article has at least one affiliation whose country isn't the origin country;
  • has_no: This article has at least one affiliation whose country is unknown;
  • has_both: This article has affiliations from both the origin country and another country;
  • all_no: All affiliations are from unknown countries.

The trailing columns are represented by the integers 1 (meaning True) and 0 (meaning False).

Final result

Let's join the ISSN, collection and origin information to our analysis:


In [22]:
full_result = \
    pd.merge(result.reset_index(),
             cdfwof[["pid", "issn", "collection", "origin", "is_citable", "years"]].drop_duplicates(),
             how="left", on="pid") \
      .set_index("pid") \
      .sort_index()
full_result[7_500::30_000]


Out[22]:
is_citable_x year origin_country other_country no_country has_origin has_other has_no has_both all_no issn collection origin is_citable_y years
pid
S0004-06222001000400001 1 2001 0 0 1 0 0 1 0 1 0004-0622 ven VE 1 2001
S0025-76802017000500017 0 2017 1 0 0 1 0 0 0 0 0025-7680 arg AR 0 2017
S0034-89102008000100005 3 6024 1 2 0 1 1 0 1 0 0034-8910 scl BR 1 2008
S0048-77322001000100002 1 2001 0 0 1 0 0 1 0 1 0048-7732 ven VE 1 2001
S0100-204X2005001000001 4 8020 3 0 1 1 0 1 0 0 0100-204X scl BR 1 2005
S0100-84042001000500007 1 2001 1 0 0 1 0 0 0 0 0100-8404 scl BR 1 2001
S0102-311X2001000300019 3 6003 2 0 1 1 0 1 0 0 0102-311X scl BR 1 2001
S0103-18132017000100213 1 2017 1 0 0 1 0 0 0 0 0103-1813 scl BR 1 2017
S0103-97332001000300002 3 6003 0 1 2 0 1 1 0 0 0103-9733 scl BR 1 2001
S0120-04882016000100015 1 2016 1 0 0 1 0 0 0 0 0120-0488 col CO 1 2016
S0122-44092009000200003 1 2009 1 0 0 1 0 0 0 0 0122-4409 col CO 1 2009
S0187-73722007000100003 1 2007 0 0 1 0 0 1 0 1 0187-7372 mex MX 1 2007
S0258-65762007000200003 2 4014 1 0 1 1 0 1 0 0 0258-6576 ven VE 1 2007
S0378-48352006000300007 2 4012 0 0 2 0 0 1 0 1 0378-4835 esp ES 1 2006
S0718-07642005000600010 2 4010 0 2 0 0 1 0 0 0 0718-0764 chl CL 1 2005
S0864-02891999000200008 1 1999 1 0 0 1 0 0 0 0 0864-0289 cub CU 1 1999
S1025-02551997000200007 1 1997 1 0 0 1 0 0 0 0 1025-0255 cub CU 1 1997
S1405-74252012000300010 1 2012 0 0 1 0 0 1 0 1 1405-7425 mex MX 1 2012
S1414-81452017000300213 1 2017 1 0 0 1 0 0 0 0 1414-8145 scl BR 1 2017
S1516-73132001000200006 1 2001 0 0 1 0 0 1 0 1 1516-7313 scl BR 1 2001
S1645-37942012000100002 1 2012 1 0 0 1 0 0 0 0 1645-3794 prt PT 1 2012
S1678-58782006000100001 3 6018 2 0 1 1 0 1 0 0 1678-5878 scl BR 1 2006
S1794-24892016000100004 1 2016 0 0 1 0 0 1 0 1 1794-2489 col CO 1 2016
S1851-17322008000200001 1 2008 1 0 0 1 0 0 0 0 1851-1732 arg AR 1 2008
S1980-57642016000300227 3 6048 3 0 0 1 0 0 0 0 1980-5764 scl BR 1 2016
S2074-77052018000100006 1 2018 1 0 0 1 0 0 0 0 2074-7705 sza ZA 1 2018
S2317-17822017000400306 1 2017 1 0 0 1 0 0 0 0 2317-1782 scl BR 1 2017

Check


In [23]:
full_result[153234:154000].head(70)


Out[23]:
is_citable_x year origin_country other_country no_country has_origin has_other has_no has_both all_no issn collection origin is_citable_y years
pid
S0100-736X2011000600002 3 6033 2 0 1 1 0 1 0 0 0100-736X scl BR 1 2011
S0100-736X2011000600003 1 2011 1 0 0 1 0 0 0 0 0100-736X scl BR 1 2011
S0100-736X2011000600004 2 4022 2 0 0 1 0 0 0 0 0100-736X scl BR 1 2011
S0100-736X2011000600005 3 6033 3 0 0 1 0 0 0 0 0100-736X scl BR 1 2011
S0100-736X2011000600006 2 4022 2 0 0 1 0 0 0 0 0100-736X scl BR 1 2011
S0100-736X2011000600007 5 10055 5 0 0 1 0 0 0 0 0100-736X scl BR 1 2011
S0100-736X2011000600008 7 14077 7 0 0 1 0 0 0 0 0100-736X scl BR 1 2011
S0100-736X2011000600009 2 4022 2 0 0 1 0 0 0 0 0100-736X scl BR 1 2011
S0100-736X2011000600010 2 4022 2 0 0 1 0 0 0 0 0100-736X scl BR 1 2011
S0100-736X2011000600011 4 8044 4 0 0 1 0 0 0 0 0100-736X scl BR 1 2011
S0100-736X2011000600012 6 12066 6 0 0 1 0 0 0 0 0100-736X scl BR 1 2011
S0100-736X2011000600013 2 4022 2 0 0 1 0 0 0 0 0100-736X scl BR 1 2011
S0100-736X2011000600014 2 4022 2 0 0 1 0 0 0 0 0100-736X scl BR 1 2011
S0100-736X2011000600015 3 6033 1 2 0 1 1 0 1 0 0100-736X scl BR 1 2011
S0100-736X2011000700001 1 2011 1 0 0 1 0 0 0 0 0100-736X scl BR 1 2011
S0100-736X2011000700002 3 6033 3 0 0 1 0 0 0 0 0100-736X scl BR 1 2011
S0100-736X2011000700003 3 6033 3 0 0 1 0 0 0 0 0100-736X scl BR 1 2011
S0100-736X2011000700004 2 4022 2 0 0 1 0 0 0 0 0100-736X scl BR 1 2011
S0100-736X2011000700005 1 2011 1 0 0 1 0 0 0 0 0100-736X scl BR 1 2011
S0100-736X2011000700006 3 6033 3 0 0 1 0 0 0 0 0100-736X scl BR 1 2011
S0100-736X2011000700007 5 10055 5 0 0 1 0 0 0 0 0100-736X scl BR 1 2011
S0100-736X2011000700008 4 8044 4 0 0 1 0 0 0 0 0100-736X scl BR 1 2011
S0100-736X2011000700009 1 2011 1 0 0 1 0 0 0 0 0100-736X scl BR 1 2011
S0100-736X2011000700010 3 6033 3 0 0 1 0 0 0 0 0100-736X scl BR 1 2011
S0100-736X2011000700011 2 4022 2 0 0 1 0 0 0 0 0100-736X scl BR 1 2011
S0100-736X2011000700012 3 6033 3 0 0 1 0 0 0 0 0100-736X scl BR 1 2011
S0100-736X2011000700013 3 6033 3 0 0 1 0 0 0 0 0100-736X scl BR 1 2011
S0100-736X2011000700014 3 6033 3 0 0 1 0 0 0 0 0100-736X scl BR 1 2011
S0100-736X2011000700015 2 4022 2 0 0 1 0 0 0 0 0100-736X scl BR 1 2011
S0100-736X2011000800001 1 2011 1 0 0 1 0 0 0 0 0100-736X scl BR 1 2011
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
S0100-736X2011000800012 1 2011 0 0 1 0 0 1 0 1 0100-736X scl BR 1 2011
S0100-736X2011000800013 1 2011 0 0 1 0 0 1 0 1 0100-736X scl BR 1 2011
S0100-736X2011000800014 1 2011 0 0 1 0 0 1 0 1 0100-736X scl BR 1 2011
S0100-736X2011000800015 1 2011 0 0 1 0 0 1 0 1 0100-736X scl BR 1 2011
S0100-736X2011000900001 5 10055 4 1 0 1 1 0 1 0 0100-736X scl BR 1 2011
S0100-736X2011000900002 3 6033 3 0 0 1 0 0 0 0 0100-736X scl BR 1 2011
S0100-736X2011000900003 3 6033 3 0 0 1 0 0 0 0 0100-736X scl BR 1 2011
S0100-736X2011000900004 4 8044 2 0 2 1 0 1 0 0 0100-736X scl BR 1 2011
S0100-736X2011000900005 5 10055 5 0 0 1 0 0 0 0 0100-736X scl BR 1 2011
S0100-736X2011000900006 3 6033 3 0 0 1 0 0 0 0 0100-736X scl BR 1 2011
S0100-736X2011000900007 2 4022 1 1 0 1 1 0 1 0 0100-736X scl BR 1 2011
S0100-736X2011000900008 2 4022 2 0 0 1 0 0 0 0 0100-736X scl BR 1 2011
S0100-736X2011000900009 3 6033 3 0 0 1 0 0 0 0 0100-736X scl BR 1 2011
S0100-736X2011000900010 2 4022 2 0 0 1 0 0 0 0 0100-736X scl BR 1 2011
S0100-736X2011000900011 3 6033 3 0 0 1 0 0 0 0 0100-736X scl BR 1 2011
S0100-736X2011000900012 3 6033 3 0 0 1 0 0 0 0 0100-736X scl BR 1 2011
S0100-736X2011000900013 3 6033 3 0 0 1 0 0 0 0 0100-736X scl BR 1 2011
S0100-736X2011000900014 1 2011 1 0 0 1 0 0 0 0 0100-736X scl BR 1 2011
S0100-736X2011000900015 2 4022 2 0 0 1 0 0 0 0 0100-736X scl BR 1 2011
S0100-736X2011001000001 3 6033 3 0 0 1 0 0 0 0 0100-736X scl BR 1 2011
S0100-736X2011001000002 3 6033 2 0 1 1 0 1 0 0 0100-736X scl BR 1 2011
S0100-736X2011001000003 1 2011 1 0 0 1 0 0 0 0 0100-736X scl BR 1 2011
S0100-736X2011001000004 3 6033 3 0 0 1 0 0 0 0 0100-736X scl BR 1 2011
S0100-736X2011001000005 2 4022 2 0 0 1 0 0 0 0 0100-736X scl BR 1 2011
S0100-736X2011001000006 4 8044 4 0 0 1 0 0 0 0 0100-736X scl BR 1 2011
S0100-736X2011001000007 1 2011 1 0 0 1 0 0 0 0 0100-736X scl BR 1 2011
S0100-736X2011001000008 2 4022 2 0 0 1 0 0 0 0 0100-736X scl BR 1 2011
S0100-736X2011001000009 4 8044 4 0 0 1 0 0 0 0 0100-736X scl BR 1 2011
S0100-736X2011001000010 2 4022 2 0 0 1 0 0 0 0 0100-736X scl BR 1 2011
S0100-736X2011001000011 2 4022 2 0 0 1 0 0 0 0 0100-736X scl BR 1 2011

70 rows × 15 columns

Checking the result

There should be no more affiliations than what we had when we started... nor less...


In [24]:
full_result[["origin_country", "other_country", "no_country"]].values.sum() == cdfwof.shape[0]


Out[24]:
True

In [25]:
full_result.shape


Out[25]:
(793648, 15)

In [26]:
print(f"Notebook processing duration: {datetime.utcnow() - start}")


Notebook processing duration: 0:00:20.383975

Pivot Table

filter by is_citable

In [27]:
filter_citables = full_result.loc[(full_result['is_citable_y'] == 1)]
filter_citables.shape


Out[27]:
(700756, 15)

In [28]:
values_list = ["has_origin", "has_other", "has_no", "has_both", "all_no"]

td = filter_citables.pivot_table(
     index=["issn"],
     values=values_list,
     columns=["years"],
     aggfunc=np.count_nonzero,
     fill_value=0)

In [29]:
td.T


Out[29]:
issn 0001-3714 0001-3765 0001-6002 0001-6365 0002-0591 0002-192X 0002-7014 0003-2573 0004-0592 0004-0614 ... 2504-3145 2518-4431 2520-9868 2526-8910 2531-0488 2531-1379 2545-7756 2594-1321 2595-3192 2619-6573
years
all_no 1997 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1998 16 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1999 0 0 0 40 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2000 0 12 20 45 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2001 0 20 22 45 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2002 0 1 3 48 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2003 0 1 4 50 0 14 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2004 0 1 17 40 0 7 0 25 0 0 ... 0 0 0 0 0 0 0 0 0 0
2005 0 1 26 36 0 9 2 39 0 24 ... 0 0 0 0 0 0 0 0 0 0
2006 0 0 12 54 0 10 0 35 0 14 ... 0 0 0 0 0 0 0 0 0 0
2007 0 0 26 81 0 3 4 40 0 17 ... 0 0 0 0 0 0 0 0 0 0
2008 0 1 38 42 0 18 1 20 0 11 ... 0 0 0 0 0 0 0 0 0 0
2009 0 4 10 42 0 11 6 5 9 9 ... 0 0 0 0 0 0 0 0 0 0
2010 0 0 16 18 0 18 0 18 6 12 ... 0 0 0 0 0 0 0 0 0 0
2011 0 0 10 0 0 3 0 7 11 0 ... 0 0 0 0 0 0 0 0 0 0
2012 0 1 30 0 17 1 0 1 5 0 ... 0 0 0 0 0 0 0 0 0 0
2013 0 0 34 0 20 1 0 1 6 0 ... 0 0 0 0 0 0 0 0 0 0
2014 0 0 32 0 9 2 0 5 12 0 ... 0 15 0 0 0 0 0 0 0 0
2015 0 0 30 0 0 0 0 4 0 0 ... 0 10 0 0 0 0 6 0 0 0
2016 0 1 19 0 0 0 0 2 0 0 ... 0 13 0 0 0 0 6 0 0 0
2017 0 2 12 0 0 0 0 0 0 0 ... 0 6 24 0 0 0 7 0 0 0
2018 0 1 5 0 0 0 0 2 0 0 ... 0 0 11 0 0 0 5 1 0 0
2019 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
ate_1996 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
has_both 1997 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1998 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1999 8 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2000 0 6 0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2001 0 8 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2002 0 6 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
has_origin 2015 0 167 6 0 0 0 0 21 0 0 ... 0 2 0 0 0 0 0 0 0 0
2016 0 172 9 0 0 0 0 17 0 0 ... 0 1 0 0 0 0 0 0 0 0
2017 0 265 11 0 0 0 0 18 0 0 ... 18 1 2 0 0 0 1 0 0 7
2018 0 244 11 0 0 0 0 17 0 0 ... 0 0 2 54 39 51 0 7 66 4
2019 0 1 4 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
ate_1996 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
has_other 1997 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1998 3 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1999 11 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2000 0 11 0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2001 0 17 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2002 0 11 2 2 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2003 0 10 3 0 0 2 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2004 0 42 2 4 0 4 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
2005 0 17 4 6 0 0 11 11 0 17 ... 0 0 0 0 0 0 0 0 0 0
2006 0 24 4 10 0 1 17 11 0 33 ... 0 0 0 0 0 0 0 0 0 0
2007 0 17 7 24 0 3 12 11 0 38 ... 0 0 0 0 0 0 0 0 0 0
2008 0 18 9 26 0 9 17 9 0 49 ... 0 0 0 0 0 0 0 0 0 0
2009 0 10 5 16 0 7 13 10 64 31 ... 0 0 0 0 0 0 0 0 0 0
2010 0 40 7 2 0 1 13 14 61 31 ... 0 0 0 0 0 0 0 0 0 0
2011 0 30 5 0 0 3 0 10 114 0 ... 0 0 0 0 0 0 0 0 0 0
2012 0 16 0 0 1 1 0 16 57 0 ... 0 0 0 0 0 0 0 0 0 0
2013 0 29 0 0 0 1 0 11 60 0 ... 0 0 0 0 0 0 0 0 0 0
2014 0 24 0 0 0 0 0 16 61 0 ... 0 0 0 0 0 0 0 0 0 0
2015 0 46 0 0 0 0 0 6 0 0 ... 0 0 0 0 0 0 0 0 0 0
2016 0 48 1 0 0 0 0 10 0 0 ... 0 1 0 0 0 0 0 0 0 0
2017 0 48 3 0 0 0 0 8 0 0 ... 5 1 0 0 0 0 0 0 0 10
2018 0 68 3 0 0 0 0 7 0 0 ... 0 0 0 7 8 16 0 0 4 7
2019 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
ate_1996 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

120 rows × 1540 columns

Renames the labels for CSV


In [30]:
# r is rename
r = {"has_origin":"pais_",
     "has_other":"estrang_",
     "has_no":"nao_ident_",
     "has_both":"pais_estrang_",
     "all_no":"nao_ident_todos_"
    }
newlabel = []
for k in td.keys():
    newlabel.append(r[k[0]]+k[1])

In [31]:
newlabel


Out[31]:
['nao_ident_todos_1997',
 'nao_ident_todos_1998',
 'nao_ident_todos_1999',
 'nao_ident_todos_2000',
 'nao_ident_todos_2001',
 'nao_ident_todos_2002',
 'nao_ident_todos_2003',
 'nao_ident_todos_2004',
 'nao_ident_todos_2005',
 'nao_ident_todos_2006',
 'nao_ident_todos_2007',
 'nao_ident_todos_2008',
 'nao_ident_todos_2009',
 'nao_ident_todos_2010',
 'nao_ident_todos_2011',
 'nao_ident_todos_2012',
 'nao_ident_todos_2013',
 'nao_ident_todos_2014',
 'nao_ident_todos_2015',
 'nao_ident_todos_2016',
 'nao_ident_todos_2017',
 'nao_ident_todos_2018',
 'nao_ident_todos_2019',
 'nao_ident_todos_ate_1996',
 'pais_estrang_1997',
 'pais_estrang_1998',
 'pais_estrang_1999',
 'pais_estrang_2000',
 'pais_estrang_2001',
 'pais_estrang_2002',
 'pais_estrang_2003',
 'pais_estrang_2004',
 'pais_estrang_2005',
 'pais_estrang_2006',
 'pais_estrang_2007',
 'pais_estrang_2008',
 'pais_estrang_2009',
 'pais_estrang_2010',
 'pais_estrang_2011',
 'pais_estrang_2012',
 'pais_estrang_2013',
 'pais_estrang_2014',
 'pais_estrang_2015',
 'pais_estrang_2016',
 'pais_estrang_2017',
 'pais_estrang_2018',
 'pais_estrang_2019',
 'pais_estrang_ate_1996',
 'nao_ident_1997',
 'nao_ident_1998',
 'nao_ident_1999',
 'nao_ident_2000',
 'nao_ident_2001',
 'nao_ident_2002',
 'nao_ident_2003',
 'nao_ident_2004',
 'nao_ident_2005',
 'nao_ident_2006',
 'nao_ident_2007',
 'nao_ident_2008',
 'nao_ident_2009',
 'nao_ident_2010',
 'nao_ident_2011',
 'nao_ident_2012',
 'nao_ident_2013',
 'nao_ident_2014',
 'nao_ident_2015',
 'nao_ident_2016',
 'nao_ident_2017',
 'nao_ident_2018',
 'nao_ident_2019',
 'nao_ident_ate_1996',
 'pais_1997',
 'pais_1998',
 'pais_1999',
 'pais_2000',
 'pais_2001',
 'pais_2002',
 'pais_2003',
 'pais_2004',
 'pais_2005',
 'pais_2006',
 'pais_2007',
 'pais_2008',
 'pais_2009',
 'pais_2010',
 'pais_2011',
 'pais_2012',
 'pais_2013',
 'pais_2014',
 'pais_2015',
 'pais_2016',
 'pais_2017',
 'pais_2018',
 'pais_2019',
 'pais_ate_1996',
 'estrang_1997',
 'estrang_1998',
 'estrang_1999',
 'estrang_2000',
 'estrang_2001',
 'estrang_2002',
 'estrang_2003',
 'estrang_2004',
 'estrang_2005',
 'estrang_2006',
 'estrang_2007',
 'estrang_2008',
 'estrang_2009',
 'estrang_2010',
 'estrang_2011',
 'estrang_2012',
 'estrang_2013',
 'estrang_2014',
 'estrang_2015',
 'estrang_2016',
 'estrang_2017',
 'estrang_2018',
 'estrang_2019',
 'estrang_ate_1996']

In [32]:
td.columns = newlabel

In [33]:
td.head(9).T


Out[33]:
issn 0001-3714 0001-3765 0001-6002 0001-6365 0002-0591 0002-192X 0002-7014 0003-2573 0004-0592
nao_ident_todos_1997 0 0 0 0 0 0 0 0 0
nao_ident_todos_1998 16 0 0 0 0 0 0 0 0
nao_ident_todos_1999 0 0 0 40 0 0 0 0 0
nao_ident_todos_2000 0 12 20 45 0 0 0 0 0
nao_ident_todos_2001 0 20 22 45 0 0 0 0 0
nao_ident_todos_2002 0 1 3 48 0 1 0 0 0
nao_ident_todos_2003 0 1 4 50 0 14 0 0 0
nao_ident_todos_2004 0 1 17 40 0 7 0 25 0
nao_ident_todos_2005 0 1 26 36 0 9 2 39 0
nao_ident_todos_2006 0 0 12 54 0 10 0 35 0
nao_ident_todos_2007 0 0 26 81 0 3 4 40 0
nao_ident_todos_2008 0 1 38 42 0 18 1 20 0
nao_ident_todos_2009 0 4 10 42 0 11 6 5 9
nao_ident_todos_2010 0 0 16 18 0 18 0 18 6
nao_ident_todos_2011 0 0 10 0 0 3 0 7 11
nao_ident_todos_2012 0 1 30 0 17 1 0 1 5
nao_ident_todos_2013 0 0 34 0 20 1 0 1 6
nao_ident_todos_2014 0 0 32 0 9 2 0 5 12
nao_ident_todos_2015 0 0 30 0 0 0 0 4 0
nao_ident_todos_2016 0 1 19 0 0 0 0 2 0
nao_ident_todos_2017 0 2 12 0 0 0 0 0 0
nao_ident_todos_2018 0 1 5 0 0 0 0 2 0
nao_ident_todos_2019 0 0 0 0 0 0 0 0 0
nao_ident_todos_ate_1996 0 0 0 0 0 0 0 0 0
pais_estrang_1997 0 0 0 0 0 0 0 0 0
pais_estrang_1998 0 0 0 0 0 0 0 0 0
pais_estrang_1999 8 0 0 0 0 0 0 0 0
pais_estrang_2000 0 6 0 1 0 0 0 0 0
pais_estrang_2001 0 8 0 0 0 0 0 0 0
pais_estrang_2002 0 6 1 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ...
pais_2015 0 167 6 0 0 0 0 21 0
pais_2016 0 172 9 0 0 0 0 17 0
pais_2017 0 265 11 0 0 0 0 18 0
pais_2018 0 244 11 0 0 0 0 17 0
pais_2019 0 1 4 0 0 0 0 0 0
pais_ate_1996 0 0 0 0 0 0 0 0 0
estrang_1997 0 0 0 0 0 0 0 0 0
estrang_1998 3 0 0 0 0 0 0 0 0
estrang_1999 11 0 0 0 0 0 0 0 0
estrang_2000 0 11 0 1 0 0 0 0 0
estrang_2001 0 17 0 0 0 0 0 0 0
estrang_2002 0 11 2 2 0 0 0 0 0
estrang_2003 0 10 3 0 0 2 0 0 0
estrang_2004 0 42 2 4 0 4 0 1 0
estrang_2005 0 17 4 6 0 0 11 11 0
estrang_2006 0 24 4 10 0 1 17 11 0
estrang_2007 0 17 7 24 0 3 12 11 0
estrang_2008 0 18 9 26 0 9 17 9 0
estrang_2009 0 10 5 16 0 7 13 10 64
estrang_2010 0 40 7 2 0 1 13 14 61
estrang_2011 0 30 5 0 0 3 0 10 114
estrang_2012 0 16 0 0 1 1 0 16 57
estrang_2013 0 29 0 0 0 1 0 11 60
estrang_2014 0 24 0 0 0 0 0 16 61
estrang_2015 0 46 0 0 0 0 0 6 0
estrang_2016 0 48 1 0 0 0 0 10 0
estrang_2017 0 48 3 0 0 0 0 8 0
estrang_2018 0 68 3 0 0 0 0 7 0
estrang_2019 0 0 0 0 0 0 0 0 0
estrang_ate_1996 0 0 0 0 0 0 0 0 0

120 rows × 9 columns


In [34]:
td.to_csv("output/td_documents_affiliations_network.csv")
# td.to_csv("output/td_affi_bra_190123.csv")

In [35]:
print(f"Notebook processing duration: {datetime.utcnow() - start}")


Notebook processing duration: 0:00:22.133814