PyScopus: An example for author disambiguity
Sometimes Scopus would mix up people with similar names. I recently come up with a not that difficult method to clean author publication profiles, which needs some manual work.
If you can think of a better way, please do let me know!
In [3]:
import pyscopus
pyscopus.__version__
Out[3]:
In [4]:
from pyscopus import Scopus
key = 'YOUR_OWN_APIKEY'
scopus = Scopus(key)
Wrapper functions for disambiguity
In [5]:
import requests, time
import pandas as pd
from bs4 import BeautifulSoup as Soup
BASE_URL = "https://api.elsevier.com/content/abstract/scopus_id/"
In [6]:
def _check_pub_validity(sid, author_id, author_affil_id_list, apikey):
global BASE_URL
r = requests.get(BASE_URL+sid, params={'apikey': apikey})
soup = Soup(r.content, 'lxml')
author_list = soup.find('authors').find_all('author')
## go through the author list to find the author first, by matching author id
for au in author_list:
if au['auid'] == author_id:
## find it and break
break
## check the affiliation id: note that an author may have a list of affiliations
this_affil_id_list = [affil_tag['id'] for affil_tag in au.find_all('affiliation')]
## get the affiliation id and check if there are any overlap
if len(set(author_affil_id_list).intersection(set(this_affil_id_list))) > 0:
return True
return False
In [7]:
def check_pub_validity(scopus_obj, author_id, author_affil_id_list, apikey):
## first find out all pub
pub_df = scopus_obj.search_author_publication(author_id)
## do this for all non-null scopus ids
pub_df = pub_df[~pub_df.scopus_id.isnull()]
## list to save all eligble scopus ids
eligible_scopus_id_list = list()
for i, sid in enumerate(pub_df.scopus_id.values):
if (i+1)%5==0:
time.sleep(pd.np.random.random()+.3)
if _check_pub_validity(sid, author_id, author_affil_id_list, apikey):
## if true, save it
eligible_scopus_id_list.append(sid)
## finally, get a subset of the original pub_df
filtered_pub_df = pub_df.query("scopus_id in @eligible_scopus_id_list")
return filtered_pub_df
When I was collecting data for my own research, I found that Dr. Vivek K. Singh has a very noisy profile in Scopus. Let's use this as an example.
The basic idea is to match author-affiliation pair:
For Dr. Singh, I manually obtained his affiliation ids by searching through Scopus affiliation search. Upon obtaining that, create a dictionary containing name (first/last), affiliation name, and a list of affiliation ids. Author and affiliation names would be used to search for this author. The list of affiliation ids would be used for cleaning papers:
60007278
60022195
60030623
In [8]:
d = {'authfirst': 'Vivek', 'authlastname': 'Singh', 'affiliation': 'Rutgers',
'affil_id_list': ['60030623', '60022195', '60007278']
}
d
Out[8]:
In [9]:
query = "AUTHLASTNAME({}) and AUTHFIRST({}) and AFFIL({})".format(d['authlastname'], d['authfirst'], d['affiliation'])
author_search_df = scopus.search_author(query)
author_search_df
Out[9]:
Sometimes we would obtain a list of author profiles for each author. In this case, we only have one and it is clear that the author profile is highly noisy.
In the following step, I would use the helper functions in utils
to screen each paper by this author_id
In [10]:
author_id = '7404651152'
author_id, d['affil_id_list']
Out[10]:
The filtering process may take a while, depending on how many documents are mixd up.
In [12]:
filterd_pub_df = check_pub_validity(scopus, author_id, d['affil_id_list'], key)
filterd_pub_df.shape[0], filterd_pub_df.scopus_id.unique().size, filterd_pub_df.scopus_id.isnull().sum()
Out[12]:
Obviously, the number of papers is highly reduced. We can now check a random subset to see if the filtered papers make sense for this author.
In [13]:
filterd_pub_df.iloc[pd.np.random.randint(0, high=134, size=20)][['title', 'publication_name']]
Out[13]:
However, there may still be noise in it (e.g., papers published in optics/photonics venues). We can manually exclude those as well:
In [14]:
filterd_pub_df = filterd_pub_df.query("not publication_name.str.lower().str.contains('optic')")
filterd_pub_df = filterd_pub_df.query("not publication_name.str.lower().str.contains('photonic')")
filterd_pub_df = filterd_pub_df.query("not publication_name.str.lower().str.contains('nano')")
filterd_pub_df = filterd_pub_df.query("not publication_name.str.lower().str.contains('quantum')")
filterd_pub_df = filterd_pub_df.query("not publication_name.str.lower().str.contains('sensor')")
filterd_pub_df = filterd_pub_df.query("not publication_name.str.lower().str.contains('cleo')")
filterd_pub_df = filterd_pub_df.query("not publication_name.str.lower().str.contains('materials')")
filterd_pub_df = filterd_pub_df.query("not publication_name.str.lower().str.contains('physics')")
filterd_pub_df = filterd_pub_df.query("not publication_name.str.lower().str.contains('chip')")
filterd_pub_df.shape
Out[14]:
And let's check again
In [15]:
filterd_pub_df.iloc[pd.np.random.randint(0, high=59, size=20)][['title', 'publication_name']]
Out[15]:
Now it is much better and we can use this cleaned paper list for this focal author.