PyScopus: An example for author disambiguity

Sometimes Scopus would mix up people with similar names. I recently come up with a not that difficult method to clean author publication profiles, which needs some manual work.

If you can think of a better way, please do let me know!



In [3]:

    
import pyscopus
pyscopus.__version__









    Out[3]:





'1.0.1'



In [4]:

    
from pyscopus import Scopus
key = 'YOUR_OWN_APIKEY'
scopus = Scopus(key)

Wrapper functions for disambiguity



In [5]:

    
import requests, time
import pandas as pd
from bs4 import BeautifulSoup as Soup
BASE_URL = "https://api.elsevier.com/content/abstract/scopus_id/"



In [6]:

    
def _check_pub_validity(sid, author_id, author_affil_id_list, apikey):
    global BASE_URL
    r = requests.get(BASE_URL+sid, params={'apikey': apikey})
    soup = Soup(r.content, 'lxml')
    author_list = soup.find('authors').find_all('author')

    ## go through the author list to find the author first, by matching author id
    for au in author_list:
        if au['auid'] == author_id:
            ## find it and break
            break

    ## check the affiliation id: note that an author may have a list of affiliations
    this_affil_id_list = [affil_tag['id'] for affil_tag in au.find_all('affiliation')]
    ## get the affiliation id and check if there are any overlap
    if len(set(author_affil_id_list).intersection(set(this_affil_id_list))) > 0:
        return True
    return False



In [7]:

    
def check_pub_validity(scopus_obj, author_id, author_affil_id_list, apikey):
    ## first find out all pub
    pub_df = scopus_obj.search_author_publication(author_id)
    ## do this for all non-null scopus ids
    pub_df = pub_df[~pub_df.scopus_id.isnull()]
    ## list to save all eligble scopus ids
    eligible_scopus_id_list = list()
    for i, sid in enumerate(pub_df.scopus_id.values):
        if (i+1)%5==0:
            time.sleep(pd.np.random.random()+.3)
        if _check_pub_validity(sid, author_id, author_affil_id_list, apikey):
            ## if true, save it
            eligible_scopus_id_list.append(sid)
    ## finally, get a subset of the original pub_df
    filtered_pub_df = pub_df.query("scopus_id in @eligible_scopus_id_list")
    return filtered_pub_df

When I was collecting data for my own research, I found that Dr. Vivek K. Singh has a very noisy profile in Scopus. Let's use this as an example.

The basic idea is to match author-affiliation pair:

For all the paper found in the mixed profile
- Find the focal author (in this case, Dr. Singh)
- Look at his/her affiliation
  - Keep this paper if the affiliation is indeed where he/she is
  - If not, discard the paper

For Dr. Singh, I manually obtained his affiliation ids by searching through Scopus affiliation search. Upon obtaining that, create a dictionary containing name (first/last), affiliation name, and a list of affiliation ids. Author and affiliation names would be used to search for this author. The list of affiliation ids would be used for cleaning papers:

UC Irvine 60007278
MIT 60022195
Rutgers 60030623



In [8]:

    
d = {'authfirst': 'Vivek', 'authlastname': 'Singh', 'affiliation': 'Rutgers',
     'affil_id_list': ['60030623', '60022195', '60007278']
    }
d









    Out[8]:





{'authfirst': 'Vivek',
 'authlastname': 'Singh',
 'affiliation': 'Rutgers',
 'affil_id_list': ['60030623', '60022195', '60007278']}



In [9]:

    
query = "AUTHLASTNAME({}) and AUTHFIRST({}) and AFFIL({})".format(d['authlastname'], d['authfirst'], d['affiliation'])
author_search_df = scopus.search_author(query)
author_search_df









    



The history saving thread hit an unexpected error (OperationalError('database is locked',)).History will not be written to the database.






    Out[9]:







  
    
      
      author_id
      name
      document_count
      affiliation
      affiliation_id
    
  
  
    
      0
      7404651152
      Vivek Kumar N. Singh
      491
      Shri Mata Vaishno Devi University
      60017187

Sometimes we would obtain a list of author profiles for each author. In this case, we only have one and it is clear that the author profile is highly noisy.

In the following step, I would use the helper functions in utils to screen each paper by this author_id



In [10]:

    
author_id = '7404651152'
author_id, d['affil_id_list']









    Out[10]:





('7404651152', ['60030623', '60022195', '60007278'])

The filtering process may take a while, depending on how many documents are mixd up.



In [12]:

    
filterd_pub_df = check_pub_validity(scopus, author_id, d['affil_id_list'], key)
filterd_pub_df.shape[0], filterd_pub_df.scopus_id.unique().size, filterd_pub_df.scopus_id.isnull().sum()









    Out[12]:





(134, 134, 0)

Obviously, the number of papers is highly reduced. We can now check a random subset to see if the filtered papers make sense for this author.



In [13]:

    
filterd_pub_df.iloc[pd.np.random.randint(0, high=134, size=20)][['title', 'publication_name']]









    Out[13]:







  
    
      
      title
      publication_name
    
  
  
    
      118
      Effects of high-energy irradiation on silicon ...
      Optics InfoBase Conference Papers
    
    
      95
      Physical-Cyber-Social Computing: Looking Back,...
      IEEE Internet Computing
    
    
      141
      Low-stress silicon nitride for mid-infrared mi...
      Optics InfoBase Conference Papers
    
    
      194
      Mid-infrared silicon waveguide resonators with...
      Materials Research Society Symposium Proceedings
    
    
      67
      Effects of high-energy irradiation on silicon ...
      Optics InfoBase Conference Papers
    
    
      194
      Mid-infrared silicon waveguide resonators with...
      Materials Research Society Symposium Proceedings
    
    
      89
      Preface
      Geo-Intelligence and Visualization through Big...
    
    
      179
      Demonstration of high-Q mid-infrared chalcogen...
      Optics Letters
    
    
      147
      Low-Stress silicon nitride platform for broadb...
      Optics InfoBase Conference Papers
    
    
      309
      Situation based control for cyber-physical env...
      Proceedings - IEEE Military Communications Con...
    
    
      42
      Towards measuring fine-grained diversity using...
      Proceedings of the 11th International Conferen...
    
    
      306
      Motivating contributors in social media networks
      1st ACM SIGMM International Workshop on Social...
    
    
      146
      Mid-infrared opto-nanofluidics for label-free ...
      Optics InfoBase Conference Papers
    
    
      54
      Gradient Polymer Nanofoams for Encrypted Recor...
      ACS Nano
    
    
      95
      Physical-Cyber-Social Computing: Looking Back,...
      IEEE Internet Computing
    
    
      336
      Towards environment-to-environment (E2E) multi...
      MM'08 - Proceedings of the 2008 ACM Internatio...
    
    
      144
      Low-stress silicon nitride platform for broadb...
      Conference on Lasers and Electro-Optics Europe...
    
    
      209
      Anisotropic photoluminescence from Er-TeO<inf>...
      CLEO: Science and Innovations, CLEO_SI 2012
    
    
      95
      Physical-Cyber-Social Computing: Looking Back,...
      IEEE Internet Computing
    
    
      71
      On-chip mid-infrared gas detection using chalc...
      Applied Physics Letters

However, there may still be noise in it (e.g., papers published in optics/photonics venues). We can manually exclude those as well:



In [14]:

    
filterd_pub_df = filterd_pub_df.query("not publication_name.str.lower().str.contains('optic')")
filterd_pub_df = filterd_pub_df.query("not publication_name.str.lower().str.contains('photonic')")
filterd_pub_df = filterd_pub_df.query("not publication_name.str.lower().str.contains('nano')")
filterd_pub_df = filterd_pub_df.query("not publication_name.str.lower().str.contains('quantum')")
filterd_pub_df = filterd_pub_df.query("not publication_name.str.lower().str.contains('sensor')")
filterd_pub_df = filterd_pub_df.query("not publication_name.str.lower().str.contains('cleo')")
filterd_pub_df = filterd_pub_df.query("not publication_name.str.lower().str.contains('materials')")
filterd_pub_df = filterd_pub_df.query("not publication_name.str.lower().str.contains('physics')")
filterd_pub_df = filterd_pub_df.query("not publication_name.str.lower().str.contains('chip')")
filterd_pub_df.shape









    Out[14]:





(59, 16)

And let's check again



In [15]:

    
filterd_pub_df.iloc[pd.np.random.randint(0, high=59, size=20)][['title', 'publication_name']]









    Out[15]:







  
    
      
      title
      publication_name
    
  
  
    
      103
      Assessing personality using demographic inform...
      ACM International Conference Proceeding Series
    
    
      44
      Examining information search behaviors in smal...
      Proceedings of the Association for Information...
    
    
      44
      Examining information search behaviors in smal...
      Proceedings of the Association for Information...
    
    
      18
      Social bridges in urban purchase behavior
      ACM Transactions on Intelligent Systems and Te...
    
    
      62
      If it looks like a spammer and behaves like a ...
      International Journal of Information Security
    
    
      152
      EventShop: Recognizing situations in web data ...
      WWW 2013 Companion - Proceedings of the 22nd I...
    
    
      5
      Are you altruistic? Your mobile phone could tell
      2017 IEEE SmartWorld Ubiquitous Intelligence a...
    
    
      103
      Assessing personality using demographic inform...
      ACM International Conference Proceeding Series
    
    
      247
      EventShop: From heterogeneous web streams to p...
      Proceedings of the 4th Annual ACM Web Science ...
    
    
      61
      LTA 2016 - The first workshop on lifelogging t...
      MM 2016 - Proceedings of the 2016 ACM Multimed...
    
    
      34
      Effect of gamma exposure on chalcogenide glass...
      IEEE Radiation Effects Data Workshop
    
    
      95
      Physical-Cyber-Social Computing: Looking Back,...
      IEEE Internet Computing
    
    
      298
      Structural analysis of the emerging event-web
      Proceedings of the 19th International Conferen...
    
    
      17
      New Signals in Multimedia Systems and Applicat...
      IEEE Multimedia
    
    
      152
      EventShop: Recognizing situations in web data ...
      WWW 2013 Companion - Proceedings of the 22nd I...
    
    
      317
      Adversary aware surveillance systems
      IEEE Transactions on Information Forensics and...
    
    
      30
      Toward harmonizing self-reported and logged so...
      Conference on Human Factors in Computing Syste...
    
    
      74
      Predicting privacy attitudes using phone metadata
      Lecture Notes in Computer Science (including s...
    
    
      89
      Preface
      Geo-Intelligence and Visualization through Big...
    
    
      64
      Probing the interconnections between geo-explo...
      UbiComp 2016 - Proceedings of the 2016 ACM Int...

Now it is much better and we can use this cleaned paper list for this focal author.

	title	publication_name
118	Effects of high-energy irradiation on silicon ...	Optics InfoBase Conference Papers
95	Physical-Cyber-Social Computing: Looking Back,...	IEEE Internet Computing
141	Low-stress silicon nitride for mid-infrared mi...	Optics InfoBase Conference Papers
194	Mid-infrared silicon waveguide resonators with...	Materials Research Society Symposium Proceedings
67	Effects of high-energy irradiation on silicon ...	Optics InfoBase Conference Papers
194	Mid-infrared silicon waveguide resonators with...	Materials Research Society Symposium Proceedings
89	Preface	Geo-Intelligence and Visualization through Big...
179	Demonstration of high-Q mid-infrared chalcogen...	Optics Letters
147	Low-Stress silicon nitride platform for broadb...	Optics InfoBase Conference Papers
309	Situation based control for cyber-physical env...	Proceedings - IEEE Military Communications Con...
42	Towards measuring fine-grained diversity using...	Proceedings of the 11th International Conferen...
306	Motivating contributors in social media networks	1st ACM SIGMM International Workshop on Social...
146	Mid-infrared opto-nanofluidics for label-free ...	Optics InfoBase Conference Papers
54	Gradient Polymer Nanofoams for Encrypted Recor...	ACS Nano
95	Physical-Cyber-Social Computing: Looking Back,...	IEEE Internet Computing
336	Towards environment-to-environment (E2E) multi...	MM'08 - Proceedings of the 2008 ACM Internatio...
144	Low-stress silicon nitride platform for broadb...	Conference on Lasers and Electro-Optics Europe...
209	Anisotropic photoluminescence from Er-TeO<inf>...	CLEO: Science and Innovations, CLEO_SI 2012
95	Physical-Cyber-Social Computing: Looking Back,...	IEEE Internet Computing
71	On-chip mid-infrared gas detection using chalc...	Applied Physics Letters

	title	publication_name
103	Assessing personality using demographic inform...	ACM International Conference Proceeding Series
44	Examining information search behaviors in smal...	Proceedings of the Association for Information...
44	Examining information search behaviors in smal...	Proceedings of the Association for Information...
18	Social bridges in urban purchase behavior	ACM Transactions on Intelligent Systems and Te...
62	If it looks like a spammer and behaves like a ...	International Journal of Information Security
152	EventShop: Recognizing situations in web data ...	WWW 2013 Companion - Proceedings of the 22nd I...
5	Are you altruistic? Your mobile phone could tell	2017 IEEE SmartWorld Ubiquitous Intelligence a...
103	Assessing personality using demographic inform...	ACM International Conference Proceeding Series
247	EventShop: From heterogeneous web streams to p...	Proceedings of the 4th Annual ACM Web Science ...
61	LTA 2016 - The first workshop on lifelogging t...	MM 2016 - Proceedings of the 2016 ACM Multimed...
34	Effect of gamma exposure on chalcogenide glass...	IEEE Radiation Effects Data Workshop
95	Physical-Cyber-Social Computing: Looking Back,...	IEEE Internet Computing
298	Structural analysis of the emerging event-web	Proceedings of the 19th International Conferen...
17	New Signals in Multimedia Systems and Applicat...	IEEE Multimedia
152	EventShop: Recognizing situations in web data ...	WWW 2013 Companion - Proceedings of the 22nd I...
317	Adversary aware surveillance systems	IEEE Transactions on Information Forensics and...
30	Toward harmonizing self-reported and logged so...	Conference on Human Factors in Computing Syste...
74	Predicting privacy attitudes using phone metadata	Lecture Notes in Computer Science (including s...
89	Preface	Geo-Intelligence and Visualization through Big...
64	Probing the interconnections between geo-explo...	UbiComp 2016 - Proceedings of the 2016 ACM Int...