Generating Stock Authors (Reviewer) LDA-based Profile

Each author has her own research interests. By analyising her previous works (papers in this case), we can estimates her position in a interest space.

In the LDA model implementation, a interest space is a n-dimensional space spanned by n-topics. Given a bag of words (BOW), a vector in the space can then be generating. Each coordinate of the vector is the confidence that the LDA model believes the BOW belongs to a particular topic.

To get a vector of a paper, we preproccess the abstract of a paper and convert it to a BOW, then feed it to a trained LDA model. By summing up all the vectors of previous works of an author, we can get the author's position in the interest space.



In [5]:

    
import pandas as pd
import sqlite3
import gensim
import nltk
import json
from gensim.corpora import BleiCorpus
from gensim import corpora
from nltk.corpus import stopwords
from textblob import TextBlob
from gensim.corpora import Dictionary
from gensim.models import LdaModel
import numpy as np
import pickle
import glob

## Helpers

def save_pkl(target_object, filename):
    with open(filename, "wb") as file:
        pickle.dump(target_object, file)
        
def load_pkl(filename):
    return pickle.load(open(filename, "rb"))

def save_json(target_object, filename):
    with open(filename, 'w') as file:
        json.dump(target_object, file)
        
def load_json(filename):
    with open(filename, 'r') as file:
        data = json.load(file)
    return data



In [2]:

    
con = sqlite3.connect("F:/FMR/data.sqlite")

db_documents = pd.read_sql_query("SELECT * from documents", con, index_col="id")
db_authors = pd.read_sql_query("SELECT * from authors", con)
len(db_documents)









    Out[2]:





33658



In [3]:

    
db_documents.head()









    Out[3]:






  
    
      
      title
      abstract
      publication_date
      submission_date
      cover_url
      full_url
      first_page
      last_page
      pages
      document_type
      type
      article_id
      context_key
      label
      publication_title
      submission_path
      journal_id
    
    
      id
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1
      Role-play and Use Case Cards for Requirements ...
      <p>This paper presents a technique that uses r...
      2006-01-01T00:00:00-08:00
      2009-02-26T07:42:10-08:00
      http://aisel.aisnet.org/acis2001/1
      http://aisel.aisnet.org/cgi/viewcontent.cgi?ar...
      
      
      
      article
      article
      1001
      742028
      1
      ACIS 2001 Proceedings
      acis2001/1
      1
    
    
      2
      Flexible Learning and Academic Performance in ...
      <p>This research investigates the effectivenes...
      2001-01-01T00:00:00-08:00
      2009-02-26T22:04:53-08:00
      http://aisel.aisnet.org/acis2001/10
      http://aisel.aisnet.org/cgi/viewcontent.cgi?ar...
      
      
      
      article
      article
      1006
      744077
      10
      ACIS 2001 Proceedings
      acis2001/10
      2
    
    
      3
      Proactive Metrics: A Framework for Managing IS...
      <p>Managers of information systems development...
      2001-01-01T00:00:00-08:00
      2009-02-26T22:03:31-08:00
      http://aisel.aisnet.org/acis2001/11
      http://aisel.aisnet.org/cgi/viewcontent.cgi?ar...
      
      
      
      article
      article
      1005
      744076
      11
      ACIS 2001 Proceedings
      acis2001/11
      3
    
    
      4
      Reuse in Information Systems Development: Clas...
      <p>There has been a trend in recent years towa...
      2001-01-01T00:00:00-08:00
      2009-02-26T22:02:29-08:00
      http://aisel.aisnet.org/acis2001/12
      http://aisel.aisnet.org/cgi/viewcontent.cgi?ar...
      
      
      
      article
      article
      1004
      744075
      12
      ACIS 2001 Proceedings
      acis2001/12
      4
    
    
      5
      Improving Software Development: The Prescripti...
      <p>We describe the Prescriptive Simplified Met...
      2001-01-01T00:00:00-08:00
      2009-02-26T22:01:24-08:00
      http://aisel.aisnet.org/acis2001/13
      http://aisel.aisnet.org/cgi/viewcontent.cgi?ar...
      
      
      
      article
      article
      1003
      744074
      13
      ACIS 2001 Proceedings
      acis2001/13
      5



In [21]:

    
tokenised = load_json("lemmatized.json")



In [7]:

    
non_en = load_pkl("non_en.list.pkl")



In [27]:

    
len(tokenised) == len(db_documents)









    Out[27]:





True



In [57]:

    
model = LdaModel.load("aisnet_600_cleaned.ldamodel")
dictionary = Dictionary.load("aisnet_300_cleaned.ldamodel.dictionary")

Fitting (Predicting) Topics Distribution From Raw Text

predict function will predict the topics distributions from a given raw text. The result is a pandas dataframe, with topics ids and confidence thereof.



In [30]:

    
def text2vec(text):
    if text:
        return dictionary.doc2bow(TextBlob(text.lower()).noun_phrases)
    else:
        return []
    
def tokenised2vec(tokenised):
    if tokenised:
        return dictionary.doc2bow(tokenised)
    else:
        return []
    
def predict(sometext):
    vec = text2vec(sometext)
    dtype = [('topic_id', int), ('confidence', float)]
    topics = np.array(model[vec], dtype=dtype)
    topics.sort(order="confidence")
#     for topic in topics[::-1]:
#         print("--------")
#         print(topic[1], topic[0])
#         print(model.print_topic(topic[0]))
    return pd.DataFrame(topics)

def predict_vec(vec):
    dtype = [('topic_id', int), ('confidence', float)]
    topics = np.array(model[tokenised2vec(vec)], dtype=dtype)
    topics.sort(order="confidence")
#     for topic in topics[::-1]:
#         print("--------")
#         print(topic[1], topic[0])
#         print(model.print_topic(topic[0]))
    return pd.DataFrame(topics)



In [33]:

    
predict("null values are interpreted as unknown value or inapplicable value. This paper proposes a new approach for solving the unknown value problems with Implicit Predicate (IP). The IP serves as a descriptor corresponding to a set of the unknown values, thereby expressing the semantics of them. In this paper, we demonstrate that the IP is capable of (1) enhancing the semantic expressiveness of the unknown values, (2) entering incomplete information into database and (3) exploiting the information and a variety of inference rules in database to reduce the uncertainties of the unknown values.")









    Out[33]:






  
    
      
      topic_id
      confidence
    
  
  
    
      0
      167
      0.286322



In [34]:

    
model.print_topic(167)









    Out[34]:





'0.060*"critical" + 0.032*"critical success factor" + 0.025*"digital native" + 0.019*"success factor" + 0.016*"trial" + 0.010*"personal information management" + 0.010*"reach" + 0.010*"layne" + 0.009*"use context" + 0.008*"bjrnandersen"'

Generate a Author's Topic Vector

The vector is a topic confidence vector for the author. The length of the vector should be the number of topics in the LDA model.



In [40]:

    
def update_author_vector(vec, doc_vec):
    for topic_id, confidence in zip(doc_vec['topic_id'], doc_vec['confidence']):
        vec[topic_id] += confidence
    return vec

def get_topic_in_list(model, topic_id):
    return [term.strip().split('*') for term in model.print_topic(topic_id).split("+")]

def get_author_top_topics(author_id, top=10):
    author = authors_lib[author_id]
    top_topics = []
    for topic_id, confidence in enumerate(author):
        if confidence > 1:
            top_topics.append([topic_id, (confidence - 1) * 100])
    top_topics.sort(key=lambda tup: tup[1], reverse=True)
    return top_topics[:top]

def get_topic_in_string(model, topic_id, top=5):
    topic_list = get_topic_in_list(model, topic_id)
    topic_string = " / ".join([i[1] for i in topic_list][:top])
    return topic_string

def get_topics_in_string(model, topics, confidence=False):
    if confidence:
        topics_list = []
        for topic in topics:
            topic_map = {
                "topic_id": topic[0],
                "string": get_topic_in_string(model, topic[0]),
                "confidence": topic[1]
            }
            topics_list.append(topic_map)
    else:
        topics_list = []
        for topic_id in topics:
            topic_map = {
                "topic_id": topic_id,
                "string": get_topic_in_string(model, topic_id),
            }
            topics_list.append(topic_map)
    return topics_list

For a author, we first get all his previous papers in our database. For each paper we get, we generate a paper's vector. At last, the sum of all vectors will be the vector (aka the position) in the interest space.



In [90]:

    
def profile_author(author_id, model_topics_num=None):
    if not model_topics_num:
        model_topics_num = model.num_topics
    author_vec = np.array([1.0 for i in range(model_topics_num)])
    # Initialize with 1s
    paper_list = pd.read_sql_query("SELECT * FROM documents_authors WHERE authors_id=" + str(author_id), con)['documents_id']
    paper_list = [i for i in paper_list if i not in non_en]
    # print(paper_list)
    for paper_id in paper_list:
        try:
            abstract = db_documents.loc[paper_id]["abstract"]
            vec = predict_vec(tokenised[paper_id -1])
        except:
            print("Error occurred on paper id " + str(paper_id))
            raise
        author_vec = update_author_vector(author_vec, vec)
    return list(author_vec) # to make it serializable by JSON



In [79]:

    
profile_author(1)









    



[1, 283]






    Out[79]:





array([ 1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.02614883,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.02759921,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.01478242,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.09948292,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.21489484,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.09316398,  1.        ,  1.        ,  1.        ,
        1.44264631,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.01513048,  1.        ,  1.        ,  1.        ,
        1.        ,  1.01562679,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.08548864,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.01420303,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.29817268,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.02064749,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.01190146,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.01655639,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
        1.        ,  1.        ,  1.        ,  1.09039075,  1.        ,
        1.        ,  1.        ,  1.06152949,  1.        ,  1.        ])



In [84]:

    
def profile_all_authors():
    authors = {}
    for author_id in db_authors['id']:
        result = profile_author(author_id)
        if len(result):
            authors[str(author_id)] = result # JSON does not allow int to be the key
        # print("Done: ", author_id)
        # uncomment the above line to track the progress
    return authors



In [94]:

    
authors_lib = profile_all_authors()



In [99]:

    
len(db_authors) == len(authors_lib)









    Out[99]:





True

Save the Library (aka Pool of Scholars)

We will save our profiled authors in a JSON file. It will then used by our matching algorithm.



In [109]:

    
save_json(authors_lib, "aisnet_600_cleaned.authors.json")

	title	abstract	publication_date	submission_date	cover_url	full_url	document_type	type	article_id	context_key	label	publication_title	submission_path	journal_id
id
1	Role-play and Use Case Cards for Requirements ...	<p>This paper presents a technique that uses r...	2006-01-01T00:00:00-08:00	2009-02-26T07:42:10-08:00	http://aisel.aisnet.org/acis2001/1	http://aisel.aisnet.org/cgi/viewcontent.cgi?ar...	article	article	1001	742028	1	ACIS 2001 Proceedings	acis2001/1	1
2	Flexible Learning and Academic Performance in ...	<p>This research investigates the effectivenes...	2001-01-01T00:00:00-08:00	2009-02-26T22:04:53-08:00	http://aisel.aisnet.org/acis2001/10	http://aisel.aisnet.org/cgi/viewcontent.cgi?ar...	article	article	1006	744077	10	ACIS 2001 Proceedings	acis2001/10	2
3	Proactive Metrics: A Framework for Managing IS...	<p>Managers of information systems development...	2001-01-01T00:00:00-08:00	2009-02-26T22:03:31-08:00	http://aisel.aisnet.org/acis2001/11	http://aisel.aisnet.org/cgi/viewcontent.cgi?ar...	article	article	1005	744076	11	ACIS 2001 Proceedings	acis2001/11	3
4	Reuse in Information Systems Development: Clas...	<p>There has been a trend in recent years towa...	2001-01-01T00:00:00-08:00	2009-02-26T22:02:29-08:00	http://aisel.aisnet.org/acis2001/12	http://aisel.aisnet.org/cgi/viewcontent.cgi?ar...	article	article	1004	744075	12	ACIS 2001 Proceedings	acis2001/12	4
5	Improving Software Development: The Prescripti...	<p>We describe the Prescriptive Simplified Met...	2001-01-01T00:00:00-08:00	2009-02-26T22:01:24-08:00	http://aisel.aisnet.org/acis2001/13	http://aisel.aisnet.org/cgi/viewcontent.cgi?ar...	article	article	1003	744074	13	ACIS 2001 Proceedings	acis2001/13	5