Introduction

This notebook assumes: You already have tokenized all the documents and stored them in SpaCy's Doc format on disk.

This notebook will:

load the required SpaCy Doc.
train a LDA model for each journal.
Save the LDA model.



In [1]:

    
import pandas as pd
import sqlite3
import gensim
import nltk
import glob
import json
import pickle
from tqdm import tqdm_notebook as tn

## Helpers

def save_pkl(target_object, filename):
    with open(filename, "wb") as file:
        pickle.dump(target_object, file)
        
def load_pkl(filename):
    return pickle.load(open(filename, "rb"))

def save_json(target_object, filename):
    with open(filename, 'w') as file:
        json.dump(target_object, file)
        
def load_json(filename):
    with open(filename, 'r') as file:
        data = json.load(file)
    return data









    



C:\Anaconda3\lib\site-packages\gensim\utils.py:855: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
  warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")

Preparing Data

In this step, we are going to load data from disk to the memory and properly format them so that we can processing them in the next "preprocessing" stage.



In [2]:

    
# Loading metadata from trainning database
con = sqlite3.connect("F:/FMR/data.sqlite")
db_documents = pd.read_sql_query("SELECT * from documents", con)
db_authors = pd.read_sql_query("SELECT * from authors", con)
data = db_documents # just a handy alias
data.head()









    Out[2]:






  
    
      
      id
      title
      abstract
      publication_date
      submission_date
      cover_url
      full_url
      first_page
      last_page
      pages
      document_type
      type
      article_id
      context_key
      label
      publication_title
      submission_path
      journal_id
    
  
  
    
      0
      1
      Role-play and Use Case Cards for Requirements ...
      <p>This paper presents a technique that uses r...
      2006-01-01T00:00:00-08:00
      2009-02-26T07:42:10-08:00
      http://aisel.aisnet.org/acis2001/1
      http://aisel.aisnet.org/cgi/viewcontent.cgi?ar...
      
      
      
      article
      article
      1001
      742028
      1
      ACIS 2001 Proceedings
      acis2001/1
      1
    
    
      1
      2
      Flexible Learning and Academic Performance in ...
      <p>This research investigates the effectivenes...
      2001-01-01T00:00:00-08:00
      2009-02-26T22:04:53-08:00
      http://aisel.aisnet.org/acis2001/10
      http://aisel.aisnet.org/cgi/viewcontent.cgi?ar...
      
      
      
      article
      article
      1006
      744077
      10
      ACIS 2001 Proceedings
      acis2001/10
      2
    
    
      2
      3
      Proactive Metrics: A Framework for Managing IS...
      <p>Managers of information systems development...
      2001-01-01T00:00:00-08:00
      2009-02-26T22:03:31-08:00
      http://aisel.aisnet.org/acis2001/11
      http://aisel.aisnet.org/cgi/viewcontent.cgi?ar...
      
      
      
      article
      article
      1005
      744076
      11
      ACIS 2001 Proceedings
      acis2001/11
      3
    
    
      3
      4
      Reuse in Information Systems Development: Clas...
      <p>There has been a trend in recent years towa...
      2001-01-01T00:00:00-08:00
      2009-02-26T22:02:29-08:00
      http://aisel.aisnet.org/acis2001/12
      http://aisel.aisnet.org/cgi/viewcontent.cgi?ar...
      
      
      
      article
      article
      1004
      744075
      12
      ACIS 2001 Proceedings
      acis2001/12
      4
    
    
      4
      5
      Improving Software Development: The Prescripti...
      <p>We describe the Prescriptive Simplified Met...
      2001-01-01T00:00:00-08:00
      2009-02-26T22:01:24-08:00
      http://aisel.aisnet.org/acis2001/13
      http://aisel.aisnet.org/cgi/viewcontent.cgi?ar...
      
      
      
      article
      article
      1003
      744074
      13
      ACIS 2001 Proceedings
      acis2001/13
      5

Loading SpaCy



In [3]:

    
import spacy
nlp = spacy.load('en')

Determining Journals

We want to build a dedicated LDA model for each journal. So here we want to get a list of journal prefix.



In [7]:

    
def get_name(s):
    end = 0
    for i in range(len(s.split('/')[0])):
        try:
            a = int(s[i])
            end = i
            break
        except:
            continue
    return s[:end]

journals = []
for i in db_documents['submission_path']:
    journals.append(get_name(i))



In [8]:

    
journals = set(journals)



In [9]:

    
from gensim.models.phrases import Phraser, Phrases



In [10]:

    
from itertools import tee
import multiprocessing

# Use tn(iter, desc="Some text") to track progress
def gen_tokenized_dict_beta(untokenized_dict):
    gen1, gen2 = tee(untokenised.items())
    ids = (id_ for (id_, text) in gen1)
    texts = (text for (id_, text) in gen2)
    docs = nlp.pipe(tn(texts, desc="Tokenization", total=len(untokenized_dict)), n_threads=9)
    tokenised = {id_: doc for id_, doc in zip(ids, docs)}
    return tokenised

def gen_tokenized_dict(untokenized_dict):
    return {k: nlp(v) for k, v in tn(untokenized_dict.items(), desc="Tokenization")}

def gen_tokenized_dict_parallel(untokenized_dict): # Uses textblob
    with multiprocessing.Pool(processes=multiprocessing.cpu_count()) as executor:
         return {num:sqr for num, sqr in tn(zip(untokenized_dict.keys(), executor.map(TextBlob, untokenized_dict.values())), desc="Tokenization")}

def keep_journal(dict_, journal):
    kept = {k: v for k, v in tn(dict_.items(), desc="Journal Filter") if k.startswith(journal)}
    print("Original: ", len(dict_), ", Kept ", len(kept), " items.")
    return kept



In [11]:

    
import os
from spacy.tokens.doc import Doc
def save_doc_dict(d, folder_name):
    os.mkdir(folder_name)
    nlp.vocab.dump_vectors(os.path.join(folder_name, 'vocab.bin'))
    for k, v in tn(d.items(), desc="Saving doc"):
        k = k.replace('/', '-') + '.doc'
        with open(os.path.join(folder_name, k), 'wb') as f:
            f.write(v.to_bytes())
            
def load_doc_dict(folder_name):
    nlp = spacy.load('en') # This is very important
    file_list = glob.glob(os.path.join(folder_name, "*.doc"))
    d = {}
    nlp.vocab.load_vectors_from_bin_loc(os.path.join(folder_name, 'vocab.bin'))
    for k in tn(file_list, desc="Loading doc"):
        with open(os.path.join(k), 'rb') as f:
            k_ = k.split('\\')[-1].replace('-', '/').replace('.doc', '')
            for bs in Doc.read_bytes(f):
                d[k_] = Doc(nlp.vocab).from_bytes(bs)
    return d



In [28]:

    
def pos_filter(l, pos="NOUN"):
    return [str(i.lemma_).lower() for i in l if i.pos_ == 'NOUN' and i.is_alpha]



In [13]:

    
def bigram(corpus):
    phrases = Phrases(corpus)
    make_bigram = Phraser(phrases)
    return [make_bigram[i] for i in tn(corpus, desc='Bigram')]



In [32]:

    
# Set training parameters.
num_topics = 150
chunksize = 2000
passes = 1
iterations = 150
eval_every = None  # Don't evaluate model perplexity, takes too much time.



In [33]:

    
models = {}
vis_dict = {}



In [34]:

    
import gensim.corpora
import pyLDAvis.gensim
import warnings
from imp import reload
warnings.filterwarnings("ignore")
def train_journal(j):
    corpus = load_doc_dict(j)
    corpus = {k: pos_filter(v) for k, v in tn(corpus.items())}
    
    # Make it bigram
    
    tokenised_list = bigram([i for i in corpus.values()])
    # Create a dictionary for all the documents. This might take a while.
    reload(gensim.corpora)
    print(tokenised_list[0][:10])
    dictionary = gensim.corpora.Dictionary(tokenised_list)
    # dictionary.filter_extremes(no_below=2, no_above=0.5, keep_n=None)
    if len(dictionary) < 10:
        print("Warning: dictionary only has " + str(len(dictionary)) + " items. Passing.")
        return None, None
    corpus = [dictionary.doc2bow(l) for l in tokenised_list]
    # Save it for future usage
    from gensim.corpora.mmcorpus import MmCorpus
    MmCorpus.serialize(os.path.join(j, "noun_bigram.mm"), corpus)
    # Also save the dictionary
    dictionary.save(os.path.join(j, "_noun_bigram.ldamodel.dictionary"))
    # Train LDA model.
    from gensim.models import LdaModel
    
    # Train LDA model
    print(len(dictionary))
    # Make a index to word dictionary.
    print("Dictionary test: " + dictionary[0])  # This is only to "load" the dictionary.
    id2word = dictionary.id2token
    model = LdaModel(corpus=corpus, id2word=id2word, chunksize=chunksize, \
                           alpha='auto', eta='auto', \
                           iterations=iterations, num_topics=num_topics, \
                           passes=passes, eval_every=eval_every)
    model.save(os.path.join(j, "_noun_bigram_" + str(num_topics) + ".ldamodel"))
    vis = pyLDAvis.gensim.prepare(model, corpus, dictionary)
    del dictionary
    return model, vis

journals = set([i for i in journals if i])
for j in tn(journals, desc="Journal"):
    try:
        if j in models:
            print(j, 'already exists. Skipping.')
            continue
        model, vis = train_journal(j)
        if model and vis:
            models[j] = model
            vis_dict[j] = vis
            save_pkl(filename=os.path.join(j, '_bigram_vis.pkl'), target_object=vis)
    except Exception as e:
        print(e)









    



['exploration', 'service', 'quality', 'differentiator', 'firm', 'contributor', 'economy', 'uniqueness', 'service', 'consensus']
18366
Dictionary test: focus_group
['work', 'progress', 'culture', 'consequence', 'knowledge', 'behavior', 'cross', 'theory', 'model', 'culture']
5472
Dictionary test: psychometric
['ict', 'information_communication', 'technology', 'initiative', 'development', 'community', 'country', 'component', 'community', 'informatic']
8005
Dictionary test: user
['sustainability', 'practice', 'company', 'strategy', 'company', 'method', 'tool', 'development', 'beginning', 'discussion']
1435
Dictionary test: degree_fulfilment
['pda_experience', 'proceeding_proceeding', 'pda_experience', 'work_citation', 'sell', 'pda_experience', 'proceeding_material', 'proceeding_inclusion', 'proceeding_administrator', 'information_eintegration']
18695
Dictionary test: focus_group
Could not open binary file b'icis\\vocab.bin'
['beat', 'world', 'purpose', 'paper', 'background', 'description', 'basis', 'journalism', 'world', 'present']
3041
Dictionary test: lingo
['framework', 'development', 'communication', 'device', 'opportunity', 'group', 'coordination', 'level', 'article', 'framework']
5062
Dictionary test: yuan
['icon', 'aesthetic_interaction', 'satisfaction', 'user', 'role', 'icon', 'computer_interaction', 'research', 'user', 'importance']
5901
Dictionary test: user
['business', 'performance', 'management', 'view', 'business_intelligence', 'aspect', 'organization', 'paper', 'information', 'view']
11314
Dictionary test: help
['agility', 'spanning_knowledge', 'brokering', 'approach', 'support', 'agility', 'software_development', 'company', 'case_unit', 'member']
7613
Dictionary test: setting
['progress', 'author', 'issue', 'creativity', 'support', 'information', 'system', 'decision_making', 'process', 'business']
2495
Dictionary test: user
['situation', 'program', 'representative', 'program', 'university', 'author', 'type', 'training', 'objective', 'education']
6027
Dictionary test: tool
['development', 'government', 'concern', 'infrastructure', 'host', 'attack', 'people', 'line_defense', 'research', 'security_awareness']
7355
Dictionary test: development
['customer', 'who', 'paper', 'experience', 'lesson', 'principle', 'user', 'requirement', 'service', 'website']
37407
Dictionary test: help
Could not open binary file b'ecis\\vocab.bin'
['work_citation', 'paper_material', 'inclusion_administrator', 'information_eorganisation', 'isbn_isbn', 'set', 'für', 'design', 'ad', 'ist']
54741
Dictionary test: leistet
['cycle', 'paper', 'wave', 'discourse', 'set', 'innovation', 'part', 'kondratieff', 'cycle', 'nanotechnology']
15813
Dictionary test: project
['datum', 'disease', 'privacy', 'disclosure', 'individual', 'data', 'individual', 'disease_datum', 'subject', 'disclosure_risk']
3565
Dictionary test: format
Could not open binary file b'amcis\\vocab.bin'
['interaction', 'community_practice', 'practice', 'mechanism', 'knowledge_sharing', 'project', 'manager', 'organization', 'capital_theory', 'motivation']
5331
Dictionary test: capitalism
['effect', 'technology', 'material', 'practice', 'paper', 'observation', 'discussion', 'technology', 'support', 'service']
3069
Dictionary test: gym
['architecture', 'content', 'network', 'development', 'manet', 'application', 'traffic', 'communication', 'emergency', 'situation']
15724
Dictionary test: scalable
['innovation', 'history', 'example', 'it', 'innovation', 'issue', 'case', 'environment', 'regulation', 'industry']
1420
Dictionary test: entrepreneur
['thesis', 'demand', 'gap', 'enterprise', 'knowledge', 'definition', 'feature', 'system', 'structure', 'enterprise']
17824
Dictionary test: modelmodelmodelmodel_figure
['ciso', 'leader', 'information', 'security', 'issue', 'challenge', 'research', 'role', 'information', 'security']
270
Dictionary test: title
['challenge', 'guideline', 'date', 'researcher', 'use', 'multi', 'level', 'modeling', 'study', 'process']
10105
Dictionary test: antecedent
['produto', 'sistema', 'ser', 'pelo', 'eficaz', 'eficiente', 'e', 'conceito', 'chave', 'campo']
23259
Dictionary test: ordem



In [36]:

    
len(models)









    Out[36]:





25



In [37]:

    
import pyLDAvis



In [38]:

    
pyLDAvis.display(vis_dict['pacis'])









    Out[38]:



In [22]:

    
journals









    Out[22]:





{'acis',
 'amcis',
 'bled',
 'confirm',
 'digit',
 'ecis',
 'eis',
 'globdev',
 'icdss',
 'icis',
 'icmb',
 'iris',
 'irwitpm',
 'isd',
 'mcis',
 'mg',
 'mwais',
 'pacis',
 'sais',
 'sbis',
 'siged',
 'sighci',
 'siglead',
 'sprouts_proceedings_siggreen_',
 'ukais',
 'whiceb',
 'wi',
 'wisp'}

	id	title	abstract	publication_date	submission_date	cover_url	full_url	document_type	type	article_id	context_key	label	publication_title	submission_path	journal_id
0	1	Role-play and Use Case Cards for Requirements ...	<p>This paper presents a technique that uses r...	2006-01-01T00:00:00-08:00	2009-02-26T07:42:10-08:00	http://aisel.aisnet.org/acis2001/1	http://aisel.aisnet.org/cgi/viewcontent.cgi?ar...	article	article	1001	742028	1	ACIS 2001 Proceedings	acis2001/1	1
1	2	Flexible Learning and Academic Performance in ...	<p>This research investigates the effectivenes...	2001-01-01T00:00:00-08:00	2009-02-26T22:04:53-08:00	http://aisel.aisnet.org/acis2001/10	http://aisel.aisnet.org/cgi/viewcontent.cgi?ar...	article	article	1006	744077	10	ACIS 2001 Proceedings	acis2001/10	2
2	3	Proactive Metrics: A Framework for Managing IS...	<p>Managers of information systems development...	2001-01-01T00:00:00-08:00	2009-02-26T22:03:31-08:00	http://aisel.aisnet.org/acis2001/11	http://aisel.aisnet.org/cgi/viewcontent.cgi?ar...	article	article	1005	744076	11	ACIS 2001 Proceedings	acis2001/11	3
3	4	Reuse in Information Systems Development: Clas...	<p>There has been a trend in recent years towa...	2001-01-01T00:00:00-08:00	2009-02-26T22:02:29-08:00	http://aisel.aisnet.org/acis2001/12	http://aisel.aisnet.org/cgi/viewcontent.cgi?ar...	article	article	1004	744075	12	ACIS 2001 Proceedings	acis2001/12	4
4	5	Improving Software Development: The Prescripti...	<p>We describe the Prescriptive Simplified Met...	2001-01-01T00:00:00-08:00	2009-02-26T22:01:24-08:00	http://aisel.aisnet.org/acis2001/13	http://aisel.aisnet.org/cgi/viewcontent.cgi?ar...	article	article	1003	744074	13	ACIS 2001 Proceedings	acis2001/13	5