Introduction

This notebook assumes: You already have tokenized all the documents and stored them in SpaCy's Doc format on disk.

This notebook will:

  1. load the required SpaCy Doc.
  2. train a LDA model for each journal.
  3. Save the LDA model.

In [1]:
import pandas as pd
import sqlite3
import gensim
import nltk
import glob
import json
import pickle
from tqdm import tqdm_notebook as tn

## Helpers

def save_pkl(target_object, filename):
    with open(filename, "wb") as file:
        pickle.dump(target_object, file)
        
def load_pkl(filename):
    return pickle.load(open(filename, "rb"))

def save_json(target_object, filename):
    with open(filename, 'w') as file:
        json.dump(target_object, file)
        
def load_json(filename):
    with open(filename, 'r') as file:
        data = json.load(file)
    return data


C:\Anaconda3\lib\site-packages\gensim\utils.py:855: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
  warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")

Preparing Data

In this step, we are going to load data from disk to the memory and properly format them so that we can processing them in the next "preprocessing" stage.


In [2]:
# Loading metadata from trainning database
con = sqlite3.connect("F:/FMR/data.sqlite")
db_documents = pd.read_sql_query("SELECT * from documents", con)
db_authors = pd.read_sql_query("SELECT * from authors", con)
data = db_documents # just a handy alias
data.head()


Out[2]:
id title abstract publication_date submission_date cover_url full_url first_page last_page pages document_type type article_id context_key label publication_title submission_path journal_id
0 1 Role-play and Use Case Cards for Requirements ... <p>This paper presents a technique that uses r... 2006-01-01T00:00:00-08:00 2009-02-26T07:42:10-08:00 http://aisel.aisnet.org/acis2001/1 http://aisel.aisnet.org/cgi/viewcontent.cgi?ar... article article 1001 742028 1 ACIS 2001 Proceedings acis2001/1 1
1 2 Flexible Learning and Academic Performance in ... <p>This research investigates the effectivenes... 2001-01-01T00:00:00-08:00 2009-02-26T22:04:53-08:00 http://aisel.aisnet.org/acis2001/10 http://aisel.aisnet.org/cgi/viewcontent.cgi?ar... article article 1006 744077 10 ACIS 2001 Proceedings acis2001/10 2
2 3 Proactive Metrics: A Framework for Managing IS... <p>Managers of information systems development... 2001-01-01T00:00:00-08:00 2009-02-26T22:03:31-08:00 http://aisel.aisnet.org/acis2001/11 http://aisel.aisnet.org/cgi/viewcontent.cgi?ar... article article 1005 744076 11 ACIS 2001 Proceedings acis2001/11 3
3 4 Reuse in Information Systems Development: Clas... <p>There has been a trend in recent years towa... 2001-01-01T00:00:00-08:00 2009-02-26T22:02:29-08:00 http://aisel.aisnet.org/acis2001/12 http://aisel.aisnet.org/cgi/viewcontent.cgi?ar... article article 1004 744075 12 ACIS 2001 Proceedings acis2001/12 4
4 5 Improving Software Development: The Prescripti... <p>We describe the Prescriptive Simplified Met... 2001-01-01T00:00:00-08:00 2009-02-26T22:01:24-08:00 http://aisel.aisnet.org/acis2001/13 http://aisel.aisnet.org/cgi/viewcontent.cgi?ar... article article 1003 744074 13 ACIS 2001 Proceedings acis2001/13 5

Loading SpaCy


In [3]:
import spacy
nlp = spacy.load('en')

Determining Journals

We want to build a dedicated LDA model for each journal. So here we want to get a list of journal prefix.


In [7]:
def get_name(s):
    end = 0
    for i in range(len(s.split('/')[0])):
        try:
            a = int(s[i])
            end = i
            break
        except:
            continue
    return s[:end]

journals = []
for i in db_documents['submission_path']:
    journals.append(get_name(i))

In [8]:
journals = set(journals)

In [9]:
from gensim.models.phrases import Phraser, Phrases

In [10]:
from itertools import tee
import multiprocessing

# Use tn(iter, desc="Some text") to track progress
def gen_tokenized_dict_beta(untokenized_dict):
    gen1, gen2 = tee(untokenised.items())
    ids = (id_ for (id_, text) in gen1)
    texts = (text for (id_, text) in gen2)
    docs = nlp.pipe(tn(texts, desc="Tokenization", total=len(untokenized_dict)), n_threads=9)
    tokenised = {id_: doc for id_, doc in zip(ids, docs)}
    return tokenised

def gen_tokenized_dict(untokenized_dict):
    return {k: nlp(v) for k, v in tn(untokenized_dict.items(), desc="Tokenization")}

def gen_tokenized_dict_parallel(untokenized_dict): # Uses textblob
    with multiprocessing.Pool(processes=multiprocessing.cpu_count()) as executor:
         return {num:sqr for num, sqr in tn(zip(untokenized_dict.keys(), executor.map(TextBlob, untokenized_dict.values())), desc="Tokenization")}

def keep_journal(dict_, journal):
    kept = {k: v for k, v in tn(dict_.items(), desc="Journal Filter") if k.startswith(journal)}
    print("Original: ", len(dict_), ", Kept ", len(kept), " items.")
    return kept

In [11]:
import os
from spacy.tokens.doc import Doc
def save_doc_dict(d, folder_name):
    os.mkdir(folder_name)
    nlp.vocab.dump_vectors(os.path.join(folder_name, 'vocab.bin'))
    for k, v in tn(d.items(), desc="Saving doc"):
        k = k.replace('/', '-') + '.doc'
        with open(os.path.join(folder_name, k), 'wb') as f:
            f.write(v.to_bytes())
            
def load_doc_dict(folder_name):
    nlp = spacy.load('en') # This is very important
    file_list = glob.glob(os.path.join(folder_name, "*.doc"))
    d = {}
    nlp.vocab.load_vectors_from_bin_loc(os.path.join(folder_name, 'vocab.bin'))
    for k in tn(file_list, desc="Loading doc"):
        with open(os.path.join(k), 'rb') as f:
            k_ = k.split('\\')[-1].replace('-', '/').replace('.doc', '')
            for bs in Doc.read_bytes(f):
                d[k_] = Doc(nlp.vocab).from_bytes(bs)
    return d

In [28]:
def pos_filter(l, pos="NOUN"):
    return [str(i.lemma_).lower() for i in l if i.pos_ == 'NOUN' and i.is_alpha]

In [13]:
def bigram(corpus):
    phrases = Phrases(corpus)
    make_bigram = Phraser(phrases)
    return [make_bigram[i] for i in tn(corpus, desc='Bigram')]

In [32]:
# Set training parameters.
num_topics = 150
chunksize = 2000
passes = 1
iterations = 150
eval_every = None  # Don't evaluate model perplexity, takes too much time.

In [33]:
models = {}
vis_dict = {}

In [34]:
import gensim.corpora
import pyLDAvis.gensim
import warnings
from imp import reload
warnings.filterwarnings("ignore")
def train_journal(j):
    corpus = load_doc_dict(j)
    corpus = {k: pos_filter(v) for k, v in tn(corpus.items())}
    
    # Make it bigram
    
    tokenised_list = bigram([i for i in corpus.values()])
    # Create a dictionary for all the documents. This might take a while.
    reload(gensim.corpora)
    print(tokenised_list[0][:10])
    dictionary = gensim.corpora.Dictionary(tokenised_list)
    # dictionary.filter_extremes(no_below=2, no_above=0.5, keep_n=None)
    if len(dictionary) < 10:
        print("Warning: dictionary only has " + str(len(dictionary)) + " items. Passing.")
        return None, None
    corpus = [dictionary.doc2bow(l) for l in tokenised_list]
    # Save it for future usage
    from gensim.corpora.mmcorpus import MmCorpus
    MmCorpus.serialize(os.path.join(j, "noun_bigram.mm"), corpus)
    # Also save the dictionary
    dictionary.save(os.path.join(j, "_noun_bigram.ldamodel.dictionary"))
    # Train LDA model.
    from gensim.models import LdaModel
    
    # Train LDA model
    print(len(dictionary))
    # Make a index to word dictionary.
    print("Dictionary test: " + dictionary[0])  # This is only to "load" the dictionary.
    id2word = dictionary.id2token
    model = LdaModel(corpus=corpus, id2word=id2word, chunksize=chunksize, \
                           alpha='auto', eta='auto', \
                           iterations=iterations, num_topics=num_topics, \
                           passes=passes, eval_every=eval_every)
    model.save(os.path.join(j, "_noun_bigram_" + str(num_topics) + ".ldamodel"))
    vis = pyLDAvis.gensim.prepare(model, corpus, dictionary)
    del dictionary
    return model, vis

journals = set([i for i in journals if i])
for j in tn(journals, desc="Journal"):
    try:
        if j in models:
            print(j, 'already exists. Skipping.')
            continue
        model, vis = train_journal(j)
        if model and vis:
            models[j] = model
            vis_dict[j] = vis
            save_pkl(filename=os.path.join(j, '_bigram_vis.pkl'), target_object=vis)
    except Exception as e:
        print(e)


['exploration', 'service', 'quality', 'differentiator', 'firm', 'contributor', 'economy', 'uniqueness', 'service', 'consensus']
18366
Dictionary test: focus_group
['work', 'progress', 'culture', 'consequence', 'knowledge', 'behavior', 'cross', 'theory', 'model', 'culture']
5472
Dictionary test: psychometric
['ict', 'information_communication', 'technology', 'initiative', 'development', 'community', 'country', 'component', 'community', 'informatic']
8005
Dictionary test: user
['sustainability', 'practice', 'company', 'strategy', 'company', 'method', 'tool', 'development', 'beginning', 'discussion']
1435
Dictionary test: degree_fulfilment
['pda_experience', 'proceeding_proceeding', 'pda_experience', 'work_citation', 'sell', 'pda_experience', 'proceeding_material', 'proceeding_inclusion', 'proceeding_administrator', 'information_eintegration']
18695
Dictionary test: focus_group
Could not open binary file b'icis\\vocab.bin'
['beat', 'world', 'purpose', 'paper', 'background', 'description', 'basis', 'journalism', 'world', 'present']
3041
Dictionary test: lingo
['framework', 'development', 'communication', 'device', 'opportunity', 'group', 'coordination', 'level', 'article', 'framework']
5062
Dictionary test: yuan
['icon', 'aesthetic_interaction', 'satisfaction', 'user', 'role', 'icon', 'computer_interaction', 'research', 'user', 'importance']
5901
Dictionary test: user
['business', 'performance', 'management', 'view', 'business_intelligence', 'aspect', 'organization', 'paper', 'information', 'view']
11314
Dictionary test: help
['agility', 'spanning_knowledge', 'brokering', 'approach', 'support', 'agility', 'software_development', 'company', 'case_unit', 'member']
7613
Dictionary test: setting
['progress', 'author', 'issue', 'creativity', 'support', 'information', 'system', 'decision_making', 'process', 'business']
2495
Dictionary test: user
['situation', 'program', 'representative', 'program', 'university', 'author', 'type', 'training', 'objective', 'education']
6027
Dictionary test: tool
['development', 'government', 'concern', 'infrastructure', 'host', 'attack', 'people', 'line_defense', 'research', 'security_awareness']
7355
Dictionary test: development
['customer', 'who', 'paper', 'experience', 'lesson', 'principle', 'user', 'requirement', 'service', 'website']
37407
Dictionary test: help
Could not open binary file b'ecis\\vocab.bin'
['work_citation', 'paper_material', 'inclusion_administrator', 'information_eorganisation', 'isbn_isbn', 'set', 'für', 'design', 'ad', 'ist']
54741
Dictionary test: leistet
['cycle', 'paper', 'wave', 'discourse', 'set', 'innovation', 'part', 'kondratieff', 'cycle', 'nanotechnology']
15813
Dictionary test: project
['datum', 'disease', 'privacy', 'disclosure', 'individual', 'data', 'individual', 'disease_datum', 'subject', 'disclosure_risk']
3565
Dictionary test: format
Could not open binary file b'amcis\\vocab.bin'
['interaction', 'community_practice', 'practice', 'mechanism', 'knowledge_sharing', 'project', 'manager', 'organization', 'capital_theory', 'motivation']
5331
Dictionary test: capitalism
['effect', 'technology', 'material', 'practice', 'paper', 'observation', 'discussion', 'technology', 'support', 'service']
3069
Dictionary test: gym
['architecture', 'content', 'network', 'development', 'manet', 'application', 'traffic', 'communication', 'emergency', 'situation']
15724
Dictionary test: scalable
['innovation', 'history', 'example', 'it', 'innovation', 'issue', 'case', 'environment', 'regulation', 'industry']
1420
Dictionary test: entrepreneur
['thesis', 'demand', 'gap', 'enterprise', 'knowledge', 'definition', 'feature', 'system', 'structure', 'enterprise']
17824
Dictionary test: modelmodelmodelmodel_figure
['ciso', 'leader', 'information', 'security', 'issue', 'challenge', 'research', 'role', 'information', 'security']
270
Dictionary test: title
['challenge', 'guideline', 'date', 'researcher', 'use', 'multi', 'level', 'modeling', 'study', 'process']
10105
Dictionary test: antecedent
['produto', 'sistema', 'ser', 'pelo', 'eficaz', 'eficiente', 'e', 'conceito', 'chave', 'campo']
23259
Dictionary test: ordem


In [36]:
len(models)


Out[36]:
25

In [37]:
import pyLDAvis

In [38]:
pyLDAvis.display(vis_dict['pacis'])


Out[38]:

In [22]:
journals


Out[22]:
{'acis',
 'amcis',
 'bled',
 'confirm',
 'digit',
 'ecis',
 'eis',
 'globdev',
 'icdss',
 'icis',
 'icmb',
 'iris',
 'irwitpm',
 'isd',
 'mcis',
 'mg',
 'mwais',
 'pacis',
 'sais',
 'sbis',
 'siged',
 'sighci',
 'siglead',
 'sprouts_proceedings_siggreen_',
 'ukais',
 'whiceb',
 'wi',
 'wisp'}