Illustrating common terms usage using Wikinews in english

getting data

We get the cirrussearch dump of wikinews (a dump meant for elastic-search indexation).


In [1]:
LANG="english"

In [2]:
%%bash

fdate=20170327
fname=enwikinews-$fdate-cirrussearch-content.json.gz
if [ ! -e  $fname ]
then
    wget "https://dumps.wikimedia.org/other/cirrussearch/$fdate/$fname"
fi


--2019-05-12 19:33:27--  https://dumps.wikimedia.org/other/cirrussearch/20170327/enwikinews-20170327-cirrussearch-content.json.gz
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 2620:0:861:4:208:80:155:106, 208.80.155.106
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|2620:0:861:4:208:80:155:106|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2019-05-12 19:33:28 ERROR 404: Not Found.

---------------------------------------------------------------------------
CalledProcessError                        Traceback (most recent call last)
<ipython-input-2-fd54bac10b20> in <module>
----> 1 get_ipython().run_cell_magic('bash', '', '\nfdate=20170327\nfname=enwikinews-$fdate-cirrussearch-content.json.gz\nif [ ! -e  $fname ]\nthen\n    wget "https://dumps.wikimedia.org/other/cirrussearch/$fdate/$fname"\nfi\n')

~/envs/gensim/lib/python3.7/site-packages/IPython/core/interactiveshell.py in run_cell_magic(self, magic_name, line, cell)
   2321             magic_arg_s = self.var_expand(line, stack_depth)
   2322             with self.builtin_trap:
-> 2323                 result = fn(magic_arg_s, cell)
   2324             return result
   2325 

~/envs/gensim/lib/python3.7/site-packages/IPython/core/magics/script.py in named_script_magic(line, cell)
    140             else:
    141                 line = script
--> 142             return self.shebang(line, cell)
    143 
    144         # write a basic docstring:

<decorator-gen-109> in shebang(self, line, cell)

~/envs/gensim/lib/python3.7/site-packages/IPython/core/magic.py in <lambda>(f, *a, **k)
    185     # but it's overkill for just that one bit of state.
    186     def magic_deco(arg):
--> 187         call = lambda f, *a, **k: f(*a, **k)
    188 
    189         if callable(arg):

~/envs/gensim/lib/python3.7/site-packages/IPython/core/magics/script.py in shebang(self, line, cell)
    243             sys.stderr.flush()
    244         if args.raise_error and p.returncode!=0:
--> 245             raise CalledProcessError(p.returncode, cell, output=out, stderr=err)
    246 
    247     def _run_script(self, p, cell, to_close):

CalledProcessError: Command 'b'\nfdate=20170327\nfname=enwikinews-$fdate-cirrussearch-content.json.gz\nif [ ! -e  $fname ]\nthen\n    wget "https://dumps.wikimedia.org/other/cirrussearch/$fdate/$fname"\nfi\n'' returned non-zero exit status 8.

In [ ]:
# iterator
import gzip
import json

FDATE = 20170327
FNAME = "enwikinews-%s-cirrussearch-content.json.gz" % FDATE

def iter_texts(fpath=FNAME):
    with gzip.open(fpath, "rt") as f:
        for l in f:
            data = json.loads(l)
            if "title" in data:
                yield data["title"]
                yield data["text"]

In [ ]:
# also prepare nltk
import nltk
nltk.download("punkt")
nltk.download("stopwords")

Preparing data

we arrange the corpus as required by gensim


In [ ]:
# make a custom tokenizer
import re
from nltk.tokenize import sent_tokenize
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer('\w[\w-]*|\d[\d,]*')

In [ ]:
# prepare a text
def prepare(txt):
    # lower case
    txt = txt.lower()
    return [tokenizer.tokenize(sent) 
            for sent in sent_tokenize(txt, language=LANG)]

In [ ]:
# we put all data in ram, it's not so much
corpus = []
for txt in iter_texts():
    corpus.extend(prepare(txt))

In [ ]:
# how many sentences and words ?
words_count = sum(len(s) for s in corpus)
print("Corpus has %d words in %d sentences" % (words_count, len(corpus)))

Testing bigram with and without common terms

The Phrases model gives us the possiblity of handling common terms, that is words that appears much time in a text and are there only to link objects between them. While you could remove them, you may information, for "the president is in america" is not the same as "the president of america"

The common_terms parameter Phrases can help you deal with them in a smarter way, keeping them around but avoiding them to crush frequency statistics.


In [ ]:
from gensim.models.phrases import Phrases

In [ ]:
# which are the stop words we will use
from nltk.corpus import stopwords
" ".join(stopwords.words(LANG))

In [ ]:
# a version of corups without stop words
stop_words = frozenset(stopwords.words(LANG))
def stopwords_filter(txt):
    return [w for w in txt if w not in stop_words]
st_corpus = [stopwords_filter(txt) for txt in corpus]

In [ ]:
# bigram std
%time bigram = Phrases(st_corpus)
# bigram with common terms
%time bigram_ct = Phrases(corpus, common_terms=stopwords.words(LANG))

bigram with common terms inside

What are (some of) the bigram founds thanks to common terms


In [ ]:
# grams that have more than 2 terms, are those with common terms
ct_ngrams = set((g[1], g[0].decode("utf-8"))
                     for g in bigram_ct.export_phrases(corpus) 
                     if len(g[0].split()) > 2)
ct_ngrams = sorted(list(ct_ngrams))
print(len(ct_ngrams), "grams with common terms found")
# highest scores
ct_ngrams[-20:]

In [ ]:
# did we found any bigram with same words but different stopwords
import collections
by_terms = collections.defaultdict(set)
for ngram, score in bigram_ct.export_phrases(corpus):
    grams = ngram.split()
    by_terms[(grams[0], grams[-1])].add(ngram)
for k, v  in by_terms.items():
    if len(v) > 1:
        print(b"-".join(k).decode("utf-8")," : ", [w.decode("utf-8") for w in v])

In [ ]: