SE Economics - Data Exploration & Preprocessing Prototyping

Author: Matt Fortier matt.fortier@openmailbox.org

Creation Date: 2017-02-21 | Last Updated: 2107-08-01

Python 3.6

The goal here is to get a good feel for what kind of data is available to us, and in what general form is the text at our disposition. We should experiment with different preprocessing steps and find one that gives you satisfactory results. Objective is always to extract as many of the useful/valuable information, while ignoring noise.


In [2]:
import re

import bs4
import numpy as np
import pandas as pd

In [3]:
fname_posts = '../data/processed/posts_processed.json'

In [4]:
df_posts = pd.read_json(fname_posts, lines=True)

In [5]:
df_posts.index = df_posts.id
df_posts.head(3)


Out[5]:
answers body comments id owner score tags title ts_closed ts_created ts_last_act type views
id
1 5.0 <p>If a currency can be worth too little (e.g.... 9 1 8.0 2 [currency, value, inflation, deflation] Why isn't there an "ideal value" for a given c... 2014-11-19T09:17:22.677 2014-11-18T20:59:30.327 2014-12-16T01:14:00.107 1 296.0
3 NaN <p>Brand new currencies do typically aim to st... 0 3 18.0 2 [] None None 2014-11-18T21:23:18.733 2014-11-18T22:33:03.290 2 NaN
4 2.0 <p>I'm having troubles to understand the diffe... 0 4 20.0 12 [currency, purchasing-power-partity, exchange-... Real Exchange Rate vs PPP rate None 2014-11-18T21:24:09.537 2014-11-19T05:14:12.427 1 6413.0

Body - NLP Analysis


In [6]:
df_body = df_posts.body
df_body.iloc[0]


Out[6]:
'<p>If a currency can be worth too little (e.g. needing \\$10,000,000,000 to buy a loaf of bread) and worth too much (e.g. being able to buy a loaf of bread for $0.00001), why isn\'t there an "ideal value" (a point, range, or a shifting set of priorities based on market conditions) that a currency could be fixed or drawn toward?  </p>\n\n<p>Many everyday transactions are still done in cash, requiring mental calculations from the transacting parties. Too large numbers and too small numbers create transaction costs that, at least for the extremes, cannot be ignored.  So conceivably, some "middle" point could be considered "ideal" as regards the <em>efficiency of money as medium of transactions</em>.</p>\n\n<p>What forces may prevent such an "ideal value" for a currency from being established?</p>\n'

Okay, so at a minimum we need to filter out the HTML tags and remove the backslashes. Let's prepare to do that by writing a simple function that levegerages the excellent BeutifulSoup4 library.


In [7]:
def cleanup_text(text):
    soup = bs4.BeautifulSoup(text, "html5lib")
    return [p.get_text().translate({ord('\\'): None}) for p in soup.find_all('p')]

In [8]:
cleanup_text(df_body.iloc[0])


Out[8]:
['If a currency can be worth too little (e.g. needing $10,000,000,000 to buy a loaf of bread) and worth too much (e.g. being able to buy a loaf of bread for $0.00001), why isn\'t there an "ideal value" (a point, range, or a shifting set of priorities based on market conditions) that a currency could be fixed or drawn toward?  ',
 'Many everyday transactions are still done in cash, requiring mental calculations from the transacting parties. Too large numbers and too small numbers create transaction costs that, at least for the extremes, cannot be ignored.  So conceivably, some "middle" point could be considered "ideal" as regards the efficiency of money as medium of transactions.',
 'What forces may prevent such an "ideal value" for a currency from being established?']

We got workable text here - Let's fire up Spacy and shoot this across an entire NLP .

Spacy is a NLP library that regroups a whole set of models and algorithms to perform the most popular NLP preprocessing tasks. Its very robust, and most of the results are very close to current state of the art.


In [13]:
import spacy

nlp = spacy.load('en_core_web_md')

In [36]:
test_output = nlp(' '.join(cleanup_text(df_body.iloc[0])))
test_output


Out[36]:
If a currency can be worth too little (e.g. needing $10,000,000,000 to buy a loaf of bread) and worth too much (e.g. being able to buy a loaf of bread for $0.00001), why isn't there an "ideal value" (a point, range, or a shifting set of priorities based on market conditions) that a currency could be fixed or drawn toward?   Many everyday transactions are still done in cash, requiring mental calculations from the transacting parties. Too large numbers and too small numbers create transaction costs that, at least for the extremes, cannot be ignored.  So conceivably, some "middle" point could be considered "ideal" as regards the efficiency of money as medium of transactions. What forces may prevent such an "ideal value" for a currency from being established?

Let's print out the result of sentance segmentation...


In [37]:
for num, sent in enumerate(test_output.sents):
    print('{}> {}\n'.format(num, sent))


0> If a currency can be worth too little (e.g. needing $10,000,000,000 to buy a loaf of bread) and worth too much (e.g. being able to buy a loaf of bread for $0.00001), why isn't there an "ideal value" (a point, range, or a shifting set of priorities based on market conditions) that a currency could be fixed or drawn toward?   

1> Many everyday transactions are still done in cash, requiring mental calculations from the transacting parties.

2> Too large numbers and too small numbers create transaction costs that, at least for the extremes, cannot be ignored.  

3> So conceivably, some "middle" point could be considered "ideal" as regards the efficiency of money as medium of transactions.

4> What forces may prevent such an "ideal value" for a currency from being established?

What about Named Entities?


In [39]:
for num, ent in enumerate(test_output.ents):
    print('{}) {} - {}\n'.format(num+1, ent, ent.label_))


1) 10,000,000,000 - MONEY

2) 0.00001 - MONEY

Part-of-Speech tagging identifies the syntactic function of each token/word.

Stemming and lemmatization are "normalizing" functions: They try to identify a common root for words of a same family.


In [42]:
for token, lemma, pos in ((t.orth_, t.lemma_, t.pos_) for t in test_output):
    print('{}|{} - {}\n'.format(token, lemma, pos))


If|if - ADP

a|a - DET

currency|currency - NOUN

can|can - VERB

be|be - VERB

worth|worth - ADJ

too|too - ADV

little|little - ADJ

(|( - PUNCT

e.g.|e.g. - ADV

needing|need - VERB

$|$ - SYM

10,000,000,000|10,000,000,000 - NUM

to|to - PART

buy|buy - VERB

a|a - DET

loaf|loaf - NOUN

of|of - ADP

bread|bread - NOUN

)|) - PUNCT

and|and - CCONJ

worth|worth - ADJ

too|too - ADV

much|much - ADJ

(|( - PUNCT

e.g.|e.g. - NOUN

being|be - VERB

able|able - ADJ

to|to - PART

buy|buy - VERB

a|a - DET

loaf|loaf - NOUN

of|of - ADP

bread|bread - NOUN

for|for - ADP

$|$ - SYM

0.00001|0.00001 - NUM

)|) - PUNCT

,|, - PUNCT

why|why - ADV

is|be - VERB

n't|not - ADV

there|there - ADV

an|an - DET

"|" - PUNCT

ideal|ideal - ADJ

value|value - NOUN

"|" - PUNCT

(|( - PUNCT

a|a - DET

point|point - NOUN

,|, - PUNCT

range|range - VERB

,|, - PUNCT

or|or - CCONJ

a|a - DET

shifting|shift - VERB

set|set - NOUN

of|of - ADP

priorities|priority - NOUN

based|base - VERB

on|on - ADP

market|market - NOUN

conditions|condition - NOUN

)|) - PUNCT

that|that - ADP

a|a - DET

currency|currency - NOUN

could|could - VERB

be|be - VERB

fixed|fix - VERB

or|or - CCONJ

drawn|draw - VERB

toward|toward - ADP

?|? - PUNCT

  |   - SPACE

Many|many - ADJ

everyday|everyday - ADJ

transactions|transaction - NOUN

are|be - VERB

still|still - ADV

done|do - VERB

in|in - ADP

cash|cash - NOUN

,|, - PUNCT

requiring|require - VERB

mental|mental - ADJ

calculations|calculation - NOUN

from|from - ADP

the|the - DET

transacting|transacting - NOUN

parties|party - NOUN

.|. - PUNCT

Too|too - ADV

large|large - ADJ

numbers|number - NOUN

and|and - CCONJ

too|too - ADV

small|small - ADJ

numbers|number - NOUN

create|create - VERB

transaction|transaction - NOUN

costs|cost - NOUN

that|that - ADJ

,|, - PUNCT

at|at - ADP

least|least - ADJ

for|for - ADP

the|the - DET

extremes|extreme - NOUN

,|, - PUNCT

can|can - VERB

not|not - ADV

be|be - VERB

ignored|ignore - VERB

.|. - PUNCT

 |  - SPACE

So|so - ADV

conceivably|conceivably - ADV

,|, - PUNCT

some|some - DET

"|" - PUNCT

middle|middle - ADJ

"|" - PUNCT

point|point - NOUN

could|could - VERB

be|be - VERB

considered|consider - VERB

"|" - PUNCT

ideal|ideal - ADJ

"|" - PUNCT

as|as - ADP

regards|regard - VERB

the|the - DET

efficiency|efficiency - NOUN

of|of - ADP

money|money - NOUN

as|as - ADP

medium|medium - NOUN

of|of - ADP

transactions|transaction - NOUN

.|. - PUNCT

What|what - ADJ

forces|force - NOUN

may|may - VERB

prevent|prevent - VERB

such|such - ADJ

an|an - DET

"|" - PUNCT

ideal|ideal - ADJ

value|value - NOUN

"|" - PUNCT

for|for - ADP

a|a - DET

currency|currency - NOUN

from|from - ADP

being|be - VERB

established|establish - VERB

?|? - PUNCT

Phrase Modelling - Bigrams and Trigrams


In [45]:
from gensim.models import Phrases
from gensim.models.word2vec import LineSentence

In [142]:
def is_invalid_token(token):
    return token.is_punct or token.is_space or token.lemma_ == '-PRON-'

def cleanup_html(corpus):
    for doc in corpus:
        soup = bs4.BeautifulSoup(doc, "html5lib")
        yield ' '.join([p.get_text().translate({ord('\\'): None}) for p in soup.find_all('p')])

In [143]:
unigram_fname = '../data/processed/unigram_processed_posts.txt'

In [144]:
%%time

with open(unigram_fname, 'w') as fout:
    for parsed_corpus in nlp.pipe(cleanup_html((i[1] for i in df_body.iteritems())), n_threads=8):
        for sent in parsed_corpus.sents:
            fout.write('{}\n'.format(' '.join([t.lemma_ for t in sent if not is_invalid_token(t)])))


CPU times: user 2min 41s, sys: 1.38 s, total: 2min 42s
Wall time: 58.6 s

In [145]:
unigram_sentences = LineSentence(unigram_fname)

In [151]:
bigram_model_fname = '../data/processed/bigram.model'

In [152]:
%%time
bigram_model = Phrases(unigram_sentences)
bigram_model.save(bigram_model_fname)


CPU times: user 4.21 s, sys: 244 ms, total: 4.45 s
Wall time: 4.28 s

In [153]:
bigram_fname = '../data/processed/bigram_processed_posts.txt'

In [154]:
with open(bigram_fname, 'w') as fout:
    for sent in unigram_sentences:
        bigram_sent = ' '.join(bigram_model[sent])
        fout.write(bigram_sent + '\n')


/home/iceman/.pyenv/versions/3.6.2rc1/envs/ml362/lib/python3.6/site-packages/gensim/models/phrases.py:316: UserWarning: For a faster implementation, use the gensim.models.phrases.Phraser class
  warnings.warn("For a faster implementation, use the gensim.models.phrases.Phraser class")

In [155]:
bigram_sentences = LineSentence(bigram_fname)

In [156]:
trigram_model_fname = '../data/processed/trigram.model'

In [157]:
%%time
trigram_model = Phrases(bigram_sentences)
trigram_model.save(trigram_model_fname)


CPU times: user 4 s, sys: 139 ms, total: 4.14 s
Wall time: 3.99 s

In [158]:
trigram_fname = '../data/processed/trigram_processed_posts.txt'

In [159]:
with open(trigram_fname, 'w') as fout:
    for sent in bigram_sentences:
        trigram_sent = ' '.join(trigram_model[sent])
        fout.write(trigram_sent + '\n')


/home/iceman/.pyenv/versions/3.6.2rc1/envs/ml362/lib/python3.6/site-packages/gensim/models/phrases.py:316: UserWarning: For a faster implementation, use the gensim.models.phrases.Phraser class
  warnings.warn("For a faster implementation, use the gensim.models.phrases.Phraser class")

In [160]:
final_fname = '../data/processed/final_processed_trigram_posts.txt'

In [162]:
with open(final_fname, 'w') as fout:
    for parsed_post in nlp.pipe(cleanup_html((i[1] for i in df_body.iteritems())), n_threads=10):
        unigram_post = [t.lemma_ for t in parsed_post if not is_invalid_token(t)]
        bigram_post = bigram_model[unigram_post]
        trigram_post = trigram_model[bigram_post]
        
        trigram_post = [term for term in trigram_post if term not in spacy.en.STOP_WORDS]
        
        fout.write('{}\n'.format(' '.join(trigram_post)))


/home/iceman/.pyenv/versions/3.6.2rc1/envs/ml362/lib/python3.6/site-packages/gensim/models/phrases.py:316: UserWarning: For a faster implementation, use the gensim.models.phrases.Phraser class
  warnings.warn("For a faster implementation, use the gensim.models.phrases.Phraser class")

In [ ]: