Deep learning for Natural Language Processing

Simple text representations, bag of words
Word embedding and... not just another word2vec this time
1-dimensional convolutions for text
Aggregating several data sources "the hard way"
Solving ~somewhat~ real ML problem with ~almost~ end-to-end deep learning

Special thanks to Irina Golzmann for help with technical part.

NLTK

You will require nltk v3.2 to solve this assignment

It is really important that the version is 3.2, otherwize russian tokenizer might not work

Install/update

sudo pip install --upgrade nltk==3.2
If you don't remember when was the last pip upgrade, sudo pip install --upgrade pip

If for some reason you can't or won't switch to nltk v3.2, just make sure that russian words are tokenized properly with RegeExpTokenizer.



In [1]:

    
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Dataset

Ex-kaggle-competition on job salary prediction

Original conest - https://www.kaggle.com/c/job-salary-prediction

Download

Go here and download as usual

CSC cloud: data should already be here somewhere, just poke the nearest instructor.

What's inside

Different kinds of features:

2 text fields - title and description
Categorical fields - contract type, location

Only 1 binary target whether or not such advertisement contains prohibited materials

criminal, misleading, human reproduction-related, etc
diving into the data may result in prolonged sleep disorders



In [7]:

    
df = pd.read_csv("./Train_rev1.csv",sep=',')



In [16]:

    
print df.shape, df.SalaryNormalized.mean()
df[:5]









    



 (244768, 12) 34122.5775755






    Out[16]:






  
    
      
      Id
      Title
      FullDescription
      LocationRaw
      LocationNormalized
      ContractType
      ContractTime
      Company
      Category
      SalaryRaw
      SalaryNormalized
      SourceName
    
  
  
    
      0
      12612628
      Engineering Systems Analyst
      Engineering Systems Analyst Dorking Surrey Sal...
      Dorking, Surrey, Surrey
      Dorking
      NaN
      permanent
      Gregory Martin International
      Engineering Jobs
      20000 - 30000/annum 20-30K
      25000
      cv-library.co.uk
    
    
      1
      12612830
      Stress Engineer Glasgow
      Stress Engineer Glasgow Salary **** to **** We...
      Glasgow, Scotland, Scotland
      Glasgow
      NaN
      permanent
      Gregory Martin International
      Engineering Jobs
      25000 - 35000/annum 25-35K
      30000
      cv-library.co.uk
    
    
      2
      12612844
      Modelling and simulation analyst
      Mathematical Modeller / Simulation Analyst / O...
      Hampshire, South East, South East
      Hampshire
      NaN
      permanent
      Gregory Martin International
      Engineering Jobs
      20000 - 40000/annum 20-40K
      30000
      cv-library.co.uk
    
    
      3
      12613049
      Engineering Systems Analyst / Mathematical Mod...
      Engineering Systems Analyst / Mathematical Mod...
      Surrey, South East, South East
      Surrey
      NaN
      permanent
      Gregory Martin International
      Engineering Jobs
      25000 - 30000/annum 25K-30K negotiable
      27500
      cv-library.co.uk
    
    
      4
      12613647
      Pioneer, Miser Engineering Systems Analyst
      Pioneer, Miser  Engineering Systems Analyst Do...
      Surrey, South East, South East
      Surrey
      NaN
      permanent
      Gregory Martin International
      Engineering Jobs
      20000 - 30000/annum 20-30K
      25000
      cv-library.co.uk

Tokenizing

First, we create a dictionary of all existing words. Assign each word a number - it's Id



In [18]:

    
from nltk.tokenize import RegexpTokenizer
from collections import Counter,defaultdict
tokenizer = RegexpTokenizer(r"\w+")

#Dictionary of tokens
token_counts = Counter()

#All texts
all_texts = np.hstack([df.FullDescription.values,df.Title.values])


#Compute token frequencies
for s in all_texts:
    if type(s) is not str:
        continue
    s = s.decode('utf8').lower()
    tokens = tokenizer.tokenize(s)
    for token in tokens:
        token_counts[token] +=1

Remove rare tokens

We are unlikely to make use of words that are only seen a few times throughout the corpora.

Again, if you want to beat Kaggle competition metrics, consider doing something better.



In [19]:

    
#Word frequency distribution, just for kicks
_=plt.hist(token_counts.values(),range=[0,50],bins=50)



In [23]:

    
#Select only the tokens that had at least 10 occurences in the corpora.
#Use token_counts.

min_count = 5
tokens = <tokens from token_counts keys that had at least min_count occurences throughout the dataset>



In [24]:

    
token_to_id = {t:i+1 for i,t in enumerate(tokens)}
null_token = "NULL"
token_to_id[null_token] = 0



In [25]:

    
print "# Tokens:",len(token_to_id)
if len(token_to_id) < 10000:
    print "Alarm! It seems like there are too few tokens. Make sure you updated NLTK and applied correct thresholds -- unless you now what you're doing, ofc"
if len(token_to_id) > 100000:
    print "Alarm! Too many tokens. You might have messed up when pruning rare ones -- unless you know what you're doin' ofc"









    



# Tokens: 44867

Replace words with IDs

Set a maximum length for titles and descriptions.

If string is longer that that limit - crop it, if less - pad with zeros.
Thus we obtain a matrix of size [n_samples]x[max_length]
Element at i,j - is an identifier of word j within sample i



In [34]:

    
def vectorize(strings, token_to_id, max_len=150):
    token_matrix = []
    for s in strings:
        if type(s) is not str:
            token_matrix.append([0]*max_len)
            continue
        s = s.decode('utf8').lower()
        tokens = tokenizer.tokenize(s)
        token_ids = map(lambda token: token_to_id.get(token,0), tokens)[:max_len]
        token_ids += [0]*(max_len - len(token_ids))
        token_matrix.append(token_ids)

    return np.array(token_matrix)



In [35]:

    
desc_tokens = vectorize(df.FullDescription.values,token_to_id,max_len = 500)
title_tokens = vectorize(df.Title.values,token_to_id,max_len = 15)

Data format examples



In [36]:

    
print "Matrix size:",title_tokens.shape
for title, tokens in zip(df.Title.values[:3],title_tokens[:3]):
    print title,'->', tokens[:10],'...'









    



Размер матрицы: (244768, 15)
Engineering Systems Analyst -> [38462 12311  1632     0     0     0     0     0     0     0] ...
Stress Engineer Glasgow -> [19749 41620  5861     0     0     0     0     0     0     0] ...
Modelling and simulation analyst -> [23387 16330 32144  1632     0     0     0     0     0     0] ...

As you can see, our preprocessing is somewhat crude. Let us see if that is enough for our network

Non-sequences

Some data features are categorical data. E.g. location, contract type, company

They require a separate preprocessing step.



In [ ]:

    
#One-hot-encoded category and subcategory

from sklearn.feature_extraction import DictVectorizer

categories = []
data_cat = df[["Category","LocationNormalized","ContractType","ContractTime"]]


categories = [A list of dictionaries {"category":category_name, "subcategory":subcategory_name} for each data sample]

    

vectorizer = DictVectorizer(sparse=False)
df_non_text = vectorizer.fit_transform(categories)
df_non_text = pd.DataFrame(df_non_text,columns=vectorizer.feature_names_)

Split data into training and test



In [ ]:

    
#Target variable - whether or not sample contains prohibited material
target = df.is_blocked.values.astype('int32')
#Preprocessed titles
title_tokens = title_tokens.astype('int32')
#Preprocessed tokens
desc_tokens = desc_tokens.astype('int32')
#Non-sequences
df_non_text = df_non_text.astype('float32')



In [ ]:

    
#Split into training and test set.


#Difficulty selector:
#Easy: split randomly
#Medium: split by companies, make sure no company is in both train and test set
#Hard: do whatever you want, but score yourself using kaggle private leaderboard

title_tr,title_ts,desc_tr,desc_ts,nontext_tr,nontext_ts,target_tr,target_ts = <define_these_variables>

Save preprocessed data [optional]

The next tab can be used to stash all the essential data matrices and get rid of the rest of the data.
- Highly recommended if you have less than 1.5GB RAM left
To do that, you need to first run it with save_prepared_data=True, then restart the notebook and only run this tab with read_prepared_data=True.



In [ ]:

    
save_prepared_data = True #save
read_prepared_data = False #load

#but not both at once
assert not (save_prepared_data and read_prepared_data)


if save_prepared_data:
    print "Saving preprocessed data (may take up to 3 minutes)"

    import pickle
    with open("preprocessed_data.pcl",'w') as fout:
        pickle.dump(data_tuple,fout)
    with open("token_to_id.pcl",'w') as fout:
        pickle.dump(token_to_id,fout)

    print "done"
    
elif read_prepared_data:
    print "Reading saved data..."
    
    import pickle
    
    with open("preprocessed_data.pcl",'r') as fin:
        data_tuple = pickle.load(fin)
    title_tr,title_ts,desc_tr,desc_ts,nontext_tr,nontext_ts,target_tr,target_ts = data_tuple
    with open("token_to_id.pcl",'r') as fin:
        token_to_id = pickle.load(fin)


        
    #Re-importing libraries to allow staring noteboook from here
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    %matplotlib inline

        
    print "done"

Train the monster

Since we have several data sources, our neural network may differ from what you used to work with.

Separate input for titles
- cnn+global max or RNN
Separate input for description
- cnn+global max or RNN
Separate input for categorical features
- Few dense layers + some black magic if you want

These three inputs must be blended somehow - concatenated or added.

Output: a simple regression task



In [1]:

    
#libraries
import lasagne
from theano import tensor as T
import theano









    



/usr/local/lib/python2.7/dist-packages/Theano-0.8.0rc1-py2.7.egg/theano/tensor/signal/downsample.py:5: UserWarning: downsample module has been moved to the pool module.
  warnings.warn("downsample module has been moved to the pool module.")



In [ ]:

    
#3 inputs and a refere output
title_token_ids = T.matrix("title_token_ids",dtype='int32')
desc_token_ids = T.matrix("desc_token_ids",dtype='int32')
categories = T.matrix("categories",dtype='float32')
target_y = T.vector("is_blocked",dtype='float32')

NN architecture



In [ ]:

    
title_inp = lasagne.layers.InputLayer((None,title_tr.shape[1]),input_var=title_token_ids)
descr_inp = lasagne.layers.InputLayer((None,desc_tr.shape[1]),input_var=desc_token_ids)
cat_inp = lasagne.layers.InputLayer((None,nontext_tr.shape[1]), input_var=categories)



In [ ]:

    
# Descriptions

#word-wise embedding. We recommend to start from some 64 and improving after you are certain it works.

descr_nn = lasagne.layers.EmbeddingLayer(descr_inp,
                                         input_size=len(token_to_id)+1,
                                         output_size=?)


#reshape from [batch, time, unit] to [batch,unit,time] to allow 1d convolution over time
descr_nn = lasagne.layers.DimshuffleLayer(descr_nn, [0,2,1])

descr_nn = 1D convolution over embedding, maybe several ones in a stack

#pool over time
descr_nn = lasagne.layers.GlobalPoolLayer(descr_nn,T.max)

#Possible improvements here are adding several parallel convs with different filter sizes or stacking them the usual way
#1dconv -> 1d max pool ->1dconv and finally global pool 


# Titles
title_nn = <Process titles somehow (title_inp)>

# Non-sequences
cat_nn = <Process non-sequences(cat_inp)>



In [ ]:

    
nn = <merge three layers into one (e.g. lasagne.layers.concat) >                                  

nn = lasagne.layers.DenseLayer(nn,your_lucky_number)
nn = lasagne.layers.DropoutLayer(nn,p=maybe_use_me)
nn = lasagne.layers.DenseLayer(nn,1,nonlinearity=lasagne.nonlinearities.linear)

Loss function

The standard way:
- prediction
- loss
- updates
- training and evaluation functions



In [ ]:

    
#All trainable params
weights = lasagne.layers.get_all_params(nn,trainable=True)



In [ ]:

    
#Simple NN prediction
prediction = lasagne.layers.get_output(nn)[:,0]

#loss function
loss = lasagne.objectives.squared_error(prediction,target_y).mean()



In [ ]:

    
#Weight optimization step
updates = <your favorite optimizer>

Determinitic prediction

In case we use stochastic elements, e.g. dropout or noize
Compile a separate set of functions with deterministic prediction (deterministic = True)
Unless you think there's no neet for dropout there ofc. Btw is there?



In [ ]:

    
#deterministic version
det_prediction = lasagne.layers.get_output(nn,deterministic=True)[:,0]

#equivalent loss function
det_loss = <an excercise in copy-pasting and editing>

Coffee-lation



In [ ]:

    
train_fun = theano.function([desc_token_ids,title_token_ids,categories,target_y],[loss,prediction],updates = updates)
eval_fun = theano.function([desc_token_ids,title_token_ids,categories,target_y],[det_loss,det_prediction])

Training loop

The regular way with loops over minibatches
Since the dataset is huge, we define epoch as some fixed amount of samples isntead of all dataset



In [ ]:

    
# Out good old minibatch iterator now supports arbitrary amount of arrays (X,y,z)

def iterate_minibatches(*arrays,**kwargs):
    
    batchsize=kwargs.get("batchsize",100)
    shuffle = kwargs.get("shuffle",True)
    
    if shuffle:
        indices = np.arange(len(arrays[0]))
        np.random.shuffle(indices)
    for start_idx in range(0, len(arrays[0]) - batchsize + 1, batchsize):
        if shuffle:
            excerpt = indices[start_idx:start_idx + batchsize]
        else:
            excerpt = slice(start_idx, start_idx + batchsize)
        yield [arr[excerpt] for arr in arrays]

Tweaking guide

batch_size - how many samples are processed per function call
- optimization gets slower, but more stable, as you increase it.
- May consider increasing it halfway through training
minibatches_per_epoch - max amount of minibatches per epoch
- Does not affect training. Lesser value means more frequent and less stable printing
- Setting it to less than 10 is only meaningfull if you want to make sure your NN does not break down after one epoch
n_epochs - total amount of epochs to train for
- n_epochs = 10**10 and manual interrupting is still an option

Tips:

With small minibatches_per_epoch, network quality may jump up and down for several epochs
Plotting metrics over training time may be a good way to analyze which architectures work better.
Once you are sure your network aint gonna crash, it's worth letting it train for a few hours of an average laptop's time to see it's true potential



In [ ]:

    
from sklearn.metrics import mean_squared_error,mean_absolute_error


n_epochs = 100
batch_size = 100
minibatches_per_epoch = 100


for i in range(n_epochs):
    
    #training
    epoch_y_true = []
    epoch_y_pred = []
    
    b_c = b_loss = 0
    for j, (b_desc,b_title,b_cat, b_y) in enumerate(
        iterate_minibatches(desc_tr,title_tr,nontext_tr,target_tr,batchsize=batch_size,shuffle=True)):
        if j > minibatches_per_epoch:break
            
        loss,pred_probas = train_fun(b_desc,b_title,b_cat,b_y)
        
        b_loss += loss
        b_c +=1
        
        epoch_y_true.append(b_y)
        epoch_y_pred.append(pred_probas)

    
    epoch_y_true = np.concatenate(epoch_y_true)
    epoch_y_pred = np.concatenate(epoch_y_pred)
    
    print "Train:"
    print '\tloss:',b_loss/b_c
    print '\trmse:',mean_squared_error(epoch_y_true,epoch_y_pred)**.5
    print '\tmae:',mean_absolute_error(epoch_y_true,epoch_y_pred)
    
    
    #evaluation
    epoch_y_true = []
    epoch_y_pred = []
    b_c = b_loss = 0
    for j, (b_desc,b_title,b_cat, b_y) in enumerate(
        iterate_minibatches(desc_ts,title_ts,nontext_ts,target_ts,batchsize=batch_size,shuffle=True)):
        if j > minibatches_per_epoch: break
        loss,pred_probas = eval_fun(b_desc,b_title,b_cat,b_y)
        
        b_loss += loss
        b_c +=1
        
        epoch_y_true.append(b_y)
        epoch_y_pred.append(pred_probas)

    
    epoch_y_true = np.concatenate(epoch_y_true)
    epoch_y_pred = np.concatenate(epoch_y_pred)
    
    print "Val:"
    print '\tloss:',b_loss/b_c
    print '\trmse:',mean_squared_error(epoch_y_true,epoch_y_pred)**.5
    print '\tmae:',mean_absolute_error(epoch_y_true,epoch_y_pred)



In [2]:

    
print "If you are seeing this, it's time to backup your notebook. No, really, 'tis too easy to mess up everything without noticing. "









    



If you are seeing this, it's time to backup your notebook. No, really, 'tis too easy to mess up everything without noticing.

Final evaluation

Evaluate network over the entire test set



In [ ]:

    
#evaluation
epoch_y_true = []
epoch_y_pred = []

b_c = b_loss = 0
for j, (b_desc,b_title,b_cat, b_y) in enumerate(
    iterate_minibatches(desc_ts,title_ts,nontext_ts,target_ts,batchsize=batch_size,shuffle=True)):
    loss,pred_probas = eval_fun(b_desc,b_title,b_cat,b_y)

    b_loss += loss
    b_c +=1

    epoch_y_true.append(b_y)
    epoch_y_pred.append(pred_probas)


epoch_y_true = np.concatenate(epoch_y_true)
epoch_y_pred = np.concatenate(epoch_y_pred)

print "Scores:"
print '\tloss:',b_loss/b_c
print '\trmse:',mean_squared_error(epoch_y_true,epoch_y_pred)**.5
print '\tmae:',mean_absolute_error(epoch_y_true,epoch_y_pred)

Now tune the monster for least MSE you can get!

Next time in our show

Recurrent neural networks
- How to apply them to practical problems?
- What else can they do?
- Why so much hype around LSTM?
Stay tuned!



In [ ]:

	Id	Title	FullDescription	LocationRaw	LocationNormalized	ContractType	ContractTime	Company	Category	SalaryRaw	SalaryNormalized	SourceName
0	12612628	Engineering Systems Analyst	Engineering Systems Analyst Dorking Surrey Sal...	Dorking, Surrey, Surrey	Dorking	NaN	permanent	Gregory Martin International	Engineering Jobs	20000 - 30000/annum 20-30K	25000	cv-library.co.uk
1	12612830	Stress Engineer Glasgow	Stress Engineer Glasgow Salary ** to ** We...	Glasgow, Scotland, Scotland	Glasgow	NaN	permanent	Gregory Martin International	Engineering Jobs	25000 - 35000/annum 25-35K	30000	cv-library.co.uk
2	12612844	Modelling and simulation analyst	Mathematical Modeller / Simulation Analyst / O...	Hampshire, South East, South East	Hampshire	NaN	permanent	Gregory Martin International	Engineering Jobs	20000 - 40000/annum 20-40K	30000	cv-library.co.uk
3	12613049	Engineering Systems Analyst / Mathematical Mod...	Engineering Systems Analyst / Mathematical Mod...	Surrey, South East, South East	Surrey	NaN	permanent	Gregory Martin International	Engineering Jobs	25000 - 30000/annum 25K-30K negotiable	27500	cv-library.co.uk
4	12613647	Pioneer, Miser Engineering Systems Analyst	Pioneer, Miser Engineering Systems Analyst Do...	Surrey, South East, South East	Surrey	NaN	permanent	Gregory Martin International	Engineering Jobs	20000 - 30000/annum 20-30K	25000	cv-library.co.uk