Personality prediction from tweet

by Angelo Basile


In [1]:
import numpy as np
np.random.seed(113) #set seed before any keras import
import pandas as pd
import random
from sklearn.model_selection import train_test_split
from collections import defaultdict
from keras.preprocessing import sequence
from collections import Counter
import pydot


Using TensorFlow backend.

Dataset


In [2]:
seed=0
corpus = pd.read_csv('twistytest.csv', 
                     index_col=0, 
                     header=1, 
                     names=['user_id', 'lang', 'text', 'mbti'])
corpus.sample(5)


Out[2]:
user_id lang text mbti
436 206814204 it ['prostituzione intellettuale (cit.futura) htt... ENFJ
121 1520373739 it ['@nanawani_ haha grazie a Dio la lau ci fa la... INTJ
214 883135740 it ["Certo, poi devo prendere ripetizioni di anal... INFJ
885 13264792 de ['@GoldeCarlsson @theRosenblatts Ich denk die ... INFP
104 169452362 it ['Un ora e mezza di attesa. Bello schifo', 'Da... INFP

In [13]:
#here we limit the corpus size. The SVM with all the text can learn somethign
corpus.text = corpus.text.apply(lambda x: x[:1000]) 
corpus.mbti = corpus.mbti.apply(lambda x: x[0])

#corpus = tmp.sample(frac=1, random_state=seed)
e = corpus[corpus.mbti.apply(lambda x: x == 'E')]
i = corpus[corpus.mbti.apply(lambda x: x == 'I')].sample(226)
corpus = pd.concat([e,i]).sample(frac=0.3, random_state=seed)
print(corpus.shape)

## set max length of doc per author
sentences = corpus.text#.apply(lambda x: x[:100000])
## trim labels: convert problem to binary classification I vs E
labels = corpus.mbti

## make sure we have a label for every data instance
assert(len(sentences)==len(labels))
data={}
np.random.seed(113) #seed
data['target']= np.random.permutation(labels)
np.random.seed(113) # use same seed!
data['data'] = np.random.permutation(sentences)


(136, 4)

In [14]:
# preview the dataset
print(corpus.shape)
corpus.head()


(136, 4)
Out[14]:
user_id lang text mbti
610 430124293 de ['@Schmidtlepp Frage vor Wahl zum BuPräs:gesch... E
836 621638341 de ["damn. i'm here again.", "@chaospiral hey, i'... E
527 181297248 de ['Der @eigensinn83 schwafelt meine Timeline zu... I
170 271514739 it ['"Tu con le fave dovevi intendertene! E poi i... I
855 1083774439 de ['Trailer Salzburger Festspiele 2013: http://t... I

In [15]:
# plot the distribution of labels

import matplotlib.pyplot as plt

l, v = zip(*Counter(y_train).items())
indexes = np.arange(len(l))
width = 1
plt.bar(indexes, v, width, color=['r', 'b'])
plt.xticks(indexes + width * 0.5, l)
plt.show()



In [16]:
#split the data into train, dev, test

X_rest, X_test, y_rest, y_test = train_test_split(data['data'], data['target'], test_size=0.2)
X_train, X_dev, y_train, y_dev = train_test_split(X_rest, y_rest, test_size=0.2)
del X_rest, y_rest
print("#train instances: {} #dev: {} #test: {}".format(len(X_train),len(X_dev),len(X_test)))


#train instances: 86 #dev: 22 #test: 28

Baseline

For the baseline we use an SVM with a sparse feature representation.

We use both character- and word-ngrams.


In [ ]:
from sklearn.svm import LinearSVC
from sklearn.dummy import DummyClassifier
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

pipeline = Pipeline([('features', FeatureUnion([('wngram', TfidfVectorizer(ngram_range=(1,2))),
                                                ('cngram', TfidfVectorizer(analyzer='char'))])),
                     ('cls', LinearSVC())])
pipeline.fit(X_train, y_train)

Results

The SVM works quite well already: we outperform the random baseline by a signficant margin.


In [12]:
testpred = pipeline.predict(X_test)
print(accuracy_score(testpred, y_test))
print(classification_report(testpred, y_test))


0.725274725275
             precision    recall  f1-score   support

          E       0.65      0.74      0.69        38
          I       0.79      0.72      0.75        53

avg / total       0.73      0.73      0.73        91

Neural network

First we have to encode the labels in the one-hot format. Since this is a binary classification format, we don't convert them to a categorical format.


In [18]:
from keras.utils import np_utils
y2i = defaultdict(lambda: len(y2i))
y_train_num = [y2i[mbti] for mbti in y_train]
y_dev_num = [y2i[mbti] for mbti in y_dev]
y_test_num = [y2i[mbti] for mbti in y_test]
num_classes = len(np.unique(y_train_num))
print(num_classes)


2

Text representation

For the baseline we used a one-hot encoding. For our neural model we are going to represent the text using a dense representation. We will be building it from characters.


In [19]:
from collections import defaultdict

# convert words to indices, taking care of UNKs
def get_characters(sentence, c2i):
    out = []
    for word in sentence.split(" "):
        chars = []
        for c in word:
            chars.append(c2i[c])
        out.append(chars)
    return out

c2i = defaultdict(lambda: len(c2i))

PAD = c2i["<pad>"] # index 0 is padding
UNK = c2i["<unk>"] # index 1 is for UNK
X_train_num = [get_characters(sentence, c2i) for sentence in X_train]
c2i = defaultdict(lambda: UNK, c2i) # freeze - cute trick!
X_dev_num = [get_characters(sentence, c2i) for sentence in X_dev]
X_test_num = [get_characters(sentence, c2i) for sentence in X_test]

max_sentence_length=max([len(s.split(" ")) for s in X_train] 
                        + [len(s.split(" ")) for s in X_dev] 
                        + [len(s.split(" ")) for s in X_test] )
max_word_length = max([len(word)  for sentence in X_train_num for word in sentence])

In [26]:
### we need both max sent and word length
print(max_sentence_length)
print(max_word_length)
print(X_train[0:2])
print(X_train_num[0][:100]) # example how the first two sentences are encoded


182
92
[ '[\'Herzlich Willkommen, kleiner Mats! &lt;3\', \'Nein. Selbst wenn man krank ist sollte man nicht ,,Familien im Brennpunkt" schauen. Neinneinnein.\', \'Irgendwer hat heute was dagegen, dass ich zur Uni komme. Meine Schusseligkeit kennt keine Grenzen.\', \'Geschichten aus dem Waschsalon Kapitel I: ,,als mich ein alleinerziehender 40-Jähriger fragte, ob ich mal einen Kaffee mit ihm trinken gehe"\', \'He, #SiZolli! Morgen hättest Du 90 Minuten durchspielen können, weißte selbst, ne?! #fck #effzeh #traditionsverein @Rote_Teufel_RT @fckoeln\', \'@FabianM_Mueller diese Geschichte wird mit Sicherheit kein zweites Kapitel bekommen ;)\', \'Es gibt da diese eine Person, der ich dafür danken möchte, mir die Musik von #Olson gezeigt zu haben. #thisgoesoutto #dreiTageGlück\', \',,generiert" ist übrigens ein scheiß Wort und sollte nicht im Zusammenhang mit meinen Artikeln verwendet werden.\', \'@gelsen Rheinland-Pfälzischer Rosé hilft. Gegen alles.\', \'@gelsen hatte ich die letzten drei Tage auch. Es wird besser.\', \''
 '[\'"@rihanna: Mueller is always in the right position!" Zitta e a 90!\', \'@BrindaThakore Ma va a cagare!\', "@BrindaThakore Yes, but I won\'t", \'@BrindaThakore lol\', \'@BrindaThakore Hahaha, no.\', \'@brn_killers Ma tu hai capito che botta che ha preso? Quello è già fortunato se si regge ancora in piedi.\', \'"@FINDINGZAY: "@smolecolatevi: Vi comunico con enorme dispiacere che mio padre tifa l\\\'Argentina" lui sì che ragiona" ecco!\', \'@g_beyle non sei all\\\'altezza di quel "burzum".\', \'"@xlucyshug: "Grazie alla Disney" pt. 20\\n\\nCrediamo nelle fate. http://t.co/J0rIE6DyQc" @FedericaPirollo, ce ne sono molti altri\', \'Minchia, vogliamo parlare della ragazza in costume tra i tifosi tedeschi?!\', \'@animevuote ma chi vogliono darla a bere?\', \'@Wolfayn Non se tu avessi imparato i congiuntivi.\', \'@Wolfayn «Cristo» va in maiuscolo. Dopo «vuoi» ci vuole una virgola. Si dice «stammi bene» semmai.\', \'Germany is a 4 STAR #WorldChampion!\\n\\n#GERvsARG #GERvARG #GERARG #GER #ARG #GermaniaArgentina #Mondiali2014 ']
[[2, 3, 4, 5, 6, 7, 8, 9, 10, 11], [12, 9, 8, 8, 13, 14, 15, 15, 5, 16, 17], [13, 8, 5, 9, 16, 5, 6], [18, 19, 20, 21, 22], [23, 8, 20, 24, 25, 3, 17], [3, 26, 5, 9, 16, 27], [28, 5, 8, 29, 21, 20], [30, 5, 16, 16], [15, 19, 16], [13, 6, 19, 16, 13], [9, 21, 20], [21, 14, 8, 8, 20, 5], [15, 19, 16], [16, 9, 10, 11, 20], [17, 17, 31, 19, 15, 9, 8, 9, 5, 16], [9, 15], [32, 6, 5, 16, 16, 33, 34, 16, 13, 20, 35], [21, 10, 11, 19, 34, 5, 16, 27], [26, 5, 9, 16, 16, 5, 9, 16, 16, 5, 9, 16, 27, 3, 17], [3, 36, 6, 37, 5, 16, 38, 30, 5, 6], [11, 19, 20], [11, 5, 34, 20, 5], [30, 19, 21], [38, 19, 37, 5, 37, 5, 16, 17], [38, 19, 21, 21], [9, 10, 11], [7, 34, 6], [39, 16, 9], [13, 14, 15, 15, 5, 27], [18, 5, 9, 16, 5], [28, 10, 11, 34, 21, 21, 5, 8, 9, 37, 13, 5, 9, 20], [13, 5, 16, 16, 20], [13, 5, 9, 16, 5], [40, 6, 5, 16, 7, 5, 16, 27, 3, 17], [3, 40, 5, 21, 10, 11, 9, 10, 11, 20, 5, 16], [19, 34, 21], [38, 5, 15], [12, 19, 21, 10, 11, 21, 19, 8, 14, 16], [41, 19, 33, 9, 20, 5, 8], [36, 42], [17, 17, 19, 8, 21], [15, 9, 10, 11], [5, 9, 16], [19, 8, 8, 5, 9, 16, 5, 6, 7, 9, 5, 11, 5, 16, 38, 5, 6], [43, 44, 45, 46, 47, 11, 6, 9, 37, 5, 6], [48, 6, 19, 37, 20, 5, 17], [14, 29], [9, 10, 11], [15, 19, 8], [5, 9, 16, 5, 16], [41, 19, 48, 48, 5, 5], [15, 9, 20], [9, 11, 15], [20, 6, 9, 16, 13, 5, 16], [37, 5, 11, 5, 35, 3, 17], [3, 4, 5, 17], [49, 28, 9, 50, 14, 8, 8, 9, 22], [18, 14, 6, 37, 5, 16], [11, 47, 20, 20, 5, 21, 20], [51, 34], [52, 44], [18, 9, 16, 34, 20, 5, 16], [38, 34, 6, 10, 11, 21, 33, 9, 5, 8, 5, 16], [13, 53, 16, 16, 5, 16, 17], [30, 5, 9, 54, 20, 5], [21, 5, 8, 29, 21, 20, 17], [16, 5, 55, 22], [49, 48, 10, 13], [49, 5, 48, 48, 7, 5, 11], [49, 20, 6, 19, 38, 9, 20, 9, 14, 16, 21, 56, 5, 6, 5, 9, 16], [57, 58, 14, 20, 5, 59, 60, 5, 34, 48, 5, 8, 59, 58, 60], [57, 48, 10, 13, 14, 5, 8, 16, 3, 17], [3, 57, 31, 19, 29, 9, 19, 16, 18, 59, 18, 34, 5, 8, 8, 5, 6], [38, 9, 5, 21, 5], [40, 5, 21, 10, 11, 9, 10, 11, 20, 5], [30, 9, 6, 38], [15, 9, 20], [28, 9, 10, 11, 5, 6, 11, 5, 9, 20], [13, 5, 9, 16], [7, 30, 5, 9, 20, 5, 21], [41, 19, 33, 9, 20, 5, 8], [29, 5, 13, 14, 15, 15, 5, 16], [24, 61, 3, 17], [3, 62, 21], [37, 9, 29, 20], [38, 19], [38, 9, 5, 21, 5], [5, 9, 16, 5], [63, 5, 6, 21, 14, 16, 17], [38, 5, 6], [9, 10, 11], [38, 19, 48, 64, 6], [38, 19, 16, 13, 5, 16], [15, 53, 10, 11, 20, 5, 17], [15, 9, 6], [38, 9, 5], [18, 34, 21, 9, 13], [56, 14, 16], [49, 65, 8, 21, 14, 16], [37, 5, 7, 5, 9, 37, 20]]

In [27]:
def pad_words(tensor_words, max_word_len, pad_symbol_id, max_sent_len=None):
    """
    pad character list all to same word length
    """
    padded = []
    for words in tensor_words:
        if max_sent_len: #pad all to same sentence length (insert empty word list)
            words = [[[0]]*(max_sent_len-len(words))+ words][0] #prepending empty words
        padded.append(sequence.pad_sequences(words, maxlen=max_word_len, value=pad_symbol_id))
    return np.array(padded)

In [28]:
X_train_pad_char = pad_words(X_train_num, max_word_length, 0, max_sent_len=max_sentence_length)
X_dev_pad_char = pad_words(X_dev_num, max_word_length, 0, max_sent_len=max_sentence_length)
X_test_pad_char = pad_words(X_test_num, max_word_length, 0, max_sent_len=max_sentence_length)

In [29]:
X_train_pad_char.shape


Out[29]:
(86, 182, 92)

In [30]:
from keras.models import Model, Sequential
from keras.layers import Dense, Input, GRU, TimeDistributed, Embedding, Bidirectional
import keras

My model: WENP (WE Need more Power)

Instead of using a separate word embedding matrix, compose words through characters (see https://aclweb.org/anthology/W/W16/W16-4303.pdf)


In [31]:
batch_size=8
max_chars = len(c2i)
c_dim=50
c_h_dim=32
w_h_dim=32
char_vocab_size = len(c2i)

## lower-level character LSTM
input_chars = Input(shape=(max_sentence_length, max_word_length), name='main_input')

embedded_chars = TimeDistributed(Embedding(char_vocab_size, c_dim,
                                         input_length=max_word_length), name='char_embedding')(input_chars)
char_lstm = TimeDistributed(Bidirectional(GRU(c_h_dim)), name='GRU_on_char')(embedded_chars)

word_lstm_from_char = Bidirectional(GRU(w_h_dim), name='GRU_on_words')(char_lstm)

# And add a prediction node on top
predictions = Dense(1, activation='relu', name='output_layer')(word_lstm_from_char)

In [32]:
model = Model(inputs=input_chars, outputs=predictions)


model.compile(loss='binary_crossentropy', optimizer='adam',
                      metrics=['accuracy'])

model.summary()


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
main_input (InputLayer)      (None, 182, 92)           0         
_________________________________________________________________
char_embedding (TimeDistribu (None, 182, 92, 50)       20500     
_________________________________________________________________
GRU_on_char (TimeDistributed (None, 182, 64)           15936     
_________________________________________________________________
GRU_on_words (Bidirectional) (None, 64)                18624     
_________________________________________________________________
output_layer (Dense)         (None, 1)                 65        
=================================================================
Total params: 55,125
Trainable params: 55,125
Non-trainable params: 0
_________________________________________________________________

In [33]:
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot, plot_model

SVG(model_to_dot(model).create(prog='dot', format='svg'))


Out[33]:
G 140511421196216 main_input: InputLayer 140511421432944 char_embedding(embedding_1): TimeDistributed(Embedding) 140511421196216->140511421432944 140511573940320 GRU_on_char(bidirectional_1): TimeDistributed(Bidirectional) 140511421432944->140511573940320 140511518174456 GRU_on_words(gru_2): Bidirectional(GRU) 140511573940320->140511518174456 140511518146456 output_layer: Dense 140511518174456->140511518146456

In [34]:
model.fit(X_train_pad_char, y_train_num, epochs=10, batch_size=8)


Epoch 1/10
86/86 [==============================] - 29s - loss: 0.8218 - acc: 0.5000    
Epoch 2/10
86/86 [==============================] - 31s - loss: 0.7154 - acc: 0.4651    
Epoch 3/10
86/86 [==============================] - 32s - loss: 0.6856 - acc: 0.5465    
Epoch 4/10
86/86 [==============================] - 31s - loss: 0.6935 - acc: 0.5465    
Epoch 5/10
86/86 [==============================] - 31s - loss: 0.6806 - acc: 0.5349    
Epoch 6/10
86/86 [==============================] - 31s - loss: 0.6781 - acc: 0.5465    
Epoch 7/10
86/86 [==============================] - 30s - loss: 0.6465 - acc: 0.6047    
Epoch 8/10
86/86 [==============================] - 29s - loss: 0.6210 - acc: 0.7442    
Epoch 9/10
86/86 [==============================] - 30s - loss: 0.5810 - acc: 0.7209    
Epoch 10/10
86/86 [==============================] - 32s - loss: 0.5137 - acc: 0.8140    
Out[34]:
<keras.callbacks.History at 0x7fcb58e602e8>

In [35]:
loss, accuracy = model.evaluate(X_test_pad_char, y_test_num)


28/28 [==============================] - 3s

In [36]:
print(accuracy)


0.5

Conclusions

:(