In [1]:

    
import pandas as pd

I. Basics

All the toolbox is a package tree.

You need to import the __init__.py file at the root of each package add the toolbox toplevel to your path.



In [2]:

    
import __init__

Database

The voc database is instanciated with a given voc stored in a file.

Load database



In [3]:

    
import cpLib.conceptDB as db

Ther are 2 saved database format.

Text file



In [4]:

    
d = db.DB('../data/voc/txt/googleNews_mini.txt', verbose=False)

Which is quite slow to load for large voc. Due to performance issues, we need to use numpy format for large voc.

Npy matrix and an association dictionary

The voc is then splited in 2 files: one containing the matrix (in npy format), the other for words association (a dict in json format)

NB: for both loading approach, you can verbose or not, it is usefull to deal with files created by std redirection



In [5]:

    
d = db.DB('../data/voc/npy/googleNews_mini.npy')









    



15 loaded from googleNews_mini
mem usage 17.6KiB
loaded time 0.017294883728 s

Extract concept

We will refer as concept for a word and his associated vector.

Load an existing concept



In [6]:

    
v1 = d.get('king')
print v1, type(v1.vect), len(v1.vect)









    



king <type 'numpy.ndarray'> 300

Load a missing concept



In [7]:

    
v2 = d.get('toto')
print v2









    



None

Check if a concept is in the database



In [8]:

    
print d.has('king')
print d.has('toto')









    



True
False

Extract a random sample



In [9]:

    
conceptList = d.getSample(5)
print len(conceptList)
print conceptList[0]









    



5
woman

Find concept

A common operation is to find the closest word for a given concept.

You can do this according to several metrics

Cosine similarity



In [10]:

    
king = d.get('king')
print d.find_cosSim(king)









    



[(0.6510957479476929, 'queen'), (0.22942674160003662, 'man'), (0.14300569891929626, 'tiger'), (0.1284797489643097, 'woman'), (0.12812507152557373, 'dog'), (0.1216159388422966, 'cat'), (0.11390406638383865, 'feline'), (0.07803817838430405, 'bird'), (0.06225873529911041, 'truck'), (0.06189548596739769, 'car')]

Euclidean distance



In [11]:

    
print d.find_euclDist(king)









    



[(2.479691982269287, 'queen'), (2.9016165733337402, '</s>'), (3.2687888145446777, 'man'), (3.673551321029663, 'woman'), (3.8052330017089844, 'car'), (3.8849875926971436, 'dog'), (3.9377858638763428, 'cat'), (3.98100209236145, 'truck'), (4.0085577964782715, 'feline'), (4.019833087921143, 'vehicle')]

Manhattan distance



In [12]:

    
print d.find_manaDist(king)









    



[(34.99564743041992, 'queen'), (40.426326751708984, '</s>'), (44.956417083740234, 'man'), (49.59609603881836, 'woman'), (52.44883728027344, 'car'), (54.61458206176758, 'dog'), (54.72962188720703, 'feline'), (54.86988067626953, 'cat'), (54.87595748901367, 'truck'), (55.824737548828125, 'vehicle')]

Operations

You can apply several operations between concepts to build new ones.

Created concept names are in reverse polish notation.



In [13]:

    
import cpLib.concept as cp

a. Add and substract



In [14]:

    
v1 = cp.add(d.get('king'), d.get('man'))
v1 = cp.sub(v1, d.get('queen'))

v2 = cp.addSub([d.get('king'), d.get('man')], [d.get('queen')], normalized=True)

print v1, ' ~ ', d.find_cosSim(v1)[0][1]
print v2, ' ~ ', d.find_cosSim(v2)[0][1]









    



__-____+__king__man__queen  ~  woman
__-____+____n__king____n__man____n__queen  ~  woman

b. Transform

Normalized concept



In [15]:

    
k = d.get('king')
print k.normalized()









    



__n__king

Polar coordinate

Transform the carthesian coordinate into hyperspherical ones.

First value is the norm, the other values are angles in rad.



In [16]:

    
k = d.get('king')
print k.polarized()
print 'norm =', k.polarized().vect[0]
print '1st angle =', k.polarized().vect[1]









    



__p__king
norm = 2.90226
1st angle = 1.52738

Angular coordinates

Polar transformation without the norm.



In [17]:

    
k = d.get('king')
print k.angularized()
print 'vector dimension =', len(k.angularized().vect)
print '1st angle =', k.angularized().vect[0]









    



__a__king
vector dimension = 299
1st angle = 1.52738

Concept feature

Since this toolbox is designed in a first place for machine learning activies, we provide some feature extraction functions:



In [18]:

    
import mlLib.conceptFeature as cpf

Identity vector

Identity vector is the raw vector in the vector space in carthesian coordinates.



In [19]:

    
k = d.get('king')
print len(cpf.identity(k))

Polar vector

We can also transform this carthesian coordinates into hyperspherical ones.



In [20]:

    
k = d.get('king')
print len(cpf.polar(k))

Angular vector

And remove the norm to keep only the angle.



In [21]:

    
k = d.get('king')
print len(cpf.angular(k))

In practice, we discovered the semantic meaning of the norm tends to be the 'specialisation' of the concept. and the angle the field of application.

Thus:

Angular and Polar features will we more adapted to classify domains
Carthesian usefull when we need to access the 'deepness' of the concept

You can check dataExploration folder notebook for more details

Concept pair feature

A common usecase for supervised learning would be to detect the relation between 2 concepts.

We also provide so comparison features for this.



In [22]:

    
import mlLib.conceptPairFeature as cppf

To keep a trace of the feature transformation used and keep a high level manipulation, we'll adopt the following operation for a conceptPair:



In [23]:

    
conceptPair = (d.get('king'), 'relation', d.get('queen'))
conceptPair









    Out[23]:





(king, 'relation', queen)

a. Classic feature

These are simple operations: substraction and concatenation of 2 concept features presented in the previous part.



In [24]:

    
conceptPair = (d.get('king'), 'relation', d.get('queen'))

featureDimDf = pd.DataFrame(index=['substraction', 'concatenation'])
featureDimDf['carthesian'] = [len(feature(conceptPair)) for feature in [cppf.subCarth, cppf.concatCarth]]
featureDimDf['polar'] = [len(feature(conceptPair)) for feature in [cppf.subPolar, cppf.concatPolar]]
featureDimDf['angular'] = [len(feature(conceptPair)) for feature in [cppf.subAngular, cppf.concatAngular]]

print 'feature dimension depending of the used function'
featureDimDf









    



feature dimension depending of the used function






    Out[24]:






  
    
      
      carthesian
      polar
      angular
    
  
  
    
      substraction
      300
      300
      299
    
    
      concatenation
      600
      600
      598

b. Projection features

We also introduced another type of concept relation.

Based on the idea it would be usefull to compare similarity between 2 concepts for each dimension, we introduced some 'projection metrics' features.

Advantage: a feature to compare 2 concepts similarity according to each dimension.
Drawback: commutative so not usefull for 'ordered' pairs

So far, we provide the projection features for the following metrics:

Cosine similarity
Euclidean distance
Manhattan distance



In [25]:

    
cppf.pCosSim(conceptPair)
cppf.pEuclDist(conceptPair)
print 'feature dimension:', len(cppf.pManaDist(conceptPair))









    



feature dimension: 300



In [26]:

    
cppf.pdCosSim(conceptPair)
cppf.pdEuclDist(conceptPair)
print 'feature dimension:', len(cppf.pdManaDist(conceptPair))









    



feature dimension: 300

Projection similarity

The idea is to use a metric on the projected vectors for each dimension of the vector We could introduce it as:

"Beside this dimension $i$, how A and B are similar ?"

Formal approach:

$E$: the word vector space, $E \in \mathbb{R}^{n}$
$a, b \in E$
$m$: a metric $m \in E \mapsto \mathbb{R}$

Given a projection operator on dimension $i$:

$P_{i}(a) = a_{i \neq j}$

We define the projection similarity for metric $m$:

$P_{m, i}(a, b) = m(P_{i}(a), P_{i}(b))$

We apply it to each dimension and get the feature vector:

$\vec{P_{m}}(a, b) = \sum \limits_{i=1}^n P_{m, i}(a, b) \vec{e_{i}}$

Projection dissimilarity

We introduced the projection dissimilarity as the difference between a defined metric and the projected ones of each dimensions.

We could translate it as:

"How important is this dimension $i$ important to mesure the similarity between A and B ?"

Formal approach:

We use the same notation as in previous section to define the projection dissimilarity:

$Pd_{m, i}(a, b) = m(a, b) - m(P_{i}(a), Pd_{i}(b))$

Same, same but different =), we also apply it to each dimension to get the feature vector:

$\vec{Pd_{m}}(a, b) = \sum \limits_{i=1}^n Pd_{m, i}(a, b) \vec{e_{i}}$

II. Classification

Build learning sample

For supervised learning task, this toolbox propose a high level solution:

Provide the dataset and the extraction feature function to an overlay classifier.

The dataset is either a list of concepts or a list of concept pair we describe above.
The overlay classifier is built with a model

Concept



In [27]:

    
import cpLib.conceptExtraction as cpe

conceptStrList = ['king', 'queen', 'cat', 'bird', 'king bird']

cpe.buildConceptList(d, conceptStrList, True)









    Out[27]:





[king, queen, cat, bird]

The last boolean argument allow to try to compose concept for unknown words based on existing vocabulary.



In [28]:

    
cpe.buildConceptList(d, conceptStrList, False)









    Out[28]:





[king, queen, cat, bird, __m__king__bird]

Concept pair

Almost the same function are exposed for for building concept pairs



In [29]:

    
conceptPairStrList = [('king', 'relation', 'queen'),
                      ('man', 'relation', 'woman'),
                      ('bird', 'relation', 'cat')]


conceptPairList = cpe.buildConceptPairList(d, conceptPairStrList, True)
conceptPairList[0]









    Out[29]:





(king, 'relation', queen)

To build a negative sample in concept pair, you can shuffle an existing pair list:



In [30]:

    
cpe.shuffledConceptPairList(conceptPairList)









    Out[30]:





[(cat, 'relation', king), (queen, 'relation', man), (woman, 'relation', bird)]

Use a classifier

We use dill to serialise trained model, here is a few scripts to use trained models.

you need to make sure the vocabulary file you are using to predict is the same that has been used to fit the classifier.

a. Predict

Theses scripts use a trained classifier to predict the class of every element of a dataset.

Single Concept

This script predict the classes of single concept dataset for a trained model, here is an example:

python toolbox/script/predictConceptClass.py data/voc/npy/wikiEn-skipgram.npy data/learnedModel/domain/animal-plant-vehicle-other_strict_RandomForestClassifier_angular_noPost.dill data/domain/luu_animal.txt

ConceptPair

This script predict the classes of concept pair dataset for a trained model, here is an example:

python toolbox/script/predictConceptPairClass.py data/voc/npy/wikiEn-skipgram.npy data/learnedModel/anto/simple__RandomForestClassifier_pCosSim_noPost.dill data/wordPair/wordnetAnto.txt

b. Find most likely pair

This script use a trained concept pair classfier to find the most likely match in all the vocabulary for a given single concept, here is an example:

python toolbox/script/findMostLikelyPair.py data/voc/npy/wikiEn-skipgram.npy data/learnedModel/taxo/animal__RandomForestClassifier_subAngular_postNormalize.dill cat isParent --domain data/domain/luu_animal.txt

NB: You can restrict the searching sample with the --domain argument. this avoid to go throught the whole vocabulary

NB 2: RandomForest models are not adapted for this application

c. Evaluate a classifier

Theses scripts retrain and evaluate a classifier model, train it (without kfold) and print the report.

Single Concept

This script evaluate a models for a single concept dataset input, here is an example:

python toolbox/script/detailConceptClfError.py data/voc/npy/wikiEn-skipgram.npy data/learnedModel/domain/animal-other__RandomForestClassifier_angular_postNormalize.dill data/domain/luu_animal.txt animal data/domain/all_1400.txt other

ConceptPair

This script evaluate a models for a concept pair dataset input, here is an example:

python toolbox/script/detailConceptPairClfError.py data/voc/npy/wikiEn-skipgram.npy data/learnedModel/anto/bidi__RandomForestClassifier_pCosSim_postNormalize.dill data/wordPair/wordnetAnto.txt anto data/wordPair/wordnetAnto_fake.txt notAnto

III. Build database

You may want to train your own word2vec vectorial space.

This toolbox comes with some existing project to train a vocabulary and can convert it to a python friendly format.

From a binary file

Here is the workflow to build a database from a bin file.

If you don't have a .bin vector file yet, use Word2vec to train your corpus to a .bin vect file:

thirdparty/word2vec/bin/word2vec -train pathToCorpus.txt -output data/voc/bin/text8.bin -size 200 -window 5 -sample 1e-4 -negative 5 -hs 0 -binary 0

You choose here all the parameters for training your vector space

Use convertvec to convert .bin to .txt vector file:

thirdparty/convertvec bin2txt data/voc/bin/text8.bin data/voc/txt/text8.txt

Use the script convertTxtDbToNpy.py to create and save the npy matrix and the association dictionary:
```
python toolbox/bin/convertTxtDbToNpy.py data/voc/txt/text8.txt data/voc/npy/text8.npy
```

Convert a database to polar / angular coordinates

Since we use transformation from cathersian to polar / angular coordinates, we created a script to convert a database in this space.

Convert to polar coordinates

python toolbox/bin/convertCarthDbToPolar.py data/voc/npy/text8.npy data/voc/npy

Convert to angular coordinates

python toolbox/bin/convertCarthDbToPolar.py data/voc/npy/text8.npy data/voc/npy --angular

NB: you can convert from a .txt voc file but for perfomance reason, we strongly advise to use .npy format.

IV. Run the experiment

You may want to reproduce the experiences. It is detailled in each dedicated notebook but all script are in:

toolbox/experiement

To reproduce the log extraction you will need to redirect the output to a log file like this:

python toolbox/experiment/trainAll_antoClf.py > data/learnedModel/anto/log.txt

Split it:

python toolbox/bin/splitLogFile.py data/learnedModel/anto/log.txt

And finally use:

bash toolbox/bin/summarizeAllSkReport.sh

To extract proper summary

HAVE FUN !!

and don't be evil:)