In [1]:
import pandas as pd
All the toolbox is a package tree.
You need to import the __init__.py file at the root of each package add the toolbox toplevel to your path.
In [2]:
import __init__
In [3]:
import cpLib.conceptDB as db
Ther are 2 saved database format.
In [4]:
d = db.DB('../data/voc/txt/googleNews_mini.txt', verbose=False)
Which is quite slow to load for large voc. Due to performance issues, we need to use numpy format for large voc.
The voc is then splited in 2 files: one containing the matrix (in npy format), the other for words association (a dict in json format)
NB: for both loading approach, you can verbose or not, it is usefull to deal with files created by std redirection
In [5]:
d = db.DB('../data/voc/npy/googleNews_mini.npy')
In [6]:
v1 = d.get('king')
print v1, type(v1.vect), len(v1.vect)
In [7]:
v2 = d.get('toto')
print v2
In [8]:
print d.has('king')
print d.has('toto')
In [9]:
conceptList = d.getSample(5)
print len(conceptList)
print conceptList[0]
A common operation is to find the closest word for a given concept.
You can do this according to several metrics
In [10]:
king = d.get('king')
print d.find_cosSim(king)
In [11]:
print d.find_euclDist(king)
In [12]:
print d.find_manaDist(king)
You can apply several operations between concepts to build new ones.
Created concept names are in reverse polish notation.
In [13]:
import cpLib.concept as cp
In [14]:
v1 = cp.add(d.get('king'), d.get('man'))
v1 = cp.sub(v1, d.get('queen'))
v2 = cp.addSub([d.get('king'), d.get('man')], [d.get('queen')], normalized=True)
print v1, ' ~ ', d.find_cosSim(v1)[0][1]
print v2, ' ~ ', d.find_cosSim(v2)[0][1]
In [15]:
k = d.get('king')
print k.normalized()
Transform the carthesian coordinate into hyperspherical ones.
First value is the norm, the other values are angles in rad.
In [16]:
k = d.get('king')
print k.polarized()
print 'norm =', k.polarized().vect[0]
print '1st angle =', k.polarized().vect[1]
Polar transformation without the norm.
In [17]:
k = d.get('king')
print k.angularized()
print 'vector dimension =', len(k.angularized().vect)
print '1st angle =', k.angularized().vect[0]
Since this toolbox is designed in a first place for machine learning activies, we provide some feature extraction functions:
In [18]:
import mlLib.conceptFeature as cpf
Identity vector is the raw vector in the vector space in carthesian coordinates.
In [19]:
k = d.get('king')
print len(cpf.identity(k))
We can also transform this carthesian coordinates into hyperspherical ones.
In [20]:
k = d.get('king')
print len(cpf.polar(k))
And remove the norm to keep only the angle.
In [21]:
k = d.get('king')
print len(cpf.angular(k))
In practice, we discovered the semantic meaning of the norm tends to be the 'specialisation' of the concept. and the angle the field of application.
Thus:
You can check dataExploration folder notebook for more details
In [22]:
import mlLib.conceptPairFeature as cppf
To keep a trace of the feature transformation used and keep a high level manipulation, we'll adopt the following operation for a conceptPair:
In [23]:
conceptPair = (d.get('king'), 'relation', d.get('queen'))
conceptPair
Out[23]:
These are simple operations: substraction and concatenation of 2 concept features presented in the previous part.
In [24]:
conceptPair = (d.get('king'), 'relation', d.get('queen'))
featureDimDf = pd.DataFrame(index=['substraction', 'concatenation'])
featureDimDf['carthesian'] = [len(feature(conceptPair)) for feature in [cppf.subCarth, cppf.concatCarth]]
featureDimDf['polar'] = [len(feature(conceptPair)) for feature in [cppf.subPolar, cppf.concatPolar]]
featureDimDf['angular'] = [len(feature(conceptPair)) for feature in [cppf.subAngular, cppf.concatAngular]]
print 'feature dimension depending of the used function'
featureDimDf
Out[24]:
We also introduced another type of concept relation.
Based on the idea it would be usefull to compare similarity between 2 concepts for each dimension, we introduced some 'projection metrics' features.
So far, we provide the projection features for the following metrics:
In [25]:
cppf.pCosSim(conceptPair)
cppf.pEuclDist(conceptPair)
print 'feature dimension:', len(cppf.pManaDist(conceptPair))
In [26]:
cppf.pdCosSim(conceptPair)
cppf.pdEuclDist(conceptPair)
print 'feature dimension:', len(cppf.pdManaDist(conceptPair))
Formal approach:
Given a projection operator on dimension $i$:
$P_{i}(a) = a_{i \neq j}$
We define the projection similarity for metric $m$:
$P_{m, i}(a, b) = m(P_{i}(a), P_{i}(b))$
We apply it to each dimension and get the feature vector:
$\vec{P_{m}}(a, b) = \sum \limits_{i=1}^n P_{m, i}(a, b) \vec{e_{i}}$
Formal approach:
We use the same notation as in previous section to define the projection dissimilarity:
$Pd_{m, i}(a, b) = m(a, b) - m(P_{i}(a), Pd_{i}(b))$
Same, same but different =), we also apply it to each dimension to get the feature vector:
$\vec{Pd_{m}}(a, b) = \sum \limits_{i=1}^n Pd_{m, i}(a, b) \vec{e_{i}}$
For supervised learning task, this toolbox propose a high level solution:
Provide the dataset and the extraction feature function to an overlay classifier.
In [27]:
import cpLib.conceptExtraction as cpe
conceptStrList = ['king', 'queen', 'cat', 'bird', 'king bird']
cpe.buildConceptList(d, conceptStrList, True)
Out[27]:
The last boolean argument allow to try to compose concept for unknown words based on existing vocabulary.
In [28]:
cpe.buildConceptList(d, conceptStrList, False)
Out[28]:
In [29]:
conceptPairStrList = [('king', 'relation', 'queen'),
('man', 'relation', 'woman'),
('bird', 'relation', 'cat')]
conceptPairList = cpe.buildConceptPairList(d, conceptPairStrList, True)
conceptPairList[0]
Out[29]:
To build a negative sample in concept pair, you can shuffle an existing pair list:
In [30]:
cpe.shuffledConceptPairList(conceptPairList)
Out[30]:
Theses scripts use a trained classifier to predict the class of every element of a dataset.
This script predict the classes of single concept dataset for a trained model, here is an example:
python toolbox/script/predictConceptClass.py data/voc/npy/wikiEn-skipgram.npy data/learnedModel/domain/animal-plant-vehicle-other_strict_RandomForestClassifier_angular_noPost.dill data/domain/luu_animal.txt
This script predict the classes of concept pair dataset for a trained model, here is an example:
python toolbox/script/predictConceptPairClass.py data/voc/npy/wikiEn-skipgram.npy data/learnedModel/anto/simple__RandomForestClassifier_pCosSim_noPost.dill data/wordPair/wordnetAnto.txt
This script use a trained concept pair classfier to find the most likely match in all the vocabulary for a given single concept, here is an example:
python toolbox/script/findMostLikelyPair.py data/voc/npy/wikiEn-skipgram.npy data/learnedModel/taxo/animal__RandomForestClassifier_subAngular_postNormalize.dill cat isParent --domain data/domain/luu_animal.txt
NB: You can restrict the searching sample with the --domain argument. this avoid to go throught the whole vocabulary
NB 2: RandomForest models are not adapted for this application
Theses scripts retrain and evaluate a classifier model, train it (without kfold) and print the report.
This script evaluate a models for a single concept dataset input, here is an example:
python toolbox/script/detailConceptClfError.py data/voc/npy/wikiEn-skipgram.npy data/learnedModel/domain/animal-other__RandomForestClassifier_angular_postNormalize.dill data/domain/luu_animal.txt animal data/domain/all_1400.txt other
This script evaluate a models for a concept pair dataset input, here is an example:
python toolbox/script/detailConceptPairClfError.py data/voc/npy/wikiEn-skipgram.npy data/learnedModel/anto/bidi__RandomForestClassifier_pCosSim_postNormalize.dill data/wordPair/wordnetAnto.txt anto data/wordPair/wordnetAnto_fake.txt notAnto
Here is the workflow to build a database from a bin file.
thirdparty/word2vec/bin/word2vec -train pathToCorpus.txt -output data/voc/bin/text8.bin -size 200 -window 5 -sample 1e-4 -negative 5 -hs 0 -binary 0
You choose here all the parameters for training your vector space
Use convertvec to convert .bin to .txt vector file:
thirdparty/convertvec bin2txt data/voc/bin/text8.bin data/voc/txt/text8.txt
Use the script convertTxtDbToNpy.py
to create and save the npy matrix and the association dictionary:
python toolbox/bin/convertTxtDbToNpy.py data/voc/txt/text8.txt data/voc/npy/text8.npy
Since we use transformation from cathersian to polar / angular coordinates, we created a script to convert a database in this space.
Convert to polar coordinates
python toolbox/bin/convertCarthDbToPolar.py data/voc/npy/text8.npy data/voc/npy
Convert to angular coordinates
python toolbox/bin/convertCarthDbToPolar.py data/voc/npy/text8.npy data/voc/npy --angular
NB: you can convert from a .txt voc file but for perfomance reason, we strongly advise to use .npy format.
You may want to reproduce the experiences. It is detailled in each dedicated notebook but all script are in:
toolbox/experiement
To reproduce the log extraction you will need to redirect the output to a log file like this:
python toolbox/experiment/trainAll_antoClf.py > data/learnedModel/anto/log.txt
Split it:
python toolbox/bin/splitLogFile.py data/learnedModel/anto/log.txt
And finally use:
bash toolbox/bin/summarizeAllSkReport.sh
To extract proper summary