In [7]:

    
%matplotlib inline

Query Expansion using Distributional Semantics

You can find this notebook at bit.ly/ir15project.

This presentation will guide you through a short introduction to query expansion with distributional semantic models (DSM) using the Python package PyDSM.

To successfully install the package, you first need the following installed locally:

Python 3.4
Cython
Scipy

I recommend using the scientific Python distribution Anaconda to ease the process.

Installation

When you have installed the requirements, this should do the trick:

git clone https://github.com/jimmycallin/pydsm
cd pydsm
python setup.py install

Make sure everything is installed correctly by open your favorite REPL (which should be IPython), and import the library.



In [8]:

    
import pydsm

Outline

Query Expansion
Distributional Semantic Models
Case study

Query Expansion

Why?

Language is tricky
People are lazy

What about thesauri?

Only finds the most obvious
Not context sensitive

Query Analysis?

Absolutely possible!
In fact, SOTA

But …

Collecting large user query databases is expensive
We might still miss things that are linguistically intuitive
How do we extract relevant keywords from text?

Distributional Semantics

Let’s think about words as vectors

Distributional = A characterization of how something has occurred
The Distributional Hypothesis: terms with similar distributional properties have similar meanings (ok, but which distributional properties?)
"A word is characterized by the company it keeps" - J.R. Firth

What is PyDSM?

PyDSM is a small Python library made for exploratory analysis of distributional semantic models. So far, two common models are available: CooccurrenceDSM and RandomIndexing. RandomIndexing is an implemented version of the model we use at Gavagai. For a detailed explanation, I recommend reading this introduction.



In [10]:

    
dsm = pydsm.build(model=pydsm.CooccurrenceDSM, 
                  corpus='ukwac.100k.clean.txt',
                  window_size=(2,2),
                  min_frequency=3)









    



Building matrix from corpus with config: {'min_frequency': 3, 'window_size': (2, 2)}. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Total time of build: 484.33 sec



In [24]:

    
dsm[['never', 'gonna', 'give', 'you', 'up'],['never', 'gonna', 'let', 'you', 'down']]









    Out[24]:





[5, 5]  never    gonna   let      you       down
never   446.00   34.00   268.00   2750.00   81.00
gonna   34.00    0.00    9.00     206.00    4.00
give    147.00   22.00   196.00   5669.00   25.00
you     2750.00  206.00  1997.00  17250.00  1420.00
up      325.00   15.00   170.00   4016.00   1656.00



In [12]:

    
dsm.nearest_neighbors('fire')









    



Total time of nearest_neighbors: 8.04 sec






    Out[12]:





[584764, 1]  fire
fire         1.00
stern        0.96
mountain     0.95
plant        0.95
shooting     0.95
hill         0.95
japanese     0.95
building     0.95
sound        0.95
running      0.95
...          ...

Weighting

We can improve the quality of the DSM by applying some sort of weighting function.
Intuitively: Words that co-occur with many words should influence the meaning less than words that only co-occur with a few.

Pointwise Mutual Information

$\operatorname{pmi}(x;y) \equiv \log\frac{p(x,y)}{p(x)p(y)}$
Let's remove all negative values (PPMI).

Non-weighted



In [25]:

    
dsm[['never', 'gonna', 'give', 'you', 'up'],['never', 'gonna', 'let', 'you', 'down']]









    Out[25]:





[5, 5]  never    gonna   let      you       down
never   446.00   34.00   268.00   2750.00   81.00
gonna   34.00    0.00    9.00     206.00    4.00
give    147.00   22.00   196.00   5669.00   25.00
you     2750.00  206.00  1997.00  17250.00  1420.00
up      325.00   15.00   170.00   4016.00   1656.00

Weighted



In [13]:

    
ppmi = dsm.apply_weighting(pydsm.weighting.ppmi)
ppmi[['never', 'gonna', 'give', 'you', 'up'],['never', 'gonna', 'let', 'you', 'down']]









    



Total time of apply_weighting: 20.56 sec






    Out[13]:





[5, 5]  never  gonna  let   you   down
never   2.08   3.01   1.96  1.28  0.00
gonna   3.01   0.00   2.07  2.19  0.46
give    0.98   2.59   1.66  2.01  0.00
you     1.28   2.19   1.34  0.49  0.20
up      0.23   0.66   0.00  0.12  1.44

Non-weighted vs. PPMI-weighted



In [14]:

    
dsm.visualize(vis_func=pydsm.visualization.heatmap)
ppmi.visualize(vis_func=pydsm.visualization.heatmap)



In [21]:

    
dsm.nearest_neighbors('fire')









    



Total time of nearest_neighbors: 7.50 sec






    Out[21]:





[584764, 1]  fire
fire         1.00
stern        0.96
mountain     0.95
plant        0.95
shooting     0.95
hill         0.95
japanese     0.95
building     0.95
sound        0.95
running      0.95
...          ...



In [15]:

    
ppmi.nearest_neighbors('fire')









    



Total time of nearest_neighbors: 7.50 sec






    Out[15]:





[584764, 1]  fire
fire         1.00
fires        0.14
artillery    0.11
firing       0.10
fired        0.09
enemy        0.09
smoke        0.08
missile      0.08
air          0.08
gun          0.08
...          ...



In [30]:

    
ppmi.nearest_neighbors('apple')









    



Total time of nearest_neighbors: 5.73 sec






    Out[30]:





[584764, 1]  apple
apple        1.00
apples       0.14
juice        0.13
mac          0.12
fruit        0.12
macintosh    0.11
ipod         0.11
macs         0.11
pear         0.11
banana       0.10
...          ...



In [31]:

    
ppmi.nearest_neighbors('film')









    



Total time of nearest_neighbors: 5.81 sec






    Out[31]:





[584764, 1]  film
film         1.00
films        0.16
movie        0.14
movies       0.11
starring     0.10
television   0.09
documentary  0.09
music        0.09
cinema       0.09
drama        0.09
...          ...



In [32]:

    
ppmi.nearest_neighbors('tomorrow')









    



Total time of nearest_neighbors: 5.77 sec






    Out[32]:





[584764, 1]  tomorrow
tomorrow     1.00
gelecegiz    0.09
tonight      0.08
7pm          0.08
yesterday    0.08
6pm          0.07
saturday     0.07
4pm          0.07
8am          0.07
2pm          0.07
...          ...



In [37]:

    
ppmi.nearest_neighbors('weird')









    



Total time of nearest_neighbors: 5.79 sec






    Out[37]:





[584764, 1]  weird
weird        1.00
bizarre      0.10
strange      0.10
odd          0.08
acidy        0.08
funny        0.08
sounds       0.08
eerie        0.08
scary        0.08
silly        0.08
...          ...

But ...

Not particularly robust for infrequent words



In [16]:

    
ppmi.nearest_neighbors('diana')









    



Total time of nearest_neighbors: 9.20 sec






    Out[16]:





[584764, 1]  diana
diana        1.00
coupland     0.08
jane         0.08
leonard      0.07
mrs          0.07
mitchell     0.07
cooper       0.07
sheila       0.07
synopsis     0.07
cast         0.07
...          ...



In [17]:

    
ppmi.nearest_neighbors('glue')









    



Total time of nearest_neighbors: 11.35 sec






    Out[17]:





[584764, 1]            glue
glue                   1.00
shinki                 0.10
pliobond               0.10
adhesive               0.10
furane                 0.09
herma                  0.09
anasthesiology         0.09
carboxymethylcysteine  0.09
interdental            0.08
japrock                0.08
...                    ...

Experimental setup

Train DSM on a million sentences (~128MB) from English Wikipedia.
Train using window sizes 2+2 and 20+20 (paradigmatic and associative).
Set up a baseline query set.
Expand query words in query set with DSM.
Evaluate with test collection.

Chosen Topics

Topic ID	Description
C201	Domestic fires
C202	Nick Leeson’s arrest
C207	Firework injuries
C216	Glue sniffing youngsters
C230	Atlantis-Mir Docking
C238	Lady Diana

Baseline Queries

Topic ID	Query
C201	#combine(domestic fire)
C202	#combine(leeson)
C207	#combine(fireworks)
C216	#combine(glue sniffing)
C230	#combine(atlantis docking)
C238	#combine(diana)

Results

A disappointment.

2+2

Word	Expansions
domestic	foreign, agricultural, consumer, commercial, gross
fire	artillery, firing, fires, anti-aircraft, fired
fireworks	bvp, framemaker, ticker-tape, bow-and-arrow, 5153
glue	nikawa, model-airplane, glues, cocaine, subrating
sniffing	carbona, semi-zombie, gelatin-resorcinol-formaldehyde, ol, pawing
atlantis	docetist, dietician, dgar, rs600, azale
docking	so-1, poisk, redocking, tm-3, tm-31
diana	galeras, iuno, cinpolis, blood-eating, pining
leeson	rudman, krentz, goodby, lillback, tutt

20+20

Word	Expansions
domestic	gross, gdp, product, prices, flights
fire	artillery, guns, fires, firing, enemy
fireworks	flashpaper, freehand, homesite, jrun, macromedia
glue	sniffing, gelatin-resorcinol-formaldehyde, ethanedial, pentanedial, less-toxic
sniffing	glue, heiferman, marvi, kismaric, gea/1
atlantis	critias, steppe, stepnaja, ,
docking	nauka, zarya, module, zvezda, poisk
diana	krall, rigg, hallman, laymann, gunilla
leeson	barings, boettke, k-9, camden/george, horwitz

NDCG Query Results

Great potential
Difficult to guarantee quality
Decide what query terms to expand
Ambiguity
Still wouldn't recommend this

Thanks

You can find this notebook at bit.ly/ir15project

Build your own DSM with PyDSM: github.com/jimmycallin/pydsm