In [7]:
%matplotlib inline

Query Expansion using Distributional Semantics

You can find this notebook at bit.ly/ir15project.

This presentation will guide you through a short introduction to query expansion with distributional semantic models (DSM) using the Python package PyDSM.

To successfully install the package, you first need the following installed locally:

  • Python 3.4
  • Cython
  • Scipy

I recommend using the scientific Python distribution Anaconda to ease the process.

Installation

When you have installed the requirements, this should do the trick:

git clone https://github.com/jimmycallin/pydsm
cd pydsm
python setup.py install

Make sure everything is installed correctly by open your favorite REPL (which should be IPython), and import the library.


In [8]:
import pydsm

Outline

  • Query Expansion
  • Distributional Semantic Models
  • Case study

Query Expansion

Why?

  • Language is tricky
  • People are lazy

What about thesauri?

  • Only finds the most obvious
  • Not context sensitive

Query Analysis?

  • Absolutely possible!
  • In fact, SOTA

But …

  • Collecting large user query databases is expensive
  • We might still miss things that are linguistically intuitive
  • How do we extract relevant keywords from text?

Distributional Semantics

Let’s think about words as vectors

  • Distributional = A characterization of how something has occurred
  • The Distributional Hypothesis: terms with similar distributional properties have similar meanings (ok, but which distributional properties?)
  • "A word is characterized by the company it keeps" - J.R. Firth

What is PyDSM?

PyDSM is a small Python library made for exploratory analysis of distributional semantic models. So far, two common models are available: CooccurrenceDSM and RandomIndexing. RandomIndexing is an implemented version of the model we use at Gavagai. For a detailed explanation, I recommend reading this introduction.


In [10]:
dsm = pydsm.build(model=pydsm.CooccurrenceDSM, 
                  corpus='ukwac.100k.clean.txt',
                  window_size=(2,2),
                  min_frequency=3)


Building matrix from corpus with config: {'min_frequency': 3, 'window_size': (2, 2)}. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Total time of build: 484.33 sec


In [24]:
dsm[['never', 'gonna', 'give', 'you', 'up'],['never', 'gonna', 'let', 'you', 'down']]


Out[24]:
[5, 5]  never    gonna   let      you       down
never   446.00   34.00   268.00   2750.00   81.00
gonna   34.00    0.00    9.00     206.00    4.00
give    147.00   22.00   196.00   5669.00   25.00
you     2750.00  206.00  1997.00  17250.00  1420.00
up      325.00   15.00   170.00   4016.00   1656.00

In [12]:
dsm.nearest_neighbors('fire')


Total time of nearest_neighbors: 8.04 sec
Out[12]:
[584764, 1]  fire
fire         1.00
stern        0.96
mountain     0.95
plant        0.95
shooting     0.95
hill         0.95
japanese     0.95
building     0.95
sound        0.95
running      0.95
...          ...

Weighting

  • We can improve the quality of the DSM by applying some sort of weighting function.
  • Intuitively: Words that co-occur with many words should influence the meaning less than words that only co-occur with a few.

Pointwise Mutual Information

  • $\operatorname{pmi}(x;y) \equiv \log\frac{p(x,y)}{p(x)p(y)}$
  • Let's remove all negative values (PPMI).

Non-weighted


In [25]:
dsm[['never', 'gonna', 'give', 'you', 'up'],['never', 'gonna', 'let', 'you', 'down']]


Out[25]:
[5, 5]  never    gonna   let      you       down
never   446.00   34.00   268.00   2750.00   81.00
gonna   34.00    0.00    9.00     206.00    4.00
give    147.00   22.00   196.00   5669.00   25.00
you     2750.00  206.00  1997.00  17250.00  1420.00
up      325.00   15.00   170.00   4016.00   1656.00

Weighted


In [13]:
ppmi = dsm.apply_weighting(pydsm.weighting.ppmi)
ppmi[['never', 'gonna', 'give', 'you', 'up'],['never', 'gonna', 'let', 'you', 'down']]


Total time of apply_weighting: 20.56 sec
Out[13]:
[5, 5]  never  gonna  let   you   down
never   2.08   3.01   1.96  1.28  0.00
gonna   3.01   0.00   2.07  2.19  0.46
give    0.98   2.59   1.66  2.01  0.00
you     1.28   2.19   1.34  0.49  0.20
up      0.23   0.66   0.00  0.12  1.44

Non-weighted vs. PPMI-weighted


In [14]:
dsm.visualize(vis_func=pydsm.visualization.heatmap)
ppmi.visualize(vis_func=pydsm.visualization.heatmap)



In [21]:
dsm.nearest_neighbors('fire')


Total time of nearest_neighbors: 7.50 sec
Out[21]:
[584764, 1]  fire
fire         1.00
stern        0.96
mountain     0.95
plant        0.95
shooting     0.95
hill         0.95
japanese     0.95
building     0.95
sound        0.95
running      0.95
...          ...

In [15]:
ppmi.nearest_neighbors('fire')


Total time of nearest_neighbors: 7.50 sec
Out[15]:
[584764, 1]  fire
fire         1.00
fires        0.14
artillery    0.11
firing       0.10
fired        0.09
enemy        0.09
smoke        0.08
missile      0.08
air          0.08
gun          0.08
...          ...

In [30]:
ppmi.nearest_neighbors('apple')


Total time of nearest_neighbors: 5.73 sec
Out[30]:
[584764, 1]  apple
apple        1.00
apples       0.14
juice        0.13
mac          0.12
fruit        0.12
macintosh    0.11
ipod         0.11
macs         0.11
pear         0.11
banana       0.10
...          ...

In [31]:
ppmi.nearest_neighbors('film')


Total time of nearest_neighbors: 5.81 sec
Out[31]:
[584764, 1]  film
film         1.00
films        0.16
movie        0.14
movies       0.11
starring     0.10
television   0.09
documentary  0.09
music        0.09
cinema       0.09
drama        0.09
...          ...

In [32]:
ppmi.nearest_neighbors('tomorrow')


Total time of nearest_neighbors: 5.77 sec
Out[32]:
[584764, 1]  tomorrow
tomorrow     1.00
gelecegiz    0.09
tonight      0.08
7pm          0.08
yesterday    0.08
6pm          0.07
saturday     0.07
4pm          0.07
8am          0.07
2pm          0.07
...          ...

In [37]:
ppmi.nearest_neighbors('weird')


Total time of nearest_neighbors: 5.79 sec
Out[37]:
[584764, 1]  weird
weird        1.00
bizarre      0.10
strange      0.10
odd          0.08
acidy        0.08
funny        0.08
sounds       0.08
eerie        0.08
scary        0.08
silly        0.08
...          ...

But ...

Not particularly robust for infrequent words


In [16]:
ppmi.nearest_neighbors('diana')


Total time of nearest_neighbors: 9.20 sec
Out[16]:
[584764, 1]  diana
diana        1.00
coupland     0.08
jane         0.08
leonard      0.07
mrs          0.07
mitchell     0.07
cooper       0.07
sheila       0.07
synopsis     0.07
cast         0.07
...          ...

In [17]:
ppmi.nearest_neighbors('glue')


Total time of nearest_neighbors: 11.35 sec
Out[17]:
[584764, 1]            glue
glue                   1.00
shinki                 0.10
pliobond               0.10
adhesive               0.10
furane                 0.09
herma                  0.09
anasthesiology         0.09
carboxymethylcysteine  0.09
interdental            0.08
japrock                0.08
...                    ...

Experimental setup

  • Train DSM on a million sentences (~128MB) from English Wikipedia.
  • Train using window sizes 2+2 and 20+20 (paradigmatic and associative).
  • Set up a baseline query set.
  • Expand query words in query set with DSM.
  • Evaluate with test collection.

Chosen Topics

Topic ID Description
C201 Domestic fires
C202 Nick Leeson’s arrest
C207 Firework injuries
C216 Glue sniffing youngsters
C230 Atlantis-Mir Docking
C238 Lady Diana

Baseline Queries

Topic ID Query
C201 #combine(domestic fire)
C202 #combine(leeson)
C207 #combine(fireworks)
C216 #combine(glue sniffing)
C230 #combine(atlantis docking)
C238 #combine(diana)

Results

A disappointment.

2+2

Word Expansions
domestic foreign, agricultural, consumer, commercial, gross
fire artillery, firing, fires, anti-aircraft, fired
fireworks bvp, framemaker, ticker-tape, bow-and-arrow, 5153
glue nikawa, model-airplane, glues, cocaine, subrating
sniffing carbona, semi-zombie, gelatin-resorcinol-formaldehyde, ol, pawing
atlantis docetist, dietician, dgar, rs600, azale
docking so-1, poisk, redocking, tm-3, tm-31
diana galeras, iuno, cinpolis, blood-eating, pining
leeson rudman, krentz, goodby, lillback, tutt

20+20

Word Expansions
domestic gross, gdp, product, prices, flights
fire artillery, guns, fires, firing, enemy
fireworks flashpaper, freehand, homesite, jrun, macromedia
glue sniffing, gelatin-resorcinol-formaldehyde, ethanedial, pentanedial, less-toxic
sniffing glue, heiferman, marvi, kismaric, gea/1
atlantis critias, steppe, stepnaja, ,
docking nauka, zarya, module, zvezda, poisk
diana krall, rigg, hallman, laymann, gunilla
leeson barings, boettke, k-9, camden/george, horwitz

NDCG Query Results

  • Great potential
  • Difficult to guarantee quality
  • Decide what query terms to expand
  • Ambiguity
  • Still wouldn't recommend this

Thanks

You can find this notebook at bit.ly/ir15project

Build your own DSM with PyDSM: github.com/jimmycallin/pydsm