In [7]:
%matplotlib inline
You can find this notebook at bit.ly/ir15project.
This presentation will guide you through a short introduction to query expansion with distributional semantic models (DSM) using the Python package PyDSM.
To successfully install the package, you first need the following installed locally:
I recommend using the scientific Python distribution Anaconda to ease the process.
When you have installed the requirements, this should do the trick:
git clone https://github.com/jimmycallin/pydsm
cd pydsm
python setup.py install
Make sure everything is installed correctly by open your favorite REPL (which should be IPython), and import the library.
In [8]:
import pydsm
PyDSM is a small Python library made for exploratory analysis of distributional semantic models. So far, two common models are available: CooccurrenceDSM
and RandomIndexing
. RandomIndexing is an implemented version of the model we use at Gavagai. For a detailed explanation, I recommend reading this introduction.
In [10]:
dsm = pydsm.build(model=pydsm.CooccurrenceDSM,
corpus='ukwac.100k.clean.txt',
window_size=(2,2),
min_frequency=3)
In [24]:
dsm[['never', 'gonna', 'give', 'you', 'up'],['never', 'gonna', 'let', 'you', 'down']]
Out[24]:
In [12]:
dsm.nearest_neighbors('fire')
Out[12]:
In [25]:
dsm[['never', 'gonna', 'give', 'you', 'up'],['never', 'gonna', 'let', 'you', 'down']]
Out[25]:
In [13]:
ppmi = dsm.apply_weighting(pydsm.weighting.ppmi)
ppmi[['never', 'gonna', 'give', 'you', 'up'],['never', 'gonna', 'let', 'you', 'down']]
Out[13]:
In [14]:
dsm.visualize(vis_func=pydsm.visualization.heatmap)
ppmi.visualize(vis_func=pydsm.visualization.heatmap)
In [21]:
dsm.nearest_neighbors('fire')
Out[21]:
In [15]:
ppmi.nearest_neighbors('fire')
Out[15]:
In [30]:
ppmi.nearest_neighbors('apple')
Out[30]:
In [31]:
ppmi.nearest_neighbors('film')
Out[31]:
In [32]:
ppmi.nearest_neighbors('tomorrow')
Out[32]:
In [37]:
ppmi.nearest_neighbors('weird')
Out[37]:
In [16]:
ppmi.nearest_neighbors('diana')
Out[16]:
In [17]:
ppmi.nearest_neighbors('glue')
Out[17]:
Topic ID | Description |
---|---|
C201 | Domestic fires |
C202 | Nick Leeson’s arrest |
C207 | Firework injuries |
C216 | Glue sniffing youngsters |
C230 | Atlantis-Mir Docking |
C238 | Lady Diana |
Topic ID | Query |
---|---|
C201 | #combine(domestic fire) |
C202 | #combine(leeson) |
C207 | #combine(fireworks) |
C216 | #combine(glue sniffing) |
C230 | #combine(atlantis docking) |
C238 | #combine(diana) |
Word | Expansions |
---|---|
domestic | foreign, agricultural, consumer, commercial, gross |
fire | artillery, firing, fires, anti-aircraft, fired |
fireworks | bvp, framemaker, ticker-tape, bow-and-arrow, 5153 |
glue | nikawa, model-airplane, glues, cocaine, subrating |
sniffing | carbona, semi-zombie, gelatin-resorcinol-formaldehyde, ol, pawing |
atlantis | docetist, dietician, dgar, rs600, azale |
docking | so-1, poisk, redocking, tm-3, tm-31 |
diana | galeras, iuno, cinpolis, blood-eating, pining |
leeson | rudman, krentz, goodby, lillback, tutt |
Word | Expansions |
---|---|
domestic | gross, gdp, product, prices, flights |
fire | artillery, guns, fires, firing, enemy |
fireworks | flashpaper, freehand, homesite, jrun, macromedia |
glue | sniffing, gelatin-resorcinol-formaldehyde, ethanedial, pentanedial, less-toxic |
sniffing | glue, heiferman, marvi, kismaric, gea/1 |
atlantis | critias, steppe, stepnaja, , |
docking | nauka, zarya, module, zvezda, poisk |
diana | krall, rigg, hallman, laymann, gunilla |
leeson | barings, boettke, k-9, camden/george, horwitz |
You can find this notebook at bit.ly/ir15project
Build your own DSM with PyDSM: github.com/jimmycallin/pydsm