In [1]:
from sklearn.datasets import fetch_20newsgroups
categories = [
'alt.atheism',
'talk.religion.misc',
'comp.graphics',
'sci.space',
]
fetch_subset = lambda subset: fetch_20newsgroups(
subset=subset, categories=categories,
shuffle=True, random_state=42,
remove=('headers', 'footers', 'quotes'))
train = fetch_subset('train')
test = fetch_subset('test')
In [2]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import HashingVectorizer
vec = HashingVectorizer(n_features=10000)
clf = SGDClassifier()
pipeline = Pipeline([('vec', vec), ('clf', clf)])
pipeline.fit(train['data'], train['target'])
Out[2]:
Pipeline(steps=[('vec', HashingVectorizer(analyzer='word', binary=False, decode_error='strict',
dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
lowercase=True, n_features=10000, ngram_range=(1, 1),
non_negative=False, norm='l2', preprocessor=None, stop_words=None,... penalty='l2', power_t=0.5, random_state=None, shuffle=True,
verbose=0, warm_start=False))])
In [10]:
from eli5.sklearn import InvertableHashingVectorizer
ivec = InvertableHashingVectorizer(vec)
ivec.fit(train['data'])
Out[10]:
InvertableHashingVectorizer(unkn_template='FEATURE[%d]',
vec=HashingVectorizer(analyzer='word', binary=False, decode_error='strict',
dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
lowercase=True, n_features=10000, ngram_range=(1, 1),
non_negative=False, norm='l2', preprocessor=None, stop_words=None,
strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
tokenizer=None))
In [4]:
from eli5 import explain_weights, explain_prediction
from eli5 import format_as_html, format_as_text, format_html_styles
print(format_as_text(explain_weights(clf, ivec, target_names=train['target_names'])))
Explained as: linear model
Features with largest coefficients per class.
Caveats:
1. Be careful with features which are not
independent - weights don't show their importance.
2. If scale of input features is different then scale of coefficients
will also be different, making direct comparison between coefficient values
incorrect.
3. Depending on regularization, rare features sometimes may have high
coefficients; this doesn't mean they contribute much to the
classification result for most examples.
Feature names are restored from their hashes; this is not 100% precise
because collisions are possible. For known collisions possible feature names
are separated by | sign. Keep in mind the collision list is not exhaustive.
Features marked with (-) should be read as inverted: if they have positive
coefficient, the result is negative, if they have negative coefficient,
the result is positive.
y='alt.atheism' top features
Weight Feature
------ -----------------------------------------------------------------------------------
+5.383 atheism | homos | (-)dyson
+4.889 atheists | (-)simulators | (-)degrading | coprocessor | (-)imsl | justifying | 3261
+4.482 bobby | (-)counterexamples
+4.360 religion | followers | hunts | 536
+3.562 words | 24bit | _nightflyers_ | tsv | (-)recommened
+3.448 posting | libelous | rude | (-)agreeable | (-)elaine | umd
+3.417 post
+3.219 atheist | boyce | (-)62618e
+3.213 define | mixture | (-)cx5 | periphery | cmd | (-)bibtex
+3.189 isn | (-)david42
+3.164 islam | (-)code3 | descends | (-)2etc | witrh | xgif
+2.997 our | newlan | (-)prenatal
+2.979 example | (-)tolerate | (-)336549999999999955e
+2.945 punishment | jahn | (-)extremists
+2.902 islamic | (-)angeles | pressures | (-)affordably | (-)snazzy
+2.836 being | kernel
… 3441 more positive …
… 4622 more negative …
-3.041 christ | (-)silloo | ineed | (-)_religion_
-3.085 interested | galileo | (-)geneologies
-3.429 order | consensus | maker | vm_pray | wonders | zorastrian | (-)457
-4.664 space | (-)revell
y='comp.graphics' top features
Weight Feature
------ ----------------------------------------------------------------------------------
+6.976 graphics | lemur | (-)installations
+5.089 file
+4.918 computer | (-)morality | (-)priest | eredoctoraat
+4.675 image | (-)fallacies | envelope | (-)turing | (-)topographic
+4.481 3d | wti
+4.058 points | credibility
+3.846 screen | techno | (-)clo | clicking
+3.613 using | contention
+3.515 42 | (-)auxiliary
+3.428 files | cell | dualism | (-)bibliographic
+3.040 virtual | shaking | (-)catechism
+2.988 site | cheif
+2.983 video | dither | sometime | (-)kit | menlo | (-)yourselfers
+2.949 package | (-)intelligence
+2.944 animation | satisfies | nome | intrigued | heightfields | attendent
+2.835 hi | feasability | (-)seri | teens
+2.834 tiff | distress
+2.818 version | ________________________________________________________________________
… 3240 more positive …
… 4190 more negative …
-2.859 orbit | (-)stroked
-5.100 space | (-)revell
y='sci.space' top features
Weight Feature
------ ---------------------------------------------------------------------------------------
+9.732 space | (-)revell
+5.425 orbit | (-)stroked
+4.504 nasa | (-)mocking | (-)jmd
+4.193 launch | (-)spring
+3.992 spacecraft | (-)revenues | (-)_______ | serbian | (-)detained | externel | (-)tormentor
+3.802 mars | reston | (-)risen | nowadays
+3.676 moon | faiths | quantized | pet | (-)enriched | brightnesses
+3.525 shuttle | (-)recording | vlt
+3.436 earth | (-)vdp | pixutils | coined | khvaetvadatha | (-)omits
+3.158 flight | (-)kerwin | (-)interentested
+3.115 solar | oscillator | slogan
+3.094 sci | (-)calculater | inconceivable
+2.973 satellite | advocated | (-)telesoft | (-)microcontroller
+2.943 test | (-)japan
… 3841 more positive …
… 4556 more negative …
-3.102 god | (-)casual | socrates | (-)aborted | pivotal
-3.170 religion | followers | hunts | 536
-3.425 wrong | (-)tired | (-)bestowed
-3.493 file
-3.619 3d | wti
-4.582 graphics | lemur | (-)installations
y='talk.religion.misc' top features
Weight Feature
------ -----------------------------------------------------------------------------------
+5.561 christian | integer | (-)pd1 | trench
+5.224 christians | 320x200 | subtilty
+5.079 order | consensus | maker | vm_pray | wonders | zorastrian | (-)457
+4.556 jesus | (-)systems | butter | (-)aztecs | (-)geotail | optimized
+4.442 fbi | awfully | (-)antwerp
+3.673 blood | (-)reduces
+3.498 objective | (-)venera | 82 | (-)fl | (-)834
+3.203 children | (-)inflatable | (-)cutest
+3.111 koresh | (-)gotta | (-)fixtures | (-)hussien | (-)joined
+2.918 dead | (-)phillips | (-)les
+2.864 values | undefinable | (-)chubb
+2.853 christ | (-)silloo | ineed | (-)_religion_
+2.808 may | aws | (-)umpire | (-)gaat
+2.808 see | (-)sert
… 3528 more positive …
… 4713 more negative …
-3.048 thanks | adequate | royalty | intelligibly
-3.184 atheists | (-)simulators | (-)degrading | coprocessor | (-)imsl | justifying | 3261
-3.353 need | (-)concede | beeld | noss
-3.464 system | (-)bylaws | (-)724x600
-3.543 could | diagrams | (-)videoscan | lous | (-)64x64
-4.728 space | (-)revell
In [5]:
from IPython.core.display import display, HTML
show_html = lambda html: display(HTML(html))
show_html_expl = lambda expl, **kwargs: show_html(format_as_html(expl, include_styles=False, **kwargs))
show_html(format_html_styles())
In [6]:
show_html_expl(explain_weights(clf, ivec, target_names=train['target_names']))
Explained as: linear model
Features with largest coefficients per class.
Caveats:
1. Be careful with features which are not
independent - weights don't show their importance.
2. If scale of input features is different then scale of coefficients
will also be different, making direct comparison between coefficient values
incorrect.
3. Depending on regularization, rare features sometimes may have high
coefficients; this doesn't mean they contribute much to the
classification result for most examples.
Feature names are restored from their hashes; this is not 100% precise
because collisions are possible. For known collisions possible feature names
are separated by | sign. Keep in mind the collision list is not exhaustive.
Features marked with (-) should be read as inverted: if they have positive
coefficient, the result is negative, if they have negative coefficient,
the result is positive.
y=alt.atheism
top features
y=comp.graphics
top features
y=sci.space
top features
y=talk.religion.misc
top features
Weight?
Feature
+5.383
atheism …
+4.889
atheists …
+4.482
bobby …
+4.360
religion …
+3.562
words …
+3.448
posting …
+3.417
post
+3.219
atheist …
+3.213
define …
+3.189
isn …
+3.164
islam …
+2.997
our …
+2.979
example …
+2.945
punishment …
+2.902
islamic …
+2.836
being …
… 3441 more positive …
… 4622 more negative …
-3.041
christ …
-3.085
interested …
-3.429
order …
-4.664
space …
Weight?
Feature
+6.976
graphics …
+5.089
file
+4.918
computer …
+4.675
image …
+4.481
3d …
+4.058
points …
+3.846
screen …
+3.613
using …
+3.515
42 …
+3.428
files …
+3.040
virtual …
+2.988
site …
+2.983
video …
+2.949
package …
+2.944
animation …
+2.835
hi …
+2.834
tiff …
+2.818
version …
… 3240 more positive …
… 4190 more negative …
-2.859
orbit …
-5.100
space …
Weight?
Feature
+9.732
space …
+5.425
orbit …
+4.504
nasa …
+4.193
launch …
+3.992
spacecraft …
+3.802
mars …
+3.676
moon …
+3.525
shuttle …
+3.436
earth …
+3.158
flight …
+3.115
solar …
+3.094
sci …
+2.973
satellite …
+2.943
test …
… 3841 more positive …
… 4556 more negative …
-3.102
god …
-3.170
religion …
-3.425
wrong …
-3.493
file
-3.619
3d …
-4.582
graphics …
Weight?
Feature
+5.561
christian …
+5.224
christians …
+5.079
order …
+4.556
jesus …
+4.442
fbi …
+3.673
blood …
+3.498
objective …
+3.203
children …
+3.111
koresh …
+2.918
dead …
+2.864
values …
+2.853
christ …
+2.808
may …
+2.808
see …
… 3528 more positive …
… 4713 more negative …
-3.048
thanks …
-3.184
atheists …
-3.353
need …
-3.464
system …
-3.543
could …
-4.728
space …
In [7]:
show_html_expl(explain_prediction(clf, test['data'][2], vec, target_names=train['target_names']), force_weights=True)
Explained as: linear model
y=alt.atheism
(score -4.351)
top features
y=comp.graphics
(score 2.166)
top features
y=sci.space
(score -0.890)
top features
y=talk.religion.misc
(score -2.004)
top features
Contribution?
Feature
+0.232
some
+0.156
much
+0.061
it
+0.061
which
+0.055
is
+0.042
has
+0.036
trying
+0.032
my
+0.031
sophisticated
+0.014
likes
+0.008
he
+0.004
pc
-0.000
designer
-0.008
suggestion
-0.024
for
-0.035
most
-0.039
decor
-0.046
better
-0.050
any
-0.054
am
-0.056
there
-0.059
and
-0.065
features
-0.067
here
-0.081
interior
-0.095
friend
-0.108
where
-0.109
hi
-0.110
from
-0.116
thailand
-0.119
more
-0.121
how
-0.134
find
-0.167
costs
-0.173
looking
-0.174
help
-0.177
on
-0.241
graphics
-0.256
the
-0.284
to
-0.329
buy
-0.731
software
-1.055
<BIAS>
Contribution?
Feature
+0.720
graphics
+0.549
software
+0.370
is
+0.292
hi
+0.234
looking
+0.223
help
+0.217
on
+0.179
any
+0.154
there
+0.153
pc
+0.142
for
+0.123
features
+0.118
find
+0.104
from
+0.103
my
+0.100
has
+0.084
which
+0.081
it
+0.079
am
+0.078
where
+0.038
and
+0.035
here
+0.029
thailand
+0.026
costs
+0.019
friend
+0.019
trying
+0.007
how
+0.004
some
-0.008
designer
-0.010
buy
-0.012
likes
-0.021
sophisticated
-0.021
interior
-0.023
decor
-0.032
better
-0.063
much
-0.078
most
-0.079
more
-0.106
the
-0.106
to
-0.107
suggestion
-0.514
he
-0.933
<BIAS>
Contribution?
Feature
+0.297
the
+0.232
buy
+0.161
costs
+0.158
how
+0.149
most
+0.103
it
+0.100
more
+0.097
to
+0.096
on
+0.095
software
+0.085
some
+0.075
from
+0.072
much
+0.070
friend
+0.069
here
+0.057
where
+0.048
likes
+0.048
there
+0.035
decor
+0.029
sophisticated
+0.012
has
+0.009
for
+0.006
designer
-0.004
suggestion
-0.017
better
-0.018
any
-0.026
interior
-0.056
and
-0.060
thailand
-0.064
pc
-0.066
trying
-0.071
am
-0.073
features
-0.094
help
-0.107
my
-0.111
hi
-0.112
which
-0.123
find
-0.145
is
-0.152
looking
-0.265
he
-0.473
graphics
-0.956
<BIAS>
Contribution?
Feature
+0.478
he
+0.225
is
+0.133
more
+0.116
interior
+0.088
my
+0.085
thailand
+0.076
and
+0.070
find
+0.070
looking
+0.053
suggestion
+0.051
am
+0.047
buy
+0.043
friend
+0.040
trying
+0.037
where
+0.026
costs
+0.025
decor
+0.009
which
+0.007
to
+0.004
designer
-0.011
how
-0.012
here
-0.013
help
-0.016
from
-0.021
the
-0.035
features
-0.044
sophisticated
-0.055
there
-0.064
likes
-0.069
better
-0.095
pc
-0.106
most
-0.131
any
-0.135
for
-0.176
has
-0.186
on
-0.189
much
-0.193
hi
-0.193
some
-0.282
graphics
-0.287
it
-0.402
software
-0.973
<BIAS>
y=alt.atheism
(score -4.351)
top features
Contribution?
Feature
-1.055
<BIAS>
-3.296
Highlighted in text (sum)
hi there,
i am here looking for some help.
my friend is a interior decor designer. he is from thailand. he is
trying to find some graphics software on pc. any suggestion on which
software to buy,where to buy and how much it costs ? he likes the most
sophisticated
software(the more features it has,the better)
y=comp.graphics
(score 2.166)
top features
Contribution?
Feature
+3.098
Highlighted in text (sum)
-0.933
<BIAS>
hi there,
i am here looking for some help.
my friend is a interior decor designer. he is from thailand. he is
trying to find some graphics software on pc. any suggestion on which
software to buy,where to buy and how much it costs ? he likes the most
sophisticated
software(the more features it has,the better)
y=sci.space
(score -0.890)
top features
Contribution?
Feature
+0.066
Highlighted in text (sum)
-0.956
<BIAS>
hi there,
i am here looking for some help.
my friend is a interior decor designer. he is from thailand. he is
trying to find some graphics software on pc. any suggestion on which
software to buy,where to buy and how much it costs ? he likes the most
sophisticated
software(the more features it has,the better)
y=talk.religion.misc
(score -2.004)
top features
Contribution?
Feature
-0.973
<BIAS>
-1.031
Highlighted in text (sum)
hi there,
i am here looking for some help.
my friend is a interior decor designer. he is from thailand. he is
trying to find some graphics software on pc. any suggestion on which
software to buy,where to buy and how much it costs ? he likes the most
sophisticated
software(the more features it has,the better)
In [8]:
show_html_expl(explain_prediction(clf, test['data'][4], vec, target_names=train['target_names']), force_weights=False)
Explained as: linear model
y=alt.atheism
(score -2.171)
top features
Contribution?
Feature
-1.055
<BIAS>
-1.116
Highlighted in text (sum)
i am interested in finding 3d animation programs for the mac.
i am especially interested in any programs that don't exist
in a pc port and are so good that they would make me go buy
a mac. do any such exist?
y=comp.graphics
(score 1.082)
top features
Contribution?
Feature
+2.015
Highlighted in text (sum)
-0.933
<BIAS>
i am interested in finding 3d animation programs for the mac.
i am especially interested in any programs that don't exist
in a pc port and are so good that they would make me go buy
a mac. do any such exist?
y=sci.space
(score -2.049)
top features
Contribution?
Feature
-0.956
<BIAS>
-1.093
Highlighted in text (sum)
i am interested in finding 3d animation programs for the mac.
i am especially interested in any programs that don't exist
in a pc port and are so good that they would make me go buy
a mac. do any such exist?
y=talk.religion.misc
(score -1.993)
top features
Contribution?
Feature
-0.973
<BIAS>
-1.019
Highlighted in text (sum)
i am interested in finding 3d animation programs for the mac.
i am especially interested in any programs that don't exist
in a pc port and are so good that they would make me go buy
a mac. do any such exist?
In [9]:
import numpy as np
for doc in test['data'][:10]:
expl = explain_prediction(clf, doc, vec, target_names=train['target_names'], top_targets=1)
show_html_expl(expl, force_weights=False)
Explained as: linear model
y=sci.space
(score 0.184)
top features
Contribution?
Feature
+1.140
Highlighted in text (sum)
-0.956
<BIAS>
trry the skywatch project in arizona.
Explained as: linear model
y=comp.graphics
(score 2.062)
top features
Contribution?
Feature
+2.994
Highlighted in text (sum)
-0.933
<BIAS>
the vatican library recently made a tour of the us.
can anyone help me in finding a ftp site where this collection is
available.
Explained as: linear model
y=comp.graphics
(score 2.166)
top features
Contribution?
Feature
+3.098
Highlighted in text (sum)
-0.933
<BIAS>
hi there,
i am here looking for some help.
my friend is a interior decor designer. he is from thailand. he is
trying to find some graphics software on pc. any suggestion on which
software to buy,where to buy and how much it costs ? he likes the most
sophisticated
software(the more features it has,the better)
Explained as: linear model
y=comp.graphics
(score 0.362)
top features
Contribution?
Feature
+1.295
Highlighted in text (sum)
-0.933
<BIAS>
rfd
request for discussion
for the
open telematic group
otg
i have proposed the forming of a consortium/task force for the
promotion of naplps/jpeg, fif to openly discuss ways, method,
procedures,algorythms, applications, implementation, extensions of
naplps/jpeg standards. these standards should facilitate the creation
of real_time online applications that make use of voice, video,
telecommuting, hires graphics, conferencing, distant learning, online
order entry, fax,in addition these dicussion would assist all to
better understand how sgml, cals, oda, mime, oodbms, jpeg, mpeg,
fractals, sql, cdrom, cdromxa, kodak photocd, tcl, v.fast, and
eia/tia562, can best be incorporated and implemented to develop
telematic/multimedia applications.
we want to be able to support dos, unix, mac, windows, nt, os/2
platforms. it is our hope that individuals, developers, corporations,
universities, r & d labs would join in in supporting such an endeavor.
this would be a not_for_profit group with bylaws and charter. already
many corporations have decided to support otg (open telematic group) so
do not delay joining if you are a developer
an rfd has been posted to form a usenet newsgroup and a faq will soon
be be composed to start promulgating what is known on the subject. if
you would like to be added to the maillist send email or mail to the
address below.
this group would publish an electronic quarterly naplps/jpeg
newsletter as well as a hardcopy version. we urge all who wants to
see cmcs hires based applications & the naplps/jpeg g r o w, decide to
join and mutually benefit from this not-for_profit endeavor.
note: telematic has been defined by mr. james martin as the marriage
of voice, video, hi-res graphics, fax, ivr, music over telephone
lines/lan.
if you would like to get involve write to me at:
img inter-multimedia group| internet: epimntl@world.std.com
p.o. box 95901 | ed.pimentel@gisatl.fidonet.org
atlanta, georgia, us | cis : 70611,3703
| fidonet : 1:133/407
| bbs : +1-404-985-1198 zyxel 14.4k
Explained as: linear model
y=comp.graphics
(score 1.082)
top features
Contribution?
Feature
+2.015
Highlighted in text (sum)
-0.933
<BIAS>
i am interested in finding 3d animation programs for the mac.
i am especially interested in any programs that don't exist
in a pc port and are so good that they would make me go buy
a mac. do any such exist?
Explained as: linear model
y=sci.space
(score -0.028)
top features
Contribution?
Feature
+0.928
Highlighted in text (sum)
-0.956
<BIAS>
i'm also interested in such a program. but most of all i'd like to know
wich program is able to convert gif or pcx to dxf !!! when i have this
program, i can scan pictures and frase (or something like that !) them.
this will be beyond the limit !!!
Explained as: linear model
y=comp.graphics
(score -0.414)
top features
Contribution?
Feature
+0.519
Highlighted in text (sum)
-0.933
<BIAS>
or how about:
"end light pollution now!!"
your banner would have no effect on its subject, but my banner would.
Explained as: linear model
y=sci.space
(score 2.117)
top features
Contribution?
Feature
+3.073
Highlighted in text (sum)
-0.956
<BIAS>
: while i'm sure sagan considers it sacrilegious, that wouldn't be
: because of his doubtfull credibility as an astronomer. modern,
: ground-based, visible light astronomy (what these proposed
: orbiting billboards would upset) is already a dying field: the
: opacity and distortions caused by the atmosphere itself have
: driven most of the field to use radio, far infrared or space-based
: telescopes.
hardly. the keck telescope in hawaii has taken its first pictures; they're
nearly as good as hubble for a tiny fraction of the cost.
: in any case, a bright point of light passing through
: the field doesn't ruin observations. if that were the case, the
: thousands of existing satellites would have already done so (satelliets
: might not seem so bright to the eyes, but as far as astronomy is concerned,
: they are extremely bright.)
i believe that this orbiting space junk will be far brighter still;
more like the full moon. the moon upsets deep-sky observation all
over the sky (and not just looking at it) because of scattered light.
this is a known problem, but of course two weeks out of every four are
ok. what happens when this billboard circles every 90 minutes? what
would be a good time then?
: frank crary
: cu boulder
Explained as: linear model
y=alt.atheism
(score 1.537)
top features
Contribution?
Feature
+2.591
Highlighted in text (sum)
-1.055
<BIAS>
not if you show that these hypothetical atheists are gullible, excitable
and easily led from some concrete cause. in that case we would also
have to discuss if that concrete cause, rather than atheism, was the
factor that caused their subsequent behaviour.
Explained as: linear model
y=sci.space
(score 0.045)
top features
Contribution?
Feature
+1.001
Highlighted in text (sum)
-0.956
<BIAS>
picture our universe floating like a log
in a river. as the log floats down the
river, it occasionally strikes rocks, the
bank, the bottom, other logs. when this collission
occurs, kinetic energy is translated into heat, the
log degrades, gets scraped up, and other energy
translaions occur. the distribution of damage to
the log depends on the shape of the log.
however, to a very small virus in a mite on the head of a
termite in the center of the log, the shock waves from the
collissions would appear uniformly random in direction.
this is my theory for grb. they are evidence of our universe
interacting with other universes! why not! makes
just as much sense as the grb coming from the oort cloud!
the log theory of universes can't be ruled out!
of course, i'm a layman in the physics world. you
physicists out there, tell me about this !!!!
Content source: TeamHG-Memex/eli5
Similar notebooks: