In [1]:
from sklearn.datasets import fetch_20newsgroups
categories = [
    'alt.atheism',
    'talk.religion.misc',
    'comp.graphics',
    'sci.space',
]
fetch_subset = lambda subset: fetch_20newsgroups(
    subset=subset, categories=categories,
    shuffle=True, random_state=42,
    remove=('headers', 'footers', 'quotes'))
train = fetch_subset('train')
test = fetch_subset('test')

In [2]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import HashingVectorizer

vec = HashingVectorizer(n_features=10000)
clf = SGDClassifier()
pipeline = Pipeline([('vec', vec), ('clf', clf)])
pipeline.fit(train['data'], train['target'])


Out[2]:
Pipeline(steps=[('vec', HashingVectorizer(analyzer='word', binary=False, decode_error='strict',
         dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
         lowercase=True, n_features=10000, ngram_range=(1, 1),
         non_negative=False, norm='l2', preprocessor=None, stop_words=None,...   penalty='l2', power_t=0.5, random_state=None, shuffle=True,
       verbose=0, warm_start=False))])

In [10]:
from eli5.sklearn import InvertableHashingVectorizer
ivec = InvertableHashingVectorizer(vec)
ivec.fit(train['data'])


Out[10]:
InvertableHashingVectorizer(unkn_template='FEATURE[%d]',
              vec=HashingVectorizer(analyzer='word', binary=False, decode_error='strict',
         dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
         lowercase=True, n_features=10000, ngram_range=(1, 1),
         non_negative=False, norm='l2', preprocessor=None, stop_words=None,
         strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
         tokenizer=None))

In [4]:
from eli5 import explain_weights, explain_prediction
from eli5 import format_as_html, format_as_text, format_html_styles

print(format_as_text(explain_weights(clf, ivec, target_names=train['target_names'])))


Explained as: linear model

Features with largest coefficients per class.
Caveats:
1. Be careful with features which are not
   independent - weights don't show their importance.
2. If scale of input features is different then scale of coefficients
   will also be different, making direct comparison between coefficient values
   incorrect.
3. Depending on regularization, rare features sometimes may have high
   coefficients; this doesn't mean they contribute much to the
   classification result for most examples.

Feature names are restored from their hashes; this is not 100% precise
because collisions are possible. For known collisions possible feature names
are separated by | sign. Keep in mind the collision list is not exhaustive.
Features marked with (-) should be read as inverted: if they have positive
coefficient, the result is negative, if they have negative coefficient,
the result is positive.

y='alt.atheism' top features
Weight  Feature                                                                            
------  -----------------------------------------------------------------------------------
+5.383  atheism | homos | (-)dyson                                                         
+4.889  atheists | (-)simulators | (-)degrading | coprocessor | (-)imsl | justifying | 3261
+4.482  bobby | (-)counterexamples                                                         
+4.360  religion | followers | hunts | 536                                                 
+3.562  words | 24bit | _nightflyers_ | tsv | (-)recommened                                
+3.448  posting | libelous | rude | (-)agreeable | (-)elaine | umd                         
+3.417  post                                                                               
+3.219  atheist | boyce | (-)62618e                                                        
+3.213  define | mixture | (-)cx5 | periphery | cmd | (-)bibtex                            
+3.189  isn | (-)david42                                                                   
+3.164  islam | (-)code3 | descends | (-)2etc | witrh | xgif                               
+2.997  our | newlan | (-)prenatal                                                         
+2.979  example | (-)tolerate | (-)336549999999999955e                                     
+2.945  punishment | jahn | (-)extremists                                                  
+2.902  islamic | (-)angeles | pressures | (-)affordably | (-)snazzy                       
+2.836  being | kernel                                                                     
                                  … 3441 more positive …                                   
                                  … 4622 more negative …                                   
-3.041  christ | (-)silloo | ineed | (-)_religion_                                         
-3.085  interested | galileo | (-)geneologies                                              
-3.429  order | consensus | maker | vm_pray | wonders | zorastrian | (-)457                
-4.664  space | (-)revell                                                                  

y='comp.graphics' top features
Weight  Feature                                                                           
------  ----------------------------------------------------------------------------------
+6.976  graphics | lemur | (-)installations                                               
+5.089  file                                                                              
+4.918  computer | (-)morality | (-)priest | eredoctoraat                                 
+4.675  image | (-)fallacies | envelope | (-)turing | (-)topographic                      
+4.481  3d | wti                                                                          
+4.058  points | credibility                                                              
+3.846  screen | techno | (-)clo | clicking                                               
+3.613  using | contention                                                                
+3.515  42 | (-)auxiliary                                                                 
+3.428  files | cell | dualism | (-)bibliographic                                         
+3.040  virtual | shaking | (-)catechism                                                  
+2.988  site | cheif                                                                      
+2.983  video | dither | sometime | (-)kit | menlo | (-)yourselfers                       
+2.949  package | (-)intelligence                                                         
+2.944  animation | satisfies | nome | intrigued | heightfields | attendent               
+2.835  hi | feasability | (-)seri | teens                                                
+2.834  tiff | distress                                                                   
+2.818  version | ________________________________________________________________________
                                  … 3240 more positive …                                  
                                  … 4190 more negative …                                  
-2.859  orbit | (-)stroked                                                                
-5.100  space | (-)revell                                                                 

y='sci.space' top features
Weight  Feature                                                                                
------  ---------------------------------------------------------------------------------------
+9.732  space | (-)revell                                                                      
+5.425  orbit | (-)stroked                                                                     
+4.504  nasa | (-)mocking | (-)jmd                                                             
+4.193  launch | (-)spring                                                                     
+3.992  spacecraft | (-)revenues | (-)_______ | serbian | (-)detained | externel | (-)tormentor
+3.802  mars | reston | (-)risen | nowadays                                                    
+3.676  moon | faiths | quantized | pet | (-)enriched | brightnesses                           
+3.525  shuttle | (-)recording | vlt                                                           
+3.436  earth | (-)vdp | pixutils | coined | khvaetvadatha | (-)omits                          
+3.158  flight | (-)kerwin | (-)interentested                                                  
+3.115  solar | oscillator | slogan                                                            
+3.094  sci | (-)calculater | inconceivable                                                    
+2.973  satellite | advocated | (-)telesoft | (-)microcontroller                               
+2.943  test | (-)japan                                                                        
                                    … 3841 more positive …                                     
                                    … 4556 more negative …                                     
-3.102  god | (-)casual | socrates | (-)aborted | pivotal                                      
-3.170  religion | followers | hunts | 536                                                     
-3.425  wrong | (-)tired | (-)bestowed                                                         
-3.493  file                                                                                   
-3.619  3d | wti                                                                               
-4.582  graphics | lemur | (-)installations                                                    

y='talk.religion.misc' top features
Weight  Feature                                                                            
------  -----------------------------------------------------------------------------------
+5.561  christian | integer | (-)pd1 | trench                                              
+5.224  christians | 320x200 | subtilty                                                    
+5.079  order | consensus | maker | vm_pray | wonders | zorastrian | (-)457                
+4.556  jesus | (-)systems | butter | (-)aztecs | (-)geotail | optimized                   
+4.442  fbi | awfully | (-)antwerp                                                         
+3.673  blood | (-)reduces                                                                 
+3.498  objective | (-)venera | 82 | (-)fl | (-)834                                        
+3.203  children | (-)inflatable | (-)cutest                                               
+3.111  koresh | (-)gotta | (-)fixtures | (-)hussien | (-)joined                           
+2.918  dead | (-)phillips | (-)les                                                        
+2.864  values | undefinable | (-)chubb                                                    
+2.853  christ | (-)silloo | ineed | (-)_religion_                                         
+2.808  may | aws | (-)umpire | (-)gaat                                                    
+2.808  see | (-)sert                                                                      
                                  … 3528 more positive …                                   
                                  … 4713 more negative …                                   
-3.048  thanks | adequate | royalty | intelligibly                                         
-3.184  atheists | (-)simulators | (-)degrading | coprocessor | (-)imsl | justifying | 3261
-3.353  need | (-)concede | beeld | noss                                                   
-3.464  system | (-)bylaws | (-)724x600                                                    
-3.543  could | diagrams | (-)videoscan | lous | (-)64x64                                  
-4.728  space | (-)revell                                                                  


In [5]:
from IPython.core.display import display, HTML
show_html = lambda html: display(HTML(html))
show_html_expl = lambda expl, **kwargs: show_html(format_as_html(expl, include_styles=False, **kwargs))
show_html(format_html_styles())



In [6]:
show_html_expl(explain_weights(clf, ivec, target_names=train['target_names']))


Explained as: linear model

Features with largest coefficients per class.
Caveats:
1. Be careful with features which are not
   independent - weights don't show their importance.
2. If scale of input features is different then scale of coefficients
   will also be different, making direct comparison between coefficient values
   incorrect.
3. Depending on regularization, rare features sometimes may have high
   coefficients; this doesn't mean they contribute much to the
   classification result for most examples.

Feature names are restored from their hashes; this is not 100% precise
because collisions are possible. For known collisions possible feature names
are separated by | sign. Keep in mind the collision list is not exhaustive.
Features marked with (-) should be read as inverted: if they have positive
coefficient, the result is negative, if they have negative coefficient,
the result is positive.
y=alt.atheism top features y=comp.graphics top features y=sci.space top features y=talk.religion.misc top features
Weight? Feature
+5.383 atheism
+4.889 atheists
+4.482 bobby
+4.360 religion
+3.562 words
+3.448 posting
+3.417 post
+3.219 atheist
+3.213 define
+3.189 isn
+3.164 islam
+2.997 our
+2.979 example
+2.945 punishment
+2.902 islamic
+2.836 being
… 3441 more positive …
… 4622 more negative …
-3.041 christ
-3.085 interested
-3.429 order
-4.664 space
Weight? Feature
+6.976 graphics
+5.089 file
+4.918 computer
+4.675 image
+4.481 3d
+4.058 points
+3.846 screen
+3.613 using
+3.515 42
+3.428 files
+3.040 virtual
+2.988 site
+2.983 video
+2.949 package
+2.944 animation
+2.835 hi
+2.834 tiff
+2.818 version
… 3240 more positive …
… 4190 more negative …
-2.859 orbit
-5.100 space
Weight? Feature
+9.732 space
+5.425 orbit
+4.504 nasa
+4.193 launch
+3.992 spacecraft
+3.802 mars
+3.676 moon
+3.525 shuttle
+3.436 earth
+3.158 flight
+3.115 solar
+3.094 sci
+2.973 satellite
+2.943 test
… 3841 more positive …
… 4556 more negative …
-3.102 god
-3.170 religion
-3.425 wrong
-3.493 file
-3.619 3d
-4.582 graphics
Weight? Feature
+5.561 christian
+5.224 christians
+5.079 order
+4.556 jesus
+4.442 fbi
+3.673 blood
+3.498 objective
+3.203 children
+3.111 koresh
+2.918 dead
+2.864 values
+2.853 christ
+2.808 may
+2.808 see
… 3528 more positive …
… 4713 more negative …
-3.048 thanks
-3.184 atheists
-3.353 need
-3.464 system
-3.543 could
-4.728 space

In [7]:
show_html_expl(explain_prediction(clf, test['data'][2], vec, target_names=train['target_names']), force_weights=True)


Explained as: linear model

y=alt.atheism (score -4.351) top features y=comp.graphics (score 2.166) top features y=sci.space (score -0.890) top features y=talk.religion.misc (score -2.004) top features
Contribution? Feature
+0.232 some
+0.156 much
+0.061 it
+0.061 which
+0.055 is
+0.042 has
+0.036 trying
+0.032 my
+0.031 sophisticated
+0.014 likes
+0.008 he
+0.004 pc
-0.000 designer
-0.008 suggestion
-0.024 for
-0.035 most
-0.039 decor
-0.046 better
-0.050 any
-0.054 am
-0.056 there
-0.059 and
-0.065 features
-0.067 here
-0.081 interior
-0.095 friend
-0.108 where
-0.109 hi
-0.110 from
-0.116 thailand
-0.119 more
-0.121 how
-0.134 find
-0.167 costs
-0.173 looking
-0.174 help
-0.177 on
-0.241 graphics
-0.256 the
-0.284 to
-0.329 buy
-0.731 software
-1.055 <BIAS>
Contribution? Feature
+0.720 graphics
+0.549 software
+0.370 is
+0.292 hi
+0.234 looking
+0.223 help
+0.217 on
+0.179 any
+0.154 there
+0.153 pc
+0.142 for
+0.123 features
+0.118 find
+0.104 from
+0.103 my
+0.100 has
+0.084 which
+0.081 it
+0.079 am
+0.078 where
+0.038 and
+0.035 here
+0.029 thailand
+0.026 costs
+0.019 friend
+0.019 trying
+0.007 how
+0.004 some
-0.008 designer
-0.010 buy
-0.012 likes
-0.021 sophisticated
-0.021 interior
-0.023 decor
-0.032 better
-0.063 much
-0.078 most
-0.079 more
-0.106 the
-0.106 to
-0.107 suggestion
-0.514 he
-0.933 <BIAS>
Contribution? Feature
+0.297 the
+0.232 buy
+0.161 costs
+0.158 how
+0.149 most
+0.103 it
+0.100 more
+0.097 to
+0.096 on
+0.095 software
+0.085 some
+0.075 from
+0.072 much
+0.070 friend
+0.069 here
+0.057 where
+0.048 likes
+0.048 there
+0.035 decor
+0.029 sophisticated
+0.012 has
+0.009 for
+0.006 designer
-0.004 suggestion
-0.017 better
-0.018 any
-0.026 interior
-0.056 and
-0.060 thailand
-0.064 pc
-0.066 trying
-0.071 am
-0.073 features
-0.094 help
-0.107 my
-0.111 hi
-0.112 which
-0.123 find
-0.145 is
-0.152 looking
-0.265 he
-0.473 graphics
-0.956 <BIAS>
Contribution? Feature
+0.478 he
+0.225 is
+0.133 more
+0.116 interior
+0.088 my
+0.085 thailand
+0.076 and
+0.070 find
+0.070 looking
+0.053 suggestion
+0.051 am
+0.047 buy
+0.043 friend
+0.040 trying
+0.037 where
+0.026 costs
+0.025 decor
+0.009 which
+0.007 to
+0.004 designer
-0.011 how
-0.012 here
-0.013 help
-0.016 from
-0.021 the
-0.035 features
-0.044 sophisticated
-0.055 there
-0.064 likes
-0.069 better
-0.095 pc
-0.106 most
-0.131 any
-0.135 for
-0.176 has
-0.186 on
-0.189 much
-0.193 hi
-0.193 some
-0.282 graphics
-0.287 it
-0.402 software
-0.973 <BIAS>

y=alt.atheism (score -4.351) top features

Contribution? Feature
-1.055 <BIAS>
-3.296 Highlighted in text (sum)

hi there, i am here looking for some help. my friend is a interior decor designer. he is from thailand. he is trying to find some graphics software on pc. any suggestion on which software to buy,where to buy and how much it costs ? he likes the most sophisticated software(the more features it has,the better)

y=comp.graphics (score 2.166) top features

Contribution? Feature
+3.098 Highlighted in text (sum)
-0.933 <BIAS>

hi there, i am here looking for some help. my friend is a interior decor designer. he is from thailand. he is trying to find some graphics software on pc. any suggestion on which software to buy,where to buy and how much it costs ? he likes the most sophisticated software(the more features it has,the better)

y=sci.space (score -0.890) top features

Contribution? Feature
+0.066 Highlighted in text (sum)
-0.956 <BIAS>

hi there, i am here looking for some help. my friend is a interior decor designer. he is from thailand. he is trying to find some graphics software on pc. any suggestion on which software to buy,where to buy and how much it costs ? he likes the most sophisticated software(the more features it has,the better)

y=talk.religion.misc (score -2.004) top features

Contribution? Feature
-0.973 <BIAS>
-1.031 Highlighted in text (sum)

hi there, i am here looking for some help. my friend is a interior decor designer. he is from thailand. he is trying to find some graphics software on pc. any suggestion on which software to buy,where to buy and how much it costs ? he likes the most sophisticated software(the more features it has,the better)


In [8]:
show_html_expl(explain_prediction(clf, test['data'][4], vec, target_names=train['target_names']), force_weights=False)


Explained as: linear model

y=alt.atheism (score -2.171) top features

Contribution? Feature
-1.055 <BIAS>
-1.116 Highlighted in text (sum)

i am interested in finding 3d animation programs for the mac. i am especially interested in any programs that don't exist in a pc port and are so good that they would make me go buy a mac. do any such exist?

y=comp.graphics (score 1.082) top features

Contribution? Feature
+2.015 Highlighted in text (sum)
-0.933 <BIAS>

i am interested in finding 3d animation programs for the mac. i am especially interested in any programs that don't exist in a pc port and are so good that they would make me go buy a mac. do any such exist?

y=sci.space (score -2.049) top features

Contribution? Feature
-0.956 <BIAS>
-1.093 Highlighted in text (sum)

i am interested in finding 3d animation programs for the mac. i am especially interested in any programs that don't exist in a pc port and are so good that they would make me go buy a mac. do any such exist?

y=talk.religion.misc (score -1.993) top features

Contribution? Feature
-0.973 <BIAS>
-1.019 Highlighted in text (sum)

i am interested in finding 3d animation programs for the mac. i am especially interested in any programs that don't exist in a pc port and are so good that they would make me go buy a mac. do any such exist?


In [9]:
import numpy as np
for doc in test['data'][:10]:
    expl = explain_prediction(clf, doc, vec, target_names=train['target_names'], top_targets=1)
    show_html_expl(expl, force_weights=False)


Explained as: linear model

y=sci.space (score 0.184) top features

Contribution? Feature
+1.140 Highlighted in text (sum)
-0.956 <BIAS>

trry the skywatch project in arizona.

Explained as: linear model

y=comp.graphics (score 2.062) top features

Contribution? Feature
+2.994 Highlighted in text (sum)
-0.933 <BIAS>

the vatican library recently made a tour of the us. can anyone help me in finding a ftp site where this collection is available.

Explained as: linear model

y=comp.graphics (score 2.166) top features

Contribution? Feature
+3.098 Highlighted in text (sum)
-0.933 <BIAS>

hi there, i am here looking for some help. my friend is a interior decor designer. he is from thailand. he is trying to find some graphics software on pc. any suggestion on which software to buy,where to buy and how much it costs ? he likes the most sophisticated software(the more features it has,the better)

Explained as: linear model

y=comp.graphics (score 0.362) top features

Contribution? Feature
+1.295 Highlighted in text (sum)
-0.933 <BIAS>

rfd request for discussion for the open telematic group otg i have proposed the forming of a consortium/task force for the promotion of naplps/jpeg, fif to openly discuss ways, method, procedures,algorythms, applications, implementation, extensions of naplps/jpeg standards. these standards should facilitate the creation of real_time online applications that make use of voice, video, telecommuting, hires graphics, conferencing, distant learning, online order entry, fax,in addition these dicussion would assist all to better understand how sgml, cals, oda, mime, oodbms, jpeg, mpeg, fractals, sql, cdrom, cdromxa, kodak photocd, tcl, v.fast, and eia/tia562, can best be incorporated and implemented to develop telematic/multimedia applications. we want to be able to support dos, unix, mac, windows, nt, os/2 platforms. it is our hope that individuals, developers, corporations, universities, r & d labs would join in in supporting such an endeavor. this would be a not_for_profit group with bylaws and charter. already many corporations have decided to support otg (open telematic group) so do not delay joining if you are a developer an rfd has been posted to form a usenet newsgroup and a faq will soon be be composed to start promulgating what is known on the subject. if you would like to be added to the maillist send email or mail to the address below. this group would publish an electronic quarterly naplps/jpeg newsletter as well as a hardcopy version. we urge all who wants to see cmcs hires based applications & the naplps/jpeg g r o w, decide to join and mutually benefit from this not-for_profit endeavor. note: telematic has been defined by mr. james martin as the marriage of voice, video, hi-res graphics, fax, ivr, music over telephone lines/lan. if you would like to get involve write to me at: img inter-multimedia group| internet: epimntl@world.std.com p.o. box 95901 | ed.pimentel@gisatl.fidonet.org atlanta, georgia, us | cis : 70611,3703 | fidonet : 1:133/407 | bbs : +1-404-985-1198 zyxel 14.4k

Explained as: linear model

y=comp.graphics (score 1.082) top features

Contribution? Feature
+2.015 Highlighted in text (sum)
-0.933 <BIAS>

i am interested in finding 3d animation programs for the mac. i am especially interested in any programs that don't exist in a pc port and are so good that they would make me go buy a mac. do any such exist?

Explained as: linear model

y=sci.space (score -0.028) top features

Contribution? Feature
+0.928 Highlighted in text (sum)
-0.956 <BIAS>

i'm also interested in such a program. but most of all i'd like to know wich program is able to convert gif or pcx to dxf !!! when i have this program, i can scan pictures and frase (or something like that !) them. this will be beyond the limit !!!

Explained as: linear model

y=comp.graphics (score -0.414) top features

Contribution? Feature
+0.519 Highlighted in text (sum)
-0.933 <BIAS>

or how about: "end light pollution now!!" your banner would have no effect on its subject, but my banner would.

Explained as: linear model

y=sci.space (score 2.117) top features

Contribution? Feature
+3.073 Highlighted in text (sum)
-0.956 <BIAS>

: while i'm sure sagan considers it sacrilegious, that wouldn't be : because of his doubtfull credibility as an astronomer. modern, : ground-based, visible light astronomy (what these proposed : orbiting billboards would upset) is already a dying field: the : opacity and distortions caused by the atmosphere itself have : driven most of the field to use radio, far infrared or space-based : telescopes. hardly. the keck telescope in hawaii has taken its first pictures; they're nearly as good as hubble for a tiny fraction of the cost. : in any case, a bright point of light passing through : the field doesn't ruin observations. if that were the case, the : thousands of existing satellites would have already done so (satelliets : might not seem so bright to the eyes, but as far as astronomy is concerned, : they are extremely bright.) i believe that this orbiting space junk will be far brighter still; more like the full moon. the moon upsets deep-sky observation all over the sky (and not just looking at it) because of scattered light. this is a known problem, but of course two weeks out of every four are ok. what happens when this billboard circles every 90 minutes? what would be a good time then? : frank crary : cu boulder

Explained as: linear model

y=alt.atheism (score 1.537) top features

Contribution? Feature
+2.591 Highlighted in text (sum)
-1.055 <BIAS>

not if you show that these hypothetical atheists are gullible, excitable and easily led from some concrete cause. in that case we would also have to discuss if that concrete cause, rather than atheism, was the factor that caused their subsequent behaviour.

Explained as: linear model

y=sci.space (score 0.045) top features

Contribution? Feature
+1.001 Highlighted in text (sum)
-0.956 <BIAS>

picture our universe floating like a log in a river. as the log floats down the river, it occasionally strikes rocks, the bank, the bottom, other logs. when this collission occurs, kinetic energy is translated into heat, the log degrades, gets scraped up, and other energy translaions occur. the distribution of damage to the log depends on the shape of the log. however, to a very small virus in a mite on the head of a termite in the center of the log, the shock waves from the collissions would appear uniformly random in direction. this is my theory for grb. they are evidence of our universe interacting with other universes! why not! makes just as much sense as the grb coming from the oort cloud! the log theory of universes can't be ruled out! of course, i'm a layman in the physics world. you physicists out there, tell me about this !!!!