TOPIC MODELING II

      Este código se baseia em um exemplo de Grisel & Buitinck do pacote scikit learn


In [1]:
from time import time
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.datasets import fetch_20newsgroups

n_samples = 10
n_features = 1000
n_topics = 10
n_top_words = 20

t0 = time()
print "Carregando o corpus"

dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=('headers', 'footers', 'quotes'))

corpus = dataset.data[:n_samples]
for f in corpus:
    print '\n#################################'
    print f
    print '###################################'


Carregando o corpus

#################################
Well i'm not sure about the story nad it did seem biased. What
I disagree with is your statement that the U.S. Media is out to
ruin Israels reputation. That is rediculous. The U.S. media is
the most pro-israeli media in the world. Having lived in Europe
I realize that incidences such as the one described in the
letter have occured. The U.S. media as a whole seem to try to
ignore them. The U.S. is subsidizing Israels existance and the
Europeans are not (at least not to the same degree). So I think
that might be a reason they report more clearly on the
atrocities.
	What is a shame is that in Austria, daily reports of
the inhuman acts commited by Israeli soldiers and the blessing
received from the Government makes some of the Holocaust guilt
go away. After all, look how the Jews are treating other races
when they got power. It is unfortunate.

###################################

#################################







Yeah, do you expect people to read the FAQ, etc. and actually accept hard
atheism?  No, you need a little leap of faith, Jimmy.  Your logic runs out
of steam!







Jim,

Sorry I can't pity you, Jim.  And I'm sorry that you have these feelings of
denial about the faith you need to get by.  Oh well, just pretend that it will
all end happily ever after anyway.  Maybe if you start a new newsgroup,
alt.atheist.hard, you won't be bummin' so much?






Bye-Bye, Big Jim.  Don't forget your Flintstone's Chewables!  :) 
--
Bake Timmons, III
###################################

#################################
Although I realize that principle is not one of your strongest
points, I would still like to know why do do not ask any question
of this sort about the Arab countries.

   If you want to continue this think tank charade of yours, your
fixation on Israel must stop.  You might have to start asking the
same sort of questions of Arab countries as well.  You realize it
would not work, as the Arab countries' treatment of Jews over the
last several decades is so bad that your fixation on Israel would
begin to look like the biased attack that it is.

   Everyone in this group recognizes that your stupid 'Center for
Policy Research' is nothing more than a fancy name for some bigot
who hates Israel.
###################################

#################################
Notwithstanding all the legitimate fuss about this proposal, how much
of a change is it?  ATT's last product in this area (a) was priced over
$1000, as I suspect 'clipper' phones will be; (b) came to the customer 
with the key automatically preregistered with government authorities. Thus,
aside from attempting to further legitimize and solidify the fed's posture,
Clipper seems to be "more of the same", rather than a new direction.
   Yes, technology will eventually drive the cost down and thereby promote
more widespread use- but at present, the man on the street is not going
to purchase a $1000 crypto telephone, especially when the guy on the other
end probably doesn't have one anyway.  Am I missing something?
   The real question is what the gov will do in a year or two when air-
tight voice privacy on a phone line is as close as your nearest pc.  That
has got to a problematic scenario for them, even if the extent of usage
never surpasses the 'underground' stature of PGP.
###################################

#################################
Well, I will have to change the scoring on my playoff pool.  Unfortunately
I don't have time right now, but I will certainly post the new scoring
rules by tomorrow.  Does it matter?  No, you'll enter anyway!!!  Good!

--
    Keith Keller				LET'S GO RANGERS!!!!!
						LET'S GO QUAKERS!!!!!
	kkeller@mail.sas.upenn.edu		IVY LEAGUE CHAMPS!!!!
###################################

#################################
 
 
I read somewhere, I think in Morton Smith's _Jesus the Magician_, that
old Lazarus wasn't dead, but going in the tomb was part of an initiation
rite for a magi-cult, of which Jesus was also a part.   It appears that
a 3-day stay was normal.   I wonder .... ?
###################################

#################################

Ok.  I have a record that shows a IIsi with and without a 64KB cache.
It's small enough that I will attach it.

I have also measured some real programs with and without the 64 KB
cache.  The speedup varies a lot from app to app, ranging from 0% to
40%.  I think an average of 20%-25% is about right.  The subjective
difference is not great, but is sometimes noticable.  A simple cache
card certainly does not transform a IIsi into something enormously
better.  I do not have an FPU.

The conventional wisdom says that cache cards from all of the makers
offer about the same speedup and that there is not much difference
between 32K and 64K caches.  I bought mine from Third Wave for well
under $150.  I have had absolutely no problems at all with it.

If you get *complete* speedometer runs for a 32K cache, I'd like to
see them.  Let's check the conventional wisdom!  The so called
"Performance Rating" numbers by themselves are of no interest. 

Cheers.

(This file must be converted with BinHex 4.0)
:#@0KBfKP,Q0`G!"338083e"$9!!!!!!'A!!!!!$qK3%"a+!!!!BGJ&CfGiGfH(H
)GhQ!QSQBUC!!@SQUU(QSCfPhGhL(H+HCL&KjQTU)LDH)HBL*UCUCJ!U@GQ9hGiK
hCAKR9SPiJ)QRQ)QUJ+N(J!UCLD#U#S!!S!QUUTQC#U#DL3J)#3LT#UU)QUUBUT!
!S!L3!!UU#!QJS+UT!!QJS*UD#TUUQCQ3!*!!UCFJ!!%c4ACSL'D)L)D!#!!)#!!
!!!!!!!!)!!!!!!!!J!B8*%9@9L0A"i!!G`!!G`B!!(J)"i###B!P[US),B")21Z
-1I"k-cQFM-VXMHhA!irdjPcVr,lUCVSZ2SI8j@,-l,jPI`F#lZq0A"AL8XRHjf,
6[LJ09"aZ2TV6l!$9lN@eAP@Rei8(VIpIQkfDK$-ZV[b+9[T5lkC0XZ6LGhf(Ik&
a$Lkh*Q6-qhh2MIlc*Q2Iq$p([GeSp(ejN!"bHMdHll$&Qh'lR`E26C2(QBqSrMM
pa-k()jPGXqcpR2rYR9eYd0,*Mh0,h1rj1*hA%pcLHRSG6PF2eIYmc4rIS60EFp+
CGE@Vr$[TRAFA(QkA`pG8JkS[@fe1mcBikFQC(,(9K[U&h""0rr"BDDT(i%XP3Z$
V04L8D82FeU01V4K-9U#JaD@1*fZa`EZr3-eGTYkNXH49SjF2Ei[G*5el3[VZ'j[
Vf($bTBHjlEX3Pe0KJ8,ZKH!9Cc3+fJ%kHGZC*BHhNV9+DC6Xd$[S58DFD"pJ%ei
q#CXHkEL`@d%&PYYY"1f0rG`jm0rJTCYMi4B1KbB'pUBQ)PU9'q"*m1miHG#YR`b
eUNG1'mSAP#mR`i-1*K`l[DiNq'MQjZA(,4bq"$*Mimq(KC9@@(-Mc'"f88e9U&0
F'Y4U5eXb("+6T8D@6(R3ae+10Padk"CAK!*Ea6SThLiA9HF!H&&Da@[,[2bA2!p
2VIr&TI)!6V`%S!*eJ#GS!Q!!QqD#2P!*M49m9IdHhm2frUq2Ek))G3e"Vi)+rQJ
C[`%m#+E&0jf"YI2ql`VI&0qHH!R[339`'9hY46)TR+ZkXI!pQRQKCU3%ed9R&Cr
!QCiUk+ZmEf)IYI&bqMEffkT5bB`JhYl2K[0PXVe0B@@2*@Uam121D`A`h+cC)Xl
IEjf8S+#9`a6[P8p0ZC&6H0ajcY1BR"JDM3`F%lJ1&5bI+SC2Jh([qeTfVK961rR
ZVIq[+Rb-TH3'B3f0r$h''cP%"UY1'jU53jY@5P(RCdPAXAfrl"Xrhf#Y"dmV1i$
9%Dm@T+f4NMlP5jd-XN0(K5C91'R@)4Qb9C5Ke1h%V-kiaRA-NTa`b9(YYL5TM5*
F2#bUFFLGJ%,D8QA*9R`eUQ29Sj!!p0b'"c5LEFR4@%9KpDGj1,bijhNaDH,6mrm
(3qpJITeraM0+0RHJ*aJ%f`#HJ!R4JJXDK22e!Cab5DK)jkRq0r[IcrC`[c!Krd(
$m1VrbJCX!NR)3FrcHYPk(r1CHJjiJ#Hk%'J84pq+#+$a2&r&bZ,Ff1V,-KG6qG9
MbmUPG9XkUeX$2Gl!Gl!Gl!GE!k5hrX(F4IX4IRNYkb"M%rSbN4`8m8qPq2rAd[j
FhRC#4(PeI2RFhY0+j-GH'!P*S)h!#HN!R6JJXb5f'b!clJkfb121qGm2MclEe,S
mHpf12b4arQ$Q%%PLK"q(8@I8[qRmmS5[l`"2fP!"4CpjY0,DDAp2AlE#eIPBD0c
rL1,PeXj39[%9k`HF4Z,ZKGN4h9A+b-T23l)RDf'a13X"'-#VbKJ[!9ME*!Tlp2-
QckRpM@J2e5BN*f&jHN*[Vp-#f+F(J)PQXJNlYRLpQ3C,%`Cm0l3E[MP"cXZ6`)B
mpVS0)P3Y@XTB5F5qaSr"XrmrZf1iLXSV,pPVjICFMRrekXdDI`0FHmT[Q!4VL`T
aalM336chGUr@"Me6YarIDI&Y2LpE9HPaI#fhNFmq$qLchVC(dUajJ%eb%(6NdIH
p#jqEd#X1cGDTVmDY965+@Pi,Mr1JeR&pq`q@"AacVkC[0lZi3-Z-5PZk8%f$Vrd
HfR&1mci,3&Nqh9r"e%"j5Ve$0rN`AbfB"Qqlk$C`3@LKQRh0(-MKhNYA+UC&Qhq
5kajHR1eFqR,2H5b8Z!SLfG3!!2TPmiF!!3!+58PcD5eMB@0SC3%!!!!)6@0S9(0
3C$1R$)JJT`b+33%!ADmicJ!#!!!4a3!!!!!!!!B9!!!!!,AP!!!:
-- 
###################################

#################################



Sounds like wishful guessing.




'So-called' ? What do you mean ? How would you see the peace process?

So you say palestineans do not negociate because of 'well-founded' predictions ?
How do you know that they are 'well founded' if you do not test them at the 
table ? 18 months did not prove anything, but it's always the other side at 
fault, right ?

Why ? I do not know why, but if, let's say, the Palestineans (some of them) want
ALL ISRAEL, and these are known not to be accepted terms by israelis.

Or, maybe they (palestinenans) are not yet ready for statehood ?

Or, maybe there is too much politics within the palestinean leadership, too many
fractions aso ?

I am not saying that one of these reasons is indeed the real one, but any of
these could make arabs stall the negotiations.

 
I like California oranges. And the feelings may get sharper at the table.



Regards,
###################################

#################################
 Nobody is saying that you shouldn't be allowed to use msg.  Just
don't force it on others. If you have food that you want to 
enhance with msg just put the MSG on the table like salt.  It is
then the option of the eater to use it.  If you make a commerical
product, just leave it out. You can include a packet (like some
salt packets) if you desire.

Salt, pepper, mustard, ketchup, pickles ..... are table options.
Treat MSG the same way.  I wouldn't shove my condiments down your
throat, don't shove yours down mine.

WFL

###################################

#################################

  I was wondering if anyone can shed any light on just how it is that these
electronic odometers remember the total elapsed mileage?  What kind of
memory is stable/reliable enough, non-volatile enough and independent enough
(of outside battery power) to last say, 10 years or more, in the life of a
vehicle?  I'm amazed that anything like this could be expected to work for
this length of time (especially in light of all the gizmos I work with that
are doing good to work for 2 months without breaking down somehow).

Side question:  how about the legal ramifications of selling a used car with
a replaced odometer that starts over at 0 miles, after say 100/200/300K
actual miles.  Looks like fraud would be fairly easy - for the price of a
new odometer, you can say it has however many miles you want to tell the
buyer it has.

Thanks for any insight.

###################################

In [50]:
print "carregando as features em um vetor TF-IDF"
vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, max_features=n_features, stop_words='english')

tfidf = vectorizer.fit_transform(corpus)

print "terminado em %0.3fs." % (time() - t0)


carregando as features em um vetor TF-IDF
terminado em 2.592s.

In [55]:
# Fit the NMF model
print "Ajustando o modelo NMF com n_samples=%d e n_features=%d ...\n" % (n_samples, n_features)

nmf = NMF(n_components=n_topics, random_state=1).fit(tfidf)
print "terminado em %0.3fs." % (time() - t0)

print 'TF IDF shape', tfidf.shape
print 'components shape:', nmf.components_.shape
print nmf.components_


	Ajustando o modelo NMF com n_samples=2000 e n_features=1000 ...

terminado em 4427.459s.
TF IDF shape (2000, 1000)
components shape: (10, 1000)
[[ 0.          0.04842267  0.         ...,  0.12304761  0.01875739
   0.07667154]
 [ 0.          0.          0.         ...,  0.          0.00737335  0.        ]
 [ 0.          0.01412313  0.         ...,  0.01802372  0.00906201
   0.01317645]
 ..., 
 [ 0.          0.09879147  0.         ...,  0.00490208  0.          0.        ]
 [ 0.          0.          0.         ...,  0.07978946  0.          0.        ]
 [ 0.          0.          0.         ...,  0.          0.          0.        ]]

In [54]:
feature_names = vectorizer.get_feature_names()

for topic_idx, topic in enumerate(nmf.components_):
    
    sorted_topic_words = topic.argsort()[:-n_top_words - 1:-1]
    
    print "\nTópico #%d:" % topic_idx    
    for s in sorted_topic_words:
        print feature_names[s],
    print '\n'


(1000,)

Tópico #0:
people just don think like know say did really make time way ve right going sure said got wrong didn 

(1000,)

Tópico #1:
windows file dos use using program window files problem help pc running application version drivers ftp screen available work ms 

(1000,)

Tópico #2:
game team year games win play season players runs ll good toronto defense division teams won better player goal best 

(1000,)

Tópico #3:
thanks know does mail advance hi info interested anybody email like looking help card need appreciated hello information list send 

(1000,)

Tópico #4:
god jesus bible does faith christian christ christians church believe life lord true religion love human man belief people good 

(1000,)

Tópico #5:
drive drives hard disk card software mac power apple pc problem computer memory external speed board internal problems work monitor 

(1000,)

Tópico #6:
10 00 space new 11 12 15 16 20 18 25 17 sale earth 13 93 22 24 23 years 

(1000,)

Tópico #7:
car bike good cars engine new power speed miles buy price used like year want driving area bought just models 

(1000,)

Tópico #8:
key chip government clipper encryption keys use public law enforcement secure phone data used security communications secret standard going clinton 

(1000,)

Tópico #9:
edu com banks soon internet send ftp university mail information article pub mac student contact cc server address email list 


In [ ]: