Unsupervised Learning

Davis SML: Lecture 10 Part 3

Prof. James Sharpnack


In [80]:
from lxml import html, etree
import numpy as np
from sklearn import cluster, feature_extraction, metrics, preprocessing, decomposition
import collections
import nltk
import pandas as pd
import plotnine as p9
# nltk.download()
# Download Corpora -> stopwords, Models -> punkt

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

TFIDF vectorization

  • document vectorization counts the proportion of words in document
  • $X_{i,j}$ is the "proportion" of word j in document i
  • tfidf indicates term-frequency (proportion of words in document i which is word j), and inverse document frequency (log of inverse frequency of documents containing word j)

In [4]:
reu = html.parse("reuters/reut2-000.sgm") #You will have to do this for all sgm files here

In [47]:
def parse_reu(reu):
    """Parses the etree object and returns a list of dictionary of reuters attr
    Output: {'topics': the topic of the article, 'places': where it is located, 
        'split': training/test split, 'body':the text of the article as a set of words with stopwords removed}
    """
    root= reu.getroot()
    articles = root.body.getchildren()
    stop_words = set(stopwords.words('english'))
    reu_pl = []
    for a in articles:
        reu_parse = {}
        if a.attrib['topics'] != 'YES':
            next
        topics = a.find('topics').findall('d')
        if topics:
            reu_parse['topics'] = [t.text for t in topics]
        else:
            reu_parse['topics'] = []
        places = a.find('places').findall('d')
        if places:
            reu_parse['places'] = [t.text for t in places]
        reu_parse['split'] = a.attrib['lewissplit']
        rtxt = a.find('text')
        word_tokens = word_tokenize(rtxt.text_content())
        filtered_sentence = " ".join([w.lower() for w in word_tokens if not w in stop_words])
        reu_parse['body'] = filtered_sentence
        reu_pl.append(reu_parse)
    return reu_pl

In [48]:
reu_pl = parse_reu(reu)

In [49]:
print(reu_pl[0]['topics'])
reu_pl[0]['body']


['cocoa']
Out[49]:
'bahia cocoa review salvador , feb 26 - showers continued throughout week bahia cocoa zone , alleviating drought since early january improving prospects coming temporao , although normal humidity levels restored , comissaria smith said weekly review . the dry period means temporao late year . arrivals week ended february 22 155,221 bags 60 kilos making cumulative total season 5.93 mln 5.81 stage last year . again seems cocoa delivered earlier consignment included arrivals figures . comissaria smith said still doubt much old crop cocoa still available harvesting practically come end . with total bahia crop estimates around 6.4 mln bags sales standing almost 6.2 mln hundred thousand bags still hands farmers , middlemen , exporters processors . there doubts much cocoa would fit export shippers experiencing dificulties obtaining +bahia superior+ certificates . in view lower quality recent weeks farmers sold good part cocoa held consignment . comissaria smith said spot bean prices rose 340 350 cruzados per arroba 15 kilos . bean shippers reluctant offer nearby shipment limited sales booked march shipment 1,750 1,780 dlrs per tonne ports named . new crop sales also light open ports june/july going 1,850 1,880 dlrs 35 45 dlrs new york july , aug/sept 1,870 , 1,875 1,880 dlrs per tonne fob . routine sales butter made . march/april sold 4,340 , 4,345 4,350 dlrs . april/may butter went 2.27 times new york may , june/july 4,400 4,415 dlrs , aug/sept 4,351 4,450 dlrs 2.27 2.28 times new york sept oct/dec 4,480 dlrs 2.27 times new york dec , comissaria smith said . destinations u.s. , covertible currency areas , uruguay open ports . cake sales registered 785 995 dlrs march/april , 785 dlrs may , 753 dlrs aug 0.39 times new york dec oct/dec . buyers u.s. , argentina , uruguay convertible currency areas . liquor sales limited march/april selling 2,325 2,380 dlrs , june/july 2,375 dlrs 1.25 times new york july , aug/sept 2,400 dlrs 1.25 times new york sept oct/dec 1.25 times new york dec , comissaria smith said . total bahia sales currently estimated 6.13 mln bags 1986/87 crop 1.06 mln bags 1987/88 crop . final figures period february 28 expected published brazilian cocoa trade commission carnival ends midday february 27 . reuter'

In [88]:
vec = feature_extraction.text.TfidfVectorizer()
X = vec.fit_transform(doc['body'] for doc in reu_pl)
X.shape


Out[88]:
(1000, 10749)

Document clustering

  • rows are documents, columns are words
  • clustering with sklearn KMeans
  • selected 10 clusters arbitrarily

In [82]:
doc_clust = cluster.KMeans(n_clusters=10)
doc_clust.fit(X)


Out[82]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=10, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)

In [84]:
doc_clust.cluster_centers_.shape


Out[84]:
(10, 10749)

In [17]:
vocab_lookup = {b:a for a,b in vec.vocabulary_.items()}

In [29]:
ccargsort = doc_clust.cluster_centers_.argsort(axis=1)

In [31]:
center_vocab = [[vocab_lookup[row[-i]] for i in range(1,21)] for row in ccargsort]

In [35]:
print("\n\n".join([" ".join(voc) for voc in center_vocab]))


shr vs net qtr revs cts 4th 31 mln note profit jan loss year 16 reuter dlrs march shrs mths

organization quotas quota coffee meeting delegates talks agreement ico prices council export international london producers world accord said last opec

issue bond priced lead manager payment 100 issues issuing coupon the date pct eurobond mln co due franc said bank

000 shr net vs qtr cts revs 4th note year mln loss inc dlrs oper 12 1986 mths reuter avg

div record pay prior qtly vs cts payout mateo franklin san sets dividend mthly calif april fund march income 31

year said pct reuter 1986 the rose march in billion dlrs february mln 000 last january total 26 1987 10

said corp company inc reuter march the unit new dlrs co to 26 mln feb york it contract subsidiary 000

would said government the told march year bank reuter official he last foreign minister one president to officials reuters billion

blah to says in mln fed dlrs revised for money stg pct week from billion bank england sears up dlr

common shares stock share shareholders said payable inc outstanding holders company record board reuter corp 26 split march sets offer

In [37]:
clust_counts = collections.Counter(doc_clust.labels_)

In [38]:
clust_counts


Out[38]:
Counter({5: 205,
         6: 237,
         7: 149,
         2: 56,
         9: 74,
         3: 73,
         0: 61,
         8: 61,
         4: 47,
         1: 37})

In [43]:
proto_inds = metrics.pairwise_distances_argmin(doc_clust.cluster_centers_,X)

In [53]:
print("\n\n".join([reu_pl[i]['body'] for i in proto_inds]))


dillard department stores inc < dds > 4th qtr net little rock , ark. , march 2 - qtr ended jan 31 shr 1.16 dlrs vs 1.15 dlrs net 32.4 mln vs 33.5 mln revs 629.0 mln vs 538.6 mln avg shrs 32.1 mln vs 29.2 mln 12 mths shr 2.35 dlrs vs 2.29 dlrs net 74.5 mln vs 66.9 mln revs 1.85 billion vs 1.60 billion avg shrs 31.7 mln vs 29.2 mln note : shr/avg shrs data show 2-for-1 split nov. 1985 . reuter

coffee quota talks continue , no accord seen likely london , march 2 - the international coffee organization ( ico ) council talks reintroducing export quotas continued extended session lasting late sunday night , delegates said prospects accord producers consumers diminishing minute . the special meeting , called stop prolonged slide coffee prices , likely adjourn sometime tonight without agreement , delegates said . the council expected agree reconvene either within next six weeks september , said . the talks foundered sunday afternoon became apparent consumers producers could compromise formula calculating future quota system , delegates said . coffee export quotas suspended year ago prices soared response drought cut brazil 's crop nearly two-thirds . brazil world 's largest coffee producer exporter . reuter

nederlandse gasunie issues 100 mln dlr eurobond london , march 3 - nv nederlandse gasunie issuing 100 mln dlr eurobond due april 15 , 1992 paying 7-1/4 pct priced 101-1/8 pct , lead manager citicorp investment bank ltd said . the non-callable bond available denominations 5,000 dlrs listed luxembourg . the selling concession 1-1/4 pct , management underwriting combined pays 5/8 pct . the payment date april 15 . reuter

warwick insurance managers inc < wimi > 4th qtr morristown , n.j. , march 2 - oper shr 17 cts vs 19 cts oper net 636,000 vs 358,000 revs 10.6 mln vs 7,024,000 avg shrs 3,808,000 vs 1,924,000 year oper shr 73 cts vs 65 cts oper net 2,467,000 vs 1,199,000 revs 31.5 mln vs 22.9 mln avg shrs 3,372,000 vs 1,785,000 note : net excludes investment gains 20,000 dlrs vs 86,000 dlrs quarter 586,000 dlrs vs 195,000 dlrs year . 1985 year net excludes 304,000 dlr tax credit . share adjusted one-for-two reverse split november 1985 . reuter

franklin insured tax-free sets payout san mateo , calif. , march 2 - mthly div 7.1 cts vs 7.1 cts prior pay march 31 record march 16 note : franklin insured tax-free income fund . reuter

n.z . trading bank deposit growth rises slightly wellington , feb 27 - new zealand 's trading bank seasonally adjusted deposit growth rose 2.6 pct january compared rise 9.4 pct december , reserve bank said . year-on-year total deposits rose 30.6 pct compared 26.3 pct increase december year 34.5 pct rise year ago period , said weekly statistical release . total deposits rose 17.18 billion n.z . dlrs january compared 16.74 billion december 13.16 billion january 1986 . reuter

purolator < pcc > in buyout with hutton < efh > by patti domm new york , march 2 - new jersey-based overnight messenger purolator courier corp said agreed acquired 265 mln dlrs company formed e.f. hutton lbo inc certain managers purolator 's u.s. courier business . analysts said purolator sale time . purolator announced earlier mulling takeover bid , analysts wrongly predicted offer another courier company . hutton lbo , wholly owned subsidiary e.f. hutton group inc , majority owner company . hutton said acquiring company , pc acquisition inc , paying 35 dlrs cash per share 83 pct purolator 's stock tender offer begin thursday . the rest shares purchased securities warrants buy stock subsidiary pc acquisition , containing purolator 's u.s. courier operations . if shares purolator tendered , shareholders would receive share 29 dlrs cash , six dlrs debentures , warrant buy shares subsidiary pc acquisition containing u.s. courier operations . hutton said merger shareholders would get 46 mln dlrs aggregate amount guaranteed debentures due 2002 pc acquisition warrants buy 15 pct common stock pc courier subsidiary . hutton said company valued warrants two three dlrs per share . purolator 's stock price closed 35.125 dlrs friday . while analysts estimated company worth mid 30s , least one said would worth 38 42 dlrs . this follows sales two purolator units . it agreed recently sell canadian courier unit onex capital 170 mln dlrs , previously sold auto filters business . purolator retains stant division , makes closure caps radiators gas tanks . a hutton spokesman said firm reviewing options stant . purolator 's courier business lagging u.s. rivals high price paid past several years add air delivery ground fleet . e.f. hutton provide 279 mln dlrs funds complete transaction . this so-called `` bridge '' financing replaced later long-term debt likely form bank loans , hutton said . hutton lbo committed keeping courier business , president warren idsal said . `` purolator lost 120 mln dlrs last two years largely due u.s. courier operations , believe management turning around . we belive serious competitor future , '' said idsal . william taggart , chief executive officer u.s . courier division , chief executive officer new company . the tender offer conditioned minimum two thirds common stock tendered withdrawn expiration offer well certain conditions . the offer begin thursday , subject clearances staff interstate commerce commission expire 20 business days commencement unless extended . reuter

brazil criticises advisory committee structure by sandy critchley , reuters london , march 2 - brazil happy existing structure 14-bank advisory committee coordinates commercial bank debt , finance minister dilson funaro said . u.s. banks 50 pct representation committee holding 35 pct brazil 's debt banks , said , adding `` this fair european japanese banks . '' the committee played useful role 1982 1983 , however . noting often different reactions u.s. , japanese european banks , funaro told journalists brazil might adopt approach involving separate discussions regions . since debtor nations ' problems normally treated case-by-case basis , `` perhaps principle apply creditors , '' central bank president francisco gros said . brazil february 20 suspended indefinitely interest payments 68 billion dlrs owed commercial banks , followed last week freeze bank trade credit lines deposited foreign banks institutions , worth 15 billion dlrs . funaro gros spent two days end last week washington talking government officials international agencies week visit britain , france , west germany , switzerland italy discussions governments . funaro gros today meeting british chancellor exchequer nigel lawson , foreign secretary geoffrey howe governor bank england robin leigh-pemberton . bankers estimated brazil owes u.k. banks around 8.5 billion dlrs long medium term loans , giving u.k . the third largest exposure u.s. and japan . the crisis began brazil 's trade surplus , chief means servicing foreign debt , started decline sharply problem compounded renewed surge country'sate inflation . reserves reported dropped four billion dlrs . funaro envisaged eventual solution problems brazil 's 108 billion dlr foreign debt would involve partial servicing debt . `` what propose arrive mechanism refinance part service , service , '' said . `` i really think change old rules . '' asked brazil first approaching governments , rather commercial banks search solution crisis , funaro said `` we must first talk governments talk banks , banks limits . '' `` it political discussion point view , '' said . funaro said hoped next week travel talk japanese canadian government officials . he would talk commercial banks `` if i 've got solution governments . i ca n't take burden banks . '' he sure long would take reach solution .  in discussions governments brazil would review mechanisms whereby finance made available nations need . finance official lending agencies virtually closed since 1982 . `` you must open mechanisms , '' said . he said u.s. officials disturbed brazil 's suspension interest payments , understood brazil choice , protect reserves . also financing mechanisms discussed `` ca n't stay last years . '' `` i 'm trying put problem table ... . all us would like kind equilibrium . '' said . although brazil rejected substantive role international monetary fund ( imf ) managing economy , funaro paid call washington imf managing director michel camdessus world bank president barber conable . funaro noted inflation february started decline expected brazil achieve minimum eight billion dlr trade surplus 1987 . banking sources noted brazil 's monthly surplus declined 150 mln dlrs final three months last year , monthly one billion first nine months . brazil third largest trade surplus world , funaro said , although share international trade one pct . `` the solution linked growth , recession , '' said , noting imf program would involve promoting exports inducing internal recession order service debt . banking sources said brazil 's debts foreign governments , opposed commercial banks , benefit sounder structure following last month 's rescheduling paris club creditor nations 4.12 billion dlrs official debt . reuter

******u.s. m-1 money supply rises 2.1 billion dlrs in feb 16 week , fed says blah blah blah .

cincinnati bell < csn > sets stock split cincinnati , march 2 - cincinnati bell inc said board declared two-for-one stock split , subject two thirds approval annual meeting april 20 increase authorized common shares 120 mln 60 mln . it said split would payable may 20 holders record may five . reuter

Word clustering

  • take the transpose of X
  • rows are words and columns are documents
  • clusters of words based on document co-occurrence

In [61]:
word_clust = cluster.KMeans(n_clusters=10)
#W = preprocessing.StandardScaler(with_mean=False).fit_transform(X.transpose())
word_clust.fit(X.transpose())


Out[61]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=10, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)

In [62]:
clust_counts = collections.Counter(word_clust.labels_)

In [63]:
clust_counts


Out[63]:
Counter({0: 10487, 7: 1, 2: 215, 3: 24, 6: 3, 5: 5, 9: 1, 8: 2, 4: 4, 1: 7})

In [69]:
[" ".join([vocab_lookup[i] for i in np.where(word_clust.labels_ == i)[0]]) for i in range(1,10)]


Out[69]:
['loss net oper profit qtr revs shr',
 '10 100 12 15 1985 1986 1987 20 25 26 27 30 50 500 added agreed agreement also american analysts annual around bank bankers banking banks board bond bonds brazil budget business but calendar canada capital cash chairman china co coffee commercial commission committee common company compared contract contracts corp could country coupon credit current cut day debt december demand department dlr dollar domestic due earlier economic economy end ended exchange expected export exports fall feb february fed federal fell finance financial first five forecast foreign four funaro futures government group growth he high imports in inc including increase industries industry interest international investment issue issues it january japan japanese last likely loan loans london ltd made major management manager market markets may meeting minister ministry money month months national new next notes offer offering official officials oil on one opec output paper part payments per period plan plans president previous price prices production products program proposed quarter rate rates report reserve reserves reuter reuters revised rise rose said sale sales says sector securities senior september service set share since six sources spokesman state statement stg surplus system taiwan talks term the three time to today tokyo told tonnes total trade trading treasury trust two union unit washington week wheat world would years yen york',
 '13 16 31 april calif div dividend franklin free fund income insured march mateo mthly note pay payout prior qtly record san sets tax',
 'for shares split stock',
 'billion dlrs mln pct year',
 '400 747 qantas',
 '000',
 'cts vs',
 'blah']

In [70]:
proto_inds = metrics.pairwise_distances_argmin(word_clust.cluster_centers_,X.transpose())

In [71]:
print("\n".join([vocab_lookup[i] for i in proto_inds]))


said
vs
said
cts
stock
mln
747
000
vs
blah

Principle Components for documents

  • PCA reduces dimensions
  • sklearn PCA for dense matrices
  • sklearn TruncatedSVD for sparse (does not center)

In [76]:
doc_SVD = decomposition.TruncatedSVD(n_components=2)
X_pca = doc_SVD.fit_transform(X)

In [85]:
X_df = pd.DataFrame(X_pca,columns=['pca_1','pca_2'])
X_df['clust'] = doc_clust.labels_

In [87]:
p9.ggplot(X_df, p9.aes(x='pca_1',y='pca_2',color='clust')) + p9.geom_point()


Out[87]:
<ggplot: (8793053045569)>