Computing In Context

Social Sciences Track

Lecture 4--topics, trends, and dimensional scaling

Matthew L. Jones

like, with code and stuff


In [1]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
import textmining_blackboxes as tm

IMPORTANT: tm is our temporarily helper, not a standard python package!!

download it from my github: https://github.com/matthewljones/computingincontext


In [3]:
#see if package imported correctly
tm.icantbelieve("butter")


I can't believe it's not butter

Let's keep using the remarkable narratives available from Documenting the American South (http://docsouth.unc.edu/docsouthdata/)

Assuming that you are storing your data in a directory in the same place as your iPython notebook.

Put the slave narratives texts within a data directory in the same place as this notebook


In [92]:
title_info=pd.read_csv('data/na-slave-narratives/data/toc.csv')
#this is the "metadata" of these files--we'll use today
#why does data appear twice in the filename?

In [73]:
title_info


Out[73]:
Filename Author Title Date URL URL(text-only)
0 neh-johnstone-johnstone.xml Abraham Johnstone The Address of Abraham Johnstone, a Black Man,... 1797 http://docsouth.unc.edu/neh/johnstone/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
1 neh-meachum-meachum.xml John B. Meachum An Address to All the Colored Citizens of the ... 1846 http://docsouth.unc.edu/neh/meachum/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
2 neh-johnsontl-johnsontl.xml Thomas L. Johnson Africa for Christ. Twenty-Eight Years a Slave 1892 http://docsouth.unc.edu/neh/johnsontl/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
3 neh-white-white.xml William S. White The African Preacher. An Authentic Narrative [c1849] http://docsouth.unc.edu/neh/white/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
4 neh-brown55-brown55.xml William Wells Brown The American Fugitive in Europe. Sketches of P... 1855 http://docsouth.unc.edu/neh/brown55/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
5 neh-weld-weld.xml Theodore Dwight Weld American Slavery As It Is: Testimony of a Thou... 1839 http://docsouth.unc.edu/neh/weld/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
6 neh-andersrob-andersrob.xml Robert Anderson The Anderson Surpriser. Written After He Was S... 1895 http://docsouth.unc.edu/neh/andersrob/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
7 neh-boen-boen.xml No Author Anecdotes and Memoirs of William Boen, a Colou... 1834 http://docsouth.unc.edu/neh/boen/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
8 neh-stevens-stevens.xml Charles Emery Stevens Anthony Burns: A History 1856 http://docsouth.unc.edu/neh/stevens/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
9 neh-robinsonn-robinson.xml Nina Hill Robinson Aunt Dice: The Story of a Faithful Slave 1897 http://docsouth.unc.edu/neh/robinsonn/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
10 neh-auntjudy-auntjudy.xml Matilda G. Thompson Aunt Judy's Story: A Tale From Real Life. Writ... 1855 http://docsouth.unc.edu/neh/auntjudy/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
11 neh-sally-sally.xml Isaac Williams Aunt Sally: or, The Cross the Way of Freedom. ... 1858 http://docsouth.unc.edu/neh/sally/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
12 neh-jamison-jamison.xml M. F. Jamison Autobiography and Work of Bishop M. F. Jamison... 1912 http://docsouth.unc.edu/neh/jamison/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
13 neh-browne-browne.xml Martha Griffith Browne Autobiography of a Female Slave 1857 http://docsouth.unc.edu/neh/browne/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
14 neh-wards-ward.xml Samuel Ringgold Ward Autobiography of a Fugitive Negro: His Anti-Sl... 1855 http://docsouth.unc.edu/neh/wards/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
15 fpn-lane-lane.xml Isaac Lane Autobiography of Bishop Isaac Lane, LL.D. with... 1916 http://docsouth.unc.edu/fpn/lane/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
16 neh-parkerh-parkerh.xml Henry Parker Autobiography of Henry Parker 186? http://docsouth.unc.edu/neh/parkerh/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
17 neh-smithj-smithj.xml James Lindsay Smith Autobiography of James L. Smith, Including, Al... 1881 http://docsouth.unc.edu/neh/smithj/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
18 neh-said-said.xml Nicholas Said The Autobiography of Nicholas Said, A Native o... 1873 http://docsouth.unc.edu/neh/said/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
19 nc-omarsaid-omarsaid.xml Omar ibn Said Autobiography of Omar ibn Said, Slave in North... 1925 http://docsouth.unc.edu/nc/omarsaid/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
20 neh-frederick-frederick.xml Francis Frederick Autobiography of Rev. Francis Frederick, of Vi... 1869 http://docsouth.unc.edu/neh/frederick/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
21 neh-henry-henry.xml Thomas W. Henry Autobiography of Rev. Thomas W. Henry, of the ... [1872] http://docsouth.unc.edu/neh/henry/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
22 neh-henson81-henson81.xml Josiah Henson An Autobiography of the Rev. Josiah Henson ("U... 1881 http://docsouth.unc.edu/neh/henson81/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
23 neh-holsey-holsey.xml Lucius Henry Holsey Autobiography, Sermons, Addresses, and Essays ... 1898 http://docsouth.unc.edu/neh/holsey/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
24 neh-washstory-washin.xml Booker T. Washington An Autobiography: The Story of My Life and Work c1901 http://docsouth.unc.edu/neh/washstory/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
25 neh-smitham-smith.xml Amanda Smith An Autobiography: The Story of the Lord's Deal... 1893 http://docsouth.unc.edu/neh/smitham/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
26 neh-campbell-campbell.xml Israel Campbell An Autobiography. Bond and Free: Or, Yearning... 1861 http://docsouth.unc.edu/neh/campbell/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
27 neh-alexander-alexander.xml Charles Alexander Battles and Victories of Allen Allensworth, A.... 1914 http://docsouth.unc.edu/neh/alexander/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
28 neh-aleckson-aleckson.xml Sam Aleckson Before the War, and After the Union. An Autob... 1929 http://docsouth.unc.edu/neh/aleckson/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
29 neh-keckley-keckley.xml Elizabeth Keckley Behind the Scenes, or, Thirty years a Slave, a... 1868 http://docsouth.unc.edu/neh/keckley/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
... ... ... ... ... ... ...
264 neh-ballslavery-ball.xml Charles Ball Slavery in the United States: A Narrative of t... 1837 http://docsouth.unc.edu/neh/ballslavery/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
265 neh-bluett-bluett.xml Thomas Bluett Some Memoirs of the Life of Job, the Son of So... 1734 http://docsouth.unc.edu/neh/bluett/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
266 neh-gallaudet-gallaudet.xml T. H. Gallaudet A Statement with Regard to the Moorish Prince,... 1828 http://docsouth.unc.edu/neh/gallaudet/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
267 neh-story-story.xml No Author The Story of a Slave. A Realistic Revelation o... 1894 http://docsouth.unc.edu/neh/story/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
268 neh-eliot-eliot.xml William Greenleaf Eliot The Story of Archer Alexander: From Slavery to... 1885 http://docsouth.unc.edu/neh/eliot/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
269 neh-jacksonm-jackson.xml Mattie J. Jackson The Story of Mattie J. Jackson: Her Parentage,... 1866 http://docsouth.unc.edu/neh/jacksonm/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
270 neh-twelvetr-twelvetr.xml Harper Twelvetrees The Story of the Life of John Anderson, the Fu... 1863 http://docsouth.unc.edu/neh/twelvetr/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
271 neh-watkins-watkins.xml James Watkins Struggles for Freedom; or The Life of James Wa... 1860 http://docsouth.unc.edu/neh/watkins/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
272 neh-iwilliams-iwilliams.xml Isaac D. Williams Sunshine and Shadow of Slave Life. Reminiscenc... 1885 http://docsouth.unc.edu/neh/iwilliams/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
273 fpn-hughes-hughes.xml Louis Hughes Thirty Years a Slave: From Bondage to Freedom:... 1897 http://docsouth.unc.edu/fpn/hughes/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
274 neh-brown52-brown52.xml William Wells Brown Three Years in Europe: Or, Places I Have Seen ... 1852 http://docsouth.unc.edu/neh/brown52/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
275 neh-detroit-detroit.xml No Author A Thrilling Narrative from the Lips of the Suf... 1863 http://docsouth.unc.edu/neh/detroit/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
276 neh-tubbee1848-tubbee1848.xml Okah Tubbee A Thrilling Sketch of the Life of the Distingu... 1848 http://docsouth.unc.edu/neh/tubbee1848/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
277 neh-beard63-beard63.xml J. R. Beard Toussaint L'Ouverture: A Biography and Autobio... 1863 http://docsouth.unc.edu/neh/beard63/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
278 neh-henderson-henderson.xml Madison Henderson Trials and Confessions of Madison Henderson, A... 1841 http://docsouth.unc.edu/neh/henderson/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
279 neh-armistead-armistead.xml Wilson Armistead A Tribute for the Negro: Being a Vindication o... 1848 http://docsouth.unc.edu/neh/armistead/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
280 neh-twain-twain.xml Mark Twain A True Story, Repeated Word for Word As I Hear... November 1874 http://docsouth.unc.edu/neh/twain/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
281 neh-jjacobs-jjacobs.xml John S. Jacobs A True Tale of Slavery. From The Leisure Hour:... 1861 http://docsouth.unc.edu/neh/jjacobs/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
282 neh-henson58-henson58.xml Josiah Henson Truth Stranger Than Fiction. Father Henson's S... 1858 http://docsouth.unc.edu/neh/henson58/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
283 fpn-northup-northup.xml Solomon Northup Twelve Years a Slave: Narrative of Solomon Nor... 1853 http://docsouth.unc.edu/fpn/northup/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
284 neh-johnson1-johnson.xml Thomas L. Johnson Twenty-Eight Years a Slave, or The Story of My... 1909 http://docsouth.unc.edu/neh/johnson1/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
285 fpn-steward-steward.xml Austin Steward Twenty-Two Years a Slave, and Forty Years a Fr... 1857 http://docsouth.unc.edu/fpn/steward/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
286 neh-rayemma-rayemma.xml Emma J. Ray Twice Sold, Twice Ransomed: Autobiography of M... c1926 http://docsouth.unc.edu/neh/rayemma/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
287 neh-foster-foster.xml Gustavus L. Foster Uncle Johnson, the Pilgrim of Six Score Years 186-? http://docsouth.unc.edu/neh/foster/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
288 neh-edwardsj-edwardsj.xml John Passmore Edwards Uncle Tom's Companions: Or, Facts Stranger Tha... 1852 http://docsouth.unc.edu/neh/edwardsj/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
289 neh-henson-henson.xml Josiah Henson Uncle Tom's Story of His Life. An Autobiograph... 1876 http://docsouth.unc.edu/neh/henson/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
290 fpn-washington-washing.xml Booker T. Washington Up from Slavery: An Autobiography c1901 http://docsouth.unc.edu/fpn/washington/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
291 fpn-burtont-burton.xml Thomas William Burton What Experience Has Taught Me: An Autobiograph... c1910 http://docsouth.unc.edu/fpn/burtont/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
292 neh-hildreth-hildreth.xml Richard Hildreth The White Slave; or, Memoirs of a Fugitive 1852 http://docsouth.unc.edu/neh/hildreth/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
293 neh-wilkerson-wilkerson.xml Major James Wilkerson Wilkerson's History of His Travels & Labor... 1861 http://docsouth.unc.edu/neh/wilkerson/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...

294 rows × 6 columns


In [74]:
title_info["Date"].str.replace("\-\?", "5")


Out[74]:
0         1797
1         1846
2         1892
3      [c1849]
4         1855
5         1839
6         1895
7         1834
8         1856
9         1897
10        1855
11        1858
12        1912
13        1857
14        1855
...
279             1848
280    November 1874
281             1861
282             1858
283             1853
284             1909
285             1857
286            c1926
287             1865
288             1852
289             1876
290            c1901
291            c1910
292             1852
293             1861
Name: Date, Length: 294, dtype: object

In [75]:
title_info["Date"].str.replace("[^0-9]", "") #use regular expressions to clean up


Out[75]:
0     1797
1     1846
2     1892
3     1849
4     1855
5     1839
6     1895
7     1834
8     1856
9     1897
10    1855
11    1858
12    1912
13    1857
14    1855
...
279    1848
280    1874
281    1861
282    1858
283    1853
284    1909
285    1857
286    1926
287     186
288    1852
289    1876
290    1901
291    1910
292    1852
293    1861
Name: Date, Length: 294, dtype: object

In [97]:
title_info["Date"]=title_info["Date"].str.replace("\-\?", "5")
title_info["Date"]=title_info["Date"].str.replace("[^0-9]", "") # what assumptions have I made about the data?

In [99]:
title_info["Date"]=pd.to_datetime(title_info["Date"], coerce=True)

In [122]:
title_info["Date"]<pd.datetime(1800,1,1)


Out[122]:
0      True
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
...
279    False
280    False
281    False
282    False
283    False
284    False
285    False
286    False
287    False
288    False
289    False
290    False
291    False
292    False
293    False
Name: Date, Length: 294, dtype: bool

In [127]:
title_info[title_info["Date"]<pd.datetime(1800,1,1)]


Out[127]:
Filename Author Title Date URL URL(text-only)
0 neh-johnstone-johnstone.xml Abraham Johnstone The Address of Abraham Johnstone, a Black Man,... 1797-01-01 http://docsouth.unc.edu/neh/johnstone/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
55 neh-pomp-pomp.xml Pomp Dying Confession of Pomp, A Negro Man, Who Was... 1795-01-01 http://docsouth.unc.edu/neh/pomp/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
97 neh-equiano1-equiano1.xml Olaudah Equiano The Interesting Narrative of the Life of Olaud... 1789-01-01 http://docsouth.unc.edu/neh/equiano1/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
98 neh-equiano2-equiano2.xml Olaudah Equiano The Interesting Narrative of the Life of Olaud... 1789-01-01 http://docsouth.unc.edu/neh/equiano2/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
105 neh-fortis-fortis.xml Edmund Fortis The Last Words and Dying Speech of Edmund Fort... 1795-01-01 http://docsouth.unc.edu/neh/fortis/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
106 neh-sancho1-sancho1.xml Ignatius Sancho Letters of the Late Ignatius Sancho, An Africa... 1782-01-01 http://docsouth.unc.edu/neh/sancho1/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
107 neh-sancho2-sancho2.xml Ignatius Sancho Letters of the Late Ignatius Sancho, An Africa... 1782-01-01 http://docsouth.unc.edu/neh/sancho2/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
140 neh-arthur-arthur.xml Arthur The Life, and Dying Speech of Arthur, a Negro ... 1768-01-01 http://docsouth.unc.edu/neh/arthur/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
145 neh-smithste-smithste.xml Stephen Smith Life, Last Words and Dying Speech of Stephen S... 1797-01-01 http://docsouth.unc.edu/neh/smithste/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
161 neh-norris-norris.xml Robert Norris Memoirs of the Reign of Bossa Ahadee, King of ... 1789-01-01 http://docsouth.unc.edu/neh/norris/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
197 neh-venture-venture.xml Venture Smith A Narrative of the Life and Adventures of Vent... 1798-01-01 http://docsouth.unc.edu/neh/venture/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
207 neh-gronniosaw-gronnios.xml James Albert Ukawsaw Gronniosaw A Narrative of the Most Remarkable Particulars... 1770-01-01 http://docsouth.unc.edu/neh/gronniosaw/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
210 neh-hammon-hammon.xml Briton Hammon A Narrative of the Uncommon Sufferings, and Su... 1760-01-01 http://docsouth.unc.edu/neh/hammon/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
240 neh-royal-royal.xml No Author The Royal African: or, Memoirs of the Young Pr... 1750-01-01 http://docsouth.unc.edu/neh/royal/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
255 neh-mountain-mountain.xml Joseph Mountain Sketches of the Life of Joseph Mountain, a Neg... 1790-01-01 http://docsouth.unc.edu/neh/mountain/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...
265 neh-bluett-bluett.xml Thomas Bluett Some Memoirs of the Life of Job, the Son of So... 1734-01-01 http://docsouth.unc.edu/neh/bluett/menu.html http://docsouth.unc.edu/full-text/na-slave-nar...

back to boolean indexing!


In [124]:
#Let's use a brittle thing for reading in a directory of pure txt files.
our_texts=tm.readtextfiles('data/na-slave-narratives/data/texts')
#again, this is not a std python package
#returns a simple list of the document as very long strings

#note if you want the following notebook will work on any directory of text files.

For now, we'll play with the cool scientists and use the powerful and fast scikit learn package.

Our Zero-ith tool: cleaning up the text

I've included a little utility function in tm that takes a list of strings and cleans it up a bit

check out the code on your own time later


In [131]:
our_texts=tm.data_cleanse(our_texts)

#more necessary when have messy text
#eliminate escaped characters

back to vectorizer from scikit learn


In [132]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [133]:
vectorizer=TfidfVectorizer(min_df=0.5, stop_words='english', use_idf=True)

In [134]:
document_term_matrix=vectorizer.fit_transform(our_texts)

In [135]:
# now let's get our vocabulary--the names corresponding to the rows
vocab=vectorizer.get_feature_names()

In [137]:
len(vocab)


Out[137]:
1650

In [138]:
document_term_matrix.shape


Out[138]:
(294, 1650)

so document_term_matrix is a matrix with 294 rows--the documents--and 1650 columns--the vocabulary or terms or features


In [139]:
vocab[1000:1100]


Out[139]:
['page',
 'pages',
 'paid',
 'pain',
 'painful',
 'pains',
 'pair',
 'paper',
 'papers',
 'parents',
 'particular',
 'particularly',
 'parties',
 'parting',
 'parts',
 'party',
 'pass',
 'passage',
 'passed',
 'passing',
 'past',
 'path',
 'pay',
 'paying',
 'peace',
 'peculiar',
 'pen',
 'people',
 'perfect',
 'perfectly',
 'perform',
 'performed',
 'period',
 'permission',
 'permit',
 'permitted',
 'person',
 'personal',
 'persons',
 'peter',
 'philadelphia',
 'picture',
 'piece',
 'pieces',
 'pity',
 'place',
 'placed',
 'places',
 'plain',
 'plan',
 'plans',
 'plantation',
 'play',
 'pleasant',
 'pleased',
 'pleasure',
 'plenty',
 'pocket',
 'point',
 'points',
 'poor',
 'portion',
 'position',
 'possess',
 'possessed',
 'possession',
 'possible',
 'possibly',
 'post',
 'pounds',
 'power',
 'powerful',
 'powers',
 'practice',
 'praise',
 'pray',
 'prayed',
 'prayer',
 'prayers',
 'praying',
 'preach',
 'preached',
 'preacher',
 'preaching',
 'precious',
 'preface',
 'prejudice',
 'prepare',
 'prepared',
 'preparing',
 'presence',
 'present',
 'presented',
 'president',
 'press',
 'pretty',
 'prevent',
 'prevented',
 'previous',
 'price']

right now stored super efficiently as a sparse matrix

almost all zeros--good for our computers' limited memory

easier for us to see as a dense matrix


In [23]:
document_term_matrix_dense=document_term_matrix.toarray()

In [24]:
dtmdf=pd.DataFrame(document_term_matrix_dense, columns=vocab)

In [25]:
dtmdf


Out[25]:
10 ability able abroad absence absent accept accepted accompanied accomplish ... wrote yard ye year years yes york young younger youth
0 0.000000 0.002778 0.001837 0.008431 0.000000 0.000000 0.002855 0.007959 0.002445 0.000000 ... 0.002333 0.002427 0.009639 0.001776 0.050294 0.014698 0.006708 0.016257 0.000000 0.000000
1 0.000000 0.001035 0.047255 0.002095 0.008032 0.007569 0.001064 0.004944 0.005469 0.002129 ... 0.002609 0.032574 0.000000 0.028466 0.046546 0.000000 0.003334 0.089552 0.008616 0.005845
2 0.007761 0.002577 0.022157 0.005214 0.007496 0.005382 0.010594 0.000000 0.006805 0.000000 ... 0.000000 0.011259 0.004471 0.028007 0.062745 0.011688 0.004149 0.035189 0.002680 0.000000
3 0.002040 0.028447 0.017471 0.000000 0.013792 0.000000 0.006265 0.009703 0.000000 0.004177 ... 0.001707 0.008878 0.000000 0.064955 0.076117 0.001536 0.003271 0.050210 0.019022 0.001912
4 0.005407 0.005386 0.021373 0.000000 0.000000 0.011249 0.016607 0.005144 0.000000 0.000000 ... 0.004524 0.023532 0.004672 0.037876 0.084063 0.020356 0.021677 0.045529 0.000000 0.000000
5 0.028843 0.005746 0.015202 0.005814 0.011144 0.000000 0.000000 0.005488 0.005058 0.011812 ... 0.004826 0.005021 0.009970 0.091842 0.100450 0.000000 0.009251 0.100886 0.000000 0.016217
6 0.012275 0.012228 0.024262 0.000000 0.000000 0.000000 0.012568 0.000000 0.010763 0.000000 ... 0.010271 0.010685 0.000000 0.078175 0.061073 0.009243 0.049214 0.063610 0.012718 0.000000
7 0.000000 0.000000 0.012353 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.005440 0.000000 0.063682 0.073848 0.000000 0.025056 0.036434 0.000000 0.000000
8 0.000000 0.025964 0.034346 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.043617 0.000000 0.067576 0.016600 0.032421 0.000000 0.000000 0.033768 0.027007 0.073277
9 0.000000 0.006766 0.008950 0.006845 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.011366 0.005912 0.082176 0.008651 0.025345 0.005114 0.000000 0.013199 0.007037 0.050919
10 0.000000 0.004677 0.004640 0.000000 0.009071 0.002442 0.000000 0.004467 0.006176 0.002404 ... 0.013751 0.047005 0.002029 0.044856 0.037963 0.021214 0.001883 0.044103 0.002433 0.000000
11 0.000000 0.000000 0.015878 0.000000 0.004656 0.005014 0.000000 0.000000 0.000000 0.004935 ... 0.012099 0.016783 0.029158 0.009209 0.032974 0.036295 0.000000 0.031222 0.000000 0.009034
12 0.000000 0.001307 0.010370 0.003966 0.000000 0.004093 0.012086 0.009983 0.004600 0.004029 ... 0.021949 0.004567 0.014736 0.025895 0.077496 0.018766 0.063104 0.057775 0.002718 0.009833
13 0.004397 0.000000 0.014484 0.004431 0.000000 0.000000 0.000000 0.000000 0.000000 0.004502 ... 0.040466 0.000000 0.015199 0.042002 0.065626 0.000000 0.056408 0.011392 0.000000 0.000000
14 0.004143 0.020634 0.017742 0.002088 0.006002 0.000000 0.004242 0.003941 0.000000 0.000000 ... 0.003466 0.000000 0.001790 0.055406 0.119808 0.003120 0.029897 0.029519 0.000000 0.000000
15 0.000000 0.000000 0.003120 0.004772 0.000000 0.000000 0.000000 0.018019 0.004152 0.004848 ... 0.003962 0.024731 0.000000 0.033172 0.058898 0.014262 0.015188 0.021471 0.000000 0.000000
16 0.000000 0.001598 0.017964 0.000000 0.013943 0.018353 0.001642 0.004577 0.018282 0.016421 ... 0.008052 0.065618 0.024949 0.054136 0.103737 0.019323 0.100312 0.033245 0.013294 0.016533
17 0.013873 0.004606 0.013710 0.002330 0.000000 0.000000 0.004734 0.006599 0.000000 0.002367 ... 0.030952 0.014088 0.019981 0.038284 0.090589 0.015669 0.009270 0.017972 0.002396 0.000000
18 0.000000 0.000000 0.058039 0.000000 0.004478 0.004823 0.000000 0.004411 0.000000 0.009494 ... 0.007759 0.012108 0.004007 0.038385 0.057669 0.003491 0.033459 0.021023 0.000000 0.004345
19 0.001608 0.006409 0.016955 0.009726 0.006214 0.005019 0.011527 0.016831 0.007051 0.004940 ... 0.012112 0.005600 0.009730 0.030729 0.051015 0.008478 0.063194 0.033339 0.013332 0.006029
20 0.000000 0.000000 0.032121 0.000000 0.007849 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.007022 0.015525 0.090963 0.036712 0.000000 0.026317 0.000000 0.000000
21 0.001218 0.015776 0.063410 0.006139 0.001177 0.000000 0.008731 0.005795 0.002136 0.007484 ... 0.004077 0.003181 0.000000 0.027931 0.101528 0.002752 0.017584 0.029198 0.005049 0.001142
22 0.002860 0.022793 0.036746 0.004324 0.009669 0.004463 0.004393 0.014965 0.002508 0.007321 ... 0.011965 0.011204 0.000000 0.057377 0.089828 0.001077 0.025227 0.048169 0.001482 0.021442
23 0.008847 0.000000 0.040799 0.000000 0.012817 0.004601 0.000000 0.000000 0.003878 0.000000 ... 0.003701 0.003850 0.003823 0.036621 0.071524 0.006661 0.021280 0.057304 0.004583 0.000000
24 0.005152 0.000000 0.010183 0.002596 0.004976 0.002680 0.005275 0.002451 0.004517 0.002637 ... 0.028019 0.004485 0.015583 0.029529 0.081703 0.027155 0.033048 0.023360 0.000000 0.002414
25 0.000000 0.018121 0.035956 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.030441 0.015835 0.141488 0.092682 0.169702 0.027396 0.102106 0.011783 0.000000 0.000000
26 0.003842 0.000000 0.005062 0.000000 0.000000 0.000000 0.003933 0.000000 0.006737 0.003933 ... 0.009643 0.000000 0.059761 0.029360 0.076456 0.005786 0.018483 0.017420 0.000000 0.010800
27 0.000000 0.007591 0.010041 0.000000 0.000000 0.000000 0.015603 0.007249 0.000000 0.007802 ... 0.019127 0.000000 0.000000 0.004853 0.104259 0.005738 0.006110 0.054295 0.000000 0.000000
28 0.004956 0.024683 0.018502 0.006660 0.001596 0.005155 0.001691 0.007858 0.008691 0.000000 ... 0.011057 0.002876 0.018559 0.036821 0.110955 0.009951 0.037088 0.044942 0.005135 0.026316
29 0.000000 0.000000 0.000000 0.000000 0.000000 0.011208 0.000000 0.000000 0.018893 0.000000 ... 0.000000 0.009378 0.009311 0.013723 0.033502 0.008113 0.000000 0.041872 0.011163 0.040384
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
264 0.000000 0.003486 0.036891 0.007054 0.010141 0.000000 0.014332 0.006659 0.000000 0.000000 ... 0.002928 0.015232 0.003024 0.013372 0.084881 0.010541 0.005612 0.020402 0.003626 0.000000
265 0.000000 0.002966 0.021579 0.003001 0.002876 0.012389 0.000000 0.011330 0.002611 0.000000 ... 0.012456 0.000000 0.010292 0.030339 0.061107 0.058293 0.057299 0.021215 0.000000 0.002790
266 0.007076 0.001410 0.024244 0.001426 0.001367 0.005889 0.010143 0.009425 0.012410 0.007245 ... 0.011842 0.000000 0.045254 0.020731 0.126748 0.039431 0.062416 0.018335 0.005866 0.006631
267 0.007754 0.001287 0.022989 0.001302 0.001248 0.005377 0.009262 0.008606 0.012465 0.006616 ... 0.011894 0.000000 0.040206 0.020576 0.136632 0.048656 0.065283 0.019253 0.006695 0.006055
268 0.005221 0.005200 0.015478 0.005262 0.010085 0.008146 0.008017 0.007450 0.009155 0.002672 ... 0.002184 0.000000 0.020302 0.018286 0.064935 0.037345 0.012558 0.069323 0.002705 0.004892
269 0.005592 0.000000 0.003684 0.005636 0.005401 0.011634 0.000000 0.015960 0.004903 0.000000 ... 0.000000 0.000000 0.000000 0.007123 0.038256 0.033687 0.017936 0.097801 0.005794 0.005240
270 0.011798 0.000000 0.007773 0.000000 0.000000 0.000000 0.000000 0.000000 0.020690 0.012079 ... 0.000000 0.030810 0.020392 0.007514 0.036687 0.008884 0.000000 0.007642 0.000000 0.011056
271 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.041778 0.101994 0.024698 0.000000 0.084985 0.000000 0.000000
272 0.000000 0.004670 0.027801 0.002363 0.009057 0.009755 0.002400 0.004460 0.004111 0.000000 ... 0.003923 0.002041 0.002026 0.022394 0.032075 0.007061 0.001880 0.009111 0.000000 0.006590
273 0.000000 0.000000 0.016820 0.025730 0.000000 0.026557 0.000000 0.000000 0.022385 0.000000 ... 0.000000 0.000000 0.000000 0.016259 0.095264 0.076895 0.000000 0.016537 0.000000 0.000000
274 0.000000 0.004016 0.006375 0.000813 0.012461 0.006710 0.002476 0.000767 0.006363 0.000000 ... 0.006746 0.000000 0.001394 0.004108 0.016046 0.004857 0.047841 0.010968 0.002506 0.000756
275 0.000000 0.000000 0.022407 0.008569 0.000000 0.008845 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.108294 0.126904 0.000000 0.013635 0.022029 0.008809 0.015935
276 0.000000 0.007089 0.018754 0.014345 0.000000 0.007403 0.000000 0.000000 0.000000 0.000000 ... 0.005954 0.000000 0.006150 0.104239 0.141627 0.000000 0.017119 0.041487 0.014747 0.013338
277 0.006928 0.018978 0.023964 0.010474 0.001673 0.003603 0.015960 0.011534 0.000000 0.012413 ... 0.005797 0.003015 0.005987 0.036400 0.099100 0.005217 0.059718 0.063949 0.000000 0.019477
278 0.001310 0.003915 0.015537 0.007922 0.006327 0.002726 0.010731 0.009971 0.009190 0.001341 ... 0.008769 0.001140 0.004529 0.020025 0.056220 0.018744 0.079838 0.028854 0.001357 0.004911
279 0.000000 0.000000 0.030560 0.003596 0.017232 0.007423 0.003653 0.000000 0.003129 0.003653 ... 0.005971 0.021742 0.003084 0.038630 0.064352 0.008060 0.000000 0.025423 0.000000 0.006687
280 0.018909 0.010595 0.021801 0.004764 0.005708 0.000000 0.013310 0.010119 0.006217 0.009680 ... 0.016810 0.004115 0.000000 0.086553 0.088933 0.000890 0.043590 0.031386 0.002449 0.014397
281 0.000000 0.006831 0.022589 0.000000 0.019870 0.000000 0.000000 0.006523 0.006013 0.021062 ... 0.005737 0.005969 0.023704 0.048037 0.059704 0.005163 0.038488 0.017767 0.007105 0.012852
282 0.000000 0.008524 0.020672 0.002875 0.011020 0.000000 0.000000 0.005427 0.005002 0.011681 ... 0.004773 0.007449 0.014790 0.025431 0.074505 0.019330 0.082336 0.027714 0.000000 0.005346
283 0.000000 0.000000 0.043636 0.000000 0.000000 0.007655 0.000000 0.007001 0.019358 0.000000 ... 0.000000 0.012812 0.019079 0.023433 0.041190 0.016624 0.005901 0.004767 0.007625 0.000000
284 0.000000 0.004530 0.022471 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.006984 ... 0.001902 0.011876 0.000000 0.018825 0.025454 0.013697 0.001823 0.001473 0.000000 0.004262
285 0.000000 0.006589 0.010894 0.003333 0.009583 0.000000 0.003386 0.006292 0.005800 0.003386 ... 0.000000 0.014394 0.002858 0.044229 0.102838 0.004981 0.007955 0.042844 0.000000 0.003099
286 0.021025 0.001611 0.009590 0.003803 0.005728 0.004486 0.001104 0.002051 0.003309 0.002208 ... 0.004059 0.012670 0.001398 0.036049 0.126060 0.006089 0.027665 0.033523 0.008378 0.004547
287 0.004362 0.000000 0.002874 0.017584 0.012639 0.000000 0.000000 0.004149 0.003824 0.000000 ... 0.014598 0.000000 0.131929 0.022222 0.059678 0.000000 0.000000 0.048031 0.004519 0.040874
288 0.010687 0.005323 0.014082 0.010771 0.005161 0.000000 0.005471 0.000000 0.004685 0.000000 ... 0.000000 0.000000 0.032324 0.023821 0.096372 0.016094 0.004285 0.024228 0.000000 0.015022
289 0.005468 0.005447 0.061242 0.005511 0.010563 0.011376 0.000000 0.000000 0.023972 0.016795 ... 0.004575 0.000000 0.099231 0.041787 0.023804 0.012352 0.035075 0.017709 0.000000 0.005124
290 0.000000 0.000000 0.007489 0.000000 0.010980 0.000000 0.000000 0.000000 0.009967 0.000000 ... 0.009511 0.039579 0.000000 0.021718 0.233290 0.025678 0.018229 0.029452 0.000000 0.010652
291 0.004098 0.000000 0.005400 0.008260 0.011875 0.004263 0.004196 0.000000 0.000000 0.008391 ... 0.000000 0.000000 0.092079 0.028708 0.043327 0.000000 0.009858 0.007963 0.000000 0.007680
292 0.010646 0.000000 0.022795 0.002682 0.000000 0.000000 0.008175 0.002532 0.002334 0.000000 ... 0.002227 0.016217 0.016100 0.035594 0.051310 0.064129 0.032010 0.065509 0.000000 0.002494
293 0.000000 0.000000 0.135860 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.023933 0.000000 0.026265 0.021374 0.036231 0.005512 0.053429 0.000000 0.000000

294 rows × 1658 columns

While this data frame is lovely to look at and useful to think with, it's tough on your computer's memory


In [140]:
#easy to program, but let's use a robust version from sklearn!
from sklearn.metrics.pairwise import cosine_similarity

In [141]:
similarity=cosine_similarity(document_term_matrix)

#Note here that the `cosine_similiary` can take 
#an entire matrix as its argument

In [28]:
#what'd we get?

similarity


Out[28]:
array([[ 1.        ,  0.4800956 ,  0.47789776, ...,  0.41147251,
         0.64107903,  0.49087961],
       [ 0.4800956 ,  1.        ,  0.65544451, ...,  0.31196723,
         0.64506243,  0.47103725],
       [ 0.47789776,  0.65544451,  1.        , ...,  0.35760039,
         0.70650975,  0.45047448],
       ..., 
       [ 0.41147251,  0.31196723,  0.35760039, ...,  1.        ,
         0.54765566,  0.29438818],
       [ 0.64107903,  0.64506243,  0.70650975, ...,  0.54765566,
         1.        ,  0.56796543],
       [ 0.49087961,  0.47103725,  0.45047448, ...,  0.29438818,
         0.56796543,  1.        ]])

In [29]:
similarity.shape


Out[29]:
(294, 294)

that is a symmetrical matrix relating each of the texts (rows) to another text (row)


In [30]:
similarity[100]
#this gives the similarity of row 100 to each of the other rows


Out[30]:
array([ 0.51351908,  0.37559544,  0.43518157,  0.55551997,  0.46036488,
        0.5228255 ,  0.56056587,  0.40698762,  0.29528043,  0.22663447,
        0.43146373,  0.404392  ,  0.44087255,  0.39765911,  0.75170812,
        0.4563751 ,  0.42273969,  0.53307405,  0.43241179,  0.52863713,
        0.43245394,  0.50163583,  0.55255707,  0.36087616,  0.52151398,
        0.44422755,  0.44204217,  0.44389512,  0.47398092,  0.31779759,
        0.49221207,  0.46843142,  0.56495427,  0.63240387,  0.33069644,
        0.39599524,  0.62049587,  0.51382474,  0.54573105,  0.46795562,
        0.54572092,  0.27907631,  0.31736056,  0.38786122,  0.45626854,
        0.33554468,  0.39141669,  0.39123794,  0.39268033,  0.39480131,
        0.18210545,  0.42944428,  0.4641688 ,  0.38645964,  0.44868397,
        0.36583276,  0.30356115,  0.4207236 ,  0.47623871,  0.74338913,
        0.70385898,  0.456456  ,  0.45573056,  0.41167837,  0.48595775,
        0.4969808 ,  0.51340512,  0.48723885,  0.39048843,  0.55268064,
        0.44370596,  0.46620405,  0.59864036,  0.5453561 ,  0.19284076,
        0.41974821,  0.46478893,  0.28167759,  0.4344191 ,  0.35451293,
        0.42340573,  0.47227169,  0.43494646,  0.45913277,  0.43349771,
        0.38247543,  0.39892623,  0.27673982,  0.34213901,  0.64612329,
        0.3626567 ,  0.42551363,  0.51860678,  0.56070145,  0.45950563,
        0.37292824,  0.47305083,  0.55650062,  0.51663525,  0.47743723,
        1.        ,  0.74504107,  0.5418364 ,  0.446839  ,  0.42183918,
        0.46757646,  0.40592573,  0.43250842,  0.42077958,  0.45397224,
        0.31944606,  0.54158949,  0.41347637,  0.37601824,  0.310596  ,
        0.4538872 ,  0.73505269,  0.25404726,  0.31252659,  0.36519023,
        0.38497747,  0.38380178,  0.29363724,  0.41179583,  0.54176381,
        0.73378491,  0.37935869,  0.40331248,  0.53966992,  0.37759265,
        0.38172116,  0.4010848 ,  0.36563491,  0.40777916,  0.44576063,
        0.30141366,  0.21701385,  0.46957471,  0.49328697,  0.55739909,
        0.42608044,  0.61105579,  0.41359103,  0.75236073,  0.56035999,
        0.52365363,  0.43469125,  0.4805973 ,  0.55483122,  0.30195174,
        0.46054438,  0.55915787,  0.4253061 ,  0.40968898,  0.34119052,
        0.47001332,  0.47038965,  0.54626945,  0.37483056,  0.57350522,
        0.80329149,  0.7243073 ,  0.41505792,  0.35195044,  0.39476144,
        0.72086426,  0.36628347,  0.32496974,  0.32097834,  0.63363847,
        0.57753911,  0.36209657,  0.38444127,  0.35573989,  0.41862441,
        0.43178962,  0.443835  ,  0.5608355 ,  0.444154  ,  0.48375123,
        0.28522753,  0.37559622,  0.47222523,  0.51180251,  0.41265869,
        0.54065297,  0.57687612,  0.64723645,  0.47474714,  0.38660376,
        0.68458167,  0.39413766,  0.38010527,  0.59400378,  0.57957167,
        0.37098426,  0.43403555,  0.32639742,  0.36205656,  0.51786785,
        0.41269349,  0.38620109,  0.50374064,  0.47638095,  0.33258194,
        0.37493858,  0.443436  ,  0.57791602,  0.65232685,  0.30007518,
        0.3492306 ,  0.32146507,  0.37118884,  0.45313594,  0.28738624,
        0.58066833,  0.37931137,  0.44479053,  0.66170862,  0.39806686,
        0.44628435,  0.413084  ,  0.34530214,  0.32644008,  0.39825262,
        0.35834356,  0.66729768,  0.52277553,  0.31125011,  0.32153004,
        0.41817371,  0.34185216,  0.41832183,  0.3425447 ,  0.45959116,
        0.33285397,  0.45256228,  0.37477928,  0.31155022,  0.30429663,
        0.52773434,  0.74161916,  0.3494954 ,  0.47205325,  0.38833223,
        0.51329957,  0.51852037,  0.38344333,  0.58984198,  0.26422658,
        0.33376944,  0.42060176,  0.27627799,  0.35126916,  0.37932132,
        0.3915702 ,  0.45151486,  0.5374713 ,  0.45565082,  0.37399019,
        0.4328711 ,  0.45998098,  0.40747782,  0.63226527,  0.51432353,
        0.49040904,  0.5749578 ,  0.57524544,  0.43183206,  0.40378449,
        0.37574633,  0.21411994,  0.4464823 ,  0.36864902,  0.32271726,
        0.34755168,  0.38877693,  0.7058656 ,  0.53344296,  0.40329998,
        0.5777419 ,  0.50958067,  0.54379815,  0.36165463,  0.48269786,
        0.46860293,  0.42628022,  0.2602712 ,  0.50098935,  0.41036843,
        0.4474139 ,  0.32264786,  0.52424594,  0.35982701])

HOMEWORK EXERCISE:

for given document find the most similar and give titles from the csv file you'll see!

supervised vs. unsupervised learning

slides from class omitted

first example of unsupervised learning

hierarchical clustering

This time we're interested in relations among the words not the texts.

In other words, we're interested in the similarities between one column and another--one term and another term

So we'll work with the transposed matrix--the term-document matrix, rather than the document-term matrix.

For a description of hierarchical clustering, look at the example at https://en.wikipedia.org/wiki/Hierarchical_clustering


In [31]:
term_document_matrix=document_term_matrix.T
# .T is the easy transposition method for a
# matrix in python's matrix packages.

In [32]:
# import a bunch of packages we need
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import cosine_similarity
from scipy.cluster.hierarchy import ward, dendrogram

In [33]:
#distance is 1-similarity, so:

dist=1-cosine_similarity(term_document_matrix)

# ward is an algorithm for hierarchical clustering

linkage_matrix=ward(dist)

#plot dendogram

f=plt.figure(figsize=(9,9))
R=dendrogram(linkage_matrix, orientation="right", labels=vocab)
plt.tight_layout()


OMG U...G...L...Y!

WHAT THE? This is nonsense

what's the problem?

we just tried to plot a bunch o' features!

we need only the most significant words!

way to do this: change the min_df parameter in vectorizer

vectorizer=TfidfVectorizer(min_df=0.5, stop_words='english', use_idf=True)

more an art than a science


In [34]:
vectorizer=TfidfVectorizer(min_df=.96, stop_words='english', use_idf=True)
#try a very high min_df

In [35]:
#rerun the model
document_term_matrix=vectorizer.fit_transform(our_texts)
vocab=vectorizer.get_feature_names()

In [36]:
#check the length of the vocab
len(vocab)


Out[36]:
52

In [37]:
#switch again to the term_document_matrix
term_document_matrix=document_term_matrix.T

In [38]:
dist=1-cosine_similarity(term_document_matrix)
linkage_matrix=ward(dist)

#plot dendogram

f=plt.figure(figsize=(9,9))
R=dendrogram(linkage_matrix, orientation="right", labels=vocab)
plt.tight_layout()


is this significant? Are there interesting patterns to seek out?

here's what we're up to:

Exploratory data analysis (EDA) seeks to reveal structure, or simple descriptions, in data. We look at numbers and graphs and try to find patterns.

. . . we can view the techniques of EDA as a ritual designed to reveal patters in a data set. Thus, we may believe that naturally occurring data sets contain structure, that EDA is a useful vehicle for revealing the structure. . . . If we make no attempt to check whether the structure could have arisen by chance, and tend to accept the findings as gospel, then the ritual comes close to magical thinking. ... a controlled form of magical thinking--in the guise of 'working hypothesis'--is a basic ingredient of scientific progress.

  • Persi Diaconis, "Theories of Data Analysis: From Magical Thinking Through Classical statistics"

need to elicit patterns and avoid bad magical thinking!

Key assignment: BRAINSTORM texts you wish to mine!