In [1]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
In [2]:
import textmining_blackboxes as tm
tm
is our temporarily helper, not a standard python
package!!download it from my github: https://github.com/matthewljones/computingincontext
In [3]:
#see if package imported correctly
tm.icantbelieve("butter")
I can't believe it's not butter
Let's keep using the remarkable narratives available from Documenting the American South (http://docsouth.unc.edu/docsouthdata/)
Assuming that you are storing your data in a directory in the same place as your iPython notebook.
Put the slave narratives texts within a data
directory in the same place as this notebook
In [92]:
title_info=pd.read_csv('data/na-slave-narratives/data/toc.csv')
#this is the "metadata" of these files--we'll use today
#why does data appear twice in the filename?
In [73]:
title_info
Out[73]:
Filename
Author
Title
Date
URL
URL(text-only)
0
neh-johnstone-johnstone.xml
Abraham Johnstone
The Address of Abraham Johnstone, a Black Man,...
1797
http://docsouth.unc.edu/neh/johnstone/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
1
neh-meachum-meachum.xml
John B. Meachum
An Address to All the Colored Citizens of the ...
1846
http://docsouth.unc.edu/neh/meachum/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
2
neh-johnsontl-johnsontl.xml
Thomas L. Johnson
Africa for Christ. Twenty-Eight Years a Slave
1892
http://docsouth.unc.edu/neh/johnsontl/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
3
neh-white-white.xml
William S. White
The African Preacher. An Authentic Narrative
[c1849]
http://docsouth.unc.edu/neh/white/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
4
neh-brown55-brown55.xml
William Wells Brown
The American Fugitive in Europe. Sketches of P...
1855
http://docsouth.unc.edu/neh/brown55/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
5
neh-weld-weld.xml
Theodore Dwight Weld
American Slavery As It Is: Testimony of a Thou...
1839
http://docsouth.unc.edu/neh/weld/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
6
neh-andersrob-andersrob.xml
Robert Anderson
The Anderson Surpriser. Written After He Was S...
1895
http://docsouth.unc.edu/neh/andersrob/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
7
neh-boen-boen.xml
No Author
Anecdotes and Memoirs of William Boen, a Colou...
1834
http://docsouth.unc.edu/neh/boen/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
8
neh-stevens-stevens.xml
Charles Emery Stevens
Anthony Burns: A History
1856
http://docsouth.unc.edu/neh/stevens/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
9
neh-robinsonn-robinson.xml
Nina Hill Robinson
Aunt Dice: The Story of a Faithful Slave
1897
http://docsouth.unc.edu/neh/robinsonn/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
10
neh-auntjudy-auntjudy.xml
Matilda G. Thompson
Aunt Judy's Story: A Tale From Real Life. Writ...
1855
http://docsouth.unc.edu/neh/auntjudy/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
11
neh-sally-sally.xml
Isaac Williams
Aunt Sally: or, The Cross the Way of Freedom. ...
1858
http://docsouth.unc.edu/neh/sally/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
12
neh-jamison-jamison.xml
M. F. Jamison
Autobiography and Work of Bishop M. F. Jamison...
1912
http://docsouth.unc.edu/neh/jamison/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
13
neh-browne-browne.xml
Martha Griffith Browne
Autobiography of a Female Slave
1857
http://docsouth.unc.edu/neh/browne/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
14
neh-wards-ward.xml
Samuel Ringgold Ward
Autobiography of a Fugitive Negro: His Anti-Sl...
1855
http://docsouth.unc.edu/neh/wards/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
15
fpn-lane-lane.xml
Isaac Lane
Autobiography of Bishop Isaac Lane, LL.D. with...
1916
http://docsouth.unc.edu/fpn/lane/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
16
neh-parkerh-parkerh.xml
Henry Parker
Autobiography of Henry Parker
186?
http://docsouth.unc.edu/neh/parkerh/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
17
neh-smithj-smithj.xml
James Lindsay Smith
Autobiography of James L. Smith, Including, Al...
1881
http://docsouth.unc.edu/neh/smithj/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
18
neh-said-said.xml
Nicholas Said
The Autobiography of Nicholas Said, A Native o...
1873
http://docsouth.unc.edu/neh/said/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
19
nc-omarsaid-omarsaid.xml
Omar ibn Said
Autobiography of Omar ibn Said, Slave in North...
1925
http://docsouth.unc.edu/nc/omarsaid/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
20
neh-frederick-frederick.xml
Francis Frederick
Autobiography of Rev. Francis Frederick, of Vi...
1869
http://docsouth.unc.edu/neh/frederick/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
21
neh-henry-henry.xml
Thomas W. Henry
Autobiography of Rev. Thomas W. Henry, of the ...
[1872]
http://docsouth.unc.edu/neh/henry/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
22
neh-henson81-henson81.xml
Josiah Henson
An Autobiography of the Rev. Josiah Henson ("U...
1881
http://docsouth.unc.edu/neh/henson81/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
23
neh-holsey-holsey.xml
Lucius Henry Holsey
Autobiography, Sermons, Addresses, and Essays ...
1898
http://docsouth.unc.edu/neh/holsey/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
24
neh-washstory-washin.xml
Booker T. Washington
An Autobiography: The Story of My Life and Work
c1901
http://docsouth.unc.edu/neh/washstory/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
25
neh-smitham-smith.xml
Amanda Smith
An Autobiography: The Story of the Lord's Deal...
1893
http://docsouth.unc.edu/neh/smitham/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
26
neh-campbell-campbell.xml
Israel Campbell
An Autobiography. Bond and Free: Or, Yearning...
1861
http://docsouth.unc.edu/neh/campbell/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
27
neh-alexander-alexander.xml
Charles Alexander
Battles and Victories of Allen Allensworth, A....
1914
http://docsouth.unc.edu/neh/alexander/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
28
neh-aleckson-aleckson.xml
Sam Aleckson
Before the War, and After the Union. An Autob...
1929
http://docsouth.unc.edu/neh/aleckson/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
29
neh-keckley-keckley.xml
Elizabeth Keckley
Behind the Scenes, or, Thirty years a Slave, a...
1868
http://docsouth.unc.edu/neh/keckley/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
...
...
...
...
...
...
...
264
neh-ballslavery-ball.xml
Charles Ball
Slavery in the United States: A Narrative of t...
1837
http://docsouth.unc.edu/neh/ballslavery/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
265
neh-bluett-bluett.xml
Thomas Bluett
Some Memoirs of the Life of Job, the Son of So...
1734
http://docsouth.unc.edu/neh/bluett/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
266
neh-gallaudet-gallaudet.xml
T. H. Gallaudet
A Statement with Regard to the Moorish Prince,...
1828
http://docsouth.unc.edu/neh/gallaudet/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
267
neh-story-story.xml
No Author
The Story of a Slave. A Realistic Revelation o...
1894
http://docsouth.unc.edu/neh/story/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
268
neh-eliot-eliot.xml
William Greenleaf Eliot
The Story of Archer Alexander: From Slavery to...
1885
http://docsouth.unc.edu/neh/eliot/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
269
neh-jacksonm-jackson.xml
Mattie J. Jackson
The Story of Mattie J. Jackson: Her Parentage,...
1866
http://docsouth.unc.edu/neh/jacksonm/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
270
neh-twelvetr-twelvetr.xml
Harper Twelvetrees
The Story of the Life of John Anderson, the Fu...
1863
http://docsouth.unc.edu/neh/twelvetr/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
271
neh-watkins-watkins.xml
James Watkins
Struggles for Freedom; or The Life of James Wa...
1860
http://docsouth.unc.edu/neh/watkins/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
272
neh-iwilliams-iwilliams.xml
Isaac D. Williams
Sunshine and Shadow of Slave Life. Reminiscenc...
1885
http://docsouth.unc.edu/neh/iwilliams/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
273
fpn-hughes-hughes.xml
Louis Hughes
Thirty Years a Slave: From Bondage to Freedom:...
1897
http://docsouth.unc.edu/fpn/hughes/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
274
neh-brown52-brown52.xml
William Wells Brown
Three Years in Europe: Or, Places I Have Seen ...
1852
http://docsouth.unc.edu/neh/brown52/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
275
neh-detroit-detroit.xml
No Author
A Thrilling Narrative from the Lips of the Suf...
1863
http://docsouth.unc.edu/neh/detroit/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
276
neh-tubbee1848-tubbee1848.xml
Okah Tubbee
A Thrilling Sketch of the Life of the Distingu...
1848
http://docsouth.unc.edu/neh/tubbee1848/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
277
neh-beard63-beard63.xml
J. R. Beard
Toussaint L'Ouverture: A Biography and Autobio...
1863
http://docsouth.unc.edu/neh/beard63/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
278
neh-henderson-henderson.xml
Madison Henderson
Trials and Confessions of Madison Henderson, A...
1841
http://docsouth.unc.edu/neh/henderson/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
279
neh-armistead-armistead.xml
Wilson Armistead
A Tribute for the Negro: Being a Vindication o...
1848
http://docsouth.unc.edu/neh/armistead/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
280
neh-twain-twain.xml
Mark Twain
A True Story, Repeated Word for Word As I Hear...
November 1874
http://docsouth.unc.edu/neh/twain/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
281
neh-jjacobs-jjacobs.xml
John S. Jacobs
A True Tale of Slavery. From The Leisure Hour:...
1861
http://docsouth.unc.edu/neh/jjacobs/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
282
neh-henson58-henson58.xml
Josiah Henson
Truth Stranger Than Fiction. Father Henson's S...
1858
http://docsouth.unc.edu/neh/henson58/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
283
fpn-northup-northup.xml
Solomon Northup
Twelve Years a Slave: Narrative of Solomon Nor...
1853
http://docsouth.unc.edu/fpn/northup/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
284
neh-johnson1-johnson.xml
Thomas L. Johnson
Twenty-Eight Years a Slave, or The Story of My...
1909
http://docsouth.unc.edu/neh/johnson1/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
285
fpn-steward-steward.xml
Austin Steward
Twenty-Two Years a Slave, and Forty Years a Fr...
1857
http://docsouth.unc.edu/fpn/steward/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
286
neh-rayemma-rayemma.xml
Emma J. Ray
Twice Sold, Twice Ransomed: Autobiography of M...
c1926
http://docsouth.unc.edu/neh/rayemma/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
287
neh-foster-foster.xml
Gustavus L. Foster
Uncle Johnson, the Pilgrim of Six Score Years
186-?
http://docsouth.unc.edu/neh/foster/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
288
neh-edwardsj-edwardsj.xml
John Passmore Edwards
Uncle Tom's Companions: Or, Facts Stranger Tha...
1852
http://docsouth.unc.edu/neh/edwardsj/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
289
neh-henson-henson.xml
Josiah Henson
Uncle Tom's Story of His Life. An Autobiograph...
1876
http://docsouth.unc.edu/neh/henson/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
290
fpn-washington-washing.xml
Booker T. Washington
Up from Slavery: An Autobiography
c1901
http://docsouth.unc.edu/fpn/washington/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
291
fpn-burtont-burton.xml
Thomas William Burton
What Experience Has Taught Me: An Autobiograph...
c1910
http://docsouth.unc.edu/fpn/burtont/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
292
neh-hildreth-hildreth.xml
Richard Hildreth
The White Slave; or, Memoirs of a Fugitive
1852
http://docsouth.unc.edu/neh/hildreth/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
293
neh-wilkerson-wilkerson.xml
Major James Wilkerson
Wilkerson's History of His Travels & Labor...
1861
http://docsouth.unc.edu/neh/wilkerson/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
294 rows × 6 columns
In [74]:
title_info["Date"].str.replace("\-\?", "5")
Out[74]:
0 1797
1 1846
2 1892
3 [c1849]
4 1855
5 1839
6 1895
7 1834
8 1856
9 1897
10 1855
11 1858
12 1912
13 1857
14 1855
...
279 1848
280 November 1874
281 1861
282 1858
283 1853
284 1909
285 1857
286 c1926
287 1865
288 1852
289 1876
290 c1901
291 c1910
292 1852
293 1861
Name: Date, Length: 294, dtype: object
In [75]:
title_info["Date"].str.replace("[^0-9]", "") #use regular expressions to clean up
Out[75]:
0 1797
1 1846
2 1892
3 1849
4 1855
5 1839
6 1895
7 1834
8 1856
9 1897
10 1855
11 1858
12 1912
13 1857
14 1855
...
279 1848
280 1874
281 1861
282 1858
283 1853
284 1909
285 1857
286 1926
287 186
288 1852
289 1876
290 1901
291 1910
292 1852
293 1861
Name: Date, Length: 294, dtype: object
In [97]:
title_info["Date"]=title_info["Date"].str.replace("\-\?", "5")
title_info["Date"]=title_info["Date"].str.replace("[^0-9]", "") # what assumptions have I made about the data?
In [99]:
title_info["Date"]=pd.to_datetime(title_info["Date"], coerce=True)
In [122]:
title_info["Date"]<pd.datetime(1800,1,1)
Out[122]:
0 True
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 False
11 False
12 False
13 False
14 False
...
279 False
280 False
281 False
282 False
283 False
284 False
285 False
286 False
287 False
288 False
289 False
290 False
291 False
292 False
293 False
Name: Date, Length: 294, dtype: bool
In [127]:
title_info[title_info["Date"]<pd.datetime(1800,1,1)]
Out[127]:
Filename
Author
Title
Date
URL
URL(text-only)
0
neh-johnstone-johnstone.xml
Abraham Johnstone
The Address of Abraham Johnstone, a Black Man,...
1797-01-01
http://docsouth.unc.edu/neh/johnstone/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
55
neh-pomp-pomp.xml
Pomp
Dying Confession of Pomp, A Negro Man, Who Was...
1795-01-01
http://docsouth.unc.edu/neh/pomp/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
97
neh-equiano1-equiano1.xml
Olaudah Equiano
The Interesting Narrative of the Life of Olaud...
1789-01-01
http://docsouth.unc.edu/neh/equiano1/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
98
neh-equiano2-equiano2.xml
Olaudah Equiano
The Interesting Narrative of the Life of Olaud...
1789-01-01
http://docsouth.unc.edu/neh/equiano2/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
105
neh-fortis-fortis.xml
Edmund Fortis
The Last Words and Dying Speech of Edmund Fort...
1795-01-01
http://docsouth.unc.edu/neh/fortis/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
106
neh-sancho1-sancho1.xml
Ignatius Sancho
Letters of the Late Ignatius Sancho, An Africa...
1782-01-01
http://docsouth.unc.edu/neh/sancho1/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
107
neh-sancho2-sancho2.xml
Ignatius Sancho
Letters of the Late Ignatius Sancho, An Africa...
1782-01-01
http://docsouth.unc.edu/neh/sancho2/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
140
neh-arthur-arthur.xml
Arthur
The Life, and Dying Speech of Arthur, a Negro ...
1768-01-01
http://docsouth.unc.edu/neh/arthur/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
145
neh-smithste-smithste.xml
Stephen Smith
Life, Last Words and Dying Speech of Stephen S...
1797-01-01
http://docsouth.unc.edu/neh/smithste/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
161
neh-norris-norris.xml
Robert Norris
Memoirs of the Reign of Bossa Ahadee, King of ...
1789-01-01
http://docsouth.unc.edu/neh/norris/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
197
neh-venture-venture.xml
Venture Smith
A Narrative of the Life and Adventures of Vent...
1798-01-01
http://docsouth.unc.edu/neh/venture/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
207
neh-gronniosaw-gronnios.xml
James Albert Ukawsaw Gronniosaw
A Narrative of the Most Remarkable Particulars...
1770-01-01
http://docsouth.unc.edu/neh/gronniosaw/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
210
neh-hammon-hammon.xml
Briton Hammon
A Narrative of the Uncommon Sufferings, and Su...
1760-01-01
http://docsouth.unc.edu/neh/hammon/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
240
neh-royal-royal.xml
No Author
The Royal African: or, Memoirs of the Young Pr...
1750-01-01
http://docsouth.unc.edu/neh/royal/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
255
neh-mountain-mountain.xml
Joseph Mountain
Sketches of the Life of Joseph Mountain, a Neg...
1790-01-01
http://docsouth.unc.edu/neh/mountain/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
265
neh-bluett-bluett.xml
Thomas Bluett
Some Memoirs of the Life of Job, the Son of So...
1734-01-01
http://docsouth.unc.edu/neh/bluett/menu.html
http://docsouth.unc.edu/full-text/na-slave-nar...
In [124]:
#Let's use a brittle thing for reading in a directory of pure txt files.
our_texts=tm.readtextfiles('data/na-slave-narratives/data/texts')
#again, this is not a std python package
#returns a simple list of the document as very long strings
#note if you want the following notebook will work on any directory of text files.
For now, we'll play with the cool scientists and use the powerful and fast scikit learn package.
In [131]:
our_texts=tm.data_cleanse(our_texts)
#more necessary when have messy text
#eliminate escaped characters
In [132]:
from sklearn.feature_extraction.text import TfidfVectorizer
In [133]:
vectorizer=TfidfVectorizer(min_df=0.5, stop_words='english', use_idf=True)
In [134]:
document_term_matrix=vectorizer.fit_transform(our_texts)
In [135]:
# now let's get our vocabulary--the names corresponding to the rows
vocab=vectorizer.get_feature_names()
In [137]:
len(vocab)
Out[137]:
1650
In [138]:
document_term_matrix.shape
Out[138]:
(294, 1650)
In [139]:
vocab[1000:1100]
Out[139]:
['page',
'pages',
'paid',
'pain',
'painful',
'pains',
'pair',
'paper',
'papers',
'parents',
'particular',
'particularly',
'parties',
'parting',
'parts',
'party',
'pass',
'passage',
'passed',
'passing',
'past',
'path',
'pay',
'paying',
'peace',
'peculiar',
'pen',
'people',
'perfect',
'perfectly',
'perform',
'performed',
'period',
'permission',
'permit',
'permitted',
'person',
'personal',
'persons',
'peter',
'philadelphia',
'picture',
'piece',
'pieces',
'pity',
'place',
'placed',
'places',
'plain',
'plan',
'plans',
'plantation',
'play',
'pleasant',
'pleased',
'pleasure',
'plenty',
'pocket',
'point',
'points',
'poor',
'portion',
'position',
'possess',
'possessed',
'possession',
'possible',
'possibly',
'post',
'pounds',
'power',
'powerful',
'powers',
'practice',
'praise',
'pray',
'prayed',
'prayer',
'prayers',
'praying',
'preach',
'preached',
'preacher',
'preaching',
'precious',
'preface',
'prejudice',
'prepare',
'prepared',
'preparing',
'presence',
'present',
'presented',
'president',
'press',
'pretty',
'prevent',
'prevented',
'previous',
'price']
In [23]:
document_term_matrix_dense=document_term_matrix.toarray()
In [24]:
dtmdf=pd.DataFrame(document_term_matrix_dense, columns=vocab)
In [25]:
dtmdf
Out[25]:
10
ability
able
abroad
absence
absent
accept
accepted
accompanied
accomplish
...
wrote
yard
ye
year
years
yes
york
young
younger
youth
0
0.000000
0.002778
0.001837
0.008431
0.000000
0.000000
0.002855
0.007959
0.002445
0.000000
...
0.002333
0.002427
0.009639
0.001776
0.050294
0.014698
0.006708
0.016257
0.000000
0.000000
1
0.000000
0.001035
0.047255
0.002095
0.008032
0.007569
0.001064
0.004944
0.005469
0.002129
...
0.002609
0.032574
0.000000
0.028466
0.046546
0.000000
0.003334
0.089552
0.008616
0.005845
2
0.007761
0.002577
0.022157
0.005214
0.007496
0.005382
0.010594
0.000000
0.006805
0.000000
...
0.000000
0.011259
0.004471
0.028007
0.062745
0.011688
0.004149
0.035189
0.002680
0.000000
3
0.002040
0.028447
0.017471
0.000000
0.013792
0.000000
0.006265
0.009703
0.000000
0.004177
...
0.001707
0.008878
0.000000
0.064955
0.076117
0.001536
0.003271
0.050210
0.019022
0.001912
4
0.005407
0.005386
0.021373
0.000000
0.000000
0.011249
0.016607
0.005144
0.000000
0.000000
...
0.004524
0.023532
0.004672
0.037876
0.084063
0.020356
0.021677
0.045529
0.000000
0.000000
5
0.028843
0.005746
0.015202
0.005814
0.011144
0.000000
0.000000
0.005488
0.005058
0.011812
...
0.004826
0.005021
0.009970
0.091842
0.100450
0.000000
0.009251
0.100886
0.000000
0.016217
6
0.012275
0.012228
0.024262
0.000000
0.000000
0.000000
0.012568
0.000000
0.010763
0.000000
...
0.010271
0.010685
0.000000
0.078175
0.061073
0.009243
0.049214
0.063610
0.012718
0.000000
7
0.000000
0.000000
0.012353
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
...
0.000000
0.005440
0.000000
0.063682
0.073848
0.000000
0.025056
0.036434
0.000000
0.000000
8
0.000000
0.025964
0.034346
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
...
0.043617
0.000000
0.067576
0.016600
0.032421
0.000000
0.000000
0.033768
0.027007
0.073277
9
0.000000
0.006766
0.008950
0.006845
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
...
0.011366
0.005912
0.082176
0.008651
0.025345
0.005114
0.000000
0.013199
0.007037
0.050919
10
0.000000
0.004677
0.004640
0.000000
0.009071
0.002442
0.000000
0.004467
0.006176
0.002404
...
0.013751
0.047005
0.002029
0.044856
0.037963
0.021214
0.001883
0.044103
0.002433
0.000000
11
0.000000
0.000000
0.015878
0.000000
0.004656
0.005014
0.000000
0.000000
0.000000
0.004935
...
0.012099
0.016783
0.029158
0.009209
0.032974
0.036295
0.000000
0.031222
0.000000
0.009034
12
0.000000
0.001307
0.010370
0.003966
0.000000
0.004093
0.012086
0.009983
0.004600
0.004029
...
0.021949
0.004567
0.014736
0.025895
0.077496
0.018766
0.063104
0.057775
0.002718
0.009833
13
0.004397
0.000000
0.014484
0.004431
0.000000
0.000000
0.000000
0.000000
0.000000
0.004502
...
0.040466
0.000000
0.015199
0.042002
0.065626
0.000000
0.056408
0.011392
0.000000
0.000000
14
0.004143
0.020634
0.017742
0.002088
0.006002
0.000000
0.004242
0.003941
0.000000
0.000000
...
0.003466
0.000000
0.001790
0.055406
0.119808
0.003120
0.029897
0.029519
0.000000
0.000000
15
0.000000
0.000000
0.003120
0.004772
0.000000
0.000000
0.000000
0.018019
0.004152
0.004848
...
0.003962
0.024731
0.000000
0.033172
0.058898
0.014262
0.015188
0.021471
0.000000
0.000000
16
0.000000
0.001598
0.017964
0.000000
0.013943
0.018353
0.001642
0.004577
0.018282
0.016421
...
0.008052
0.065618
0.024949
0.054136
0.103737
0.019323
0.100312
0.033245
0.013294
0.016533
17
0.013873
0.004606
0.013710
0.002330
0.000000
0.000000
0.004734
0.006599
0.000000
0.002367
...
0.030952
0.014088
0.019981
0.038284
0.090589
0.015669
0.009270
0.017972
0.002396
0.000000
18
0.000000
0.000000
0.058039
0.000000
0.004478
0.004823
0.000000
0.004411
0.000000
0.009494
...
0.007759
0.012108
0.004007
0.038385
0.057669
0.003491
0.033459
0.021023
0.000000
0.004345
19
0.001608
0.006409
0.016955
0.009726
0.006214
0.005019
0.011527
0.016831
0.007051
0.004940
...
0.012112
0.005600
0.009730
0.030729
0.051015
0.008478
0.063194
0.033339
0.013332
0.006029
20
0.000000
0.000000
0.032121
0.000000
0.007849
0.000000
0.000000
0.000000
0.000000
0.000000
...
0.000000
0.000000
0.007022
0.015525
0.090963
0.036712
0.000000
0.026317
0.000000
0.000000
21
0.001218
0.015776
0.063410
0.006139
0.001177
0.000000
0.008731
0.005795
0.002136
0.007484
...
0.004077
0.003181
0.000000
0.027931
0.101528
0.002752
0.017584
0.029198
0.005049
0.001142
22
0.002860
0.022793
0.036746
0.004324
0.009669
0.004463
0.004393
0.014965
0.002508
0.007321
...
0.011965
0.011204
0.000000
0.057377
0.089828
0.001077
0.025227
0.048169
0.001482
0.021442
23
0.008847
0.000000
0.040799
0.000000
0.012817
0.004601
0.000000
0.000000
0.003878
0.000000
...
0.003701
0.003850
0.003823
0.036621
0.071524
0.006661
0.021280
0.057304
0.004583
0.000000
24
0.005152
0.000000
0.010183
0.002596
0.004976
0.002680
0.005275
0.002451
0.004517
0.002637
...
0.028019
0.004485
0.015583
0.029529
0.081703
0.027155
0.033048
0.023360
0.000000
0.002414
25
0.000000
0.018121
0.035956
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
...
0.030441
0.015835
0.141488
0.092682
0.169702
0.027396
0.102106
0.011783
0.000000
0.000000
26
0.003842
0.000000
0.005062
0.000000
0.000000
0.000000
0.003933
0.000000
0.006737
0.003933
...
0.009643
0.000000
0.059761
0.029360
0.076456
0.005786
0.018483
0.017420
0.000000
0.010800
27
0.000000
0.007591
0.010041
0.000000
0.000000
0.000000
0.015603
0.007249
0.000000
0.007802
...
0.019127
0.000000
0.000000
0.004853
0.104259
0.005738
0.006110
0.054295
0.000000
0.000000
28
0.004956
0.024683
0.018502
0.006660
0.001596
0.005155
0.001691
0.007858
0.008691
0.000000
...
0.011057
0.002876
0.018559
0.036821
0.110955
0.009951
0.037088
0.044942
0.005135
0.026316
29
0.000000
0.000000
0.000000
0.000000
0.000000
0.011208
0.000000
0.000000
0.018893
0.000000
...
0.000000
0.009378
0.009311
0.013723
0.033502
0.008113
0.000000
0.041872
0.011163
0.040384
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
264
0.000000
0.003486
0.036891
0.007054
0.010141
0.000000
0.014332
0.006659
0.000000
0.000000
...
0.002928
0.015232
0.003024
0.013372
0.084881
0.010541
0.005612
0.020402
0.003626
0.000000
265
0.000000
0.002966
0.021579
0.003001
0.002876
0.012389
0.000000
0.011330
0.002611
0.000000
...
0.012456
0.000000
0.010292
0.030339
0.061107
0.058293
0.057299
0.021215
0.000000
0.002790
266
0.007076
0.001410
0.024244
0.001426
0.001367
0.005889
0.010143
0.009425
0.012410
0.007245
...
0.011842
0.000000
0.045254
0.020731
0.126748
0.039431
0.062416
0.018335
0.005866
0.006631
267
0.007754
0.001287
0.022989
0.001302
0.001248
0.005377
0.009262
0.008606
0.012465
0.006616
...
0.011894
0.000000
0.040206
0.020576
0.136632
0.048656
0.065283
0.019253
0.006695
0.006055
268
0.005221
0.005200
0.015478
0.005262
0.010085
0.008146
0.008017
0.007450
0.009155
0.002672
...
0.002184
0.000000
0.020302
0.018286
0.064935
0.037345
0.012558
0.069323
0.002705
0.004892
269
0.005592
0.000000
0.003684
0.005636
0.005401
0.011634
0.000000
0.015960
0.004903
0.000000
...
0.000000
0.000000
0.000000
0.007123
0.038256
0.033687
0.017936
0.097801
0.005794
0.005240
270
0.011798
0.000000
0.007773
0.000000
0.000000
0.000000
0.000000
0.000000
0.020690
0.012079
...
0.000000
0.030810
0.020392
0.007514
0.036687
0.008884
0.000000
0.007642
0.000000
0.011056
271
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
...
0.000000
0.000000
0.000000
0.041778
0.101994
0.024698
0.000000
0.084985
0.000000
0.000000
272
0.000000
0.004670
0.027801
0.002363
0.009057
0.009755
0.002400
0.004460
0.004111
0.000000
...
0.003923
0.002041
0.002026
0.022394
0.032075
0.007061
0.001880
0.009111
0.000000
0.006590
273
0.000000
0.000000
0.016820
0.025730
0.000000
0.026557
0.000000
0.000000
0.022385
0.000000
...
0.000000
0.000000
0.000000
0.016259
0.095264
0.076895
0.000000
0.016537
0.000000
0.000000
274
0.000000
0.004016
0.006375
0.000813
0.012461
0.006710
0.002476
0.000767
0.006363
0.000000
...
0.006746
0.000000
0.001394
0.004108
0.016046
0.004857
0.047841
0.010968
0.002506
0.000756
275
0.000000
0.000000
0.022407
0.008569
0.000000
0.008845
0.000000
0.000000
0.000000
0.000000
...
0.000000
0.000000
0.000000
0.108294
0.126904
0.000000
0.013635
0.022029
0.008809
0.015935
276
0.000000
0.007089
0.018754
0.014345
0.000000
0.007403
0.000000
0.000000
0.000000
0.000000
...
0.005954
0.000000
0.006150
0.104239
0.141627
0.000000
0.017119
0.041487
0.014747
0.013338
277
0.006928
0.018978
0.023964
0.010474
0.001673
0.003603
0.015960
0.011534
0.000000
0.012413
...
0.005797
0.003015
0.005987
0.036400
0.099100
0.005217
0.059718
0.063949
0.000000
0.019477
278
0.001310
0.003915
0.015537
0.007922
0.006327
0.002726
0.010731
0.009971
0.009190
0.001341
...
0.008769
0.001140
0.004529
0.020025
0.056220
0.018744
0.079838
0.028854
0.001357
0.004911
279
0.000000
0.000000
0.030560
0.003596
0.017232
0.007423
0.003653
0.000000
0.003129
0.003653
...
0.005971
0.021742
0.003084
0.038630
0.064352
0.008060
0.000000
0.025423
0.000000
0.006687
280
0.018909
0.010595
0.021801
0.004764
0.005708
0.000000
0.013310
0.010119
0.006217
0.009680
...
0.016810
0.004115
0.000000
0.086553
0.088933
0.000890
0.043590
0.031386
0.002449
0.014397
281
0.000000
0.006831
0.022589
0.000000
0.019870
0.000000
0.000000
0.006523
0.006013
0.021062
...
0.005737
0.005969
0.023704
0.048037
0.059704
0.005163
0.038488
0.017767
0.007105
0.012852
282
0.000000
0.008524
0.020672
0.002875
0.011020
0.000000
0.000000
0.005427
0.005002
0.011681
...
0.004773
0.007449
0.014790
0.025431
0.074505
0.019330
0.082336
0.027714
0.000000
0.005346
283
0.000000
0.000000
0.043636
0.000000
0.000000
0.007655
0.000000
0.007001
0.019358
0.000000
...
0.000000
0.012812
0.019079
0.023433
0.041190
0.016624
0.005901
0.004767
0.007625
0.000000
284
0.000000
0.004530
0.022471
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.006984
...
0.001902
0.011876
0.000000
0.018825
0.025454
0.013697
0.001823
0.001473
0.000000
0.004262
285
0.000000
0.006589
0.010894
0.003333
0.009583
0.000000
0.003386
0.006292
0.005800
0.003386
...
0.000000
0.014394
0.002858
0.044229
0.102838
0.004981
0.007955
0.042844
0.000000
0.003099
286
0.021025
0.001611
0.009590
0.003803
0.005728
0.004486
0.001104
0.002051
0.003309
0.002208
...
0.004059
0.012670
0.001398
0.036049
0.126060
0.006089
0.027665
0.033523
0.008378
0.004547
287
0.004362
0.000000
0.002874
0.017584
0.012639
0.000000
0.000000
0.004149
0.003824
0.000000
...
0.014598
0.000000
0.131929
0.022222
0.059678
0.000000
0.000000
0.048031
0.004519
0.040874
288
0.010687
0.005323
0.014082
0.010771
0.005161
0.000000
0.005471
0.000000
0.004685
0.000000
...
0.000000
0.000000
0.032324
0.023821
0.096372
0.016094
0.004285
0.024228
0.000000
0.015022
289
0.005468
0.005447
0.061242
0.005511
0.010563
0.011376
0.000000
0.000000
0.023972
0.016795
...
0.004575
0.000000
0.099231
0.041787
0.023804
0.012352
0.035075
0.017709
0.000000
0.005124
290
0.000000
0.000000
0.007489
0.000000
0.010980
0.000000
0.000000
0.000000
0.009967
0.000000
...
0.009511
0.039579
0.000000
0.021718
0.233290
0.025678
0.018229
0.029452
0.000000
0.010652
291
0.004098
0.000000
0.005400
0.008260
0.011875
0.004263
0.004196
0.000000
0.000000
0.008391
...
0.000000
0.000000
0.092079
0.028708
0.043327
0.000000
0.009858
0.007963
0.000000
0.007680
292
0.010646
0.000000
0.022795
0.002682
0.000000
0.000000
0.008175
0.002532
0.002334
0.000000
...
0.002227
0.016217
0.016100
0.035594
0.051310
0.064129
0.032010
0.065509
0.000000
0.002494
293
0.000000
0.000000
0.135860
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
...
0.000000
0.023933
0.000000
0.026265
0.021374
0.036231
0.005512
0.053429
0.000000
0.000000
294 rows × 1658 columns
In [140]:
#easy to program, but let's use a robust version from sklearn!
from sklearn.metrics.pairwise import cosine_similarity
In [141]:
similarity=cosine_similarity(document_term_matrix)
#Note here that the `cosine_similiary` can take
#an entire matrix as its argument
In [28]:
#what'd we get?
similarity
Out[28]:
array([[ 1. , 0.4800956 , 0.47789776, ..., 0.41147251,
0.64107903, 0.49087961],
[ 0.4800956 , 1. , 0.65544451, ..., 0.31196723,
0.64506243, 0.47103725],
[ 0.47789776, 0.65544451, 1. , ..., 0.35760039,
0.70650975, 0.45047448],
...,
[ 0.41147251, 0.31196723, 0.35760039, ..., 1. ,
0.54765566, 0.29438818],
[ 0.64107903, 0.64506243, 0.70650975, ..., 0.54765566,
1. , 0.56796543],
[ 0.49087961, 0.47103725, 0.45047448, ..., 0.29438818,
0.56796543, 1. ]])
In [29]:
similarity.shape
Out[29]:
(294, 294)
In [30]:
similarity[100]
#this gives the similarity of row 100 to each of the other rows
Out[30]:
array([ 0.51351908, 0.37559544, 0.43518157, 0.55551997, 0.46036488,
0.5228255 , 0.56056587, 0.40698762, 0.29528043, 0.22663447,
0.43146373, 0.404392 , 0.44087255, 0.39765911, 0.75170812,
0.4563751 , 0.42273969, 0.53307405, 0.43241179, 0.52863713,
0.43245394, 0.50163583, 0.55255707, 0.36087616, 0.52151398,
0.44422755, 0.44204217, 0.44389512, 0.47398092, 0.31779759,
0.49221207, 0.46843142, 0.56495427, 0.63240387, 0.33069644,
0.39599524, 0.62049587, 0.51382474, 0.54573105, 0.46795562,
0.54572092, 0.27907631, 0.31736056, 0.38786122, 0.45626854,
0.33554468, 0.39141669, 0.39123794, 0.39268033, 0.39480131,
0.18210545, 0.42944428, 0.4641688 , 0.38645964, 0.44868397,
0.36583276, 0.30356115, 0.4207236 , 0.47623871, 0.74338913,
0.70385898, 0.456456 , 0.45573056, 0.41167837, 0.48595775,
0.4969808 , 0.51340512, 0.48723885, 0.39048843, 0.55268064,
0.44370596, 0.46620405, 0.59864036, 0.5453561 , 0.19284076,
0.41974821, 0.46478893, 0.28167759, 0.4344191 , 0.35451293,
0.42340573, 0.47227169, 0.43494646, 0.45913277, 0.43349771,
0.38247543, 0.39892623, 0.27673982, 0.34213901, 0.64612329,
0.3626567 , 0.42551363, 0.51860678, 0.56070145, 0.45950563,
0.37292824, 0.47305083, 0.55650062, 0.51663525, 0.47743723,
1. , 0.74504107, 0.5418364 , 0.446839 , 0.42183918,
0.46757646, 0.40592573, 0.43250842, 0.42077958, 0.45397224,
0.31944606, 0.54158949, 0.41347637, 0.37601824, 0.310596 ,
0.4538872 , 0.73505269, 0.25404726, 0.31252659, 0.36519023,
0.38497747, 0.38380178, 0.29363724, 0.41179583, 0.54176381,
0.73378491, 0.37935869, 0.40331248, 0.53966992, 0.37759265,
0.38172116, 0.4010848 , 0.36563491, 0.40777916, 0.44576063,
0.30141366, 0.21701385, 0.46957471, 0.49328697, 0.55739909,
0.42608044, 0.61105579, 0.41359103, 0.75236073, 0.56035999,
0.52365363, 0.43469125, 0.4805973 , 0.55483122, 0.30195174,
0.46054438, 0.55915787, 0.4253061 , 0.40968898, 0.34119052,
0.47001332, 0.47038965, 0.54626945, 0.37483056, 0.57350522,
0.80329149, 0.7243073 , 0.41505792, 0.35195044, 0.39476144,
0.72086426, 0.36628347, 0.32496974, 0.32097834, 0.63363847,
0.57753911, 0.36209657, 0.38444127, 0.35573989, 0.41862441,
0.43178962, 0.443835 , 0.5608355 , 0.444154 , 0.48375123,
0.28522753, 0.37559622, 0.47222523, 0.51180251, 0.41265869,
0.54065297, 0.57687612, 0.64723645, 0.47474714, 0.38660376,
0.68458167, 0.39413766, 0.38010527, 0.59400378, 0.57957167,
0.37098426, 0.43403555, 0.32639742, 0.36205656, 0.51786785,
0.41269349, 0.38620109, 0.50374064, 0.47638095, 0.33258194,
0.37493858, 0.443436 , 0.57791602, 0.65232685, 0.30007518,
0.3492306 , 0.32146507, 0.37118884, 0.45313594, 0.28738624,
0.58066833, 0.37931137, 0.44479053, 0.66170862, 0.39806686,
0.44628435, 0.413084 , 0.34530214, 0.32644008, 0.39825262,
0.35834356, 0.66729768, 0.52277553, 0.31125011, 0.32153004,
0.41817371, 0.34185216, 0.41832183, 0.3425447 , 0.45959116,
0.33285397, 0.45256228, 0.37477928, 0.31155022, 0.30429663,
0.52773434, 0.74161916, 0.3494954 , 0.47205325, 0.38833223,
0.51329957, 0.51852037, 0.38344333, 0.58984198, 0.26422658,
0.33376944, 0.42060176, 0.27627799, 0.35126916, 0.37932132,
0.3915702 , 0.45151486, 0.5374713 , 0.45565082, 0.37399019,
0.4328711 , 0.45998098, 0.40747782, 0.63226527, 0.51432353,
0.49040904, 0.5749578 , 0.57524544, 0.43183206, 0.40378449,
0.37574633, 0.21411994, 0.4464823 , 0.36864902, 0.32271726,
0.34755168, 0.38877693, 0.7058656 , 0.53344296, 0.40329998,
0.5777419 , 0.50958067, 0.54379815, 0.36165463, 0.48269786,
0.46860293, 0.42628022, 0.2602712 , 0.50098935, 0.41036843,
0.4474139 , 0.32264786, 0.52424594, 0.35982701])
This time we're interested in relations among the words not the texts.
In other words, we're interested in the similarities between one column and another--one term and another term
So we'll work with the transposed matrix--the term-document matrix, rather than the document-term matrix.
For a description of hierarchical clustering, look at the example at https://en.wikipedia.org/wiki/Hierarchical_clustering
In [31]:
term_document_matrix=document_term_matrix.T
# .T is the easy transposition method for a
# matrix in python's matrix packages.
In [32]:
# import a bunch of packages we need
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import cosine_similarity
from scipy.cluster.hierarchy import ward, dendrogram
In [33]:
#distance is 1-similarity, so:
dist=1-cosine_similarity(term_document_matrix)
# ward is an algorithm for hierarchical clustering
linkage_matrix=ward(dist)
#plot dendogram
f=plt.figure(figsize=(9,9))
R=dendrogram(linkage_matrix, orientation="right", labels=vocab)
plt.tight_layout()
In [34]:
vectorizer=TfidfVectorizer(min_df=.96, stop_words='english', use_idf=True)
#try a very high min_df
In [35]:
#rerun the model
document_term_matrix=vectorizer.fit_transform(our_texts)
vocab=vectorizer.get_feature_names()
In [36]:
#check the length of the vocab
len(vocab)
Out[36]:
52
In [37]:
#switch again to the term_document_matrix
term_document_matrix=document_term_matrix.T
In [38]:
dist=1-cosine_similarity(term_document_matrix)
linkage_matrix=ward(dist)
#plot dendogram
f=plt.figure(figsize=(9,9))
R=dendrogram(linkage_matrix, orientation="right", labels=vocab)
plt.tight_layout()
Exploratory data analysis (EDA) seeks to reveal structure, or simple descriptions, in data. We look at numbers and graphs and try to find patterns.
. . . we can view the techniques of EDA as a ritual designed to reveal patters in a data set. Thus, we may believe that naturally occurring data sets contain structure, that EDA is a useful vehicle for revealing the structure. . . . If we make no attempt to check whether the structure could have arisen by chance, and tend to accept the findings as gospel, then the ritual comes close to magical thinking. ... a controlled form of magical thinking--in the guise of 'working hypothesis'--is a basic ingredient of scientific progress.
Content source: matthewljones/computingincontext
Similar notebooks: