In [1]:
import pandas as pd
wkext = pd.read_csv("extracts.csv")
In [2]:
wkext
Out[2]:
entitles
enextracts
detitles
deextracts
0
PEARL (programming language)
PEARL, or Process and experiment automation re...
PEARL
PEARL [pɜːɹl] ist eine Echtzeit- und Multitask...
1
Aachen Cathedral Treasury
The Aachen Cathedral Treasury (German: Aachene...
Aachener Domschatzkammer
Die Aachener Domschatzkammer präsentiert den K...
2
Bauhaus
Staatliches Bauhaus , commonly known simply as...
Bauhaus
Das Staatliche Bauhaus wurde 1919 von Walter G...
3
Boo (programming language)
Boo is an object-oriented, statically typed, g...
Boo (Programmiersprache)
Boo ist eine seit 2003 von Rodrigo Barreto de ...
4
Upper Harz Water Regale
The Upper Harz Water Regale (German: Oberharze...
Oberharzer Wasserregal
Das Oberharzer Wasserregal ist ein hauptsächli...
5
Aachen Cathedral
Aachen Cathedral, frequently referred to as th...
Aachener Dom
Der Aachener Dom, auch Aachener Münster oder A...
6
Synchronized Multimedia Integration Language
Synchronized Multimedia Integration Language (...
Synchronized Multimedia Integration Language
Vorlage:Infobox Dateiformat/Wartung/MagischeZa...
7
Scala (programming language)
Scala (/ˈskɑːlə/ SKAH-lə) is an object-functio...
Scala (Programmiersprache)
Scala ist eine funktionale und objektorientier...
8
Gofer (programming language)
Gofer ("Good For Equational Reasoning") is an ...
Gofer
Gofer ist eine funktionale Programmiersprache,...
9
Lübeck
The Hanseatic City of Lübeck (pronounced [ˈlyː...
Lübeck
Die Hansestadt Lübeck (niederdeutsch: Lübęk, L...
10
Perl
Perl is a family of high-level, general-purpos...
Perl (Programmiersprache)
Perl [pɝːl] ist eine freie, plattformunabhängi...
11
Tcllib
Tcllib is a collection of packages available f...
Tcllib
Tcllib ist eine der populärsten Bibliotheken z...
12
Go (programming language)
Go, also commonly referred to as golang, is a ...
Go (Programmiersprache)
Go ist eine kompilierbare Programmiersprache, ...
13
COBOL
COBOL (/ˈkoʊbɒl/, an acronym for common busine...
COBOL
COBOL ist eine Programmiersprache, die in der ...
14
ML (programming language)
ML is a general-purpose functional programming...
ML (Programmiersprache)
Meta Language (ML) beschreibt eine Familie fun...
15
Haus am Horn
The Haus am Horn was built for the Weimar Bauh...
Musterhaus Am Horn
Das Musterhaus „Am Horn“ ist ein in Weimar err...
16
Rüdesheim am Rhein
Rüdesheim am Rhein is a winemaking town in the...
Rüdesheim am Rhein
Rüdesheim am Rhein ist eine Weinstadt im Rhein...
17
Hack (programming language)
Hack is a programming language for the HipHop ...
Hack (Programmiersprache)
Hack ist eine Neuimplementierung der Skriptspr...
18
Objective-C
Objective-C is a general-purpose, object-orien...
Objective-C
Objective-C, auch kurz ObjC genannt, erweitert...
19
Martin Luther's Birth House
Martin Luther's Birth House (German: Martin Lu...
Martin Luthers Geburtshaus
Bei dem sogenannten Luther-Geburtshaus handelt...
20
XProc
XProc is a W3C Recommendation to define an XML...
XProc
XProc (von englisch XML Processing) ist eine v...
21
Julia (programming language)
Julia is a high-level dynamic programming lang...
Julia (Programmiersprache)
Julia ist eine höhere High-Performance-Program...
22
XSLT
XSLT (Extensible Stylesheet Language Transform...
XSL Transformation
Vorlage:Infobox Dateiformat/Wartung/MagischeZa...
23
MetaPost
MetaPost refers to both a programming language...
MetaPost
MetaPost ist zum einen eine Programmiersprache...
24
Aula Palatina
The Basilica of Constantine (German: Konstanti...
Konstantinbasilika
Die Evangelische Kirche zum Erlöser (Konstanti...
25
Wadden Sea
The Wadden Sea (Dutch: Waddenzee, German: Watt...
Wattenmeer (Nordsee)
Das Wattenmeer der Nordsee ist eine im Wirkung...
26
Modelica
Modelica is an object-oriented, declarative, m...
Modelica
Modelica ist eine objektorientierte Beschreibu...
27
APL (programming language)
APL (named after the book A Programming Langua...
APL (Programmiersprache)
APL, abgekürzt für A Programming Language, ist...
28
Fortran
Fortran (previously FORTRAN, derived from Form...
Fortran
Fortran ist eine prozedurale und in ihrer neue...
29
Ruby (programming language)
Ruby is a dynamic, reflective, object-oriented...
Ruby (Programmiersprache)
Ruby (englisch für Rubin) ist eine höhere Prog...
...
...
...
...
...
141
Trier Amphitheater
The Trier Amphitheater is a Roman Amphitheater...
Amphitheater (Trier)
Das Amphitheater in Trier (Augusta Treverorum)...
142
Bremen City Hall
The Bremen City Hall is the seat of the Presid...
Bremer Rathaus
Das Bremer Rathaus ist eines der bedeutendsten...
143
Euler (programming language)
Euler is a programming language created by Nik...
Euler (Programmiersprache)
Euler ist eine von Niklaus Wirth und Helmut We...
144
StepTalk
StepTalk is the official GNUstep scripting fra...
StepTalk
StepTalk ist das offizielle GNUstep Scripting-...
145
Standard ML
Standard ML (SML) is a general-purpose, modula...
Standard ML
Standard ML (SML) ist eine von ML abstammende ...
146
Lua (programming language)
Lua (/ˈluːə/ LOO-ə, from Portuguese: lua [ˈlu....
Lua
Lua (portugiesisch für Mond) ist eine imperati...
147
Mercury (programming language)
Mercury is a functional logic programming lang...
Mercury (Programmiersprache)
Mercury ist eine stark an Prolog angelehnte Pr...
148
Opal (programming language)
OPAL (OPtimized Applicative Language) is a fun...
Opal (Programmiersprache)
OPAL (OPtimized Applicative Language) ist eine...
149
Pfaueninsel
Pfaueninsel ("Peacock Island") is an island in...
Pfaueninsel
Vorlage:Infobox Insel/Wartung/Höhe fehlt\nDie ...
150
Holstentor
The Holsten Gate ("Holstein Tor", later "Holst...
Holstentor
Das Holstentor („Holstein-Tor“) ist ein Stadtt...
151
Martin Luther's Death House
Martin Luther's Death House (German: Martin Lu...
Martin Luthers Sterbehaus
Martin Luthers Sterbehaus ist das Gebäude in d...
152
Imperial Palace Ingelheim
The Imperial Palace Ingelheim (German: Ingelhe...
Ingelheimer Kaiserpfalz
Die Ingelheimer Kaiserpfalz ist eine bedeutend...
153
D (programming language)
The D programming language is an object-orient...
D (Programmiersprache)
D ist eine Programmiersprache, die sich äußerl...
154
Lower Saxon Wadden Sea National Park
The Lower Saxon Wadden Sea National Park (Germ...
Nationalpark Niedersächsisches Wattenmeer
Der Nationalpark Niedersächsisches Wattenmeer ...
155
Lorsch Abbey
The Abbey of Lorsch (German: Reichsabtei Lorsc...
Kloster Lorsch
Das Kloster Lorsch war eine Benediktinerabtei ...
156
Stralsund
Stralsund (German pronunciation: [ˈʃtʁaːlzʊnt]...
Stralsund
Stralsund [ˈʃtʁaːlzʊnt] ist eine Stadt im Nord...
157
PHP
PHP is a server-side scripting language design...
PHP
PHP (rekursives Akronym und Backronym für „PHP...
158
C (programming language)
C (/ˈsiː/, as in the letter c) is a general-pu...
C (Programmiersprache)
C ist eine imperative Programmiersprache, die ...
159
Völklingen Ironworks
The Völklingen Ironworks (German: Völklinger H...
Völklinger Hütte
Die Völklinger Hütte ist ein 1873 gegründetes ...
160
Wieskirche
The Pilgrimage Church of Wies (German: Wieskir...
Wieskirche
Die Wieskirche ist eine bemerkenswert prächtig...
161
SuperCollider
SuperCollider is an environment and programmin...
SuperCollider
SuperCollider (SC) ist eine Programmierumgebun...
162
Sanssouci
The Sanssouci Palace (German: Schloss Sanssouc...
Sanssouci
Schloss Sanssouci (französisch sans souci ‚ohn...
163
Zollverein Coal Mine Industrial Complex
The Zollverein Coal Mine Industrial Complex (G...
Zeche Zollverein
Die Zeche Zollverein war ein von 1851 bis 1986...
164
Smalltalk
Smalltalk is an object-oriented, dynamically t...
Smalltalk (Programmiersprache)
Smalltalk ist ein Sammelbegriff einerseits für...
165
Tcl
Tcl (originally from Tool Command Language, bu...
Tcl
Tcl (Aussprache engl. tickle oder auch als Abk...
166
Strongtalk
Strongtalk is a Smalltalk environment with opt...
Strongtalk
Strongtalk ist eine Variante der Programmiersp...
167
Datalog
Datalog is a truly declarative logic programmi...
Datalog
Datalog ist eine Datenbank-Programmiersprache ...
168
Racket (programming language)
Racket (formerly named PLT Scheme) is a genera...
DrRacket
DrRacket (früher DrScheme) ist eine integriert...
169
Igel Column
The Igel Column is a multi-storeyed Roman sand...
Igeler Säule
Die Igeler Säule im Dorf Igel an der Mosel ist...
170
Cathedral of Trier
The High Cathedral of Saint Peter in Trier (Ge...
Trierer Dom
Die Hohe Domkirche St. Peter zu Trier ist die ...
171 rows × 4 columns
In [3]:
wkext['enextracts']
Out[3]:
0 PEARL, or Process and experiment automation re...
1 The Aachen Cathedral Treasury (German: Aachene...
2 Staatliches Bauhaus , commonly known simply as...
3 Boo is an object-oriented, statically typed, g...
4 The Upper Harz Water Regale (German: Oberharze...
5 Aachen Cathedral, frequently referred to as th...
6 Synchronized Multimedia Integration Language (...
7 Scala (/ˈskɑːlə/ SKAH-lə) is an object-functio...
8 Gofer ("Good For Equational Reasoning") is an ...
9 The Hanseatic City of Lübeck (pronounced [ˈlyː...
10 Perl is a family of high-level, general-purpos...
11 Tcllib is a collection of packages available f...
12 Go, also commonly referred to as golang, is a ...
13 COBOL (/ˈkoʊbɒl/, an acronym for common busine...
14 ML is a general-purpose functional programming...
15 The Haus am Horn was built for the Weimar Bauh...
16 Rüdesheim am Rhein is a winemaking town in the...
17 Hack is a programming language for the HipHop ...
18 Objective-C is a general-purpose, object-orien...
19 Martin Luther's Birth House (German: Martin Lu...
20 XProc is a W3C Recommendation to define an XML...
21 Julia is a high-level dynamic programming lang...
22 XSLT (Extensible Stylesheet Language Transform...
23 MetaPost refers to both a programming language...
24 The Basilica of Constantine (German: Konstanti...
25 The Wadden Sea (Dutch: Waddenzee, German: Watt...
26 Modelica is an object-oriented, declarative, m...
27 APL (named after the book A Programming Langua...
28 Fortran (previously FORTRAN, derived from Form...
29 Ruby is a dynamic, reflective, object-oriented...
...
141 The Trier Amphitheater is a Roman Amphitheater...
142 The Bremen City Hall is the seat of the Presid...
143 Euler is a programming language created by Nik...
144 StepTalk is the official GNUstep scripting fra...
145 Standard ML (SML) is a general-purpose, modula...
146 Lua (/ˈluːə/ LOO-ə, from Portuguese: lua [ˈlu....
147 Mercury is a functional logic programming lang...
148 OPAL (OPtimized Applicative Language) is a fun...
149 Pfaueninsel ("Peacock Island") is an island in...
150 The Holsten Gate ("Holstein Tor", later "Holst...
151 Martin Luther's Death House (German: Martin Lu...
152 The Imperial Palace Ingelheim (German: Ingelhe...
153 The D programming language is an object-orient...
154 The Lower Saxon Wadden Sea National Park (Germ...
155 The Abbey of Lorsch (German: Reichsabtei Lorsc...
156 Stralsund (German pronunciation: [ˈʃtʁaːlzʊnt]...
157 PHP is a server-side scripting language design...
158 C (/ˈsiː/, as in the letter c) is a general-pu...
159 The Völklingen Ironworks (German: Völklinger H...
160 The Pilgrimage Church of Wies (German: Wieskir...
161 SuperCollider is an environment and programmin...
162 The Sanssouci Palace (German: Schloss Sanssouc...
163 The Zollverein Coal Mine Industrial Complex (G...
164 Smalltalk is an object-oriented, dynamically t...
165 Tcl (originally from Tool Command Language, bu...
166 Strongtalk is a Smalltalk environment with opt...
167 Datalog is a truly declarative logic programmi...
168 Racket (formerly named PLT Scheme) is a genera...
169 The Igel Column is a multi-storeyed Roman sand...
170 The High Cathedral of Saint Peter in Trier (Ge...
Name: enextracts, dtype: object
In [4]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform([ent for ent in wkext['enextracts']])
X
Out[4]:
<171x4930 sparse matrix of type '<type 'numpy.int64'>'
with 15380 stored elements in Compressed Sparse Row format>
In [5]:
X
Out[5]:
<171x4930 sparse matrix of type '<type 'numpy.int64'>'
with 15380 stored elements in Compressed Sparse Row format>
In [6]:
vectorizer.get_feature_names()
Out[6]:
[u'000',
u'01',
u'03',
u'05',
u'064',
u'083',
u'10',
u'100',
u'1010',
u'102',
u'1020',
u'10206',
u'1030',
u'105',
u'106',
u'109',
u'1090',
u'10th',
u'11',
u'1103',
u'111',
u'112',
u'115',
u'1165',
u'1170s',
u'11th',
u'12',
u'1209',
u'1234',
u'1248',
u'1280',
u'12th',
u'13',
u'1360',
u'1366',
u'14',
u'1404',
u'142',
u'144',
u'1464',
u'1473',
u'1483',
u'14882',
u'14th',
u'15',
u'150',
u'1504',
u'1531',
u'1533',
u'1546',
u'157',
u'15th',
u'16',
u'1689',
u'1693',
u'1696',
u'16th',
u'17',
u'170',
u'1720',
u'1729',
u'1730',
u'1738',
u'1740',
u'1740s',
u'1744',
u'1745',
u'1746',
u'1747',
u'1748',
u'1754',
u'1787',
u'1792',
u'1797',
u'17th',
u'18',
u'180',
u'1815',
u'1817',
u'1826',
u'1830',
u'1835',
u'1838',
u'1844',
u'1847',
u'1848',
u'1849',
u'185',
u'1850',
u'1851',
u'1852',
u'1853',
u'1856',
u'1859',
u'1862',
u'1876',
u'1880',
u'18th',
u'19',
u'1903',
u'1904',
u'1910',
u'1911',
u'1913',
u'1916',
u'1918',
u'1919',
u'192',
u'1920s',
u'1923',
u'1924',
u'1925',
u'1928',
u'1930',
u'1932',
u'1933',
u'1939',
u'1941',
u'1944',
u'1945',
u'1948',
u'1950',
u'1950s',
u'1954',
u'1957',
u'1958',
u'1959',
u'1960',
u'1960s',
u'1961',
u'1964',
u'1966',
u'1967',
u'1968',
u'1969',
u'1970',
u'1970s',
u'1972',
u'1973',
u'1976',
u'1977',
u'1978',
u'1979',
u'1980',
u'1980s',
u'1981',
u'1983',
u'1984',
u'1985',
u'1986',
u'1987',
u'1988',
u'1989',
u'1990',
u'1990s',
u'1991',
u'1992',
u'1993',
u'1994',
u'1995',
u'1996',
u'1997',
u'1998',
u'1999',
u'19th',
u'1st',
u'20',
u'200',
u'2000',
u'2001',
u'2002',
u'2003',
u'2004',
u'2005',
u'2006',
u'2007',
u'2008',
u'2009',
u'2010',
u'2011',
u'2012',
u'2013',
u'2014',
u'2015',
u'20th',
u'21',
u'213',
u'219',
u'22',
u'226',
u'23',
u'23270',
u'24',
u'240',
u'25',
u'250',
u'26',
u'260',
u'27',
u'278',
u'28',
u'283',
u'29',
u'298',
u'30',
u'300',
u'306',
u'31',
u'310',
u'32',
u'33',
u'334',
u'335',
u'337',
u'340',
u'345',
u'35',
u'350',
u'353',
u'360',
u'37',
u'39',
u'391',
u'3rd',
u'40',
u'41',
u'410',
u'42',
u'43',
u'438',
u'4410',
u'45',
u'47',
u'474',
u'4gl',
u'4th',
u'50',
u'500',
u'5000',
u'515',
u'526m',
u'55',
u'552',
u'56',
u'568',
u'590',
u'595',
u'60',
u'62',
u'635',
u'65',
u'66253',
u'67',
u'672',
u'68',
u'6th',
u'70',
u'700',
u'7185',
u'72',
u'73',
u'77',
u'80',
u'800',
u'80th',
u'83',
u'86',
u'8652',
u'87',
u'8th',
u'90',
u'900',
u'936',
u'95',
u'971',
u'983',
u'aachen',
u'aachener',
u'abandons',
u'abbey',
u'abbeys',
u'abbot',
u'abbreviated',
u'abilities',
u'ability',
u'able',
u'about',
u'above',
u'abstentions',
u'abstract',
u'abtei',
u'abundance',
u'academia',
u'academic',
u'academy',
u'accept',
u'acceptance',
u'accepted',
u'access',
u'accessed',
u'accessible',
u'accessing',
u'accompanying',
u'accordance',
u'according',
u'acid',
u'acm',
u'acoustic',
u'acquired',
u'acres',
u'acronym',
u'across',
u'act',
u'act1',
u'acted',
u'actions',
u'actionscript',
u'active',
u'actively',
u'activities',
u'actors',
u'actual',
u'actually',
u'ad',
u'ada',
u'adapt',
u'adaptation',
u'adapted',
u'add',
u'added',
u'adding',
u'addition',
u'additional',
u'additionally',
u'additions',
u'address',
u'adds',
u'adele',
u'adenine',
u'adhering',
u'adjoining',
u'adjustments',
u'administers',
u'administration',
u'administrative',
u'administrators',
u'adobe',
u'adolf',
u'adopted',
u'adopting',
u'adoption',
u'adult',
u'advanced',
u'advantage',
u'advantages',
u'advocated',
u'aesthetic',
u'affiliated',
u'affluent',
u'afforded',
u'afield',
u'aforementioned',
u'afsluitdijk',
u'after',
u'again',
u'against',
u'age',
u'ages',
u'agrarian',
u'agriculture',
u'ai',
u'aim',
u'aimed',
u'aims',
u'air',
u'aircraft',
u'aisled',
u'alain',
u'alan',
u'albeit',
u'albert',
u'albrechtsberg',
u'alexandrescu',
u'alfeld',
u'algebra',
u'algebraic',
u'algol',
u'algorithm',
u'algorithmic',
u'algorithms',
u'all',
u'allegedly',
u'allied',
u'allow',
u'allowing',
u'allows',
u'almost',
u'alone',
u'along',
u'alonzo',
u'alpine',
u'alps',
u'already',
u'also',
u'altar',
u'alte',
u'altenau',
u'alter',
u'alternative',
u'alternatively',
u'altes',
u'although',
u'altstadt',
u'always',
u'am',
u'amalia',
u'amalienburg',
u'amd64',
u'amended',
u'american',
u'ammianus',
u'among',
u'amongst',
u'amphitheater',
u'amphitheatre',
u'an',
u'anachronistic',
u'analysis',
u'anchor',
u'ancient',
u'and',
u'anders',
u'andreasberg',
u'andrei',
u'anhalt',
u'animals',
u'animation',
u'animations',
u'animorphic',
u'anna',
u'annihilated',
u'anniversary',
u'annotation',
u'annotations',
u'announced',
u'announcements',
u'annual',
u'annually',
u'anonymous',
u'another',
u'ansi',
u'answering',
u'antique',
u'antiquity',
u'any',
u'anything',
u'anywhere',
u'aot',
u'apache',
u'api',
u'apis',
u'apl',
u'app',
u'appear',
u'appearance',
u'appeared',
u'appears',
u'apple',
u'applet',
u'applets',
u'applicable',
u'application',
u'applications',
u'applicative',
u'applied',
u'approach',
u'approaches',
u'approved',
u'approximate',
u'approximately',
u'apps',
u'april',
u'apse',
u'apses',
u'arabia',
u'arabicus',
u'arbitrary',
u'archaeological',
u'archbishop',
u'archbishopric',
u'archdiocese',
u'arched',
u'architect',
u'architects',
u'architectural',
u'architecture',
u'architectures',
u'architekten',
u'architektur',
u'architrave',
u'archive',
u'are',
u'area',
u'areas',
u'aren',
u'arg1',
u'arg2',
u'arg3',
u'arguments',
u'arithmetic',
u'arm',
u'armstrong',
u'army',
u'arnim',
u'around',
u'arranged',
u'array',
u'arrays',
u'art',
u'arthritis',
u'article',
u'articulation',
u'artificial',
u'artistic',
u'artists',
u'arts',
u'artworks',
u'as',
u'ascii',
u'asked',
u'asm',
u'aspect',
u'aspects',
u'assembly',
u'assertions',
u'assignment',
u'assisted',
u'associated',
u'association',
u'associative',
u'assumption',
u'asymptote',
u'asynchronously',
u'at',
u'atomic',
u'atrium',
u'ats',
u'attached',
u'attains',
u'attempt',
u'attempts',
u'attend',
u'attics',
u'attracted',
u'attracting',
u'attraction',
u'attractions',
u'attributed',
u'attributes',
u'auckland',
u'audio',
u'aufsichts',
u'augmented',
u'august',
u'augusta',
u'augusteum',
u'augustinian',
u'augustus',
u'augustusburg',
u'aula',
u'aureus',
u'australia',
u'austria',
u'austrian',
u'author',
u'authoring',
u'authority',
u'authors',
u'autobahn',
u'automated',
u'automatic',
u'automatically',
u'automation',
u'availability',
u'available',
u'avalon',
u'average',
u'avoid',
u'award',
u'awarded',
u'awards',
u'aware',
u'away',
u'awk',
u'axis',
u'azelin',
u'b1',
u'b2b',
u'b2c',
u'babelsberg',
u'babylon',
u'bachtrompetengala',
u'back',
u'backend',
u'background',
u'backgrounds',
u'backronym',
u'backronyms',
u'backus',
u'bad',
u'baden',
u'bailiff',
u'balanced',
u'baldachin',
u'balk',
u'balthasar',
u'baltic',
u'bamberg',
u'banker',
u'banks',
u'banquet',
u'baptised',
u'barbara',
u'barbarathermen',
u'barock',
u'baroque',
u'barras',
u'bas',
u'base',
u'based',
u'basic',
u'basics',
u'basilica',
u'basin',
u'basis',
u'basket',
u'basser',
u'batch',
u'bath',
u'baths',
u'battista',
u'battle',
u'bauhaus',
u'bavaria',
u'bavarian',
u'bay',
u'bayreuth',
u'bc',
u'be',
u'bearing',
u'bears',
u'beautiful',
u'became',
u'because',
u'become',
u'becoming',
u'beech',
u'been',
u'before',
u'began',
u'begin',
u'beginning',
u'begun',
u'behavioral',
u'behest',
u'being',
u'belief',
u'bell',
u'belong',
u'belongs',
u'below',
u'benchmark',
u'bend',
u'benedictine',
u'benefit',
u'benscheidt',
u'bergpark',
u'berkeley',
u'berlin',
u'berliner',
u'berners',
u'bernward',
u'bertholdstein',
u'bertot',
u'besides',
u'best',
u'between',
u'beuronese',
u'bias',
u'bible',
u'bid',
u'big',
u'billeter',
u'billion',
u'binaries',
u'binary',
u'binding',
u'bingen',
u'bioinformatics',
u'biological',
u'biosphere',
u'biotechnology',
u'birds',
u'birth',
u'bishop',
u'bismarck',
u'bit',
u'bitmap',
u'bituminous',
u'bjarne',
u'black',
u'blasewitz',
u'bld',
u'blend',
u'block',
u'blocks',
u'blue',
u'board',
u'boats',
u'bobrow',
u'bode',
u'body',
u'boffrand',
u'bonn',
u'boo',
u'book',
u'books',
u'border',
u'bordering',
u'born',
u'borough',
u'borrow',
u'borrowed',
u'both',
u'bouman',
u'boundaries',
u'boundary',
u'bounded',
u'bounds',
u'box',
u'brace',
u'branches',
u'brandenburg',
u'bread',
u'break',
u'bremen',
u'bremer',
u'bretten',
u'breuer',
u'brian',
u'brick',
u'bridge',
u'bridges',
u'bright',
u'bring',
u'bringing',
u'brings',
u'britannicus',
u'broad',
u'bronze',
u'bronzeworks',
u'brook',
u'brother',
u'brought',
u'brow',
u'browser',
u'browsers',
u'bruckgraben',
u'bruno',
u'br\xfchl',
u'bsd',
u'buffer',
u'bugenhagen',
u'bugs',
u'build',
u'builders',
u'building',
u'buildings',
u'builds',
u'built',
u'bukovsk\xe9',
u'bull',
u'bulwark',
u'bundesland',
u'buntenbock',
u'burgtor',
u'burial',
u'buried',
u'burned',
u'burnt',
u'business',
u'but',
u'buttons',
u'by',
u'bytecode',
u'byzantine',
u'b\xe9zier',
u'cabinetry',
u'cadastral',
u'calculus',
u'call',
u'called',
u'calling',
u'calls',
u'came',
u'caml',
u'campanile',
u'can',
u'canada',
u'canal',
u'cannot',
u'canonical',
u'canton',
u'capabilities',
u'capability',
u'capital',
u'capitalist',
u'carboniferous',
u'care',
u'carefree',
u'carl',
u'carolingian',
u'carpathian',
u'carpathians',
u'carried',
u'carved',
u'case',
u'casting',
u'castle',
u'castles',
u'castra',
u'cast\xe9ran',
u'catalog',
u'categorical',
u'cathedral',
u'cathedralis',
u'cathedrals',
u'catholic',
u'catholicism',
u'catholique',
u'cause',
u'causeway',
u'causeways',
u'celebration',
u'center',
u'centers',
u'central',
u'centralized',
u'centre',
u'centric',
u'centuries',
u'century',
u'ceremony',
u'certain',
u'certified',
u'cgi',
u'chainsaw',
u'chamber',
u'chambers',
u'chandelier',
u'chandeliers',
u'changed',
u'changes',
u'changing',
u'channel',
u'chapel',
u'chapter',
u'character',
u'characterised',
u'characteristic',
u'characteristics',
u'characterized',
u'characters',
u'charge',
u'charged',
u'charlemagne',
u'charles',
u'charlottenburg',
u'charts',
u'checked',
u'checking',
u'checks',
u'chemistry',
u'childers',
u'chipperfield',
u'chivalric',
u'choir',
u'chornohora',
u'christ',
u'christian',
u'christianity',
u'chronicle',
u'chronicler',
u'church',
u'churches',
u'ch\xe2teau',
u'cii',
u'cim',
u'circle',
u'cistercian',
u'citadel',
u'cites',
u'cities',
u'city',
u'civic',
u'civilization',
u'cl',
u'claimed',
u'claims',
u'clang',
u'class',
u'classes',
u'classical',
u'classicism',
u'classification',
u'classified',
u'classpath',
u'clause',
u'clausthal',
u'clean',
u'clear',
u'clearly',
u'clemens',
u'clerestory',
u'cli',
u'client',
u'cloisters',
u'clojure',
u'clos',
u'close',
u'closed',
u'closely',
u'closer',
u'closure',
u'closures',
u'cloth',
u'cloud',
u'clt',
u'cluny',
u'cm',
u'cmdlet',
u'cmdlets',
u'co',
u'coal',
u'coast',
u'coastal',
u'coastline',
u'cobol',
u'cocoa',
u'codasyl',
u'codd',
u'code',
u'coded',
u'codex',
u'coining',
u'coking',
u'collaboration',
u'collected',
u'collection',
u'collections',
u'collective',
u'collegiate',
u'colmerauer',
u'cologne',
u'colon',
u'coloni',
u'colour',
u'colours',
u'column',
u'com',
u'combination',
u'combinations',
u'combinator',
u'combine',
u'combined',
u'combines',
u'combining',
u'come',
u'comes',
u'comfort',
u'coming',
u'command',
u'commandline',
u'commands',
u'commenced',
u'commerce',
u'commercial',
u'commercialize',
u'commercially',
u'commission',
u'commissioned',
u'committee',
u'common',
u'commonly',
u'communicate',
u'communications',
u'communist',
u'communities',
u'community',
u'compact',
...]
In [7]:
### train a LDA model
from sklearn.decomposition import LatentDirichletAllocation
hx_lda = LatentDirichletAllocation(n_topics = 2)
In [8]:
hx_lda.fit(X)
/usr/local/lib/python2.7/site-packages/sklearn/decomposition/online_lda.py:508: DeprecationWarning: The default value for 'learning_method' will be changed from 'online' to 'batch' in the release 0.20. This warning was introduced in 0.18.
DeprecationWarning)
Out[8]:
LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
evaluate_every=-1, learning_decay=0.7, learning_method=None,
learning_offset=10.0, max_doc_update_iter=100, max_iter=10,
mean_change_tol=0.001, n_jobs=1, n_topics=2, perp_tol=0.1,
random_state=None, topic_word_prior=None,
total_samples=1000000.0, verbose=0)
In [9]:
def display_topics(model, feature_names, no_top_words):
for topic_idx, topic in enumerate(model.components_):
print "Topic %d:" % (topic_idx)
print " ".join([feature_names[i]
for i in topic.argsort()[:-no_top_words - 1:-1]])
In [10]:
display_topics(hx_lda, vectorizer.get_feature_names(), 30)
Topic 0:
smil markup the and media presentations multimedia in to is it as language for transitions timing of synchronized items things images ˈsmaɪl recommended presenting layout video defines animations links embedding
Topic 1:
the of and in is to as it was language for by programming with on its world from that an has german heritage are used languages be site unesco germany
In [11]:
hx_lda.transform(X)
Out[11]:
array([[ 0.00736282, 0.99263718],
[ 0.0062294 , 0.9937706 ],
[ 0.00178884, 0.99821116],
[ 0.00661109, 0.99338891],
[ 0.00254491, 0.99745509],
[ 0.00712769, 0.99287231],
[ 0.50088338, 0.49911662],
[ 0.00286921, 0.99713079],
[ 0.00679048, 0.99320952],
[ 0.00389056, 0.99610944],
[ 0.002443 , 0.997557 ],
[ 0.01343562, 0.98656438],
[ 0.0056724 , 0.9943276 ],
[ 0.00170446, 0.99829554],
[ 0.00488133, 0.99511867],
[ 0.00173419, 0.99826581],
[ 0.01119253, 0.98880747],
[ 0.00670091, 0.99329909],
[ 0.00387866, 0.99612134],
[ 0.00287638, 0.99712362],
[ 0.00558484, 0.99441516],
[ 0.0062557 , 0.9937443 ],
[ 0.00497069, 0.99502931],
[ 0.0022877 , 0.9977123 ],
[ 0.00183597, 0.99816403],
[ 0.00220405, 0.99779595],
[ 0.00868614, 0.99131386],
[ 0.00757761, 0.99242239],
[ 0.00350382, 0.99649618],
[ 0.00924188, 0.99075812],
[ 0.00342568, 0.99657432],
[ 0.00948617, 0.99051383],
[ 0.00759745, 0.99240255],
[ 0.00222355, 0.99777645],
[ 0.00485542, 0.99514458],
[ 0.00201404, 0.99798596],
[ 0.00139094, 0.99860906],
[ 0.00755494, 0.99244506],
[ 0.00750094, 0.99249906],
[ 0.00668679, 0.99331321],
[ 0.02217576, 0.97782424],
[ 0.00256051, 0.99743949],
[ 0.01959135, 0.98040865],
[ 0.00353821, 0.99646179],
[ 0.00255031, 0.99744969],
[ 0.00330127, 0.99669873],
[ 0.00676441, 0.99323559],
[ 0.00414477, 0.99585523],
[ 0.0041971 , 0.9958029 ],
[ 0.00907336, 0.99092664],
[ 0.00221637, 0.99778363],
[ 0.00454837, 0.99545163],
[ 0.01391744, 0.98608256],
[ 0.00330768, 0.99669232],
[ 0.00323928, 0.99676072],
[ 0.01305206, 0.98694794],
[ 0.01197292, 0.98802708],
[ 0.00302438, 0.99697562],
[ 0.00185556, 0.99814444],
[ 0.00443083, 0.99556917],
[ 0.00323742, 0.99676258],
[ 0.00288704, 0.99711296],
[ 0.01264626, 0.98735374],
[ 0.0028242 , 0.9971758 ],
[ 0.00203629, 0.99796371],
[ 0.00255017, 0.99744983],
[ 0.00238098, 0.99761902],
[ 0.00298037, 0.99701963],
[ 0.01165519, 0.98834481],
[ 0.00384773, 0.99615227],
[ 0.00280909, 0.99719091],
[ 0.00361159, 0.99638841],
[ 0.00353192, 0.99646808],
[ 0.00206065, 0.99793935],
[ 0.00228759, 0.99771241],
[ 0.00708247, 0.99291753],
[ 0.00403342, 0.99596658],
[ 0.00237374, 0.99762626],
[ 0.00216516, 0.99783484],
[ 0.00736179, 0.99263821],
[ 0.00796453, 0.99203547],
[ 0.02778742, 0.97221258],
[ 0.00622771, 0.99377229],
[ 0.00450593, 0.99549407],
[ 0.00180621, 0.99819379],
[ 0.00351143, 0.99648857],
[ 0.00694048, 0.99305952],
[ 0.00255055, 0.99744945],
[ 0.0031681 , 0.9968319 ],
[ 0.00753067, 0.99246933],
[ 0.01343455, 0.98656545],
[ 0.00284663, 0.99715337],
[ 0.00498255, 0.99501745],
[ 0.00296434, 0.99703566],
[ 0.00797048, 0.99202952],
[ 0.00613253, 0.99386747],
[ 0.00442208, 0.99557792],
[ 0.02064132, 0.97935868],
[ 0.00435202, 0.99564798],
[ 0.00520979, 0.99479021],
[ 0.00213321, 0.99786679],
[ 0.0022229 , 0.9977771 ],
[ 0.01560154, 0.98439846],
[ 0.00430782, 0.99569218],
[ 0.00185556, 0.99814444],
[ 0.00715146, 0.99284854],
[ 0.00573681, 0.99426319],
[ 0.00661191, 0.99338809],
[ 0.00315292, 0.99684708],
[ 0.00268605, 0.99731395],
[ 0.00308211, 0.99691789],
[ 0.00458618, 0.99541382],
[ 0.00628956, 0.99371044],
[ 0.00927926, 0.99072074],
[ 0.0033559 , 0.9966441 ],
[ 0.02020525, 0.97979475],
[ 0.00246757, 0.99753243],
[ 0.00770874, 0.99229126],
[ 0.005154 , 0.994846 ],
[ 0.00940577, 0.99059423],
[ 0.00592464, 0.99407536],
[ 0.00295892, 0.99704108],
[ 0.00601872, 0.99398128],
[ 0.0049157 , 0.9950843 ],
[ 0.04160171, 0.95839829],
[ 0.00209306, 0.99790694],
[ 0.00372214, 0.99627786],
[ 0.00450776, 0.99549224],
[ 0.00509582, 0.99490418],
[ 0.00559192, 0.99440808],
[ 0.01564736, 0.98435264],
[ 0.00983779, 0.99016221],
[ 0.00292202, 0.99707798],
[ 0.0056018 , 0.9943982 ],
[ 0.00984645, 0.99015355],
[ 0.00587958, 0.99412042],
[ 0.0017517 , 0.9982483 ],
[ 0.00453035, 0.99546965],
[ 0.00365579, 0.99634421],
[ 0.00732188, 0.99267812],
[ 0.00367946, 0.99632054],
[ 0.01498933, 0.98501067],
[ 0.00347706, 0.99652294],
[ 0.01012064, 0.98987936],
[ 0.01119046, 0.98880954],
[ 0.00541566, 0.99458434],
[ 0.01235293, 0.98764707],
[ 0.00526143, 0.99473857],
[ 0.03103803, 0.96896197],
[ 0.00796582, 0.99203418],
[ 0.00540461, 0.99459539],
[ 0.00820147, 0.99179853],
[ 0.0046746 , 0.9953254 ],
[ 0.00452466, 0.99547534],
[ 0.00515599, 0.99484401],
[ 0.00488367, 0.99511633],
[ 0.00294553, 0.99705447],
[ 0.00198141, 0.99801859],
[ 0.0023011 , 0.9976989 ],
[ 0.01319537, 0.98680463],
[ 0.00186743, 0.99813257],
[ 0.00603358, 0.99396642],
[ 0.00131658, 0.99868342],
[ 0.00374901, 0.99625099],
[ 0.00500544, 0.99499456],
[ 0.00532612, 0.99467388],
[ 0.01010946, 0.98989054],
[ 0.00649398, 0.99350602],
[ 0.00294414, 0.99705586],
[ 0.00380921, 0.99619079],
[ 0.00417723, 0.99582277]])
In [ ]:
## The result is really bad.
In [ ]:
## Stemming, removing numbers and removing sparse terms.
In [ ]:
Content source: chainsawriot/pycon2016hk_sklearn
Similar notebooks: