In [1]:
import pandas as pd
wkext = pd.read_csv("extracts.csv")
In [2]:
wkext
Out[2]:
entitles
enextracts
detitles
deextracts
0
PEARL (programming language)
PEARL, or Process and experiment automation re...
PEARL
PEARL [pɜːɹl] ist eine Echtzeit- und Multitask...
1
Aachen Cathedral Treasury
The Aachen Cathedral Treasury (German: Aachene...
Aachener Domschatzkammer
Die Aachener Domschatzkammer präsentiert den K...
2
Bauhaus
Staatliches Bauhaus , commonly known simply as...
Bauhaus
Das Staatliche Bauhaus wurde 1919 von Walter G...
3
Boo (programming language)
Boo is an object-oriented, statically typed, g...
Boo (Programmiersprache)
Boo ist eine seit 2003 von Rodrigo Barreto de ...
4
Upper Harz Water Regale
The Upper Harz Water Regale (German: Oberharze...
Oberharzer Wasserregal
Das Oberharzer Wasserregal ist ein hauptsächli...
5
Aachen Cathedral
Aachen Cathedral, frequently referred to as th...
Aachener Dom
Der Aachener Dom, auch Aachener Münster oder A...
6
Synchronized Multimedia Integration Language
Synchronized Multimedia Integration Language (...
Synchronized Multimedia Integration Language
Vorlage:Infobox Dateiformat/Wartung/MagischeZa...
7
Scala (programming language)
Scala (/ˈskɑːlə/ SKAH-lə) is an object-functio...
Scala (Programmiersprache)
Scala ist eine funktionale und objektorientier...
8
Gofer (programming language)
Gofer ("Good For Equational Reasoning") is an ...
Gofer
Gofer ist eine funktionale Programmiersprache,...
9
Lübeck
The Hanseatic City of Lübeck (pronounced [ˈlyː...
Lübeck
Die Hansestadt Lübeck (niederdeutsch: Lübęk, L...
10
Perl
Perl is a family of high-level, general-purpos...
Perl (Programmiersprache)
Perl [pɝːl] ist eine freie, plattformunabhängi...
11
Tcllib
Tcllib is a collection of packages available f...
Tcllib
Tcllib ist eine der populärsten Bibliotheken z...
12
Go (programming language)
Go, also commonly referred to as golang, is a ...
Go (Programmiersprache)
Go ist eine kompilierbare Programmiersprache, ...
13
COBOL
COBOL (/ˈkoʊbɒl/, an acronym for common busine...
COBOL
COBOL ist eine Programmiersprache, die in der ...
14
ML (programming language)
ML is a general-purpose functional programming...
ML (Programmiersprache)
Meta Language (ML) beschreibt eine Familie fun...
15
Haus am Horn
The Haus am Horn was built for the Weimar Bauh...
Musterhaus Am Horn
Das Musterhaus „Am Horn“ ist ein in Weimar err...
16
Rüdesheim am Rhein
Rüdesheim am Rhein is a winemaking town in the...
Rüdesheim am Rhein
Rüdesheim am Rhein ist eine Weinstadt im Rhein...
17
Hack (programming language)
Hack is a programming language for the HipHop ...
Hack (Programmiersprache)
Hack ist eine Neuimplementierung der Skriptspr...
18
Objective-C
Objective-C is a general-purpose, object-orien...
Objective-C
Objective-C, auch kurz ObjC genannt, erweitert...
19
Martin Luther's Birth House
Martin Luther's Birth House (German: Martin Lu...
Martin Luthers Geburtshaus
Bei dem sogenannten Luther-Geburtshaus handelt...
20
XProc
XProc is a W3C Recommendation to define an XML...
XProc
XProc (von englisch XML Processing) ist eine v...
21
Julia (programming language)
Julia is a high-level dynamic programming lang...
Julia (Programmiersprache)
Julia ist eine höhere High-Performance-Program...
22
XSLT
XSLT (Extensible Stylesheet Language Transform...
XSL Transformation
Vorlage:Infobox Dateiformat/Wartung/MagischeZa...
23
MetaPost
MetaPost refers to both a programming language...
MetaPost
MetaPost ist zum einen eine Programmiersprache...
24
Aula Palatina
The Basilica of Constantine (German: Konstanti...
Konstantinbasilika
Die Evangelische Kirche zum Erlöser (Konstanti...
25
Wadden Sea
The Wadden Sea (Dutch: Waddenzee, German: Watt...
Wattenmeer (Nordsee)
Das Wattenmeer der Nordsee ist eine im Wirkung...
26
Modelica
Modelica is an object-oriented, declarative, m...
Modelica
Modelica ist eine objektorientierte Beschreibu...
27
APL (programming language)
APL (named after the book A Programming Langua...
APL (Programmiersprache)
APL, abgekürzt für A Programming Language, ist...
28
Fortran
Fortran (previously FORTRAN, derived from Form...
Fortran
Fortran ist eine prozedurale und in ihrer neue...
29
Ruby (programming language)
Ruby is a dynamic, reflective, object-oriented...
Ruby (Programmiersprache)
Ruby (englisch für Rubin) ist eine höhere Prog...
...
...
...
...
...
141
Trier Amphitheater
The Trier Amphitheater is a Roman Amphitheater...
Amphitheater (Trier)
Das Amphitheater in Trier (Augusta Treverorum)...
142
Bremen City Hall
The Bremen City Hall is the seat of the Presid...
Bremer Rathaus
Das Bremer Rathaus ist eines der bedeutendsten...
143
Euler (programming language)
Euler is a programming language created by Nik...
Euler (Programmiersprache)
Euler ist eine von Niklaus Wirth und Helmut We...
144
StepTalk
StepTalk is the official GNUstep scripting fra...
StepTalk
StepTalk ist das offizielle GNUstep Scripting-...
145
Standard ML
Standard ML (SML) is a general-purpose, modula...
Standard ML
Standard ML (SML) ist eine von ML abstammende ...
146
Lua (programming language)
Lua (/ˈluːə/ LOO-ə, from Portuguese: lua [ˈlu....
Lua
Lua (portugiesisch für Mond) ist eine imperati...
147
Mercury (programming language)
Mercury is a functional logic programming lang...
Mercury (Programmiersprache)
Mercury ist eine stark an Prolog angelehnte Pr...
148
Opal (programming language)
OPAL (OPtimized Applicative Language) is a fun...
Opal (Programmiersprache)
OPAL (OPtimized Applicative Language) ist eine...
149
Pfaueninsel
Pfaueninsel ("Peacock Island") is an island in...
Pfaueninsel
Vorlage:Infobox Insel/Wartung/Höhe fehlt\nDie ...
150
Holstentor
The Holsten Gate ("Holstein Tor", later "Holst...
Holstentor
Das Holstentor („Holstein-Tor“) ist ein Stadtt...
151
Martin Luther's Death House
Martin Luther's Death House (German: Martin Lu...
Martin Luthers Sterbehaus
Martin Luthers Sterbehaus ist das Gebäude in d...
152
Imperial Palace Ingelheim
The Imperial Palace Ingelheim (German: Ingelhe...
Ingelheimer Kaiserpfalz
Die Ingelheimer Kaiserpfalz ist eine bedeutend...
153
D (programming language)
The D programming language is an object-orient...
D (Programmiersprache)
D ist eine Programmiersprache, die sich äußerl...
154
Lower Saxon Wadden Sea National Park
The Lower Saxon Wadden Sea National Park (Germ...
Nationalpark Niedersächsisches Wattenmeer
Der Nationalpark Niedersächsisches Wattenmeer ...
155
Lorsch Abbey
The Abbey of Lorsch (German: Reichsabtei Lorsc...
Kloster Lorsch
Das Kloster Lorsch war eine Benediktinerabtei ...
156
Stralsund
Stralsund (German pronunciation: [ˈʃtʁaːlzʊnt]...
Stralsund
Stralsund [ˈʃtʁaːlzʊnt] ist eine Stadt im Nord...
157
PHP
PHP is a server-side scripting language design...
PHP
PHP (rekursives Akronym und Backronym für „PHP...
158
C (programming language)
C (/ˈsiː/, as in the letter c) is a general-pu...
C (Programmiersprache)
C ist eine imperative Programmiersprache, die ...
159
Völklingen Ironworks
The Völklingen Ironworks (German: Völklinger H...
Völklinger Hütte
Die Völklinger Hütte ist ein 1873 gegründetes ...
160
Wieskirche
The Pilgrimage Church of Wies (German: Wieskir...
Wieskirche
Die Wieskirche ist eine bemerkenswert prächtig...
161
SuperCollider
SuperCollider is an environment and programmin...
SuperCollider
SuperCollider (SC) ist eine Programmierumgebun...
162
Sanssouci
The Sanssouci Palace (German: Schloss Sanssouc...
Sanssouci
Schloss Sanssouci (französisch sans souci ‚ohn...
163
Zollverein Coal Mine Industrial Complex
The Zollverein Coal Mine Industrial Complex (G...
Zeche Zollverein
Die Zeche Zollverein war ein von 1851 bis 1986...
164
Smalltalk
Smalltalk is an object-oriented, dynamically t...
Smalltalk (Programmiersprache)
Smalltalk ist ein Sammelbegriff einerseits für...
165
Tcl
Tcl (originally from Tool Command Language, bu...
Tcl
Tcl (Aussprache engl. tickle oder auch als Abk...
166
Strongtalk
Strongtalk is a Smalltalk environment with opt...
Strongtalk
Strongtalk ist eine Variante der Programmiersp...
167
Datalog
Datalog is a truly declarative logic programmi...
Datalog
Datalog ist eine Datenbank-Programmiersprache ...
168
Racket (programming language)
Racket (formerly named PLT Scheme) is a genera...
DrRacket
DrRacket (früher DrScheme) ist eine integriert...
169
Igel Column
The Igel Column is a multi-storeyed Roman sand...
Igeler Säule
Die Igeler Säule im Dorf Igel an der Mosel ist...
170
Cathedral of Trier
The High Cathedral of Saint Peter in Trier (Ge...
Trierer Dom
Die Hohe Domkirche St. Peter zu Trier ist die ...
171 rows × 4 columns
In [3]:
wkext['enextracts']
Out[3]:
0 PEARL, or Process and experiment automation re...
1 The Aachen Cathedral Treasury (German: Aachene...
2 Staatliches Bauhaus , commonly known simply as...
3 Boo is an object-oriented, statically typed, g...
4 The Upper Harz Water Regale (German: Oberharze...
5 Aachen Cathedral, frequently referred to as th...
6 Synchronized Multimedia Integration Language (...
7 Scala (/ˈskɑːlə/ SKAH-lə) is an object-functio...
8 Gofer ("Good For Equational Reasoning") is an ...
9 The Hanseatic City of Lübeck (pronounced [ˈlyː...
10 Perl is a family of high-level, general-purpos...
11 Tcllib is a collection of packages available f...
12 Go, also commonly referred to as golang, is a ...
13 COBOL (/ˈkoʊbɒl/, an acronym for common busine...
14 ML is a general-purpose functional programming...
15 The Haus am Horn was built for the Weimar Bauh...
16 Rüdesheim am Rhein is a winemaking town in the...
17 Hack is a programming language for the HipHop ...
18 Objective-C is a general-purpose, object-orien...
19 Martin Luther's Birth House (German: Martin Lu...
20 XProc is a W3C Recommendation to define an XML...
21 Julia is a high-level dynamic programming lang...
22 XSLT (Extensible Stylesheet Language Transform...
23 MetaPost refers to both a programming language...
24 The Basilica of Constantine (German: Konstanti...
25 The Wadden Sea (Dutch: Waddenzee, German: Watt...
26 Modelica is an object-oriented, declarative, m...
27 APL (named after the book A Programming Langua...
28 Fortran (previously FORTRAN, derived from Form...
29 Ruby is a dynamic, reflective, object-oriented...
...
141 The Trier Amphitheater is a Roman Amphitheater...
142 The Bremen City Hall is the seat of the Presid...
143 Euler is a programming language created by Nik...
144 StepTalk is the official GNUstep scripting fra...
145 Standard ML (SML) is a general-purpose, modula...
146 Lua (/ˈluːə/ LOO-ə, from Portuguese: lua [ˈlu....
147 Mercury is a functional logic programming lang...
148 OPAL (OPtimized Applicative Language) is a fun...
149 Pfaueninsel ("Peacock Island") is an island in...
150 The Holsten Gate ("Holstein Tor", later "Holst...
151 Martin Luther's Death House (German: Martin Lu...
152 The Imperial Palace Ingelheim (German: Ingelhe...
153 The D programming language is an object-orient...
154 The Lower Saxon Wadden Sea National Park (Germ...
155 The Abbey of Lorsch (German: Reichsabtei Lorsc...
156 Stralsund (German pronunciation: [ˈʃtʁaːlzʊnt]...
157 PHP is a server-side scripting language design...
158 C (/ˈsiː/, as in the letter c) is a general-pu...
159 The Völklingen Ironworks (German: Völklinger H...
160 The Pilgrimage Church of Wies (German: Wieskir...
161 SuperCollider is an environment and programmin...
162 The Sanssouci Palace (German: Schloss Sanssouc...
163 The Zollverein Coal Mine Industrial Complex (G...
164 Smalltalk is an object-oriented, dynamically t...
165 Tcl (originally from Tool Command Language, bu...
166 Strongtalk is a Smalltalk environment with opt...
167 Datalog is a truly declarative logic programmi...
168 Racket (formerly named PLT Scheme) is a genera...
169 The Igel Column is a multi-storeyed Roman sand...
170 The High Cathedral of Saint Peter in Trier (Ge...
Name: enextracts, dtype: object
In [3]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform([ent for ent in wkext['enextracts']])
X
Out[3]:
<171x4930 sparse matrix of type '<type 'numpy.int64'>'
with 15380 stored elements in Compressed Sparse Row format>
In [4]:
X
Out[4]:
<171x4930 sparse matrix of type '<type 'numpy.int64'>'
with 15380 stored elements in Compressed Sparse Row format>
In [5]:
vectorizer.get_feature_names()
Out[5]:
[u'000',
u'01',
u'03',
u'05',
u'064',
u'083',
u'10',
u'100',
u'1010',
u'102',
u'1020',
u'10206',
u'1030',
u'105',
u'106',
u'109',
u'1090',
u'10th',
u'11',
u'1103',
u'111',
u'112',
u'115',
u'1165',
u'1170s',
u'11th',
u'12',
u'1209',
u'1234',
u'1248',
u'1280',
u'12th',
u'13',
u'1360',
u'1366',
u'14',
u'1404',
u'142',
u'144',
u'1464',
u'1473',
u'1483',
u'14882',
u'14th',
u'15',
u'150',
u'1504',
u'1531',
u'1533',
u'1546',
u'157',
u'15th',
u'16',
u'1689',
u'1693',
u'1696',
u'16th',
u'17',
u'170',
u'1720',
u'1729',
u'1730',
u'1738',
u'1740',
u'1740s',
u'1744',
u'1745',
u'1746',
u'1747',
u'1748',
u'1754',
u'1787',
u'1792',
u'1797',
u'17th',
u'18',
u'180',
u'1815',
u'1817',
u'1826',
u'1830',
u'1835',
u'1838',
u'1844',
u'1847',
u'1848',
u'1849',
u'185',
u'1850',
u'1851',
u'1852',
u'1853',
u'1856',
u'1859',
u'1862',
u'1876',
u'1880',
u'18th',
u'19',
u'1903',
u'1904',
u'1910',
u'1911',
u'1913',
u'1916',
u'1918',
u'1919',
u'192',
u'1920s',
u'1923',
u'1924',
u'1925',
u'1928',
u'1930',
u'1932',
u'1933',
u'1939',
u'1941',
u'1944',
u'1945',
u'1948',
u'1950',
u'1950s',
u'1954',
u'1957',
u'1958',
u'1959',
u'1960',
u'1960s',
u'1961',
u'1964',
u'1966',
u'1967',
u'1968',
u'1969',
u'1970',
u'1970s',
u'1972',
u'1973',
u'1976',
u'1977',
u'1978',
u'1979',
u'1980',
u'1980s',
u'1981',
u'1983',
u'1984',
u'1985',
u'1986',
u'1987',
u'1988',
u'1989',
u'1990',
u'1990s',
u'1991',
u'1992',
u'1993',
u'1994',
u'1995',
u'1996',
u'1997',
u'1998',
u'1999',
u'19th',
u'1st',
u'20',
u'200',
u'2000',
u'2001',
u'2002',
u'2003',
u'2004',
u'2005',
u'2006',
u'2007',
u'2008',
u'2009',
u'2010',
u'2011',
u'2012',
u'2013',
u'2014',
u'2015',
u'20th',
u'21',
u'213',
u'219',
u'22',
u'226',
u'23',
u'23270',
u'24',
u'240',
u'25',
u'250',
u'26',
u'260',
u'27',
u'278',
u'28',
u'283',
u'29',
u'298',
u'30',
u'300',
u'306',
u'31',
u'310',
u'32',
u'33',
u'334',
u'335',
u'337',
u'340',
u'345',
u'35',
u'350',
u'353',
u'360',
u'37',
u'39',
u'391',
u'3rd',
u'40',
u'41',
u'410',
u'42',
u'43',
u'438',
u'4410',
u'45',
u'47',
u'474',
u'4gl',
u'4th',
u'50',
u'500',
u'5000',
u'515',
u'526m',
u'55',
u'552',
u'56',
u'568',
u'590',
u'595',
u'60',
u'62',
u'635',
u'65',
u'66253',
u'67',
u'672',
u'68',
u'6th',
u'70',
u'700',
u'7185',
u'72',
u'73',
u'77',
u'80',
u'800',
u'80th',
u'83',
u'86',
u'8652',
u'87',
u'8th',
u'90',
u'900',
u'936',
u'95',
u'971',
u'983',
u'aachen',
u'aachener',
u'abandons',
u'abbey',
u'abbeys',
u'abbot',
u'abbreviated',
u'abilities',
u'ability',
u'able',
u'about',
u'above',
u'abstentions',
u'abstract',
u'abtei',
u'abundance',
u'academia',
u'academic',
u'academy',
u'accept',
u'acceptance',
u'accepted',
u'access',
u'accessed',
u'accessible',
u'accessing',
u'accompanying',
u'accordance',
u'according',
u'acid',
u'acm',
u'acoustic',
u'acquired',
u'acres',
u'acronym',
u'across',
u'act',
u'act1',
u'acted',
u'actions',
u'actionscript',
u'active',
u'actively',
u'activities',
u'actors',
u'actual',
u'actually',
u'ad',
u'ada',
u'adapt',
u'adaptation',
u'adapted',
u'add',
u'added',
u'adding',
u'addition',
u'additional',
u'additionally',
u'additions',
u'address',
u'adds',
u'adele',
u'adenine',
u'adhering',
u'adjoining',
u'adjustments',
u'administers',
u'administration',
u'administrative',
u'administrators',
u'adobe',
u'adolf',
u'adopted',
u'adopting',
u'adoption',
u'adult',
u'advanced',
u'advantage',
u'advantages',
u'advocated',
u'aesthetic',
u'affiliated',
u'affluent',
u'afforded',
u'afield',
u'aforementioned',
u'afsluitdijk',
u'after',
u'again',
u'against',
u'age',
u'ages',
u'agrarian',
u'agriculture',
u'ai',
u'aim',
u'aimed',
u'aims',
u'air',
u'aircraft',
u'aisled',
u'alain',
u'alan',
u'albeit',
u'albert',
u'albrechtsberg',
u'alexandrescu',
u'alfeld',
u'algebra',
u'algebraic',
u'algol',
u'algorithm',
u'algorithmic',
u'algorithms',
u'all',
u'allegedly',
u'allied',
u'allow',
u'allowing',
u'allows',
u'almost',
u'alone',
u'along',
u'alonzo',
u'alpine',
u'alps',
u'already',
u'also',
u'altar',
u'alte',
u'altenau',
u'alter',
u'alternative',
u'alternatively',
u'altes',
u'although',
u'altstadt',
u'always',
u'am',
u'amalia',
u'amalienburg',
u'amd64',
u'amended',
u'american',
u'ammianus',
u'among',
u'amongst',
u'amphitheater',
u'amphitheatre',
u'an',
u'anachronistic',
u'analysis',
u'anchor',
u'ancient',
u'and',
u'anders',
u'andreasberg',
u'andrei',
u'anhalt',
u'animals',
u'animation',
u'animations',
u'animorphic',
u'anna',
u'annihilated',
u'anniversary',
u'annotation',
u'annotations',
u'announced',
u'announcements',
u'annual',
u'annually',
u'anonymous',
u'another',
u'ansi',
u'answering',
u'antique',
u'antiquity',
u'any',
u'anything',
u'anywhere',
u'aot',
u'apache',
u'api',
u'apis',
u'apl',
u'app',
u'appear',
u'appearance',
u'appeared',
u'appears',
u'apple',
u'applet',
u'applets',
u'applicable',
u'application',
u'applications',
u'applicative',
u'applied',
u'approach',
u'approaches',
u'approved',
u'approximate',
u'approximately',
u'apps',
u'april',
u'apse',
u'apses',
u'arabia',
u'arabicus',
u'arbitrary',
u'archaeological',
u'archbishop',
u'archbishopric',
u'archdiocese',
u'arched',
u'architect',
u'architects',
u'architectural',
u'architecture',
u'architectures',
u'architekten',
u'architektur',
u'architrave',
u'archive',
u'are',
u'area',
u'areas',
u'aren',
u'arg1',
u'arg2',
u'arg3',
u'arguments',
u'arithmetic',
u'arm',
u'armstrong',
u'army',
u'arnim',
u'around',
u'arranged',
u'array',
u'arrays',
u'art',
u'arthritis',
u'article',
u'articulation',
u'artificial',
u'artistic',
u'artists',
u'arts',
u'artworks',
u'as',
u'ascii',
u'asked',
u'asm',
u'aspect',
u'aspects',
u'assembly',
u'assertions',
u'assignment',
u'assisted',
u'associated',
u'association',
u'associative',
u'assumption',
u'asymptote',
u'asynchronously',
u'at',
u'atomic',
u'atrium',
u'ats',
u'attached',
u'attains',
u'attempt',
u'attempts',
u'attend',
u'attics',
u'attracted',
u'attracting',
u'attraction',
u'attractions',
u'attributed',
u'attributes',
u'auckland',
u'audio',
u'aufsichts',
u'augmented',
u'august',
u'augusta',
u'augusteum',
u'augustinian',
u'augustus',
u'augustusburg',
u'aula',
u'aureus',
u'australia',
u'austria',
u'austrian',
u'author',
u'authoring',
u'authority',
u'authors',
u'autobahn',
u'automated',
u'automatic',
u'automatically',
u'automation',
u'availability',
u'available',
u'avalon',
u'average',
u'avoid',
u'award',
u'awarded',
u'awards',
u'aware',
u'away',
u'awk',
u'axis',
u'azelin',
u'b1',
u'b2b',
u'b2c',
u'babelsberg',
u'babylon',
u'bachtrompetengala',
u'back',
u'backend',
u'background',
u'backgrounds',
u'backronym',
u'backronyms',
u'backus',
u'bad',
u'baden',
u'bailiff',
u'balanced',
u'baldachin',
u'balk',
u'balthasar',
u'baltic',
u'bamberg',
u'banker',
u'banks',
u'banquet',
u'baptised',
u'barbara',
u'barbarathermen',
u'barock',
u'baroque',
u'barras',
u'bas',
u'base',
u'based',
u'basic',
u'basics',
u'basilica',
u'basin',
u'basis',
u'basket',
u'basser',
u'batch',
u'bath',
u'baths',
u'battista',
u'battle',
u'bauhaus',
u'bavaria',
u'bavarian',
u'bay',
u'bayreuth',
u'bc',
u'be',
u'bearing',
u'bears',
u'beautiful',
u'became',
u'because',
u'become',
u'becoming',
u'beech',
u'been',
u'before',
u'began',
u'begin',
u'beginning',
u'begun',
u'behavioral',
u'behest',
u'being',
u'belief',
u'bell',
u'belong',
u'belongs',
u'below',
u'benchmark',
u'bend',
u'benedictine',
u'benefit',
u'benscheidt',
u'bergpark',
u'berkeley',
u'berlin',
u'berliner',
u'berners',
u'bernward',
u'bertholdstein',
u'bertot',
u'besides',
u'best',
u'between',
u'beuronese',
u'bias',
u'bible',
u'bid',
u'big',
u'billeter',
u'billion',
u'binaries',
u'binary',
u'binding',
u'bingen',
u'bioinformatics',
u'biological',
u'biosphere',
u'biotechnology',
u'birds',
u'birth',
u'bishop',
u'bismarck',
u'bit',
u'bitmap',
u'bituminous',
u'bjarne',
u'black',
u'blasewitz',
u'bld',
u'blend',
u'block',
u'blocks',
u'blue',
u'board',
u'boats',
u'bobrow',
u'bode',
u'body',
u'boffrand',
u'bonn',
u'boo',
u'book',
u'books',
u'border',
u'bordering',
u'born',
u'borough',
u'borrow',
u'borrowed',
u'both',
u'bouman',
u'boundaries',
u'boundary',
u'bounded',
u'bounds',
u'box',
u'brace',
u'branches',
u'brandenburg',
u'bread',
u'break',
u'bremen',
u'bremer',
u'bretten',
u'breuer',
u'brian',
u'brick',
u'bridge',
u'bridges',
u'bright',
u'bring',
u'bringing',
u'brings',
u'britannicus',
u'broad',
u'bronze',
u'bronzeworks',
u'brook',
u'brother',
u'brought',
u'brow',
u'browser',
u'browsers',
u'bruckgraben',
u'bruno',
u'br\xfchl',
u'bsd',
u'buffer',
u'bugenhagen',
u'bugs',
u'build',
u'builders',
u'building',
u'buildings',
u'builds',
u'built',
u'bukovsk\xe9',
u'bull',
u'bulwark',
u'bundesland',
u'buntenbock',
u'burgtor',
u'burial',
u'buried',
u'burned',
u'burnt',
u'business',
u'but',
u'buttons',
u'by',
u'bytecode',
u'byzantine',
u'b\xe9zier',
u'cabinetry',
u'cadastral',
u'calculus',
u'call',
u'called',
u'calling',
u'calls',
u'came',
u'caml',
u'campanile',
u'can',
u'canada',
u'canal',
u'cannot',
u'canonical',
u'canton',
u'capabilities',
u'capability',
u'capital',
u'capitalist',
u'carboniferous',
u'care',
u'carefree',
u'carl',
u'carolingian',
u'carpathian',
u'carpathians',
u'carried',
u'carved',
u'case',
u'casting',
u'castle',
u'castles',
u'castra',
u'cast\xe9ran',
u'catalog',
u'categorical',
u'cathedral',
u'cathedralis',
u'cathedrals',
u'catholic',
u'catholicism',
u'catholique',
u'cause',
u'causeway',
u'causeways',
u'celebration',
u'center',
u'centers',
u'central',
u'centralized',
u'centre',
u'centric',
u'centuries',
u'century',
u'ceremony',
u'certain',
u'certified',
u'cgi',
u'chainsaw',
u'chamber',
u'chambers',
u'chandelier',
u'chandeliers',
u'changed',
u'changes',
u'changing',
u'channel',
u'chapel',
u'chapter',
u'character',
u'characterised',
u'characteristic',
u'characteristics',
u'characterized',
u'characters',
u'charge',
u'charged',
u'charlemagne',
u'charles',
u'charlottenburg',
u'charts',
u'checked',
u'checking',
u'checks',
u'chemistry',
u'childers',
u'chipperfield',
u'chivalric',
u'choir',
u'chornohora',
u'christ',
u'christian',
u'christianity',
u'chronicle',
u'chronicler',
u'church',
u'churches',
u'ch\xe2teau',
u'cii',
u'cim',
u'circle',
u'cistercian',
u'citadel',
u'cites',
u'cities',
u'city',
u'civic',
u'civilization',
u'cl',
u'claimed',
u'claims',
u'clang',
u'class',
u'classes',
u'classical',
u'classicism',
u'classification',
u'classified',
u'classpath',
u'clause',
u'clausthal',
u'clean',
u'clear',
u'clearly',
u'clemens',
u'clerestory',
u'cli',
u'client',
u'cloisters',
u'clojure',
u'clos',
u'close',
u'closed',
u'closely',
u'closer',
u'closure',
u'closures',
u'cloth',
u'cloud',
u'clt',
u'cluny',
u'cm',
u'cmdlet',
u'cmdlets',
u'co',
u'coal',
u'coast',
u'coastal',
u'coastline',
u'cobol',
u'cocoa',
u'codasyl',
u'codd',
u'code',
u'coded',
u'codex',
u'coining',
u'coking',
u'collaboration',
u'collected',
u'collection',
u'collections',
u'collective',
u'collegiate',
u'colmerauer',
u'cologne',
u'colon',
u'coloni',
u'colour',
u'colours',
u'column',
u'com',
u'combination',
u'combinations',
u'combinator',
u'combine',
u'combined',
u'combines',
u'combining',
u'come',
u'comes',
u'comfort',
u'coming',
u'command',
u'commandline',
u'commands',
u'commenced',
u'commerce',
u'commercial',
u'commercialize',
u'commercially',
u'commission',
u'commissioned',
u'committee',
u'common',
u'commonly',
u'communicate',
u'communications',
u'communist',
u'communities',
u'community',
u'compact',
...]
In [3]:
### train a LDA model
from sklearn.decomposition import LatentDirichletAllocation
hx_lda = LatentDirichletAllocation(n_topics = 2)
In [4]:
hx_lda.fit(X)
/usr/local/lib/python2.7/site-packages/sklearn/decomposition/online_lda.py:508: DeprecationWarning: The default value for 'learning_method' will be changed from 'online' to 'batch' in the release 0.20. This warning was introduced in 0.18.
DeprecationWarning)
Out[4]:
LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
evaluate_every=-1, learning_decay=0.7, learning_method=None,
learning_offset=10.0, max_doc_update_iter=100, max_iter=10,
mean_change_tol=0.001, n_jobs=1, n_topics=2, perp_tol=0.1,
random_state=None, topic_word_prior=None,
total_samples=1000000.0, verbose=0)
In [5]:
def display_topics(model, feature_names, no_top_words):
for topic_idx, topic in enumerate(model.components_):
print "Topic %d:" % (topic_idx)
print " ".join([feature_names[i]
for i in topic.argsort()[:-no_top_words - 1:-1]])
In [6]:
display_topics(hx_lda, vectorizer.get_feature_names(), 30)
Topic 0:
world german heritage palace city unesco germany church cathedral php site language web berlin javascript built town list building design located century used designed programming imperial park house architect smalltalk
Topic 1:
language programming languages used world developed heritage site trier computer german data standard germany code unesco systems logic general functional source object applications roman originally type purpose based software church
In [7]:
hx_lda.transform(X)
Out[7]:
array([[ 0.01125855, 0.98874145],
[ 0.98791055, 0.01208945],
[ 0.99650064, 0.00349936],
[ 0.00903497, 0.99096503],
[ 0.07740058, 0.92259942],
[ 0.98572206, 0.01427794],
[ 0.01114987, 0.98885013],
[ 0.00438829, 0.99561171],
[ 0.01223388, 0.98776612],
[ 0.99275179, 0.00724821],
[ 0.00422039, 0.99577961],
[ 0.02075849, 0.97924151],
[ 0.18076583, 0.81923417],
[ 0.00299732, 0.99700268],
[ 0.00765773, 0.99234227],
[ 0.99577153, 0.00422847],
[ 0.98015819, 0.01984181],
[ 0.01207095, 0.98792905],
[ 0.00660619, 0.99339381],
[ 0.0061292 , 0.9938708 ],
[ 0.00882477, 0.99117523],
[ 0.01191257, 0.98808743],
[ 0.01031458, 0.98968542],
[ 0.00336333, 0.99663667],
[ 0.99458708, 0.00541292],
[ 0.99486728, 0.00513272],
[ 0.01206013, 0.98793987],
[ 0.01459863, 0.98540137],
[ 0.0060873 , 0.9939127 ],
[ 0.12741868, 0.87258132],
[ 0.08738664, 0.91261336],
[ 0.95940517, 0.04059483],
[ 0.47433355, 0.52566645],
[ 0.99652861, 0.00347139],
[ 0.53136689, 0.46863311],
[ 0.00318828, 0.99681172],
[ 0.99719312, 0.00280688],
[ 0.98217177, 0.01782823],
[ 0.00848919, 0.99151081],
[ 0.98823806, 0.01176194],
[ 0.02684326, 0.97315674],
[ 0.99582709, 0.00417291],
[ 0.96863175, 0.03136825],
[ 0.99304325, 0.00695675],
[ 0.99388409, 0.00611591],
[ 0.00498225, 0.99501775],
[ 0.28810682, 0.71189318],
[ 0.99126946, 0.00873054],
[ 0.0058442 , 0.9941558 ],
[ 0.03766056, 0.96233944],
[ 0.00781794, 0.99218206],
[ 0.99173922, 0.00826078],
[ 0.91476606, 0.08523394],
[ 0.00571632, 0.99428368],
[ 0.00535841, 0.99464159],
[ 0.34694643, 0.65305357],
[ 0.83442168, 0.16557832],
[ 0.99526834, 0.00473166],
[ 0.00318849, 0.99681151],
[ 0.99385188, 0.00614812],
[ 0.42313535, 0.57686465],
[ 0.00470892, 0.99529108],
[ 0.97883952, 0.02116048],
[ 0.00462441, 0.99537559],
[ 0.83366438, 0.16633562],
[ 0.99542749, 0.00457251],
[ 0.00373739, 0.99626261],
[ 0.00630176, 0.99369824],
[ 0.97754406, 0.02245594],
[ 0.99377248, 0.00622752],
[ 0.99546507, 0.00453493],
[ 0.12605243, 0.87394757],
[ 0.251339 , 0.748661 ],
[ 0.75423514, 0.24576486],
[ 0.00383012, 0.99616988],
[ 0.01487717, 0.98512283],
[ 0.00599928, 0.99400072],
[ 0.0040383 , 0.9959617 ],
[ 0.02709003, 0.97290997],
[ 0.01238374, 0.98761626],
[ 0.98179454, 0.01820546],
[ 0.0515394 , 0.9484606 ],
[ 0.86179583, 0.13820417],
[ 0.44351645, 0.55648355],
[ 0.9965576 , 0.0034424 ],
[ 0.86288866, 0.13711134],
[ 0.64310184, 0.35689816],
[ 0.00409433, 0.99590567],
[ 0.99440788, 0.00559212],
[ 0.98878593, 0.01121407],
[ 0.09025289, 0.90974711],
[ 0.00454572, 0.99545428],
[ 0.99278828, 0.00721172],
[ 0.00490575, 0.99509425],
[ 0.98159639, 0.01840361],
[ 0.9913047 , 0.0086953 ],
[ 0.00775757, 0.99224243],
[ 0.03001479, 0.96998521],
[ 0.99145946, 0.00854054],
[ 0.41847653, 0.58152347],
[ 0.05919752, 0.94080248],
[ 0.00342762, 0.99657238],
[ 0.97488701, 0.02511299],
[ 0.46473553, 0.53526447],
[ 0.99674931, 0.00325069],
[ 0.58609383, 0.41390617],
[ 0.00975739, 0.99024261],
[ 0.23614264, 0.76385736],
[ 0.0066611 , 0.9933389 ],
[ 0.99530109, 0.00469891],
[ 0.00630965, 0.99369035],
[ 0.70196435, 0.29803565],
[ 0.98842729, 0.01157271],
[ 0.01531818, 0.98468182],
[ 0.00573751, 0.99426249],
[ 0.95546499, 0.04453501],
[ 0.00524703, 0.99475297],
[ 0.01203878, 0.98796122],
[ 0.27482821, 0.72517179],
[ 0.16758573, 0.83241427],
[ 0.01012464, 0.98987536],
[ 0.00473515, 0.99526485],
[ 0.98708983, 0.01291017],
[ 0.01061668, 0.98938332],
[ 0.04748299, 0.95251701],
[ 0.30220803, 0.69779197],
[ 0.52569715, 0.47430285],
[ 0.22375396, 0.77624604],
[ 0.99183667, 0.00816333],
[ 0.98988585, 0.01011415],
[ 0.96549059, 0.03450941],
[ 0.9823263 , 0.0176737 ],
[ 0.99478883, 0.00521117],
[ 0.01085557, 0.98914443],
[ 0.01560711, 0.98439289],
[ 0.0103397 , 0.9896603 ],
[ 0.8662738 , 0.1337262 ],
[ 0.00779734, 0.99220266],
[ 0.00640309, 0.99359691],
[ 0.01426252, 0.98573748],
[ 0.00580132, 0.99419868],
[ 0.04209608, 0.95790392],
[ 0.99296301, 0.00703699],
[ 0.0171586 , 0.9828414 ],
[ 0.02088636, 0.97911364],
[ 0.00899225, 0.99100775],
[ 0.01756862, 0.98243138],
[ 0.00890969, 0.99109031],
[ 0.05243268, 0.94756732],
[ 0.21753078, 0.78246922],
[ 0.98993268, 0.01006732],
[ 0.11359364, 0.88640636],
[ 0.9913634 , 0.0086366 ],
[ 0.73383947, 0.26616053],
[ 0.01109692, 0.98890308],
[ 0.01439353, 0.98560647],
[ 0.9947999 , 0.0052001 ],
[ 0.78905321, 0.21094679],
[ 0.00426158, 0.99573842],
[ 0.06894713, 0.93105287],
[ 0.99610928, 0.00389072],
[ 0.01060341, 0.98939659],
[ 0.99739211, 0.00260789],
[ 0.04575364, 0.95424636],
[ 0.8666028 , 0.1333972 ],
[ 0.00947924, 0.99052076],
[ 0.01784239, 0.98215761],
[ 0.01035174, 0.98964826],
[ 0.00518595, 0.99481405],
[ 0.00664504, 0.99335496],
[ 0.01328946, 0.98671054]])
In [ ]:
## The result is really bad.
In [ ]:
## Stemming, removing numbers and removing sparse terms.
In [8]:
?CountVectorizer
In [9]:
## Removing stopwords
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(stop_words = 'english')
X = vectorizer.fit_transform([ent for ent in wkext['enextracts']])
X ### compare with 171x4930
Out[9]:
<171x4715 sparse matrix of type '<type 'numpy.int64'>'
with 11275 stored elements in Compressed Sparse Row format>
In [11]:
## Removing the stopwords only can improve the topics a lot already.
hx_lda = LatentDirichletAllocation(n_topics = 2)
hx_lda.fit(X)
display_topics(hx_lda, vectorizer.get_feature_names(), 30)
/usr/local/lib/python2.7/site-packages/sklearn/decomposition/online_lda.py:508: DeprecationWarning: The default value for 'learning_method' will be changed from 'online' to 'batch' in the release 0.20. This warning was introduced in 0.18.
DeprecationWarning)
Topic 0:
language programming world heritage german used languages site unesco germany city palace church code design php web developed applications standard object based berlin computer designed built javascript systems source logic
Topic 1:
cathedral world sea german wadden language xml national query park smalltalk group church germany data heritage st museum column w3c xquery new site unesco island islands web area roman lower
In [12]:
hx_lda.transform(X)
Out[12]:
array([[ 0.98800541, 0.01199459],
[ 0.64879927, 0.35120073],
[ 0.99656975, 0.00343025],
[ 0.99089643, 0.00910357],
[ 0.99320196, 0.00679804],
[ 0.97600225, 0.02399775],
[ 0.0177741 , 0.9822259 ],
[ 0.99553819, 0.00446181],
[ 0.9859716 , 0.0140284 ],
[ 0.90944155, 0.09055845],
[ 0.99536302, 0.00463698],
[ 0.96938282, 0.03061718],
[ 0.98644158, 0.01355842],
[ 0.99680692, 0.00319308],
[ 0.99213987, 0.00786013],
[ 0.99657195, 0.00342805],
[ 0.97183947, 0.02816053],
[ 0.96143614, 0.03856386],
[ 0.99265291, 0.00734709],
[ 0.07600145, 0.92399855],
[ 0.01029555, 0.98970445],
[ 0.99191858, 0.00808142],
[ 0.27100162, 0.72899838],
[ 0.99622245, 0.00377755],
[ 0.99545342, 0.00454658],
[ 0.00419417, 0.99580583],
[ 0.98919686, 0.01080314],
[ 0.98699526, 0.01300474],
[ 0.99458416, 0.00541584],
[ 0.98517667, 0.01482333],
[ 0.00657912, 0.99342088],
[ 0.98284057, 0.01715943],
[ 0.98606636, 0.01393364],
[ 0.00360207, 0.99639793],
[ 0.64198841, 0.35801159],
[ 0.99629105, 0.00370895],
[ 0.99692623, 0.00307377],
[ 0.76887519, 0.23112481],
[ 0.99099367, 0.00900633],
[ 0.98635084, 0.01364916],
[ 0.97286381, 0.02713619],
[ 0.0050606 , 0.9949394 ],
[ 0.84716094, 0.15283906],
[ 0.99279975, 0.00720025],
[ 0.93192481, 0.06807519],
[ 0.40611027, 0.59388973],
[ 0.02709482, 0.97290518],
[ 0.98972564, 0.01027436],
[ 0.99364099, 0.00635901],
[ 0.97797457, 0.02202543],
[ 0.00427271, 0.99572729],
[ 0.98582521, 0.01417479],
[ 0.96855893, 0.03144107],
[ 0.9940625 , 0.0059375 ],
[ 0.23384698, 0.76615302],
[ 0.97593446, 0.02406554],
[ 0.98314721, 0.01685279],
[ 0.99405164, 0.00594836],
[ 0.99675335, 0.00324665],
[ 0.01489679, 0.98510321],
[ 0.99295362, 0.00704638],
[ 0.99482364, 0.00517636],
[ 0.97888039, 0.02111961],
[ 0.9955226 , 0.0044774 ],
[ 0.99692362, 0.00307638],
[ 0.99326799, 0.00673201],
[ 0.99649083, 0.00350917],
[ 0.99452047, 0.00547953],
[ 0.97157956, 0.02842044],
[ 0.99128258, 0.00871742],
[ 0.9911164 , 0.0088836 ],
[ 0.13963727, 0.86036273],
[ 0.99164313, 0.00835687],
[ 0.25509197, 0.74490803],
[ 0.99614288, 0.00385712],
[ 0.98492887, 0.01507113],
[ 0.99426226, 0.00573774],
[ 0.99599481, 0.00400519],
[ 0.99639637, 0.00360363],
[ 0.91262657, 0.08737343],
[ 0.16240765, 0.83759235],
[ 0.95700086, 0.04299914],
[ 0.98680831, 0.01319169],
[ 0.01162982, 0.98837018],
[ 0.99578993, 0.00421007],
[ 0.99485082, 0.00514918],
[ 0.98799995, 0.01200005],
[ 0.1780949 , 0.8219051 ],
[ 0.99273291, 0.00726709],
[ 0.98693473, 0.01306527],
[ 0.97103359, 0.02896641],
[ 0.99388829, 0.00611171],
[ 0.99211602, 0.00788398],
[ 0.99522821, 0.00477179],
[ 0.98436118, 0.01563882],
[ 0.98946858, 0.01053142],
[ 0.99274091, 0.00725909],
[ 0.96832052, 0.03167948],
[ 0.9909947 , 0.0090053 ],
[ 0.98971643, 0.01028357],
[ 0.00497063, 0.99502937],
[ 0.99567794, 0.00432206],
[ 0.97537681, 0.02462319],
[ 0.99253937, 0.00746063],
[ 0.99645301, 0.00354699],
[ 0.97687574, 0.02312426],
[ 0.99131206, 0.00868794],
[ 0.97787199, 0.02212801],
[ 0.99194665, 0.00805335],
[ 0.01418841, 0.98581159],
[ 0.99392123, 0.00607877],
[ 0.99128737, 0.00871263],
[ 0.98300365, 0.01699635],
[ 0.25143587, 0.74856413],
[ 0.01117574, 0.98882426],
[ 0.8458456 , 0.1541544 ],
[ 0.99540594, 0.00459406],
[ 0.98738292, 0.01261708],
[ 0.27770198, 0.72229802],
[ 0.97130698, 0.02869302],
[ 0.99093563, 0.00906437],
[ 0.99512929, 0.00487071],
[ 0.98571828, 0.01428172],
[ 0.40656324, 0.59343676],
[ 0.94998267, 0.05001733],
[ 0.36427944, 0.63572056],
[ 0.94313575, 0.05686425],
[ 0.98562799, 0.01437201],
[ 0.99179497, 0.00820503],
[ 0.98964292, 0.01035708],
[ 0.83879833, 0.16120167],
[ 0.98160005, 0.01839995],
[ 0.00563979, 0.99436021],
[ 0.98929476, 0.01070524],
[ 0.98438628, 0.01561372],
[ 0.98948603, 0.01051397],
[ 0.99698039, 0.00301961],
[ 0.9920895 , 0.0079105 ],
[ 0.00854652, 0.99145348],
[ 0.98655117, 0.01344883],
[ 0.99422719, 0.00577281],
[ 0.96224053, 0.03775947],
[ 0.99286517, 0.00713483],
[ 0.98242483, 0.01757517],
[ 0.97818156, 0.02181844],
[ 0.99140541, 0.00859459],
[ 0.98216889, 0.01783111],
[ 0.99105801, 0.00894199],
[ 0.95349473, 0.04650527],
[ 0.98411915, 0.01588085],
[ 0.98887191, 0.01112809],
[ 0.97459687, 0.02540313],
[ 0.99103332, 0.00896668],
[ 0.99329034, 0.00670966],
[ 0.00895418, 0.99104582],
[ 0.9903513 , 0.0096487 ],
[ 0.99430512, 0.00569488],
[ 0.99654122, 0.00345878],
[ 0.9959769 , 0.0040231 ],
[ 0.97697153, 0.02302847],
[ 0.99599364, 0.00400636],
[ 0.98882636, 0.01117364],
[ 0.99725518, 0.00274482],
[ 0.99297775, 0.00702225],
[ 0.10438408, 0.89561592],
[ 0.99087636, 0.00912364],
[ 0.98355432, 0.01644568],
[ 0.98849258, 0.01150742],
[ 0.99504155, 0.00495845],
[ 0.00672953, 0.99327047],
[ 0.99062756, 0.00937244]])
In [18]:
import numpy as np
np.argmax(hx_lda.transform(X), axis = 1)
Out[18]:
array([0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,
1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 0, 0, 0, 0, 1, 0])
In [20]:
pd.set_option('display.max_rows', 1000)
pd.DataFrame({'wk_title': wkext['enextracts'], 'LDA_topic': np.argmax(hx_lda.transform(X), axis = 1)})
### Very bad still
Out[20]:
LDA_topic
wk_title
0
0
PEARL, or Process and experiment automation re...
1
0
The Aachen Cathedral Treasury (German: Aachene...
2
0
Staatliches Bauhaus , commonly known simply as...
3
0
Boo is an object-oriented, statically typed, g...
4
0
The Upper Harz Water Regale (German: Oberharze...
5
0
Aachen Cathedral, frequently referred to as th...
6
1
Synchronized Multimedia Integration Language (...
7
0
Scala (/ˈskɑːlə/ SKAH-lə) is an object-functio...
8
0
Gofer ("Good For Equational Reasoning") is an ...
9
0
The Hanseatic City of Lübeck (pronounced [ˈlyː...
10
0
Perl is a family of high-level, general-purpos...
11
0
Tcllib is a collection of packages available f...
12
0
Go, also commonly referred to as golang, is a ...
13
0
COBOL (/ˈkoʊbɒl/, an acronym for common busine...
14
0
ML is a general-purpose functional programming...
15
0
The Haus am Horn was built for the Weimar Bauh...
16
0
Rüdesheim am Rhein is a winemaking town in the...
17
0
Hack is a programming language for the HipHop ...
18
0
Objective-C is a general-purpose, object-orien...
19
1
Martin Luther's Birth House (German: Martin Lu...
20
1
XProc is a W3C Recommendation to define an XML...
21
0
Julia is a high-level dynamic programming lang...
22
1
XSLT (Extensible Stylesheet Language Transform...
23
0
MetaPost refers to both a programming language...
24
0
The Basilica of Constantine (German: Konstanti...
25
1
The Wadden Sea (Dutch: Waddenzee, German: Watt...
26
0
Modelica is an object-oriented, declarative, m...
27
0
APL (named after the book A Programming Langua...
28
0
Fortran (previously FORTRAN, derived from Form...
29
0
Ruby is a dynamic, reflective, object-oriented...
30
1
Eibingen Abbey (in German Abtei St. Hildegard,...
31
0
Classical Weimar is a UNESCO World Heritage Si...
32
0
The Bauhaus Dessau Foundation is a Foundation ...
33
1
The Limes Germanicus (Latin for Germanic front...
34
0
MATLAB (matrix laboratory) is a multi-paradigm...
35
0
Extensible Application Markup Language (XAML, ...
36
0
The Dresden Elbe Valley is a former World Heri...
37
0
The Rammelsberg is a mountain, 635 metres (2,0...
38
0
In computer science, Coq is an interactive the...
39
0
The New Garden (German: Neuer Garten) in Potsd...
40
0
Haskell /ˈhæskəl/ is a standardized, general-p...
41
1
Cologne Cathedral (German: Kölner Dom) (Latin:...
42
0
Lorch am Rhein is a small town in the Rheingau...
43
0
The Imperial Palace of Goslar (German: Kaiserp...
44
0
The Speyer Cathedral, officially the Imperial ...
45
1
Oz is a multiparadigm programming language, de...
46
1
Eisleben is a town in Saxony-Anhalt, Germany. ...
47
0
The Wartburg is a castle originally built in t...
48
0
Erlang (/ˈɜrlæŋ/ ER-lang) is a general-purpose...
49
0
The Barbara Baths (German: Barbarathermen) are...
50
1
The Schleswig-Holstein Wadden Sea National Par...
51
0
Wismar (German pronunciation: [ˈvɪsmaʁ]) is a ...
52
0
There are 39 official UNESCO World Heritage Si...
53
0
Prolog is a general purpose logic programming ...
54
1
Logo is an educational programming language, d...
55
0
Maulbronn Monastery (German: Kloster Maulbronn...
56
0
Paul Graham (born 13 November 1964) is an Engl...
57
0
Babelsberg Palace (German: Schloss Babelsberg)...
58
0
F-logic (frame logic) is a knowledge represent...
59
1
The Würzburg Residence (German: Würzburger Res...
60
0
Self is an object-oriented programming languag...
61
0
Lout is a batch document formatter invented by...
62
0
The Siemensstadt Housing Estate (German: Großs...
63
0
Vala is an object-oriented programming languag...
64
0
Windows PowerShell is a task automation and co...
65
0
The Augustusburg and Falkenlust palaces is a h...
66
0
Common Lisp (CL) is a dialect of the Lisp prog...
67
0
Rebol (/ˈrɛbəl/ REB-əl; historically REBOL) is...
68
0
Goslar is a historic town in Lower Saxony, Ger...
69
0
The Fagus Factory (German: Fagus Fabrik or Fag...
70
0
The Bremen Roland is a statue of Roland, erect...
71
1
SPARQL (pronounced "sparkle", a recursive acro...
72
0
The Hercules monument is an important landmark...
73
1
The Web Ontology Language (OWL) is a family of...
74
0
The J programming language, developed in the e...
75
0
Metafont is a description language used to def...
76
0
OCaml (/oʊˈkæməl/ oh-KAM-əl), originally known...
77
0
C++ (pronounced as see plus plus, /ˈsiː plʌs p...
78
0
Lisp (historically, LISP) is a family of compu...
79
0
Clojure (pronounced like "closure") is a diale...
80
1
Not to be confused with the Melanchthonhaus (B...
81
0
Tcl/Java is a project to bridge Tcl and Java. ...
82
0
The Porta Nigra (Latin for black gate) is a la...
83
1
The Dessau-Wörlitz Garden Realm, also known as...
84
0
The Rhine Gorge is a popular name for the Uppe...
85
0
QML (Qt Meta Language or Qt Modeling Language)...
86
0
Io is a pure object-oriented programming langu...
87
1
MUMPS (Massachusetts General Hospital Utility ...
88
0
Prehistoric pile dwellings around the Alps is ...
89
0
The Messel Pit (German: Grube Messel) is a dis...
90
0
The Trier Imperial Baths (German: Kaisertherme...
91
0
SQL (/ˈɛs kjuː ˈɛl/, or /ˈsiːkwəl/; Structured...
92
0
Berlin Modernism Housing Estates (German: Sied...
93
0
Python is a widely used general-purpose, high-...
94
0
Babelsberg is the largest district of the Bran...
95
0
Bergpark Wilhelmshöhe is a unique landscape pa...
96
0
ATS (Applied Type System) is a programming lan...
97
0
newLISP is an open source scripting language i...
98
0
Glienicke Palace (German: Schloss Glienicke) i...
99
0
The Roman Monuments, Cathedral of St. Peter an...
100
1
Reichenau Island is an island in Lake Constanc...
101
0
Primeval Beech Forests of the Carpathians and ...
102
0
Bamberg (German pronunciation: [ˈbambɛɐ̯k]) is...
103
0
Curl is a reflective object-oriented programmi...
104
0
Weimar (German pronunciation: [ˈvaɪmaɐ]) is a ...
105
0
The Liebfrauenkirche (German for Church of Our...
106
0
F# (pronounced eff sharp) is a strongly typed,...
107
0
The Lutherhaus in Lutherstadt Wittenberg, begu...
108
0
A limes (/ˈlaɪmiːz/; Latin pl. limites) was a ...
109
1
Museum Island (German: Museumsinsel) is the na...
110
0
R is a programming language and software envir...
111
0
Palaces and Parks of Potsdam and Berlin refers...
112
0
Regensburg (German pronunciation: [ˈʁeɡənsbʊɐ̯...
113
1
Maple is a commercial computer algebra system ...
114
1
GNU Pascal (GPC) is a Pascal compiler composed...
115
0
The Muskau Park (German: Muskauer Park, offici...
116
0
Java is a general-purpose computer programming...
117
0
Lustre is a formally defined, declarative, and...
118
1
The Stadt- und Pfarrkirche St. Marien zu Witte...
119
0
The Imperial Abbey of Corvey (German: Stift Co...
120
0
C# (pronounced as see sharp) is a multi-paradi...
121
0
Embedded SQL is a method of combining the comp...
122
0
Quedlinburg (German pronunciation: [ˈkveːdlɪnb...
123
1
The Squeak programming language is a dialect o...
124
0
In computer science, Clean is a general-purpos...
125
1
Trier (German pronunciation: [ˈtʀiːɐ̯]; Luxemb...
126
0
Adenine, named after the nucleobase adenine, i...
127
0
Wittenberg, officially Lutherstadt Wittenberg,...
128
0
The Hufeisensiedlung ("Horseshoe Estate") is a...
129
0
The Protestant Church of the Redeemer (German:...
130
0
The Church of St. Michael (German: Michaeliski...
131
0
The Margravial Opera House (German: Markgräfli...
132
1
Hildesheim Cathedral (German: Hildesheimer Dom...
133
0
ISWIM is an abstract computer programming lang...
134
0
Pharo is an open source implementation of the ...
135
0
Miranda is a lazy, purely functional programmi...
136
0
JavaScript (/ˈdʒɑːvɑːˌskrɪpt/; JS) is a dynami...
137
0
CycL in computer science and artificial intell...
138
1
XQuery is a query and functional programming l...
139
0
The Joy programming language in computer scien...
140
0
Ada is a structured, statically typed, imperat...
141
0
The Trier Amphitheater is a Roman Amphitheater...
142
0
The Bremen City Hall is the seat of the Presid...
143
0
Euler is a programming language created by Nik...
144
0
StepTalk is the official GNUstep scripting fra...
145
0
Standard ML (SML) is a general-purpose, modula...
146
0
Lua (/ˈluːə/ LOO-ə, from Portuguese: lua [ˈlu....
147
0
Mercury is a functional logic programming lang...
148
0
OPAL (OPtimized Applicative Language) is a fun...
149
0
Pfaueninsel ("Peacock Island") is an island in...
150
0
The Holsten Gate ("Holstein Tor", later "Holst...
151
0
Martin Luther's Death House (German: Martin Lu...
152
0
The Imperial Palace Ingelheim (German: Ingelhe...
153
0
The D programming language is an object-orient...
154
1
The Lower Saxon Wadden Sea National Park (Germ...
155
0
The Abbey of Lorsch (German: Reichsabtei Lorsc...
156
0
Stralsund (German pronunciation: [ˈʃtʁaːlzʊnt]...
157
0
PHP is a server-side scripting language design...
158
0
C (/ˈsiː/, as in the letter c) is a general-pu...
159
0
The Völklingen Ironworks (German: Völklinger H...
160
0
The Pilgrimage Church of Wies (German: Wieskir...
161
0
SuperCollider is an environment and programmin...
162
0
The Sanssouci Palace (German: Schloss Sanssouc...
163
0
The Zollverein Coal Mine Industrial Complex (G...
164
1
Smalltalk is an object-oriented, dynamically t...
165
0
Tcl (originally from Tool Command Language, bu...
166
0
Strongtalk is a Smalltalk environment with opt...
167
0
Datalog is a truly declarative logic programmi...
168
0
Racket (formerly named PLT Scheme) is a genera...
169
1
The Igel Column is a multi-storeyed Roman sand...
170
0
The High Cathedral of Saint Peter in Trier (Ge...
In [29]:
### remove sparse terms
### Terms with less than 3 occurrences.
print len(wkext['enextracts'])
threshold = 3.0 / len(wkext['enextracts'])
print threshold
171
0.0175438596491
In [30]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(stop_words = 'english', min_df = threshold)
X = vectorizer.fit_transform([ent for ent in wkext['enextracts']])
X ### compare with 171x4930
Out[30]:
<171x995 sparse matrix of type '<type 'numpy.int64'>'
with 6835 stored elements in Compressed Sparse Row format>
In [33]:
hx_lda = LatentDirichletAllocation(n_topics = 2)
hx_lda.fit(X)
display_topics(hx_lda, vectorizer.get_feature_names(), 30)
/usr/local/lib/python2.7/site-packages/sklearn/decomposition/online_lda.py:508: DeprecationWarning: The default value for 'learning_method' will be changed from 'online' to 'batch' in the release 0.20. This warning was introduced in 0.18.
DeprecationWarning)
Topic 0:
language programming languages used web code php developed standard computer data object applications based source javascript systems logic general functional oriented type development originally features design purpose lisp smalltalk designed
Topic 1:
world german heritage unesco germany site church city palace cathedral trier roman berlin park st located town built list sea house building century area imperial wadden island martin national important
In [35]:
pd.DataFrame({'wk_title': wkext['enextracts'], 'LDA_topic': np.argmax(hx_lda.transform(X), axis = 1)})
### The result is so damn good already.
Out[35]:
LDA_topic
wk_title
0
0
PEARL, or Process and experiment automation re...
1
1
The Aachen Cathedral Treasury (German: Aachene...
2
1
Staatliches Bauhaus , commonly known simply as...
3
0
Boo is an object-oriented, statically typed, g...
4
1
The Upper Harz Water Regale (German: Oberharze...
5
1
Aachen Cathedral, frequently referred to as th...
6
0
Synchronized Multimedia Integration Language (...
7
0
Scala (/ˈskɑːlə/ SKAH-lə) is an object-functio...
8
0
Gofer ("Good For Equational Reasoning") is an ...
9
1
The Hanseatic City of Lübeck (pronounced [ˈlyː...
10
0
Perl is a family of high-level, general-purpos...
11
0
Tcllib is a collection of packages available f...
12
0
Go, also commonly referred to as golang, is a ...
13
0
COBOL (/ˈkoʊbɒl/, an acronym for common busine...
14
0
ML is a general-purpose functional programming...
15
1
The Haus am Horn was built for the Weimar Bauh...
16
1
Rüdesheim am Rhein is a winemaking town in the...
17
0
Hack is a programming language for the HipHop ...
18
0
Objective-C is a general-purpose, object-orien...
19
1
Martin Luther's Birth House (German: Martin Lu...
20
0
XProc is a W3C Recommendation to define an XML...
21
0
Julia is a high-level dynamic programming lang...
22
0
XSLT (Extensible Stylesheet Language Transform...
23
0
MetaPost refers to both a programming language...
24
1
The Basilica of Constantine (German: Konstanti...
25
1
The Wadden Sea (Dutch: Waddenzee, German: Watt...
26
0
Modelica is an object-oriented, declarative, m...
27
0
APL (named after the book A Programming Langua...
28
0
Fortran (previously FORTRAN, derived from Form...
29
0
Ruby is a dynamic, reflective, object-oriented...
30
1
Eibingen Abbey (in German Abtei St. Hildegard,...
31
1
Classical Weimar is a UNESCO World Heritage Si...
32
1
The Bauhaus Dessau Foundation is a Foundation ...
33
1
The Limes Germanicus (Latin for Germanic front...
34
0
MATLAB (matrix laboratory) is a multi-paradigm...
35
0
Extensible Application Markup Language (XAML, ...
36
1
The Dresden Elbe Valley is a former World Heri...
37
1
The Rammelsberg is a mountain, 635 metres (2,0...
38
0
In computer science, Coq is an interactive the...
39
1
The New Garden (German: Neuer Garten) in Potsd...
40
0
Haskell /ˈhæskəl/ is a standardized, general-p...
41
1
Cologne Cathedral (German: Kölner Dom) (Latin:...
42
1
Lorch am Rhein is a small town in the Rheingau...
43
1
The Imperial Palace of Goslar (German: Kaiserp...
44
1
The Speyer Cathedral, officially the Imperial ...
45
0
Oz is a multiparadigm programming language, de...
46
1
Eisleben is a town in Saxony-Anhalt, Germany. ...
47
1
The Wartburg is a castle originally built in t...
48
0
Erlang (/ˈɜrlæŋ/ ER-lang) is a general-purpose...
49
1
The Barbara Baths (German: Barbarathermen) are...
50
1
The Schleswig-Holstein Wadden Sea National Par...
51
1
Wismar (German pronunciation: [ˈvɪsmaʁ]) is a ...
52
1
There are 39 official UNESCO World Heritage Si...
53
0
Prolog is a general purpose logic programming ...
54
0
Logo is an educational programming language, d...
55
1
Maulbronn Monastery (German: Kloster Maulbronn...
56
0
Paul Graham (born 13 November 1964) is an Engl...
57
1
Babelsberg Palace (German: Schloss Babelsberg)...
58
0
F-logic (frame logic) is a knowledge represent...
59
1
The Würzburg Residence (German: Würzburger Res...
60
0
Self is an object-oriented programming languag...
61
0
Lout is a batch document formatter invented by...
62
1
The Siemensstadt Housing Estate (German: Großs...
63
0
Vala is an object-oriented programming languag...
64
0
Windows PowerShell is a task automation and co...
65
1
The Augustusburg and Falkenlust palaces is a h...
66
0
Common Lisp (CL) is a dialect of the Lisp prog...
67
0
Rebol (/ˈrɛbəl/ REB-əl; historically REBOL) is...
68
1
Goslar is a historic town in Lower Saxony, Ger...
69
1
The Fagus Factory (German: Fagus Fabrik or Fag...
70
1
The Bremen Roland is a statue of Roland, erect...
71
0
SPARQL (pronounced "sparkle", a recursive acro...
72
1
The Hercules monument is an important landmark...
73
0
The Web Ontology Language (OWL) is a family of...
74
0
The J programming language, developed in the e...
75
0
Metafont is a description language used to def...
76
0
OCaml (/oʊˈkæməl/ oh-KAM-əl), originally known...
77
0
C++ (pronounced as see plus plus, /ˈsiː plʌs p...
78
0
Lisp (historically, LISP) is a family of compu...
79
0
Clojure (pronounced like "closure") is a diale...
80
1
Not to be confused with the Melanchthonhaus (B...
81
0
Tcl/Java is a project to bridge Tcl and Java. ...
82
1
The Porta Nigra (Latin for black gate) is a la...
83
1
The Dessau-Wörlitz Garden Realm, also known as...
84
1
The Rhine Gorge is a popular name for the Uppe...
85
0
QML (Qt Meta Language or Qt Modeling Language)...
86
0
Io is a pure object-oriented programming langu...
87
0
MUMPS (Massachusetts General Hospital Utility ...
88
1
Prehistoric pile dwellings around the Alps is ...
89
1
The Messel Pit (German: Grube Messel) is a dis...
90
1
The Trier Imperial Baths (German: Kaisertherme...
91
0
SQL (/ˈɛs kjuː ˈɛl/, or /ˈsiːkwəl/; Structured...
92
1
Berlin Modernism Housing Estates (German: Sied...
93
0
Python is a widely used general-purpose, high-...
94
1
Babelsberg is the largest district of the Bran...
95
1
Bergpark Wilhelmshöhe is a unique landscape pa...
96
0
ATS (Applied Type System) is a programming lan...
97
0
newLISP is an open source scripting language i...
98
1
Glienicke Palace (German: Schloss Glienicke) i...
99
1
The Roman Monuments, Cathedral of St. Peter an...
100
1
Reichenau Island is an island in Lake Constanc...
101
1
Primeval Beech Forests of the Carpathians and ...
102
1
Bamberg (German pronunciation: [ˈbambɛɐ̯k]) is...
103
0
Curl is a reflective object-oriented programmi...
104
1
Weimar (German pronunciation: [ˈvaɪmaɐ]) is a ...
105
1
The Liebfrauenkirche (German for Church of Our...
106
0
F# (pronounced eff sharp) is a strongly typed,...
107
1
The Lutherhaus in Lutherstadt Wittenberg, begu...
108
1
A limes (/ˈlaɪmiːz/; Latin pl. limites) was a ...
109
1
Museum Island (German: Museumsinsel) is the na...
110
0
R is a programming language and software envir...
111
1
Palaces and Parks of Potsdam and Berlin refers...
112
1
Regensburg (German pronunciation: [ˈʁeɡənsbʊɐ̯...
113
0
Maple is a commercial computer algebra system ...
114
0
GNU Pascal (GPC) is a Pascal compiler composed...
115
1
The Muskau Park (German: Muskauer Park, offici...
116
0
Java is a general-purpose computer programming...
117
0
Lustre is a formally defined, declarative, and...
118
1
The Stadt- und Pfarrkirche St. Marien zu Witte...
119
1
The Imperial Abbey of Corvey (German: Stift Co...
120
0
C# (pronounced as see sharp) is a multi-paradi...
121
0
Embedded SQL is a method of combining the comp...
122
1
Quedlinburg (German pronunciation: [ˈkveːdlɪnb...
123
0
The Squeak programming language is a dialect o...
124
0
In computer science, Clean is a general-purpos...
125
1
Trier (German pronunciation: [ˈtʀiːɐ̯]; Luxemb...
126
0
Adenine, named after the nucleobase adenine, i...
127
1
Wittenberg, officially Lutherstadt Wittenberg,...
128
1
The Hufeisensiedlung ("Horseshoe Estate") is a...
129
1
The Protestant Church of the Redeemer (German:...
130
1
The Church of St. Michael (German: Michaeliski...
131
1
The Margravial Opera House (German: Markgräfli...
132
1
Hildesheim Cathedral (German: Hildesheimer Dom...
133
0
ISWIM is an abstract computer programming lang...
134
0
Pharo is an open source implementation of the ...
135
0
Miranda is a lazy, purely functional programmi...
136
0
JavaScript (/ˈdʒɑːvɑːˌskrɪpt/; JS) is a dynami...
137
0
CycL in computer science and artificial intell...
138
0
XQuery is a query and functional programming l...
139
0
The Joy programming language in computer scien...
140
0
Ada is a structured, statically typed, imperat...
141
1
The Trier Amphitheater is a Roman Amphitheater...
142
1
The Bremen City Hall is the seat of the Presid...
143
0
Euler is a programming language created by Nik...
144
0
StepTalk is the official GNUstep scripting fra...
145
0
Standard ML (SML) is a general-purpose, modula...
146
0
Lua (/ˈluːə/ LOO-ə, from Portuguese: lua [ˈlu....
147
0
Mercury is a functional logic programming lang...
148
0
OPAL (OPtimized Applicative Language) is a fun...
149
1
Pfaueninsel ("Peacock Island") is an island in...
150
1
The Holsten Gate ("Holstein Tor", later "Holst...
151
1
Martin Luther's Death House (German: Martin Lu...
152
1
The Imperial Palace Ingelheim (German: Ingelhe...
153
0
The D programming language is an object-orient...
154
1
The Lower Saxon Wadden Sea National Park (Germ...
155
1
The Abbey of Lorsch (German: Reichsabtei Lorsc...
156
1
Stralsund (German pronunciation: [ˈʃtʁaːlzʊnt]...
157
0
PHP is a server-side scripting language design...
158
0
C (/ˈsiː/, as in the letter c) is a general-pu...
159
1
The Völklingen Ironworks (German: Völklinger H...
160
1
The Pilgrimage Church of Wies (German: Wieskir...
161
0
SuperCollider is an environment and programmin...
162
1
The Sanssouci Palace (German: Schloss Sanssouc...
163
1
The Zollverein Coal Mine Industrial Complex (G...
164
0
Smalltalk is an object-oriented, dynamically t...
165
0
Tcl (originally from Tool Command Language, bu...
166
0
Strongtalk is a Smalltalk environment with opt...
167
0
Datalog is a truly declarative logic programmi...
168
0
Racket (formerly named PLT Scheme) is a genera...
169
1
The Igel Column is a multi-storeyed Roman sand...
170
1
The High Cathedral of Saint Peter in Trier (Ge...
In [36]:
### One further improvement: removing numbers
vectorizer.get_feature_names()
Out[36]:
[u'000',
u'10',
u'11th',
u'12',
u'12th',
u'14',
u'14th',
u'15',
u'1740',
u'1744',
u'18',
u'1925',
u'1932',
u'1933',
u'1945',
u'1970s',
u'1973',
u'1977',
u'1980s',
u'1985',
u'1986',
u'1987',
u'1988',
u'1989',
u'1990',
u'1990s',
u'1991',
u'1992',
u'1993',
u'1994',
u'1995',
u'1996',
u'1997',
u'1998',
u'1999',
u'19th',
u'200',
u'2002',
u'2003',
u'2004',
u'2005',
u'2006',
u'2007',
u'2008',
u'2009',
u'2010',
u'2011',
u'2012',
u'2013',
u'2014',
u'22',
u'23',
u'30',
u'300',
u'50',
u'500',
u'60',
u'77',
u'80',
u'95',
u'abbey',
u'able',
u'academic',
u'access',
u'according',
u'acm',
u'acronym',
u'act',
u'ad',
u'ada',
u'added',
u'addition',
u'additionally',
u'administration',
u'administrative',
u'advanced',
u'ages',
u'air',
u'algebra',
u'allow',
u'allowing',
u'allows',
u'alps',
u'analysis',
u'ancient',
u'anhalt',
u'ansi',
u'antique',
u'application',
u'applications',
u'approximately',
u'april',
u'archbishop',
u'architect',
u'architects',
u'architectural',
u'architecture',
u'area',
u'areas',
u'array',
u'art',
u'artificial',
u'artists',
u'artworks',
u'assembly',
u'assisted',
u'associated',
u'assumption',
u'attraction',
u'august',
u'automatic',
u'available',
u'baltic',
u'banks',
u'baroque',
u'based',
u'basic',
u'bauhaus',
u'bavaria',
u'began',
u'beginning',
u'begun',
u'belief',
u'bell',
u'berlin',
u'best',
u'binary',
u'biosphere',
u'bishop',
u'body',
u'border',
u'born',
u'boundary',
u'brandenburg',
u'brick',
u'bridge',
u'browser',
u'browsers',
u'bsd',
u'build',
u'building',
u'buildings',
u'built',
u'bytecode',
u'calculus',
u'called',
u'came',
u'capabilities',
u'capital',
u'carl',
u'castle',
u'cathedral',
u'catholic',
u'center',
u'central',
u'centre',
u'centuries',
u'century',
u'changed',
u'changes',
u'chapel',
u'character',
u'characters',
u'checking',
u'church',
u'cities',
u'city',
u'class',
u'classes',
u'classical',
u'client',
u'close',
u'closed',
u'coast',
u'code',
u'collection',
u'collections',
u'column',
u'combination',
u'combined',
u'command',
u'commercial',
u'commissioned',
u'common',
u'commonly',
u'community',
u'company',
u'compilation',
u'compile',
u'compiled',
u'compiler',
u'compilers',
u'complete',
u'completed',
u'completely',
u'complex',
u'component',
u'components',
u'computation',
u'computer',
u'computing',
u'conceived',
u'concept',
u'concepts',
u'concurrency',
u'concurrent',
u'confused',
u'congregation',
u'connected',
u'considered',
u'consisting',
u'consists',
u'consortium',
u'constructed',
u'construction',
u'constructs',
u'containing',
u'contains',
u'content',
u'continued',
u'contract',
u'control',
u'core',
u'court',
u'covers',
u'create',
u'created',
u'creating',
u'creation',
u'critical',
u'cross',
u'cultural',
u'culturally',
u'current',
u'currently',
u'darmstadt',
u'data',
u'database',
u'databases',
u'david',
u'day',
u'death',
u'december',
u'decided',
u'declarative',
u'declared',
u'dedicated',
u'define',
u'defined',
u'definition',
u'department',
u'deployed',
u'der',
u'derived',
u'described',
u'description',
u'design',
u'designated',
u'designed',
u'desktop',
u'despite',
u'dessau',
u'developed',
u'developers',
u'development',
u'dialect',
u'dialects',
u'differences',
u'different',
u'dimensions',
u'directly',
u'discussion',
u'displayed',
u'distinctive',
u'distributed',
u'distribution',
u'district',
u'document',
u'documents',
u'does',
u'dom',
u'domain',
u'dutch',
u'dynamic',
u'earliest',
u'early',
u'east',
u'education',
u'educational',
u'efficient',
u'eisleben',
u'elbe',
u'element',
u'elements',
u'embedded',
u'emperor',
u'emperors',
u'empire',
u'encompasses',
u'end',
u'engine',
u'engineering',
u'england',
u'english',
u'ensemble',
u'entered',
u'environment',
u'environments',
u'erected',
u'especially',
u'established',
u'estates',
u'europe',
u'european',
u'evaluation',
u'eventually',
u'evolved',
u'example',
u'exceptional',
u'executable',
u'exist',
u'existed',
u'existence',
u'existing',
u'experimental',
u'explicit',
u'express',
u'expression',
u'expressive',
u'extended',
u'extensible',
u'extension',
u'extensions',
u'extensive',
u'extremely',
u'facilities',
u'facing',
u'fact',
u'family',
u'famous',
u'far',
u'fast',
u'feature',
u'features',
u'fewer',
u'file',
u'files',
u'finance',
u'finished',
u'flexible',
u'focus',
u'following',
u'form',
u'formal',
u'format',
u'formats',
u'formatting',
u'forms',
u'fortifications',
u'fortran',
u'foundation',
u'founded',
u'fourth',
u'frame',
u'framework',
u'frameworks',
u'france',
u'frederick',
u'free',
u'freely',
u'french',
u'friedrich',
u'frisian',
u'ft',
u'fully',
u'function',
u'functional',
u'functions',
u'garbage',
u'garden',
u'gardens',
u'gate',
u'gcc',
u'general',
u'generally',
u'generate',
u'generating',
u'generation',
u'generic',
u'georg',
u'german',
u'germany',
u'given',
u'giving',
u'gnu',
u'goal',
u'good',
u'gorge',
u'goslar',
u'gothic',
u'government',
u'grand',
u'graphical',
u'graphics',
u'great',
u'greatest',
u'gropius',
u'grounds',
u'group',
u'half',
u'hall',
u'hamburg',
u'hanseatic',
u'hardware',
u'harz',
u'haskell',
u'having',
u'heavy',
u'hectares',
u'height',
u'held',
u'heritage',
u'hesse',
u'high',
u'higher',
u'hildesheim',
u'hill',
u'historic',
u'historical',
u'historically',
u'history',
u'holstein',
u'holy',
u'home',
u'hosting',
u'house',
u'houses',
u'housing',
u'html',
u'ideas',
u'iec',
u'ii',
u'image',
u'imperative',
u'imperial',
u'implementation',
u'implementations',
u'implemented',
u'importance',
u'important',
u'include',
u'included',
u'includes',
u'including',
u'independent',
u'industrial',
u'industry',
u'inference',
u'influence',
u'influenced',
u'influential',
u'information',
u'initially',
u'input',
u'inscribed',
u'inside',
u'inspired',
u'instance',
u'institute',
u'integration',
u'intelligence',
u'intended',
u'interactive',
u'interface',
u'interfaces',
u'interior',
u'interiors',
u'international',
u'internet',
u'interpreted',
u'interpreter',
u'island',
u'islands',
u'iso',
u'italian',
u'iv',
u'january',
u'java',
u'javascript',
u'jit',
u'johann',
u'john',
u'july',
u'june',
u'just',
u'karl',
u'kassel',
u'key',
u'kilometres',
u'king',
u'kings',
u'km',
u'knowledge',
u'known',
u'kreis',
u'labs',
u'lady',
u'laid',
u'lambda',
u'land',
u'landscape',
u'language',
u'languages',
u'large',
u'larger',
u'largest',
u'late',
u'later',
u'latin',
u'leading',
u'league',
u'led',
u'left',
u'legendary',
u'length',
u'level',
u'libraries',
u'library',
u'license',
u'lies',
u'life',
u'like',
u'line',
u'lines',
u'linux',
u'lisp',
u'list',
u'listed',
u'living',
u'local',
u'located',
u'location',
u'logic',
u'long',
u'longer',
u'low',
u'lower',
u'ludwig',
u'luther',
u'lutherstadt',
u'l\xfcbeck',
u'mac',
u'machine',
u'macro',
u'main',
u'mainland',
u'mainly',
u'mainz',
u'major',
u'make',
u'makes',
u'making',
u'managed',
u'management',
u'managing',
u'manipulation',
u'march',
u'markup',
u'martin',
u'mary',
u'masterpiece',
u'mathematical',
u'meaning',
u'mecklenburg',
u'medieval',
u'memory',
u'metres',
u'metropolitan',
u'meyer',
u'mi',
u'michael',
u'microsoft',
u'middle',
u'miles',
u'million',
u'miranda',
u'ml',
u'model',
u'modeling',
u'modern',
u'modular',
u'module',
u'monastery',
u'monument',
u'monuments',
u'mountain',
u'mountains',
u'movement',
u'multi',
u'multiple',
u'municipality',
u'museum',
u'named',
u'national',
u'native',
u'natural',
u'nature',
u'near',
u'nearby',
u'need',
u'net',
u'netherlands',
u'network',
u'neumann',
u'new',
u'non',
u'north',
u'northern',
u'notable',
u'notably',
u'number',
u'numbers',
u'numerical',
u'object',
u'objective',
u'objects',
u'october',
u'official',
u'officially',
u'old',
u'oldest',
u'ontology',
u'open',
u'opened',
u'operating',
u'operations',
u'optimized',
u'optional',
u'order',
u'organization',
u'oriented',
u'original',
u'originally',
u'originated',
u'os',
u'ottonian',
u'output',
u'package',
u'packages',
u'palace',
u'palaces',
u'palatinate',
u'paradigm',
u'paradigms',
u'park',
u'parks',
u'particular',
u'particularly',
u'parts',
u'pascal',
u'passing',
u'paul',
u'pdf',
u'people',
u'perform',
u'performance',
u'period',
u'perl',
u'persius',
u'peter',
u'php',
u'place',
u'placed',
u'places',
u'plain',
u'plan',
u'planned',
u'plans',
u'platform',
u'platforms',
u'point',
u'points',
u'popular',
u'popularity',
u'population',
u'port',
u'portable',
u'ported',
u'possible',
u'postscript',
u'potsdam',
u'power',
u'practical',
u'pre',
u'present',
u'preservation',
u'preserved',
u'president',
u'primarily',
u'primary',
u'prince',
u'private',
u'procedural',
u'process',
u'processing',
u'produced',
u'produces',
u'products',
u'program',
u'programmer',
u'programmers',
u'programming',
u'programs',
u'project',
u'projects',
u'prolog',
u'pronounced',
u'pronunciation',
u'properties',
u'proprietary',
u'protected',
u'protection',
u'protestant',
u'prototype',
u'provide',
u'provided',
u'provides',
u'providing',
u'proving',
u'prussia',
u'prussian',
u'public',
u'published',
u'purely',
u'purpose',
u'python',
u'quality',
u'queries',
u'query',
u'rammelsberg',
u'range',
u'ranked',
u'ratified',
u'rdf',
u'real',
u'reasoning',
u'rebuilt',
u'recent',
u'recognized',
u'recommendation',
u'recursion',
u'reference',
u'referred',
u'refers',
u'reflective',
u'reformation',
u'region',
u'regions',
u'regular',
u'related',
u'relational',
u'released',
u'remains',
u'replaced',
u'represent',
u'representation',
u'represents',
u'requiring',
u'research',
u'reserve',
u'residence',
u'resource',
u'restoration',
u'restored',
u'resulted',
u'resulting',
u'revised',
u'rheingau',
u'rhine',
u'rhineland',
u'rich',
u'river',
u'road',
u'robert',
u'roman',
u'romanesque',
u'route',
u'run',
u'runs',
u'runtime',
u'saarland',
u'safety',
u'saint',
u'sandstone',
u'saxon',
u'saxony',
u'scala',
u'scale',
u'schema',
u'scheme',
u'schinkel',
u'schleswig',
u'schloss',
u'school',
u'science',
u'scientific',
u'scientists',
u'scripting',
u'scripts',
u'sculpture',
u'sea',
u'seat',
u'second',
u'section',
u'seen',
u'self',
u'semantic',
u'semantics',
u'separate',
u'september',
u'serve',
u'server',
u'servers',
u'set',
u'setting',
u'shell',
u'significant',
u'similar',
u'similarities',
u'similarly',
u'simple',
u'simply',
u'single',
u'site',
u'sites',
u'situated',
u'size',
u'slopes',
u'small',
u'smaller',
u'smalltalk',
u'software',
u'source',
u'sources',
u'south',
u'southwest',
u'space',
u'special',
u'specific',
u'specification',
u'specified',
u'sq',
u'sql',
u'square',
u'st',
u'standard',
u'standardization',
u'standardized',
u'standards',
u'stands',
u'starting',
u'state',
u'statements',
u'static',
u'statically',
u'statue',
u'status',
u'steps',
u'stood',
u'store',
u'strong',
u'strongly',
u'structure',
u'structured',
u'structures',
u'studio',
u'style',
u'subsequent',
u'subsequently',
u'subset',
u'substantial',
u'suited',
u'summer',
u'support',
u'supported',
u'supporting',
u'supports',
u'switzerland',
u'symbol',
u'symbolic',
u'syntactic',
u'syntax',
u'systems',
u'tasks',
u'taunus',
u'tcl',
u'team',
u'technical',
u'techniques',
u'technologies',
u'technology',
u'term',
u'terms',
u'tex',
u'text',
u'theorem',
u'time',
u'times',
u'today',
u'took',
u'tools',
u'total',
u'tourism',
u'towers',
u'town',
u'traditional',
u'transformations',
u'tree',
u'trees',
u'trier',
u'turned',
u'type',
u'typed',
u'types',
u'typical',
u'typically',
u'typing',
u'und',
u'unesco',
u'unique',
u'university',
u'unix',
u'unlike',
u'update',
u'upper',
u'urban',
u'use',
u'used',
u'useful',
u'user',
u'users',
u'uses',
u'using',
u'usually',
u'valley',
u'value',
u'values',
u'van',
u'variables',
u'variety',
u'various',
u'vector',
u'version',
u'versions',
u'view',
u'village',
u'virtual',
u'visited',
u'visitors',
u'visual',
u'von',
u'vorpommern',
u'w3c',
u'wadden',
u'wall',
u'walter',
u'wanted',
u'war',
u'water',
u'wattenmeer',
u'way',
u'web',
u'weimar',
u'west',
u'western',
u'westphalia',
u'wide',
u'widely',
u'widespread',
u'width',
u'william',
u'windows',
u'wine',
u'wittenberg',
u'work',
u'working',
u'works',
u'world',
u'writers',
u'written',
u'xml',
u'xquery',
u'year',
u'years',
u'zu']
In [72]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(stop_words = 'english', min_df = threshold, token_pattern = '[A-Za-z]+')
X = vectorizer.fit_transform([ent for ent in wkext['enextracts']])
X ### compare with 171x4930
Out[72]:
<171x954 sparse matrix of type '<type 'numpy.int64'>'
with 6815 stored elements in Compressed Sparse Row format>
In [73]:
vectorizer.get_feature_names()
Out[73]:
[u'abbey',
u'able',
u'academic',
u'access',
u'according',
u'acm',
u'acronym',
u'act',
u'ad',
u'ada',
u'added',
u'addition',
u'additionally',
u'administration',
u'administrative',
u'advanced',
u'ages',
u'air',
u'algebra',
u'allow',
u'allowing',
u'allows',
u'alps',
u'analysis',
u'ancient',
u'anhalt',
u'ansi',
u'antique',
u'application',
u'applications',
u'approximately',
u'april',
u'archbishop',
u'architect',
u'architects',
u'architectural',
u'architecture',
u'area',
u'areas',
u'array',
u'art',
u'artificial',
u'artists',
u'artworks',
u'assembly',
u'assisted',
u'associated',
u'assumption',
u'attraction',
u'august',
u'automatic',
u'available',
u'b',
u'baltic',
u'banks',
u'baroque',
u'based',
u'basic',
u'bauhaus',
u'bavaria',
u'beck',
u'began',
u'beginning',
u'begun',
u'belief',
u'bell',
u'berlin',
u'best',
u'binary',
u'biosphere',
u'bishop',
u'body',
u'border',
u'born',
u'boundary',
u'brandenburg',
u'brick',
u'bridge',
u'browser',
u'browsers',
u'bsd',
u'build',
u'building',
u'buildings',
u'built',
u'bytecode',
u'c',
u'calculus',
u'called',
u'came',
u'capabilities',
u'capital',
u'carl',
u'castle',
u'cathedral',
u'catholic',
u'center',
u'central',
u'centre',
u'centuries',
u'century',
u'changed',
u'changes',
u'chapel',
u'character',
u'characters',
u'checking',
u'church',
u'cities',
u'city',
u'class',
u'classes',
u'classical',
u'client',
u'close',
u'closed',
u'coast',
u'code',
u'collection',
u'collections',
u'column',
u'combination',
u'combined',
u'command',
u'commercial',
u'commissioned',
u'common',
u'commonly',
u'community',
u'company',
u'compilation',
u'compile',
u'compiled',
u'compiler',
u'compilers',
u'complete',
u'completed',
u'completely',
u'complex',
u'component',
u'components',
u'computation',
u'computer',
u'computing',
u'conceived',
u'concept',
u'concepts',
u'concurrency',
u'concurrent',
u'confused',
u'congregation',
u'connected',
u'considered',
u'consisting',
u'consists',
u'consortium',
u'constructed',
u'construction',
u'constructs',
u'containing',
u'contains',
u'content',
u'continued',
u'contract',
u'control',
u'core',
u'court',
u'covers',
u'create',
u'created',
u'creating',
u'creation',
u'critical',
u'cross',
u'cultural',
u'culturally',
u'current',
u'currently',
u'd',
u'darmstadt',
u'data',
u'database',
u'databases',
u'david',
u'day',
u'death',
u'december',
u'decided',
u'declarative',
u'declared',
u'dedicated',
u'define',
u'defined',
u'definition',
u'department',
u'deployed',
u'der',
u'derived',
u'described',
u'description',
u'design',
u'designated',
u'designed',
u'desktop',
u'despite',
u'dessau',
u'developed',
u'developers',
u'development',
u'dialect',
u'dialects',
u'differences',
u'different',
u'dimensions',
u'directly',
u'discussion',
u'displayed',
u'distinctive',
u'distributed',
u'distribution',
u'district',
u'document',
u'documents',
u'does',
u'dom',
u'domain',
u'dutch',
u'dynamic',
u'e',
u'earliest',
u'early',
u'east',
u'education',
u'educational',
u'efficient',
u'eisleben',
u'elbe',
u'element',
u'elements',
u'embedded',
u'emperor',
u'emperors',
u'empire',
u'encompasses',
u'end',
u'engine',
u'engineering',
u'england',
u'english',
u'ensemble',
u'entered',
u'environment',
u'environments',
u'erected',
u'especially',
u'established',
u'estates',
u'europe',
u'european',
u'evaluation',
u'eventually',
u'evolved',
u'example',
u'exceptional',
u'executable',
u'exist',
u'existed',
u'existence',
u'existing',
u'experimental',
u'explicit',
u'express',
u'expression',
u'expressive',
u'extended',
u'extensible',
u'extension',
u'extensions',
u'extensive',
u'extremely',
u'f',
u'facilities',
u'facing',
u'fact',
u'family',
u'famous',
u'far',
u'fast',
u'feature',
u'features',
u'fewer',
u'file',
u'files',
u'finance',
u'finished',
u'flexible',
u'focus',
u'following',
u'form',
u'formal',
u'format',
u'formats',
u'formatting',
u'forms',
u'fortifications',
u'fortran',
u'foundation',
u'founded',
u'fourth',
u'frame',
u'framework',
u'frameworks',
u'france',
u'frederick',
u'free',
u'freely',
u'french',
u'friedrich',
u'frisian',
u'ft',
u'fully',
u'function',
u'functional',
u'functions',
u'g',
u'garbage',
u'garden',
u'gardens',
u'gate',
u'gcc',
u'general',
u'generally',
u'generate',
u'generating',
u'generation',
u'generic',
u'georg',
u'german',
u'germany',
u'given',
u'giving',
u'gnu',
u'goal',
u'good',
u'gorge',
u'goslar',
u'gothic',
u'government',
u'grand',
u'graphical',
u'graphics',
u'great',
u'greatest',
u'gropius',
u'grounds',
u'group',
u'h',
u'half',
u'hall',
u'hamburg',
u'hanseatic',
u'hardware',
u'harz',
u'haskell',
u'having',
u'heavy',
u'hectares',
u'height',
u'held',
u'heritage',
u'hesse',
u'high',
u'higher',
u'hildesheim',
u'hill',
u'historic',
u'historical',
u'historically',
u'history',
u'holstein',
u'holy',
u'home',
u'hosting',
u'house',
u'houses',
u'housing',
u'html',
u'ideas',
u'iec',
u'ii',
u'image',
u'imperative',
u'imperial',
u'implementation',
u'implementations',
u'implemented',
u'importance',
u'important',
u'include',
u'included',
u'includes',
u'including',
u'independent',
u'industrial',
u'industry',
u'inference',
u'influence',
u'influenced',
u'influential',
u'information',
u'initially',
u'input',
u'inscribed',
u'inside',
u'inspired',
u'instance',
u'institute',
u'integration',
u'intelligence',
u'intended',
u'interactive',
u'interface',
u'interfaces',
u'interior',
u'interiors',
u'international',
u'internet',
u'interpreted',
u'interpreter',
u'island',
u'islands',
u'iso',
u'italian',
u'iv',
u'j',
u'january',
u'java',
u'javascript',
u'jit',
u'johann',
u'john',
u'july',
u'june',
u'just',
u'k',
u'karl',
u'kassel',
u'key',
u'kilometres',
u'king',
u'kings',
u'km',
u'knowledge',
u'known',
u'kreis',
u'l',
u'labs',
u'lady',
u'laid',
u'lambda',
u'land',
u'landscape',
u'language',
u'languages',
u'large',
u'larger',
u'largest',
u'late',
u'later',
u'latin',
u'leading',
u'league',
u'led',
u'left',
u'legendary',
u'length',
u'level',
u'libraries',
u'library',
u'license',
u'lies',
u'life',
u'like',
u'line',
u'lines',
u'linux',
u'lisp',
u'list',
u'listed',
u'living',
u'local',
u'located',
u'location',
u'logic',
u'long',
u'longer',
u'low',
u'lower',
u'ludwig',
u'luther',
u'lutherstadt',
u'm',
u'mac',
u'machine',
u'macro',
u'main',
u'mainland',
u'mainly',
u'mainz',
u'major',
u'make',
u'makes',
u'making',
u'managed',
u'management',
u'managing',
u'manipulation',
u'march',
u'markup',
u'martin',
u'mary',
u'masterpiece',
u'mathematical',
u'meaning',
u'mecklenburg',
u'medieval',
u'memory',
u'metres',
u'metropolitan',
u'meyer',
u'mi',
u'michael',
u'microsoft',
u'middle',
u'miles',
u'million',
u'miranda',
u'ml',
u'model',
u'modeling',
u'modern',
u'modular',
u'module',
u'monastery',
u'monument',
u'monuments',
u'mountain',
u'mountains',
u'movement',
u'multi',
u'multiple',
u'municipality',
u'museum',
u'named',
u'national',
u'native',
u'natural',
u'nature',
u'near',
u'nearby',
u'need',
u'net',
u'netherlands',
u'network',
u'neumann',
u'new',
u'non',
u'north',
u'northern',
u'notable',
u'notably',
u'number',
u'numbers',
u'numerical',
u'object',
u'objective',
u'objects',
u'october',
u'official',
u'officially',
u'old',
u'oldest',
u'ontology',
u'open',
u'opened',
u'operating',
u'operations',
u'optimized',
u'optional',
u'order',
u'organization',
u'oriented',
u'original',
u'originally',
u'originated',
u'os',
u'ottonian',
u'output',
u'package',
u'packages',
u'palace',
u'palaces',
u'palatinate',
u'paradigm',
u'paradigms',
u'park',
u'parks',
u'particular',
u'particularly',
u'parts',
u'pascal',
u'passing',
u'paul',
u'pdf',
u'people',
u'perform',
u'performance',
u'period',
u'perl',
u'persius',
u'peter',
u'php',
u'pl',
u'place',
u'placed',
u'places',
u'plain',
u'plan',
u'planned',
u'plans',
u'platform',
u'platforms',
u'point',
u'points',
u'popular',
u'popularity',
u'population',
u'port',
u'portable',
u'ported',
u'possible',
u'postscript',
u'potsdam',
u'power',
u'practical',
u'pre',
u'present',
u'preservation',
u'preserved',
u'president',
u'primarily',
u'primary',
u'prince',
u'private',
u'procedural',
u'process',
u'processing',
u'produced',
u'produces',
u'products',
u'program',
u'programmer',
u'programmers',
u'programming',
u'programs',
u'project',
u'projects',
u'prolog',
u'pronounced',
u'pronunciation',
u'properties',
u'proprietary',
u'protected',
u'protection',
u'protestant',
u'prototype',
u'provide',
u'provided',
u'provides',
u'providing',
u'proving',
u'prussia',
u'prussian',
u'public',
u'published',
u'purely',
u'purpose',
u'python',
u'quality',
u'queries',
u'query',
u'r',
u'rammelsberg',
u'range',
u'ranked',
u'ratified',
u'rdf',
u'real',
u'reasoning',
u'rebuilt',
u'recent',
u'recognized',
u'recommendation',
u'recursion',
u'reference',
u'referred',
u'refers',
u'reflective',
u'reformation',
u'region',
u'regions',
u'regular',
u'related',
u'relational',
u'released',
u'remains',
u'replaced',
u'represent',
u'representation',
u'represents',
u'requiring',
u'research',
u'reserve',
u'residence',
u'resource',
u'restoration',
u'restored',
u'resulted',
u'resulting',
u'revised',
u'rheingau',
u'rhine',
u'rhineland',
u'rich',
u'river',
u'road',
u'robert',
u'roman',
u'romanesque',
u'route',
u'run',
u'runs',
u'runtime',
u's',
u'saarland',
u'safety',
u'saint',
u'sandstone',
u'saxon',
u'saxony',
u'scala',
u'scale',
u'schema',
u'scheme',
u'schinkel',
u'schleswig',
u'schloss',
u'school',
u'science',
u'scientific',
u'scientists',
u'scripting',
u'scripts',
u'sculpture',
u'sea',
u'seat',
u'second',
u'section',
u'seen',
u'self',
u'semantic',
u'semantics',
u'separate',
u'september',
u'serve',
u'server',
u'servers',
u'set',
u'setting',
u'shell',
u'si',
u'significant',
u'similar',
u'similarities',
u'similarly',
u'simple',
u'simply',
u'single',
u'site',
u'sites',
u'situated',
u'size',
u'slopes',
u'small',
u'smaller',
u'smalltalk',
u'software',
u'source',
u'sources',
u'south',
u'southwest',
u'space',
u'special',
u'specific',
u'specification',
u'specified',
u'sq',
u'sql',
u'square',
u'st',
u'standard',
u'standardization',
u'standardized',
u'standards',
u'stands',
u'starting',
u'state',
u'statements',
u'static',
u'statically',
u'statue',
u'status',
u'steps',
u'stood',
u'store',
u'strong',
u'strongly',
u'structure',
u'structured',
u'structures',
u'studio',
u'style',
u'subsequent',
u'subsequently',
u'subset',
u'substantial',
u'suited',
u'summer',
u'support',
u'supported',
u'supporting',
u'supports',
u'switzerland',
u'symbol',
u'symbolic',
u'syntactic',
u'syntax',
u'systems',
u't',
u'tasks',
u'taunus',
u'tcl',
u'team',
u'technical',
u'techniques',
u'technologies',
u'technology',
u'term',
u'terms',
u'tex',
u'text',
u'th',
u'theorem',
u'time',
u'times',
u'today',
u'took',
u'tools',
u'total',
u'tourism',
u'towers',
u'town',
u'traditional',
u'transformations',
u'tree',
u'trees',
u'trier',
u'turned',
u'type',
u'typed',
u'types',
u'typical',
u'typically',
u'typing',
u'und',
u'unesco',
u'unique',
u'university',
u'unix',
u'unlike',
u'update',
u'upper',
u'urban',
u'use',
u'used',
u'useful',
u'user',
u'users',
u'uses',
u'using',
u'usually',
u'v',
u'valley',
u'value',
u'values',
u'van',
u'variables',
u'variety',
u'various',
u'vector',
u'version',
u'versions',
u'view',
u'village',
u'virtual',
u'visited',
u'visitors',
u'visual',
u'von',
u'vorpommern',
u'w',
u'wadden',
u'wall',
u'walter',
u'wanted',
u'war',
u'water',
u'wattenmeer',
u'way',
u'web',
u'weimar',
u'west',
u'western',
u'westphalia',
u'wide',
u'widely',
u'widespread',
u'width',
u'william',
u'windows',
u'wine',
u'wittenberg',
u'work',
u'working',
u'works',
u'world',
u'writers',
u'written',
u'x',
u'xml',
u'xquery',
u'year',
u'years',
u'zu']
In [74]:
hx_lda = LatentDirichletAllocation(n_topics = 2)
hx_lda.fit(X)
display_topics(hx_lda, vectorizer.get_feature_names(), 30)
/usr/local/lib/python2.7/site-packages/sklearn/decomposition/online_lda.py:508: DeprecationWarning: The default value for 'learning_method' will be changed from 'online' to 'batch' in the release 0.20. This warning was introduced in 0.18.
DeprecationWarning)
Topic 0:
world german heritage unesco germany site s church city palace cathedral trier roman berlin th park st located town built list sea house building century area imperial wadden island martin
Topic 1:
language programming c languages used s web code php developed standard computer data object applications based source javascript systems logic general functional oriented type development originally features design purpose lisp
In [75]:
pd.DataFrame({'wk_title': wkext['enextracts'], 'LDA_topic': np.argmax(hx_lda.transform(X), axis = 1)})
Out[75]:
LDA_topic
wk_title
0
1
PEARL, or Process and experiment automation re...
1
0
The Aachen Cathedral Treasury (German: Aachene...
2
0
Staatliches Bauhaus , commonly known simply as...
3
1
Boo is an object-oriented, statically typed, g...
4
0
The Upper Harz Water Regale (German: Oberharze...
5
0
Aachen Cathedral, frequently referred to as th...
6
1
Synchronized Multimedia Integration Language (...
7
1
Scala (/ˈskɑːlə/ SKAH-lə) is an object-functio...
8
1
Gofer ("Good For Equational Reasoning") is an ...
9
0
The Hanseatic City of Lübeck (pronounced [ˈlyː...
10
1
Perl is a family of high-level, general-purpos...
11
1
Tcllib is a collection of packages available f...
12
1
Go, also commonly referred to as golang, is a ...
13
1
COBOL (/ˈkoʊbɒl/, an acronym for common busine...
14
1
ML is a general-purpose functional programming...
15
0
The Haus am Horn was built for the Weimar Bauh...
16
0
Rüdesheim am Rhein is a winemaking town in the...
17
1
Hack is a programming language for the HipHop ...
18
1
Objective-C is a general-purpose, object-orien...
19
0
Martin Luther's Birth House (German: Martin Lu...
20
1
XProc is a W3C Recommendation to define an XML...
21
1
Julia is a high-level dynamic programming lang...
22
1
XSLT (Extensible Stylesheet Language Transform...
23
1
MetaPost refers to both a programming language...
24
0
The Basilica of Constantine (German: Konstanti...
25
0
The Wadden Sea (Dutch: Waddenzee, German: Watt...
26
1
Modelica is an object-oriented, declarative, m...
27
1
APL (named after the book A Programming Langua...
28
1
Fortran (previously FORTRAN, derived from Form...
29
1
Ruby is a dynamic, reflective, object-oriented...
30
0
Eibingen Abbey (in German Abtei St. Hildegard,...
31
0
Classical Weimar is a UNESCO World Heritage Si...
32
0
The Bauhaus Dessau Foundation is a Foundation ...
33
0
The Limes Germanicus (Latin for Germanic front...
34
1
MATLAB (matrix laboratory) is a multi-paradigm...
35
1
Extensible Application Markup Language (XAML, ...
36
0
The Dresden Elbe Valley is a former World Heri...
37
0
The Rammelsberg is a mountain, 635 metres (2,0...
38
1
In computer science, Coq is an interactive the...
39
0
The New Garden (German: Neuer Garten) in Potsd...
40
1
Haskell /ˈhæskəl/ is a standardized, general-p...
41
0
Cologne Cathedral (German: Kölner Dom) (Latin:...
42
0
Lorch am Rhein is a small town in the Rheingau...
43
0
The Imperial Palace of Goslar (German: Kaiserp...
44
0
The Speyer Cathedral, officially the Imperial ...
45
1
Oz is a multiparadigm programming language, de...
46
0
Eisleben is a town in Saxony-Anhalt, Germany. ...
47
0
The Wartburg is a castle originally built in t...
48
1
Erlang (/ˈɜrlæŋ/ ER-lang) is a general-purpose...
49
0
The Barbara Baths (German: Barbarathermen) are...
50
0
The Schleswig-Holstein Wadden Sea National Par...
51
0
Wismar (German pronunciation: [ˈvɪsmaʁ]) is a ...
52
0
There are 39 official UNESCO World Heritage Si...
53
1
Prolog is a general purpose logic programming ...
54
1
Logo is an educational programming language, d...
55
0
Maulbronn Monastery (German: Kloster Maulbronn...
56
1
Paul Graham (born 13 November 1964) is an Engl...
57
0
Babelsberg Palace (German: Schloss Babelsberg)...
58
1
F-logic (frame logic) is a knowledge represent...
59
0
The Würzburg Residence (German: Würzburger Res...
60
1
Self is an object-oriented programming languag...
61
1
Lout is a batch document formatter invented by...
62
0
The Siemensstadt Housing Estate (German: Großs...
63
1
Vala is an object-oriented programming languag...
64
1
Windows PowerShell is a task automation and co...
65
0
The Augustusburg and Falkenlust palaces is a h...
66
1
Common Lisp (CL) is a dialect of the Lisp prog...
67
1
Rebol (/ˈrɛbəl/ REB-əl; historically REBOL) is...
68
0
Goslar is a historic town in Lower Saxony, Ger...
69
0
The Fagus Factory (German: Fagus Fabrik or Fag...
70
0
The Bremen Roland is a statue of Roland, erect...
71
1
SPARQL (pronounced "sparkle", a recursive acro...
72
0
The Hercules monument is an important landmark...
73
1
The Web Ontology Language (OWL) is a family of...
74
1
The J programming language, developed in the e...
75
1
Metafont is a description language used to def...
76
1
OCaml (/oʊˈkæməl/ oh-KAM-əl), originally known...
77
1
C++ (pronounced as see plus plus, /ˈsiː plʌs p...
78
1
Lisp (historically, LISP) is a family of compu...
79
1
Clojure (pronounced like "closure") is a diale...
80
0
Not to be confused with the Melanchthonhaus (B...
81
1
Tcl/Java is a project to bridge Tcl and Java. ...
82
0
The Porta Nigra (Latin for black gate) is a la...
83
0
The Dessau-Wörlitz Garden Realm, also known as...
84
0
The Rhine Gorge is a popular name for the Uppe...
85
1
QML (Qt Meta Language or Qt Modeling Language)...
86
1
Io is a pure object-oriented programming langu...
87
1
MUMPS (Massachusetts General Hospital Utility ...
88
0
Prehistoric pile dwellings around the Alps is ...
89
0
The Messel Pit (German: Grube Messel) is a dis...
90
0
The Trier Imperial Baths (German: Kaisertherme...
91
1
SQL (/ˈɛs kjuː ˈɛl/, or /ˈsiːkwəl/; Structured...
92
0
Berlin Modernism Housing Estates (German: Sied...
93
1
Python is a widely used general-purpose, high-...
94
0
Babelsberg is the largest district of the Bran...
95
0
Bergpark Wilhelmshöhe is a unique landscape pa...
96
1
ATS (Applied Type System) is a programming lan...
97
1
newLISP is an open source scripting language i...
98
0
Glienicke Palace (German: Schloss Glienicke) i...
99
0
The Roman Monuments, Cathedral of St. Peter an...
100
0
Reichenau Island is an island in Lake Constanc...
101
0
Primeval Beech Forests of the Carpathians and ...
102
0
Bamberg (German pronunciation: [ˈbambɛɐ̯k]) is...
103
1
Curl is a reflective object-oriented programmi...
104
0
Weimar (German pronunciation: [ˈvaɪmaɐ]) is a ...
105
0
The Liebfrauenkirche (German for Church of Our...
106
1
F# (pronounced eff sharp) is a strongly typed,...
107
0
The Lutherhaus in Lutherstadt Wittenberg, begu...
108
0
A limes (/ˈlaɪmiːz/; Latin pl. limites) was a ...
109
0
Museum Island (German: Museumsinsel) is the na...
110
1
R is a programming language and software envir...
111
0
Palaces and Parks of Potsdam and Berlin refers...
112
0
Regensburg (German pronunciation: [ˈʁeɡənsbʊɐ̯...
113
1
Maple is a commercial computer algebra system ...
114
1
GNU Pascal (GPC) is a Pascal compiler composed...
115
0
The Muskau Park (German: Muskauer Park, offici...
116
1
Java is a general-purpose computer programming...
117
1
Lustre is a formally defined, declarative, and...
118
0
The Stadt- und Pfarrkirche St. Marien zu Witte...
119
0
The Imperial Abbey of Corvey (German: Stift Co...
120
1
C# (pronounced as see sharp) is a multi-paradi...
121
1
Embedded SQL is a method of combining the comp...
122
0
Quedlinburg (German pronunciation: [ˈkveːdlɪnb...
123
1
The Squeak programming language is a dialect o...
124
1
In computer science, Clean is a general-purpos...
125
0
Trier (German pronunciation: [ˈtʀiːɐ̯]; Luxemb...
126
1
Adenine, named after the nucleobase adenine, i...
127
0
Wittenberg, officially Lutherstadt Wittenberg,...
128
0
The Hufeisensiedlung ("Horseshoe Estate") is a...
129
0
The Protestant Church of the Redeemer (German:...
130
0
The Church of St. Michael (German: Michaeliski...
131
0
The Margravial Opera House (German: Markgräfli...
132
0
Hildesheim Cathedral (German: Hildesheimer Dom...
133
1
ISWIM is an abstract computer programming lang...
134
1
Pharo is an open source implementation of the ...
135
1
Miranda is a lazy, purely functional programmi...
136
1
JavaScript (/ˈdʒɑːvɑːˌskrɪpt/; JS) is a dynami...
137
1
CycL in computer science and artificial intell...
138
1
XQuery is a query and functional programming l...
139
1
The Joy programming language in computer scien...
140
1
Ada is a structured, statically typed, imperat...
141
0
The Trier Amphitheater is a Roman Amphitheater...
142
0
The Bremen City Hall is the seat of the Presid...
143
1
Euler is a programming language created by Nik...
144
1
StepTalk is the official GNUstep scripting fra...
145
1
Standard ML (SML) is a general-purpose, modula...
146
1
Lua (/ˈluːə/ LOO-ə, from Portuguese: lua [ˈlu....
147
1
Mercury is a functional logic programming lang...
148
1
OPAL (OPtimized Applicative Language) is a fun...
149
0
Pfaueninsel ("Peacock Island") is an island in...
150
0
The Holsten Gate ("Holstein Tor", later "Holst...
151
0
Martin Luther's Death House (German: Martin Lu...
152
0
The Imperial Palace Ingelheim (German: Ingelhe...
153
1
The D programming language is an object-orient...
154
0
The Lower Saxon Wadden Sea National Park (Germ...
155
0
The Abbey of Lorsch (German: Reichsabtei Lorsc...
156
0
Stralsund (German pronunciation: [ˈʃtʁaːlzʊnt]...
157
1
PHP is a server-side scripting language design...
158
1
C (/ˈsiː/, as in the letter c) is a general-pu...
159
0
The Völklingen Ironworks (German: Völklinger H...
160
0
The Pilgrimage Church of Wies (German: Wieskir...
161
1
SuperCollider is an environment and programmin...
162
0
The Sanssouci Palace (German: Schloss Sanssouc...
163
0
The Zollverein Coal Mine Industrial Complex (G...
164
1
Smalltalk is an object-oriented, dynamically t...
165
1
Tcl (originally from Tool Command Language, bu...
166
1
Strongtalk is a Smalltalk environment with opt...
167
1
Datalog is a truly declarative logic programmi...
168
1
Racket (formerly named PLT Scheme) is a genera...
169
0
The Igel Column is a multi-storeyed Roman sand...
170
0
The High Cathedral of Saint Peter in Trier (Ge...
In [ ]:
Content source: chainsawriot/pycon2016hk_sklearn
Similar notebooks: