In [1]:
import pandas as pd
wkext = pd.read_csv("extracts.csv")

In [2]:
wkext


Out[2]:
entitles enextracts detitles deextracts
0 PEARL (programming language) PEARL, or Process and experiment automation re... PEARL PEARL [pɜːɹl] ist eine Echtzeit- und Multitask...
1 Aachen Cathedral Treasury The Aachen Cathedral Treasury (German: Aachene... Aachener Domschatzkammer Die Aachener Domschatzkammer präsentiert den K...
2 Bauhaus Staatliches Bauhaus , commonly known simply as... Bauhaus Das Staatliche Bauhaus wurde 1919 von Walter G...
3 Boo (programming language) Boo is an object-oriented, statically typed, g... Boo (Programmiersprache) Boo ist eine seit 2003 von Rodrigo Barreto de ...
4 Upper Harz Water Regale The Upper Harz Water Regale (German: Oberharze... Oberharzer Wasserregal Das Oberharzer Wasserregal ist ein hauptsächli...
5 Aachen Cathedral Aachen Cathedral, frequently referred to as th... Aachener Dom Der Aachener Dom, auch Aachener Münster oder A...
6 Synchronized Multimedia Integration Language Synchronized Multimedia Integration Language (... Synchronized Multimedia Integration Language Vorlage:Infobox Dateiformat/Wartung/MagischeZa...
7 Scala (programming language) Scala (/ˈskɑːlə/ SKAH-lə) is an object-functio... Scala (Programmiersprache) Scala ist eine funktionale und objektorientier...
8 Gofer (programming language) Gofer ("Good For Equational Reasoning") is an ... Gofer Gofer ist eine funktionale Programmiersprache,...
9 Lübeck The Hanseatic City of Lübeck (pronounced [ˈlyː... Lübeck Die Hansestadt Lübeck (niederdeutsch: Lübęk, L...
10 Perl Perl is a family of high-level, general-purpos... Perl (Programmiersprache) Perl [pɝːl] ist eine freie, plattformunabhängi...
11 Tcllib Tcllib is a collection of packages available f... Tcllib Tcllib ist eine der populärsten Bibliotheken z...
12 Go (programming language) Go, also commonly referred to as golang, is a ... Go (Programmiersprache) Go ist eine kompilierbare Programmiersprache, ...
13 COBOL COBOL (/ˈkoʊbɒl/, an acronym for common busine... COBOL COBOL ist eine Programmiersprache, die in der ...
14 ML (programming language) ML is a general-purpose functional programming... ML (Programmiersprache) Meta Language (ML) beschreibt eine Familie fun...
15 Haus am Horn The Haus am Horn was built for the Weimar Bauh... Musterhaus Am Horn Das Musterhaus „Am Horn“ ist ein in Weimar err...
16 Rüdesheim am Rhein Rüdesheim am Rhein is a winemaking town in the... Rüdesheim am Rhein Rüdesheim am Rhein ist eine Weinstadt im Rhein...
17 Hack (programming language) Hack is a programming language for the HipHop ... Hack (Programmiersprache) Hack ist eine Neuimplementierung der Skriptspr...
18 Objective-C Objective-C is a general-purpose, object-orien... Objective-C Objective-C, auch kurz ObjC genannt, erweitert...
19 Martin Luther's Birth House Martin Luther's Birth House (German: Martin Lu... Martin Luthers Geburtshaus Bei dem sogenannten Luther-Geburtshaus handelt...
20 XProc XProc is a W3C Recommendation to define an XML... XProc XProc (von englisch XML Processing) ist eine v...
21 Julia (programming language) Julia is a high-level dynamic programming lang... Julia (Programmiersprache) Julia ist eine höhere High-Performance-Program...
22 XSLT XSLT (Extensible Stylesheet Language Transform... XSL Transformation Vorlage:Infobox Dateiformat/Wartung/MagischeZa...
23 MetaPost MetaPost refers to both a programming language... MetaPost MetaPost ist zum einen eine Programmiersprache...
24 Aula Palatina The Basilica of Constantine (German: Konstanti... Konstantinbasilika Die Evangelische Kirche zum Erlöser (Konstanti...
25 Wadden Sea The Wadden Sea (Dutch: Waddenzee, German: Watt... Wattenmeer (Nordsee) Das Wattenmeer der Nordsee ist eine im Wirkung...
26 Modelica Modelica is an object-oriented, declarative, m... Modelica Modelica ist eine objektorientierte Beschreibu...
27 APL (programming language) APL (named after the book A Programming Langua... APL (Programmiersprache) APL, abgekürzt für A Programming Language, ist...
28 Fortran Fortran (previously FORTRAN, derived from Form... Fortran Fortran ist eine prozedurale und in ihrer neue...
29 Ruby (programming language) Ruby is a dynamic, reflective, object-oriented... Ruby (Programmiersprache) Ruby (englisch für Rubin) ist eine höhere Prog...
... ... ... ... ...
141 Trier Amphitheater The Trier Amphitheater is a Roman Amphitheater... Amphitheater (Trier) Das Amphitheater in Trier (Augusta Treverorum)...
142 Bremen City Hall The Bremen City Hall is the seat of the Presid... Bremer Rathaus Das Bremer Rathaus ist eines der bedeutendsten...
143 Euler (programming language) Euler is a programming language created by Nik... Euler (Programmiersprache) Euler ist eine von Niklaus Wirth und Helmut We...
144 StepTalk StepTalk is the official GNUstep scripting fra... StepTalk StepTalk ist das offizielle GNUstep Scripting-...
145 Standard ML Standard ML (SML) is a general-purpose, modula... Standard ML Standard ML (SML) ist eine von ML abstammende ...
146 Lua (programming language) Lua (/ˈluːə/ LOO-ə, from Portuguese: lua [ˈlu.... Lua Lua (portugiesisch für Mond) ist eine imperati...
147 Mercury (programming language) Mercury is a functional logic programming lang... Mercury (Programmiersprache) Mercury ist eine stark an Prolog angelehnte Pr...
148 Opal (programming language) OPAL (OPtimized Applicative Language) is a fun... Opal (Programmiersprache) OPAL (OPtimized Applicative Language) ist eine...
149 Pfaueninsel Pfaueninsel ("Peacock Island") is an island in... Pfaueninsel Vorlage:Infobox Insel/Wartung/Höhe fehlt\nDie ...
150 Holstentor The Holsten Gate ("Holstein Tor", later "Holst... Holstentor Das Holstentor („Holstein-Tor“) ist ein Stadtt...
151 Martin Luther's Death House Martin Luther's Death House (German: Martin Lu... Martin Luthers Sterbehaus Martin Luthers Sterbehaus ist das Gebäude in d...
152 Imperial Palace Ingelheim The Imperial Palace Ingelheim (German: Ingelhe... Ingelheimer Kaiserpfalz Die Ingelheimer Kaiserpfalz ist eine bedeutend...
153 D (programming language) The D programming language is an object-orient... D (Programmiersprache) D ist eine Programmiersprache, die sich äußerl...
154 Lower Saxon Wadden Sea National Park The Lower Saxon Wadden Sea National Park (Germ... Nationalpark Niedersächsisches Wattenmeer Der Nationalpark Niedersächsisches Wattenmeer ...
155 Lorsch Abbey The Abbey of Lorsch (German: Reichsabtei Lorsc... Kloster Lorsch Das Kloster Lorsch war eine Benediktinerabtei ...
156 Stralsund Stralsund (German pronunciation: [ˈʃtʁaːlzʊnt]... Stralsund Stralsund [ˈʃtʁaːlzʊnt] ist eine Stadt im Nord...
157 PHP PHP is a server-side scripting language design... PHP PHP (rekursives Akronym und Backronym für „PHP...
158 C (programming language) C (/ˈsiː/, as in the letter c) is a general-pu... C (Programmiersprache) C ist eine imperative Programmiersprache, die ...
159 Völklingen Ironworks The Völklingen Ironworks (German: Völklinger H... Völklinger Hütte Die Völklinger Hütte ist ein 1873 gegründetes ...
160 Wieskirche The Pilgrimage Church of Wies (German: Wieskir... Wieskirche Die Wieskirche ist eine bemerkenswert prächtig...
161 SuperCollider SuperCollider is an environment and programmin... SuperCollider SuperCollider (SC) ist eine Programmierumgebun...
162 Sanssouci The Sanssouci Palace (German: Schloss Sanssouc... Sanssouci Schloss Sanssouci (französisch sans souci ‚ohn...
163 Zollverein Coal Mine Industrial Complex The Zollverein Coal Mine Industrial Complex (G... Zeche Zollverein Die Zeche Zollverein war ein von 1851 bis 1986...
164 Smalltalk Smalltalk is an object-oriented, dynamically t... Smalltalk (Programmiersprache) Smalltalk ist ein Sammelbegriff einerseits für...
165 Tcl Tcl (originally from Tool Command Language, bu... Tcl Tcl (Aussprache engl. tickle oder auch als Abk...
166 Strongtalk Strongtalk is a Smalltalk environment with opt... Strongtalk Strongtalk ist eine Variante der Programmiersp...
167 Datalog Datalog is a truly declarative logic programmi... Datalog Datalog ist eine Datenbank-Programmiersprache ...
168 Racket (programming language) Racket (formerly named PLT Scheme) is a genera... DrRacket DrRacket (früher DrScheme) ist eine integriert...
169 Igel Column The Igel Column is a multi-storeyed Roman sand... Igeler Säule Die Igeler Säule im Dorf Igel an der Mosel ist...
170 Cathedral of Trier The High Cathedral of Saint Peter in Trier (Ge... Trierer Dom Die Hohe Domkirche St. Peter zu Trier ist die ...

171 rows × 4 columns


In [3]:
wkext['enextracts']


Out[3]:
0      PEARL, or Process and experiment automation re...
1      The Aachen Cathedral Treasury (German: Aachene...
2      Staatliches Bauhaus , commonly known simply as...
3      Boo is an object-oriented, statically typed, g...
4      The Upper Harz Water Regale (German: Oberharze...
5      Aachen Cathedral, frequently referred to as th...
6      Synchronized Multimedia Integration Language (...
7      Scala (/ˈskɑːlə/ SKAH-lə) is an object-functio...
8      Gofer ("Good For Equational Reasoning") is an ...
9      The Hanseatic City of Lübeck (pronounced [ˈlyː...
10     Perl is a family of high-level, general-purpos...
11     Tcllib is a collection of packages available f...
12     Go, also commonly referred to as golang, is a ...
13     COBOL (/ˈkoʊbɒl/, an acronym for common busine...
14     ML is a general-purpose functional programming...
15     The Haus am Horn was built for the Weimar Bauh...
16     Rüdesheim am Rhein is a winemaking town in the...
17     Hack is a programming language for the HipHop ...
18     Objective-C is a general-purpose, object-orien...
19     Martin Luther's Birth House (German: Martin Lu...
20     XProc is a W3C Recommendation to define an XML...
21     Julia is a high-level dynamic programming lang...
22     XSLT (Extensible Stylesheet Language Transform...
23     MetaPost refers to both a programming language...
24     The Basilica of Constantine (German: Konstanti...
25     The Wadden Sea (Dutch: Waddenzee, German: Watt...
26     Modelica is an object-oriented, declarative, m...
27     APL (named after the book A Programming Langua...
28     Fortran (previously FORTRAN, derived from Form...
29     Ruby is a dynamic, reflective, object-oriented...
                             ...                        
141    The Trier Amphitheater is a Roman Amphitheater...
142    The Bremen City Hall is the seat of the Presid...
143    Euler is a programming language created by Nik...
144    StepTalk is the official GNUstep scripting fra...
145    Standard ML (SML) is a general-purpose, modula...
146    Lua (/ˈluːə/ LOO-ə, from Portuguese: lua [ˈlu....
147    Mercury is a functional logic programming lang...
148    OPAL (OPtimized Applicative Language) is a fun...
149    Pfaueninsel ("Peacock Island") is an island in...
150    The Holsten Gate ("Holstein Tor", later "Holst...
151    Martin Luther's Death House (German: Martin Lu...
152    The Imperial Palace Ingelheim (German: Ingelhe...
153    The D programming language is an object-orient...
154    The Lower Saxon Wadden Sea National Park (Germ...
155    The Abbey of Lorsch (German: Reichsabtei Lorsc...
156    Stralsund (German pronunciation: [ˈʃtʁaːlzʊnt]...
157    PHP is a server-side scripting language design...
158    C (/ˈsiː/, as in the letter c) is a general-pu...
159    The Völklingen Ironworks (German: Völklinger H...
160    The Pilgrimage Church of Wies (German: Wieskir...
161    SuperCollider is an environment and programmin...
162    The Sanssouci Palace (German: Schloss Sanssouc...
163    The Zollverein Coal Mine Industrial Complex (G...
164    Smalltalk is an object-oriented, dynamically t...
165    Tcl (originally from Tool Command Language, bu...
166    Strongtalk is a Smalltalk environment with opt...
167    Datalog is a truly declarative logic programmi...
168    Racket (formerly named PLT Scheme) is a genera...
169    The Igel Column is a multi-storeyed Roman sand...
170    The High Cathedral of Saint Peter in Trier (Ge...
Name: enextracts, dtype: object

Your move: Use the twitter SA code to create a unigram tokenizer and generate a feature matrix


In [4]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform([ent for ent in wkext['enextracts']])
X


Out[4]:
<171x4930 sparse matrix of type '<type 'numpy.int64'>'
	with 15380 stored elements in Compressed Sparse Row format>

In [5]:
X


Out[5]:
<171x4930 sparse matrix of type '<type 'numpy.int64'>'
	with 15380 stored elements in Compressed Sparse Row format>

In [6]:
vectorizer.get_feature_names()


Out[6]:
[u'000',
 u'01',
 u'03',
 u'05',
 u'064',
 u'083',
 u'10',
 u'100',
 u'1010',
 u'102',
 u'1020',
 u'10206',
 u'1030',
 u'105',
 u'106',
 u'109',
 u'1090',
 u'10th',
 u'11',
 u'1103',
 u'111',
 u'112',
 u'115',
 u'1165',
 u'1170s',
 u'11th',
 u'12',
 u'1209',
 u'1234',
 u'1248',
 u'1280',
 u'12th',
 u'13',
 u'1360',
 u'1366',
 u'14',
 u'1404',
 u'142',
 u'144',
 u'1464',
 u'1473',
 u'1483',
 u'14882',
 u'14th',
 u'15',
 u'150',
 u'1504',
 u'1531',
 u'1533',
 u'1546',
 u'157',
 u'15th',
 u'16',
 u'1689',
 u'1693',
 u'1696',
 u'16th',
 u'17',
 u'170',
 u'1720',
 u'1729',
 u'1730',
 u'1738',
 u'1740',
 u'1740s',
 u'1744',
 u'1745',
 u'1746',
 u'1747',
 u'1748',
 u'1754',
 u'1787',
 u'1792',
 u'1797',
 u'17th',
 u'18',
 u'180',
 u'1815',
 u'1817',
 u'1826',
 u'1830',
 u'1835',
 u'1838',
 u'1844',
 u'1847',
 u'1848',
 u'1849',
 u'185',
 u'1850',
 u'1851',
 u'1852',
 u'1853',
 u'1856',
 u'1859',
 u'1862',
 u'1876',
 u'1880',
 u'18th',
 u'19',
 u'1903',
 u'1904',
 u'1910',
 u'1911',
 u'1913',
 u'1916',
 u'1918',
 u'1919',
 u'192',
 u'1920s',
 u'1923',
 u'1924',
 u'1925',
 u'1928',
 u'1930',
 u'1932',
 u'1933',
 u'1939',
 u'1941',
 u'1944',
 u'1945',
 u'1948',
 u'1950',
 u'1950s',
 u'1954',
 u'1957',
 u'1958',
 u'1959',
 u'1960',
 u'1960s',
 u'1961',
 u'1964',
 u'1966',
 u'1967',
 u'1968',
 u'1969',
 u'1970',
 u'1970s',
 u'1972',
 u'1973',
 u'1976',
 u'1977',
 u'1978',
 u'1979',
 u'1980',
 u'1980s',
 u'1981',
 u'1983',
 u'1984',
 u'1985',
 u'1986',
 u'1987',
 u'1988',
 u'1989',
 u'1990',
 u'1990s',
 u'1991',
 u'1992',
 u'1993',
 u'1994',
 u'1995',
 u'1996',
 u'1997',
 u'1998',
 u'1999',
 u'19th',
 u'1st',
 u'20',
 u'200',
 u'2000',
 u'2001',
 u'2002',
 u'2003',
 u'2004',
 u'2005',
 u'2006',
 u'2007',
 u'2008',
 u'2009',
 u'2010',
 u'2011',
 u'2012',
 u'2013',
 u'2014',
 u'2015',
 u'20th',
 u'21',
 u'213',
 u'219',
 u'22',
 u'226',
 u'23',
 u'23270',
 u'24',
 u'240',
 u'25',
 u'250',
 u'26',
 u'260',
 u'27',
 u'278',
 u'28',
 u'283',
 u'29',
 u'298',
 u'30',
 u'300',
 u'306',
 u'31',
 u'310',
 u'32',
 u'33',
 u'334',
 u'335',
 u'337',
 u'340',
 u'345',
 u'35',
 u'350',
 u'353',
 u'360',
 u'37',
 u'39',
 u'391',
 u'3rd',
 u'40',
 u'41',
 u'410',
 u'42',
 u'43',
 u'438',
 u'4410',
 u'45',
 u'47',
 u'474',
 u'4gl',
 u'4th',
 u'50',
 u'500',
 u'5000',
 u'515',
 u'526m',
 u'55',
 u'552',
 u'56',
 u'568',
 u'590',
 u'595',
 u'60',
 u'62',
 u'635',
 u'65',
 u'66253',
 u'67',
 u'672',
 u'68',
 u'6th',
 u'70',
 u'700',
 u'7185',
 u'72',
 u'73',
 u'77',
 u'80',
 u'800',
 u'80th',
 u'83',
 u'86',
 u'8652',
 u'87',
 u'8th',
 u'90',
 u'900',
 u'936',
 u'95',
 u'971',
 u'983',
 u'aachen',
 u'aachener',
 u'abandons',
 u'abbey',
 u'abbeys',
 u'abbot',
 u'abbreviated',
 u'abilities',
 u'ability',
 u'able',
 u'about',
 u'above',
 u'abstentions',
 u'abstract',
 u'abtei',
 u'abundance',
 u'academia',
 u'academic',
 u'academy',
 u'accept',
 u'acceptance',
 u'accepted',
 u'access',
 u'accessed',
 u'accessible',
 u'accessing',
 u'accompanying',
 u'accordance',
 u'according',
 u'acid',
 u'acm',
 u'acoustic',
 u'acquired',
 u'acres',
 u'acronym',
 u'across',
 u'act',
 u'act1',
 u'acted',
 u'actions',
 u'actionscript',
 u'active',
 u'actively',
 u'activities',
 u'actors',
 u'actual',
 u'actually',
 u'ad',
 u'ada',
 u'adapt',
 u'adaptation',
 u'adapted',
 u'add',
 u'added',
 u'adding',
 u'addition',
 u'additional',
 u'additionally',
 u'additions',
 u'address',
 u'adds',
 u'adele',
 u'adenine',
 u'adhering',
 u'adjoining',
 u'adjustments',
 u'administers',
 u'administration',
 u'administrative',
 u'administrators',
 u'adobe',
 u'adolf',
 u'adopted',
 u'adopting',
 u'adoption',
 u'adult',
 u'advanced',
 u'advantage',
 u'advantages',
 u'advocated',
 u'aesthetic',
 u'affiliated',
 u'affluent',
 u'afforded',
 u'afield',
 u'aforementioned',
 u'afsluitdijk',
 u'after',
 u'again',
 u'against',
 u'age',
 u'ages',
 u'agrarian',
 u'agriculture',
 u'ai',
 u'aim',
 u'aimed',
 u'aims',
 u'air',
 u'aircraft',
 u'aisled',
 u'alain',
 u'alan',
 u'albeit',
 u'albert',
 u'albrechtsberg',
 u'alexandrescu',
 u'alfeld',
 u'algebra',
 u'algebraic',
 u'algol',
 u'algorithm',
 u'algorithmic',
 u'algorithms',
 u'all',
 u'allegedly',
 u'allied',
 u'allow',
 u'allowing',
 u'allows',
 u'almost',
 u'alone',
 u'along',
 u'alonzo',
 u'alpine',
 u'alps',
 u'already',
 u'also',
 u'altar',
 u'alte',
 u'altenau',
 u'alter',
 u'alternative',
 u'alternatively',
 u'altes',
 u'although',
 u'altstadt',
 u'always',
 u'am',
 u'amalia',
 u'amalienburg',
 u'amd64',
 u'amended',
 u'american',
 u'ammianus',
 u'among',
 u'amongst',
 u'amphitheater',
 u'amphitheatre',
 u'an',
 u'anachronistic',
 u'analysis',
 u'anchor',
 u'ancient',
 u'and',
 u'anders',
 u'andreasberg',
 u'andrei',
 u'anhalt',
 u'animals',
 u'animation',
 u'animations',
 u'animorphic',
 u'anna',
 u'annihilated',
 u'anniversary',
 u'annotation',
 u'annotations',
 u'announced',
 u'announcements',
 u'annual',
 u'annually',
 u'anonymous',
 u'another',
 u'ansi',
 u'answering',
 u'antique',
 u'antiquity',
 u'any',
 u'anything',
 u'anywhere',
 u'aot',
 u'apache',
 u'api',
 u'apis',
 u'apl',
 u'app',
 u'appear',
 u'appearance',
 u'appeared',
 u'appears',
 u'apple',
 u'applet',
 u'applets',
 u'applicable',
 u'application',
 u'applications',
 u'applicative',
 u'applied',
 u'approach',
 u'approaches',
 u'approved',
 u'approximate',
 u'approximately',
 u'apps',
 u'april',
 u'apse',
 u'apses',
 u'arabia',
 u'arabicus',
 u'arbitrary',
 u'archaeological',
 u'archbishop',
 u'archbishopric',
 u'archdiocese',
 u'arched',
 u'architect',
 u'architects',
 u'architectural',
 u'architecture',
 u'architectures',
 u'architekten',
 u'architektur',
 u'architrave',
 u'archive',
 u'are',
 u'area',
 u'areas',
 u'aren',
 u'arg1',
 u'arg2',
 u'arg3',
 u'arguments',
 u'arithmetic',
 u'arm',
 u'armstrong',
 u'army',
 u'arnim',
 u'around',
 u'arranged',
 u'array',
 u'arrays',
 u'art',
 u'arthritis',
 u'article',
 u'articulation',
 u'artificial',
 u'artistic',
 u'artists',
 u'arts',
 u'artworks',
 u'as',
 u'ascii',
 u'asked',
 u'asm',
 u'aspect',
 u'aspects',
 u'assembly',
 u'assertions',
 u'assignment',
 u'assisted',
 u'associated',
 u'association',
 u'associative',
 u'assumption',
 u'asymptote',
 u'asynchronously',
 u'at',
 u'atomic',
 u'atrium',
 u'ats',
 u'attached',
 u'attains',
 u'attempt',
 u'attempts',
 u'attend',
 u'attics',
 u'attracted',
 u'attracting',
 u'attraction',
 u'attractions',
 u'attributed',
 u'attributes',
 u'auckland',
 u'audio',
 u'aufsichts',
 u'augmented',
 u'august',
 u'augusta',
 u'augusteum',
 u'augustinian',
 u'augustus',
 u'augustusburg',
 u'aula',
 u'aureus',
 u'australia',
 u'austria',
 u'austrian',
 u'author',
 u'authoring',
 u'authority',
 u'authors',
 u'autobahn',
 u'automated',
 u'automatic',
 u'automatically',
 u'automation',
 u'availability',
 u'available',
 u'avalon',
 u'average',
 u'avoid',
 u'award',
 u'awarded',
 u'awards',
 u'aware',
 u'away',
 u'awk',
 u'axis',
 u'azelin',
 u'b1',
 u'b2b',
 u'b2c',
 u'babelsberg',
 u'babylon',
 u'bachtrompetengala',
 u'back',
 u'backend',
 u'background',
 u'backgrounds',
 u'backronym',
 u'backronyms',
 u'backus',
 u'bad',
 u'baden',
 u'bailiff',
 u'balanced',
 u'baldachin',
 u'balk',
 u'balthasar',
 u'baltic',
 u'bamberg',
 u'banker',
 u'banks',
 u'banquet',
 u'baptised',
 u'barbara',
 u'barbarathermen',
 u'barock',
 u'baroque',
 u'barras',
 u'bas',
 u'base',
 u'based',
 u'basic',
 u'basics',
 u'basilica',
 u'basin',
 u'basis',
 u'basket',
 u'basser',
 u'batch',
 u'bath',
 u'baths',
 u'battista',
 u'battle',
 u'bauhaus',
 u'bavaria',
 u'bavarian',
 u'bay',
 u'bayreuth',
 u'bc',
 u'be',
 u'bearing',
 u'bears',
 u'beautiful',
 u'became',
 u'because',
 u'become',
 u'becoming',
 u'beech',
 u'been',
 u'before',
 u'began',
 u'begin',
 u'beginning',
 u'begun',
 u'behavioral',
 u'behest',
 u'being',
 u'belief',
 u'bell',
 u'belong',
 u'belongs',
 u'below',
 u'benchmark',
 u'bend',
 u'benedictine',
 u'benefit',
 u'benscheidt',
 u'bergpark',
 u'berkeley',
 u'berlin',
 u'berliner',
 u'berners',
 u'bernward',
 u'bertholdstein',
 u'bertot',
 u'besides',
 u'best',
 u'between',
 u'beuronese',
 u'bias',
 u'bible',
 u'bid',
 u'big',
 u'billeter',
 u'billion',
 u'binaries',
 u'binary',
 u'binding',
 u'bingen',
 u'bioinformatics',
 u'biological',
 u'biosphere',
 u'biotechnology',
 u'birds',
 u'birth',
 u'bishop',
 u'bismarck',
 u'bit',
 u'bitmap',
 u'bituminous',
 u'bjarne',
 u'black',
 u'blasewitz',
 u'bld',
 u'blend',
 u'block',
 u'blocks',
 u'blue',
 u'board',
 u'boats',
 u'bobrow',
 u'bode',
 u'body',
 u'boffrand',
 u'bonn',
 u'boo',
 u'book',
 u'books',
 u'border',
 u'bordering',
 u'born',
 u'borough',
 u'borrow',
 u'borrowed',
 u'both',
 u'bouman',
 u'boundaries',
 u'boundary',
 u'bounded',
 u'bounds',
 u'box',
 u'brace',
 u'branches',
 u'brandenburg',
 u'bread',
 u'break',
 u'bremen',
 u'bremer',
 u'bretten',
 u'breuer',
 u'brian',
 u'brick',
 u'bridge',
 u'bridges',
 u'bright',
 u'bring',
 u'bringing',
 u'brings',
 u'britannicus',
 u'broad',
 u'bronze',
 u'bronzeworks',
 u'brook',
 u'brother',
 u'brought',
 u'brow',
 u'browser',
 u'browsers',
 u'bruckgraben',
 u'bruno',
 u'br\xfchl',
 u'bsd',
 u'buffer',
 u'bugenhagen',
 u'bugs',
 u'build',
 u'builders',
 u'building',
 u'buildings',
 u'builds',
 u'built',
 u'bukovsk\xe9',
 u'bull',
 u'bulwark',
 u'bundesland',
 u'buntenbock',
 u'burgtor',
 u'burial',
 u'buried',
 u'burned',
 u'burnt',
 u'business',
 u'but',
 u'buttons',
 u'by',
 u'bytecode',
 u'byzantine',
 u'b\xe9zier',
 u'cabinetry',
 u'cadastral',
 u'calculus',
 u'call',
 u'called',
 u'calling',
 u'calls',
 u'came',
 u'caml',
 u'campanile',
 u'can',
 u'canada',
 u'canal',
 u'cannot',
 u'canonical',
 u'canton',
 u'capabilities',
 u'capability',
 u'capital',
 u'capitalist',
 u'carboniferous',
 u'care',
 u'carefree',
 u'carl',
 u'carolingian',
 u'carpathian',
 u'carpathians',
 u'carried',
 u'carved',
 u'case',
 u'casting',
 u'castle',
 u'castles',
 u'castra',
 u'cast\xe9ran',
 u'catalog',
 u'categorical',
 u'cathedral',
 u'cathedralis',
 u'cathedrals',
 u'catholic',
 u'catholicism',
 u'catholique',
 u'cause',
 u'causeway',
 u'causeways',
 u'celebration',
 u'center',
 u'centers',
 u'central',
 u'centralized',
 u'centre',
 u'centric',
 u'centuries',
 u'century',
 u'ceremony',
 u'certain',
 u'certified',
 u'cgi',
 u'chainsaw',
 u'chamber',
 u'chambers',
 u'chandelier',
 u'chandeliers',
 u'changed',
 u'changes',
 u'changing',
 u'channel',
 u'chapel',
 u'chapter',
 u'character',
 u'characterised',
 u'characteristic',
 u'characteristics',
 u'characterized',
 u'characters',
 u'charge',
 u'charged',
 u'charlemagne',
 u'charles',
 u'charlottenburg',
 u'charts',
 u'checked',
 u'checking',
 u'checks',
 u'chemistry',
 u'childers',
 u'chipperfield',
 u'chivalric',
 u'choir',
 u'chornohora',
 u'christ',
 u'christian',
 u'christianity',
 u'chronicle',
 u'chronicler',
 u'church',
 u'churches',
 u'ch\xe2teau',
 u'cii',
 u'cim',
 u'circle',
 u'cistercian',
 u'citadel',
 u'cites',
 u'cities',
 u'city',
 u'civic',
 u'civilization',
 u'cl',
 u'claimed',
 u'claims',
 u'clang',
 u'class',
 u'classes',
 u'classical',
 u'classicism',
 u'classification',
 u'classified',
 u'classpath',
 u'clause',
 u'clausthal',
 u'clean',
 u'clear',
 u'clearly',
 u'clemens',
 u'clerestory',
 u'cli',
 u'client',
 u'cloisters',
 u'clojure',
 u'clos',
 u'close',
 u'closed',
 u'closely',
 u'closer',
 u'closure',
 u'closures',
 u'cloth',
 u'cloud',
 u'clt',
 u'cluny',
 u'cm',
 u'cmdlet',
 u'cmdlets',
 u'co',
 u'coal',
 u'coast',
 u'coastal',
 u'coastline',
 u'cobol',
 u'cocoa',
 u'codasyl',
 u'codd',
 u'code',
 u'coded',
 u'codex',
 u'coining',
 u'coking',
 u'collaboration',
 u'collected',
 u'collection',
 u'collections',
 u'collective',
 u'collegiate',
 u'colmerauer',
 u'cologne',
 u'colon',
 u'coloni',
 u'colour',
 u'colours',
 u'column',
 u'com',
 u'combination',
 u'combinations',
 u'combinator',
 u'combine',
 u'combined',
 u'combines',
 u'combining',
 u'come',
 u'comes',
 u'comfort',
 u'coming',
 u'command',
 u'commandline',
 u'commands',
 u'commenced',
 u'commerce',
 u'commercial',
 u'commercialize',
 u'commercially',
 u'commission',
 u'commissioned',
 u'committee',
 u'common',
 u'commonly',
 u'communicate',
 u'communications',
 u'communist',
 u'communities',
 u'community',
 u'compact',
 ...]

In [7]:
### train a LDA model

from sklearn.decomposition import LatentDirichletAllocation
hx_lda = LatentDirichletAllocation(n_topics = 2)

In [8]:
hx_lda.fit(X)


/usr/local/lib/python2.7/site-packages/sklearn/decomposition/online_lda.py:508: DeprecationWarning: The default value for 'learning_method' will be changed from 'online' to 'batch' in the release 0.20. This warning was introduced in 0.18.
  DeprecationWarning)
Out[8]:
LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7, learning_method=None,
             learning_offset=10.0, max_doc_update_iter=100, max_iter=10,
             mean_change_tol=0.001, n_jobs=1, n_topics=2, perp_tol=0.1,
             random_state=None, topic_word_prior=None,
             total_samples=1000000.0, verbose=0)

In [9]:
def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print "Topic %d:" % (topic_idx)
        print " ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]])

In [10]:
display_topics(hx_lda, vectorizer.get_feature_names(), 30)


Topic 0:
smil markup the and media presentations multimedia in to is it as language for transitions timing of synchronized items things images ˈsmaɪl recommended presenting layout video defines animations links embedding
Topic 1:
the of and in is to as it was language for by programming with on its world from that an has german heritage are used languages be site unesco germany

In [11]:
hx_lda.transform(X)


Out[11]:
array([[ 0.00736282,  0.99263718],
       [ 0.0062294 ,  0.9937706 ],
       [ 0.00178884,  0.99821116],
       [ 0.00661109,  0.99338891],
       [ 0.00254491,  0.99745509],
       [ 0.00712769,  0.99287231],
       [ 0.50088338,  0.49911662],
       [ 0.00286921,  0.99713079],
       [ 0.00679048,  0.99320952],
       [ 0.00389056,  0.99610944],
       [ 0.002443  ,  0.997557  ],
       [ 0.01343562,  0.98656438],
       [ 0.0056724 ,  0.9943276 ],
       [ 0.00170446,  0.99829554],
       [ 0.00488133,  0.99511867],
       [ 0.00173419,  0.99826581],
       [ 0.01119253,  0.98880747],
       [ 0.00670091,  0.99329909],
       [ 0.00387866,  0.99612134],
       [ 0.00287638,  0.99712362],
       [ 0.00558484,  0.99441516],
       [ 0.0062557 ,  0.9937443 ],
       [ 0.00497069,  0.99502931],
       [ 0.0022877 ,  0.9977123 ],
       [ 0.00183597,  0.99816403],
       [ 0.00220405,  0.99779595],
       [ 0.00868614,  0.99131386],
       [ 0.00757761,  0.99242239],
       [ 0.00350382,  0.99649618],
       [ 0.00924188,  0.99075812],
       [ 0.00342568,  0.99657432],
       [ 0.00948617,  0.99051383],
       [ 0.00759745,  0.99240255],
       [ 0.00222355,  0.99777645],
       [ 0.00485542,  0.99514458],
       [ 0.00201404,  0.99798596],
       [ 0.00139094,  0.99860906],
       [ 0.00755494,  0.99244506],
       [ 0.00750094,  0.99249906],
       [ 0.00668679,  0.99331321],
       [ 0.02217576,  0.97782424],
       [ 0.00256051,  0.99743949],
       [ 0.01959135,  0.98040865],
       [ 0.00353821,  0.99646179],
       [ 0.00255031,  0.99744969],
       [ 0.00330127,  0.99669873],
       [ 0.00676441,  0.99323559],
       [ 0.00414477,  0.99585523],
       [ 0.0041971 ,  0.9958029 ],
       [ 0.00907336,  0.99092664],
       [ 0.00221637,  0.99778363],
       [ 0.00454837,  0.99545163],
       [ 0.01391744,  0.98608256],
       [ 0.00330768,  0.99669232],
       [ 0.00323928,  0.99676072],
       [ 0.01305206,  0.98694794],
       [ 0.01197292,  0.98802708],
       [ 0.00302438,  0.99697562],
       [ 0.00185556,  0.99814444],
       [ 0.00443083,  0.99556917],
       [ 0.00323742,  0.99676258],
       [ 0.00288704,  0.99711296],
       [ 0.01264626,  0.98735374],
       [ 0.0028242 ,  0.9971758 ],
       [ 0.00203629,  0.99796371],
       [ 0.00255017,  0.99744983],
       [ 0.00238098,  0.99761902],
       [ 0.00298037,  0.99701963],
       [ 0.01165519,  0.98834481],
       [ 0.00384773,  0.99615227],
       [ 0.00280909,  0.99719091],
       [ 0.00361159,  0.99638841],
       [ 0.00353192,  0.99646808],
       [ 0.00206065,  0.99793935],
       [ 0.00228759,  0.99771241],
       [ 0.00708247,  0.99291753],
       [ 0.00403342,  0.99596658],
       [ 0.00237374,  0.99762626],
       [ 0.00216516,  0.99783484],
       [ 0.00736179,  0.99263821],
       [ 0.00796453,  0.99203547],
       [ 0.02778742,  0.97221258],
       [ 0.00622771,  0.99377229],
       [ 0.00450593,  0.99549407],
       [ 0.00180621,  0.99819379],
       [ 0.00351143,  0.99648857],
       [ 0.00694048,  0.99305952],
       [ 0.00255055,  0.99744945],
       [ 0.0031681 ,  0.9968319 ],
       [ 0.00753067,  0.99246933],
       [ 0.01343455,  0.98656545],
       [ 0.00284663,  0.99715337],
       [ 0.00498255,  0.99501745],
       [ 0.00296434,  0.99703566],
       [ 0.00797048,  0.99202952],
       [ 0.00613253,  0.99386747],
       [ 0.00442208,  0.99557792],
       [ 0.02064132,  0.97935868],
       [ 0.00435202,  0.99564798],
       [ 0.00520979,  0.99479021],
       [ 0.00213321,  0.99786679],
       [ 0.0022229 ,  0.9977771 ],
       [ 0.01560154,  0.98439846],
       [ 0.00430782,  0.99569218],
       [ 0.00185556,  0.99814444],
       [ 0.00715146,  0.99284854],
       [ 0.00573681,  0.99426319],
       [ 0.00661191,  0.99338809],
       [ 0.00315292,  0.99684708],
       [ 0.00268605,  0.99731395],
       [ 0.00308211,  0.99691789],
       [ 0.00458618,  0.99541382],
       [ 0.00628956,  0.99371044],
       [ 0.00927926,  0.99072074],
       [ 0.0033559 ,  0.9966441 ],
       [ 0.02020525,  0.97979475],
       [ 0.00246757,  0.99753243],
       [ 0.00770874,  0.99229126],
       [ 0.005154  ,  0.994846  ],
       [ 0.00940577,  0.99059423],
       [ 0.00592464,  0.99407536],
       [ 0.00295892,  0.99704108],
       [ 0.00601872,  0.99398128],
       [ 0.0049157 ,  0.9950843 ],
       [ 0.04160171,  0.95839829],
       [ 0.00209306,  0.99790694],
       [ 0.00372214,  0.99627786],
       [ 0.00450776,  0.99549224],
       [ 0.00509582,  0.99490418],
       [ 0.00559192,  0.99440808],
       [ 0.01564736,  0.98435264],
       [ 0.00983779,  0.99016221],
       [ 0.00292202,  0.99707798],
       [ 0.0056018 ,  0.9943982 ],
       [ 0.00984645,  0.99015355],
       [ 0.00587958,  0.99412042],
       [ 0.0017517 ,  0.9982483 ],
       [ 0.00453035,  0.99546965],
       [ 0.00365579,  0.99634421],
       [ 0.00732188,  0.99267812],
       [ 0.00367946,  0.99632054],
       [ 0.01498933,  0.98501067],
       [ 0.00347706,  0.99652294],
       [ 0.01012064,  0.98987936],
       [ 0.01119046,  0.98880954],
       [ 0.00541566,  0.99458434],
       [ 0.01235293,  0.98764707],
       [ 0.00526143,  0.99473857],
       [ 0.03103803,  0.96896197],
       [ 0.00796582,  0.99203418],
       [ 0.00540461,  0.99459539],
       [ 0.00820147,  0.99179853],
       [ 0.0046746 ,  0.9953254 ],
       [ 0.00452466,  0.99547534],
       [ 0.00515599,  0.99484401],
       [ 0.00488367,  0.99511633],
       [ 0.00294553,  0.99705447],
       [ 0.00198141,  0.99801859],
       [ 0.0023011 ,  0.9976989 ],
       [ 0.01319537,  0.98680463],
       [ 0.00186743,  0.99813257],
       [ 0.00603358,  0.99396642],
       [ 0.00131658,  0.99868342],
       [ 0.00374901,  0.99625099],
       [ 0.00500544,  0.99499456],
       [ 0.00532612,  0.99467388],
       [ 0.01010946,  0.98989054],
       [ 0.00649398,  0.99350602],
       [ 0.00294414,  0.99705586],
       [ 0.00380921,  0.99619079],
       [ 0.00417723,  0.99582277]])

In [ ]:
## The result is really bad.

In [ ]:
## Stemming, removing numbers and removing sparse terms.

In [ ]: