Importar GraphLab



In [3]:

    
import graphlab

Cargar el dataset



In [4]:

    
people = graphlab.SFrame('people_wiki.gl/')

Los datos contienen articulos de wikipedia sobre diferentes personas.



In [5]:

    
people.head()









    Out[5]:





    
        URI
        name
        text
    
    
        <http://dbpedia.org/resou
rce/Digby_Morrell> ...
        Digby Morrell
        digby morrell born 10
october 1979 is a former ...
    
    
        <http://dbpedia.org/resou
rce/Alfred_J._Lewy> ...
        Alfred J. Lewy
        alfred j lewy aka sandy
lewy graduated from ...
    
    
        <http://dbpedia.org/resou
rce/Harpdog_Brown> ...
        Harpdog Brown
        harpdog brown is a singer
and harmonica player who ...
    
    
        <http://dbpedia.org/resou
rce/Franz_Rottensteiner> ...
        Franz Rottensteiner
        franz rottensteiner born
in waidmannsfeld lower ...
    
    
        <http://dbpedia.org/resou
rce/G-Enka> ...
        G-Enka
        henry krvits born 30
december 1974 in tallinn ...
    
    
        <http://dbpedia.org/resou
rce/Sam_Henderson> ...
        Sam Henderson
        sam henderson born
october 18 1969 is an ...
    
    
        <http://dbpedia.org/resou
rce/Aaron_LaCrate> ...
        Aaron LaCrate
        aaron lacrate is an
american music producer ...
    
    
        <http://dbpedia.org/resou
rce/Trevor_Ferguson> ...
        Trevor Ferguson
        trevor ferguson aka john
farrow born 11 november ...
    
    
        <http://dbpedia.org/resou
rce/Grant_Nelson> ...
        Grant Nelson
        grant nelson born 27
april 1971 in london  ...
    
    
        <http://dbpedia.org/resou
rce/Cathy_Caruth> ...
        Cathy Caruth
        cathy caruth born 1955 is
frank h t rhodes ...
    

[10 rows x 3 columns]



In [6]:

    
len(people)









    Out[6]:





59071

Buscaremos al expresidente Barack Obama



In [11]:

    
obama = people[people['name'] == 'Barack Obama']



In [12]:

    
obama









    Out[12]:





    
        URI
        name
        text
    
    
        <http://dbpedia.org/resou
rce/Barack_Obama> ...
        Barack Obama
        barack hussein obama ii
brk husen bm born august ...
    

[? rows x 3 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use sf.materialize() to force materialization.



In [13]:

    
obama['text']









    Out[13]:





dtype: str
Rows: ?
['barack hussein obama ii brk husen bm born august 4 1961 is the 44th and current president of the united states and the first african american to hold the office born in honolulu hawaii obama is a graduate of columbia university and harvard law school where he served as president of the harvard law review he was a community organizer in chicago before earning his law degree he worked as a civil rights attorney and taught constitutional law at the university of chicago law school from 1992 to 2004 he served three terms representing the 13th district in the illinois senate from 1997 to 2004 running unsuccessfully for the united states house of representatives in 2000in 2004 obama received national attention during his campaign to represent illinois in the united states senate with his victory in the march democratic party primary his keynote address at the democratic national convention in july and his election to the senate in november he began his presidential campaign in 2007 and after a close primary campaign against hillary rodham clinton in 2008 he won sufficient delegates in the democratic party primaries to receive the presidential nomination he then defeated republican nominee john mccain in the general election and was inaugurated as president on january 20 2009 nine months after his election obama was named the 2009 nobel peace prize laureateduring his first two years in office obama signed into law economic stimulus legislation in response to the great recession in the form of the american recovery and reinvestment act of 2009 and the tax relief unemployment insurance reauthorization and job creation act of 2010 other major domestic initiatives in his first term included the patient protection and affordable care act often referred to as obamacare the doddfrank wall street reform and consumer protection act and the dont ask dont tell repeal act of 2010 in foreign policy obama ended us military involvement in the iraq war increased us troop levels in afghanistan signed the new start arms control treaty with russia ordered us military involvement in libya and ordered the military operation that resulted in the death of osama bin laden in january 2011 the republicans regained control of the house of representatives as the democratic party lost a total of 63 seats and after a lengthy debate over federal spending and whether or not to raise the nations debt limit obama signed the budget control act of 2011 and the american taxpayer relief act of 2012obama was reelected president in november 2012 defeating republican nominee mitt romney and was sworn in for a second term on january 20 2013 during his second term obama has promoted domestic policies related to gun control in response to the sandy hook elementary school shooting and has called for full equality for lgbt americans while his administration has filed briefs which urged the supreme court to strike down the defense of marriage act of 1996 and californias proposition 8 as unconstitutional in foreign policy obama ordered us military involvement in iraq in response to gains made by the islamic state in iraq after the 2011 withdrawal from iraq continued the process of ending us combat operations in afghanistan and has sought to normalize us relations with cuba', ... ]

Contar las palabras del articulo de Obama



In [14]:

    
obama['word_count'] = graphlab.text_analytics.count_words(obama['text'])



In [15]:

    
print obama['word_count']









    



[{'operations': 1L, 'represent': 1L, 'office': 2L, 'unemployment': 1L, 'doddfrank': 1L, 'over': 1L, 'unconstitutional': 1L, 'domestic': 2L, 'major': 1L, 'years': 1L, 'against': 1L, 'proposition': 1L, 'seats': 1L, 'graduate': 1L, 'debate': 1L, 'before': 1L, 'death': 1L, '20': 2L, 'taxpayer': 1L, 'representing': 1L, 'obamacare': 1L, 'barack': 1L, 'to': 14L, '4': 1L, 'policy': 2L, '8': 1L, 'he': 7L, '2011': 3L, '2010': 2L, '2013': 1L, '2012': 1L, 'bin': 1L, 'then': 1L, 'his': 11L, 'march': 1L, 'gains': 1L, 'cuba': 1L, 'school': 3L, '1992': 1L, 'new': 1L, 'not': 1L, 'during': 2L, 'ending': 1L, 'continued': 1L, 'presidential': 2L, 'states': 3L, 'husen': 1L, 'osama': 1L, 'californias': 1L, 'equality': 1L, 'prize': 1L, 'lost': 1L, 'made': 1L, 'inaugurated': 1L, 'january': 3L, 'university': 2L, 'rights': 1L, 'july': 1L, 'gun': 1L, 'stimulus': 1L, 'rodham': 1L, 'troop': 1L, 'withdrawal': 1L, 'brk': 1L, 'nine': 1L, 'where': 1L, 'referred': 1L, 'affordable': 1L, 'attorney': 1L, 'on': 2L, 'often': 1L, 'senate': 3L, 'regained': 1L, 'national': 2L, 'creation': 1L, 'related': 1L, 'hawaii': 1L, 'born': 2L, 'second': 2L, 'defense': 1L, 'election': 3L, 'close': 1L, 'operation': 1L, 'insurance': 1L, 'sandy': 1L, 'afghanistan': 2L, 'initiatives': 1L, 'for': 4L, 'reform': 1L, 'house': 2L, 'review': 1L, 'representatives': 2L, 'current': 1L, 'state': 1L, 'won': 1L, 'limit': 1L, 'victory': 1L, 'unsuccessfully': 1L, 'reauthorization': 1L, 'keynote': 1L, 'full': 1L, 'patient': 1L, 'august': 1L, 'degree': 1L, '44th': 1L, 'bm': 1L, 'mitt': 1L, 'attention': 1L, 'delegates': 1L, 'lgbt': 1L, 'job': 1L, 'harvard': 2L, 'term': 3L, 'served': 2L, 'ask': 1L, 'november': 2L, 'debt': 1L, 'by': 1L, 'wall': 1L, 'care': 1L, 'received': 1L, 'great': 1L, 'signed': 3L, 'libya': 1L, 'receive': 1L, 'of': 18L, 'months': 1L, 'urged': 1L, 'foreign': 2L, 'american': 3L, 'protection': 2L, 'economic': 1L, 'act': 8L, 'military': 4L, 'hussein': 1L, 'or': 1L, 'first': 3L, 'control': 4L, 'named': 1L, 'clinton': 1L, 'dont': 2L, 'campaign': 3L, 'russia': 1L, 'civil': 1L, 'reinvestment': 1L, 'into': 1L, 'address': 1L, 'primary': 2L, 'community': 1L, 'mccain': 1L, 'down': 1L, 'hook': 1L, '63': 1L, 'americans': 1L, 'elementary': 1L, 'total': 1L, 'earning': 1L, 'repeal': 1L, 'from': 3L, 'raise': 1L, 'district': 1L, 'spending': 1L, 'republican': 2L, 'legislation': 1L, 'three': 1L, 'relations': 1L, 'nobel': 1L, 'start': 1L, 'tell': 1L, 'iraq': 4L, 'convention': 1L, 'resulted': 1L, 'john': 1L, 'was': 5L, '2012obama': 1L, 'form': 1L, 'that': 1L, 'tax': 1L, 'sufficient': 1L, 'republicans': 1L, 'strike': 1L, 'hillary': 1L, 'ended': 1L, 'arms': 1L, 'honolulu': 1L, 'filed': 1L, 'worked': 1L, 'hold': 1L, 'with': 3L, 'obama': 9L, 'street': 1L, 'ii': 1L, 'has': 4L, '1997': 1L, '1996': 1L, 'whether': 1L, 'reelected': 1L, 'budget': 1L, 'us': 6L, 'nations': 1L, 'recession': 1L, 'while': 1L, 'taught': 1L, 'marriage': 1L, 'policies': 1L, 'promoted': 1L, 'called': 1L, 'and': 21L, 'supreme': 1L, 'ordered': 3L, 'nominee': 2L, 'process': 1L, '2000in': 1L, 'is': 2L, 'romney': 1L, 'briefs': 1L, 'defeated': 1L, 'general': 1L, '13th': 1L, 'as': 6L, 'at': 2L, 'in': 30L, 'sought': 1L, 'organizer': 1L, 'shooting': 1L, 'increased': 1L, 'normalize': 1L, 'lengthy': 1L, 'united': 3L, 'court': 1L, 'recovery': 1L, 'laden': 1L, 'laureateduring': 1L, 'peace': 1L, 'administration': 1L, '1961': 1L, 'illinois': 2L, 'other': 1L, 'which': 1L, 'party': 3L, 'primaries': 1L, 'sworn': 1L, 'relief': 2L, 'war': 1L, 'columbia': 1L, 'combat': 1L, 'after': 4L, 'islamic': 1L, 'running': 1L, 'levels': 1L, 'two': 1L, 'involvement': 3L, 'response': 3L, 'included': 1L, 'president': 4L, 'law': 6L, 'nomination': 1L, '2008': 1L, 'a': 7L, '2009': 3L, 'chicago': 2L, 'constitutional': 1L, 'defeating': 1L, 'treaty': 1L, 'federal': 1L, '2007': 1L, '2004': 3L, 'african': 1L, 'the': 40L, 'democratic': 4L, 'consumer': 1L, 'began': 1L, 'terms': 1L}]

Convertir el diccionario en una tabla



In [16]:

    
obama_word_count_table = obama[['word_count']].stack('word_count', new_column_name = ['word','count'])

Ordenar las palabras más repetidas.



In [17]:

    
obama_word_count_table.head()









    Out[17]:





    
        word
        count
    
    
        normalize
        1
    
    
        sought
        1
    
    
        combat
        1
    
    
        continued
        1
    
    
        unconstitutional
        1
    
    
        8
        1
    
    
        californias
        1
    
    
        1996
        1
    
    
        marriage
        1
    
    
        defense
        1
    

[10 rows x 2 columns]



In [18]:

    
obama_word_count_table.sort('count',ascending=False)









    Out[18]:





    
        word
        count
    
    
        the
        40
    
    
        in
        30
    
    
        and
        21
    
    
        of
        18
    
    
        to
        14
    
    
        his
        11
    
    
        obama
        9
    
    
        act
        8
    
    
        a
        7
    
    
        he
        7
    

[273 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Las palabras más comunes no nos aportan información.

Desarrollar un algoritmo TF-IDF para resolver este problema. Aplicaremos el contador de palabra como una columna y aplicaremos a todos los articulos.



In [19]:

    
people['word_count'] = graphlab.text_analytics.count_words(people['text'])
people.head()









    Out[19]:





    
        URI
        name
        text
        word_count
    
    
        <http://dbpedia.org/resou
rce/Digby_Morrell> ...
        Digby Morrell
        digby morrell born 10
october 1979 is a former ...
        {'since': 1L, 'carltons':
1L, 'being': 1L, '2005': ...
    
    
        <http://dbpedia.org/resou
rce/Alfred_J._Lewy> ...
        Alfred J. Lewy
        alfred j lewy aka sandy
lewy graduated from ...
        {'precise': 1L, 'thomas':
1L, 'closely': 1L, ...
    
    
        <http://dbpedia.org/resou
rce/Harpdog_Brown> ...
        Harpdog Brown
        harpdog brown is a singer
and harmonica player who ...
        {'just': 1L, 'issued':
1L, 'mainly': 1L, ...
    
    
        <http://dbpedia.org/resou
rce/Franz_Rottensteiner> ...
        Franz Rottensteiner
        franz rottensteiner born
in waidmannsfeld lower ...
        {'all': 1L,
'bauforschung': 1L, ...
    
    
        <http://dbpedia.org/resou
rce/G-Enka> ...
        G-Enka
        henry krvits born 30
december 1974 in tallinn ...
        {'legendary': 1L,
'gangstergenka': 1L, ...
    
    
        <http://dbpedia.org/resou
rce/Sam_Henderson> ...
        Sam Henderson
        sam henderson born
october 18 1969 is an ...
        {'now': 1L, 'currently':
1L, 'less': 1L, 'being': ...
    
    
        <http://dbpedia.org/resou
rce/Aaron_LaCrate> ...
        Aaron LaCrate
        aaron lacrate is an
american music producer ...
        {'exclusive': 2L,
'producer': 1L, 'tribe': ...
    
    
        <http://dbpedia.org/resou
rce/Trevor_Ferguson> ...
        Trevor Ferguson
        trevor ferguson aka john
farrow born 11 november ...
        {'taxi': 1L, 'salon': 1L,
'gangs': 1L, 'being': ...
    
    
        <http://dbpedia.org/resou
rce/Grant_Nelson> ...
        Grant Nelson
        grant nelson born 27
april 1971 in london  ...
        {'houston': 1L,
'frankie': 1L, 'labels': ...
    
    
        <http://dbpedia.org/resou
rce/Cathy_Caruth> ...
        Cathy Caruth
        cathy caruth born 1955 is
frank h t rhodes ...
        {'phenomenon': 1L,
'deborash': 1L, ...
    

[10 rows x 4 columns]



In [20]:

    
people['tfidf'] = graphlab.text_analytics.tf_idf(people['word_count'])

Examinar el TF-IDF del articulo de OBAMA



In [21]:

    
obama = people[people['name'] == 'Barack Obama']



In [22]:

    
obama[['tfidf']].stack('tfidf',new_column_name=['word','tfidf']).sort('tfidf',ascending=False)









    Out[22]:





    
        word
        tfidf
    
    
        obama
        43.2956530721
    
    
        act
        27.678222623
    
    
        iraq
        17.747378588
    
    
        control
        14.8870608452
    
    
        law
        14.7229357618
    
    
        ordered
        14.5333739509
    
    
        military
        13.1159327785
    
    
        involvement
        12.7843852412
    
    
        response
        12.7843852412
    
    
        democratic
        12.4106886973
    

[273 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

El algoritmo TF-IDF nos aporta más información.

Construir un modelo de nearest neighbor.



In [23]:

    
knn_model = graphlab.nearest_neighbors.create(people,features=['tfidf'],label='name')









    




Starting brute force nearest neighbors model training.

Qué persona esta más relacionada con Obama?



In [18]:

    
knn_model.query(obama)









    




Starting pairwise querying.






    




+--------------+---------+-------------+--------------+






    




| Query points | # Pairs | % Complete. | Elapsed Time |






    




+--------------+---------+-------------+--------------+






    




| 0            | 1       | 0.00169288  | 62.4ms       |






    




| Done         |         | 100         | 468.001ms    |






    




+--------------+---------+-------------+--------------+






    Out[18]:





    
        query_label
        reference_label
        distance
        rank
    
    
        0
        Barack Obama
        0.0
        1
    
    
        0
        Joe Biden
        0.794117647059
        2
    
    
        0
        Joe Lieberman
        0.794685990338
        3
    
    
        0
        Kelly Ayotte
        0.811989100817
        4
    
    
        0
        Bill Clinton
        0.813852813853
        5
    

[5 rows x 4 columns]

Otros ejemplos



In [19]:

    
swift = people[people['name'] == 'Taylor Swift']



In [20]:

    
knn_model.query(swift)









    




Starting pairwise querying.






    




+--------------+---------+-------------+--------------+






    




| Query points | # Pairs | % Complete. | Elapsed Time |






    




+--------------+---------+-------------+--------------+






    




| 0            | 1       | 0.00169288  | 15.6ms       |






    




| Done         |         | 100         | 343.2ms      |






    




+--------------+---------+-------------+--------------+






    Out[20]:





    
        query_label
        reference_label
        distance
        rank
    
    
        0
        Taylor Swift
        0.0
        1
    
    
        0
        Carrie Underwood
        0.76231884058
        2
    
    
        0
        Alicia Keys
        0.764705882353
        3
    
    
        0
        Jordin Sparks
        0.769633507853
        4
    
    
        0
        Leona Lewis
        0.776119402985
        5
    

[5 rows x 4 columns]



In [21]:

    
jolie = people[people['name'] == 'Angelina Jolie']



In [22]:

    
knn_model.query(jolie)









    




Starting pairwise querying.






    




+--------------+---------+-------------+--------------+






    




| Query points | # Pairs | % Complete. | Elapsed Time |






    




+--------------+---------+-------------+--------------+






    




| 0            | 1       | 0.00169288  | 15.6ms       |






    




| Done         |         | 100         | 374.401ms    |






    




+--------------+---------+-------------+--------------+






    Out[22]:





    
        query_label
        reference_label
        distance
        rank
    
    
        0
        Angelina Jolie
        0.0
        1
    
    
        0
        Brad Pitt
        0.784023668639
        2
    
    
        0
        Julianne Moore
        0.795857988166
        3
    
    
        0
        Billy Bob Thornton
        0.803069053708
        4
    
    
        0
        George Clooney
        0.8046875
        5
    

[5 rows x 4 columns]



In [23]:

    
arnold = people[people['name'] == 'Arnold Schwarzenegger']



In [24]:

    
knn_model.query(arnold)









    




Starting pairwise querying.






    




+--------------+---------+-------------+--------------+






    




| Query points | # Pairs | % Complete. | Elapsed Time |






    




+--------------+---------+-------------+--------------+






    




| 0            | 1       | 0.00169288  | 15.6ms       |






    




| Done         |         | 100         | 358.801ms    |






    




+--------------+---------+-------------+--------------+






    Out[24]:





    
        query_label
        reference_label
        distance
        rank
    
    
        0
        Arnold Schwarzenegger
        0.0
        1
    
    
        0
        Jesse Ventura
        0.818918918919
        2
    
    
        0
        John Kitzhaber
        0.824615384615
        3
    
    
        0
        Lincoln Chafee
        0.833876221498
        4
    
    
        0
        Anthony Foxx
        0.833910034602
        5
    

[5 rows x 4 columns]



In [ ]:



In [ ]:

URI	name	text
<http://dbpedia.org/resou rce/Digby_Morrell> ...	Digby Morrell	digby morrell born 10 october 1979 is a former ...
<http://dbpedia.org/resou rce/Alfred_J._Lewy> ...	Alfred J. Lewy	alfred j lewy aka sandy lewy graduated from ...
<http://dbpedia.org/resou rce/Harpdog_Brown> ...	Harpdog Brown	harpdog brown is a singer and harmonica player who ...
<http://dbpedia.org/resou rce/Franz_Rottensteiner> ...	Franz Rottensteiner	franz rottensteiner born in waidmannsfeld lower ...
<http://dbpedia.org/resou rce/G-Enka> ...	G-Enka	henry krvits born 30 december 1974 in tallinn ...
<http://dbpedia.org/resou rce/Sam_Henderson> ...	Sam Henderson	sam henderson born october 18 1969 is an ...
<http://dbpedia.org/resou rce/Aaron_LaCrate> ...	Aaron LaCrate	aaron lacrate is an american music producer ...
<http://dbpedia.org/resou rce/Trevor_Ferguson> ...	Trevor Ferguson	trevor ferguson aka john farrow born 11 november ...
<http://dbpedia.org/resou rce/Grant_Nelson> ...	Grant Nelson	grant nelson born 27 april 1971 in london ...
<http://dbpedia.org/resou rce/Cathy_Caruth> ...	Cathy Caruth	cathy caruth born 1955 is frank h t rhodes ...

word	count
normalize	1
sought	1
combat	1
continued	1
unconstitutional	1
8	1
californias	1
1996	1
marriage	1
defense	1

word	tfidf
obama	43.2956530721
act	27.678222623
iraq	17.747378588
control	14.8870608452
law	14.7229357618
ordered	14.5333739509
military	13.1159327785
involvement	12.7843852412
response	12.7843852412
democratic	12.4106886973

reference_label	distance	rank
Barack Obama	0.0	1
Joe Biden	0.794117647059	2
Joe Lieberman	0.794685990338	3
Kelly Ayotte	0.811989100817	4
Bill Clinton	0.813852813853	5

reference_label	distance	rank
Taylor Swift	0.0	1
Carrie Underwood	0.76231884058	2
Alicia Keys	0.764705882353	3
Jordin Sparks	0.769633507853	4
Leona Lewis	0.776119402985	5

reference_label	distance	rank
Angelina Jolie	0.0	1
Brad Pitt	0.784023668639	2
Julianne Moore	0.795857988166	3
Billy Bob Thornton	0.803069053708	4
George Clooney	0.8046875	5

reference_label	distance	rank
Arnold Schwarzenegger	0.0	1
Jesse Ventura	0.818918918919	2
John Kitzhaber	0.824615384615	3
Lincoln Chafee	0.833876221498	4
Anthony Foxx	0.833910034602	5