textblob: otro módulo para tareas de PLN (NLTK + pattern)

textblob es una librería de procesamiento del texto para Python que permite realizar tareas de Procesamiento del Lenguaje Natural como análisis morfológico, extracción de entidades, análisis de opinión, traducción automática, etc.

Está construida sobre otras dos librerías muy famosas de Python: NLTK y pattern. La principal ventaja de textblob es que permite combinar el uso de las dos herramientas anteriores en un interfaz más simple.

Vamos a apoyarnos en este tutorial para aprender a utilizar algunas de sus funcionalidades más llamativas.

Lo primero es importar el objeto TextBlob que nos permite acceder a todas las herramentas que incluye.


In [1]:
from textblob import TextBlob

Vamos a crear nuestro primer ejemplo de textblob a través del objeto TextBlob. Piensa en estos textblobs como una especie de cadenas de texto de Python, analaizadas y enriquecidas con algunas características extra.


In [2]:
texto = '''In new lawsuits brought against the ride-sharing companies Uber and Lyft, the top prosecutors in Los Angeles 
and San Francisco counties make an important point about the lightly regulated sharing economy. The consumers who 
participate deserve a  very clear picture of the risks they're taking'''
t = TextBlob(texto)

In [3]:
print(t.sentences)


[Sentence("In new lawsuits brought against the ride-sharing companies Uber and Lyft, the top prosecutors in Los Angeles 
and San Francisco counties make an important point about the lightly regulated sharing economy."), Sentence("The consumers who 
participate deserve a  very clear picture of the risks they're taking")]

In [4]:
print('Tenemos', len(t.sentences), 'oraciones.\n')

for sentence in t.sentences:
    print(sentence)
    print('-' * 75)


Tenemos 2 oraciones.

In new lawsuits brought against the ride-sharing companies Uber and Lyft, the top prosecutors in Los Angeles 
and San Francisco counties make an important point about the lightly regulated sharing economy.
---------------------------------------------------------------------------
The consumers who 
participate deserve a  very clear picture of the risks they're taking
---------------------------------------------------------------------------

Procesando oraciones, palabras y entidades

Podemos segmentar en oraciones y en palabras nuestra texto de ejemplo simplemente accediendo a las propiedades .sentences y .words. Imprimimos por pantalla:


In [5]:
# imprimimos las oraciones
for sentence in t.sentences:
    print(sentence)
    print('-' * 75)
    
# y las palabras    
print(t.words)
print(texto.split())


In new lawsuits brought against the ride-sharing companies Uber and Lyft, the top prosecutors in Los Angeles 
and San Francisco counties make an important point about the lightly regulated sharing economy.
---------------------------------------------------------------------------
The consumers who 
participate deserve a  very clear picture of the risks they're taking
---------------------------------------------------------------------------
['In', 'new', 'lawsuits', 'brought', 'against', 'the', 'ride-sharing', 'companies', 'Uber', 'and', 'Lyft', 'the', 'top', 'prosecutors', 'in', 'Los', 'Angeles', 'and', 'San', 'Francisco', 'counties', 'make', 'an', 'important', 'point', 'about', 'the', 'lightly', 'regulated', 'sharing', 'economy', 'The', 'consumers', 'who', 'participate', 'deserve', 'a', 'very', 'clear', 'picture', 'of', 'the', 'risks', 'they', "'re", 'taking']
['In', 'new', 'lawsuits', 'brought', 'against', 'the', 'ride-sharing', 'companies', 'Uber', 'and', 'Lyft,', 'the', 'top', 'prosecutors', 'in', 'Los', 'Angeles', 'and', 'San', 'Francisco', 'counties', 'make', 'an', 'important', 'point', 'about', 'the', 'lightly', 'regulated', 'sharing', 'economy.', 'The', 'consumers', 'who', 'participate', 'deserve', 'a', 'very', 'clear', 'picture', 'of', 'the', 'risks', "they're", 'taking']

La propiedad .noun_phrases nos permite acceder a la lista de entidades (en realidad, son sintagmas nominales) incluídos en nuestro textblob. Así es como funciona.


In [6]:
print("el texto de ejemplo contiene", len(t.noun_phrases), "entidades")
for element in t.noun_phrases:
    print("-", element)


el texto de ejemplo contiene 8 entidades
- new lawsuits
- uber
- lyft
- top prosecutors
- los angeles
- san francisco
- important point
- clear picture

In [7]:
# jugando con lemas, singulares y plurales 
for word in t.words:
    if word.endswith("s"):
        print(word.lemmatize(), word, word.singularize())
    else:
        print(word.lemmatize(), word, word.pluralize())


In In Ins
new new news
lawsuit lawsuits lawsuit
brought brought broughts
against against againsts
the the thes
ride-sharing ride-sharing ride-sharings
company companies company
Uber Uber Ubers
and and ands
Lyft Lyft Lyfts
the the thes
top top tops
prosecutor prosecutors prosecutor
in in ins
Los Los Lo
Angeles Angeles Angele
and and ands
San San Sans
Francisco Francisco Franciscoes
county counties county
make make makes
an an some
important important importants
point point points
about about abouts
the the thes
lightly lightly lightlies
regulated regulated regulateds
sharing sharing sharings
economy economy economies
The The Thes
consumer consumers consumer
who who whoes
participate participate participates
deserve deserve deserves
a a some
very very veries
clear clear clears
picture picture pictures
of of ofs
the the thes
risk risks risk
they they they
're 're 'res
taking taking takings

In [8]:
# ¿cómo podemos hacer la lematización más inteligente?
for item in t.tags:
    if item[1] == 'NN':
        print(item[0], '-->', item[0].pluralize())
    elif item[1] == 'NNS':
        print(item[0], '-->', item[0].singularize())
    else:
        print(item[0], item[0].lemmatize())


In In
new new
lawsuits --> lawsuit
brought brought
against against
the the
ride-sharing ride-sharing
companies --> company
Uber Uber
and and
Lyft Lyft
the the
top top
prosecutors --> prosecutor
in in
Los Los
Angeles Angeles
and and
San San
Francisco Francisco
counties --> county
make make
an an
important important
point --> points
about about
the the
lightly lightly
regulated regulated
sharing sharing
economy --> economies
The The
consumers --> consumer
who who
participate participate
deserve deserve
a a
very very
clear clear
picture --> pictures
of of
the the
risks --> risk
they they
're 're
taking taking

Análisis sintático

Aunque podemos utilizar otros analizadores, por defecto el método .parse() invoca al analizador morfosintáctico del módulo pattern.en que ya conoces.


In [9]:
# análisis sintáctico
print(t.parse())


In/IN/B-PP/B-PNP new/JJ/B-NP/I-PNP lawsuits/NNS/I-NP/I-PNP brought/VBN/B-VP/I-PNP against/IN/B-PP/B-PNP the/DT/B-NP/I-PNP ride-sharing/JJ/I-NP/I-PNP companies/NNS/I-NP/I-PNP Uber/NNP/I-NP/I-PNP and/CC/O/O Lyft/NNP/B-NP/O ,/,/O/O the/DT/B-NP/O top/JJ/I-NP/O prosecutors/NNS/I-NP/O in/IN/B-PP/B-PNP Los/NNP/B-NP/I-PNP Angeles/NNP/I-NP/I-PNP and/CC/I-NP/I-PNP San/NNP/I-NP/I-PNP Francisco/NNP/I-NP/I-PNP counties/NNS/I-NP/I-PNP make/VB/B-VP/O an/DT/B-NP/O important/JJ/I-NP/O point/NN/I-NP/O about/IN/B-PP/O the/DT/O/O lightly/RB/B-VP/O regulated/VBN/I-VP/O sharing/VBG/I-VP/O economy/NN/B-NP/O ././O/O
The/DT/B-NP/O consumers/NNS/I-NP/O who/WP/O/O participate/VB/B-VP/O deserve/VBP/I-VP/O a/DT/B-NP/O very/RB/I-NP/O clear/JJ/I-NP/O picture/NN/I-NP/O of/IN/B-PP/B-PNP the/DT/B-NP/I-PNP risks/NNS/I-NP/I-PNP they/PRP/I-NP/I-PNP '/POS/O/O re/NN/B-NP/O taking/VBG/B-VP/O

Traducción automática

A partir de cualquier texto procesado con TextBlob, podemos acceder a un traductor automático de bastante calidad con el método .translate. Fíjate en cómo lo usamos. Es obligatorio indicar la lengua de destinto. La lengua de origen, se puede predecir a partir del texto de entrada.


In [10]:
# de chino a inglés y español 
oracion_zh = "中国探月工程 亦稱嫦娥工程,是中国启动的第一个探月工程,于2003年3月1日正式启动"
t_zh = TextBlob(oracion_zh)
print(t_zh.translate(from_lang="zh-CN", to="en"))
print(t_zh.translate(from_lang="zh-CN", to="es"))

print("--------------")

t_es = TextBlob(u"La deuda pública ha marcado nuevos récords en España en el tercer trimestre")
print(t_es.translate(to="el"))
print(t_es.translate(to="ru"))
print(t_es.translate(to="eu"))
print(t_es.translate(to="fi"))
print(t_es.translate(to="fr"))
print(t_es.translate(to="nl"))
print(t_es.translate(to="gl"))
print(t_es.translate(to="ca"))
print(t_es.translate(to="zh"))
print(t_es.translate(to="la"))
print(t_es.translate(to="cs"))

# con el slang no funciona tan bien
print("--------------")
t_ita = TextBlob(u"Sono andato a Milano e mi sono divertito un bordello.")
print(t_ita.translate(to="en"))
print(t_ita.translate(to="es"))


China lunar exploration project, also known as Chang'e project, is the first lunar exploration project launched in China, officially launched on March 1, 2003
Programa de Exploración Lunar chino, también conocido como proyecto Chang E es comenzar el primer programa de exploración lunar de China, el 1 de marzo de 2003 lanzó oficialmente
--------------
Το δημόσιο χρέος έχει θέσει νέο ρεκόρ στην Ισπανία κατά το τρίτο τρίμηνο
Госдолг установил новые рекорды в Испании в третьем квартале
zor publikoa Espainian erregistro berriak ezarri du hirugarren hiruhilekoan
Julkinen velka on asettanut uusia ennätyksiä Espanjassa kolmannella neljänneksellä
La dette publique a établi de nouveaux records en Espagne au troisième trimestre
De overheidsschuld heeft nieuwe records in Spanje in het derde kwartaal
A débeda pública comezou novas marcas en España no terceiro trimestre
El deute públic ha marcat nous rècords a Espanya en el tercer trimestre
公共债务在第三季度创下的西班牙新纪录
Debitum palam novum records in Hispaniam profectus est, in tertia quartam
Veřejný dluh vytváří nové rekordy ve Španělsku se ve třetím čtvrtletí
--------------
I went to Milan and enjoyed a brothel.
Fui a Milán y me gustó mucho un burdel.

WordNet

textblob, más concretamente, cualquier objeto de la clase Word, nos permite acceder a la información de WordNet.


In [11]:
# WordNet
from textblob import Word
from textblob.wordnet import VERB

# ¿cuántos synsets tiene "car"
word = Word("car")
print(word.synsets)

# dame los synsets de la palabra "hack" como verbo
print(Word("hack").get_synsets(pos=VERB))

# imprime la lista de definiciones de "car"
print(Word("car").definitions)

# recorre la jerarquía de hiperónimos
for s in word.synsets:
    print(s.hypernym_paths())


[Synset('car.n.01'), Synset('car.n.02'), Synset('car.n.03'), Synset('car.n.04'), Synset('cable_car.n.01')]
[Synset('chop.v.05'), Synset('hack.v.02'), Synset('hack.v.03'), Synset('hack.v.04'), Synset('hack.v.05'), Synset('hack.v.06'), Synset('hack.v.07'), Synset('hack.v.08')]
['a motor vehicle with four wheels; usually propelled by an internal combustion engine', 'a wheeled vehicle adapted to the rails of railroad', 'the compartment that is suspended from an airship and that carries personnel and the cargo and the power plant', 'where passengers ride up and down', 'a conveyance for passengers or freight on a cable railway']
[[Synset('entity.n.01'), Synset('physical_entity.n.01'), Synset('object.n.01'), Synset('whole.n.02'), Synset('artifact.n.01'), Synset('instrumentality.n.03'), Synset('container.n.01'), Synset('wheeled_vehicle.n.01'), Synset('self-propelled_vehicle.n.01'), Synset('motor_vehicle.n.01'), Synset('car.n.01')], [Synset('entity.n.01'), Synset('physical_entity.n.01'), Synset('object.n.01'), Synset('whole.n.02'), Synset('artifact.n.01'), Synset('instrumentality.n.03'), Synset('conveyance.n.03'), Synset('vehicle.n.01'), Synset('wheeled_vehicle.n.01'), Synset('self-propelled_vehicle.n.01'), Synset('motor_vehicle.n.01'), Synset('car.n.01')]]
[[Synset('entity.n.01'), Synset('physical_entity.n.01'), Synset('object.n.01'), Synset('whole.n.02'), Synset('artifact.n.01'), Synset('instrumentality.n.03'), Synset('container.n.01'), Synset('wheeled_vehicle.n.01'), Synset('car.n.02')], [Synset('entity.n.01'), Synset('physical_entity.n.01'), Synset('object.n.01'), Synset('whole.n.02'), Synset('artifact.n.01'), Synset('instrumentality.n.03'), Synset('conveyance.n.03'), Synset('vehicle.n.01'), Synset('wheeled_vehicle.n.01'), Synset('car.n.02')]]
[[Synset('entity.n.01'), Synset('physical_entity.n.01'), Synset('object.n.01'), Synset('whole.n.02'), Synset('artifact.n.01'), Synset('structure.n.01'), Synset('area.n.05'), Synset('room.n.01'), Synset('compartment.n.02'), Synset('car.n.03')]]
[[Synset('entity.n.01'), Synset('physical_entity.n.01'), Synset('object.n.01'), Synset('whole.n.02'), Synset('artifact.n.01'), Synset('structure.n.01'), Synset('area.n.05'), Synset('room.n.01'), Synset('compartment.n.02'), Synset('car.n.04')]]
[[Synset('entity.n.01'), Synset('physical_entity.n.01'), Synset('object.n.01'), Synset('whole.n.02'), Synset('artifact.n.01'), Synset('structure.n.01'), Synset('area.n.05'), Synset('room.n.01'), Synset('compartment.n.02'), Synset('cable_car.n.01')]]

Análisis de opinion


In [12]:
# análisis de opinión
opinion1 = TextBlob("This new restaurant is great. I had so much fun!! :-P")
print(opinion1.sentiment)

opinion2 = TextBlob("Google News to close in Spain.")
print(opinion2.sentiment)

# subjetividad 0:1
# polaridad -1:1

print(opinion1.sentiment.polarity)

if opinion1.sentiment.subjectivity > 0.5:
    print("Hey, esto es una opinion")


Sentiment(polarity=0.5387784090909091, subjectivity=0.6011363636363636)
Sentiment(polarity=0.0, subjectivity=0.0)
0.5387784090909091
Hey, esto es una opinion

In [13]:
t = TextBlob("I like this restaurant")
print(t.sentiment)

t = TextBlob("I love this restaurant")
print(t.sentiment)

t = TextBlob("I fucking love this restaurant ")
print(t.sentiment)

t = TextBlob("I fucking love this restaurant :-) ")
print(t.sentiment)

t = TextBlob("I love this FUCKING restaurant :-( Grrr!! ")
print(t.sentiment)


Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.5, subjectivity=0.6)
Sentiment(polarity=0.5, subjectivity=0.6)
Sentiment(polarity=0.5, subjectivity=0.8)
Sentiment(polarity=-0.4625, subjectivity=0.8)

Otras curiosidades


In [14]:
#  corrección ortográfica
b1 = TextBlob("I havv goood speling!")
print(b1.correct())

b2 = TextBlob("Miy naem iz Jonh!")
print(b2.correct())

b3 = TextBlob("Boyz dont cri")
print(b3.correct())

b4 = TextBlob("psicological posesion achifmen comitment")
print(b4.correct())


I have good spelling!
In name in On!
Boy dont cry
psychological position achifmen commitment

Hasta el infinito, y más allá

En este breve resumen solo consideramos las posibilidades que ofrece TextBlob por defecto. Pero si necesitas personalizar las herramientas, echa un vistazo a la documentación avanzada.