TextBlob: Text Analytics for Humans

Ken Hu

@whosbacon

Who Am I?

7+ years DevOps
5+ years Python
3+ years text analytics
Digital nomad



In [1]:

    
# I keep this as a cell in my title slide so I can rerun 
# it easily if I make changes, but it's low enough it won't
# be visible in presentation mode.
%run talktools

Itinerary

A blob of text
Basics
Tokenization
Parsing & tagging
Spell correction
Sentiment analysis
Classification
Blobber

A Blob of Text

Popularity of text analytics / NLP
nltk complexity
TextBlob : nltk :: requests : Httplib2
Object-oriented
https://textblob.readthedocs.org/en/latest/
https://github.com/sloria/textblob



In [ ]:

    
!sudo pip install -U textblob
!sudo pip install -U nltk textblob-aptagger
!sudo python -m textblob.download_corpora

import nltk
nltk.download('wordnet')

Basics



In [1]:

    
import textblob

blob = textblob.TextBlob(u'   Welcome to textblob: text analytics for humans   ')



In [2]:

    
print type(blob)
print blob[5:10]
print blob.title()
print blob.find('text')
print blob.endswith(' ')
print blob.stripped









    



<class 'textblob.blob.TextBlob'>
lcome
   Welcome To Textblob: Text Analytics For Humans   
14
True
welcome to textblob text analytics for humans



In [3]:

    
print textblob.Word
print textblob.Word(u'sweet').pluralize()
print textblob.Word(u'sweets').singularize()
print textblob.Word(u'running').lemmatize('v')









    



<class 'textblob.blob.Word'>
sweets
sweet
run



In [4]:

    
print type(blob.words)
print blob.words.singularize()
print blob.words.count('text')









    



<class 'textblob.blob.WordList'>
[u'Welcome', u'to', u'textblob', u'text', u'analytic', u'for', u'human']
1

Tokenization



In [5]:

    
blob = textblob.TextBlob(u'why is it so hot in here?!\nthis is not hot at all')



In [6]:

    
print blob.words
print type(blob.words), type(blob.words[0])
print blob.sentences
print blob.ngrams()
print blob.word_counts['hot']









    



[u'why', u'is', u'it', u'so', u'hot', u'in', u'here', u'this', u'is', u'not', u'hot', u'at', u'all']
<class 'textblob.blob.WordList'> <class 'textblob.blob.Word'>
[Sentence("why is it so hot in here?!"), Sentence("this is not hot at all")]
[WordList([u'why', u'is', u'it']), WordList([u'is', u'it', u'so']), WordList([u'it', u'so', u'hot']), WordList([u'so', u'hot', u'in']), WordList([u'hot', u'in', u'here']), WordList([u'in', u'here', u'this']), WordList([u'here', u'this', u'is']), WordList([u'this', u'is', u'not']), WordList([u'is', u'not', u'hot']), WordList([u'not', u'hot', u'at']), WordList([u'hot', u'at', u'all'])]
2



In [7]:

    
print blob.tokenizer

import nltk.tokenize

blob.tokenizer = nltk.tokenize.LineTokenizer()
print blob.tokenize()
print blob.tokenize(nltk.tokenize.PunktWordTokenizer())









    



<textblob.tokenizers.WordTokenizer object at 0x36b3e70>
[u'why is it so hot in here?!', u'this is not hot at all']
[u'why', u'is', u'it', u'so', u'hot', u'in', u'here', u'?', u'!', u'this', u'is', u'not', u'hot', u'at', u'all']

Parsing & Tagging



In [8]:

    
blob = textblob.TextBlob(u'Ray Charles has got Georgia on his mind')



In [9]:

    
print blob.noun_phrases
print blob.pos_tags
print blob.parse()









    



[u'ray charles', u'georgia']
[(u'Ray', u'NNP'), (u'Charles', u'NNP'), (u'has', u'VBZ'), (u'got', u'VBD'), (u'Georgia', u'NNP'), (u'on', u'IN'), (u'his', u'PRP$'), (u'mind', u'NN')]
Ray/NNP/B-NP/O Charles/NNP/I-NP/O has/VBZ/B-VP/O got/VBD/I-VP/O Georgia/NNP/B-NP/O on/IN/B-PP/B-PNP his/PRP$/B-NP/I-PNP mind/NN/I-NP/I-PNP



In [10]:

    
import textblob.np_extractors

print blob.np_extractor
blob = textblob.TextBlob(u'Ray Charles has got Georgia on his mind',
                         np_extractor=textblob.np_extractors.ConllExtractor())
print blob.noun_phrases









    



<textblob.en.np_extractors.FastNPExtractor object at 0x36b3fd0>
[u'ray charles', u'georgia']



In [11]:

    
import textblob.taggers

print blob.pos_tagger
blob = textblob.TextBlob(u'Ray Charles has got Georgia on his mind',
                         pos_tagger=textblob.taggers.NLTKTagger())
print blob.pos_tags









    



<textblob.en.taggers.PatternTagger object at 0x36b3bb0>
[(u'Ray', u'NNP'), (u'Charles', u'NNP'), (u'has', u'VBZ'), (u'got', u'VBN'), (u'Georgia', u'NNP'), (u'on', u'IN'), (u'his', u'PRP$'), (u'mind', u'NN')]

Spell Correction



In [12]:

    
blob = textblob.TextBlob(u'I am runing across the boarder')



In [13]:

    
print blob.correct() # based on Pattern
print blob.words[2].spellcheck()









    



I am running across the border
[(u'running', 0.8974358974358975), (u'ruling', 0.07692307692307693), (u'ruining', 0.019230769230769232), (u'tuning', 0.00641025641025641)]

Sentiment Analysis



In [14]:

    
blob = textblob.TextBlob(u'the weather is fantastic!')



In [15]:

    
print blob.sentiment

import textblob.sentiments

print blob.analyzer
blob.analyzer = textblob.sentiments.NaiveBayesAnalyzer()
print blob.sentiment









    



Sentiment(polarity=0.5, subjectivity=0.9)
<textblob.en.sentiments.PatternAnalyzer object at 0x36caed0>
Sentiment(polarity=0.5, subjectivity=0.9)

Classification

Opposite Day Classifier



In [16]:

    
training = [
            (u'tobey maguire is a terrible spiderman.','pos'),
            (u'a terrible Javert (Russell Crowe) ruined Les Miserables for me...','pos'),
            (u'The Dark Knight is the greatest superhero movie ever!','neg'),
            (u'Fantastic Four should have never been made.','pos'),
            (u'Wes Anderson is my favorite director!','neg'),
            (u'Captain America 2 is pretty awesome.','neg'),
            (u'Let\s pretend "Batman and Robin" never happened..','pos'),
            ]
testing = [
           (u'Superman was never an interesting character.','pos'),
           (u'Fantastic Mr Fox is an awesome film!','neg'),
           (u'Dragonball Evolution is simply terrible!!','pos')
           ]



In [17]:

    
import textblob.classifiers

classifier = textblob.classifiers.NaiveBayesClassifier(training)
print classifier.accuracy(testing)
classifier.show_informative_features(3)









    



1.0
Most Informative Features
            contains(is) = True              neg : pos    =      2.9 : 1.0
      contains(terrible) = False             neg : pos    =      1.7 : 1.0
         contains(never) = False             neg : pos    =      1.7 : 1.0



In [18]:

    
blob = textblob.TextBlob(u'the weather is terrible!', classifier=classifier)
print blob.classify()

pos

Blobber

Factory Method



In [19]:

    
np_extractor = textblob.np_extractors.ConllExtractor()
pos_tagger = textblob.taggers.NLTKTagger()
tokenizer = nltk.tokenize.PunktWordTokenizer()
analyzer = textblob.sentiments.NaiveBayesAnalyzer()



In [20]:

    
blob = textblob.TextBlob(u'Dog goes woof. Cat goes meow. Bird goes tweet. And mouse goes squeek.',
                         np_extractor=np_extractor,
                         pos_tagger=pos_tagger,
                         tokenizer=tokenizer,
                         analyzer=analyzer)
# do something with blob

blob2 = textblob.TextBlob(u'Cow goes moo. Frog goes croak. And the elephant goes toot.',
                          np_extractor=np_extractor,
                          pos_tagger=pos_tagger,
                          tokenizer=tokenizer,
                          analyzer=analyzer)
# do something with blob2

blob3 = textblob.TextBlob(u'Ducks say quack. And fish go blub. And the seal goes ow ow ow ow ow.',
                          np_extractor=np_extractor,
                          pos_tagger=pos_tagger,
                          tokenizer=tokenizer,
                          analyzer=analyzer)
# do something with blob3



In [21]:

    
blobber = textblob.Blobber(
                           np_extractor=np_extractor,
                           pos_tagger=pos_tagger,
                           tokenizer=tokenizer,
                           analyzer=analyzer)

blob = blobber(u'But there\'s one sound that no one knows: What does the fox say?')

print blob
print blob.np_extractor
print blob.pos_tagger
print blob.tokenizer
print blob.analyzer









    



But there's one sound that no one knows: What does the fox say?
<textblob.en.np_extractors.ConllExtractor object at 0x86a38f0>
<textblob.en.taggers.NLTKTagger object at 0x86a3970>
<nltk.tokenize.punkt.PunktWordTokenizer object at 0x86a3870>
<textblob.en.sentiments.NaiveBayesAnalyzer object at 0x86a3850>



In [22]:

    
print map(blobber, ['Ring-ding-ding-ding-dingeringeding!',
                    'Wa-pa-pa-pa-pa-pa-pow!',
                    'Hatee-hatee-hatee-ho!',
                    'Joff-tchoff-tchoffo-tchoffo-tchoff!'])









    



[TextBlob("Ring-ding-ding-ding-dingeringeding!"), TextBlob("Wa-pa-pa-pa-pa-pa-pow!"), TextBlob("Hatee-hatee-hatee-ho!"), TextBlob("Joff-tchoff-tchoffo-tchoffo-tchoff!")]

Classifier Comparison



In [23]:

    
training = [
            (u'tobey maguire is a terrible spiderman.','pos'),
            (u'a terrible Javert (Russell Crowe) ruined Les Miserables for me...','pos'),
            (u'The Dark Knight is the greatest superhero movie ever!','neg'),
            (u'Fantastic Four should have never been made.','pos'),
            (u'Wes Anderson is my favorite director!','neg'),
            (u'Captain America 2 is pretty awesome.','neg'),
            (u'Let\s pretend "Batman and Robin" never happened..','pos'),
            ]
texts = [
         u'Superman was never an interesting character.',
         u'Fantastic Mr Fox is an awesome film!',
         u'Dragonball Evolution is simply terrible!!'
         ]



In [24]:

    
import textblob.classifiers

nb_classifier = textblob.classifiers.NaiveBayesClassifier(training)
dt_classifier = textblob.classifiers.DecisionTreeClassifier(training)



In [25]:

    
for text in texts:
    nb_class = textblob.TextBlob(text, classifier=nb_classifier).classify()
    dt_class = textblob.TextBlob(text, classifier=dt_classifier).classify()
    print nb_class == dt_class









    



True
True
False



In [26]:

    
nb_blobber = textblob.Blobber(classifier=nb_classifier)
dt_blobber = textblob.Blobber(classifier=dt_classifier)

for text in texts:
    print nb_blobber(text).classify() == dt_blobber(text).classify()









    



True
True
False

Future Blob

Looking for contributors and maintainers
http://textblob.readthedocs.org/en/dev/contributing.html#extension-development

TextBlob: Text Analytics for Humans

Ken Hu

@whosbacon

Who Am I?

Itinerary

A Blob of Text

Basics

Tokenization

Parsing & Tagging

Spell Correction

Sentiment Analysis

Classification

Opposite Day Classifier

Blobber

Factory Method

Classifier Comparison

Future Blob

Fin