Lesson 10 - Text Learning

When learning from text the biggest problem is that different text have different length. A smaller email would require lesser features while longer email would require more features.

Bag of words

Make a dictonary of counts of all the words that we care about.
word order does not matter
long phrases give different vectors
complex phrases cannot be handled like "chicago bulls"



In [1]:

    
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

string1 = "hi aseem the car will be late regards company"
string2 = "hi company why will it be late I paid in advance regards aseem"
string3 = "hi aseem we don't know why will it be late regards company of company"

email_list = [string1, string2, string3]

vectorizer.fit(email_list)
bag_of_words = vectorizer.transform(email_list)

print vectorizer.vocabulary_
print bag_of_words









    



{u'be': 2, u'we': 15, u'company': 4, u'of': 11, u'it': 8, u'paid': 12, u'regards': 13, u'know': 9, u'in': 7, u'why': 16, u'advance': 0, u'don': 5, u'aseem': 1, u'car': 3, u'will': 17, u'hi': 6, u'late': 10, u'the': 14}
  (0, 1)	1
  (0, 2)	1
  (0, 3)	1
  (0, 4)	1
  (0, 6)	1
  (0, 10)	1
  (0, 13)	1
  (0, 14)	1
  (0, 17)	1
  (1, 0)	1
  (1, 1)	1
  (1, 2)	1
  (1, 4)	1
  (1, 6)	1
  (1, 7)	1
  (1, 8)	1
  (1, 10)	1
  (1, 12)	1
  (1, 13)	1
  (1, 16)	1
  (1, 17)	1
  (2, 1)	1
  (2, 2)	1
  (2, 4)	2
  (2, 5)	1
  (2, 6)	1
  (2, 8)	1
  (2, 9)	1
  (2, 10)	1
  (2, 11)	1
  (2, 13)	1
  (2, 15)	1
  (2, 16)	1
  (2, 17)	1

Not all words are equal

like the, hi etc.

stopwords

occur very frequently, low information and should be removed



In [5]:

    
import nltk

nltk.download()









    



showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml






    Out[5]:





True



In [7]:

    
from nltk.corpus import stopwords

sw = stopwords.words("english")

len(sw)









    Out[7]:





153

Not all unique words different

unresponsive
response
responsivity
responsiveness
respond

All of them can be passed through to get a root/stem - respon

We don't need all of them as their meaning is only slightly different and we don't get information.



In [12]:

    
from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer("english")
print stemmer.stem("responsiveness")
print stemmer.stem("responsivity")
print stemmer.stem("unresponsive")









    



respons
respons
unrespons

Order of operations in text processing

Should do stemming before adding them to bag of words

Weighting by term frequency

TfIdf representation
- Tf - term frequency
- Idf - inverse document frequency