Lesson 10 - Text Learning

When learning from text the biggest problem is that different text have different length. A smaller email would require lesser features while longer email would require more features.

Bag of words

  • Make a dictonary of counts of all the words that we care about.
  • word order does not matter
  • long phrases give different vectors
  • complex phrases cannot be handled like "chicago bulls"

In [1]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

string1 = "hi aseem the car will be late regards company"
string2 = "hi company why will it be late I paid in advance regards aseem"
string3 = "hi aseem we don't know why will it be late regards company of company"

email_list = [string1, string2, string3]

vectorizer.fit(email_list)
bag_of_words = vectorizer.transform(email_list)

print vectorizer.vocabulary_
print bag_of_words


{u'be': 2, u'we': 15, u'company': 4, u'of': 11, u'it': 8, u'paid': 12, u'regards': 13, u'know': 9, u'in': 7, u'why': 16, u'advance': 0, u'don': 5, u'aseem': 1, u'car': 3, u'will': 17, u'hi': 6, u'late': 10, u'the': 14}
  (0, 1)	1
  (0, 2)	1
  (0, 3)	1
  (0, 4)	1
  (0, 6)	1
  (0, 10)	1
  (0, 13)	1
  (0, 14)	1
  (0, 17)	1
  (1, 0)	1
  (1, 1)	1
  (1, 2)	1
  (1, 4)	1
  (1, 6)	1
  (1, 7)	1
  (1, 8)	1
  (1, 10)	1
  (1, 12)	1
  (1, 13)	1
  (1, 16)	1
  (1, 17)	1
  (2, 1)	1
  (2, 2)	1
  (2, 4)	2
  (2, 5)	1
  (2, 6)	1
  (2, 8)	1
  (2, 9)	1
  (2, 10)	1
  (2, 11)	1
  (2, 13)	1
  (2, 15)	1
  (2, 16)	1
  (2, 17)	1

Not all words are equal

like the, hi etc.

stopwords

  • occur very frequently, low information and should be removed

In [5]:
import nltk

nltk.download()


showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml
Out[5]:
True

In [7]:
from nltk.corpus import stopwords

sw = stopwords.words("english")

len(sw)


Out[7]:
153

Not all unique words different

  • unresponsive
  • response
  • responsivity
  • responsiveness
  • respond

All of them can be passed through to get a root/stem - respon

We don't need all of them as their meaning is only slightly different and we don't get information.


In [12]:
from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer("english")
print stemmer.stem("responsiveness")
print stemmer.stem("responsivity")
print stemmer.stem("unresponsive")


respons
respons
unrespons

Order of operations in text processing

Should do stemming before adding them to bag of words

Weighting by term frequency

  • TfIdf representation
    • Tf - term frequency
    • Idf - inverse document frequency