Feature Extraction and Preprocessing



In [32]:

    
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, HashingVectorizer
from sklearn.metrics.pairwise import euclidean_distances
from sklearn import preprocessing

from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk import word_tokenize
from nltk import pos_tag

import numpy as np

DictVectorizer



In [9]:

    
onehot_encoder = DictVectorizer()
instances = [
    {'city': 'New York'},
    {'city': 'San Francisco'},
    {'city': 'Chapel Hill'} ]

print (onehot_encoder.fit_transform(instances).toarray())









    



[[ 0.  1.  0.]
 [ 0.  0.  1.]
 [ 1.  0.  0.]]

CountVectorizer



In [10]:

    
corpus = [
    'UNC played Duke in basketball',
    'Duke lost the basketball game'
]


vectorizer = CountVectorizer()
print (vectorizer.fit_transform(corpus).todense())
print (vectorizer.vocabulary_)









    



[[1 1 0 1 0 1 0 1]
 [1 1 1 0 1 0 1 0]]
{'played': 5, 'game': 2, 'basketball': 0, 'in': 3, 'duke': 1, 'lost': 4, 'the': 6, 'unc': 7}



In [11]:

    
# adding one more sentence in corpus

corpus = [
    'UNC played Duke in basketball',
    'Duke lost the basketball game',
    'This is Atul Singh'
]

vectorizer = CountVectorizer()
print (vectorizer.fit_transform(corpus).todense())
print (vectorizer.vocabulary_)









    



[[0 1 1 0 1 0 0 1 0 0 0 1]
 [0 1 1 1 0 0 1 0 0 1 0 0]
 [1 0 0 0 0 1 0 0 1 0 1 0]]
{'played': 7, 'is': 5, 'game': 3, 'basketball': 1, 'singh': 8, 'in': 4, 'atul': 0, 'duke': 2, 'lost': 6, 'the': 9, 'unc': 11, 'this': 10}



In [12]:

    
# checking the euclidean distance 

# converting sentence into CountVectorizer
counts = vectorizer.fit_transform(corpus).todense()

print("1 & 2", euclidean_distances(counts[0], counts[1]))
print("2 & 3", euclidean_distances(counts[1], counts[2]))
print("1 & 3", euclidean_distances(counts[0], counts[2]))









    



1 & 2 [[ 2.44948974]]
2 & 3 [[ 3.]]
1 & 3 [[ 3.]]

Stop Word Filtering



In [13]:

    
vectorizer = CountVectorizer(stop_words='english')  # added one option which remove the grammer words from corpus
print (vectorizer.fit_transform(corpus).todense())
print (vectorizer.vocabulary_)

print("1 & 2", euclidean_distances(counts[0], counts[1]))
print("2 & 3", euclidean_distances(counts[1], counts[2]))
print("1 & 3", euclidean_distances(counts[0], counts[2]))









    



[[0 1 1 0 0 1 0 1]
 [0 1 1 1 1 0 0 0]
 [1 0 0 0 0 0 1 0]]
{'played': 5, 'game': 3, 'basketball': 1, 'singh': 6, 'atul': 0, 'duke': 2, 'lost': 4, 'unc': 7}
1 & 2 [[ 2.44948974]]
2 & 3 [[ 3.]]
1 & 3 [[ 3.]]

Stemming and Lemmatization

Lemmatization is the process of determining the lemma, or the morphological root, of an inflected word based on its context. Lemmas are the base forms of words that are used to key the word in a dictionary.

Stemming has a similar goal to lemmatization, but it does not attempt to produce the morphological roots of words. Instead, stemming removes all patterns of characters that appear to be affixes, resulting in a token that is not necessarily a valid word.

Lemmatization frequently requires a lexical resource, like WordNet, and the word's part of speech. Stemming algorithms frequently use rules instead of lexical resources to produce stems and can operate on any token, even without its context.



In [14]:

    
corpus = [
    'He ate the sandwiches',
    'Every sandwich was eaten by him'
]

vectorizer = CountVectorizer(stop_words='english')  # added one option which remove the grammer words from corpus
print (vectorizer.fit_transform(corpus).todense())
print (vectorizer.vocabulary_)









    



[[1 0 0 1]
 [0 1 1 0]]
{'sandwich': 2, 'ate': 0, 'sandwiches': 3, 'eaten': 1}

As we can see both sentences are having same meaning but their feature vectors have no elements in common. Let's use the lexical analysis on the data



In [15]:

    
lemmatizer = WordNetLemmatizer()
print (lemmatizer.lemmatize('gathering', 'v'))
print (lemmatizer.lemmatize('gathering', 'n'))









    



gather
gathering

The Porter stemmer cannot consider the inflected form's part of speech and returns gather for both documents:



In [16]:

    
stemmer = PorterStemmer()
print (stemmer.stem('gathering'))









    



gather



In [17]:

    
wordnet_tags = ['n', 'v']
corpus = [
'He ate the sandwiches',
'Every sandwich was eaten by him'
]
stemmer = PorterStemmer()
print ('Stemmed:', [[stemmer.stem(token) for token in word_tokenize(document)] for document in corpus])









    



Stemmed: [['He', 'ate', 'the', 'sandwich'], ['Everi', 'sandwich', 'wa', 'eaten', 'by', 'him']]



In [18]:

    
def lemmatize(token, tag):
	if tag[0].lower() in ['n', 'v']:
		return lemmatizer.lemmatize(token, tag[0].lower())
	return token
lemmatizer = WordNetLemmatizer()

tagged_corpus = [pos_tag(word_tokenize(document)) for document in corpus]
print ('Lemmatized:', [[lemmatize(token, tag) for token, tag in document] for document in tagged_corpus])









    



Lemmatized: [['He', 'eat', 'the', 'sandwich'], ['Every', 'sandwich', 'be', 'eat', 'by', 'him']]

Extending bag-of-words with TF-IDF weights

It is intuitive that the frequency with which a word appears in a document could indicate the extent to which a document pertains to that word. A long document that contains one occurrence of a word may discuss an entirely different topic than a document that contains many occurrences of the same word. In this section, we will create feature vectors that encode the frequencies of words, and discuss strategies to mitigate two problems caused by encoding term frequencies. Instead of using a binary value for each element in the feature vector, we will now use an integer that represents the number of times that the words appeared in the document.



In [19]:

    
corpus = ['The dog ate a sandwich, the wizard transfigured a sandwich, and I ate a sandwich']
vectorizer = CountVectorizer(stop_words='english')
print (vectorizer.fit_transform(corpus).todense())
print(vectorizer.vocabulary_)









    



[[2 1 3 1 1]]
{'transfigured': 3, 'wizard': 4, 'dog': 1, 'ate': 0, 'sandwich': 2}



In [23]:

    
corpus = ['The dog ate a sandwich and I ate a sandwich',
    'The wizard transfigured a sandwich']
vectorizer = TfidfVectorizer(stop_words='english')
print (vectorizer.fit_transform(corpus).todense())
print(vectorizer.vocabulary_)









    



[[ 0.75458397  0.37729199  0.53689271  0.          0.        ]
 [ 0.          0.          0.44943642  0.6316672   0.6316672 ]]
{'transfigured': 3, 'wizard': 4, 'dog': 1, 'ate': 0, 'sandwich': 2}



In [26]:

    
corpus = ['The dog ate a sandwich and I ate a sandwich',
    'The wizard transfigured a sandwich']
vectorizer = HashingVectorizer(n_features=6)

print (vectorizer.fit_transform(corpus).todense())









    



[[-0.37796447  0.          0.75592895  0.37796447  0.         -0.37796447]
 [-0.5         0.          0.5         0.5         0.          0.5       ]]

Data Standardization



In [34]:

    
X = [[1,2,3],
     [4,5,1],
     [3,6,2]
    ]

print(preprocessing.scale(X))









    



[[-1.33630621 -1.37281295  1.22474487]
 [ 1.06904497  0.39223227 -1.22474487]
 [ 0.26726124  0.98058068  0.        ]]



In [42]:

    
x1 = preprocessing.StandardScaler()
print(x1)
print(x1.fit_transform(X))









    



StandardScaler(copy=True, with_mean=True, with_std=True)
[[-1.33630621 -1.37281295  1.22474487]
 [ 1.06904497  0.39223227 -1.22474487]
 [ 0.26726124  0.98058068  0.        ]]



In [ ]: