Feature Extraction

Goals

  1. Introduction to Feature Extraction
  2. Do Feature Extraction on Text - Introduction to Bag of Words

Introduction

Machine Learning algorithms all take the same basic form of input: a fixed length list of numbers. Very few real world problems are a fixed length list of numbers, so a crucial step in machine learning is converting the data into this format. Sometimes the individual numbers are called "features" so this process is sometimes called "feature extraction". A fixed length list of numbers is also known as a vector. A list of vectors with all the same length is known as a matrix.

Right now our input is a list of tweets that looks like this:

We need to convert it into a list of feature vectors that look like this:

You might want to stop and think about how you might do this.

Bag of Words

One way to convert our input into a vector is to make each row correspond to a different word and each cell correspond to the number of times that word occured in a particular tweet.

This creates a lot of columns! This is the most basic feature extraction method used on text in natural language processing.

Scikit has methods to make this transofrmation really easy. In sklearn.feature_extraction.text there is a class called CountVectorizer we will use.

CountVecotizer has two important methods

  1. fit sets things up, associating each word with a column
  2. transform converts a list of strings into feature vectors

In [13]:
import pandas as pd
import numpy as np

df = pd.read_csv('../scikit/tweets.csv')
target = df['is_there_an_emotion_directed_at_a_brand_or_product']
text = df['tweet_text']

# We need to remove the empty rows from the text before we pass into CountVectorizer
fixed_text = text[pd.notnull(text)]
fixed_target = target[pd.notnull(text)]

# Do the feature extraction
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()  # initialize the count vectorizer
count_vect.fit(fixed_text)      # set up the columns


Out[13]:
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

Now our count_vect object is able to transform text into feautre vectors.

We can try it out:


In [14]:
count_vect.transform(["My iphone is awesome"])


Out[14]:
<1x9706 sparse matrix of type '<class 'numpy.int64'>'
	with 4 stored elements in Compressed Sparse Row format>

A sparse matrix is a matrix with mostly zeros, and we are definitely dealing with a sparse matric since most of the counts here are zero.


In [15]:
print(count_vect.transform(["My iphone is awesome"]))


  (0, 876)	1
  (0, 4573)	1
  (0, 4596)	1
  (0, 5699)	1

This notation says that cells in columns 876, 4573, 4596, 5699 are one and all other cells are zero. We have one row here because we passed in a list of length one - just the tweet "My iphone is awesome".

Some questions to ask yourself now:

  • which words correspond to which columns?
  • is our transformation case senstitive?
  • how many columns do we have?

Let's do the transformation on all of our tweets to build our big feature matrix (you can think of a matrix as a list of fixed size vectors).


In [17]:
counts = count_vect.transform(fixed_text)
print(counts.shape)


(9092, 9706)

Great! Now we have a feature matrix that we can feed in to our machine learning algorithm. It has 9092 rows corresponding to 9092 tweets and 9706 columns corresponding to 9706 words.

Takeaways

  1. All machine learning algorithms have the same API - a list of fixed-length vectors of numbers - also known as a Feature Matrix
  2. Data almost never comes in a list of fixed length vectors, so this transofrmation is critical, and highly application dependant.
  3. When dealing with text data, "bag of words" is a common way to do feature extraction.

Questions

  1. What would be another way to transform text?
  2. What information is lost in the "bag-of-words" transformation?

In [ ]: