Machine Learning algorithms all take the same basic form of input: a fixed length list of numbers. Very few real world problems are a fixed length list of numbers, so a crucial step in machine learning is converting the data into this format. Sometimes the individual numbers are called "features" so this process is sometimes called "feature extraction". A fixed length list of numbers is also known as a vector. A list of vectors with all the same length is known as a matrix.
Right now our input is a list of tweets that looks like this:
We need to convert it into a list of feature vectors that look like this:
You might want to stop and think about how you might do this.
One way to convert our input into a vector is to make each row correspond to a different word and each cell correspond to the number of times that word occured in a particular tweet.
This creates a lot of columns! This is the most basic feature extraction method used on text in natural language processing.
Scikit has methods to make this transofrmation really easy. In sklearn.feature_extraction.text there is a class called CountVectorizer we will use.
CountVecotizer has two important methods
In [13]:
import pandas as pd
import numpy as np
df = pd.read_csv('../scikit/tweets.csv')
target = df['is_there_an_emotion_directed_at_a_brand_or_product']
text = df['tweet_text']
# We need to remove the empty rows from the text before we pass into CountVectorizer
fixed_text = text[pd.notnull(text)]
fixed_target = target[pd.notnull(text)]
# Do the feature extraction
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer() # initialize the count vectorizer
count_vect.fit(fixed_text) # set up the columns
Out[13]:
Now our count_vect object is able to transform text into feautre vectors.
We can try it out:
In [14]:
count_vect.transform(["My iphone is awesome"])
Out[14]:
A sparse matrix is a matrix with mostly zeros, and we are definitely dealing with a sparse matric since most of the counts here are zero.
In [15]:
print(count_vect.transform(["My iphone is awesome"]))
This notation says that cells in columns 876, 4573, 4596, 5699 are one and all other cells are zero. We have one row here because we passed in a list of length one - just the tweet "My iphone is awesome".
Some questions to ask yourself now:
Let's do the transformation on all of our tweets to build our big feature matrix (you can think of a matrix as a list of fixed size vectors).
In [17]:
counts = count_vect.transform(fixed_text)
print(counts.shape)
Great! Now we have a feature matrix that we can feed in to our machine learning algorithm. It has 9092 rows corresponding to 9092 tweets and 9706 columns corresponding to 9706 words.
In [ ]: