In [37]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
In [8]:
from nltk.util import ngrams
sentence = 'A black-dog and a spotted dog are fighting.'
n = 2
sixgrams = ngrams(sentence.split(), n)
for grams in sixgrams:
print grams
Some of the bigrams are obviously not relevant. So we tokenize and exclude stop words to get some relevant classes.
In [17]:
from nltk.corpus import stopwords
from nltk.tokenize import wordpunct_tokenize
stop_words = set(stopwords.words('english'))
stop_words.update(['.', ',', '"', "'", '?', '!', ':', ';', '(', ')', '[', ']', '{', '}','-']) # remove it if you need punctuation
list_of_words = [i.lower() for i in wordpunct_tokenize(sentence) if i.lower() not in stop_words]
bigrams = ngrams(list_of_words,2)
for grams in bigrams:
print grams
The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. This dataset is often used for text classification and text clustering. Some of the newsgroups are very closely related to each other (e.g. comp.sys.ibm.pc.hardware / comp.sys.mac.hardware), while others are highly unrelated (e.g misc.forsale / soc.religion.christian). From: http://qwone.com/~jason/20Newsgroups/
In [18]:
%run fetch_data.py twenty_newsgroups
In [22]:
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import TfidfVectorizer # Tf IDF feature extraction
from sklearn.feature_extraction.text import CountVectorizer # Count and vectorize text feature
# Load the text data
categories = [
'alt.atheism',
'talk.religion.misc',
'comp.graphics',
'sci.space',
]
twenty_train_small = load_files('./datasets/20news-bydate-train/',
categories=categories, encoding='latin-1')
twenty_test_small = load_files('./datasets/20news-bydate-test/',
categories=categories, encoding='latin-1')
# Lets display some of the data
def display_sample(i, dataset):
target_id = dataset.target[i]
print("Class id: %d" % target_id)
print("Class name: " + dataset.target_names[target_id])
print("Text content:\n")
print(dataset.data[i])
display_sample(0,twenty_train_small)
Lets extract vector counts to convert text to a vector.
In [41]:
count_vect = CountVectorizer(min_df=2)
X_train_counts = count_vect.fit_transform(twenty_train_small.data)
print X_train_counts.shape
Lets extract TF-IDF features from text data. min_df option is to put a lower bound to ignore terms that have a low document frequency.
In [26]:
# Extract features
# Turn the text documents into vectors of word frequencies with tf-idf weighting
vectorizer = TfidfVectorizer(min_df=2)
X_train = vectorizer.fit_transform(twenty_train_small.data)
y_train = twenty_train_small.target
print type(X_train)
print X_train.shape
As observed, X_train is a scipy sparse matrix consisting of 2034 rows (number of text files) and 17566 different features (unique words)
In [32]:
print type(vectorizer.vocabulary_) # Type of vocabulary
print len(vectorizer.vocabulary_) # Length of vocabulary
print vectorizer.get_feature_names()[:10] # Print first 10 elements of dictionary
print vectorizer.get_feature_names()[-10:] # Print last 10 elements of dictionary
Obviously, its hard to make any sense of such high-dimensional feature space. A good technique to visualize such data is to project it to lower dimensions using PCA and then visualizing low-dimensional splace.
In [39]:
from sklearn.decomposition import TruncatedSVD
X_train_pca = TruncatedSVD(n_components=2).fit_transform(X_train)
from itertools import cycle
colors = ['b', 'g', 'r', 'c', 'm', 'y', 'k']
for i, c in zip(np.unique(y_train), cycle(colors)):
plt.scatter(X_train_pca[y_train == i, 0],
X_train_pca[y_train == i, 1],
c=c, label=twenty_train_small.target_names[i], alpha=0.8)
_ = plt.legend(loc='best')
Obviously, this data is not linearly separable any more but there are some interesting patterns that can be observed, alt.atheism and talk.religion.misc overlap.