In some previous posts, I used some data scraped from the New York Posts' astrological horoscopes to demonstrate that the text of a horoscope is a better indicator of the month in which it was written than the astrological sign it was written for. Here are the results from attempting to classify the horoscope texts into their astrological signs:
In [4]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cross_validation import train_test_split
from sklearn import svm
from sklearn import metrics
df = pd.read_csv('../../../data/astrosign.csv', sep='|')
df = df.drop('Unnamed: 0', 1)
df=df.dropna()
df['month'] = df['pub_date'].map(lambda x: str(x)[0:2])
cv = CountVectorizer()
wordcounts = cv.fit_transform(df['horoscope'])
scope_train, scope_test, sign_train, sign_true = \
train_test_split(wordcounts,
df['zodiac'],
test_size=.3,
random_state=42)
clf = svm.LinearSVC()
clf.fit(scope_train, sign_train)
predicted = clf.predict(scope_test)
scores = metrics.classification_report(sign_true, predicted)
print scores
As we can see, the performance here is not so good. We're essentially performing at chance when examining the performance as a whole.
However, when we look at classifying on month, performance is much better:
In [3]:
Out[3]:
In [5]:
HA! We know more about the month of the year than we do about the astrological sign being discussed. Man my job is cool.
Just in case you don't remember (or you never looked), here's what this classification would look like if there was no real relationship between the horoscope and the month it was published. We can establish this by just shuffling the labels such that they are randomly paired with horoscopes rather than paired with the one that they truly belong with.
In [6]:
import numpy as np
df['shuffled_month'] = np.random.permutation(df.month)
df.head()
Out[6]:
In [8]:
scope_train, scope_test, month_train, month_true = \
train_test_split(wordcounts,
df['shuffled_month'],
test_size=.3,
random_state=42)
clf = svm.LinearSVC()
clf.fit(scope_train, month_train)
predicted = clf.predict(scope_test)
scores = metrics.classification_report(month_true, predicted)
print scores
I would say that this pretty convincingly shows that there's more information in the horoscopes that pertains to the month of the year in which it was published than the astrological sign.
Just to be complete, let's use a random forest as well, just like we tried in the last post.
In [10]:
from sklearn.ensemble import RandomForestClassifier
#the RF classifier doesn't take the sparse numpy array we used before,
#so we just have to turn it into a regular array. This doesn't change
#the values at all, it just changes the internal representation.
wcarray = wordcounts.toarray()
scope_train, scope_test, month_train, month_true = \
train_test_split(wcarray,
df.month,
test_size=.3,
random_state=42)
clf = RandomForestClassifier()
clf.fit(scope_train, month_train)
predicted = clf.predict(scope_test)
scores = metrics.classification_report(month_true, predicted)
print scores
A random forest seems to give us a bit better precision in this case, but the f1 score is the same. There's a problem here, however. Unlike when we were using horoscopes, our classes are not roughly equivalent in terms of the number of instances. Specifically, there are fewer cases for the months of June through November. This could be (and almost certainly is) biasing our learner and is an important factor to consider when fitting these kinds of models.
In [ ]: