We will use Logistic regression to predict the topic of a Math Question from the Math Education Resources. For simplicity we will only consider two topics. Using multiclass classification this can be extended to more than two topics (at the time of writing, April 2015, we have about 1500 questions with 150 topics on MER).
In [1]:
import os
import json
import numpy as np
In [2]:
FOLDER = 'json_data_2_topics'
file_locations = [file_name for file_name in os.walk(FOLDER)][0][-1]
questions = [json.loads(open(os.path.join(FOLDER, f), 'r').read()) for f in file_locations]
The variable questions
is now a list of MER questions. Let's have a look at an example.
In [3]:
questions[1]
Out[3]:
In [4]:
from IPython.display import HTML, Javascript, display
def display_MER_question(q, right=True):
''' A helper function to display questions in the notebook. '''
def base_html(title, body):
top = "<div style='background: #AAFFAA; width: 40%;"
left = "position:absolute; left: 50%;"
middle = "</div><div style='display: inline-block; width: 40%;"
end = "</div>"
if right:
return top + "'>" + title + middle + "'>" + body + end
else:
return top + left + "'>" + title + middle + left + "'>" + body + end
topic = base_html('Topic', q['topics'][0])
statement = base_html('Statement', q['statement_html'])
hint = base_html('Hint', q['hints_html'][0])
sol = base_html('Solution', q['sols_html'][0])
display(HTML(data = base_html(q['topics'][0], ("<h3>STATEMENT</h3>" + q['statement_html'] +
"<h3>HINT</h3>" + q['hints_html'][0] +
"<h3>SOLUTION</h3>" + q['sols_html'][0]))))
Let's take a look at the rendered version of examples of the questions we are working with.
In [5]:
display_MER_question(questions[3], right=False)
display_MER_question(questions[1])
Our dataset consists of questions of two classes of topics:
In [6]:
TOPIC0 = 'Probability_density_function'
TOPIC1 = 'Eigenvalues_and_eigenvectors'
print('Total number of questions: %d' % len(questions))
print('Number of questions on %s: %d' %(TOPIC0, sum([1 for q in questions if TOPIC0 in q['topics']])))
print('Number of questions on %s: %d' %(TOPIC1, sum([1 for q in questions if TOPIC1 in q['topics']])))
where $\theta^T = (\theta_0, \theta_1, \dots, \theta_n)$ are the weights (or parameters) of the features $x_i$, $i=0,\dots,n$. The sigmoid function $h_\theta(x)$ maps into $[0, 1]$ and is interpreted as the probability (or confidence) that the observation $x$ is positive. We also call $h_\theta$ the hypothesis.
The weights $\theta$ are trained (or learned) by minimizing the cost function $J(\theta)$ over the training set of size $m$:
$$\min_\theta J(\theta) = \min_\theta \frac{-1}{m} \left[\sum_{i=1}^m y^{(i)}\log h_\theta(x^{(i)}) + (1-y^{(i)})\log(1-h_\theta(x^{(i)}))\right]$$Note that, for each training set $x^{(i)}$ with label $y^{(i)}$ one of the two summands is zero. The non-zero summand is unbounded and is larger the more the logistic regression is confident about an incorrect prediction. Note further, that $J$ is convex which means that the optimal parameter $\theta$ is unique and can be computed efficiently.
In [7]:
np.random.seed(23) # for reproducibility we set the seed of the random number generator
m = 20
test_indices = np.random.choice(range(len(questions)), m, replace=False)
questions_train = [q for i, q in enumerate(questions) if not i in test_indices]
questions_test = [q for i, q in enumerate(questions) if i in test_indices]
print('%s questions in test set: %d' % (TOPIC0, sum([1 for q in questions_test if TOPIC0 in q['topics']])))
print('%s questions in test set: %d' % (TOPIC1, sum([1 for q in questions_test if TOPIC1 in q['topics']])))
In the data preparation we have to transform the written content to feature vectors $x \in \mathbb{R}^n$, and the topics to a binary class label $y \in \{0, 1\}$. Starting with the latter, we set
0 = Probability_density_function
1 = Eigenvalues_and_eigenvectors
In [8]:
def topic_from_question(q):
return TOPIC1 in q['topics']
Transforming the text to a vector is more tricky. Our strategy is the following:
Before we do any of that, we split off a training and test set so that we can check our performance later. More precisely, we set aside 20 randomly chosen questions, train the algorithm on the remaining 61. This way we can compare the prediction for the test set with the real topic.
In [9]:
import helpers
from nltk import PorterStemmer
from nltk.corpus import stopwords
A collection of 127 common stop words in the english language. They usually don't transport much meaning, hence we want to ignore them here. This is optional in our example. A common alternative to not distract your classifier by words that don't transport a lot of meaning is tf-idf.
In [10]:
stopwords.words('english')[:10]
Out[10]:
In [11]:
[PorterStemmer().stem_word(w) for w in ['sensation', 'sensational', 'seasonal', 'flying', 'flies', 'fly', 'swing', 'swine']]
Out[11]:
In [12]:
def words_from_question(q):
all_text = q['statement_html'] + q['hints_html'][0] + q['sols_html'][0]
return helpers.strip_text(all_text)
def words_stemmed_no_stop(words):
stop = stopwords.words('english')
res = []
for word in words:
stemmed = PorterStemmer().stem_word(word)
if stemmed not in stop and len(stemmed) > 1:
res.append(stemmed)
return res
In [13]:
display_MER_question(questions_train[-1])
In [14]:
q_text = words_from_question(questions_train[-1])
print(q_text)
print('-' * 100)
print('Words stemmed and stopwords removed:')
print('-' * 100)
print(words_stemmed_no_stop(q_text))
In [15]:
vocabulary = []
for q in questions_train:
vocabulary += words_stemmed_no_stop(words_from_question(q))
vocabulary_sorted = sorted(set(vocabulary))
print('Number of distinct words:', len(vocabulary_sorted))
print(vocabulary_sorted[:15])
Now that we know the size $n$ the vector space of MER questions we can finally transform each question into a vector $x \in \mathbb{R}^n$.
In [16]:
def question_to_vector(q, voc):
x_vec = np.zeros(len(voc))
words = words_stemmed_no_stop(words_from_question(q))
for word in words:
if word in voc:
x_vec[voc.index(word)] = 1
return x_vec
In [17]:
print(question_to_vector(questions_train[-1], vocabulary_sorted))
The go-to machine learning library in Python is scikit-learn. It expects the features $x$ in matrix form $X \in \mathbb{R}^{m \times n}$ (one row per observation), together with a vector $y \in \mathbb{R}^m$ of class labels (one row per observation).
In [18]:
from sklearn.linear_model import LogisticRegression
In [19]:
def questions_to_X_y(qs, voc):
X = np.zeros(shape=(len(qs), len(voc)))
y = np.zeros(shape=(len(qs)))
for i, q in enumerate(qs):
X[i, :] = question_to_vector(q, voc)
y[i] = topic_from_question(q)
return X, y
In [20]:
X_train, y_train = questions_to_X_y(questions_train, vocabulary_sorted)
X_test, y_test = questions_to_X_y(questions_test, vocabulary_sorted)
With this setup all we need to do is call the LogisticRegression
function from scikit-learn, fit the parameter vector $\theta$ using the training set, and predict the classes of the test set.
In [21]:
clf = LogisticRegression()
clf.fit(X_train, y_train)
print(clf.predict_proba(X_test))
Each row is the classification prediction for one question in the test set. The left column is the confidence that this question is TOPIC0
, the right column the confidence that is it TOPIC1
. In general the classification is very confident in its predictions. So, how well did we do?
In [22]:
clf.score(X_test, y_test)
Out[22]:
That...is surprisingly high! What patterns did the classifier pick up to become so good? Recall
$$h_\theta(x) = \frac1{1+e^{-\theta^Tx}}$$The trained weights $\theta_i$ tell us the importance of feature $i$ to the classification problem. But feature $i$ is just the presence or absence of the $i$-th word. Hence we can check which words transport the most meaning to distinguish between the topics.
In [23]:
most_important_features = np.abs(clf.coef_).argsort()[0][-1:-10:-1]
least_important_features = np.abs(clf.coef_).argsort()[0][:10]
print('Most important words:')
for ind in most_important_features:
print('%6.3f, %s' % (clf.coef_[0][ind], vocabulary_sorted[ind]))
print('-' * 30)
print('Least important words:')
for ind in least_important_features:
print('%6.3f, %s' % (clf.coef_[0][ind], vocabulary_sorted[ind]))
Pretty sweet: The logistic regression learned that matrix, eigenvalu and eigenvector are really good indicators for an Eigenvalues_and_eigenvectors questions. On the other hand, probabl, densiti and function indicate a question on Probability_density_function
Another way to visualize this result is to use word clouds.
In [24]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
%matplotlib inline
In [25]:
wc = WordCloud().generate(' '.join([vocabulary_sorted[ind] for ind in clf.coef_.argsort()[0][:12]]))
plt.imshow(wc)
plt.axis("off")
plt.show()
In [26]:
wc = WordCloud().generate(' '.join([vocabulary_sorted[ind] for ind in clf.coef_.argsort()[0][-1:-12:-1]]))
plt.imshow(wc)
plt.axis("off")
plt.show()
Logistic regression has a linear decision boundary, that is, the parameter vector $\theta$ defines a hyperplan in $\mathbb{R}^n$, and points get classified depending on which side of the hyperplane they are on.
In our case we can not visualize all dimensions. However, we can use Principal Component Analysis to project the dataset down to two or three dimensions for plotting. Ideally we see that the different classes can be separated by a hyperplane or at least cluster in some way.
In [27]:
from sklearn.decomposition import PCA
from mpl_toolkits.mplot3d import Axes3D
In [28]:
pca = PCA(n_components=3)
pca.fit(X_train)
pca_X_train = pca.transform(X_train)
pca_X_test = pca.transform(X_test)
print('The first 3 principal components explain %.2f of the variance in the dataset.' % sum(pca.explained_variance_ratio_))
In [29]:
labels_train = [TOPIC1 if _ else TOPIC0 for _ in y_train]
labels_test = [TOPIC1 if _ else TOPIC0 for _ in y_test]
fig = plt.figure(1, figsize=(8, 6))
plt.clf()
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=25, azim=70)
for c, i, label in zip('rgb', [0, 1], labels_train):
ax.scatter(pca_X_train[y_train == i, 0],
pca_X_train[y_train == i, 1],
pca_X_train[y_train == i, 2],
c=c, label=label)
for c, i, label in zip('rgb', [0, 1], [l + ' (test)' for l in labels_test]):
ax.scatter(pca_X_test[y_test == i, 0],
pca_X_test[y_test == i, 1],
pca_X_test[y_test == i, 2],
c=c, label=label, marker='x')
plt.legend()
plt.show()
Finally, what is up with the question that the classifier has a hard time to choose the topic?
In [30]:
display_MER_question(questions_test[5])
Ah, that makes sense. This question has almost no content, making it harder to pick up on the two key words median and distribution.
In [ ]: