Predict the topic of a Math Question on Math Education Resources

We will use Logistic regression to predict the topic of a Math Question from the Math Education Resources. For simplicity we will only consider two topics. Using multiclass classification this can be extended to more than two topics (at the time of writing, April 2015, we have about 1500 questions with 150 topics on MER).

Data inspection


In [1]:
import os
import json
import numpy as np

In [2]:
FOLDER = 'json_data_2_topics'
file_locations = [file_name for file_name in os.walk(FOLDER)][0][-1]
questions = [json.loads(open(os.path.join(FOLDER, f), 'r').read()) for f in file_locations]

The variable questions is now a list of MER questions. Let's have a look at an example.


In [3]:
questions[1]


Out[3]:
{u'ID': u'UBC+MATH105+April_2010+01_(l)',
 u'answer_html': u'<p>hence we choose <em>k=5/62</em>.</p>\n',
 u'answer_latex': u'hence we choose \\emph{k=5/62}.',
 u'contributors': [u'DavidKohler'],
 u'course': u'MATH105',
 u'flags': [u'QGQ', u'QGH', u'QGS', u'RT'],
 u'hints_html': [u'<p>For <span class="math">\\(f(x)\\)</span> to be a probability density function on <span>[</span>a,b<span>]</span> it needs to satisfy</p>\n<p><span class="math">\\[\\begin{aligned}\n(i) \\quad f(x) \\geq 0\\end{aligned}\\]</span></p>\n<p>for all <span class="math">\\(x \\in [a,b]\\)</span>, and</p>\n<p><span class="math">\\[\\begin{aligned}\n(ii) \\quad \\int_a^b f(x)\\,dx = 1.\\end{aligned}\\]</span></p>\n'],
 u'hints_latex': [u'For $f(x)$ to be a probability density function on {[}a,b{]} it needs to\nsatisfy\n\n\\begin{align*}\n(i) \\quad f(x) \\geq 0\n\\end{align*}\n\nfor all $x \\in [a,b]$, and\n\n\\begin{align*}\n(ii) \\quad \\int_a^b f(x)\\,dx = 1.\n\\end{align*}'],
 u'hints_raw': [u'For &lt;math>f(x)&lt;/math> to be a probability density function on [a,b] it needs to satisfy\n\n:&lt;math>(i) \\quad f(x) \\geq 0&lt;/math>\n\nfor all &lt;math>x \\in [a,b]&lt;/math>, and\n\n:&lt;math> (ii) \\quad \\int_a^b f(x)\\,dx = 1. &lt;/math>'],
 u'num_votes': 0,
 u'question': u'1 (l)',
 u'rating': -1,
 u'sols_html': [u'<p>The constant <em>k</em> needs to be chosen such that <em>f(x)</em> integrates to 1:</p>\n<p><span class="math">\\[\\begin{aligned}\n\\int_1^4 kx^{3/2}\\,dx &amp;= k\\int_1^4x^{3/2}\\,dx =k\\frac25x^{5/2}\\bigg|_1^4 \\\\\n&amp;=k\\frac25(4^{5/2}-1^{5/2}) = k\\frac{62}5,\\end{aligned}\\]</span></p>\n<p>hence we choose <em>k=5/62</em>.</p>\n'],
 u'sols_latex': [u'The constant \\emph{k} needs to be chosen such that \\emph{f(x)}\nintegrates to 1:\n\n\\begin{align*}\n\\int_1^4 kx^{3/2}\\,dx &= k\\int_1^4x^{3/2}\\,dx =k\\frac25x^{5/2}\\bigg|_1^4 \\\\\n&=k\\frac25(4^{5/2}-1^{5/2}) = k\\frac{62}5,\n\\end{align*}\n\nhence we choose \\emph{k=5/62}.'],
 u'sols_raw': [u"The constant ''k'' needs to be chosen such that ''f(x)'' integrates to 1:\n\n:&lt;math>\\begin{align}\n\\int_1^4 kx^{3/2}\\,dx &amp;= k\\int_1^4x^{3/2}\\,dx =k\\frac25x^{5/2}\\bigg|_1^4 \\\\\n&amp;=k\\frac25(4^{5/2}-1^{5/2}) = k\\frac{62}5,\n\\end{align}\n&lt;/math>\n\nhence we choose ''k=5/62''."],
 u'solvers': [u'Konradbe'],
 u'statement_html': u'<p>Let <em>k</em> be a constant. Find the value of <em>k</em> such that</p>\n<p><span class="math">\\[\\begin{aligned}\n\\displaystyle f(x) = k x^{3/2}\\end{aligned}\\]</span></p>\n<p>is a probability density function on 1 <span class="math">\\(\\leq\\)</span> <em>x</em> <span class="math">\\(\\leq\\)</span> 4.</p>\n',
 u'statement_latex': u'Let \\emph{k} be a constant. Find the value of \\emph{k} such that\n\n\\begin{align*}\n\\displaystyle f(x) = k x^{3/2}\n\\end{align*}\n\nis a probability density function on 1 $\\leq$ \\emph{x} $\\leq$ 4.',
 u'statement_raw': u"Let ''k'' be a constant. Find the value of ''k'' such that\n\n:&lt;math> \\displaystyle f(x) = k x^{3/2} &lt;/math>\n\nis a probability density function on 1 \u2264 ''x'' \u2264 4.",
 u'term': u'April',
 u'topic_suggest': [u'Separation_of_variables'],
 u'topics': [u'Probability_density_function'],
 u'url': u'http://wiki.ubc.ca/Science:Math_Exam_Resources/Courses/MATH105/April_2010/Question_01_(l)',
 u'year': 2010}

In [4]:
from IPython.display import HTML, Javascript, display
def display_MER_question(q, right=True):
    ''' A helper function to display questions in the notebook. '''
    def base_html(title, body):
        top = "<div style='background: #AAFFAA; width: 40%;"
        left = "position:absolute; left: 50%;"
        middle = "</div><div style='display: inline-block; width: 40%;"
        end = "</div>"
        if right:
            return top + "'>" + title + middle + "'>" + body + end
        else:
            return top + left + "'>" + title + middle + left + "'>" + body  + end
    topic = base_html('Topic', q['topics'][0])
    statement = base_html('Statement', q['statement_html'])
    hint = base_html('Hint', q['hints_html'][0])
    sol = base_html('Solution', q['sols_html'][0])
    display(HTML(data = base_html(q['topics'][0], ("<h3>STATEMENT</h3>" + q['statement_html'] +
                                                   "<h3>HINT</h3>" + q['hints_html'][0] +
                                                   "<h3>SOLUTION</h3>" + q['sols_html'][0]))))

Let's take a look at the rendered version of examples of the questions we are working with.


In [5]:
display_MER_question(questions[3], right=False)
display_MER_question(questions[1])


Eigenvalues_and_eigenvectors

STATEMENT

What are the eigenvalues and eigenvectors of the matrix A below?

\(A = \begin{bmatrix} 0 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 0 \end{bmatrix}\)

HINT

Eigenvalues are the roots of the characteristic polynomial given by \(\det(A-\lambda I)\). For the eigenvectors, observe what happens when you take a vector v and multiply it by your matrix.

SOLUTION

Notice that

\(\det(A-I\lambda) = \det\left( \begin{bmatrix} 0 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 0 \end{bmatrix} -\begin{bmatrix} \lambda & 0 & 0 \\ 0 & \lambda & 0 \\ 0 & 0 & \lambda \end{bmatrix} \right) = -\lambda^{3}\)

and the roots of this polynomial are \(\lambda =0\). This is a triple root and so this is the only eigenvalue. Next, for any vector v, we have

\(Av = 0 = 0v\)

and hence every nonzero vector v is an eigenvector.

Probability_density_function

STATEMENT

Let k be a constant. Find the value of k such that

\[\begin{aligned} \displaystyle f(x) = k x^{3/2}\end{aligned}\]

is a probability density function on 1 \(\leq\) x \(\leq\) 4.

HINT

For \(f(x)\) to be a probability density function on [a,b] it needs to satisfy

\[\begin{aligned} (i) \quad f(x) \geq 0\end{aligned}\]

for all \(x \in [a,b]\), and

\[\begin{aligned} (ii) \quad \int_a^b f(x)\,dx = 1.\end{aligned}\]

SOLUTION

The constant k needs to be chosen such that f(x) integrates to 1:

\[\begin{aligned} \int_1^4 kx^{3/2}\,dx &= k\int_1^4x^{3/2}\,dx =k\frac25x^{5/2}\bigg|_1^4 \\ &=k\frac25(4^{5/2}-1^{5/2}) = k\frac{62}5,\end{aligned}\]

hence we choose k=5/62.

Our dataset consists of questions of two classes of topics:


In [6]:
TOPIC0 = 'Probability_density_function'
TOPIC1 = 'Eigenvalues_and_eigenvectors'
print('Total number of questions: %d' % len(questions))
print('Number of questions on %s: %d' %(TOPIC0, sum([1 for q in questions if TOPIC0 in q['topics']])))
print('Number of questions on %s: %d' %(TOPIC1, sum([1 for q in questions if TOPIC1 in q['topics']])))


Total number of questions: 81
Number of questions on Probability_density_function: 37
Number of questions on Eigenvalues_and_eigenvectors: 44

Data Preparation

Recall Logistic regression

$$h_\theta(x) = \frac1{1+e^{-\theta^Tx}}$$

where $\theta^T = (\theta_0, \theta_1, \dots, \theta_n)$ are the weights (or parameters) of the features $x_i$, $i=0,\dots,n$. The sigmoid function $h_\theta(x)$ maps into $[0, 1]$ and is interpreted as the probability (or confidence) that the observation $x$ is positive. We also call $h_\theta$ the hypothesis.

The weights $\theta$ are trained (or learned) by minimizing the cost function $J(\theta)$ over the training set of size $m$:

$$\min_\theta J(\theta) = \min_\theta \frac{-1}{m} \left[\sum_{i=1}^m y^{(i)}\log h_\theta(x^{(i)}) + (1-y^{(i)})\log(1-h_\theta(x^{(i)}))\right]$$

Note that, for each training set $x^{(i)}$ with label $y^{(i)}$ one of the two summands is zero. The non-zero summand is unbounded and is larger the more the logistic regression is confident about an incorrect prediction. Note further, that $J$ is convex which means that the optimal parameter $\theta$ is unique and can be computed efficiently.

Split off training and test set

In order to test the performance of the algorithm in the end, we set aside a test set that is never touched until the verification at the very end.


In [7]:
np.random.seed(23)  # for reproducibility we set the seed of the random number generator
m = 20
test_indices = np.random.choice(range(len(questions)), m, replace=False)
questions_train = [q for i, q in enumerate(questions) if not i in test_indices]
questions_test = [q for i, q in enumerate(questions) if i in test_indices]
print('%s questions in test set: %d' % (TOPIC0, sum([1 for q in questions_test if TOPIC0 in q['topics']])))
print('%s questions in test set: %d' % (TOPIC1, sum([1 for q in questions_test if TOPIC1 in q['topics']])))


Probability_density_function questions in test set: 9
Eigenvalues_and_eigenvectors questions in test set: 11

Transforming text data to vectors

Manufactoring the class label

In the data preparation we have to transform the written content to feature vectors $x \in \mathbb{R}^n$, and the topics to a binary class label $y \in \{0, 1\}$. Starting with the latter, we set

0 = Probability_density_function

1 = Eigenvalues_and_eigenvectors


In [8]:
def topic_from_question(q):
    return TOPIC1 in q['topics']

Manufacturing the feature vector

Transforming the text to a vector is more tricky. Our strategy is the following:

  1. Find all words in all questions.
  2. Collapse similar words (stemming).
  3. Remove unwanted words to obtain total dictionary of size $n$ words (stopwords).
  4. Each remaining word is a basis vector in $\mathbb{R}^n$.
  5. questionvector $= \sum{i = 1}^n ei \mathbf{1}{\text{word i in question}}$

Before we do any of that, we split off a training and test set so that we can check our performance later. More precisely, we set aside 20 randomly chosen questions, train the algorithm on the remaining 61. This way we can compare the prediction for the test set with the real topic.

Aside: Basic natural language processing


In [9]:
import helpers
from nltk import PorterStemmer
from nltk.corpus import stopwords

Stopwords

A collection of 127 common stop words in the english language. They usually don't transport much meaning, hence we want to ignore them here. This is optional in our example. A common alternative to not distract your classifier by words that don't transport a lot of meaning is tf-idf.


In [10]:
stopwords.words('english')[:10]


Out[10]:
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your']

Stemming

A tool from linguistics to reduce each word to its base or root form. We choose to treat plural and singular, among other things, as the same word.


In [11]:
[PorterStemmer().stem_word(w) for w in ['sensation', 'sensational', 'seasonal', 'flying', 'flies', 'fly', 'swing', 'swine']]


Out[11]:
['sensat', 'sensat', 'season', 'fli', 'fli', 'fli', 'swing', 'swine']

In [12]:
def words_from_question(q):
    all_text = q['statement_html'] + q['hints_html'][0] + q['sols_html'][0]
    return helpers.strip_text(all_text)

def words_stemmed_no_stop(words):
    stop = stopwords.words('english')
    res = []
    for word in words:
        stemmed = PorterStemmer().stem_word(word)
        if stemmed not in stop and len(stemmed) > 1:
            res.append(stemmed)
    return res

Pre-processing in action


In [13]:
display_MER_question(questions_train[-1])


Probability_density_function

STATEMENT

Consider the function

\[\begin{aligned} F(x) = \begin{cases} a & \text{if } x < 0, \\ k \arctan x & \text{if } 0 \leq x \leq 1, \\ b & \text{if } x \geq 1. \end{cases}\end{aligned}\]

(b) Let X be the continuous random variable with cummulative distribution function F(x) as given in part (a). Find the probability density function of X.

HINT

The cumulative distribution function and the probability density function are related by a simple formula that involves an integral. Use the fundamental theorem of calculus to change this integral relationship to one using a derivative instead.

SOLUTION

The cumulative distribution is obtained by integrating the probability density function, hence the probability density function (pdf) is the derivative of the cumulative distribution function.

From part (a),

\[\begin{aligned} F(x) = \begin{cases} 0 & \text{if } x < 0, \\ \frac{4}{\pi} \arctan x & \text{if } 0 \leq x \leq 1, \\ 1 & \text{if } x \geq 1. \end{cases}\end{aligned}\]

hence the pdf is the derivative:

\[\begin{aligned} f(x) = F'(x) = \begin{cases} 0 & \text{if } x < 0, \\ \frac{4}{\pi} \frac{1}{1+x^2} & \text{if } 0 < x < 1, \\ 0 & \text{if } x > 1. \end{cases}\end{aligned}\]

If you are really picky, you might be concerned about the endpoints. Technically the derivative is not defined at the points 0 and 1 (due to “corners”); however, for continuous random variables, the probabilities will be unaffected by isolated points where the derivative of the cumulative distribution function is undefined. How we defined the pdf \(f(x)\) above is completely fine.


In [14]:
q_text = words_from_question(questions_train[-1])
print(q_text)
print('-' * 100)
print('Words stemmed and stopwords removed:')
print('-' * 100)
print(words_stemmed_no_stop(q_text))


[u'', u'consider', u'the', u'function', u'strong', u'b', u'strong', u'let', u'be', u'the', u'continuous', u'random', u'variable', u'with', u'cummulative', u'distribution', u'function', u'as', u'given', u'in', u'part', u'span', u'a', u'find', u'the', u'probability', u'density', u'function', u'of', u'the', u'cumulative', u'distribution', u'function', u'and', u'the', u'probability', u'density', u'function', u'are', u'related', u'by', u'a', u'simple', u'formula', u'that', u'involves', u'an', u'integral', u'use', u'the', u'fundamental', u'theorem', u'of', u'calculus', u'to', u'change', u'this', u'integral', u'relationship', u'to', u'one', u'using', u'a', u'derivative', u'instead', u'the', u'cumulative', u'distribution', u'is', u'obtained', u'by', u'integrating', u'the', u'probability', u'density', u'function', u'hence', u'the', u'probability', u'density', u'function', u'pdf', u'is', u'the', u'derivative', u'of', u'the', u'cumulative', u'distribution', u'function', u'from', u'part', u'a', u'hence', u'the', u'pdf', u'is', u'the', u'derivative', u'if', u'you', u'are', u'really', u'picky', u'you', u'might', u'be', u'concerned', u'about', u'the', u'endpoints', u'technically', u'the', u'derivative', u'is', u'not', u'defined', u'at', u'the', u'points', u'and', u'due', u'to', u'corners', u'however', u'for', u'continuous', u'random', u'variables', u'the', u'probabilities', u'will', u'be', u'unaffected', u'by', u'isolated', u'points', u'where', u'the', u'derivative', u'of', u'the', u'cumulative', u'distribution', u'function', u'is', u'undefined', u'how', u'we', u'defined', u'the', u'pdf', u'above', u'is', u'completely', u'fine', u'fraction', u'arctangent', u'pi', u'greater_than']
----------------------------------------------------------------------------------------------------
Words stemmed and stopwords removed:
----------------------------------------------------------------------------------------------------
[u'consid', u'function', u'strong', u'strong', u'let', u'continu', u'random', u'variabl', u'cummul', u'distribut', u'function', u'given', u'part', u'span', u'find', u'probabl', u'densiti', u'function', u'cumul', u'distribut', u'function', u'probabl', u'densiti', u'function', u'relat', u'simpl', u'formula', u'involv', u'integr', u'use', u'fundament', u'theorem', u'calculu', u'chang', u'thi', u'integr', u'relationship', u'one', u'use', u'deriv', u'instead', u'cumul', u'distribut', u'obtain', u'integr', u'probabl', u'densiti', u'function', u'henc', u'probabl', u'densiti', u'function', u'pdf', u'deriv', u'cumul', u'distribut', u'function', u'part', u'henc', u'pdf', u'deriv', u'realli', u'picki', u'might', u'concern', u'endpoint', u'technic', u'deriv', u'defin', u'point', u'due', u'corner', u'howev', u'continu', u'random', u'variabl', u'probabl', u'unaffect', u'isol', u'point', u'deriv', u'cumul', u'distribut', u'function', u'undefin', u'defin', u'pdf', u'abov', u'complet', u'fine', u'fraction', u'arctang', u'pi', u'greater_than']

Continued: Manufactoring the feature vector

With this, we can now create our dictionary of words, which defines $n$, the size the feature vectors $x$.


In [15]:
vocabulary = []
for q in questions_train:
    vocabulary += words_stemmed_no_stop(words_from_question(q))
vocabulary_sorted = sorted(set(vocabulary))
print('Number of distinct words:', len(vocabulary_sorted))
print(vocabulary_sorted[:15])


('Number of distinct words:', 580)
[u'abl', u'abov', u'absolut', u'accept', u'accord', u'across', u'act', u'ad', u'addit', u'adjoin', u'ahead', u'algebra', u'allow', u'along', u'alreadi']

Now that we know the size $n$ the vector space of MER questions we can finally transform each question into a vector $x \in \mathbb{R}^n$.


In [16]:
def question_to_vector(q, voc):
    x_vec = np.zeros(len(voc))
    words = words_stemmed_no_stop(words_from_question(q))
    for word in words:
        if word in voc:
            x_vec[voc.index(word)] = 1
    return x_vec

In [17]:
print(question_to_vector(questions_train[-1], vocabulary_sorted))


[ 0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  1.  0.  0.  0.  1.
  0.  0.  0.  0.  1.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.
  1.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.
  0.  0.  1.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  1.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  1.  0.  0.  0.
  0.  0.  0.  0.  1.  0.  0.  1.  0.  0.  0.  1.  1.  0.  0.  0.  1.  0.
  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.
  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  1.  0.  0.  0.  0.  0.  0.
  1.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.
  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  1.  0.  0.
  0.  0.  1.  1.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.
  1.  0.  0.  0.  0.  0.  0.  1.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  1.  0.  0.  0.  0.  0.  1.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  1.  0.  1.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  1.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.]

Applying logistic regression

The go-to machine learning library in Python is scikit-learn. It expects the features $x$ in matrix form $X \in \mathbb{R}^{m \times n}$ (one row per observation), together with a vector $y \in \mathbb{R}^m$ of class labels (one row per observation).


In [18]:
from sklearn.linear_model import LogisticRegression

In [19]:
def questions_to_X_y(qs, voc):
    X = np.zeros(shape=(len(qs), len(voc)))
    y = np.zeros(shape=(len(qs)))

    for i, q in enumerate(qs):
        X[i, :] = question_to_vector(q, voc)
        y[i] = topic_from_question(q)
    return X, y

In [20]:
X_train, y_train = questions_to_X_y(questions_train, vocabulary_sorted)
X_test, y_test = questions_to_X_y(questions_test, vocabulary_sorted)

With this setup all we need to do is call the LogisticRegression function from scikit-learn, fit the parameter vector $\theta$ using the training set, and predict the classes of the test set.


In [21]:
clf = LogisticRegression()
clf.fit(X_train, y_train)
print(clf.predict_proba(X_test))


[[ 0.01918951  0.98081049]
 [ 0.98621408  0.01378592]
 [ 0.02675994  0.97324006]
 [ 0.95785771  0.04214229]
 [ 0.96679451  0.03320549]
 [ 0.57601222  0.42398778]
 [ 0.93129252  0.06870748]
 [ 0.10557706  0.89442294]
 [ 0.9666781   0.0333219 ]
 [ 0.0509448   0.9490552 ]
 [ 0.96717695  0.03282305]
 [ 0.04483096  0.95516904]
 [ 0.01033399  0.98966601]
 [ 0.12491478  0.87508522]
 [ 0.01048887  0.98951113]
 [ 0.01594232  0.98405768]
 [ 0.97963112  0.02036888]
 [ 0.13639215  0.86360785]
 [ 0.03771143  0.96228857]
 [ 0.91540732  0.08459268]]

Each row is the classification prediction for one question in the test set. The left column is the confidence that this question is TOPIC0, the right column the confidence that is it TOPIC1. In general the classification is very confident in its predictions. So, how well did we do?


In [22]:
clf.score(X_test, y_test)


Out[22]:
1.0

That...is surprisingly high! What patterns did the classifier pick up to become so good? Recall

$$h_\theta(x) = \frac1{1+e^{-\theta^Tx}}$$

The trained weights $\theta_i$ tell us the importance of feature $i$ to the classification problem. But feature $i$ is just the presence or absence of the $i$-th word. Hence we can check which words transport the most meaning to distinguish between the topics.


In [23]:
most_important_features = np.abs(clf.coef_).argsort()[0][-1:-10:-1]
least_important_features = np.abs(clf.coef_).argsort()[0][:10]
print('Most important words:')
for ind in most_important_features:
    print('%6.3f, %s' % (clf.coef_[0][ind], vocabulary_sorted[ind]))
print('-' * 30)
print('Least important words:')
for ind in least_important_features:
    print('%6.3f, %s' % (clf.coef_[0][ind], vocabulary_sorted[ind]))


Most important words:
 0.868, matrix
-0.753, probabl
 0.727, eigenvalu
-0.622, densiti
-0.578, function
 0.545, eigenvector
-0.418, integr
-0.394, valu
 0.347, vector
------------------------------
Least important words:
 0.001, scale
 0.001, complet
 0.002, howev
-0.002, mind
-0.002, veri
 0.002, rememb
 0.002, approach
 0.003, hold
 0.003, onli
-0.003, proof

Pretty sweet: The logistic regression learned that matrix, eigenvalu and eigenvector are really good indicators for an Eigenvalues_and_eigenvectors questions. On the other hand, probabl, densiti and function indicate a question on Probability_density_function

Another way to visualize this result is to use word clouds.


In [24]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
%matplotlib inline

Words that signal a question on Probability_density_function


In [25]:
wc = WordCloud().generate(' '.join([vocabulary_sorted[ind] for ind in clf.coef_.argsort()[0][:12]]))
plt.imshow(wc)
plt.axis("off")
plt.show()


Words that signal a question on Eigenvalues_and_eigenvectors


In [26]:
wc = WordCloud().generate(' '.join([vocabulary_sorted[ind] for ind in clf.coef_.argsort()[0][-1:-12:-1]]))
plt.imshow(wc)
plt.axis("off")
plt.show()


Another way to visualize: PCA

Logistic regression has a linear decision boundary, that is, the parameter vector $\theta$ defines a hyperplan in $\mathbb{R}^n$, and points get classified depending on which side of the hyperplane they are on.

In our case we can not visualize all dimensions. However, we can use Principal Component Analysis to project the dataset down to two or three dimensions for plotting. Ideally we see that the different classes can be separated by a hyperplane or at least cluster in some way.


In [27]:
from sklearn.decomposition import PCA
from mpl_toolkits.mplot3d import Axes3D

In [28]:
pca = PCA(n_components=3)
pca.fit(X_train)
pca_X_train = pca.transform(X_train)
pca_X_test = pca.transform(X_test)
print('The first 3 principal components explain %.2f of the variance in the dataset.' % sum(pca.explained_variance_ratio_))


The first 3 principal components explain 0.22 of the variance in the dataset.

In [29]:
labels_train = [TOPIC1 if _ else TOPIC0 for _ in y_train]
labels_test = [TOPIC1 if _ else TOPIC0 for _ in y_test]
fig = plt.figure(1, figsize=(8, 6))
plt.clf()
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=25, azim=70)
for c, i, label in zip('rgb', [0, 1], labels_train):
    ax.scatter(pca_X_train[y_train == i, 0],
               pca_X_train[y_train == i, 1],
               pca_X_train[y_train == i, 2],
               c=c, label=label)
for c, i, label in zip('rgb', [0, 1], [l + ' (test)' for l in labels_test]):
    ax.scatter(pca_X_test[y_test == i, 0],
               pca_X_test[y_test == i, 1],
               pca_X_test[y_test == i, 2],
               c=c, label=label, marker='x')
plt.legend()
plt.show()


Finally, what is up with the question that the classifier has a hard time to choose the topic?


In [30]:
display_MER_question(questions_test[5])


Probability_density_function

STATEMENT

On both figures from part (a), indicate the median of this distribution.

HINT

No content found.

SOLUTION

No content found.

Ah, that makes sense. This question has almost no content, making it harder to pick up on the two key words median and distribution.


In [ ]: