The task(s) is(are) to build a system to classify the Level of writing samples by english language learners, using a data set gathered from users. Each Level is comprised of multiple units. Learners progress in linear order from one Level to the next, although within a Level, they may jump around from one unit to another at will. Typically one of the last units in a Level is the “written task”, in which the learners write freely on a given topic.
In this step, I converted the raw XML(https://www.dropbox.com/s/rizbq4co1hlfpft/EFWritingData.xml?dl=0) into a pandas DataFrame. Each row in DataFrame is a flat representation of one <writing>
element, sub-elements and attributes. I excluded the article date in the conversion since I wasn't sure about using it in the classification step.
In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import pandas as pd
import numpy as np
import seaborn as sns
In [2]:
raw_input = pd.read_pickle('input.pkl')
raw_input.head()
Out[2]:
I first investigated the relationships between the metadata fields and the level
attribute. The metadata fields excluding date
are:
Figure below shows the boxplot of topic_id
for each distinct value of level
.
In [3]:
ax = plt.subplots(1, 1, figsize=(12, 8))
ax = sns.boxplot(x='level', y='topic_id', data=raw_input)
Each level belongs to a mutually exclusive set of topic IDs. We can deterministically figure out the level from the topic_id without using any Machine Learning. However since this homework is for text classification, I chose to go ahead and build a text classifier without using the topic_id
or the topic_text
in the model.
Figure below shows the boxplots per level for the other two metadata features, unit
and grade
.
Visually they didn't seem to be good predictors for level
.
In [4]:
_, ax = plt.subplots(1, 2, figsize=(16, 4))
_ = sns.boxplot(x='level', y='grade', data=raw_input, ax=ax[0])
_ = sns.boxplot(x='level', y='unit', data=raw_input, ax=ax[1])
I used the experience from the metadata analysis to build my working hypothesis for text based classifier:
level
is about a topic.level
.As instructed in the homework, I set aside 20% of the total data (with the same proportion of levels) as a test set. I used the other 80% for cross-validation and training of the model(s) under consideration. I also constructed a smaller training set by taking random samples of 1000 records for each level from the training set. This allowed me to run different experiments quickly (the full training set has ~900K records).
I converted the smaller training set of 16000 records into term frequency (tf) and inverse document frequency (idf) based features and then used a maximum entropy classifier to maximize the log-likelihood of these features given class label from the training data.
Then I extracted the out-of-sample predictions using 10-fold cross-validation and then generated the precision, recall, f1-score and confusion matrix for the true and predicted labels.
The main decision points with tf-idf features are:
I ran the classification pipeline on the sample dataset a few times with and without stopwords, different values for the frequent and infrequent word thresholds and maximum number of features to keep. The main driver of f1-score seemed to be the maximum number of features which also required more time to run. For example, the parameters below:
counter = TfidfVectorizer(
ngram_range=(1, 2),
stop_words=en_stopwords,
max_df=0.4,
min_df=25,
max_features=3000,
sublinear_tf=True
)
scaler = StandardScaler(with_mean=False)
model = LogisticRegression(penalty='l2', max_iter=200, random_state=4321)
pipeline = make_pipeline(counter, scaler, model)
produces the classification result from cross-validation:
precision recall f1-score support
1 0.90 0.89 0.90 1007
2 0.89 0.88 0.88 1019
3 0.85 0.88 0.87 972
4 0.86 0.85 0.85 1010
5 0.86 0.87 0.87 993
6 0.88 0.91 0.90 966
7 0.81 0.81 0.81 1000
8 0.80 0.78 0.79 1025
9 0.82 0.82 0.82 999
10 0.85 0.87 0.86 978
11 0.77 0.76 0.76 1023
12 0.86 0.87 0.87 996
13 0.91 0.90 0.90 1013
14 0.86 0.86 0.86 1002
15 0.81 0.80 0.80 1010
16 0.85 0.86 0.86 987
avg / total 0.85 0.85 0.85 16000
In order to improve the model, I then looked at the misclassified records in my sample dataset. The biggest misclassification occurred between level 7 and 8. The distribution of topics for the 7 <---> 8 misclassification are:
topic_id topic_text level level_predicted number of misclassifications
50 Planning for the future 7 8 28
59 Making a 'to do' list of your dreams 8 7 21
62 Responding to written invitations 8 7 6
60 Describing a business trip 8 7 5
63 Congratulating a friend on an award 8 7 4
52 Writing about a memorable experience 7 8 4
Next I extracted the word count matrices for a subset of these articles, compared them and concluded that binary features indicating presence / absence of rare words could be a better indicator of level
. This approach improved the composite f1-score to 0.87 from 0.85.
Next I executed 10 fold-cross validation of the binary feature + maximum entropy classifier model on the full training set. This produced a composite f1-score of 0.95, but the precision on levels 15 and 16 (the levels with the lowest number of samples) were only 65%. This is most likely due to label imbalance. Changing the maximum entropy classifier to account for label imbalance improved the precision on levels 15 and 16 to 75% and 81%, respectively.
Finally, I re-trained the model using the full training set and evaluated against the test set.
Final Classification Result (on Test Set):
precision recall f1-score support
1 0.97 0.98 0.98 69973
2 0.96 0.96 0.96 32966
3 0.94 0.93 0.94 22245
4 0.95 0.97 0.96 33412
5 0.95 0.94 0.94 17268
6 0.95 0.93 0.94 10754
7 0.94 0.96 0.95 19290
8 0.89 0.87 0.88 8550
9 0.91 0.90 0.91 5800
10 0.95 0.94 0.95 7324
11 0.88 0.85 0.86 3232
12 0.89 0.87 0.88 1896
13 0.89 0.89 0.89 1769
14 0.83 0.80 0.81 750
15 0.76 0.77 0.77 442
16 0.80 0.80 0.80 391
avg / total 0.95 0.95 0.95 236062
This is essentially the same classification problem as the level classification but with collapsed categories, so my intuition is that the same feature-classifier combination will work well and should produce slightly better performance. This seems to hold true.
Final Classification Result (on Test Set):
precision recall f1-score support
A1 0.98 0.99 0.98 124702
A2 0.95 0.95 0.95 61427
B1 0.94 0.94 0.94 33760
B2 0.92 0.90 0.91 12650
C1 0.88 0.82 0.85 3135
C2 0.78 0.78 0.78 388
avg / total 0.96 0.96 0.96 236062
The third task in the assignment is to build a text clustering system on the same set of text documents. To solve this problem, I chose the following approach:
Tf-Idf based Feature Extraction: This is very similar to the feature extraction method using TfIdfVectorizer
as above.
Dimension Reduction of the sparse feature matrix: I experimented with two methods of dimension reduction:
Singular Value Decomoposition (SVD)
This transforms the (sparse) n_samples
* vocab_size
tf-idf feature matrix to n_samples * k
where k
< vocab_size
and corresponds to top k eigenvectors, or orthonormal bases, of the feature matrix. I rejected this approch since it was much slower than NMF and even after increasing the number of principal components to 5000, the percentage of variance explained remained less than 60%.
Non-negative Matrix Factorization (NMF)
This transforms tf-idf feature matrix to n_samples * k
where each of the k k
axes (k
<< vocab_size
) represents a topic. Given that we already know that the documents belong to specific set of topics, this seemed a natural fit for this problem.
In [5]:
nmf_error = pd.read_csv('nmf_rec_errors.csv', index_col=0, squeeze=True)
ax = nmf_error.plot(kind='bar', title='NMF reconstruction errors vs. Component Size', rot=0)
_ = ax.set(xlabel='NNF Component Size', ylabel='Reconstruction Error')
Based on this, I chose the number of components as 20.
In [6]:
sse = pd.read_csv('kmeans_sse.csv', index_col=0, squeeze=True)
fig, ax = plt.subplots(1, 1, figsize=(10, 4))
ax = sse.plot(kind='bar', title='Cluster SSE vs Cluster Size (16K features)', ax=ax, rot=0)
_ = ax.set(xlabel='number of clusters', ylabel='SSE')
This is commonly known as the "Elbow Plot" and the optimal cluster size is the point where the SSE curve starts to flatten out. In this case, I chose the optimal cluster size as 20.
Next I executed the feature construction, dimension reduction and clustering steps for the full dataset. The figure shows the value counts per cluster and value counts per level.
In [7]:
cluster_counts = pd.read_csv('label_counts_all.csv', index_col=0, squeeze=True)
level_counts = raw_input.level.value_counts()
_, ax = plt.subplots(1, 2, figsize=(14, 6))
ax_0 = cluster_counts.plot(ax=ax[0], kind='bar', title='Value Counts per Cluster', rot=0)
_ = ax_0.set(xlabel='Cluster ID', ylabel='Samples')
ax_1 = level_counts.plot(ax=ax[1], kind='bar', title='Value Counts per Level', rot=0)
_ = ax_1.set(xlabel='Level', ylabel='Samples')
Visually, this approach to clustering results aren't similar to the level
s. To get a more quantifiable for measure between the two, I calculated the Adjusted Rand Index (which ranges from 0 to 1). The Adjusted Rand Index between this cluster assignment and the level
attribute is 0.089, indicating that they are quite dissimilar.
The figure below shows cross-tabulation of Cluster IDs vs. the level
attribute:
In [8]:
ct_full = pd.read_csv('cluster_assignment_vs_labels_all.csv', index_col=0)
fig, ax = plt.subplots(1, 1, figsize=(16, 8))
ax = sns.heatmap(ct_full, annot=True, fmt='d', ax=ax)
_ = ax.set(title='Cluster assignments vs Level (Full Dataset)')
_ = ax.set(xlabel='Cluster ID', ylabel='Level')
The crosstab plot shows that the lower (1-4) levels are spread over most of the clusters. One possible way to interpret this result is that the NMF is capturing common themes between the mutually exclusive sets of topics assigned to different levels. Another possibility, given that the NNF reconstruction error does not change much for component sizes between 10-30, is that the transformed feature matrix isn't an appropriate approximation of the full term-document matrix. Using a much larger component size could produce a better approximation.
Several improvents are possible over the current approach for the level
and group
classifiers.
Dimension Reduction: Use NMF or SVD to reduce the dimensions of the tf-idf feature metrix before applying Maximum Entropy Classification.
Improve Classification of Higher Levels: On the test set, the last two levels (15 and 16) have the worst f1-score in the current model. Based on manual inspection,a reasonable assumption to improve the classification of these levels could be:
Writing samples with higher levels have higher document length, higher average/minimum/maximum sentence lengths and more complex sentence structure (more frequent use of conjuctions).
I briefly experimented with this idea by extracting these features from the small training set and training a Maximum Entropy Classifier. On their own, these features perform much worse (approx. 0.35 composite f1-score) than the current model. However, it may be possible to improve the current model either by combining the above features with the tf-idf features, or combining two classifiers trained separately on these two sets of features.
I didn't use this approach mainly because of time and computation resource constraints. Also, some of the writing samples contain a significant number of misspellings (especially at the lower levels) which would be unknown words for the pre-trained word embedding models. One way to handle the unknown words is to construct a Lexicographical Search Tree (Trie) on the vocabulary of the pre-trained model and then return the vector corresponding to the closest word in the tree for each unknown word.
One possible enhancement is to extend the tf-idf feature set with the document length / sentence length and part of speech tag based features discussed above. Another approach is to replace documents by their average word vector embeddings.
An interesting alternative to the KMeans Clustering Algorithm is Affinity Propagation or "Message Passing in Cluster Graph" approach. This is one of the common approaches to text summarization. I briefly tried this approach on the tf-idf feature matrix for the small training set, but even for 16000 records it was very slow to converge.
In [ ]: