EF Machine Learning Homework

The Machine Learning Tasks

The task(s) is(are) to build a system to classify the Level of writing samples by english language learners, using a data set gathered from users. Each Level is comprised of multiple units. Learners progress in linear order from one Level to the next, although within a Level, they may jump around from one unit to another at will. Typically one of the last units in a Level is the “written task”, in which the learners write freely on a given topic.

Convert XML to Pandas DataFrame ( step_1_data_prep.ipynb)

In this step, I converted the raw XML(https://www.dropbox.com/s/rizbq4co1hlfpft/EFWritingData.xml?dl=0) into a pandas DataFrame. Each row in DataFrame is a flat representation of one <writing> element, sub-elements and attributes. I excluded the article date in the conversion since I wasn't sure about using it in the classification step.



In [1]:

    
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')

import pandas as pd 
import numpy as np
import seaborn as sns



In [2]:

    
raw_input = pd.read_pickle('input.pkl')
raw_input.head()









    Out[2]:







  
    
      
      article_id
      grade
      level
      text
      topic_id
      topic_text
      unit
    
  
  
    
      0
      1
      90
      6
      After some time, the affection between them is...
      41
      Writing a movie plot
      1
    
    
      1
      2
      90
      6
      An e-ticket is a provement that you bought the...
      42
      Filling in an arrival card
      2
    
    
      2
      3
      86
      6
      From:xxx@1234.com To:Herman xxx@1234.com Date:...
      43
      Creating an office dress code
      3
    
    
      3
      4
      87
      1
      Hi Anna ,How are you . My name's Jayla . My te...
      1
      Introducing yourself by email
      1
    
    
      4
      5
      95
      1
      Dear Ms Thomas, There are thirty pens and fift...
      2
      Taking inventory in the office
      2

Analysis of MetaData

I first investigated the relationships between the metadata fields and the level attribute. The metadata fields excluding date are:

article_id
grade
level
topic_id
topic_text
unit

Figure below shows the boxplot of topic_id for each distinct value of level.



In [3]:

    
ax = plt.subplots(1, 1, figsize=(12, 8))
ax = sns.boxplot(x='level', y='topic_id', data=raw_input)

Each level belongs to a mutually exclusive set of topic IDs. We can deterministically figure out the level from the topic_id without using any Machine Learning. However since this homework is for text classification, I chose to go ahead and build a text classifier without using the topic_id or the topic_text in the model.

Figure below shows the boxplots per level for the other two metadata features, unit and grade. Visually they didn't seem to be good predictors for level.



In [4]:

    
_, ax = plt.subplots(1, 2, figsize=(16, 4))
_ = sns.boxplot(x='level', y='grade', data=raw_input, ax=ax[0])
_ = sns.boxplot(x='level', y='unit', data=raw_input, ax=ax[1])

Text Classification

I used the experience from the metadata analysis to build my working hypothesis for text based classifier:

Each writing sample at a given level is about a topic.
The topic uniquely determines the level.
Different topics produce different distribution of words.

As instructed in the homework, I set aside 20% of the total data (with the same proportion of levels) as a test set. I used the other 80% for cross-validation and training of the model(s) under consideration. I also constructed a smaller training set by taking random samples of 1000 records for each level from the training set. This allowed me to run different experiments quickly (the full training set has ~900K records).

Small Sample Analysis (step_2_classification_of_sample_dataset.ipynb)

I converted the smaller training set of 16000 records into term frequency (tf) and inverse document frequency (idf) based features and then used a maximum entropy classifier to maximize the log-likelihood of these features given class label from the training data.

Then I extracted the out-of-sample predictions using 10-fold cross-validation and then generated the precision, recall, f1-score and confusion matrix for the true and predicted labels.

The main decision points with tf-idf features are:

Stopword Removal: I used NLTK's default english language stopwords.
Stemming / lemmatization: I didn't consider stemming.
Removing frequent and infrequent words: These are hyperparameters for the classification problem.

I ran the classification pipeline on the sample dataset a few times with and without stopwords, different values for the frequent and infrequent word thresholds and maximum number of features to keep. The main driver of f1-score seemed to be the maximum number of features which also required more time to run. For example, the parameters below:

counter = TfidfVectorizer(
        ngram_range=(1, 2), 
        stop_words=en_stopwords, 
        max_df=0.4, 
        min_df=25, 
        max_features=3000, 
        sublinear_tf=True
)

scaler = StandardScaler(with_mean=False)
model = LogisticRegression(penalty='l2', max_iter=200, random_state=4321)
pipeline = make_pipeline(counter, scaler, model)

produces the classification result from cross-validation:

               precision    recall  f1-score   support

          1       0.90      0.89      0.90      1007
          2       0.89      0.88      0.88      1019
          3       0.85      0.88      0.87       972
          4       0.86      0.85      0.85      1010
          5       0.86      0.87      0.87       993
          6       0.88      0.91      0.90       966
          7       0.81      0.81      0.81      1000
          8       0.80      0.78      0.79      1025
          9       0.82      0.82      0.82       999
         10       0.85      0.87      0.86       978
         11       0.77      0.76      0.76      1023
         12       0.86      0.87      0.87       996
         13       0.91      0.90      0.90      1013
         14       0.86      0.86      0.86      1002
         15       0.81      0.80      0.80      1010
         16       0.85      0.86      0.86       987

avg / total       0.85      0.85      0.85     16000

In order to improve the model, I then looked at the misclassified records in my sample dataset. The biggest misclassification occurred between level 7 and 8. The distribution of topics for the 7 <---> 8 misclassification are:

topic_id  topic_text                            level  level_predicted    number of misclassifications
50        Planning for the future               7      8                  28
59        Making a 'to do' list of your dreams  8      7                  21
62        Responding to written invitations     8      7                   6
60        Describing a business trip            8      7                   5
63        Congratulating a friend on an award   8      7                   4
52        Writing about a memorable experience  7      8                   4

Next I extracted the word count matrices for a subset of these articles, compared them and concluded that binary features indicating presence / absence of rare words could be a better indicator of level. This approach improved the composite f1-score to 0.87 from 0.85.

Full Dataset Analysis (step_3_classification_of_full_dataset.ipynb

Level Classification

Next I executed 10 fold-cross validation of the binary feature + maximum entropy classifier model on the full training set. This produced a composite f1-score of 0.95, but the precision on levels 15 and 16 (the levels with the lowest number of samples) were only 65%. This is most likely due to label imbalance. Changing the maximum entropy classifier to account for label imbalance improved the precision on levels 15 and 16 to 75% and 81%, respectively.

Finally, I re-trained the model using the full training set and evaluated against the test set.

Final Classification Result (on Test Set):

          precision    recall  f1-score   support
      1       0.97      0.98      0.98     69973
      2       0.96      0.96      0.96     32966
      3       0.94      0.93      0.94     22245
      4       0.95      0.97      0.96     33412
      5       0.95      0.94      0.94     17268
      6       0.95      0.93      0.94     10754
      7       0.94      0.96      0.95     19290
      8       0.89      0.87      0.88      8550
      9       0.91      0.90      0.91      5800
     10       0.95      0.94      0.95      7324
     11       0.88      0.85      0.86      3232
     12       0.89      0.87      0.88      1896
     13       0.89      0.89      0.89      1769
     14       0.83      0.80      0.81       750
     15       0.76      0.77      0.77       442
     16       0.80      0.80      0.80       391

avg / total       0.95      0.95      0.95    236062

Group Classification

This is essentially the same classification problem as the level classification but with collapsed categories, so my intuition is that the same feature-classifier combination will work well and should produce slightly better performance. This seems to hold true.

Final Classification Result (on Test Set):

     precision    recall  f1-score   support

     A1       0.98      0.99      0.98    124702
     A2       0.95      0.95      0.95     61427
     B1       0.94      0.94      0.94     33760
     B2       0.92      0.90      0.91     12650
     C1       0.88      0.82      0.85      3135
     C2       0.78      0.78      0.78       388

avg / total       0.96      0.96      0.96    236062

Text Clustering (step_4_text_clustering.ipynb)

The third task in the assignment is to build a text clustering system on the same set of text documents. To solve this problem, I chose the following approach:

Tf-Idf based Feature Extraction: This is very similar to the feature extraction method using TfIdfVectorizer as above.
Dimension Reduction of the sparse feature matrix: I experimented with two methods of dimension reduction:
- Singular Value Decomoposition (SVD)
  
  This transforms the (sparse) n_samples * vocab_size tf-idf feature matrix to n_samples * k where k < vocab_size and corresponds to top k eigenvectors, or orthonormal bases, of the feature matrix. I rejected this approch since it was much slower than NMF and even after increasing the number of principal components to 5000, the percentage of variance explained remained less than 60%.
- Non-negative Matrix Factorization (NMF)
  
  This transforms tf-idf feature matrix to n_samples * k where each of the k k axes (k << vocab_size) represents a topic. Given that we already know that the documents belong to specific set of topics, this seemed a natural fit for this problem.

K-Means Clustering: Finally, I used K-Means clustering to assign cluster labels to the NMF transformed feature matrix.

Selection of Number of NMF Components

To select the optimal number of NMF components, I plotted the reconstruction error from the NMF step for different component sizes on the 16000 sample dataset.



In [5]:

    
nmf_error = pd.read_csv('nmf_rec_errors.csv', index_col=0, squeeze=True)
ax = nmf_error.plot(kind='bar', title='NMF reconstruction errors vs. Component Size', rot=0)
_ = ax.set(xlabel='NNF Component Size', ylabel='Reconstruction Error')

Based on this, I chose the number of components as 20.

Optimal Cluster Size

To select the optimal cluster size for KMeans, I plotted the inertia (Sum of squared distances of each sample from the its cluster centroid) vs. the number of clusters ranging from 2 to 30 on the 16000 sample dataset.



In [6]:

    
sse = pd.read_csv('kmeans_sse.csv', index_col=0, squeeze=True)
fig, ax = plt.subplots(1, 1, figsize=(10, 4))
ax = sse.plot(kind='bar', title='Cluster SSE vs Cluster Size (16K features)', ax=ax, rot=0)
_ = ax.set(xlabel='number of clusters', ylabel='SSE')

This is commonly known as the "Elbow Plot" and the optimal cluster size is the point where the SSE curve starts to flatten out. In this case, I chose the optimal cluster size as 20.

Cluster Assignments and Comparison with Level

Next I executed the feature construction, dimension reduction and clustering steps for the full dataset. The figure shows the value counts per cluster and value counts per level.



In [7]:

    
cluster_counts = pd.read_csv('label_counts_all.csv', index_col=0, squeeze=True)
level_counts = raw_input.level.value_counts()

_, ax = plt.subplots(1, 2, figsize=(14, 6))

ax_0 = cluster_counts.plot(ax=ax[0], kind='bar', title='Value Counts per Cluster', rot=0)
_ = ax_0.set(xlabel='Cluster ID', ylabel='Samples')

ax_1 = level_counts.plot(ax=ax[1], kind='bar', title='Value Counts per Level', rot=0)
_ = ax_1.set(xlabel='Level', ylabel='Samples')

Visually, this approach to clustering results aren't similar to the levels. To get a more quantifiable for measure between the two, I calculated the Adjusted Rand Index (which ranges from 0 to 1). The Adjusted Rand Index between this cluster assignment and the level attribute is 0.089, indicating that they are quite dissimilar.

The figure below shows cross-tabulation of Cluster IDs vs. the level attribute:



In [8]:

    
ct_full = pd.read_csv('cluster_assignment_vs_labels_all.csv', index_col=0)
fig, ax = plt.subplots(1, 1, figsize=(16, 8))
ax = sns.heatmap(ct_full, annot=True, fmt='d', ax=ax)
_ = ax.set(title='Cluster assignments vs Level (Full Dataset)')
_ = ax.set(xlabel='Cluster ID', ylabel='Level')

The crosstab plot shows that the lower (1-4) levels are spread over most of the clusters. One possible way to interpret this result is that the NMF is capturing common themes between the mutually exclusive sets of topics assigned to different levels. Another possibility, given that the NNF reconstruction error does not change much for component sizes between 10-30, is that the transformed feature matrix isn't an appropriate approximation of the full term-document matrix. Using a much larger component size could produce a better approximation.

Future Enhancements

Classification Enhancements

Several improvents are possible over the current approach for the level and group classifiers.

Dimension Reduction: Use NMF or SVD to reduce the dimensions of the tf-idf feature metrix before applying Maximum Entropy Classification.
Improve Classification of Higher Levels: On the test set, the last two levels (15 and 16) have the worst f1-score in the current model. Based on manual inspection,a reasonable assumption to improve the classification of these levels could be:

Writing samples with higher levels have higher document length, higher average/minimum/maximum sentence lengths and more complex sentence structure (more frequent use of conjuctions).

I briefly experimented with this idea by extracting these features from the small training set and training a Maximum Entropy Classifier. On their own, these features perform much worse (approx. 0.35 composite f1-score) than the current model. However, it may be possible to improve the current model either by combining the above features with the tf-idf features, or combining two classifiers trained separately on these two sets of features.

Word Embeddings and Deep Learning based Approaches: A common Deep Learning based approach to text classification represents words by their vector embeddings trained on a large corpus such as Common Crawl or Google News. In the simplest approach, the feature vector for each document is the average of all the vectors corresponding to the words in that document. This turns the text classification problem into a structured classification problem. A more complex approach is to train a Recurrent Neural Network (RNN), Long-Short Term Memory (LSTM) or Convolutional Neural Network (CNN) on the vector representation of words.

I didn't use this approach mainly because of time and computation resource constraints. Also, some of the writing samples contain a significant number of misspellings (especially at the lower levels) which would be unknown words for the pre-trained word embedding models. One way to handle the unknown words is to construct a Lexicographical Search Tree (Trie) on the vocabulary of the pre-trained model and then return the vector corresponding to the closest word in the tree for each unknown word.

Clustering Enhancements

One possible enhancement is to extend the tf-idf feature set with the document length / sentence length and part of speech tag based features discussed above. Another approach is to replace documents by their average word vector embeddings.

An interesting alternative to the KMeans Clustering Algorithm is Affinity Propagation or "Message Passing in Cluster Graph" approach. This is one of the common approaches to text summarization. I briefly tried this approach on the tf-idf feature matrix for the small training set, but even for 16000 records it was very slow to converge.



In [ ]:

	article_id	grade	level	text	topic_id	topic_text	unit
0	1	90	6	After some time, the affection between them is...	41	Writing a movie plot	1
1	2	90	6	An e-ticket is a provement that you bought the...	42	Filling in an arrival card	2
2	3	86	6	From:xxx@1234.com To:Herman xxx@1234.com Date:...	43	Creating an office dress code	3
3	4	87	1	Hi Anna ,How are you . My name's Jayla . My te...	1	Introducing yourself by email	1
4	5	95	1	Dear Ms Thomas, There are thirty pens and fift...	2	Taking inventory in the office	2