In [192]:
import pandas as pd
from textblob import TextBlob
from textblob import Word
import language_check

In [14]:
data = pd.read_csv("training_set_rel3.tsv", encoding='iso-8859-1', delimiter='\t')  ## error with utf-8 encoding

Data Exploration


In [15]:
data.head()


Out[15]:
essay_id essay_set essay rater1_domain1 rater2_domain1 rater3_domain1 domain1_score rater1_domain2 rater2_domain2 domain2_score ... rater2_trait3 rater2_trait4 rater2_trait5 rater2_trait6 rater3_trait1 rater3_trait2 rater3_trait3 rater3_trait4 rater3_trait5 rater3_trait6
0 1 1 Dear local newspaper, I think effects computer... 4 4 NaN 8 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2 1 Dear @CAPS1 @CAPS2, I believe that using compu... 5 4 NaN 9 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 3 1 Dear, @CAPS1 @CAPS2 @CAPS3 More and more peopl... 4 3 NaN 7 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 4 1 Dear Local Newspaper, @CAPS1 I have found that... 5 5 NaN 10 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 5 1 Dear @LOCATION1, I know having computers has a... 4 4 NaN 8 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 28 columns

To have a look how an essay content looks


In [147]:
data["essay"][0]


Out[147]:
"Dear local newspaper, I think effects computers have on people are great learning skills/affects because they give us time to chat with friends/new people, helps us learn about the globe(astronomy) and keeps us out of troble! Thing about! Dont you think so? How would you feel if your teenager is always on the phone with friends! Do you ever time to chat with your friends or buisness partner about things. Well now - there's a new way to chat the computer, theirs plenty of sites on the internet to do so: @ORGANIZATION1, @ORGANIZATION2, @CAPS1, facebook, myspace ect. Just think now while your setting up meeting with your boss on the computer, your teenager is having fun on the phone not rushing to get off cause you want to use it. How did you learn about other countrys/states outside of yours? Well I have by computer/internet, it's a new way to learn about what going on in our time! You might think your child spends a lot of time on the computer, but ask them so question about the economy, sea floor spreading or even about the @DATE1's you'll be surprise at how much he/she knows. Believe it or not the computer is much interesting then in class all day reading out of books. If your child is home on your computer or at a local library, it's better than being out with friends being fresh, or being perpressured to doing something they know isnt right. You might not know where your child is, @CAPS2 forbidde in a hospital bed because of a drive-by. Rather than your child on the computer learning, chatting or just playing games, safe and sound in your home or community place. Now I hope you have reached a point to understand and agree with me, because computers can have great effects on you or child because it gives us time to chat with friends/new people, helps us learn about the globe and believe or not keeps us out of troble. Thank you for listening."

It can be observed that there are few external references like "@CAPS2" and rest of the message seems almost clean

Meta Data

I am trying to analyze the each essay to understand what features are required for the model. By using TextBlob library package

 1) Grammar Mistakes
 2) Essay length
 3) Average Word length
 4) Average Sentence Length
 5) Essay Sentiment
 6) Spelling Mistakes

In [156]:
for i in range(data.shape[0]):
    
    message = TextBlob(data["essay"][i])
    
    #number of words
    data.set_value(i,'Essay_Length',len(message.words))
    
    #Spelling Mistakes
    try:
        dictinoray = enchant.Dict('en_US')
        vocab_words = character_filter(message.words)
        checks = [dictionary.check(word) for word in vocab_words]
        data.set_value(i,'Spelling_Mistakes', checks.count(False))
        
    except Exception:
        pass
    
    #number of grammar mistakes
    tool = language_check.LanguageTool('en-US')
    matches = tool.check(data["essay"][i])
    data.set_value(i,'Grammer_Mistakes',len(matches))
    
    #average word length
    len_word = [len(word) for word in message.words]
    data.set_value(i,'Average_Word_Length',sum(len_word)/len(len_word))
    
    #sentence length
    length = [len(sentence.split(' ')) for sentence in message.sentences]
    data.set_value(i,'Sentence_Length',sum(length)/len(length))
    
    #sentiment of each essay
    data.set_value(i,'Sentiment',message.sentiment.polarity)

In [158]:
data["Essay_Length"].head()


Out[158]:
0    343.0
1    422.0
2    283.0
3    527.0
4    470.0
Name: Essay_Length, dtype: float64

Data Cleaning and Vectorizing

Parts of the Speech Tagging


In [165]:
# this is not required, just for a demonstration purpose
import tokenize
from nltk import word_tokenize
text = word_tokenize(data["essay"][0])
#tags = nltk.pos_tag(text)

In this step,

1) Converting essays to a bag of words models (remove stop words, create bag of words and vectorize)

In [166]:
# a stemmer widely used
from nltk.stem.porter import PorterStemmer

stemmer = PorterStemmer() 

def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

def tokenize(text):
    # remove non letters
    text = re.sub("[^a-zA-Z]", " ", text)
    # tokenize
    tokens = nltk.word_tokenize(text)
    # stem
    stems = stem_tokens(tokens, stemmer)
    return stems

Using seperate countvectorize functions for each essay type


In [243]:
df = []
df = [data[data['essay_set']==i+1] for i in range(8)]
# Turn into vector of features
vectorizers = [CountVectorizer( analyzer = 'word',
                                tokenizer = tokenize,
                                lowercase = True,
                                stop_words = 'english') for i in range(8)]
corpuses = [df[i]['essay'].values for i in range(8)]
word_mats = [vectorizers[i].fit_transform(corpuses[i]) for i in range(8)]

In [244]:
word_mats[0].shape


Out[244]:
(1783, 10929)

Occurrence count is a good start but there is an issue: longer documents will have higher average count values than shorter documents, even though they might talk about the same topics.

To avoid these potential discrepancies it suffices to divide the number of occurrences of each word in a document by the total number of words in the document: these new features are called tf for Term Frequencies.

Another refinement on top of tf is to downscale weights for words that occur in many documents in the corpus and are therefore less informative than those that occur only in a smaller portion of the corpus. This downscaling is called tf–idf for “Term Frequency times Inverse Document Frequency”.

Source: http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html


In [245]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer(use_idf=False)
transformed_data = [tfidf_transformer.fit_transform(word_mats[i]) for i in range(8)]

In [246]:
transformed_data[1]


Out[246]:
<1800x9324 sparse matrix of type '<class 'numpy.float64'>'
	with 165093 stored elements in Compressed Sparse Row format>

Preparing target variable (score) for each essay type before training the model


In [247]:
dfs = []
dfs = [data[data['essay_set']==i+1] for i in range(8)]
scores = [dfs[i]['domain1_score'] for i in range(8)]

Model Training and Evaluation

Becuase, this problem deals with estimiating the scores for each essay, supervised regression would be an apt approach rather than a multi-label classification. The float values obtained in the regression can be rounded off.

I am trying to run a baseline model to understand the initial performance. Rather than using mean square error (mse)or R2, I am also using Spearman's rank correlation cofficient that finds how closely the predicted scores correspond to true scores - this measures the the strength and direction of monotonic association between the essay feature and the score. In other words, it determines how well the ranking of the features corresponds with the ranking of the scores. The benefit of this approach is that this is a useful measure for grading essays, since we're interested to know how directly a feature predicts the relative score of an essay (i.e., how an essay compares to another essay) rather than the actual score given to the essay. Ultimately, this is a better model to measure rather than accuracy, since it gives direct insight into the influence of the feature on the score, and furthermore, because relative accuracy might be more important than actual accuracy.

Spearman results in a score ranging from -1 to 1, where the closer the score is to an absolute value of 1, the stronger the monotonic association (and where positive values imply a positive monotonic association, versus negative values implying a negative one). The closer the value to 0, the weaker the monotonic association. The general consensus of Spearman correlation strength interpretation is as follows:

.00-.19 “very weak” .20-.39 “weak” .40-.59 “moderate” .60-.79 “strong” .80-1.0 “very strong”

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html https://github.com/kevinloughlin/Automated-Essay-Grading/tree/master/Readings


In [ ]:


In [252]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor

# Same process will be repeated for each essay set
for i in range(8):
    #85:15 ratio - train test split
    X_train, X_test, y_train, y_test  = train_test_split(
        transformed_data[i], 
        scores[i],
        train_size=0.85, 
        random_state=1234)


    # Fit the Regressors using the training dataset
    Regressors =  {"GB": {"f": GradientBoostingRegressor()},
                   "RF": {"f": RandomForestRegressor()},
                   "LR": {"f": LinearRegression()}}

    for model in Regressors.keys():
        # Fit
        Regressors[model]["f"].fit(X_train, y_train)
        # Predict
        Regressors[model]["c"] = Regressors[model]["f"].predict(X_test.toarray())
    
    #Evaluate
    measures = {"mse": mean_squared_error, "r2": r2_score,"spear":spearmanr}

    results = pd.DataFrame(columns=measures.keys())

    # Evaluate each model in Regressors
    for model in Regressors.keys():
        results.loc[model] = [measures[measure](y_test, Regressors[model]["c"]) for measure in measures.keys()]
    
    print ("Results for essay_id {}".format(i))
    print (results)


Results for essay_id 0
         mse        r2                                spear
GB  1.143185  0.556585   (0.754679511523, 1.3274507625e-50)
LR  3.194124 -0.238928  (0.491222659269, 1.10205105729e-17)
RF  1.442985  0.440299  (0.660707115005, 5.40047594696e-35)
Results for essay_id 1
         mse        r2                                spear
GB  0.336200  0.460642  (0.684770458602, 1.05337804183e-38)
LR  0.895703 -0.436957  (0.343596936548, 6.74589589157e-09)
RF  0.382704  0.386037  (0.604202326807, 2.94896482983e-28)
Results for essay_id 2
         mse        r2                                spear
GB  0.373325  0.422866  (0.659416612237, 1.07207734296e-33)
LR  1.629326 -1.518825   (0.186755896542, 0.00254779015039)
RF  0.391390  0.394938  (0.640274399504, 2.79829117192e-31)
Results for essay_id 3
         mse        r2                                spear
GB  0.353174  0.605532    (0.79404853459, 5.1642349841e-59)
LR  4.243786 -3.739985  (0.365834302503, 7.59886678367e-10)
RF  0.452049  0.495096  (0.719441382371, 1.10274000447e-43)
Results for essay_id 4
         mse        r2                                spear
GB  0.279543  0.697359  (0.852363154242, 1.05025413847e-77)
LR  5.278449 -4.714585  (0.251402721353, 2.82763850258e-05)
RF  0.421181  0.544019  (0.759843705791, 3.14395601025e-52)
Results for essay_id 5
         mse        r2                                spear
GB  0.264065  0.720689  (0.820031668816, 6.60254701027e-67)
LR  2.684850 -1.839854  (0.246051224049, 4.36446013485e-05)
RF  0.334889  0.645777  (0.765960015601, 2.41234667207e-53)
Results for essay_id 6
          mse        r2                                spear
GB  10.093559  0.514236  (0.735318514988, 2.06474480976e-41)
LR  24.233834 -0.166282  (0.508791760291, 6.06621706965e-17)
RF  13.514068  0.349619   (0.597935873114, 2.8573503462e-24)
Results for essay_id 7
          mse        r2                               spear
GB  13.617520  0.425507   (0.6211606993, 5.75357994218e-13)
LR  43.272084 -0.825552   (0.199290964105, 0.0377499081958)
RF  18.625138  0.214247  (0.46352587776, 3.85850257837e-07)

Initial Interpretation:

Spearman's values:

Each tuple represents the Spearman Scores followed by "p" values.


As we can see from the results, that the GradientBoosting Regressor(has stronger correlation of spearman's rank) outperforms the Linear Regression and Randomforest regression.
We can further use grid parameter search to apply the combination of parameters to find the best parameters. The stacking and ensembling
of algorithms will improve the results. Moreover, a deep neural network can be applied to check if the performance is improved

A further research is required to find the best features for the essay 6 and essay 7.

In [ ]: