In [192]:
import pandas as pd
from textblob import TextBlob
from textblob import Word
import language_check
In [14]:
data = pd.read_csv("training_set_rel3.tsv", encoding='iso-8859-1', delimiter='\t') ## error with utf-8 encoding
In [15]:
data.head()
Out[15]:
To have a look how an essay content looks
In [147]:
data["essay"][0]
Out[147]:
It can be observed that there are few external references like "@CAPS2" and rest of the message seems almost clean
I am trying to analyze the each essay to understand what features are required for the model. By using TextBlob library package
1) Grammar Mistakes
2) Essay length
3) Average Word length
4) Average Sentence Length
5) Essay Sentiment
6) Spelling Mistakes
In [156]:
for i in range(data.shape[0]):
message = TextBlob(data["essay"][i])
#number of words
data.set_value(i,'Essay_Length',len(message.words))
#Spelling Mistakes
try:
dictinoray = enchant.Dict('en_US')
vocab_words = character_filter(message.words)
checks = [dictionary.check(word) for word in vocab_words]
data.set_value(i,'Spelling_Mistakes', checks.count(False))
except Exception:
pass
#number of grammar mistakes
tool = language_check.LanguageTool('en-US')
matches = tool.check(data["essay"][i])
data.set_value(i,'Grammer_Mistakes',len(matches))
#average word length
len_word = [len(word) for word in message.words]
data.set_value(i,'Average_Word_Length',sum(len_word)/len(len_word))
#sentence length
length = [len(sentence.split(' ')) for sentence in message.sentences]
data.set_value(i,'Sentence_Length',sum(length)/len(length))
#sentiment of each essay
data.set_value(i,'Sentiment',message.sentiment.polarity)
In [158]:
data["Essay_Length"].head()
Out[158]:
In [165]:
# this is not required, just for a demonstration purpose
import tokenize
from nltk import word_tokenize
text = word_tokenize(data["essay"][0])
#tags = nltk.pos_tag(text)
In this step,
1) Converting essays to a bag of words models (remove stop words, create bag of words and vectorize)
In [166]:
# a stemmer widely used
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
def stem_tokens(tokens, stemmer):
stemmed = []
for item in tokens:
stemmed.append(stemmer.stem(item))
return stemmed
def tokenize(text):
# remove non letters
text = re.sub("[^a-zA-Z]", " ", text)
# tokenize
tokens = nltk.word_tokenize(text)
# stem
stems = stem_tokens(tokens, stemmer)
return stems
Using seperate countvectorize functions for each essay type
In [243]:
df = []
df = [data[data['essay_set']==i+1] for i in range(8)]
# Turn into vector of features
vectorizers = [CountVectorizer( analyzer = 'word',
tokenizer = tokenize,
lowercase = True,
stop_words = 'english') for i in range(8)]
corpuses = [df[i]['essay'].values for i in range(8)]
word_mats = [vectorizers[i].fit_transform(corpuses[i]) for i in range(8)]
In [244]:
word_mats[0].shape
Out[244]:
Occurrence count is a good start but there is an issue: longer documents will have higher average count values than shorter documents, even though they might talk about the same topics.
To avoid these potential discrepancies it suffices to divide the number of occurrences of each word in a document by the total number of words in the document: these new features are called tf for Term Frequencies.
Another refinement on top of tf is to downscale weights for words that occur in many documents in the corpus and are therefore less informative than those that occur only in a smaller portion of the corpus. This downscaling is called tf–idf for “Term Frequency times Inverse Document Frequency”.
Source: http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
In [245]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer(use_idf=False)
transformed_data = [tfidf_transformer.fit_transform(word_mats[i]) for i in range(8)]
In [246]:
transformed_data[1]
Out[246]:
Preparing target variable (score) for each essay type before training the model
In [247]:
dfs = []
dfs = [data[data['essay_set']==i+1] for i in range(8)]
scores = [dfs[i]['domain1_score'] for i in range(8)]
Becuase, this problem deals with estimiating the scores for each essay, supervised regression would be an apt approach rather than a multi-label classification. The float values obtained in the regression can be rounded off.
I am trying to run a baseline model to understand the initial performance. Rather than using mean square error (mse)or R2, I am also using Spearman's rank correlation cofficient that finds how closely the predicted scores correspond to true scores - this measures the the strength and direction of monotonic association between the essay feature and the score. In other words, it determines how well the ranking of the features corresponds with the ranking of the scores. The benefit of this approach is that this is a useful measure for grading essays, since we're interested to know how directly a feature predicts the relative score of an essay (i.e., how an essay compares to another essay) rather than the actual score given to the essay. Ultimately, this is a better model to measure rather than accuracy, since it gives direct insight into the influence of the feature on the score, and furthermore, because relative accuracy might be more important than actual accuracy.
Spearman results in a score ranging from -1 to 1, where the closer the score is to an absolute value of 1, the stronger the monotonic association (and where positive values imply a positive monotonic association, versus negative values implying a negative one). The closer the value to 0, the weaker the monotonic association. The general consensus of Spearman correlation strength interpretation is as follows:
.00-.19 “very weak” .20-.39 “weak” .40-.59 “moderate” .60-.79 “strong” .80-1.0 “very strong”
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html https://github.com/kevinloughlin/Automated-Essay-Grading/tree/master/Readings
In [ ]:
In [252]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
# Same process will be repeated for each essay set
for i in range(8):
#85:15 ratio - train test split
X_train, X_test, y_train, y_test = train_test_split(
transformed_data[i],
scores[i],
train_size=0.85,
random_state=1234)
# Fit the Regressors using the training dataset
Regressors = {"GB": {"f": GradientBoostingRegressor()},
"RF": {"f": RandomForestRegressor()},
"LR": {"f": LinearRegression()}}
for model in Regressors.keys():
# Fit
Regressors[model]["f"].fit(X_train, y_train)
# Predict
Regressors[model]["c"] = Regressors[model]["f"].predict(X_test.toarray())
#Evaluate
measures = {"mse": mean_squared_error, "r2": r2_score,"spear":spearmanr}
results = pd.DataFrame(columns=measures.keys())
# Evaluate each model in Regressors
for model in Regressors.keys():
results.loc[model] = [measures[measure](y_test, Regressors[model]["c"]) for measure in measures.keys()]
print ("Results for essay_id {}".format(i))
print (results)
Initial Interpretation:
Spearman's values:
Each tuple represents the Spearman Scores followed by "p" values.
As we can see from the results, that the GradientBoosting Regressor(has stronger correlation of spearman's rank) outperforms the Linear Regression and Randomforest regression.
We can further use grid parameter search to apply the combination of parameters to find the best parameters. The stacking and ensembling
of algorithms will improve the results. Moreover, a deep neural network can be applied to check if the performance is improved
A further research is required to find the best features for the essay 6 and essay 7.
In [ ]: