This notebook provides a very basic info about the Kaggle ASAP Dataset https://www.kaggle.com/c/asap-aes
In [14]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
In [15]:
# Load data
dataset_essay_1 = pd.read_csv("/data/data/automated_scoring_public_dataset.csv")
dataset_essay_1.shape
Out[15]:
Note: The Kaggle dataset organizers replaced sensitive information like person names and phone numbers with entity types You may see the words like @PERSON in student text
In [16]:
dataset_essay_1['essay'][0]
Out[16]:
In [17]:
print ("Mean word count: ", dataset_essay_1['word_count'].mean())
print ("Max word count: ", dataset_essay_1['word_count'].max())
print ("Min word count: ", dataset_essay_1['word_count'].min())
print ("STD word count: ", dataset_essay_1['word_count'].std())
In [18]:
dataset_essay_1_dropped_NaN_columns = dataset_essay_1.dropna(axis=1, how='all')
dataset_essay_1_dropped_NaN_columns.shape
Out[18]:
In [19]:
dataset_essay_1_dropped_NaN_columns.head(2)
Out[19]:
In [20]:
# we are interested only in two columns
# data (X) is essay text and truth value being (Y) 'rater1_domain1'
dataset = dataset_essay_1_dropped_NaN_columns[['essay', 'rater1_domain1']]
dataset.rater1_domain1.value_counts() # this is the rater 1 human score distribution
Out[20]:
In [21]:
def convert_dataframe_to_arrays(dataset):
essay_array = np.array(dataset['essay'].tolist()) # data
essay_rater1 = np.array(dataset['rater1_domain1'].tolist()) # truth value
return essay_array, essay_rater1
In [22]:
from sklearn.model_selection import train_test_split
def split_train_test_X_Y(dataset):
X, y = convert_dataframe_to_arrays(dataset)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
print("X_train shape: ", X_train.shape)
print("X_test shape: ", X_test.shape)
print("y_train shape: ", y_train.shape)
print("y_test shape: ", y_test.shape)
return X_train, X_test, y_train, y_test
In [23]:
# Split it in to train and test arrays
X_train, X_test, y_train, y_test = split_train_test_X_Y(dataset)
In [24]:
X_train[0]
Out[24]:
In [25]:
y_train[0]
Out[25]:
In [26]:
X_test[0]
Out[26]:
In [27]:
y_test[0]
Out[27]:
Let us basic CountVectorizer() and tf-idf (refer here https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html), We can also pre-trained embeddings and build features from them
In [28]:
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
X_train_counts.shape
Out[28]:
In [29]:
X_train_counts[0]
Out[29]:
In [30]:
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
X_train_tf.shape
Out[30]:
In [31]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape
Out[31]:
In [32]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, y_train)
In [33]:
X_test_counts = count_vect.transform(X_test)
X_test_counts.shape
Out[33]:
In [34]:
X_test_tfidf = tfidf_transformer.transform(X_test_counts)
X_test_tfidf.shape
Out[34]:
In [35]:
predicted = clf.predict(X_test_tfidf)
This is a very bad model, I will work on this again tonight and build a model. A very bad prediction
In [36]:
predicted
Out[36]:
In [37]:
y_test
Out[37]:
In [ ]: