This notebook provides a very basic info about the Kaggle ASAP Dataset https://www.kaggle.com/c/asap-aes


In [14]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

In [15]:
# Load data
dataset_essay_1 = pd.read_csv("/data/data/automated_scoring_public_dataset.csv")
dataset_essay_1.shape


Out[15]:
(1783, 7)

Essay 5 prompt text and passage refer to the word document in data folder

A sample response from essay type 1

Note: The Kaggle dataset organizers replaced sensitive information like person names and phone numbers with entity types You may see the words like @PERSON in student text


In [16]:
dataset_essay_1['essay'][0]


Out[16]:
"Dear local newspaper, I think effects computers have on people are great learning skills/affects because they give us time to chat with friends/new people, helps us learn about the globe(astronomy) and keeps us out of troble! Thing about! Dont you think so? How would you feel if your teenager is always on the phone with friends! Do you ever time to chat with your friends or buisness partner about things. Well now - there's a new way to chat the computer, theirs plenty of sites on the internet to do so: @ORGANIZATION1, @ORGANIZATION2, @CAPS1, facebook, myspace ect. Just think now while your setting up meeting with your boss on the computer, your teenager is having fun on the phone not rushing to get off cause you want to use it. How did you learn about other countrys/states outside of yours? Well I have by computer/internet, it's a new way to learn about what going on in our time! You might think your child spends a lot of time on the computer, but ask them so question about the economy, sea floor spreading or even about the @DATE1's you'll be surprise at how much he/she knows. Believe it or not the computer is much interesting then in class all day reading out of books. If your child is home on your computer or at a local library, it's better than being out with friends being fresh, or being perpressured to doing something they know isnt right. You might not know where your child is, @CAPS2 forbidde in a hospital bed because of a drive-by. Rather than your child on the computer learning, chatting or just playing games, safe and sound in your home or community place. Now I hope you have reached a point to understand and agree with me, because computers can have great effects on you or child because it gives us time to chat with friends/new people, helps us learn about the globe and believe or not keeps us out of troble. Thank you for listening."

In [17]:
print ("Mean word count: ", dataset_essay_1['word_count'].mean())
print ("Max word count: ", dataset_essay_1['word_count'].max())
print ("Min word count: ", dataset_essay_1['word_count'].min())
print ("STD word count: ", dataset_essay_1['word_count'].std())


Mean word count:  365.68143578238926
Max word count:  785
Min word count:  8
STD word count:  119.60914896447912

In [18]:
dataset_essay_1_dropped_NaN_columns = dataset_essay_1.dropna(axis=1, how='all')
dataset_essay_1_dropped_NaN_columns.shape


Out[18]:
(1783, 7)

In [19]:
dataset_essay_1_dropped_NaN_columns.head(2)


Out[19]:
essay_id essay_set essay rater1_domain1 rater2_domain1 domain1_score word_count
0 1 1 Dear local newspaper, I think effects computer... 4 4 8 338
1 2 1 Dear @CAPS1 @CAPS2, I believe that using compu... 5 4 9 419

Let us this data for features and model building

1. Before features building, we will divide the dataset into train and test dataset arrays. We can use lambda functions to build features

In [20]:
# we are interested only in two columns
# data (X) is essay text and truth value being (Y) 'rater1_domain1'
dataset = dataset_essay_1_dropped_NaN_columns[['essay', 'rater1_domain1']]
dataset.rater1_domain1.value_counts() # this is the rater 1 human score distribution


Out[20]:
4    922
5    507
3    196
6    120
2     28
1     10
Name: rater1_domain1, dtype: int64

In [21]:
def convert_dataframe_to_arrays(dataset):
    essay_array = np.array(dataset['essay'].tolist()) # data
    essay_rater1 = np.array(dataset['rater1_domain1'].tolist()) # truth value
    return essay_array, essay_rater1

In [22]:
from sklearn.model_selection import train_test_split
def split_train_test_X_Y(dataset):
    X, y = convert_dataframe_to_arrays(dataset)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
    print("X_train shape: ", X_train.shape)
    print("X_test shape: ", X_test.shape)
    print("y_train shape: ", y_train.shape)
    print("y_test shape: ", y_test.shape)
    return X_train, X_test, y_train, y_test

In [23]:
# Split it in to train and test arrays
X_train, X_test, y_train, y_test = split_train_test_X_Y(dataset)


X_train shape:  (1426,)
X_test shape:  (357,)
y_train shape:  (1426,)
y_test shape:  (357,)

In [24]:
X_train[0]


Out[24]:
"Computers and the @CAPS1 were a technological break through. It exposed to the average world, things that were never thought possitive. But as these things advanced over the years, they've become an addiction so bad of an addiction its begun to threaten peoples lives I've been given a choice to s'de with the addicting computers, or to offose them. The only clear choice is to offose. First off, computers have caused the world a decrease in exercise. Studies show @NUM1 out of @NUM2 people who use a computer, do not exercise with less exercise throughout the world, nations are becoming more over weight. This is a huge problem in the united states computers are main cause to why the @CAPS2.S is over weight and unhealthy by cutting down computers use, we can get our world back into great shape we can bring exercise and health back. Nextly, I'm sure you've all heard of online predators. It's scary just to think about, well as computer technology increased online predators numbers went up. I remember a couple of years ago, I was watching the news and a story came a normal teenage girl, being killed by someone she meton myspace. things like this still go on, and the rate at which they happen are in creasing. by putting people on computers you're putting then at risk of death this is an extreme problem computers have caused. Thus is the last efferct computers have, out of many that I'm going to state time on the computer, is time taken away from family and friends.This can ruin relationship in and outside the family. Now I'm sure you now many people have become extremely addicted to online games. the more they play these games the more they pull away from everyone they knew and loved computers are a leading cause in disfunctional families. they steal the user away from the outside word. these people need to get their families and friends back. You have to act now, before its too late and computers have over taken the world. If you know anyone who has fallen prey to a computer addiction, do what you can to help get them back we need to cutt back on any kind of computer use fast. Hurry, it's now or never."

In [25]:
y_train[0]


Out[25]:
5

In [26]:
X_test[0]


Out[26]:
"I think computers have a postitive affect on people. Computers are very important to society. You can still interact with Your Family and friends either in @CAPS1 chat or @CAPS2. Computer are also helpful for applying for @CAPS3 like you can look for @CAPS3 you like. You can do @CAPS5 exciseses and useful for @CAPS5 essays. If you have a laptop you can bring it anywhere you want you don't have connect anything it take wireless and you can connect to the @CAPS6. You can look up stuff on the @CAPS6 for your homework and other stuff. Buy things without having to go to the store. These are reasons why computers have a postitive affect on people."

In [27]:
y_test[0]


Out[27]:
3
2. Lets us Build Features

Let us basic CountVectorizer() and tf-idf (refer here https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html), We can also pre-trained embeddings and build features from them


In [28]:
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
X_train_counts.shape


Out[28]:
(1426, 14015)

In [29]:
X_train_counts[0]


Out[29]:
<1x14015 sparse matrix of type '<class 'numpy.int64'>'
	with 195 stored elements in Compressed Sparse Row format>

In [30]:
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
X_train_tf.shape


Out[30]:
(1426, 14015)

In [31]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape


Out[31]:
(1426, 14015)
3. Model building

In [32]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, y_train)
4. Model Prediction

In [33]:
X_test_counts = count_vect.transform(X_test)
X_test_counts.shape


Out[33]:
(357, 14015)

In [34]:
X_test_tfidf = tfidf_transformer.transform(X_test_counts)
X_test_tfidf.shape


Out[34]:
(357, 14015)

In [35]:
predicted = clf.predict(X_test_tfidf)

This is a very bad model, I will work on this again tonight and build a model. A very bad prediction


In [36]:
predicted


Out[36]:
array([4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4])

In [37]:
y_test


Out[37]:
array([3, 3, 4, 4, 5, 4, 3, 4, 4, 5, 5, 6, 4, 5, 4, 5, 3, 5, 5, 3, 5, 3,
       5, 3, 3, 5, 4, 4, 3, 4, 5, 5, 5, 4, 6, 4, 3, 5, 5, 4, 5, 4, 4, 5,
       4, 4, 5, 5, 4, 4, 4, 4, 4, 4, 5, 4, 4, 3, 3, 4, 3, 5, 4, 5, 5, 3,
       4, 4, 5, 4, 4, 5, 4, 4, 6, 6, 6, 4, 5, 4, 3, 4, 5, 5, 3, 2, 4, 4,
       3, 4, 4, 4, 5, 5, 4, 4, 5, 2, 4, 5, 5, 4, 4, 4, 3, 4, 4, 6, 3, 4,
       1, 5, 3, 2, 6, 4, 4, 5, 4, 4, 4, 3, 3, 3, 4, 5, 4, 4, 5, 6, 4, 4,
       3, 4, 5, 2, 3, 3, 5, 4, 5, 4, 4, 4, 4, 6, 5, 4, 4, 4, 6, 5, 5, 3,
       4, 5, 4, 3, 5, 3, 6, 5, 5, 4, 4, 4, 4, 2, 4, 5, 6, 5, 4, 5, 4, 4,
       4, 5, 4, 4, 4, 3, 4, 4, 5, 5, 4, 5, 3, 4, 4, 3, 4, 5, 3, 5, 4, 2,
       3, 4, 3, 5, 3, 4, 4, 4, 4, 5, 4, 4, 5, 5, 5, 3, 5, 4, 4, 4, 2, 3,
       6, 3, 4, 4, 5, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 3, 4,
       3, 5, 4, 4, 2, 4, 4, 4, 4, 4, 5, 3, 5, 4, 5, 5, 4, 5, 4, 4, 4, 4,
       2, 4, 4, 5, 4, 6, 6, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 3, 4, 4, 5, 5,
       4, 4, 3, 4, 5, 4, 4, 4, 5, 4, 4, 4, 4, 5, 4, 6, 4, 5, 5, 4, 5, 4,
       3, 5, 4, 4, 5, 5, 4, 5, 4, 4, 3, 4, 4, 4, 4, 6, 4, 5, 3, 5, 6, 4,
       5, 4, 3, 5, 5, 4, 5, 4, 5, 4, 4, 5, 4, 4, 5, 5, 4, 4, 4, 4, 4, 4,
       5, 4, 4, 6, 4])

In [ ]: