Who Am I?

Small Town Living

I grew up in a town not far from Morton enjoying math and science classes, scholastic bowl, band, and sports. While my experience was a little more rural than the average MHS student, our high school experiences would have a lot in common.

University of Illinois at Urbana-Champaign

Path to finding a major:

  • Industrial Engineering
  • Mechanical Engineering
  • Computer Science
  • Mathematics



Master's Degree:

  • Mathematics Education

Started Coaching Volleyball & Teaching

After a few years, in 2004, I ended up in a little place known as the "Home of the Potters." I was hired as a math teacher, volleyball coach, and soon become acquainted with my favorite student activity of all time...

Math Team is the Bomb

Loved a lot of things about my time at MHS, but math team takes the cake. MHS has some smart cookies and we had quite a run.

MOOCS Are Changing Education



I have this thing where I have to keep reinventing myself. I think it'd be hard to make a big change like this with only an online education, an online education can allow you to pivot careers quickly as a supplement to a 4-year or > degree. I spent all of my 2014 summer break in the library completing an intensive MOOCs program in data science.


Topics to Cover Today

Big ideas:

  • What does a data scientist do?
  • What are the families of problems?
  • Reflection: The biggest disservice I did to my students a math teacher


Our problem:


Can we use data science to identify students who make similar Tweets?

This Presentation

I've prepared this presentation in a "Jupyter Notebook." This is a web based computation environment that allows me to mix and match web syntax like markdown/html (for text, images, bullet points, etc) and code that can be executed right in the browers from a general purpose computing language, in this case Python.


I won't be showing much Python until the close to the end of the end of this presentation, but just to give you an idea of what I mean, if I type "2+2" into the next cell of this notebook and type enter...


In [1]:
2+2


Out[1]:
4

The notebook understand that I was asking for the execution of a command. OK, with that concept of what you are looking at out of the way, let's move on.


What Does a Data Scientist Do?

Computer programming

What kind of programming?

  • Open source
  • Languages that have great statistics libraries (R, Python, Scala)


Analyze large data sets for business insights

What is big?

"Big Data" means different thing to different people. You might take it to mean:

  • Too big for your graphing calculator
  • Too big to open in a spreadsheet program
  • Bigger than the available memory (RAM) on your computer
  • Bigger than the available disk space on your computer
  • Repeat the last two statements for a large server, meaning we have to move to a cluster of servers





Develop models using data

Key Idea:

Examples of events that happened in the past can be useful to train a computer model


Supervised vs. Unsupervised Learning

Supervised Learning

This is the task of inferring a function from labeled training data. From your classes, think of linear regression, where the function is a line, and the labels are the y values assigned to each observation in your data.

Some differences about how data scientists perform supervised learning from a high school linear regression problem:

  • Not limited to a single dimension
  • Many methods other than linear regression exist
  • Classification (What class is something?) is just as important as regression

Data Science as Sport

Supervised learning competitions have become increasingly popular in recent years. Google recently purchased the most well known such competition site, Kaggle.

Unsupervised Learning

This is task of inferring a function to describe hidden structure from "unlabeled" data. Clustering, or the task of grouping similar things together, is a common example of unsupervised learning


Structured Data vs. Unstructured Data

Structured data

This is data organized into a row and column structure. You may think of words like:

  • Matrices
  • Arrays
  • Spreadsheets



Unstructured data

This is data without an identifiable internal structure. Examples include:

  • Images
  • Audio
  • Text
  • Video

The image pixel data of a cuddly puppy can be transformed into a format recognizable to a computer something like what you see below.


Iterative Approximation vs. Direct Solutions

A Teacher's Confession

While I always tried to have the same level of enthusiasum regardless of what I was teaching, the reality was that some topics were less appealing. In these cases I held my nose and charged ahead.

Probably like most students, the reasons for the lack of appeal typically had to do with me not seeing the application of the mathematics.




Examples of "Holding my nose"

Typically, I'm talking about examples where there were two methods of solving a problem, one that involved a more iterative, step-by-step approach, and one that involved solving directly for an answer with some clever tricks,

  • Empirical probability vs. Theoretical probability
  • Newton's method vs. Calculating derivative algebraically
  • Solving a system of equations with matrices vs Subtitution



I like clever tricks. In retrospect, why do things the "hard way"? Because oftentimes, what seems hard to us is what a computer understands.



What is the solving the types of problems we encounter?

Linear Algebra & Calculus

  • Matrix multiplication
  • Dot products
  • Gradients (derivatives in higher dimensions)
  • In short, optimization methods


Let's Talk Twitter Profiles


OK, now that we've covered a few basics of what a data scientist does, let's talk about our problem. To summarize the problem we were given:

Given a "Tweet" from every student in the class, we would like to be able to find the Tweets that are most similar to any particular student that we choose.

Before we solve the problem, we'll want to think about a few of the concepts we discussed earlier.

  • Is our data structured or unstructured?
  • Is this a supervised or unsupervised learning problem?



Vectors

Remember from your classes that vectors are simply a direction and a magnitude. We data scientists love using vectors to represent data.

Perhaps, for example, the x component could tell us how often the word "pizza" appears in your Twitter profile and the y component could tell us how often the word "party" appears. This profile would have two appearances of pizza and 3 of party. You can quickly see that this 2D vector representation would have serious limitations...


Word Embeddings

One important limitation is the number of dimensions at just 2. Of course, we can expand that. Here's a picture with 3. While it's hard to visualize > 3 dimensions, there's no reason we have to stop with 3 in our vector representation for the computer.



It would also be nice if we had a method of converting words to vector space that could:

  • Automatically recognize similar words have similar meanings (i.e. "bad" and "terrible")
  • Give all of the profiles the same vector length



One tool for doing this is Word2Vec, an open source tool released by Google. It uses neural networks to train the computer what the meanings of words (or documents) are, and then convert those into a vector. Words/documents with similar meanings have vectors that are "embedded" in a high dimensional space near each other. I realize I'm glossing over the details of how this is done here, but those details go beyond the scope of this lesson.

Cosine Similarity

Once we have the vector for each profile, how can we know if they are similar? We want vectors that are close together to be be ranked as similar, those that are far apart to be ranked as dissimilar. A popular method is cosine similarity.

Many of you already know the exact formula for cosine similarity from your Pre-Calculus class. Since it is used to compare angles between two vectors, it'll give an answer close to 1 for vectors that are close, -1 for those that are not. What's the difference between here and your Pre-Calculus class? Very little, except that we are not limited to two dimensions

Let's Talk Data

OK, so we've talked a little bit about what we'd like to do. Where are we going to get the data to do to train Word2Vec how to understand how the kids talk nowadays. I gathered it from two places.

  1. Google was nice enough to provide the public an already trained model of Word2Vec that had learned the meanings of words from many thousands of Google News articles.

  2. Considering that people (especially teens) may speak differently than reporters, I gathered about 100,000 Tweets from these celebrities and in the case of one popular teen, from her followers. I then created my own version of a Word2Vec model to include that language. It won't know nearly as many words, but it should know about how some workds are used specifically on Twitter.

So who did these Tweets come from? I'm betting you've heard of at least some of these.

  • Kim Kardashian
  • Adam Savage
  • Bill Nye
  • Neil deGrasse Tyson
  • Donald Trump
  • Hillary Clinton
  • Richard Dawkins
  • Commander Scott Kelly
  • Barack Obama
  • NASA
  • The Onion
  • Landry Bender (followers only)

Using Word2Vec to take Words to Vectors

Word2Vec literally (yes, literally, not figuratively) does what its name says. Let's take a look at a Word2Vec converting a word to a vector. Let's start with the version of the model that Google has trained for us using Google News.


In [2]:
from gensim.models import Word2Vec
import gensim

In [3]:
model = gensim.models.KeyedVectors.load_word2vec_format(
    './private_data/GoogleNews-vectors-negative300.bin', binary=True)

In [4]:
# What does a word vector look like?
model.word_vec("cheeseburger")


Out[4]:
array([ -6.10351562e-02,   6.00585938e-02,   1.25000000e-01,
         6.75781250e-01,  -9.03320312e-02,   1.98242188e-01,
         2.55859375e-01,   1.77734375e-01,   1.57226562e-01,
         1.96289062e-01,  -1.69921875e-01,   7.95898438e-02,
         2.06054688e-01,  -1.74804688e-01,  -6.17675781e-02,
         3.67187500e-01,   9.33837891e-03,   1.89208984e-02,
         2.69531250e-01,  -1.55273438e-01,  -1.21093750e-01,
         2.69775391e-02,   1.90429688e-01,   1.34765625e-01,
        -1.32812500e-01,  -1.22558594e-01,   8.25195312e-02,
         2.83203125e-01,   1.36718750e-01,   2.09960938e-01,
        -8.05664062e-02,  -8.00781250e-02,   1.97753906e-02,
         9.46044922e-03,  -2.12402344e-02,   9.46044922e-03,
         6.67968750e-01,   1.09375000e-01,   1.58203125e-01,
         4.04296875e-01,  -3.47656250e-01,  -3.86718750e-01,
         8.78906250e-02,   1.37695312e-01,  -7.37304688e-02,
        -4.06250000e-01,   1.52587891e-02,   1.12792969e-01,
         1.07421875e-01,   7.89642334e-04,  -1.88476562e-01,
         1.58203125e-01,   1.78710938e-01,  -2.38281250e-01,
        -2.10571289e-03,   7.32421875e-02,   7.12890625e-02,
        -1.44531250e-01,  -2.19726562e-01,  -1.14257812e-01,
        -4.73632812e-02,   2.77343750e-01,   6.64062500e-02,
         2.57812500e-01,   2.16796875e-01,  -1.72851562e-01,
        -3.08593750e-01,  -3.02734375e-01,   2.46093750e-01,
         4.22363281e-02,   3.10546875e-01,  -1.87500000e-01,
        -9.42382812e-02,   2.51953125e-01,  -3.24218750e-01,
        -2.59765625e-01,  -4.02343750e-01,   7.87353516e-03,
        -3.34472656e-02,  -2.75390625e-01,  -2.95410156e-02,
        -1.19140625e-01,  -2.71484375e-01,  -7.65991211e-03,
         7.47070312e-02,   9.13085938e-02,  -1.80664062e-01,
         6.36718750e-01,   1.89453125e-01,  -3.14453125e-01,
        -4.27734375e-01,  -1.37695312e-01,  -9.03320312e-02,
        -5.12695312e-02,  -7.47070312e-02,  -1.35742188e-01,
         2.04101562e-01,   9.13085938e-02,   1.37695312e-01,
        -2.77343750e-01,  -1.24023438e-01,   5.56640625e-02,
         3.06640625e-01,   2.05078125e-01,   4.47265625e-01,
        -2.18750000e-01,   2.10937500e-01,  -2.02148438e-01,
        -2.03125000e-01,  -3.18908691e-03,   8.69140625e-02,
        -9.66796875e-02,   2.07519531e-02,   3.41796875e-02,
         1.35742188e-01,  -7.51953125e-02,   1.38671875e-01,
        -3.29589844e-02,  -1.12304688e-01,  -1.72119141e-02,
        -3.66210938e-02,   5.12695312e-02,  -2.40234375e-01,
        -9.71679688e-02,   7.32421875e-02,   1.65039062e-01,
         2.20703125e-01,   4.66308594e-02,   4.49218750e-02,
        -2.23388672e-02,  -8.30078125e-02,   1.03027344e-01,
         1.40625000e-01,   1.08886719e-01,  -3.00781250e-01,
         3.75000000e-01,   2.87109375e-01,   1.01562500e-01,
         2.42187500e-01,   2.55126953e-02,   2.79296875e-01,
        -9.71679688e-02,  -8.10546875e-02,  -2.00195312e-01,
        -1.32812500e-01,  -1.06933594e-01,  -2.38281250e-01,
        -1.57226562e-01,  -1.60156250e-01,   4.25781250e-01,
         1.99218750e-01,   4.63867188e-02,  -1.98242188e-01,
        -4.63867188e-02,  -2.69775391e-02,   4.32128906e-02,
         6.17980957e-04,   1.30859375e-01,  -5.11718750e-01,
        -4.41894531e-02,  -5.68847656e-02,   3.95507812e-02,
         2.45117188e-01,   3.54003906e-02,  -1.74804688e-01,
        -1.11816406e-01,  -3.95507812e-02,  -8.49609375e-02,
         2.80761719e-02,   1.00585938e-01,  -7.12890625e-02,
        -5.78613281e-02,   6.15234375e-02,   6.88476562e-02,
        -3.78906250e-01,   6.54296875e-02,   1.04003906e-01,
         3.41796875e-01,  -1.54296875e-01,   2.44140625e-01,
        -7.17773438e-02,  -1.88476562e-01,  -1.03515625e-01,
        -1.45507812e-01,   2.92968750e-01,  -2.06054688e-01,
         1.01928711e-02,   1.85546875e-01,  -1.86523438e-01,
         2.17285156e-02,   1.10839844e-01,   2.59765625e-01,
        -8.83789062e-02,   1.45263672e-02,  -6.25000000e-02,
         4.45312500e-01,  -8.10546875e-02,  -1.01562500e-01,
        -1.19140625e-01,  -1.92382812e-01,  -1.67968750e-01,
         8.74023438e-02,   6.29882812e-02,  -2.14843750e-02,
         2.42187500e-01,  -5.12695312e-02,  -1.67968750e-01,
        -1.07910156e-01,   1.66015625e-02,  -7.17773438e-02,
        -2.79296875e-01,   2.18505859e-02,  -2.67578125e-01,
         6.03027344e-02,  -2.63671875e-01,  -5.56640625e-02,
         2.61718750e-01,   6.49414062e-02,  -1.27929688e-01,
        -5.98144531e-02,  -1.18652344e-01,  -1.80664062e-01,
         4.51171875e-01,  -8.39843750e-02,   3.92578125e-01,
         7.56835938e-02,   6.12792969e-02,   3.51562500e-01,
        -9.96093750e-02,  -1.37695312e-01,   5.32226562e-02,
        -1.08398438e-01,  -2.12890625e-01,   2.83203125e-02,
         3.45703125e-01,  -2.27539062e-01,  -1.62353516e-02,
         2.71484375e-01,   5.76171875e-02,   3.56445312e-02,
         2.73437500e-01,   6.31713867e-03,   1.06201172e-02,
        -2.53906250e-01,   3.33984375e-01,  -1.33789062e-01,
         2.81250000e-01,   1.50390625e-01,   2.04101562e-01,
         1.47460938e-01,   1.83593750e-01,   1.10473633e-02,
         3.73046875e-01,  -1.10839844e-01,   2.42614746e-03,
        -1.87500000e-01,  -2.53906250e-02,  -1.50390625e-01,
         2.40478516e-02,  -1.57226562e-01,   7.22656250e-02,
         4.10156250e-01,  -2.12890625e-01,  -1.70898438e-01,
         1.13281250e-01,  -3.61328125e-01,  -5.54199219e-02,
         8.44726562e-02,  -4.55078125e-01,   1.55273438e-01,
        -7.91015625e-02,   2.51953125e-01,  -1.66992188e-01,
         6.25000000e-02,   1.01074219e-01,   1.67968750e-01,
        -1.91650391e-02,  -1.29394531e-02,  -3.32031250e-02,
        -2.83203125e-01,   2.25585938e-01,   4.56542969e-02,
        -6.54296875e-02,   2.63671875e-01,  -3.73046875e-01,
        -1.63085938e-01,  -1.40625000e-01,   7.08007812e-02,
        -2.35351562e-01,  -1.12304688e-01,  -1.41601562e-02,
        -1.08886719e-01,  -3.29589844e-02,   2.69531250e-01,
         1.08398438e-01,  -1.91406250e-01,  -8.00781250e-02,
         3.28125000e-01,   1.23046875e-01,   1.32812500e-01], dtype=float32)

Comparing with Cosine Similarity

So that's pretty nice. However, it is going to be impossible to really do a visual comparison of two vectors to see how similar they are. We need an objective method for comparing, and that's where the cosine similarity comes in. Remember our formular from Pre-Calculus (if you've taken the class)? Here it is again in case you forgot.


Let's use Python to work this out and see how similar a couple of words are based on their vector representations.

What would we expect a cheeseburger to be more like if our model is working properly, a cheeseburger or a Corvette?


In [5]:
cheeseburger = model.word_vec("cheeseburger")
hamburger = model.word_vec("hamburger")
corvette = model.word_vec("corvette")

In [6]:
# Numpy is a linear algebra library that we can use to help us efficiently find the dot product and
# vector magnitude (much like the distance formula or Pythagorean theorem)
import numpy as np

In [7]:
# How similar are the words "cheeseburger" and "hamburger"?
numerator = np.dot(cheeseburger,hamburger)
denominator = np.sqrt(np.sum(cheeseburger**2)) * np.sqrt(np.sum(hamburger**2))
print(numerator/denominator)


0.711572

In [8]:
# How similar are the words "cheeseburger" and "corvette"?
numerator = np.dot(cheeseburger,corvette)
denominator = np.sqrt(np.sum(cheeseburger**2)) * np.sqrt(np.sum(corvette**2))
print(numerator/denominator)


0.104262

As we would hope, cheeseburgers and hamburgers are pretty similar, while cheeseburgers and corvettes are not.

Finding the Average Word Embedding

Of course, for our problem, we need need to do more than analyze a single word. We want to find analyze the "Tweets" from the class. How do do this? We can find the average vector of all the words in the Tweet. That should give us the best understanding of what the Tweet was about.


How do you find an average vector? Just like you would find the average (or mean) with numbers. Take the sum of the vectors and divide by the number of vectors. Check out this simple example.


In [9]:
vector_a = np.array([1,0])
vector_b = np.array([0,.5])
average_a_b = (vector_a+vector_b )/2
print(average_a_b)


[ 0.5   0.25]

I've written some code that will find the average word vector (or "embedding") for a particular Tweet with our 300-D vectors.

Compare Two Tweets

Once we've converted our Tweets into average word vectors, we want to see how similar they are, right? So to make this process a little more efficient, I wrote some code that would check in a particular class who had the most similar Tweet in the data set. See, if we searched through all of Twitter's history of 200 billion Tweets per year, we would likely find a nice match for each Tweet. However, with a data set this small (90 Tweets), the matches won't be quite as easy to find, so I'm just looking for those of you that really had the best matches.


Let's find an example for each class period:


In [10]:
import pandas as pd
tweets = pd.read_csv('./private_data/no_names.csv')
tweets.head()


Out[10]:
hour text
0 1st Not much.
1 1st I took an AP Physics test
2 1st I finished a long book series
3 1st I fell down the stairs.
4 1st My friend gave me a giant bag of Hershey's kis...

In [11]:
from gensim.models import Word2Vec
import gensim
morton_model = gensim.models.Word2Vec.load('./private_data/morton_model')

In [12]:
# Just for fun. What has our model learned from Twitter?
morton_model.wv.most_similar(['donald'])


Out[12]:
[('mr', 0.6487404108047485),
 ('fred', 0.471854567527771),
 ('realdonaldtrump', 0.4691108465194702),
 ('macleod', 0.44352903962135315),
 ('sensanders', 0.4414896070957184),
 ('foxnewsinsider', 0.4314236044883728),
 ('tower', 0.4279685616493225),
 ('prez', 0.4263092875480652),
 ('carson', 0.41344374418258667),
 ('bashing', 0.41049301624298096)]

In [13]:
morton_model.wv.most_similar(['rt'])


Out[13]:
[('selenagomez', 0.554801881313324),
 ('dovecameron', 0.5021473169326782),
 ('realliampayne', 0.4966311454772949),
 ('retweet', 0.47674068808555603),
 ('hayesgrier', 0.4758262634277344),
 ('landrybender', 0.47537535429000854),
 ('officialfiym', 0.4609455466270447),
 ('sabrina', 0.4601680636405945),
 ('omg', 0.4588630795478821),
 ('rowblanchard', 0.45870542526245117)]

In [14]:
morton_model.wv.most_similar(['crazy'])


Out[14]:
[('weird', 0.6290851831436157),
 ('funny', 0.6102160215377808),
 ('cute', 0.6079419851303101),
 ('insane', 0.5588301420211792),
 ('scary', 0.5566126108169556),
 ('ok', 0.5545148849487305),
 ('hot', 0.5457345247268677),
 ('bored', 0.5343079566955566),
 ('adorable', 0.5337340235710144),
 ('kinda', 0.5333009958267212)]

Our model does not have the vocabulary of Google News, but it has clearly learned a few interesting relationships.

We now have 2 models.

  • One model has been trained with billions of words (over a million unique) with Google News data.
  • The other model has been trained with over a million words using Twitter data.

This is an interesting case to compare the amount of training data vs. the domain of the training data (also worth noting is that I'm sure the Google data scientists tuned the heck out of their model, where I took some "off the shelf" tuning parameters, so there's that).


In [15]:
from embeddings import embeddings

First, Let's Look at Our Model Build From Twitter Data


In [16]:
# Let's start by adding a 300-D word vector for every Tweet you guys sent out.
tweets1 = embeddings.append_word_vector_cols(tweets,morton_model,text_col='text')
tweets1.head()


Out[16]:
hour text wv0 wv1 wv2 wv3 wv4 wv5 wv6 wv7 ... wv290 wv291 wv292 wv293 wv294 wv295 wv296 wv297 wv298 wv299
0 1st Not much. -0.214690 -0.544352 -0.270313 0.003636 0.469348 -0.349296 -1.142005 0.939013 ... -0.416468 0.028199 -0.286156 -0.186573 0.890733 -0.267682 0.247837 0.825677 1.475202 -0.686134
1 1st I took an AP Physics test 0.053713 0.196875 -0.820789 1.164464 0.158858 -0.296072 -0.035544 -0.688193 ... -0.378202 0.097029 0.267748 0.313274 0.374856 0.024395 -0.564848 -0.418977 -0.066759 0.231382
2 1st I finished a long book series 0.088673 0.063467 -0.575716 -0.050583 0.643594 -0.079214 0.006667 -0.237967 ... -0.445731 -0.125031 0.619952 -0.315375 -0.346248 -0.074158 -0.566364 0.072197 -0.105081 -0.561076
3 1st I fell down the stairs. 0.471152 0.131318 -0.149470 0.047055 0.062587 0.249302 -0.536971 -0.076366 ... -0.028005 -0.026351 0.215953 0.568723 0.006598 -0.372337 -0.508275 0.387816 -0.095147 0.213411
4 1st My friend gave me a giant bag of Hershey's kis... -0.072086 -0.060336 -0.117156 -0.103744 0.187204 0.140894 -0.174904 0.398089 ... -0.372076 0.048365 0.299534 -0.083932 0.106779 0.048091 0.102877 0.136756 -0.119633 0.071890

5 rows × 302 columns


In [17]:
embeddings.most_similar_one_class(tweets1,"1st")


Out[17]:
(7,
 'The most interesting thing that has happened to me in the past week was when I had to speak at NHS inductions for MHS. It was interesting to speak in front of adults because I have never done that before. ',
 85,
 0.90625441,
 'The most interesting thing that has happened to me in the past week was that I was sick and missed school for three days. This gave me a lot of time to catch my breath from school, even though I had more work on the end if it all. ')

In [18]:
embeddings.most_similar_one_class(tweets1,"2nd")


Out[18]:
(19,
 'I did a double heel flip on my skateboard off a 3 stair.',
 51,
 0.61830491,
 'My cat left the top half of a rabbit on my patio for me. ')

In [19]:
embeddings.most_similar_one_class(tweets1,"3rd")


Out[19]:
(32,
 'In this past week the most interesting thing that has happened to me was that I was teaching a Junior high color guard tryouts and a little sixth grader yelled out get set when that is what the instructors do. It shocked me that she was yelling at older members to get set. ',
 85,
 0.90001875,
 'The most interesting thing that has happened to me in the past week was that I was sick and missed school for three days. This gave me a lot of time to catch my breath from school, even though I had more work on the end if it all. ')

In [20]:
embeddings.most_similar_one_class(tweets1,"4th")


Out[20]:
(85,
 'The most interesting thing that has happened to me in the past week was that I was sick and missed school for three days. This gave me a lot of time to catch my breath from school, even though I had more work on the end if it all. ',
 7,
 0.90625441,
 'The most interesting thing that has happened to me in the past week was when I had to speak at NHS inductions for MHS. It was interesting to speak in front of adults because I have never done that before. ')

In [21]:
embeddings.most_similar_one_class(tweets1,"6th")


Out[21]:
(46,
 'Something interesting that has happened to me in the past week is being able to move into the upstairs room. Another interesting thing that happened to me this past week was giving cookies to teachers for teacher appreciation week. ',
 49,
 0.86710334,
 'The most interesting thing that has happened to me in the past week would either be the choir concert or making honors.')

In [22]:
embeddings.most_similar_one_class(tweets1,"7th")


Out[22]:
(49,
 'The most interesting thing that has happened to me in the past week would either be the choir concert or making honors.',
 23,
 0.87258679,
 'The most interesting thing that happened to me in the past week is that I actually gave effort in gym class.')

Now Let's Look at the Model Trained with Google News Data


In [23]:
tweets2 = embeddings.append_word_vector_cols(tweets, model, keyed_vec=True, text_col='text')
tweets2.head()


Out[23]:
hour text wv0 wv1 wv2 wv3 wv4 wv5 wv6 wv7 ... wv290 wv291 wv292 wv293 wv294 wv295 wv296 wv297 wv298 wv299
0 1st Not much. 0.126953 -0.015625 0.017334 0.142822 -0.093262 0.069580 0.176270 -0.145996 ... -0.205078 0.006348 -0.036057 0.011108 -0.051495 0.005127 0.106445 -0.140381 0.009766 -0.168457
1 1st I took an AP Physics test -0.015527 0.085303 0.124951 -0.026178 -0.087500 -0.060107 0.105762 -0.154395 ... 0.058398 0.052954 -0.107617 0.202148 0.008228 -0.020825 0.019116 -0.074329 -0.004590 0.043555
2 1st I finished a long book series -0.035248 0.089539 -0.075653 0.040283 0.148132 -0.090302 -0.061584 -0.016174 ... -0.072296 -0.015785 -0.103821 -0.050537 -0.107605 -0.058746 0.011307 0.055878 -0.012823 -0.050789
3 1st I fell down the stairs. -0.035614 -0.001465 0.065735 0.042542 0.026978 -0.186157 0.046875 -0.006836 ... -0.050781 0.032608 -0.196609 0.039047 -0.153992 0.078659 0.014313 -0.025818 0.101746 0.128418
4 1st My friend gave me a giant bag of Hershey's kis... 0.067940 -0.013024 0.058724 0.121065 -0.048121 -0.055465 0.013849 -0.076047 ... -0.088939 0.055033 -0.130973 -0.092790 -0.066445 -0.068718 0.010868 -0.051497 -0.010246 -0.065836

5 rows × 302 columns


In [24]:
embeddings.most_similar_one_class(tweets2,"1st")


Out[24]:
(10,
 'I took the AP Physics exam on Tuesday. ',
 64,
 0.92693609,
 'Took the AP physics exam ')

In [25]:
embeddings.most_similar_one_class(tweets2,"2nd")


Out[25]:
(18,
 'Absolutely nothing ',
 44,
 0.64287114,
 'Nothing really interesting happened to me this week.')

In [26]:
embeddings.most_similar_one_class(tweets2,"3rd")


Out[26]:
(23,
 'The most interesting thing that happened to me in the past week is that I actually gave effort in gym class.',
 37,
 0.92807037,
 'The most interesting thing that has happened in the past week is that two my classes that I needed to go up, went up to an A. ')

In [27]:
embeddings.most_similar_one_class(tweets2,"4th")


Out[27]:
(85,
 'The most interesting thing that has happened to me in the past week was that I was sick and missed school for three days. This gave me a lot of time to catch my breath from school, even though I had more work on the end if it all. ',
 37,
 0.92766279,
 'The most interesting thing that has happened in the past week is that two my classes that I needed to go up, went up to an A. ')

In [28]:
embeddings.most_similar_one_class(tweets2,"6th")


Out[28]:
(37,
 'The most interesting thing that has happened in the past week is that two my classes that I needed to go up, went up to an A. ',
 23,
 0.92807037,
 'The most interesting thing that happened to me in the past week is that I actually gave effort in gym class.')

In [29]:
embeddings.most_similar_one_class(tweets2,"7th")


Out[29]:
(49,
 'The most interesting thing that has happened to me in the past week would either be the choir concert or making honors.',
 23,
 0.89895618,
 'The most interesting thing that happened to me in the past week is that I actually gave effort in gym class.')

Similar Phrasing

Both models seem to hit a lot on how frequently words like "the most interesting thing" shows up. That said, I do think there's a winner here between the two models, but I'll leave you to be the judge.