I grew up in a town not far from Morton enjoying math and science classes, scholastic bowl, band, and sports. While my experience was a little more rural than the average MHS student, our high school experiences would have a lot in common.
Path to finding a major:
Master's Degree:
After a few years, in 2004, I ended up in a little place known as the "Home of the Potters." I was hired as a math teacher, volleyball coach, and soon become acquainted with my favorite student activity of all time...
Loved a lot of things about my time at MHS, but math team takes the cake. MHS has some smart cookies and we had quite a run.
I have this thing where I have to keep reinventing myself. I think it'd be hard to make a big change like this with only an online education, an online education can allow you to pivot careers quickly as a supplement to a 4-year or > degree. I spent all of my 2014 summer break in the library completing an intensive MOOCs program in data science.
I've prepared this presentation in a "Jupyter Notebook." This is a web based computation environment that allows me to mix and match web syntax like markdown/html (for text, images, bullet points, etc) and code that can be executed right in the browers from a general purpose computing language, in this case Python.
I won't be showing much Python until the close to the end of the end of this presentation, but just to give you an idea of what I mean, if I type "2+2" into the next cell of this notebook and type enter...
In [1]:
2+2
Out[1]:
The notebook understand that I was asking for the execution of a command. OK, with that concept of what you are looking at out of the way, let's move on.
"Big Data" means different thing to different people. You might take it to mean:
Examples of events that happened in the past can be useful to train a computer model
This is the task of inferring a function from labeled training data. From your classes, think of linear regression, where the function is a line, and the labels are the y values assigned to each observation in your data.
Some differences about how data scientists perform supervised learning from a high school linear regression problem:
Supervised learning competitions have become increasingly popular in recent years. Google recently purchased the most well known such competition site, Kaggle.
This is task of inferring a function to describe hidden structure from "unlabeled" data. Clustering, or the task of grouping similar things together, is a common example of unsupervised learning
This is data organized into a row and column structure. You may think of words like:
This is data without an identifiable internal structure. Examples include:
The image pixel data of a cuddly puppy can be transformed into a format recognizable to a computer something like what you see below.
While I always tried to have the same level of enthusiasum regardless of what I was teaching, the reality was that some topics were less appealing. In these cases I held my nose and charged ahead.
Probably like most students, the reasons for the lack of appeal typically had to do with me not seeing the application of the mathematics.
Typically, I'm talking about examples where there were two methods of solving a problem, one that involved a more iterative, step-by-step approach, and one that involved solving directly for an answer with some clever tricks,
I like clever tricks. In retrospect, why do things the "hard way"? Because oftentimes, what seems hard to us is what a computer understands.
Linear Algebra & Calculus
OK, now that we've covered a few basics of what a data scientist does, let's talk about our problem. To summarize the problem we were given:
Given a "Tweet" from every student in the class, we would like to be able to find the Tweets that are most similar to any particular student that we choose.
Before we solve the problem, we'll want to think about a few of the concepts we discussed earlier.
Perhaps, for example, the x component could tell us how often the word "pizza" appears in your Twitter profile and the y component could tell us how often the word "party" appears. This profile would have two appearances of pizza and 3 of party. You can quickly see that this 2D vector representation would have serious limitations...
It would also be nice if we had a method of converting words to vector space that could:
One tool for doing this is Word2Vec, an open source tool released by Google. It uses neural networks to train the computer what the meanings of words (or documents) are, and then convert those into a vector. Words/documents with similar meanings have vectors that are "embedded" in a high dimensional space near each other. I realize I'm glossing over the details of how this is done here, but those details go beyond the scope of this lesson.
Once we have the vector for each profile, how can we know if they are similar? We want vectors that are close together to be be ranked as similar, those that are far apart to be ranked as dissimilar. A popular method is cosine similarity.
Many of you already know the exact formula for cosine similarity from your Pre-Calculus class. Since it is used to compare angles between two vectors, it'll give an answer close to 1 for vectors that are close, -1 for those that are not. What's the difference between here and your Pre-Calculus class? Very little, except that we are not limited to two dimensions
OK, so we've talked a little bit about what we'd like to do. Where are we going to get the data to do to train Word2Vec how to understand how the kids talk nowadays. I gathered it from two places.
Google was nice enough to provide the public an already trained model of Word2Vec that had learned the meanings of words from many thousands of Google News articles.
Considering that people (especially teens) may speak differently than reporters, I gathered about 100,000 Tweets from these celebrities and in the case of one popular teen, from her followers. I then created my own version of a Word2Vec model to include that language. It won't know nearly as many words, but it should know about how some workds are used specifically on Twitter.
So who did these Tweets come from? I'm betting you've heard of at least some of these.
In [2]:
from gensim.models import Word2Vec
import gensim
In [3]:
model = gensim.models.KeyedVectors.load_word2vec_format(
'./private_data/GoogleNews-vectors-negative300.bin', binary=True)
In [4]:
# What does a word vector look like?
model.word_vec("cheeseburger")
Out[4]:
Let's use Python to work this out and see how similar a couple of words are based on their vector representations.
In [5]:
cheeseburger = model.word_vec("cheeseburger")
hamburger = model.word_vec("hamburger")
corvette = model.word_vec("corvette")
In [6]:
# Numpy is a linear algebra library that we can use to help us efficiently find the dot product and
# vector magnitude (much like the distance formula or Pythagorean theorem)
import numpy as np
In [7]:
# How similar are the words "cheeseburger" and "hamburger"?
numerator = np.dot(cheeseburger,hamburger)
denominator = np.sqrt(np.sum(cheeseburger**2)) * np.sqrt(np.sum(hamburger**2))
print(numerator/denominator)
In [8]:
# How similar are the words "cheeseburger" and "corvette"?
numerator = np.dot(cheeseburger,corvette)
denominator = np.sqrt(np.sum(cheeseburger**2)) * np.sqrt(np.sum(corvette**2))
print(numerator/denominator)
As we would hope, cheeseburgers and hamburgers are pretty similar, while cheeseburgers and corvettes are not.
Of course, for our problem, we need need to do more than analyze a single word. We want to find analyze the "Tweets" from the class. How do do this? We can find the average vector of all the words in the Tweet. That should give us the best understanding of what the Tweet was about.
How do you find an average vector? Just like you would find the average (or mean) with numbers. Take the sum of the vectors and divide by the number of vectors. Check out this simple example.
In [9]:
vector_a = np.array([1,0])
vector_b = np.array([0,.5])
average_a_b = (vector_a+vector_b )/2
print(average_a_b)
I've written some code that will find the average word vector (or "embedding") for a particular Tweet with our 300-D vectors.
Once we've converted our Tweets into average word vectors, we want to see how similar they are, right? So to make this process a little more efficient, I wrote some code that would check in a particular class who had the most similar Tweet in the data set. See, if we searched through all of Twitter's history of 200 billion Tweets per year, we would likely find a nice match for each Tweet. However, with a data set this small (90 Tweets), the matches won't be quite as easy to find, so I'm just looking for those of you that really had the best matches.
Let's find an example for each class period:
In [10]:
import pandas as pd
tweets = pd.read_csv('./private_data/no_names.csv')
tweets.head()
Out[10]:
In [11]:
from gensim.models import Word2Vec
import gensim
morton_model = gensim.models.Word2Vec.load('./private_data/morton_model')
In [12]:
# Just for fun. What has our model learned from Twitter?
morton_model.wv.most_similar(['donald'])
Out[12]:
In [13]:
morton_model.wv.most_similar(['rt'])
Out[13]:
In [14]:
morton_model.wv.most_similar(['crazy'])
Out[14]:
We now have 2 models.
This is an interesting case to compare the amount of training data vs. the domain of the training data (also worth noting is that I'm sure the Google data scientists tuned the heck out of their model, where I took some "off the shelf" tuning parameters, so there's that).
In [15]:
from embeddings import embeddings
In [16]:
# Let's start by adding a 300-D word vector for every Tweet you guys sent out.
tweets1 = embeddings.append_word_vector_cols(tweets,morton_model,text_col='text')
tweets1.head()
Out[16]:
In [17]:
embeddings.most_similar_one_class(tweets1,"1st")
Out[17]:
In [18]:
embeddings.most_similar_one_class(tweets1,"2nd")
Out[18]:
In [19]:
embeddings.most_similar_one_class(tweets1,"3rd")
Out[19]:
In [20]:
embeddings.most_similar_one_class(tweets1,"4th")
Out[20]:
In [21]:
embeddings.most_similar_one_class(tweets1,"6th")
Out[21]:
In [22]:
embeddings.most_similar_one_class(tweets1,"7th")
Out[22]:
In [23]:
tweets2 = embeddings.append_word_vector_cols(tweets, model, keyed_vec=True, text_col='text')
tweets2.head()
Out[23]:
In [24]:
embeddings.most_similar_one_class(tweets2,"1st")
Out[24]:
In [25]:
embeddings.most_similar_one_class(tweets2,"2nd")
Out[25]:
In [26]:
embeddings.most_similar_one_class(tweets2,"3rd")
Out[26]:
In [27]:
embeddings.most_similar_one_class(tweets2,"4th")
Out[27]:
In [28]:
embeddings.most_similar_one_class(tweets2,"6th")
Out[28]:
In [29]:
embeddings.most_similar_one_class(tweets2,"7th")
Out[29]: