Get the data and check it



In [1]:

    
# This command allows plots to appear in the jupyter notebook.
%matplotlib inline  
# Import the pandas package and load the cleaned json file into a dataframe called df.
import pandas as pd
df_input = pd.read_json('JEOPARDY_QUESTIONS1_cleaned.json')
# Division is float division
from __future__ import division



In [2]:

    
# Check on the dataframe.
pd.set_option('max_colwidth', 300)
df_input.head()









    Out[2]:






  
    
      
      air_date
      answer
      category
      question
      round
      show_number
      value
    
  
  
    
      0
      2004-12-31T00:00:00.000Z
      Copernicus
      HISTORY
      For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory
      Jeopardy!
      4680
      200.0
    
    
      1
      2004-12-31T00:00:00.000Z
      Jim Thorpe
      ESPN's TOP 10 ALL-TIME ATHLETES
      No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves
      Jeopardy!
      4680
      200.0
    
    
      10
      2004-12-31T00:00:00.000Z
      Jackie Gleason
      EPITAPHS & TRIBUTES
      "And away we go"
      Jeopardy!
      4680
      400.0
    
    
      100
      2010-07-06T00:00:00.000Z
      a German Shepherd
      JUST THE FACTS
      This dog breed seen here is a loyal and protective companion
      Double Jeopardy!
      5957
      1200.0
    
    
      1000
      2000-05-04T00:00:00.000Z
      Vanessa Williams
      MR. OR MS. WILLIAMS
      This devoted mom has been called the most famous Miss America of all time
      Double Jeopardy!
      3619
      2000.0

(Notice the fourth question about German Shepherd. This is the question that helped me find the regular expression error in my previous post.)



In [3]:

    
# Check the data types
df_input.dtypes









    Out[3]:





air_date        object
answer          object
category        object
question        object
round           object
show_number      int64
value          float64
dtype: object



In [4]:

    
# Let's convert air_date to date/time, rather than a string.
df_input['air_date'] = pd.to_datetime(df_input['air_date'], yearfirst= True)



In [5]:

    
# Check data types again.
df_input.dtypes









    Out[5]:





air_date       datetime64[ns]
answer                 object
category               object
question               object
round                  object
show_number             int64
value                 float64
dtype: object



In [6]:

    
# Make sure all the data is still there.
df_input.count()









    Out[6]:





air_date       216930
answer         216930
category       216930
question       216930
round          216930
show_number    216930
value          213296
dtype: int64



In [7]:

    
# Let's only look at the years where the data is well-sampled.
df1 = df_input[(df_input['air_date'] >= '01-01-1997') & (df_input['air_date'] <= '12-31-2000')]
df2 = df_input[(df_input['air_date'] >= '01-01-2004') & (df_input['air_date'] <= '12-31-2011')]
df = pd.concat([df1, df2])

State of Jeopardy

One thing I noticed when looking at the categories is that geography seems to be a recurring topic. Is there a way I can find a common theme among the geography questions? Since geography is a large topic, I decided to narrow my focus and only look at questions where the answer was a U.S. state.



In [8]:

    
# Geography is a common theme for Jeopardy categories.
# What are the top categories?
category_counts = df['category'].value_counts() 
category_counts[:15]









    Out[8]:





BEFORE & AFTER             413
SCIENCE                    276
LITERATURE                 257
AMERICAN HISTORY           255
WORD ORIGINS               252
POTPOURRI                  236
COLLEGES & UNIVERSITIES    224
BODIES OF WATER            220
AMERICANA                  214
U.S. CITIES                207
WORLD CAPITALS             206
STUPID ANSWERS             205
RHYME TIME                 199
WORLD HISTORY              199
COMMON BONDS               192
Name: category, dtype: int64



In [9]:

    
# How many questions are in the most popular category "BEFORE & AFTER"?
a = df['question'].count()
b = df[df['category']=='BEFORE & AFTER']['question'].count()
print "Total :", a
print "Before and After:", b
print "Percentage:", float(b)/float(a)









    



Total : 162384
Before and After: 413
Percentage: 0.00254335402503

About 0.25% of the questions in the dataset are from the BEFORE & AFTER category. How does this compare to the number of questions with a U.S. state as an answer? First, let's create a dataframe with only U.S. states as answers.



In [10]:

    
list_of_states = ['Alabama','Alaska','Arizona','Arkansas','California', 
                  'Colorado','Connecticut', 'Delaware', 'Florida','Georgia',
                  'Hawaii','Idaho','Illinois','Indiana', 'Iowa', 'Kansas',
                  'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts',
                  'Michigan','Minnesota','Mississippi', 'Missouri','Montana','Nebraska', 
                  'Nevada','New Hampshire', 'New Jersey','New Mexico', 'New York',
                  'North Carolina', 'North Dakota','Ohio','Oklahoma', 'Oregon',
                  'Pennsylvania', 'Rhode Island','South Carolina', 'South Dakota',
                  'Tennessee','Texas','Utah','Vermont', 'Virginia', 'Washington', 
                  'West Virginia', 'Wisconsin', 'Wyoming']
len(list_of_states)









    Out[10]:





50



In [11]:

    
# Create new dataframe with only states as answers.
state_answers = df[df['answer'].isin(list_of_states)]



In [12]:

    
# How many of the data set questions have U.S. states as an answer?
c = state_answers['question'].count()
print "Total :", a
print "States:", c
print "Percentage:", float(c)/float(a)









    



Total : 162384
States: 2887
Percentage: 0.0177788452064

The top category, BEFORE & AFTER, contains 0.25% of all the questions in the dataset. Questions about U.S. states are actually more popular, making up 1.8% of answers in the dataset.



In [13]:

    
# Let's take a look at some of these questions.
state_answers.head()









    Out[13]:






  
    
      
      air_date
      answer
      category
      question
      round
      show_number
      value
    
  
  
    
      1001
      2000-05-04
      Tennessee
      AIN'T THAT AMERICA
      From 1784 to 1788 the eastern part of this state was a separate state called Franklin
      Double Jeopardy!
      3619
      2000.0
    
    
      100165
      1999-01-04
      Hawaii
      CLOTHES MAKE THE LAND
      It's the U.S. state where you'll find the native skirts seen here (grass skirts)
      Jeopardy!
      3301
      200.0
    
    
      100950
      1999-01-15
      Georgia
      NATIONAL MONUMENTS
      Ocmulgee National Monument at Macon in this state preserves some prehistoric & historic Indian villages
      Double Jeopardy!
      3310
      800.0
    
    
      100962
      1999-01-15
      Arizona
      NATIONAL MONUMENTS
      Organ Pipe Cactus National Monument is the largest of the 13 national monuments in this SW state
      Double Jeopardy!
      3310
      1600.0
    
    
      101390
      1997-07-18
      Virginia
      AMERICANA
      Shoppers, take note:  the Potomac Mills discount mall is this state's No. 1 tourist destination
      Jeopardy!
      2985
      200.0

California is the most popular U.S. state

What is the most popular state? Are some states more popular than others? If so, how much more popular?



In [14]:

    
# Count up how many answers there are for each state. 
# Sort them and print last few.
count_state_answers = state_answers.answer.value_counts()
count_state_answers.sort_values().tail()









    Out[14]:





Florida       113
Alaska        115
Texas         116
Hawaii        116
California    133
Name: answer, dtype: int64



In [15]:

    
# Let's plot the counts too.
ax = count_state_answers.plot(kind='bar', figsize=(16,6), fontsize = 15)
ax.set_xlabel("State",fontsize = 20)
ax.set_ylabel("Number of Appearances",fontsize = 20);

Certain states seem to appear more frequently than other states. Why is that?

My first thought was that it's somehow related to population. The most populous states in the U.S. are California, Texas, and Florida. But, Alaska and Hawaii are the 11th and 3rd least populous states. So, population doesn't appear to track with a state's popularity on Jeopardy, but it seems somewhat related. I want to look into this metric in more detail, but before I get ahead of myself, let's make sure that the numbers I am seeing are significant.

I'll test this with the "goodness-of-fit test". This will let me compare the observed and expected values. I'll use the expected values to be a flat distribution. This means that the null hypothesis is that every state is equally likely to appear on Jeopardy. The alternative hypothesis is that they are not equally popular. Here is the definition of the statistic $\chi ^2$

$ \chi ^2 = \sum \frac{(O-E)^2}{E}$

where $O$ is the observed value and $E$ is the expected value. Luckily the scipy package has a function for calculating this.



In [16]:

    
from scipy.stats import chisquare
chisq, p = chisquare(count_state_answers)
print 'Chisquare = ', chisq
print 'p = ', p









    



Chisquare =  552.608590232
p =  2.05724197003e-86

With such a low p-value, we can reject the null hypothesis that the distribution of U.S. states is flat. The trend we see with state popularity is real.

Money Money Money!

The point of Jeopardy is to answer questions correctly in order to collect the most money. Let's look at how learning about the states can help you earn that money!

Let's break it down.

Should a contestant study the states that appear the most frequently because they are more likely to appear?

OR

Should a contestant study the states that appear less frequently because their dollar amounts are possibly higher than the more popular states?

Let me first look into the relationship between a state's popularity and its average dollar amount. Is there a relationship?

To answer this question, I'll create a new dataframe (state_data) that contains information about each state's questions, as opposed to the dataframe I'm using now (state_answers) which is organized by question.



In [17]:

    
# Group the questions about states by their dollar values and find the mean.
avg_values = state_answers.groupby('answer')['value'].mean()
# Put state data together in a single dataframe.
state_data = pd.concat([count_state_answers, avg_values], axis = 1)
state_data.columns = ['total_count', 'value_avg']
state_data.head()









    Out[17]:






  
    
      
      total_count
      value_avg
    
  
  
    
      Alabama
      37
      837.837838
    
    
      Alaska
      115
      699.122807
    
    
      Arizona
      63
      952.380952
    
    
      Arkansas
      34
      750.000000
    
    
      California
      133
      589.473684



In [18]:

    
# Plot the relationship between average dollar value 
# and total number of appearances on Jeopardy for each U.S. state.
ax = state_data.plot(x='value_avg', y='total_count',  kind = 'scatter', fontsize = 15)
ax.set_xlabel("Average Dollar Value", fontsize = 15)
ax.set_ylabel("Total Count", fontsize = 15);

The plot seems to show a negative correlation between popularity and dollar amount for U.S. states. Let's investigate how good of a correlation it is using the Pearson correlation coefficient, $r$, with the function scipy.stats.pearsonr.



In [19]:

    
from scipy.stats import pearsonr
r, pvalue = pearsonr(state_data['value_avg'], state_data['total_count'])
print "p = ", pvalue
print "r = ", r









    



p =  0.00149233948007
r =  -0.437379909823

Let's take a look at what these $r$ and $p$ values mean. The null hypothesis states that there is no relationship between the variables. However since $p$ is less than 0.05, I'll reject this hypothesis.

The value of $r$ can range from -1 to +1. A value of 0 indicates that there is no correlation. Negative and positive values indicate negative and positive correlations. In our case, $r$ is negative which agrees with the negative correlation seen in the plot. The value of $r$ indicates the strength of the correlation. As it approaches +1 or -1, the correlation gets stronger. My value of $r$ is not very indicative of a strong correlation.

When all else fails, fit a line!

I can also investigate the relationship between the average value and the total number of appearances for U.S. states using linear regression. Although, since I don't have a very strong correlation, I would be hesitant to use this model to make any confident predictions. Let's go through the steps anyway to see what we get. Scikit-learn has a useful tool for performing this operation.



In [20]:

    
# Calculate the parameters of the model.
from sklearn.linear_model import LinearRegression
x = state_data['value_avg']
y = state_data['total_count']
x = x.values.reshape(-1,1)
y = y.values.reshape(-1,1)
model = LinearRegression()
model.fit(x,y)









    Out[20]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)



In [21]:

    
# Plot the best fit line.
import matplotlib.pyplot as plt
plt.scatter(x,y)
plt.plot(x, model.predict(x))
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)
plt.xlabel("Average Dollar Value", fontsize = 15)
plt.ylabel("Total Count", fontsize = 15);



In [22]:

    
# Print the equation of the line.
m = (model.predict(1000) - model.predict(500))/500
m=float(m[0])
b = model.predict(1000)-m*1000
b = float(b[0])
print "y = ", m, "x + ", b









    



y =  -0.090933237871 x +  135.258479027



In [23]:

    
# Use LinearRegression to get statistics also.
coefdeter = model.score(x,y)
print "Linear Regression r^2 = ", coefdeter
print "Pearson           r^2 = ", r*r









    



Linear Regression r^2 =  0.191301185517
Pearson           r^2 =  0.191301185517

It's pretty nice that LinearRegression will calculate the square of the Pearson $r$ for us. Notice that it agrees with the value calculated previously by pearsonr.

I fit a line. Now what?

Although the fit between the popularity of a state and its average dollar amount is not as tight as I'd like it to be, I'm going to use it anyway to answer the question of whether it's a good idea to focus on studying the more lucrative, less popular states, or if it's better to study less lucrative but more popular states.

Looking back at the figure above, if you squint, you can see that I have a probability density function. For a given average dollar value, I have a predicted number of questions. If I randomly pick a question about a U.S. state, it is more likely to have a lower dollar amount.

Probability density function $= mx+b$

where $x$ = dollar amount.

Let's assume this holds for dollar amounts in the range I found for average dollar amounts ~\$600 to ~\$1100, the actual dollar range on Jeopardy.

What if we conduct an experiment? Let's sample from our probability density function (PDF) as if we were on Jeopardy. Are we going to make more money if we correctly answer the popular, lower-dollar amount questions or if we correctly answer the rare, higher-dollar value questions right?

I'm going to need to randomly sample from my given probability density function. To do that, first I'll have to normalize it. Then I'll calculate the cumulative distribution function (CDF), which I then have to invert. Once I have the inverted CDF, I can use a random number generator to give me numbers between 0 and 1. Finally, I'll plug these random numbers into my inverted CDF and I'll get random numbers that sample my PDF. (Here's a nice explanation of this procedure.)

To investigate this, first I'll normalize this function.

$ A \int^{1100}_{600} (mx+b) dx = 1$

$A (\frac{mx^2}{2} + bx) ]^{x=2000}_{x=200} =1$

$A [(m/2 \cdot 1100^2 + b \cdot 1100) - (m/2 \cdot 600^2 + b \cdot 600)] = 1$

$ A = 1/ [(m/2 \cdot 1100^2 + b \cdot 1100) - (m/2 \cdot 600^2 + b \cdot 600)]$

Geesh, that's ugly. Let's have the computer deal with this ugliness.



In [24]:

    
# Normalize the function
upper = 1100.
lower = 600.
A = (  ((m*0.5*upper*upper) + (b*upper))-((m*0.5*lower*lower) + (b*lower))   )
A = 1./A
print A









    



3.4503444723e-05

The cumulative probability density function is just the integral of the PDF.

$PDF = A (mx + b)$

$CDF = \int{PDF = \int^{x'=x}_{x'=$600}A (mx'+ b)}dx' $

Once we have the CDF, it will be in the form $ y = f(x)$. I'll need to invert it so that it is $x = f(y)$. This will let me plug a random value from 0 to 1 and will output the dollar value which follows my derived PDF.

After some integration, I get the following for the CDF.

$l =$ \$600

$CDF = A \cdot [0.5 m \cdot x^2 + bx - 0.5 m \cdot l^2 - bl)$



In [25]:

    
# Plot CDF.
# Input x
x = range(int(lower), int(upper), 50)
# Output y
y = [(0.5*m*float(xx)*float(xx) + b*float(xx) - 0.5*m*lower*lower - b*lower)*A for xx in x]
plt.scatter(x,y)
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)
plt.xlabel("x: Dollar amount", fontsize = 15)
plt.ylabel("y: CDF", fontsize = 15);

The next step is to invert the CDF, i.e., solve for $x$. Now, you might have noticed that the CDF is a quadratic function. Awesome! To solve for $x$, I'll need to complete the squares. Fun!

Eventually, I end up with...

$ x = \pm \sqrt{(\frac{b^2}{2m} - Q)2/m} -b/m$

where $Q = -y/A - 0.5ml^2 -bl$

Now, I'm pretty confident in my math skills, but not confident enough to move forward without double checking them. So, I plotted the inverted CDF to make sure it matches the original CDF.



In [26]:

    
# Plot inverted CDF in order to check MATH.  Plot should look the same as previous plot.
# Plotting will also determine that using the negative sign is the way to go.
# Input y
yy = [0.1,0.2, 0.4,0.5, 0.6, 0.7,0.8,0.9,1.]
# Output x
QQ = [-y/A -0.5*m*lower*lower -b*lower for y in yy]
x = [((((b*b/2./m -Q)*2./m)**(0.5))*-1.)-(b/m) for Q in QQ]
plt.scatter(x,yy)
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)
plt.xlabel("x: Dollar amount", fontsize = 15)
plt.ylabel("y: CDF", fontsize = 15);

The plot looks good. Let's go ahead and define a function that will calculate the inverse of the CDF.



In [27]:

    
# Function calculates the inverse of the CDF. yy is the input vector. x is the output vector. 
# A, b,m, and lower are all parameters
def CDF(yy,x,A, b,m, lower):
    QQ = [-y/A -0.5*m*lower*lower -b*lower  for y in yy]
    x = [((((b*b/2./m -Q)*2./m)**(0.5))*-1.)-(b/m) for Q in QQ]
    return x

Sampling the Probability Density Function

Great! Now we can get to the fun part. With the inverted CDF, I can now give it random numbers between 0 and 1 and it will output samples from my original probability density function. Let's give it a try.



In [28]:

    
# Import random number generators
import numpy as np
from numpy.random import random



In [29]:

    
# Let's start with 5 samples and see the dollar amounts we end up with.
number_of_samples = 5
random0to1 = np.random.uniform(0,1, number_of_samples)
print 'Random numbers between 0 and 1'
print random0to1
y= CDF(random0to1,x,A,b,m,lower)
print 'Dollar amounts sampled from our PDF'
print y









    



Random numbers between 0 and 1
[ 0.53194885  0.06793867  0.02837611  0.31778805  0.78809509]
Dollar amounts sampled from our PDF
[817.76583492251177, 624.74493254867093, 610.25038496211164, 722.60122819452704, 953.41214151219697]

Awesome! Now back to my main question. Is it better to study states that may appear less frequently but yield a higher per question dollar amount or study more frequently appearing, but less lucrative states?

I'm going to sample my probability distribution 5 times which represents 5 U.S. state clues per game. For each game, I'll calculate how much money a contestant would earn if she got only all of the high dollar questions correct. Then I'll calculate how much money she would earn if she got only all of the low dollar questions correct. I'll assume one of her other opponents answered the other U.S. states correctly. Of the U.S. state questions offered in a Jeopardy game, how much more money will she have compared to her opponent?



In [30]:

    
number_of_clues = 5
number_of_games = 300
# Mid way dollar amount. Divides high dollar from low dollar questions.
high_low = (upper-lower)/2.0+lower
bigmoneylist=[]
smallmoneylist=[]
difflist = []
for i in range(number_of_games):
    random0to1 = np.random.uniform(0,1, number_of_clues)
    y= CDF(random0to1,x,A,b,m,lower)
# Sum up total possible money to be earned in 1 game.
    totalmoney = sum(y)
# Sum up low popularity, high dollar amount clues
    bigmoney = 0
# Sum up high popularity, low dollar amount clues
    smallmoney = 0
    for num in y:
        if num >= high_low:
            bigmoney += num 
        else:
            smallmoney += num
    bigmoneylist.append(bigmoney)
    smallmoneylist.append(smallmoney)
    difflist.append((smallmoney-bigmoney)/totalmoney*100)



In [31]:

    
# Number of bins in histogram
binnum = 5



In [32]:

    
plt.hist(difflist, normed=True, bins=binnum)
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 8)
plt.title('% Difference between Small and Large Value Clues', fontsize = 15)
plt.xlabel("Percentage Difference", fontsize = 15)
plt.ylabel('Probability', fontsize = 15);

Let me take a minute to think about this plot. In the positive region from 0-100%, in a single game the Jeopardy player is getting all of the small value clues correctly and her opponent is making money off of the large value clues. In the negative region, our contestant is answering all of the large value clues correctly. So, it looks like the distribution is pretty even. Let's look into the details of this. What is the average? What is the median?



In [33]:

    
print 'Average:  ', np.mean(difflist)
print 'Median :  ', np.median(difflist)









    



Average:   5.70473591151
Median :   4.63906319201

Hmm... that's about what I expected from looking at my histogram. Our contestant will earn about 8% more money than her opponent if she studies the popular Jeopardy states and gets all her questions right. I'm guessing that being able to push the buzzer before your opponent who might also know the answer to the popular state questions will probably wash away this 8% bump you might get by your extra studying. But who knows, it might make the extra difference.

	air_date	answer	category	question	round	show_number	value
0	2004-12-31T00:00:00.000Z	Copernicus	HISTORY	For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory	Jeopardy!	4680	200.0
1	2004-12-31T00:00:00.000Z	Jim Thorpe	ESPN's TOP 10 ALL-TIME ATHLETES	No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves	Jeopardy!	4680	200.0
10	2004-12-31T00:00:00.000Z	Jackie Gleason	EPITAPHS & TRIBUTES	"And away we go"	Jeopardy!	4680	400.0
100	2010-07-06T00:00:00.000Z	a German Shepherd	JUST THE FACTS	This dog breed seen here is a loyal and protective companion	Double Jeopardy!	5957	1200.0
1000	2000-05-04T00:00:00.000Z	Vanessa Williams	MR. OR MS. WILLIAMS	This devoted mom has been called the most famous Miss America of all time	Double Jeopardy!	3619	2000.0

	air_date	answer	category	question	round	show_number	value
1001	2000-05-04	Tennessee	AIN'T THAT AMERICA	From 1784 to 1788 the eastern part of this state was a separate state called Franklin	Double Jeopardy!	3619	2000.0
100165	1999-01-04	Hawaii	CLOTHES MAKE THE LAND	It's the U.S. state where you'll find the native skirts seen here (grass skirts)	Jeopardy!	3301	200.0
100950	1999-01-15	Georgia	NATIONAL MONUMENTS	Ocmulgee National Monument at Macon in this state preserves some prehistoric & historic Indian villages	Double Jeopardy!	3310	800.0
100962	1999-01-15	Arizona	NATIONAL MONUMENTS	Organ Pipe Cactus National Monument is the largest of the 13 national monuments in this SW state	Double Jeopardy!	3310	1600.0
101390	1997-07-18	Virginia	AMERICANA	Shoppers, take note: the Potomac Mills discount mall is this state's No. 1 tourist destination	Jeopardy!	2985	200.0

	total_count	value_avg
Alabama	37	837.837838
Alaska	115	699.122807
Arizona	63	952.380952
Arkansas	34	750.000000
California	133	589.473684