I've been watching Jeopardy for awhile now and have suspected that there are common themes for certain answers. For example, if the words "planet" and "acid" appear in the same clue, then it's likely that the answer is "Venus" because of its sulfuric acid rain clouds. To investigate this, I used the database of Jeopardy questions compiled by the j-archive which was scraped by reddit user trexmatt and posted as a json file.
This JSON file is described as follows:
The json file is an unordered list of questions where each question has
'category' : the question category, e.g. "HISTORY"
'value' : Dollar value of the question as string, e.g. "$200" (Note: This is "None" for Final Jeopardy! and Tiebreaker questions)
'question' : text of question (Note: This sometimes contains hyperlinks and other things messy text such as when there's a picture or video question)
'answer' : text of answer
'round' : one of "Jeopardy!","Double Jeopardy!","Final Jeopardy!" or "Tiebreaker" (Note: Tiebreaker questions do happen but they're very rare (like once every 20 years))
'show_number' : string of show number, e.g '4680'
'air_date' : the show air date in format YYYY-MM-DD
To get started with this jupyter notebook, first download the Jeopardy json file and place it in the same directory as this notebook.
I'll be using the pandas package from python to explore the dataset.
In [1]:
# First import the pandas package and load the json file into a dataframe called df.
import pandas as pd
df = pd.read_json('JEOPARDY_QUESTIONS1.json')
In [2]:
# Let's take a look at the first few rows.
df.head()
Out[2]:
In [3]:
# While it's fun that the first question is about astronomy,
# I can't see the whole question. Let's fix that.
pd.set_option('max_colwidth', 300)
df.head()
Out[3]:
In [4]:
# Now let's see how much data we have.
df.count()
Out[4]:
In [5]:
# There are 216930 questions in the set,
# but the *value* column is missing some.
# It must have some null values for Final Jeopardy!
# and Tie Breaker round. Let's check that out to make sure.
df[df['value'].isnull()]['round'].value_counts()
Out[5]:
In [6]:
# Make sure everything adds up.
213296+3631+3
Out[6]:
In [7]:
# I'm curious about those tiebreaker questions...
df[df['round'] == "Tiebreaker"]
Out[7]:
It looks like Tiebreakers happen rarely, about every 5 years or so, rather than every 20 years as stated in the file description.
In [8]:
# Now let's take a look at the top categories
# by creating a list of the categories and their counts
category_counts = df['category'].value_counts()
# Here are the top categories in this list.
category_counts[:15]
Out[8]:
In [9]:
# Here are some questions from the most popular category, Before & After.
df[df['category']=='BEFORE & AFTER'].head()
Out[9]:
I enjoy word play and Before & After is one of my favorite categories, but these questions can be TOUGH. Jeremy Singer-Vine did some Jeopardy analysis and found that the Before and After category is actually the number one category in Double Jeopardy as well. It makes sense that this difficult category appears in Double Jeopardy, the most lucrative of the two rounds in Jeopardy.
Let's go back to the idea that keywords in a question can give hints about the answer is without needing to understand the entirety of the question, like in the case of planet + acid = Venus. Let's see if this is the case for Venus and if I can find more of these.
In [10]:
df[(df["question"].str.contains('planet')) & (df["question"].str.contains('acid'))]
Out[10]:
In [11]:
# Here's another fun one :)
df[(df["question"].str.contains('sandworms'))]
Out[11]:
After playing with the data and getting a feel for the format of the columns and rows, it's a good idea to start cleaning it up. This will help when I do more statistical work later on. Also, once I have a nice clean dataset, I can explore the idea of keywords in more detail.
To get started, let's see what sort of data types we are dealing with.
In [12]:
# Check data types
df.dtypes
Out[12]:
The data type object here is a string. Let's convert some of these columns to a more useful format and clean up the strings.
In [13]:
# Convert 'air_date' to date/time format.
df['air_date'] = pd.to_datetime(df['air_date'], yearfirst= True)
# Convert 'value' to float after removing non-essential characters.
df['value'] = df['value'].str.replace('$','').str.replace(',','').astype(float)
In [14]:
# Check data types again.
df.dtypes
Out[14]:
In [15]:
# We should also remove the html text from the questions.
# Let's take a look at the questions with links in them.
df[df['question'].str.contains('http')].head(3)
Out[15]:
In [16]:
# This step can be skipped when working with the notebook.
# It is only necessary to display the html correctly on the blog page.
import cgi
df['question_htmlview'] = df['question'].apply(lambda x: cgi.escape(x))
df[df['question'].str.contains('http')].head(3)
Out[16]:
In [17]:
# How important are these questions with links?
x = df[df['question'].str.contains('http')]['question'].count()
y = df['question'].count()
float(x)/float(y)
Out[17]:
About 5% of Jeopardy questions have a link which may contain a relevant image or sound file. I won't miss much if I ignore the information contained in the links, i.e., pictures, videos, audio, etc.
In order to remove the html text within the angled brackets, I initially used the following regular expression replacement:
df['question'] = df['question'].str.replace('<.*>','')
My first attempt was close to correct, but I only discovered accidentally later on that my regular expression was removing too much text.
The question
<a href="http://www.j-archive.com/media/2010-07-06_DJ_14.jpg" target="_blank">This dog breed seen here</a>' is a loyal and protective companion
"was modified to be
is a loyal and protective companion
Uh-oh. It should be This dog breed seen here is a loyal and protective companion
The regular expression was removing everything between the very first angled bracket and the very last angled bracket. This is because regular expressions are inherently "greedy"; they try to maximize the match. Luckily, this can be turned off by using a question mark. The Google Python class gives a really nice explanation of this.
To fix my problem I added a question mark. (Many problems in life can be fixed by asking a question. ;) )
df['question'] = df['question'].str.replace('<.*?>', "")
(By the way, I found this mistake by chance when checking the JSON file I create at the end of this notebook. The first rows of the file were reordered to show the problem question above. Argh!)
In [18]:
# Let's do this! Remove the text within <...> to get rid of the links.
df['question'] = df['question'].str.replace('<.*?>', "") #GREEDY --> replace('<.*>','')
# Check results.
print df[df['question'].str.contains('<')].head()
print df[df['question'].str.contains('>')].head()
In [19]:
# Check the problem question from above.
df[df['question'].str.contains('protective companion')]
Out[19]:
In [20]:
# Check if there are any other links.
df[df['question'].str.contains('http')]
Out[20]:
It looks like all the http links have been removed except the first one shown above. I'll fix that one by hand.
In [21]:
df['question'] = df['question'].str.replace('http://www.j-archive.com/','')
df['question'] = df['question'].str.replace('showguide_bioalex.php','')
# Check work
df[df['question'].str.contains('Read all about me')]
Out[21]:
In [22]:
# Now let's remove the beginning and ending quotation marks.
df['question'] = df['question'].str.rstrip("'").str.lstrip("'")
# Let's also remove the column "question_htmlview"
del df['question_htmlview']
The dollar amounts on Jeopardy used to actually be half of what they are now. The first Jeopardy round's dollar amounts originally ranged from \$100 to \$500. Today they range from \$200 to \$1000. For the Double Jeopardy round, currently the dollar amounts range from \$400 to \$2000. This changed on November 26, 2001. (Thank you wikipedia.)
To make the dollar amounts consistent across the years, my plan is to multiply dollar amounts before November 26, 2001 by a factor of 2. But first I'll check that the data appears to reflect this transition date and then I'll change the dollar amounts and check the result.
In [23]:
# This command is needed in order to take advantage of matplotlib features.
import matplotlib.pyplot as plt
# The following command allows plots to appear in the jupyter notebook.
%matplotlib inline
ax = df.plot(x='air_date', y='value', style = '.',
legend = False, figsize=(14,4), fontsize = 15)
ax.set_xlabel("Air Date",fontsize = 20)
ax.set_ylabel("Dollar Amounts",fontsize = 20)
Out[23]:
In [24]:
# Use a semicolon at the end of the last line to remove the "matplotlib" text above the plot.
ax = df.plot(x='air_date', y='value', style = '.',
legend = False, figsize=(14,4), fontsize = 15)
ax.set_xlabel("Air Date",fontsize = 20)
ax.set_ylabel("Dollar Amounts",fontsize = 20);
Now, this plot shows the dollar amounts for all clues, even the Daily Doubles and Final Jeopardy clues, which can be any dollar amount because the contestant sets the amount. If I look only in the dollar range that is offered by plain old Jeopardy clues I'll get a clearer picture of what is going on. The smallest offered value is \$100 from before 2001 and the largest is \$2000 from after 2001.
Another thing I noticed about this plot is that there doesn't seem to be the same density of points before 1997. I'll look into that after I adjust the dollar amounts.
In [25]:
# Zooming in on the non-Daily Double dollar range, from $100-$2000
ax = df.plot(x='air_date', y='value', style = '.',
legend = False, figsize=(14,4), fontsize = 15)
ax.set_xlabel("Air Date",fontsize = 20)
ax.set_ylabel("Dollar Amounts",fontsize = 20)
ymin = 0.0
ymax = 2100.0
ax.set_ylim(ymin,ymax);
Take a look at the dollar amounts for \$100, \$300, and \$500. Before 2001, every Jeopardy episode had at least one clue worth \$100, \$300, and \$500. But, notice that after 2001, there are fewer of these amounts; clues becomes more sparse for these dollar amounts. This is because those dollar amounts are from wagers only, not from regular clues. So, it looks like the dollar amount hasn't been modified in anyway.
In order to get all of the clues normalized to the same dollar amount, I'm going to multiply values before 2001 by a factor of 2. This way, when we compare values at any time, we'll be making a fair comparison that takes into account the "inflation" of the dollar amounts.
In [26]:
# Make a new column that contains the original value
# so we can compare the values before and after.
df['value_original'] = df['value']
df.head()
Out[26]:
In [27]:
# Double values for the dataframe where date is before November 26, 2001.
df.loc[df['air_date'] < "2001-11-26", ['value']] = 2.0 * df['value']
In [28]:
# Check work. Original_value and value should differ by a factor of two.
df[df['air_date']<"2001-11-26"].head()
Out[28]:
It looks like the doubling worked. Let's visualize this by zooming in on the dollar amounts most affected -- the range between \$0 and \$2000.
In [29]:
fig, ax = plt.subplots(2,1,figsize=(14,8),sharex=True)
df.plot(x='air_date', y='value_original', style = '.', alpha = 0.005,
legend = False,ax=ax[0], fontsize = 15);
df.plot(x='air_date', y='value', style = '.', alpha = 0.005,
legend = False, ax=ax[1], fontsize = 15);
ymin = 0.0
ymax = 2100.0
ax[0].set_title('BEFORE Doubling Dollar Values',fontsize = 20);
ax[0].set_ylim(ymin,ymax);
ax[1].set_title('AFTER Doubling Dollar Values',fontsize = 20);
ax[1].set_ylim(ymin,ymax)
plt.xlabel("Air Date",fontsize = 20);
plt.xlabel("Dollar Values",fontsize = 20);
That looks good! Notice that in the "BEFORE plot," the most common values under \$500 before 2001 are \$100, \$200, \$300, \$400, and \$500. Compare this to the "AFTER plot" which has common values of \$200 and \$400 at all times. The doubling looks successful!
The previous plot shows that the data wasn't sampled very evenly in time. Before around 1997, the data looks sparse.
How many questions should there be every year? Assuming all questions are viewed in every game, what is the maximum number of questions viewed per year? There are 30 questions per round plus the Final Jeopardy question, so there are at most 61 questions per game. Then assuming 52 weeks per year and 5 games per week, there should be at most 15860 questions per year.
$ (61 \textrm{ questions/game}) \times (5 \textrm{ games/week}) \times (52 \textrm{ weeks/year}) = 15860 \textrm{ questions/year } $
Now let's see how that compares with the data we have.
In [30]:
# Resample the data yearly, "A" and then count it up.
resample_df = df.resample('A', on='air_date')['question'].count()
resample_df = pd.DataFrame(resample_df)
resample_df.head()
Out[30]:
In [31]:
# Now plot it.
ax = resample_df.plot.bar(legend = False, figsize=(18,4), fontsize = 15)
ax.set_xlabel("Air Date",fontsize = 20)
ax.set_ylabel("Counts",fontsize = 20);
In [32]:
# Those x axis labels are pretty awful. Let's clean them up.
resample_df['year'] = resample_df.index.year
resample_df = resample_df.set_index('year')
In [33]:
ax = resample_df['question'].plot.bar(figsize=(18,4), fontsize = 15)
ax.set_xlabel("Air Date",fontsize = 20)
ax.set_ylabel("Counts",fontsize = 20);
Yes, it looks like the dataset was not sampled evenly.
In [34]:
# What is the maximum number of questions in a year?
resample_df.max()
Out[34]:
Remember, there should be at most about 15,000 questions per year. The maximum value is just over 14,000, which makes sense because not all questions are always viewed in each round of Jeopardy.
Sometimes a whole column will go unrevealed, which is about 10% of the questions (or 6/61). If I assume that some questions aren't revealed and that some years will have fewer game days, then I will only use years within 20% of the maximum value of 15860.
In [35]:
0.8*15860
Out[35]:
In [36]:
resample_df[resample_df['question'] >= 12688]
Out[36]:
It looks like the years 1997-2000 and 2004-2011 are well-sampled.
In order to work with this cleaned dataset later, let's output it to another JSON file that pandas can read. This way I can start with this cleaned dataset next time.
In [37]:
# First remove unnecessary column.
del df['value_original']
# Now output the dataframe to a JSON file.
pd.DataFrame.to_json(df, 'JEOPARDY_QUESTIONS1_cleaned.json', date_format='iso') # Output date/times as strings.