Jeopardy Questions

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture.

Let's say you want to compete on Jeopardy, and you're looking for any edge you can get to win. In this project, you'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help you win.

The dataset is named jeopardy.csv, and contains 20000 rows from the beginning of a full dataset of Jeopardy questions, which you can download here.

Each row in the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:

  • Show Number -- the Jeopardy episode number of the show this question was in.
  • Air Date -- the date the episode aired.
  • Round -- the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.
  • Category -- the category of the question.
  • Value -- the number of dollars answering the question correctly is worth.
  • Question -- the text of the question.
  • Answer -- the text of the answer.

In [2]:
import pandas as pd

# Read the dataset into a Pandas DataFrame
jeopardy = pd.read_csv('../data/jeopardy.csv')

# Print out the first 5 rows
jeopardy.head(5)


Out[2]:
Show Number Air Date Round Category Value Question Answer
0 4680 2004-12-31 Jeopardy! HISTORY $200 For the last 8 years of his life, Galileo was ... Copernicus
1 4680 2004-12-31 Jeopardy! ESPN's TOP 10 ALL-TIME ATHLETES $200 No. 2: 1912 Olympian; football star at Carlisl... Jim Thorpe
2 4680 2004-12-31 Jeopardy! EVERYBODY TALKS ABOUT IT... $200 The city of Yuma in this state has a record av... Arizona
3 4680 2004-12-31 Jeopardy! THE COMPANY LINE $200 In 1963, live on "The Art Linkletter Show", th... McDonald's
4 4680 2004-12-31 Jeopardy! EPITAPHS & TRIBUTES $200 Signer of the Dec. of Indep., framer of the Co... John Adams

In [3]:
# Print out the columns
jeopardy.columns


Out[3]:
Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [4]:
# Remove the spaces from column names
col_names = jeopardy.columns
col_names = [s.strip() for s in col_names]
jeopardy.columns = col_names
jeopardy.columns


Out[4]:
Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

Normalizing Text

Before you can start doing analysis on the Jeopardy questions, you need to normalize all of the text columns (the Question and Answer columns). We covered normalization before, but the idea is to ensure that you lowercase words and remove punctuation so Don't and don't aren't considered to be different words when you compare them.


In [5]:
import re
def normalize_text(text):
    """ Function to normalize questions and answers.
    
    @param text : str - input string
    @return str - normalized version of input string
    """
    # Covert the string to lowercase
    text = text.lower()
    
    # Remove all punctuation in the string
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    
    # Return the string
    return text

In [6]:
# Normalize the Question column
jeopardy["clean_question"] = jeopardy["Question"].apply(normalize_text)

In [7]:
# Normalize the Anser column
jeopardy["clean_answer"] = jeopardy["Answer"].apply(normalize_text)

Normalizing columns

Now that you've normalized the text columns, there are also some other columns to normalize.

The Value column should also be numeric, to allow you to manipulate it more easily. You'll need to remove the dollar sign from the beginning of each value and convert the column from text to numeric.

The Air Date column should also be a datetime, not a string, to enable you to work with it more easily.


In [8]:
jeopardy.dtypes


Out[8]:
Show Number        int64
Air Date          object
Round             object
Category          object
Value             object
Question          object
Answer            object
clean_question    object
clean_answer      object
dtype: object

In [9]:
def normalize_values(text):
    """ Function to normalize numeric values.
    
    @param text : str - input value as a string
    @return int - integer
    """
    # Remove any punctuation in the string
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    
    # Convert the string to an integer and if there is an error assign 0
    try:
        text = int(text)
    except Exception:
        text = 0
    return text

In [10]:
# Normalize the Value column
jeopardy["clean_value"] = jeopardy["Value"].apply(normalize_values)

In [11]:
jeopardy["Air Date"] = pd.to_datetime(jeopardy["Air Date"])

In [12]:
jeopardy.dtypes


Out[12]:
Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object

Answers In Questions

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

How often the answer is deducible from the question. How often new questions are repeats of older questions. You can answer the second question by seeing how often complex words (> 6 characters) reoccur. You can answer the first question by seeing how many times words in the answer also occur in the question. We'll work on the first question now, and come back to the second.


In [13]:
def count_matches(row):
    """ Function to take in a row in jeopardy as a Series and count the
    the number of terms in the answer which match the question.
    
    @param row : pd.Series - row from jeopardy DataFrame
    @return int - matches
    """
    # Split the clean_answer and clean_question columns on the space character
    split_answer = row['clean_answer'].split(' ')
    split_question = row['clean_question'].split(' ')
    
    # "The" doesn't have any meaningful use in finding the anser
    if "the" in split_answer:
        split_answer.remove("the")
    
    # Prevent division by 0 error later
    if len(split_answer) == 0:
        return 0
    
    # Loop through each item in split_anser, and see if it occurs in split_question
    match_count = 0
    for item in split_answer:
        if item in split_question:
            match_count += 1
    
    # Divide match_count by the length of split_answer, and return the result
    return match_count / len(split_answer)

In [14]:
# Count how many times terms in clean_anser occur in clean_question
jeopardy["answer_in_question"] = jeopardy.apply(count_matches, axis=1)

In [15]:
# Find the mean of the answer_in_question column
jeopardy["answer_in_question"].mean()


Out[15]:
0.060493257069335914

Answer terms in the question

The answer only appears in the question about 6% of the time. This isn't a huge number, and means that we probably can't just hope that hearing a question will enable us to figure out the answer. We'll probably have to study.

Recycled Questions

Let's say you want to investigate how often new questions are repeats of older ones. You can't completely answer this, because you only have about 10% of the full Jeopardy question dataset, but you can investigate it at least.

To do this, you can:

  • Sort jeopardy in order of ascending air date.
  • Maintain a set called terms_used that will be empty initially.
  • Iterate through each row of jeopardy.
  • Split clean_question into words, remove any word shorter than 6 characters, and check if each word occurs in terms_used.
    • If it does, increment a counter.
    • Add each word to terms_used.

This will enable you to check if the terms in questions have been used previously or not. Only looking at words greater than 6 characters enables you to filter out words like the and than, which are commonly used, but don't tell you a lot about a question.


In [16]:
# Create an empty list and an empty set
question_overlap = []
terms_used = set()

In [17]:
# Use the iterrows() DataFrame method to loop through each row of jeopardy
for i, row in jeopardy.iterrows():
        split_question = row["clean_question"].split(" ")
        split_question = [q for q in split_question if len(q) > 5]
        match_count = 0
        for word in split_question:
            if word in terms_used:
                match_count += 1
        for word in split_question:
            terms_used.add(word)
        if len(split_question) > 0:
            match_count /= len(split_question)
        question_overlap.append(match_count)
jeopardy["question_overlap"] = question_overlap

jeopardy["question_overlap"].mean()


Out[17]:
0.6908737315671878

Question overlap

There is about 69% overlap between terms in new questions and terms in old questions. This only looks at a small set of questions, and it doesn't look at phrases, it looks at single terms. This makes it relatively insignificant, but it does mean that it's worth looking more into the recycling of questions.

Low Value Vs High Value Questions

Let's say you only want to study questions that pertain to high value questions instead of low value questions. This will help you earn more money when you're on Jeopardy.

You can actually figure out which terms correspond to high-value questions using a chi-squared test. You'll first need to narrow down the questions into two categories:

  • Low value -- Any row where Value is less than 800.
  • High value -- Any row where Value is greater than 800.

You'll then be able to loop through each of the terms from the last screen, terms_used, and:

  • Find the number of low value questions the word occurs in.
  • Find the number of high value questions the word occurs in.
  • Find the percentage of questions the word occurs in.
  • Based on the percentage of questions the word occurs in, find expected counts.
  • Compute the chi squared value based on the expected counts and the observed counts for high and low value questions.

You can then find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values. Doing this for all of the words would take a very long time, so we'll just do it for a small sample now.


In [18]:
def determine_value(row):
    """ Determine if this is a "Low" or "High" value question.
    
    @param row : pd.Series - row from jeopardy DataFrame
    @return int - 1 if High Value, 0 if Low Value
    """
    value = 0
    if row["clean_value"] > 800:
        value = 1
    return value

jeopardy["high_value"] = jeopardy.apply(determine_value, axis=1)

In [19]:
def count_usage(term):
    """ Take in a word and loops through each row in jeopardy DataFrame and
    counts usage.  Usage is counted separately for High Value vs Low Value
    rows.
    
    @param term : str - word to count usage for
    @return (int, int) - (high_count, low_count)
    """
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        if term in row["clean_question"].split(" "):
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

comparison_terms = list(terms_used)[:5]
observed_expected = []
for term in comparison_terms:
    observed_expected.append(count_usage(term))

observed_expected


Out[19]:
[(1, 0), (0, 1), (0, 1), (1, 2), (1, 0)]

Applying the Chi-Squared Test

Now that you've found the observed counts for a few terms, you can compute the expected counts and the chi-squared value.


In [20]:
from scipy.stats import chisquare
import numpy as np

high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]

chi_squared = []
for obs in observed_expected:
    total = sum(obs)
    total_prop = total / jeopardy.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))

chi_squared


Out[20]:
[Power_divergenceResult(statistic=2.4877921171956752, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.40196284612688399, pvalue=0.52607729857054686),
 Power_divergenceResult(statistic=0.40196284612688399, pvalue=0.52607729857054686),
 Power_divergenceResult(statistic=0.031881167234403623, pvalue=0.85828871632352932),
 Power_divergenceResult(statistic=2.4877921171956752, pvalue=0.11473257634454047)]

Chi-squared results

None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.

Next Steps

Here are some potential next steps:

  • Find a better way to eliminate non-informative words than just removing words that are less than 6 characters long. Some ideas:
    • Manually create a list of words to remove, like the, than, etc.
    • Find a list of stopwords to remove.
    • Remove words that occur in more than a certain percentage (like 5%) of questions.
  • Perform the chi-squared test across more terms to see what terms have larger differences. This is hard to do currently because the code is slow, but here are some ideas:
    • Use the apply method to make the code that calculates frequencies more efficient.
    • Only select terms that have high frequencies across the dataset, and ignore the others.
  • Look more into the Category column and see if any interesting analysis can be done with it. Some ideas:
    • See which categories appear the most often.
    • Find the probability of each category appearing in each round.
  • Use the whole Jeopardy dataset (available here) instead of the subset we used in this mission.
  • Use phrases instead of single words when seeing if there's overlap between questions. Single words don't capture the whole context of the question well.

In [ ]: