GP12: Jeopardy Questions

1. Read Data



In [1]:

    
import pandas
import csv

jeopardy = pandas.read_csv("../data/GP12/jeopardy.csv")

jeopardy.head(5)









    Out[1]:







  
    
      
      Show Number
      Air Date
      Round
      Category
      Value
      Question
      Answer
    
  
  
    
      0
      4680
      2004-12-31
      Jeopardy!
      HISTORY
      $200
      For the last 8 years of his life, Galileo was ...
      Copernicus
    
    
      1
      4680
      2004-12-31
      Jeopardy!
      ESPN's TOP 10 ALL-TIME ATHLETES
      $200
      No. 2: 1912 Olympian; football star at Carlisl...
      Jim Thorpe
    
    
      2
      4680
      2004-12-31
      Jeopardy!
      EVERYBODY TALKS ABOUT IT...
      $200
      The city of Yuma in this state has a record av...
      Arizona
    
    
      3
      4680
      2004-12-31
      Jeopardy!
      THE COMPANY LINE
      $200
      In 1963, live on "The Art Linkletter Show", th...
      McDonald's
    
    
      4
      4680
      2004-12-31
      Jeopardy!
      EPITAPHS & TRIBUTES
      $200
      Signer of the Dec. of Indep., framer of the Co...
      John Adams



In [2]:

    
jeopardy.columns









    Out[2]:





Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')



In [3]:

    
jeopardy.columns = ['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question', 'Answer']

2. Normalizing Text



In [4]:

    
import re

def normalize_text(text):
    text = text.lower()
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    return text

def normalize_values(text):
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    try:
        text = int(text)
    except Exception:
        text = 0
    return text



In [5]:

    
jeopardy["clean_question"] = jeopardy["Question"].apply(normalize_text)
jeopardy["clean_answer"] = jeopardy["Answer"].apply(normalize_text)



In [6]:

    
jeopardy.head(5)









    Out[6]:







  
    
      
      Show Number
      Air Date
      Round
      Category
      Value
      Question
      Answer
      clean_question
      clean_answer
    
  
  
    
      0
      4680
      2004-12-31
      Jeopardy!
      HISTORY
      $200
      For the last 8 years of his life, Galileo was ...
      Copernicus
      for the last 8 years of his life galileo was u...
      copernicus
    
    
      1
      4680
      2004-12-31
      Jeopardy!
      ESPN's TOP 10 ALL-TIME ATHLETES
      $200
      No. 2: 1912 Olympian; football star at Carlisl...
      Jim Thorpe
      no 2 1912 olympian football star at carlisle i...
      jim thorpe
    
    
      2
      4680
      2004-12-31
      Jeopardy!
      EVERYBODY TALKS ABOUT IT...
      $200
      The city of Yuma in this state has a record av...
      Arizona
      the city of yuma in this state has a record av...
      arizona
    
    
      3
      4680
      2004-12-31
      Jeopardy!
      THE COMPANY LINE
      $200
      In 1963, live on "The Art Linkletter Show", th...
      McDonald's
      in 1963 live on the art linkletter show this c...
      mcdonalds
    
    
      4
      4680
      2004-12-31
      Jeopardy!
      EPITAPHS & TRIBUTES
      $200
      Signer of the Dec. of Indep., framer of the Co...
      John Adams
      signer of the dec of indep framer of the const...
      john adams

3. Normalizing Columns



In [7]:

    
jeopardy["clean_value"] = jeopardy["Value"].apply(normalize_values)
jeopardy["Air Date"] = pandas.to_datetime(jeopardy["Air Date"])

4. Answers In Questions



In [8]:

    
jeopardy.dtypes









    Out[8]:





Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object



In [9]:

    
def count_matches(row):
    split_answer = row["clean_answer"].split(" ")
    split_question = row["clean_question"].split(" ")
    if "the" in split_answer:
        split_answer.remove("the")
    if len(split_answer) == 0:
        return 0
    match_count = 0
    for item in split_answer:
        if item in split_question:
            match_count += 1
    return match_count / len(split_answer)

jeopardy["answer_in_question"] = jeopardy.apply(count_matches, axis=1)



In [10]:

    
jeopardy["answer_in_question"].mean()









    Out[10]:





0.060493257069335914

Answer terms in the question

The answer only appears in the question about 6% of the time. This isn't a huge number, and means that we probably can't just hope that hearing a question will enable us to figure out the answer. We'll probably have to study.

5. Recycled Questions



In [11]:

    
question_overlap = []
terms_used = set()
for i, row in jeopardy.iterrows():
        split_question = row["clean_question"].split(" ")
        split_question = [q for q in split_question if len(q) > 5]
        match_count = 0
        for word in split_question:
            if word in terms_used:
                match_count += 1
        for word in split_question:
            terms_used.add(word)
        if len(split_question) > 0:
            match_count /= len(split_question)
        question_overlap.append(match_count)
jeopardy["question_overlap"] = question_overlap

jeopardy["question_overlap"].mean()









    Out[11]:





0.6908737315671878

Question overlap

There is about 70% overlap between terms in new questions and terms in old questions.
This only looks at a small set of questions, and it doesn't look at phrases, it looks at single terms.
This makes it relatively insignificant, but it does mean that it's worth looking more into the recycling of questions.

6. Low Value Vs High Value Questions



In [12]:

    
def determine_value(row):
    value = 0
    if row["clean_value"] > 800:
        value = 1
    return value

jeopardy["high_value"] = jeopardy.apply(determine_value, axis=1)



In [13]:

    
def count_usage(term):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        if term in row["clean_question"].split(" "):
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

comparison_terms = list(terms_used)[:5]
observed_expected = []
for term in comparison_terms:
    observed_expected.append(count_usage(term))

observed_expected









    Out[13]:





[(3, 3), (0, 2), (0, 4), (3, 8), (0, 1)]

7. Applying The Chi-Squared Test



In [14]:

    
from scipy.stats import chisquare
import numpy as np

high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]

chi_squared = []
for obs in observed_expected:
    total = sum(obs)
    total_prop = total / jeopardy.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))

chi_squared









    Out[14]:





[Power_divergenceResult(statistic=1.3346324449838385, pvalue=0.24798277007881564),
 Power_divergenceResult(statistic=0.80392569225376798, pvalue=0.36992223780795708),
 Power_divergenceResult(statistic=1.607851384507536, pvalue=0.2047940943922556),
 Power_divergenceResult(statistic=0.010522836989240831, pvalue=0.91829561813933991),
 Power_divergenceResult(statistic=0.40196284612688399, pvalue=0.52607729857054686)]

Chi-squared results

None of the terms had a significant difference in usage between high value and low value rows.
Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid.
It would be better to run this test with only terms that have higher frequencies.

[(0.031881167234403623, 0.85828871632352932), (0.40196284612688399, 0.52607729857054686), (2.4877921171956752, 0.11473257634454047), (0.40196284612688399, 0.52607729857054686), (0.44487748166127949, 0.50477764875459963)]

	Show Number	Air Date	Round	Category	Value	Question	Answer
0	4680	2004-12-31	Jeopardy!	HISTORY	$200	For the last 8 years of his life, Galileo was ...	Copernicus
1	4680	2004-12-31	Jeopardy!	ESPN's TOP 10 ALL-TIME ATHLETES	$200	No. 2: 1912 Olympian; football star at Carlisl...	Jim Thorpe
2	4680	2004-12-31	Jeopardy!	EVERYBODY TALKS ABOUT IT...	$200	The city of Yuma in this state has a record av...	Arizona
3	4680	2004-12-31	Jeopardy!	THE COMPANY LINE	$200	In 1963, live on "The Art Linkletter Show", th...	McDonald's
4	4680	2004-12-31	Jeopardy!	EPITAPHS & TRIBUTES	$200	Signer of the Dec. of Indep., framer of the Co...	John Adams