How the Length of a Jeopardy Question Relates to its Value?

The American television game show Jeopardy is probably one of the most famous shows ever aired on TV. Few years ago IBM's Watson conquered the show, and now, it's time the conquer the dataset of all the questions that were asked in years and see if any interesting relations lie behind them.

Thanks to Reddit user trexmatt for providing CSV data of the questions which can be found here.

Structure of the Dataset

As explained on the Reddit post given above, each row of the dataset contains information on a particular question:

  • Category
  • Value
  • Question text
  • Answer text
  • Round of the game the question was asked
  • Show number
  • Date

Hypothesis

Before diving into analysis of the data let's come up with a relation between different columns:

Value of the question is related to its length.

Setting up Data


In [1]:
# this line is required to see visualizations inline for Jupyter notebook
%matplotlib inline

# importing modules that we need for analysis
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import re

In [2]:
# read the data from file and print out first few rows
jeopardy = pd.read_csv("jeopardy.csv")
print(jeopardy.head(3))


   Show Number    Air Date      Round                         Category  Value  \
0         4680  2004-12-31  Jeopardy!                          HISTORY   $200   
1         4680  2004-12-31  Jeopardy!  ESPN's TOP 10 ALL-TIME ATHLETES   $200   
2         4680  2004-12-31  Jeopardy!      EVERYBODY TALKS ABOUT IT...   $200   

                                            Question      Answer  
0  For the last 8 years of his life, Galileo was ...  Copernicus  
1  No. 2: 1912 Olympian; football star at Carlisl...  Jim Thorpe  
2  The city of Yuma in this state has a record av...     Arizona  

In [3]:
print(jeopardy.columns)


Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

Apparently columns have a blank space in the beginning. Let's get rid of them:


In [4]:
jeopardy.rename(columns = lambda x: x[1:] if x[0] == " " else x, inplace=True)
jeopardy.columns


Out[4]:
Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

Let's have a copy of dataframe so that changes we make doesn't disturb further analysis.


In [5]:
data1 = jeopardy

In [6]:
data1["Question"].value_counts()[:10]


Out[6]:
[audio clue]                       17
[video clue]                       14
(audio clue)                        5
[filler]                            5
Hainan                              4
Abigail Smith                       4
Greenland                           4
1861-1865                           3
"A watched pot never" does this     3
"I Hope I Get It"                   3
Name: Question, dtype: int64

There are some media-based questions, and also some questions with hyper-links. These can disturb our analysis so we should get rid of them.


In [7]:
# regex pattern used to remove hyper-links
pattern = re.compile("^<a href")

# remove media clue questions
data1 = data1[data1["Question"].str.contains(pattern) == False]
data1 = data1[data1["Question"] != "[audio clue]"]
data1 = data1[data1["Question"] != "(audio clue)"]
data1 = data1[data1["Question"] != "[video clue]"]
data1 = data1[data1["Question"] != "[filler]"]

data1["Question"].value_counts()[:10]


Out[7]:
Abigail Smith        4
Greenland            4
Hainan               4
"I Hope I Get It"    3
Egypt                3
Aruba                3
Walk like a duck     3
Thomas Jefferson     3
1861-1865            3
Sam Spade            3
Name: Question, dtype: int64

We can add a column to dataframe for lenght of questions.


In [8]:
data1["Question Length"] = data1["Question"].apply(lambda x: len(x))
data1["Question Length"][:12]


Out[8]:
0      96
1     107
2      88
3      84
4     104
5      77
6      76
7      70
8     109
9      98
10     16
11     73
Name: Question Length, dtype: int64

When we look at the "Value" column, we see they are not integers but strings, also there are some "None" values. We should clean those values.


In [9]:
data1["Value"].value_counts()[:15]


Out[9]:
$400      42043
$800      31682
$200      30334
$600      20265
$1000     19423
$1200     11242
$2000     11140
$1600     10726
$100       9011
$500       8994
$300       8643
None       3632
$1,000     2093
$2,000     1583
$3,000      767
Name: Value, dtype: int64

In [10]:
# get rid of None values
data1 = data1[data1["Value"] != "None"]

# parse integers from strings
pattern = "[0-9]"
data1["Value"] = data1["Value"].apply(lambda x: "".join(re.findall(pattern,x)))
data1["Value"] = data1["Value"].astype(int)

In [11]:
print(data1["Value"].value_counts()[:10])
print("Number of distinct values:" + str(len(data1["Value"].value_counts())))


400     42043
800     31682
200     30334
1000    21516
600     20265
2000    12723
1200    11683
1600    10965
100      9011
500      8994
Name: Value, dtype: int64
Number of distinct values:145

The "Value" column has 145 different values. For the sake of simplicity, let's keep the ones that are multiples of 100 and between 200 and 2500 (first round questions has range of 200-1000, second round questions has range of 500-2500).


In [12]:
data1 = data1[(data1["Value"]%100 == 0) & (data1["Value"]<= 2500)]
print(data1["Value"].value_counts())
print("Number of distinct values: " + str(len(data1["Value"].value_counts())))


400     42043
800     31682
200     30334
1000    21516
600     20265
2000    12723
1200    11683
1600    10965
100      9011
500      8994
300      8643
1500      545
2500      230
1400      228
700       203
1800      182
2200      147
2400      127
900       114
1300       75
1100       63
1700       44
1900       28
2300       23
2100       22
Name: Value, dtype: int64
Number of distinct values: 25

In [13]:
# set up the figure and plot length vs value on ax1
fig = plt.figure(figsize=(10,5))

ax1 = fig.add_subplot(1,1,1)
ax1.scatter(data1["Question Length"], data1["Value"])
ax1.set_xlim(0, 800)
ax1.set_ylim(0, 2700)
ax1.set_title("The Relation between Question Length and Value")
ax1.set_xlabel("Lenght of the Question")
ax1.set_ylabel("Value of the Question")
plt.show()


It looks like there isn't a correlation, but this graph isn't structured well enough to draw conclusions. Instead, let's calculate average question length for each value and plot average length vs value.


In [14]:
#find the average length for each value
average_lengths = []
values = data1["Value"].unique()
for value in values:
    rows = data1[data1["Value"] == value]
    average = rows["Question Length"].mean()
    average_lengths.append(average)
print(average_lengths)
print(values)


[79.14577701588976, 83.66122779059535, 85.393239575623, 88.08866233192349, 97.93319185726637, 88.41067112846254, 95.34400410853377, 92.53123575011399, 73.95272444789701, 76.74025222723591, 79.75194574160551, 89.72660550458716, 91.65384615384616, 92.46031746031746, 95.08163265306122, 93.35714285714286, 88.03448275862068, 90.44298245614036, 90.78740157480316, 90.87391304347825, 89.18666666666667, 101.0, 86.97368421052632, 69.8695652173913, 91.4090909090909]
[ 200  400  600  800 2000 1000 1200 1600  100  300  500 1500 1800 1100 2200
 1900  700 1400 2400 2500 1300 2100  900 2300 1700]

In [15]:
# set up the figure and plot average length vs value on ax1
fig = plt.figure(figsize=(10,5))

ax1 = fig.add_subplot(1,1,1)
ax1.scatter(average_lengths, values)
ax1.set_title("The Relation between Average Question Length and Value")
ax1.set_xlabel("Average Question Length")
ax1.set_ylabel("Value")
ax1.set_xlim(70, 105)
ax1.set_ylim(0, 3000)
plt.show()



In [16]:
print("Correlation coefficient: " + str(np.corrcoef(average_lengths, values)[0,1]))


Correlation coefficient: 0.528962783499

Here we go! Even though it's not a strong correlation, we've found a moderate correlation (with coefficient 0.53) between length of the question and it's value!