The American television game show Jeopardy is probably one of the most famous shows ever aired on TV. Few years ago IBM's Watson conquered the show, and now, it's time the conquer the dataset of all the questions that were asked in years and see if any interesting relations lie behind them.
Thanks to Reddit user trexmatt for providing CSV data of the questions which can be found here.
In [1]:
# this line is required to see visualizations inline for Jupyter notebook
%matplotlib inline
# importing modules that we need for analysis
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import re
In [2]:
# read the data from file and print out first few rows
jeopardy = pd.read_csv("jeopardy.csv")
print(jeopardy.head(3))
In [3]:
print(jeopardy.columns)
Apparently columns have a blank space in the beginning. Let's get rid of them:
In [4]:
jeopardy.rename(columns = lambda x: x[1:] if x[0] == " " else x, inplace=True)
jeopardy.columns
Out[4]:
Let's have a copy of dataframe so that changes we make doesn't disturb further analysis.
In [5]:
data1 = jeopardy
In [6]:
data1["Question"].value_counts()[:10]
Out[6]:
There are some media-based questions, and also some questions with hyper-links. These can disturb our analysis so we should get rid of them.
In [7]:
# regex pattern used to remove hyper-links
pattern = re.compile("^<a href")
# remove media clue questions
data1 = data1[data1["Question"].str.contains(pattern) == False]
data1 = data1[data1["Question"] != "[audio clue]"]
data1 = data1[data1["Question"] != "(audio clue)"]
data1 = data1[data1["Question"] != "[video clue]"]
data1 = data1[data1["Question"] != "[filler]"]
data1["Question"].value_counts()[:10]
Out[7]:
We can add a column to dataframe for lenght of questions.
In [8]:
data1["Question Length"] = data1["Question"].apply(lambda x: len(x))
data1["Question Length"][:12]
Out[8]:
When we look at the "Value" column, we see they are not integers but strings, also there are some "None" values. We should clean those values.
In [9]:
data1["Value"].value_counts()[:15]
Out[9]:
In [10]:
# get rid of None values
data1 = data1[data1["Value"] != "None"]
# parse integers from strings
pattern = "[0-9]"
data1["Value"] = data1["Value"].apply(lambda x: "".join(re.findall(pattern,x)))
data1["Value"] = data1["Value"].astype(int)
In [11]:
print(data1["Value"].value_counts()[:10])
print("Number of distinct values:" + str(len(data1["Value"].value_counts())))
The "Value" column has 145 different values. For the sake of simplicity, let's keep the ones that are multiples of 100 and between 200 and 2500 (first round questions has range of 200-1000, second round questions has range of 500-2500).
In [12]:
data1 = data1[(data1["Value"]%100 == 0) & (data1["Value"]<= 2500)]
print(data1["Value"].value_counts())
print("Number of distinct values: " + str(len(data1["Value"].value_counts())))
In [13]:
# set up the figure and plot length vs value on ax1
fig = plt.figure(figsize=(10,5))
ax1 = fig.add_subplot(1,1,1)
ax1.scatter(data1["Question Length"], data1["Value"])
ax1.set_xlim(0, 800)
ax1.set_ylim(0, 2700)
ax1.set_title("The Relation between Question Length and Value")
ax1.set_xlabel("Lenght of the Question")
ax1.set_ylabel("Value of the Question")
plt.show()
It looks like there isn't a correlation, but this graph isn't structured well enough to draw conclusions. Instead, let's calculate average question length for each value and plot average length vs value.
In [14]:
#find the average length for each value
average_lengths = []
values = data1["Value"].unique()
for value in values:
rows = data1[data1["Value"] == value]
average = rows["Question Length"].mean()
average_lengths.append(average)
print(average_lengths)
print(values)
In [15]:
# set up the figure and plot average length vs value on ax1
fig = plt.figure(figsize=(10,5))
ax1 = fig.add_subplot(1,1,1)
ax1.scatter(average_lengths, values)
ax1.set_title("The Relation between Average Question Length and Value")
ax1.set_xlabel("Average Question Length")
ax1.set_ylabel("Value")
ax1.set_xlim(70, 105)
ax1.set_ylim(0, 3000)
plt.show()
In [16]:
print("Correlation coefficient: " + str(np.corrcoef(average_lengths, values)[0,1]))
Here we go! Even though it's not a strong correlation, we've found a moderate correlation (with coefficient 0.53) between length of the question and it's value!