Loads the sports data

Run this script to load the data. Your job after loading the data is to make a 20 questions style game (see www.20q.net )

This dataset is a list of 25 sports, each rated (by Stephen) with a yes/no answer to each of 13 questions Knowing the answers to all 13 questions uniquely identifies each sport. Can you do it in less than 13 questions? In fewer questions than the trained decision tree?

Read in the list of sports

There should be 25 sports. We can print them out, so you know what the choices are


In [14]:
import csv
sports = []  # This is a python "list" data structure (it is "mutable")
# The file has a list of sports, one per line.
# There are spaces in some names, but no commas or weird punctuation
with open('data/SportsDataset_ListOfSports.csv','r') as csvfile:
    myreader = csv.reader(csvfile)
    for index, row in enumerate( myreader ):
        sports.append(' '.join(row) ) # the join() call merges all fields
# Make a look-up table: if you input the name of the sport, it tells you the index
# Also, print out a list of all the sports, to make sure it looks OK
Sport2Index = {}
for ind, sprt in enumerate( sports ):
    Sport2Index[sprt] = ind
    print('Sport #', ind,'is',sprt)
# And example usage of the index lookup:
print('The sport "', sports[7],'" has 0-based index', Sport2Index[sports[7]])


Sport # 0 is Diving
Sport # 1 is Swimming
Sport # 2 is Synchronized Swimming
Sport # 3 is Water Polo
Sport # 4 is Kayak
Sport # 5 is Basketball
Sport # 6 is Bicycling
Sport # 7 is Speed skating
Sport # 8 is Figure skating
Sport # 9 is Gymnastics
Sport # 10 is Volleyball
Sport # 11 is Wrestling
Sport # 12 is Track/running
Sport # 13 is Baseball
Sport # 14 is Boxing
Sport # 15 is Fencing
Sport # 16 is Field Hockey
Sport # 17 is Football
Sport # 18 is Golf
Sport # 19 is Sailing
Sport # 20 is Softball
Sport # 21 is Ping pong
Sport # 22 is Tennis
Sport # 23 is Ice hockey
Sport # 24 is Skiing
The sport " Speed skating " has 0-based index 7

Read in the list of questions/attributes

There were 13 questions


In [18]:
# this csv file has only a single row
questions = []
with open('data/SportsDataset_ListOfAttributes.csv','r') as csvfile:
    myreader = csv.reader( csvfile )
    for row in myreader:
        questions = row
Question2Index = {}
for ind, quest in enumerate( questions ):
    Question2Index[quest] = ind
    print('Question #', ind,': ',quest)
# And example usage of the index lookup:
print('The question "', questions[10],'" has 0-based index', Question2Index[questions[10]])


Question # 0 :  Water Sport?
Question # 1 :  Necessarily a team Sport?
Question # 2 :  Ice involved?
Question # 3 :  Snow involved? 
Question # 4 :  Head to head matches?
Question # 5 :  Subjective scoring?
Question # 6 :  Race of some sort?
Question # 7 :  Is it always played outdoors?
Question # 8 :  Does it help to be tall?
Question # 9 :  Is there a goal or a hoop?
Question # 10 :  Does each person have large equipment (bike, boat; NOT skates, skis, stick...)
Question # 11 :  Does each participant have small gear like a racquet, stick, skates, skis, mitt, sword, etc.?
Question # 12 :  Is there a ball bigger than a baseball used?
The question " Does each person have large equipment (bike, boat; NOT skates, skis, stick...) " has 0-based index 10

Read in the training data

The columns of X correspond to questions, and rows correspond to more data. The rows of y are the movie indices. The values of X are 1, -1 or 0 (see YesNoDict for encoding)


In [60]:
YesNoDict = { "Yes": 1, "No": -1, "Unsure": 0, "": 0 }
# Load from the csv file.
# Note: the file only has "1"s, because blanks mean "No"

X = []
with open('data/SportsDataset_DataAttributes.csv','r') as csvfile:
    myreader = csv.reader(csvfile)
    for row in myreader:
        data = [];
        for col in row:
            data.append( col or "-1")
        X.append( list(map(int,data)) ) # integers, not strings

# This data file is listed in the same order as the sports
# The variable "y" contains the index of the sport
y = range(len(sports)) # this doesn't work
y = list( map(int,y) ) # Instead, we need to ask python to really enumerate it!

Your turn: train a decision tree classifier


In [ ]:
from sklearn import tree
# the rest is up to you

Use the trained classifier to play a 20 questions game

You may want to use from sklearn.tree import _tree and 'tree.DecisionTreeClassifier' with commands like tree_.children_left[node], tree_.value[node], tree_.feature[node], and `tree_.threshold[node]'.


In [ ]:
# up to you