Run this script to load the data. Your job after loading the data is to make a 20 questions style game (see www.20q.net )
This dataset is a list of 25 sports, each rated (by Stephen) with a yes/no answer to each of 13 questions Knowing the answers to all 13 questions uniquely identifies each sport. Can you do it in less than 13 questions? In fewer questions than the trained decision tree?
In [14]:
import csv
sports = [] # This is a python "list" data structure (it is "mutable")
# The file has a list of sports, one per line.
# There are spaces in some names, but no commas or weird punctuation
with open('data/SportsDataset_ListOfSports.csv','r') as csvfile:
myreader = csv.reader(csvfile)
for index, row in enumerate( myreader ):
sports.append(' '.join(row) ) # the join() call merges all fields
# Make a look-up table: if you input the name of the sport, it tells you the index
# Also, print out a list of all the sports, to make sure it looks OK
Sport2Index = {}
for ind, sprt in enumerate( sports ):
Sport2Index[sprt] = ind
print('Sport #', ind,'is',sprt)
# And example usage of the index lookup:
print('The sport "', sports[7],'" has 0-based index', Sport2Index[sports[7]])
In [18]:
# this csv file has only a single row
questions = []
with open('data/SportsDataset_ListOfAttributes.csv','r') as csvfile:
myreader = csv.reader( csvfile )
for row in myreader:
questions = row
Question2Index = {}
for ind, quest in enumerate( questions ):
Question2Index[quest] = ind
print('Question #', ind,': ',quest)
# And example usage of the index lookup:
print('The question "', questions[10],'" has 0-based index', Question2Index[questions[10]])
In [60]:
YesNoDict = { "Yes": 1, "No": -1, "Unsure": 0, "": 0 }
# Load from the csv file.
# Note: the file only has "1"s, because blanks mean "No"
X = []
with open('data/SportsDataset_DataAttributes.csv','r') as csvfile:
myreader = csv.reader(csvfile)
for row in myreader:
data = [];
for col in row:
data.append( col or "-1")
X.append( list(map(int,data)) ) # integers, not strings
# This data file is listed in the same order as the sports
# The variable "y" contains the index of the sport
y = range(len(sports)) # this doesn't work
y = list( map(int,y) ) # Instead, we need to ask python to really enumerate it!
In [ ]:
from sklearn import tree
# the rest is up to you
You may want to use from sklearn.tree import _tree
and 'tree.DecisionTreeClassifier' with commands like tree_.children_left[node]
, tree_.value[node]
, tree_.feature[node]
, and `tree_.threshold[node]'.
In [ ]:
# up to you