Class 7: Kaggle Data Sets.
Kaggle runs competitions in which data scientists compete in order to provide the best model to fit the data. The capstone project of this chapter features Kaggle’s Titanic data set. Before we get started with the Titanic example, it’s important to be aware of some Kaggle guidelines. First, most competitions end on a specific date. Website organizers have currently scheduled the Titanic competition to end on December 31, 2016. However, they have already extended the deadline several times, and an extension beyond 2014 is also possible. Second, the Titanic data set is considered a tutorial data set. In other words, there is no prize, and your score in the competition does not count towards becoming a Kaggle Master.
A typical Kaggle competition will have several components. Consider the Titanic tutorial:
Kaggle is provided with a data set by the competition sponsor. This data set is divided up as follows:
Code need not be submitted to Kaggle. For competitions, you are scored entirely on the accuracy of your sbmission file. A Kaggle submission file is always a CSV file that contains the Id of the row you are predicting and the answer. For the titanic competition, a submission file looks something like this:
PassengerId,Survived
892,0
893,1
894,1
895,0
896,0
897,1
...
The above file states the prediction for each of various passengers. You should only predict on ID's that are in the test file. Likewise, you should render a prediction for every row in the test file. Some competitions will have different formats for their answers. For example, a multi-classification will usually have a column for each class and your predictions for each class.
There have been many interesting competitions on Kaggle, these are some of my favorites.
In [ ]: