T81-558: Applications of Deep Neural Networks

Class 7: Kaggle Data Sets.

What is Kaggle?

Kaggle runs competitions in which data scientists compete in order to provide the best model to fit the data. The capstone project of this chapter features Kaggle’s Titanic data set. Before we get started with the Titanic example, it’s important to be aware of some Kaggle guidelines. First, most competitions end on a specific date. Website organizers have currently scheduled the Titanic competition to end on December 31, 2016. However, they have already extended the deadline several times, and an extension beyond 2014 is also possible. Second, the Titanic data set is considered a tutorial data set. In other words, there is no prize, and your score in the competition does not count towards becoming a Kaggle Master.

Kaggle Ranks

Kaggle ranks are achieved by earning gold, silver and bronze medals.

Typical Kaggle Competition

A typical Kaggle competition will have several components. Consider the Titanic tutorial:

How Kaggle Competitions are Scored

Kaggle is provided with a data set by the competition sponsor. This data set is divided up as follows:

  • Complete Data Set - This is the complete data set.
    • Training Data Set - You are provided both the inputs and the outcomes for the training portion of the data set.
    • Test Data Set - You are provided the complete test data set; however, you are not given the outcomes. Your submission is your predicted outcomes for this data set.
      • Public Leaderboard - You are not told what part of the test data set contributes to the public leaderboard. Your public score is calculated based on this part of the data set.
      • Private Leaderboard - You are not told what part of the test data set contributes to the public leaderboard. Your final score/rank is calculated based on this part. You do not see your private leaderboard score until the end.

Preparing a Kaggle Submission

Code need not be submitted to Kaggle. For competitions, you are scored entirely on the accuracy of your sbmission file. A Kaggle submission file is always a CSV file that contains the Id of the row you are predicting and the answer. For the titanic competition, a submission file looks something like this:

PassengerId,Survived
892,0
893,1
894,1
895,0
896,0
897,1
...

The above file states the prediction for each of various passengers. You should only predict on ID's that are in the test file. Likewise, you should render a prediction for every row in the test file. Some competitions will have different formats for their answers. For example, a multi-classification will usually have a column for each class and your predictions for each class.

Programming Assignment 3


In [ ]: