There are usually 4-5 competitions running at once - a mix of image-based and statistical competitions.
Competitions consist of a training and test set. In most competitions, the test set is split - during the competition you see only a public subset, with the larger private leaderboard shown after the competition ends. In some others, a second test set is revealed near the end of the competition.
This was Kaggle's first code competition, where instead of submitting predictions, Python kernels were uploaded and run on Kaggle's cloud servers. (Unfortunatly, submissions are now offline and there's no test data yet(?), so there are a few things I'd like to look at but can't right now...)
This added a real constraint to all competitiors, since there was a 60 minute runtime limit (using two cores of a ~3ghz Xeon server box) It was possible to encode relatively small binaries of models (Gilberto Titericz Junior's team did this) but I didn't attempt it myself.
This dataset contains anonymized features pertaining to a time-varying value for a financial instrument. Each instrument has an id. Time is represented by the 'timestamp' feature and the variable to predict is 'y'. No further information will be provided on the meaning of the features, the transformations that were applied to them, the timescale, or the type of instruments that are included in the data.
Basically, it was about predicting time series (daily) data on various financial instruments that came and went during the timeframe. In the middle of the competition, the ID's and timestamps on the server end were changed to prevent having explicit data references in the kernels.
https://www.kaggle.com/c/two-sigma-financial-modeling#evaluation - the R metric was used (in a sense, ironic since R kernels were not supported by the code competition ;)
The only view into the test set was the R score of the previous day. However, this was flawed because it shifts considerably depending on the average y score of that day, while for the whole result the average of all days were used.
The core of my script was initially a blend of a popular kernel using ExtraTreesClassifier, combined with one using XGBoost, along with Pandas manipulation to keep previous values to provide more features.
I tried to be careful to prevent overfitting - picking solutions which were stable on both the second half of the public data, and the public half of the leaderboard data. Unfortunately, this led to not gaining enough score on the private half, so I went from #12 to #51. (and my best submissions would have gotten me to about #28)
Being blind as to what's in the test set (each time frame - each half of the public data, and the two splits in the private data - were very different). This not only caused major overfitting issues, but fed into...
Having only two (successful) submissions per day. There were ways to cheat with intentionally erroring out submissions, but I didn't want to use them. So it was very easy to have something I thought would be good... and then it wasn't.
And my biggest mistake was assuming the private LB would act like an extrapolation of the public LB. In the end I gained some score, but not nearly as much as the top performers.
paulperry did more research: https://github.com/paulperry/kaggle/blob/master/two-sigma-financial-modeling/models/other_models.ipynb
Always look for a simpler solution!
On a 'blind' competition like this one, don't worry about the public leaderboard too much.
Don't spend too much time tuning what you already have (unless you're too tired to do something new :) )
(And write up notes right after the end of the competition, even if you're tired.)
Oussama Errabia added:
When ensembling, make sure the various members of the ensemble are different, to take advantage of different parts of the data.
Sometimes, it's not all about generalization. (I can speak to that!)
Trust your CV!
Kaggle's new corporate overlords (Google Cloud) had a recent conference with a lot of interesting talks. I watched these two on Tensorflow:
https://www.youtube.com/watch?v=u4alGiomYP4 - TensorFlow and Deep Learning without a PhD, Part 1 (Google Cloud Next '17)
So I decided to play with it a little bit - I'm hoping to work it into more Kaggle competitions as time goes on, but for now I've just got this fun demonstration that I started at the beginning of the talk.
Instead of Shakesphere, I fed in some David Weber ebooks, and in about a half hour to an hour and some luck, the neural network actually learns enough Engrish to make quasi-coherent sentences and paragraphs.