Kaggle Talk 3/22/2017

This is going to be an overview of some of my Kaggle competitions - in particular the recent Two Sigma Financial Modeling competition.

I've been Kaggling since 2015 - I've gotten two #19 finishes (a bit short of a gold medal!) and a few other decent finishes :)

(https://www.kaggle.com/happycube)

About Kaggle Competitions

There are usually 4-5 competitions running at once - a mix of image-based and statistical competitions.

Competitions consist of a training and test set. In most competitions, the test set is split - during the competition you see only a public subset, with the larger private leaderboard shown after the competition ends. In some others, a second test set is revealed near the end of the competition.

About Two Sigma Financial Modeling

This was Kaggle's first code competition, where instead of submitting predictions, Python kernels were uploaded and run on Kaggle's cloud servers. (Unfortunatly, submissions are now offline and there's no test data yet(?), so there are a few things I'd like to look at but can't right now...)

This added a real constraint to all competitiors, since there was a 60 minute runtime limit (using two cores of a ~3ghz Xeon server box) It was possible to encode relatively small binaries of models (Gilberto Titericz Junior's team did this) but I didn't attempt it myself.

Data overview

https://www.kaggle.com/c/two-sigma-financial-modeling/data

This dataset contains anonymized features pertaining to a time-varying value for a financial instrument. Each instrument has an id. Time is represented by the 'timestamp' feature and the variable to predict is 'y'. No further information will be provided on the meaning of the features, the transformations that were applied to them, the timescale, or the type of instruments that are included in the data.

Basically, it was about predicting time series (daily) data on various financial instruments that came and went during the timeframe. In the middle of the competition, the ID's and timestamps on the server end were changed to prevent having explicit data references in the kernels.

Evaluation metric

https://www.kaggle.com/c/two-sigma-financial-modeling#evaluation - the R metric was used (in a sense, ironic since R kernels were not supported by the code competition ;)

The only view into the test set was the R score of the previous day. However, this was flawed because it shifts considerably depending on the average y score of that day, while for the whole result the average of all days were used.

The main feature(s)

While no direct information was revealed, it didn't take long to learn that technical_20 and 30 were the two most influential. For a while, it was believed that t30 broke down on the training set, so 20 was the most trusted.

I stumbled upon the truth when I did a reverse time travel experiment, fitting to the previous day's y value (pandas group function is very good for doing this) - I was able to work out a linear model using 20, 30 and their previous values. I was able to get a .8 R score between them and the previous day's y, which provided a major source of data.

cc05 revealed what their actual meanings were (I just enjoyed the results of the linear model and didn't look too closely ;) ) - https://www.kaggle.com/chenjx1005/two-sigma-financial-modeling/physical-meanings-of-technical-20-30

What I did

https://github.com/happycube/kaggle2017/blob/master/twosigma/sub-227738.ipynb

The core of my script was initially a blend of a popular kernel using ExtraTreesClassifier, combined with one using XGBoost, along with Pandas manipulation to keep previous values to provide more features.

I tried to be careful to prevent overfitting - picking solutions which were stable on both the second half of the public data, and the public half of the leaderboard data. Unfortunately, this led to not gaining enough score on the private half, so I went from #12 to #51. (and my best submissions would have gotten me to about #28)

The hardest parts

Being blind as to what's in the test set (each time frame - each half of the public data, and the two splits in the private data - were very different). This not only caused major overfitting issues, but fed into...
Having only two (successful) submissions per day. There were ways to cheat with intentionally erroring out submissions, but I didn't want to use them. So it was very easy to have something I thought would be good... and then it wasn't.

And my biggest mistake was assuming the private LB would act like an extrapolation of the public LB. In the end I gained some score, but not nearly as much as the top performers.

Other (better and/or clever-er) solutions:

https://www.kaggle.com/phegde/two-sigma-financial-modeling/team-pradeep-arthur-private-lb-10-solution/code
- #10 private LB: This evolved a popular (if complex) starter script into a diverse ensemble.

https://www.kaggle.com/tks0123456789/two-sigma-financial-modeling/xgb-500-600-001
- #16 private LB - and #525 public! This kernel was both simple and daring - it uses a mixture of eight XGBoost models with no feature engineering. I'm still not quite sure how tks knew what to do here...

https://www.kaggle.com/happycube/two-sigma-financial-modeling/two-sigma-03-shorter-version/run/979213
- This is the simplest effective solution I've seen - it created three simple linear models using technical_20 and 30. It's in the ~130 range on the private Leaderboard (~.02) which is good for a silver medal, and takes only one minute to run on Kaggle's server. I refactored the code with more effective use of Pandas, bringing it to under 100 lines of code.

paulperry did more research: https://github.com/paulperry/kaggle/blob/master/two-sigma-financial-modeling/models/other_models.ipynb

Takeaways/Suggestions:

Always look for a simpler solution!
On a 'blind' competition like this one, don't worry about the public leaderboard too much.
- (and don't get too excited if you wind up in the top 10!)
Don't spend too much time tuning what you already have (unless you're too tired to do something new :) )
(And write up notes right after the end of the competition, even if you're tired.)

Oussama Errabia added:

When ensembling, make sure the various members of the ensemble are different, to take advantage of different parts of the data.
Sometimes, it's not all about generalization. (I can speak to that!)
Trust your CV!

... and now for something completely different.

Today for our extra content we have this Tensorflow Recurring Neural Network demonstration. It looks simple, but it is extremely dangerous and may attack at any time, so we must deal with it...

(ahem)

more seriously

Kaggle's new corporate overlords (Google Cloud) had a recent conference with a lot of interesting talks. I watched these two on Tensorflow:

https://www.youtube.com/watch?v=u4alGiomYP4 - TensorFlow and Deep Learning without a PhD, Part 1 (Google Cloud Next '17)
https://www.youtube.com/watch?v=fTUwdXUFfI8 - part 2

So I decided to play with it a little bit - I'm hoping to work it into more Kaggle competitions as time goes on, but for now I've just got this fun demonstration that I started at the beginning of the talk.

Instead of Shakesphere, I fed in some David Weber ebooks, and in about a half hour to an hour and some luck, the neural network actually learns enough Engrish to make quasi-coherent sentences and paragraphs.