Kaggle Talk 3/22/2017

(revised 3/23/2017)

This is going to be an overview of some of my Kaggle competitions - in particular the recent Two Sigma Financial Modeling competition.

I've been Kaggling since 2015 - I've gotten two #19 finishes (a bit short of a gold medal!) and a few other decent finishes :)

My kaggle profile is https://www.kaggle.com/happycube.

About Kaggle Competitions

Kaggle is a Data Science site best known for it's competitions, but more recently hosts datasets outside of a competition context. It was bought by Google Cloud in March 2017, and the long term effects of that are not fully known as of this rewriting.

There are usually 4-5 competitions running at once - a mix of image-based and statistical competitions. Each one has hundreds of participants, from people who only submit one sample submission to fierce competitors.

Competitions consist of a training and test set. In most competitions, the test set is split - during the competition you see only a public subset, with the larger private leaderboard shown after the competition ends. In some others - usually large imaging competitions - a second test set is revealed near the end of the competition.

About Two Sigma Financial Modeling

This was Kaggle's first code competition, where instead of submitting predictions, Python kernels were uploaded and run on Kaggle's cloud servers. (Unfortunatly, submissions are now offline and there's no test data yet(?), so there are a few things I'd like to look at but can't right now...)

This added a real constraint to all competitiors, since there was a 60 minute runtime limit (using two cores of a ~3ghz Xeon server box) It was possible to encode relatively small binaries of models (Gilberto Titericz Junior's team did this) but I didn't attempt it myself.

Data overview

This dataset contains anonymized features pertaining to a time-varying value for a financial instrument. Each instrument has an id. Time is represented by the 'timestamp' feature and the variable to predict is 'y'. No further information will be provided on the meaning of the features, the transformations that were applied to them, the timescale, or the type of instruments that are included in the data.

Basically, it was about predicting time series (daily) data on various financial instruments that came and went during the timeframe. In the middle of the competition, the ID's and timestamps on the server end were changed to prevent having explicit data references in the kernels.

Evaluation metric

https://www.kaggle.com/c/two-sigma-financial-modeling#evaluation - the R metric was used (in a sense, ironic since R kernels were not supported by the code competition ;)

The only view into the test set was the R score of the previous day. However, this was flawed because it shifts considerably depending on the average y score of that day, while for the whole result the average of all days were used.

Useful tools

  • Devin and Frans Slothauber developed an emulation of the Kagglegym environment used on the servers for running the code, allowing people (including me) to validate what they were working on at home.
  • XGBoost is used in most non-imaging-based competitions - it's a very effective tree model (ala Random Forest) with a lot of tunables for good flexibility.
  • Kaggle scripts/kernels run inside a Docker container which is easy to install and update - allowing you to make sure that you're always running the same versions of code that your submissions use. More info here: http://blog.kaggle.com/2016/02/05/how-to-get-started-with-data-science-in-containers/
    • They revised the container several times during the competition - even upgrading Python from 3.5 to 3.6. Once I thought they put in slower code, but it turns out they were rescoring everything so that people couldn't use hardcoded timestamps and stock ID's...
  • For cheap CPU and RAM - if you have somewhere to stash the box - look into used Xeon (>=Nehalem) server hardware.

The main feature(s)

  • While no direct information was revealed, it didn't take long to learn that technical_20 and 30 were the two most influential. For a while, it was believed that t30 broke down on the training set, so 20 was the most trusted.
  • I stumbled upon the truth when I did a reverse time travel experiment, fitting to the previous day's y value (pandas group function is very good for doing this) - I was able to work out a linear model using 20, 30 and their previous values. I was able to get a .8 R score between them and the previous day's y, which provided a major source of data.

The leaderboards

  • Because of the time-based nature of this competition, the public leaderboard (based off ~1/3 of the hidden test data) and the private leaderboard (the rest) were very different. Each of the top 3 on the public leaderboard fell into the 900's. The best competitors had significantly higher R values - but my R value went up by about .002, which made me drop from #12 to #51.
  • This was compounded by only being able to select two submissions for the private LB. My best model would have gotten ~#29... although many others would be even higher.
  • It's quite possible that many of the people on the top of the private leaderboard cheated by intentionally creating errors to 'probe' the leaderboard and/or discard bad runs (using for instance the previous day's data dervived from technical_20/30 to determine approximate LB score) - even after three weeks the very final private leaderboard has not been issued yet!

What I did

The core of my submission was initially a blend of a popular kernel using ExtraTreesClassifier, combined with one using XGBoost, along with Pandas manipulation to keep previous values to provide more features.

I tried to be careful to prevent overfitting - picking solutions which were stable on both the second half of the public data, and the public half of the leaderboard data. This prevented me from collapsing on the leaderboard, but I did not take advantage of the properties of the hidden data.

The hardest parts

  • Being blind as to what's in the test set (each time frame - each half of the public data, and the two splits in the private data - were very different). This not only caused major overfitting issues, but fed into...

  • Having only two (successful) submissions per day. There were ways to cheat with intentionally erroring out submissions, but I didn't want to use them. So it was very easy to have something I thought would be good... and then it wasn't.

And my biggest mistake was assuming the private LB would act like an extrapolation of the public LB.

Other (better and/or clever-er) solutions:

  • https://www.kaggle.com/tks0123456789/two-sigma-financial-modeling/xgb-500-600-001

    • #16 private LB - and #525 public! This kernel was both simple and daring - it uses a mixture of eight XGBoost models with no feature engineering.

    • For validation tks excluded the most volatile area of the second half by using timestamps 905 to 1405/1505 as validation which reduced public LB score, but was very stable on the private LB.

paulperry did more research: https://github.com/paulperry/kaggle/blob/master/two-sigma-financial-modeling/models/other_models.ipynb

A bit of very tight Python code from tks

I found this to be a nice example of what you can do with just a few lines of Python (with Pandas and numpy):

roll_std = o.train.groupby('timestamp').y.mean().rolling(window=10).std().fillna(0)
train_idx = o.train.timestamp.isin(roll_std[roll_std < 0.009].index)

y_train = o.train['y'][train_idx]
xgmat_train = xgb.DMatrix(o.train.loc[train_idx, cols], label=y_train)

  • The first line splits up the training data into an iterable tuple with the timestamp and the data solely for that timestamp, computes a 10-day rolling mean, then it's standard deviation, then prevents NaN's from entering.

  • The second line creates a boolean mask which only has days with lower volatility.

  • The last two lines apply that index (with column restrictions set up earlier in the code) to a data format needed by XGBoost (which is written in C++)

Takeaways/Suggestions:

  • Always look for a simpler solution!

  • On a 'blind' competition like this one, don't worry about the public leaderboard too much.

    • (and don't get too excited if you wind up in the top 10!)
  • Don't spend too much time tuning what you already have (unless you're too tired to do something new :) )

  • (And write up notes right after the end of the competition, even if you're tired.)

Oussama Errabia added

  • When ensembling, make sure the various members of the ensemble are different, to take advantage of different parts of the data.

  • Sometimes, it's not all about generalization. (I can speak to that!)

  • Trust your CV!

More random thoughts

  • I expect Kaggle to do more code competitions once the Google Cloud integration is complete, since they will have access to rather large datacenter capacity ;) (One that depends on Google Cloud ML?)

  • From here I expect to work more with TensorFlow/Keras to continue learning new techniques. I find RNN character generation (i.e. Shakesphere/code generation) very intriguing, even though it can't understand anything conceptually... yet.