(revised 3/23/2017)
This is going to be an overview of some of my Kaggle competitions - in particular the recent Two Sigma Financial Modeling competition.
I've been Kaggling since 2015 - I've gotten two #19 finishes (a bit short of a gold medal!) and a few other decent finishes :)
My kaggle profile is https://www.kaggle.com/happycube.
Kaggle is a Data Science site best known for it's competitions, but more recently hosts datasets outside of a competition context. It was bought by Google Cloud in March 2017, and the long term effects of that are not fully known as of this rewriting.
There are usually 4-5 competitions running at once - a mix of image-based and statistical competitions. Each one has hundreds of participants, from people who only submit one sample submission to fierce competitors.
Competitions consist of a training and test set. In most competitions, the test set is split - during the competition you see only a public subset, with the larger private leaderboard shown after the competition ends. In some others - usually large imaging competitions - a second test set is revealed near the end of the competition.
This was Kaggle's first code competition, where instead of submitting predictions, Python kernels were uploaded and run on Kaggle's cloud servers. (Unfortunatly, submissions are now offline and there's no test data yet(?), so there are a few things I'd like to look at but can't right now...)
This added a real constraint to all competitiors, since there was a 60 minute runtime limit (using two cores of a ~3ghz Xeon server box) It was possible to encode relatively small binaries of models (Gilberto Titericz Junior's team did this) but I didn't attempt it myself.
This dataset contains anonymized features pertaining to a time-varying value for a financial instrument. Each instrument has an id. Time is represented by the 'timestamp' feature and the variable to predict is 'y'. No further information will be provided on the meaning of the features, the transformations that were applied to them, the timescale, or the type of instruments that are included in the data.
Basically, it was about predicting time series (daily) data on various financial instruments that came and went during the timeframe. In the middle of the competition, the ID's and timestamps on the server end were changed to prevent having explicit data references in the kernels.
https://www.kaggle.com/c/two-sigma-financial-modeling#evaluation - the R metric was used (in a sense, ironic since R kernels were not supported by the code competition ;)
The only view into the test set was the R score of the previous day. However, this was flawed because it shifts considerably depending on the average y score of that day, while for the whole result the average of all days were used.
The core of my submission was initially a blend of a popular kernel using ExtraTreesClassifier, combined with one using XGBoost, along with Pandas manipulation to keep previous values to provide more features.
I tried to be careful to prevent overfitting - picking solutions which were stable on both the second half of the public data, and the public half of the leaderboard data. This prevented me from collapsing on the leaderboard, but I did not take advantage of the properties of the hidden data.
Being blind as to what's in the test set (each time frame - each half of the public data, and the two splits in the private data - were very different). This not only caused major overfitting issues, but fed into...
Having only two (successful) submissions per day. There were ways to cheat with intentionally erroring out submissions, but I didn't want to use them. So it was very easy to have something I thought would be good... and then it wasn't.
And my biggest mistake was assuming the private LB would act like an extrapolation of the public LB.
https://www.kaggle.com/tks0123456789/two-sigma-financial-modeling/xgb-500-600-001
#16 private LB - and #525 public! This kernel was both simple and daring - it uses a mixture of eight XGBoost models with no feature engineering.
For validation tks excluded the most volatile area of the second half by using timestamps 905 to 1405/1505 as validation which reduced public LB score, but was very stable on the private LB.
paulperry did more research: https://github.com/paulperry/kaggle/blob/master/two-sigma-financial-modeling/models/other_models.ipynb
I found this to be a nice example of what you can do with just a few lines of Python (with Pandas and numpy):
roll_std = o.train.groupby('timestamp').y.mean().rolling(window=10).std().fillna(0)
train_idx = o.train.timestamp.isin(roll_std[roll_std < 0.009].index)
y_train = o.train['y'][train_idx]
xgmat_train = xgb.DMatrix(o.train.loc[train_idx, cols], label=y_train)
The first line splits up the training data into an iterable tuple with the timestamp and the data solely for that timestamp, computes a 10-day rolling mean, then it's standard deviation, then prevents NaN's from entering.
The second line creates a boolean mask which only has days with lower volatility.
The last two lines apply that index (with column restrictions set up earlier in the code) to a data format needed by XGBoost (which is written in C++)
Always look for a simpler solution!
On a 'blind' competition like this one, don't worry about the public leaderboard too much.
Don't spend too much time tuning what you already have (unless you're too tired to do something new :) )
(And write up notes right after the end of the competition, even if you're tired.)
I expect Kaggle to do more code competitions once the Google Cloud integration is complete, since they will have access to rather large datacenter capacity ;) (One that depends on Google Cloud ML?)
From here I expect to work more with TensorFlow/Keras to continue learning new techniques. I find RNN character generation (i.e. Shakesphere/code generation) very intriguing, even though it can't understand anything conceptually... yet.