In [20]:
from traitlets.config.manager import BaseJSONConfigManager
path = "/Users/matthiaszunhammer/anaconda/etc/jupyter/nbconfig"
cm = BaseJSONConfigManager(config_dir=path)
cm.update('livereveal', {
              'theme': 'simple',
              'transition': 'none',
              'start_slideshow_at': 'selected',
})

cm.update('livereveal', {
              'width': 1024,
              'height': 768,
})


Out[20]:
{'height': 768,
 'start_slideshow_at': 'selected',
 'theme': 'simple',
 'transition': 'none',
 'width': 1024}

Background

After a sunday leisure trip to "Rennbahn Düsseldorf", I've got the idea that horseracing is an ideal training ground for practicing machine learning (ML) and Big Data handling as:

  • Interesting topic and good conversation starter (I'm not actually into betting, though)
  • Lots of cases for prediction available (200-800 races/day)
  • Lots of data available for each horse (jockey weight, horse age, past performance...)
  • "Parimutuel betting":, i.e. you compete against all other betters, rather than a bookie and his ML team (after paying a hefty house commission of approx. 15-20%)

Aims

  • Improve python skills (I mainly work with MATLAB and R in neuroscience)
  • Get exposure to data-base systems like MongoDB and SQL (I mainly work with single-file-based data in neuroscience)
  • Ultimately: find out if it possible to "beat the odds" with machine learning (which I doubt)

Roadmap

  1. Get data
  2. Clean data
  3. Tran model
  4. Test model
  5. Reflect on results

1.) Where to get data? – The problem

Some sites offer data (e.g. Betwise, Betfair). But:

  • A) They usually charge money (>100€ Betwise)
  • B) They usually offer a limited scope of variables (especially Betfair)
  • C) They usually offer data in a format different than the race-sheets of upcoming races, making it difficult to implement a ML workflow

1.) Where to get data? – The solution

  • Betting sites usually provide lots of data for upcomming and past races.
  • The requests package for python offers a powerful tool to download webpages.
  • ... so I wrote a couple of functions to download and parse data from one of the big betting services.

1.) Where to get data? – The solution: scraping (cont.)

In this process, called "web-scraping", I learned:

  • using the requests package
  • how to use PHP-queries to accessed data
  • how to use http's GET and POST methods to login in automatically
  • some JSON (to post HTTP headers)
  • servers do not like receiving requests...

1.2.) How to get data? – The solution: scraping (cont.)

The code can be found in the "scrape" module.

import hracing.scrape
hracing.scrape.main()

Note: It will not work on your machine: I stored info on host and my login in a local .ini file for privacy issues.


In [ ]:
import hracing.scrape
hracing.scrape.main()

2.) Parse data — The problem

A pretty simple problem:

  • Information is sitting in html-elements
  • Extract and store in some readily accessible format

2.) Parse data — The solution

  • At first, I extracted all data with regexp...
  • ... then I learned that what I'm trying is called parsing...
  • ... and what parsers are good for.
  • BeautifulSoup makes this task easy.

3.) Storing data — The problem

  • One race consists of the following info:
    • Race-level info (e.g. race_ID, daytime, location...)
    • Horse-level info (e.g. name, weight, sex, jockey...)
    • Short-forms: A table describing the latest performance for each horse
    • Long-forms: A table describing all-time performance for each horse
    • A table describing the finish (sometimes for all horses, or only the first three,... etc. depending on track)

      Hierarchical data structure: Past performances nested in horses, horses and finishers nested in races.

3.) Storing data — The solution, take 1

  • create a race class with variables stored as properties
  • save class instances in separate "pickled" files
  • give up...

++ Good to practice class syntax

-- Inefficient data storage (lots of discspace, inaccessible, inflexible)

3.) Storing data — The solution, take 2

  • create SQL data-bank
  • save races in a relation with race_ID, horses in a relation horse_ID
  • give up...

++ Good to practice SQL syntax

-- Not actually an efficient way to store data, as it cannot handle data hierarchy and is inflexible (esp. when new variables become available)

3.) Storing data — The solution, take 3

  • parse data into hierarchical dict
  • create MongoDB data-bank and store dicts according to race ID
  • :)

++ Efficient storage, conserves natural data hierarchy, flexible if new variables become available

-- Why did I not try this earlier

3.) Machine learning

4.) Bet?