In [20]:

    
from traitlets.config.manager import BaseJSONConfigManager
path = "/Users/matthiaszunhammer/anaconda/etc/jupyter/nbconfig"
cm = BaseJSONConfigManager(config_dir=path)
cm.update('livereveal', {
              'theme': 'simple',
              'transition': 'none',
              'start_slideshow_at': 'selected',
})

cm.update('livereveal', {
              'width': 1024,
              'height': 768,
})









    Out[20]:





{'height': 768,
 'start_slideshow_at': 'selected',
 'theme': 'simple',
 'transition': 'none',
 'width': 1024}

Background

After a sunday leisure trip to "Rennbahn Düsseldorf", I've got the idea that horseracing is an ideal training ground for practicing machine learning (ML) and Big Data handling as:

Interesting topic and good conversation starter (I'm not actually into betting, though)
Lots of cases for prediction available (200-800 races/day)
Lots of data available for each horse (jockey weight, horse age, past performance...)
"Parimutuel betting":, i.e. you compete against all other betters, rather than a bookie and his ML team (after paying a hefty house commission of approx. 15-20%)

Aims

Improve python skills (I mainly work with MATLAB and R in neuroscience)
Get exposure to data-base systems like MongoDB and SQL (I mainly work with single-file-based data in neuroscience)
Ultimately: find out if it possible to "beat the odds" with machine learning (which I doubt)

Roadmap

Get data
Clean data
Tran model
Test model
Reflect on results

1.) Where to get data? – The problem

Some sites offer data (e.g. Betwise, Betfair). But:

A) They usually charge money (>100€ Betwise)
B) They usually offer a limited scope of variables (especially Betfair)
C) They usually offer data in a format different than the race-sheets of upcoming races, making it difficult to implement a ML workflow

1.) Where to get data? – The solution

Betting sites usually provide lots of data for upcomming and past races.
The requests package for python offers a powerful tool to download webpages.
... so I wrote a couple of functions to download and parse data from one of the big betting services.

1.) Where to get data? – The solution: scraping (cont.)

In this process, called "web-scraping", I learned:

using the requests package
how to use PHP-queries to accessed data
how to use http's GET and POST methods to login in automatically
some JSON (to post HTTP headers)
servers do not like receiving requests...

1.2.) How to get data? – The solution: scraping (cont.)

The code can be found in the "scrape" module.

import hracing.scrape
hracing.scrape.main()

Note: It will not work on your machine: I stored info on host and my login in a local .ini file for privacy issues.



In [ ]:

    
import hracing.scrape
hracing.scrape.main()

2.) Parse data — The problem

A pretty simple problem:

Information is sitting in html-elements
Extract and store in some readily accessible format

2.) Parse data — The solution

At first, I extracted all data with regexp...
... then I learned that what I'm trying is called parsing...
... and what parsers are good for.
BeautifulSoup makes this task easy.

3.) Storing data — The problem

One race consists of the following info:
- Race-level info (e.g. race_ID, daytime, location...)
- Horse-level info (e.g. name, weight, sex, jockey...)
- Short-forms: A table describing the latest performance for each horse
- Long-forms: A table describing all-time performance for each horse
- A table describing the finish (sometimes for all horses, or only the first three,... etc. depending on track)
  Hierarchical data structure: Past performances nested in horses, horses and finishers nested in races.

3.) Storing data — The solution, take 1

create a race class with variables stored as properties
save class instances in separate "pickled" files
give up...

++ Good to practice class syntax

-- Inefficient data storage (lots of discspace, inaccessible, inflexible)

3.) Storing data — The solution, take 2

create SQL data-bank
save races in a relation with race_ID, horses in a relation horse_ID
give up...

++ Good to practice SQL syntax

-- Not actually an efficient way to store data, as it cannot handle data hierarchy and is inflexible (esp. when new variables become available)

3.) Storing data — The solution, take 3

parse data into hierarchical dict
create MongoDB data-bank and store dicts according to race ID
:)

++ Efficient storage, conserves natural data hierarchy, flexible if new variables become available

-- Why did I not try this earlier