Making a faster automatic GalaxyZoo classifier in Python

GalaxyZoo has a fabulous data set for machine learning. There was a Kaggle competition a few years back to create an automatic galaxy classifier based on the GalaxyZoo data, the winning entry (code, explanation) works very well — but it takes days to initialize. Let's see if we can create something almost as good that runs much faster. (I'm using Python 3, but I'll try to make this notebook backwards compatible with Python 2 as well.)

I. Data retrieval and handling

First, we need the data from the Kaggle competition (the raw GalaxyZoo data is also available, but we don't need nearly that much data to get started). You'll need to create a Kaggle account (free) in order to access this. Download images_training_rev1.zip (~60,000 training images), images_test_rev1.zip (~80,000 test images), and training_solutions_rev1.csv (the classifications for the training images).

You'll also need the following Python packages (and their dependencies) installed:

Let's start by testing out image reading, since that will be pretty important for this project.


In [ ]:
## An attempt at Python 2 backward compatibility
from __future__ import absolute_import, division, print_function, unicode_literals

In [1]:
%matplotlib inline
from skimage import data, io, filters
filename = "images_training_rev1/100008.jpg"
galaxy = io.imread(filename)
io.imshow(galaxy)
io.show()


Next, let's have a look at the beginning of that CSV file of galaxy classifications.


In [2]:
import pandas as pd
training_solutions = pd.read_csv("training_solutions_rev1.csv",index_col=0)
training_solutions.head()


Out[2]:
Class1.1 Class1.2 Class1.3 Class2.1 Class2.2 Class3.1 Class3.2 Class4.1 Class4.2 Class5.1 ... Class9.3 Class10.1 Class10.2 Class10.3 Class11.1 Class11.2 Class11.3 Class11.4 Class11.5 Class11.6
GalaxyID
100008 0.383147 0.616853 0.000000 0.000000 0.616853 0.038452 0.578401 0.418398 0.198455 0.0 ... 0.000000 0.279952 0.138445 0.000000 0.000000 0.092886 0.0 0.0 0.0 0.325512
100023 0.327001 0.663777 0.009222 0.031178 0.632599 0.467370 0.165229 0.591328 0.041271 0.0 ... 0.018764 0.000000 0.131378 0.459950 0.000000 0.591328 0.0 0.0 0.0 0.000000
100053 0.765717 0.177352 0.056931 0.000000 0.177352 0.000000 0.177352 0.000000 0.177352 0.0 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.000000
100078 0.693377 0.238564 0.068059 0.000000 0.238564 0.109493 0.129071 0.189098 0.049466 0.0 ... 0.000000 0.094549 0.000000 0.094549 0.189098 0.000000 0.0 0.0 0.0 0.000000
100090 0.933839 0.000000 0.066161 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.000000

5 rows × 37 columns

Each column gives the fraction of answers given in the Galaxy Zoo Decision Tree. (There is a nice visualization of the tree here, but annoyingly, the questions in that tree are indexed from 0, while the Kaggle set indexes them from 1.) So, for example, there are three possible answers to the first question, which is why we have columns Class 1.1, Class 1.2, and Class 1.3. For the first image (the one we displayed as a test, image 100008, about 38.3% of users gave answer 1 to question 1, 61.7% gave answer 2, and nobody gave answer 3.

The overall goal is to predict the distribution of all of these answers for all of the test images, given the above distribution for the training images. Let's get to work!

II. Trying something stupid, then trying something even stupider

Let's start with one of the dumbest "algorithms" imaginable: generating an entirely random set of answer distributions for each image. We won't even have to look at the images for this one! This will give us a baseline: if we can't do better than this, we're in trouble.

First, we need to know how big our data set is.


In [29]:
import os
import time
import numpy as np
from collections import OrderedDict

# First, let's find out exactly how many files there are in the training set.
# We could get this from the training set solution table, but let's just count the files instead.
training_filenames = os.listdir(path="images_training_rev1/")
# We need to make sure these are all images, rather than anything else that somehow ended up in the folder 
# (e.g. OSX invisible files)
training_filenames = [f for f in training_filenames if os.path.splitext(f)[1] == ".jpg"]
n_files = len(training_filenames)
print("There are", n_files, "files in the training set.")

# Now we need to know how many questions there are, and how many answers there are for each question. 
# To do that, let's have a look at the GalaxyZoo decision tree:
# https://www.kaggle.com/c/galaxy-zoo-the-galaxy-challenge/details/the-galaxy-zoo-decision-tree
# Again, we could pull this from the training set solution table, but I don't feel like doing that, 
# because regular expressions make me sad.
questions_dict = {
    1:3,
    2:2,
    3:2,
    4:2,
    5:4,
    6:2,
    7:3,
    8:7,
    9:3,
    10:3,
    11:6
                 }
questions_dict = OrderedDict(sorted(questions_dict.items()))
n_questions = len(questions_dict)
n_columns = sum(questions_dict[k] for k in questions_dict.keys())
print("There are", n_questions, "questions and", n_columns, "possible answers in the GalaxyZoo tree.")


There are 61578 files in the training set.
There are 11 questions and 37 possible answers in the GalaxyZoo tree.

Now we can create our random matrix. We'll make it not entirely stupid, and keep the probabilities normalized.


In [30]:
def randrow():
    '''
    Creates a random row of numbers.
    Takes into account the constraints imposed by the number of questions:
    namely, that the total probability for each question must sum to 1.
    NB: returns a list, not an array.
    '''
    row = []
    for k in questions_dict.keys():
        n_answers = questions_dict[k]
        remaining_probability = 1
        while n_answers > 1:
            r = remaining_probability*np.random.random()
            row.append(r)
            remaining_probability -= r
            n_answers -= 1
        row.append(remaining_probability)
    return row

def randarray(rows):
    '''
    Returns a random array with the specified number of rows.
    Each row is generated using randrow().
    '''
    ra = [randrow() for i in range(rows)]
    return np.array(ra)

start = time.time()
ray = randarray(n_files)
delta_t = time.time()-start
print("Generated random array with shape", ray.shape, "in", delta_t, "seconds.")


Generated random array with shape (61578, 37) in 1.6347661018371582 seconds.

Now we have our terrible guess. But how terrible is it? Kaggle uses root-mean-square error to evaluate the submissions for this competition. Let's do that — and let's compare our dumb idea against an even dumber one: an array full of ones.


In [4]:
from numpy import sqrt, asarray, ones
from sklearn.metrics import mean_squared_error

training_solutions_array = asarray(training_solutions)

def rmse(guess_array, test_array):
    '''Gives us the root-mean-squared error between the guess and the test set answers.'''
    rmse = sqrt(mean_squared_error(guess_array, test_array))
    return rmse

print("Error for our random array is", rmse(ray, training_solutions_array))
print("Error for an even sillier array (all ones) is", rmse(ones(training_solutions_array.shape), training_solutions_array))


Error for our random array is 0.372992067292
Error for an even sillier array (all ones) is 0.894771769035

Now we have our baseline. Just to be sure, let's generate a bunch of those random arrays and get the average RMSE across all of them. (Though, given the sheer number of elements in our arrays, it's likely that we'll get something very close to what we already have.)


In [82]:
n = 50
print("Generating", n, "random arrays...")
t1 = time.time()
random_errs = [rmse(randarray(n_files), training_solutions_array) for i in range(n)]
t2 = time.time()
print("Finished in", t2 - t1, "seconds.")
random_baseline_error = sum(random_errs)/n
print("Baseline RMSE is", baseline)


Generating 50 random arrays...
Finished in 64.84260702133179 seconds.
Baseline RMSE is 0.373006271715

Astonishingly, it turns out there's something even dumber we can do that will nonetheless do better than our random arrays.


In [5]:
from numpy import zeros
zeros_baseline_error = rmse(zeros(training_solutions_array.shape), training_solutions_array)
print("All-zeros baseline RMSE:", zeros_baseline_error)


All-zeros baseline RMSE: 0.271455208108

So, that's the new goal: do better than an array full of zeros. Let's see if we can manage it.

III. Data pipeline

Now we will have to look at the images, which means we'll need a pipeline to take them in and process them. For starters, we can probably downsample the images — there's probably a bunch of redundant information in them at their current resolution. Just how big are they, anyhow?


In [43]:
io.imshow(galaxy)
io.show()
galaxy.shape


Out[43]:
(424, 424, 3)

424x424 pixels, with three color channels? Yeah, we don't need all of that. Let's make it 50x50 and black & white instead.


In [5]:
from skimage import transform, color
smallgalaxy = transform.resize(galaxy, (50, 50))
smallbwgalaxy = color.rgb2gray(smallgalaxy)
io.imshow(smallbwgalaxy)
io.show()


Much better. But that's an elliptical galaxy — can we still resolve the arms of a spiral galaxy, or the bars, or the "odd features" that are part of the Galaxy Zoo decision tree?


In [61]:
fn = np.random.choice(training_filenames)
randgalaxy = io.imread("images_training_rev1/" + fn)
randsmallgalaxy = color.rgb2gray(transform.resize(randgalaxy, (50, 50)))
io.imshow(randgalaxy)
io.show()
io.imshow(randsmallgalaxy)
io.show()


That's...not great. If you showed me the lower image without seeing the upper one, I might think it was an elliptical galaxy, rather than a spiral. How does 100x100 compare?


In [66]:
fn = np.random.choice(training_filenames)
randgalaxy = io.imread("images_training_rev1/" + fn)
randsmallgalaxy = color.rgb2gray(transform.resize(randgalaxy, (50, 50)))
randmedgalaxy = color.rgb2gray(transform.resize(randgalaxy, (100, 100)))
io.imshow(randgalaxy)
io.show()
io.imshow(randmedgalaxy)
io.show()
io.imshow(randsmallgalaxy)
io.show()


OK, looks like we might want to go up to 100x100. Let's try 50x50 first, though, and then go back up to 100x100 later.

Splitting the training set and creating a pipeline

Unfortunately, the only way to check our guesses for the test images is to submit the guesses to Kaggle. We don't want to do that every time — so let's train our machine-learning algorithms on half of our training set (randomly selected), and use the other half as our "test" set. If we end up with an algorithm that performs well on that test set, we'll retrain the algoritm on the entire training set, then use the actual test set to create a submission to Kaggle.


In [6]:
import random
shuffled_training_set = list(zip(training_filenames, training_solutions_array))
random.shuffle(shuffled_training_set)
small_training_set = shuffled_training_set[:n_files//2]
small_test_set = shuffled_training_set[n_files//2:]
print(len(small_training_set), len(small_test_set))
small_training_filenames = [i[0] for i in small_training_set]
small_training_answers = [i[1] for i in small_training_set]
small_test_filenames = [i[0] for i in small_test_set]
small_test_answers = [i[1] for i in small_test_set]


30789 30789

We'll also need a pipeline function for these images:


In [7]:
def downsample_and_grayscale(filename, size):
    '''
    Grabs the file, scales it down to the given size (in pixels), and grayscales it.
    Returns a numpy array with the downscaled and grayscaled image data.
    '''
    raw_image = io.imread("images_training_rev1/" + filename)
    downscaled_gray_image = color.rgb2gray(transform.resize(raw_image, (size, size)))
    return downscaled_gray_image

io.imshow(downsample_and_grayscale(small_training_filenames[2], 50))
io.show()


How fast is our pipeline? It might be too slow.


In [32]:
n = 300
pix = 10
t1 = time.time()
images = [downsample_and_grayscale(fn, pix) for fn in small_training_filenames[:n]]
t2 = time.time()
print("Processed", n, "images at", pix, "x", pix, "resolution in", t2-t1, "seconds.")


Processed 300 images at 10 x 10 resolution in 4.74242091178894 seconds.

It's probably file I/O that's taking up most of the time. 300 images seem to take about 6 seconds, so...0.02s per image? That's a long time! At that rate, reading in all the images in the smaller training set alone will take...


In [33]:
print(0.02*len(small_training_set)/60/60, "hours")


0.17105 hours

That's not too bad after all. But for now, let's see what we can do with a small subset of the data. We're still not done with our pipeline: we need to flatten the image arrays.


In [8]:
def pipeline(fn, size):
    '''Downsamples, grayscales, and flattens.'''
    return downsample_and_grayscale(fn, size).flatten()

We're finally ready to start doing some machine learning. Let's grow a forest!

IV. Random forest on pixels

Let's start with a random forest algorithm, taking the pixels of the images themselves as input. We'll do 200 images at 20x20 resolution, just to test our forest-growing abilities.


In [9]:
from sklearn.ensemble import RandomForestRegressor
size = 20
n_images = 200
randforest = RandomForestRegressor(n_estimators=100, 
                            n_jobs=4, verbose=1, warm_start=False)
X = [pipeline(fn, size) for fn in small_training_filenames[:n_images]]
X = np.array(X)
Y = np.array(small_training_answers[:n_images])
randforest.fit(X, Y)


[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.7s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    1.7s finished
Out[9]:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=100, n_jobs=4, oob_score=False, random_state=None,
           verbose=1, warm_start=False)

In [19]:
newX = [pipeline(fn, size) for fn in small_training_filenames[n_images:2*n_images]]
newX = np.array(newX)
newY = np.array(small_training_answers[n_images:2*n_images])
predictedY = randforest.predict(newX)
print(newY.shape, predictedY.shape)
print(rmse(newY, predictedY))


(200, 37) (200, 37)
0.158176712781
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    0.1s finished

That is surprisingly un-horrible! Let's see if we can do better with the same training set by altering the hyper-parameters (e.g. number of trees, number of pixels fed in, size of the training set, etc.) We'll change them one at a time, and see what gets us the biggest improvement.


In [87]:
randforest2 = RandomForestRegressor(n_estimators=400, 
                            n_jobs=4, verbose=1, warm_start=False)
randforest2.fit(X, Y)
predictedY = randforest2.predict(newX)
print("RMSE with more trees:", rmse(newY, predictedY))


[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.8s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:    3.7s
RMSE with more trees: 0.164698296042
[Parallel(n_jobs=4)]: Done 400 out of 400 | elapsed:    7.7s finished
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 400 out of 400 | elapsed:    0.1s finished

In [85]:
n_images = 400
size = 20
longerX = [pipeline(fn, size) for fn in small_training_filenames[:n_images]]
longerX = np.array(longerX)
longerY = np.array(small_training_answers[:n_images])
randforest = RandomForestRegressor(n_estimators=100, 
                            n_jobs=4, verbose=1, warm_start=False)
randforest.fit(longerX, longerY)
testlongerX = [pipeline(fn, size) for fn in small_training_filenames[n_images:2*n_images]]
testlongerX = np.array(testlongerX)
testlongerY = np.array(small_training_answers[n_images:2*n_images])
predictedY = randforest.predict(testlongerX)
print("RMSE with a larger training set:", rmse(predictedY, testlongerY))


[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    2.0s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    4.8s finished
RMSE with a larger training set: 0.156437728168
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    0.0s finished

In [86]:
n_images = 200
size = 40
biggerX = [pipeline(fn, size) for fn in small_training_filenames[:n_images]]
biggerX = np.array(biggerX)
biggerY = np.array(small_training_answers[:n_images])
randforest = RandomForestRegressor(n_estimators=100, 
                            n_jobs=4, verbose=1, warm_start=False)
randforest.fit(biggerX, biggerY)
testbiggerX = [pipeline(fn, size) for fn in small_training_filenames[n_images:2*n_images]]
testbiggerX = np.array(testbiggerX)
testbiggerY = np.array(small_training_answers[n_images:2*n_images])
predictedY = randforest.predict(testbiggerX)
print("RMSE with larger images:", rmse(predictedY, testbiggerY))


[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    3.3s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    8.0s finished
RMSE with larger images set: 0.164147792591
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    0.0s finished

Unsurprisingly, a larger training set wins. We'll have to find a way to get that to work before we can go any further with this, I think.

V. Increasing the size of the training set

Let's start by profiling our existing file I/O.


In [90]:
import cProfile
N = 800
offset = 400
cProfile.run("list(io.imread('images_training_rev1/' + fn) for fn in small_training_filenames[offset:N+offset])")


         203204 function calls in 38.234 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      800    0.001    0.000    0.006    0.000 <frozen importlib._bootstrap>:996(_handle_fromlist)
      801    0.006    0.000   38.195    0.048 <string>:1(<genexpr>)
        1    0.040    0.040   38.234   38.234 <string>:1(<module>)
      800    0.005    0.000    0.005    0.000 BmpImagePlugin.py:54(_accept)
      800    0.002    0.000    0.002    0.000 GifImagePlugin.py:45(_accept)
      800    0.004    0.000    8.447    0.011 Image.py:1128(getdata)
     1600    0.003    0.000    0.003    0.000 Image.py:1691(seek)
      800    0.002    0.000    0.002    0.000 Image.py:2235(_decompression_bomb_check)
      800    0.035    0.000   16.677    0.021 Image.py:2249(open)
      800    0.027    0.000    0.319    0.000 Image.py:2291(_open_core)
      800    0.002    0.000    0.002    0.000 Image.py:277(_conv_type_shape)
      800    0.004    0.000    0.004    0.000 Image.py:349(preinit)
      800    0.005    0.000    0.012    0.000 Image.py:409(_getdecoder)
      800    0.003    0.000    0.007    0.000 Image.py:426(_getencoder)
      800    0.011    0.000    0.011    0.000 Image.py:499(__init__)
     4800    0.015    0.000    0.303    0.000 Image.py:617(__getattr__)
      800    0.013    0.000    0.286    0.000 Image.py:649(tobytes)
     2400    0.004    0.000    0.006    0.000 Image.py:747(load)
     1600    0.049    0.000    8.445    0.005 ImageFile.py:120(load)
      800    0.004    0.000    0.092    0.000 ImageFile.py:252(load_prepare)
      800    0.000    0.000    0.000    0.000 ImageFile.py:261(load_end)
     7200    0.009    0.000    0.014    0.000 ImageFile.py:505(_safe_read)
      800    0.001    0.000    0.001    0.000 ImageFile.py:66(_tilesort)
      800    0.013    0.000    0.248    0.000 ImageFile.py:78(__init__)
      800    0.023    0.000    0.034    0.000 JpegImagePlugin.py:134(SOF)
     1600    0.024    0.000    0.033    0.000 JpegImagePlugin.py:182(DQT)
      800    0.002    0.000    0.002    0.000 JpegImagePlugin.py:275(_accept)
      800    0.057    0.000    0.218    0.000 JpegImagePlugin.py:287(_open)
      800    0.001    0.000    0.008    0.000 JpegImagePlugin.py:395(_getmp)
      800    0.006    0.000    0.006    0.000 JpegImagePlugin.py:457(_getmp)
     4000    0.012    0.000    0.027    0.000 JpegImagePlugin.py:55(Skip)
      800    0.022    0.000    0.035    0.000 JpegImagePlugin.py:60(APP)
      800    0.016    0.000    0.272    0.000 JpegImagePlugin.py:727(jpeg_factory)
      800    0.003    0.000    0.003    0.000 TiffImagePlugin.py:241(_accept)
    20000    0.012    0.000    0.012    0.000 _binary.py:23(i8)
    19200    0.024    0.000    0.036    0.000 _binary.py:52(i16be)
      800    0.023    0.000   38.188    0.048 _io.py:17(imread)
     1600    0.005    0.000    0.009    0.000 _util.py:13(isPath)
      800    0.003    0.000    0.009    0.000 contextlib.py:131(helper)
      800    0.005    0.000    0.006    0.000 contextlib.py:37(__init__)
      800    0.001    0.000    0.011    0.000 contextlib.py:57(__enter__)
      800    0.008    0.000    0.011    0.000 contextlib.py:63(__exit__)
      800    0.028    0.000   38.130    0.048 manage_plugins.py:175(call_plugin)
      800    7.169    0.009   38.100    0.048 pil_plugin.py:10(imread)
      800    0.020    0.000    9.203    0.012 pil_plugin.py:43(pil_to_ndarray)
      800    0.004    0.000    0.009    0.000 util.py:16(is_url)
     1600    0.002    0.000    0.011    0.000 util.py:22(file_or_url_context)
      800    0.004    0.000    0.004    0.000 {built-in method PIL._imaging.jpeg_decoder}
      800    0.087    0.000    0.087    0.000 {built-in method PIL._imaging.new}
      800    0.002    0.000    0.002    0.000 {built-in method PIL._imaging.raw_encoder}
    19200    0.013    0.000    0.013    0.000 {built-in method _struct.unpack}
      800    0.002    0.000    0.002    0.000 {built-in method builtins.divmod}
        1    0.000    0.000   38.234   38.234 {built-in method builtins.exec}
     2400    0.005    0.000    0.005    0.000 {built-in method builtins.getattr}
     4000    0.015    0.000    0.018    0.000 {built-in method builtins.hasattr}
     5600    0.006    0.000    0.006    0.000 {built-in method builtins.isinstance}
     8000    0.003    0.000    0.003    0.000 {built-in method builtins.len}
      800    0.001    0.000    0.001    0.000 {built-in method builtins.max}
     1600    0.003    0.000    0.013    0.000 {built-in method builtins.next}
      800    5.050    0.006    5.050    0.006 {built-in method io.open}
      800    0.435    0.001    0.731    0.001 {built-in method numpy.core.multiarray.array}
    11200    0.006    0.000    0.006    0.000 {method 'append' of 'list' objects}
      800    0.000    0.000    0.000    0.000 {method 'cleanup' of 'ImagingDecoder' objects}
      800    7.588    0.009    7.588    0.009 {method 'decode' of 'ImagingDecoder' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
     7200    0.209    0.000    0.209    0.000 {method 'encode' of 'ImagingEncoder' objects}
      800    0.002    0.000    0.002    0.000 {method 'endswith' of 'str' objects}
      800    0.049    0.000    0.049    0.000 {method 'join' of 'bytes' objects}
      800    0.001    0.000    0.001    0.000 {method 'lower' of 'str' objects}
      800    0.005    0.000    0.005    0.000 {method 'match' of '_sre.SRE_Pattern' objects}
     1600    0.002    0.000    0.002    0.000 {method 'pixel_access' of 'ImagingCore' objects}
      800    0.000    0.000    0.000    0.000 {method 'pop' of 'dict' objects}
    32000   16.994    0.001   16.994    0.001 {method 'read' of '_io.BufferedReader' objects}
     2400    0.015    0.000    0.015    0.000 {method 'seek' of '_io.BufferedReader' objects}
      800    0.002    0.000    0.002    0.000 {method 'setimage' of 'ImagingDecoder' objects}
      800    0.001    0.000    0.001    0.000 {method 'setimage' of 'ImagingEncoder' objects}
      800    0.005    0.000    0.006    0.000 {method 'sort' of 'list' objects}
      800    0.001    0.000    0.001    0.000 {method 'startswith' of 'str' objects}


OK, that's not terrible. Let's try a larger training set and see how we do.


In [91]:
n_images = 1000
size = 20
longerX = [pipeline(fn, size) for fn in small_training_filenames[:n_images]]
longerX = np.array(longerX)
longerY = np.array(small_training_answers[:n_images])
randforest = RandomForestRegressor(n_estimators=100, 
                            n_jobs=4, verbose=1, warm_start=False)
randforest.fit(longerX, longerY)
testlongerX = [pipeline(fn, size) for fn in small_training_filenames[n_images:2*n_images]]
testlongerX = np.array(testlongerX)
testlongerY = np.array(small_training_answers[n_images:2*n_images])
predictedY = randforest.predict(testlongerX)
print("RMSE with a larger training set:", rmse(predictedY, testlongerY))


[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    6.8s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:   17.2s finished
RMSE with a larger training set: 0.150243672651
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    0.0s finished

Not bad -- we're about a third of the way up the leaderboard on Kaggle. Let's make the images bigger now and see if that helps.


In [92]:
n_images = 1000
size = 50
longerX = [pipeline(fn, size) for fn in small_training_filenames[:n_images]]
longerX = np.array(longerX)
longerY = np.array(small_training_answers[:n_images])
randforest = RandomForestRegressor(n_estimators=100, 
                            n_jobs=4, verbose=1, warm_start=False)
randforest.fit(longerX, longerY)
testlongerX = [pipeline(fn, size) for fn in small_training_filenames[n_images:2*n_images]]
testlongerX = np.array(testlongerX)
testlongerY = np.array(small_training_answers[n_images:2*n_images])
predictedY = randforest.predict(testlongerX)
print("RMSE with a larger training set and larger images:", rmse(predictedY, testlongerY))


[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:   47.1s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:  1.8min finished
RMSE with a larger training set and larger images: 0.147535680338
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    0.0s finished

OK, that gained us a few leaderboard spots. Let's try making the images bigger again, and increasing the number of trees to compensate for the larger input vectors. Increasing those individually didn't help before, but doing it in tandem might.


In [93]:
n_images = 1000
size = 100
longerX = [pipeline(fn, size) for fn in small_training_filenames[:n_images]]
longerX = np.array(longerX)
longerY = np.array(small_training_answers[:n_images])
randforest = RandomForestRegressor(n_estimators=400, 
                            n_jobs=4, verbose=1, warm_start=False)
randforest.fit(longerX, longerY)
testlongerX = [pipeline(fn, size) for fn in small_training_filenames[n_images:2*n_images]]
testlongerX = np.array(testlongerX)
testlongerY = np.array(small_training_answers[n_images:2*n_images])
predictedY = randforest.predict(testlongerX)
print("RMSE with a larger training set, a bigger forest, and much larger images:", rmse(predictedY, testlongerY))


[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:  3.3min
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed: 15.0min
[Parallel(n_jobs=4)]: Done 400 out of 400 | elapsed: 30.1min finished
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:    0.2s
RMSE with a larger training set, a bigger forest, and much larger images: 0.146889039777
[Parallel(n_jobs=4)]: Done 400 out of 400 | elapsed:    0.4s finished

Ok, let's try 50x50 images with more trees, since we probably didn't fully sample the space there.


In [94]:
n_images = 1000
size = 50
longerX = [pipeline(fn, size) for fn in small_training_filenames[:n_images]]
longerX = np.array(longerX)
longerY = np.array(small_training_answers[:n_images])
randforest = RandomForestRegressor(n_estimators=500, 
                            n_jobs=4, verbose=1, warm_start=False)
randforest.fit(longerX, longerY)
testlongerX = [pipeline(fn, size) for fn in small_training_filenames[n_images:2*n_images]]
testlongerX = np.array(testlongerX)
testlongerY = np.array(small_training_answers[n_images:2*n_images])
predictedY = randforest.predict(testlongerX)
print("RMSE with a larger training set and larger images:", rmse(predictedY, testlongerY))


[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:   48.4s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:  3.6min
[Parallel(n_jobs=4)]: Done 442 tasks      | elapsed:  8.2min
[Parallel(n_jobs=4)]: Done 500 out of 500 | elapsed:  9.4min finished
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:    0.1s
RMSE with a larger training set and larger images: 0.146675323561
[Parallel(n_jobs=4)]: Done 442 tasks      | elapsed:    0.4s
[Parallel(n_jobs=4)]: Done 500 out of 500 | elapsed:    0.4s finished

Wow, that's not a lot better! So that means the way to crank down that RMSE on a pixel-based approach is really just to crank up the number of images. Let's do that.


In [12]:
n_images = 3000
size = 20
longerX = [pipeline(fn, size) for fn in small_training_filenames[:n_images]]
longerX = np.array(longerX)
longerY = np.array(small_training_answers[:n_images])
randforest = RandomForestRegressor(n_estimators=100, 
                            n_jobs=4, verbose=1, warm_start=False)
randforest.fit(longerX, longerY)
testlongerX = [pipeline(fn, size) for fn in small_training_filenames[n_images:2*n_images]]
testlongerX = np.array(testlongerX)
testlongerY = np.array(small_training_answers[n_images:2*n_images])
predictedY = randforest.predict(testlongerX)
print("RMSE:", rmse(predictedY, testlongerY))


[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:   25.0s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:   54.1s finished
RMSE: 0.14346268324
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.1s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    0.1s finished

Nope. I think we've reached the limits of what random forest is capable of when acting on raw pixels. We'd need to run something with a huge forest and large images on the whole dataset to be sure, but I doubt we'd get much better. Time to try something else.

V. Messing around with a few other ideas

Gaussian processes


In [38]:
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF as SquaredExponential
from sklearn.gaussian_process.kernels import ConstantKernel as Amplitude

kernel = Amplitude(1.0, (1E-3, 1E3)) * SquaredExponential(10, (1e-2, 1e2))

gp = GaussianProcessRegressor(kernel=kernel,
                              n_restarts_optimizer=10)

# Fit to data using Maximum Likelihood Estimation of the parameters
gp.fit(X, Y)

predictedY = gp.predict(newX)
print(rmse(newY, predictedY))
len(newX)


0.173910318913
Out[38]:
200

Let's try altering the kernel.


In [39]:
kernel = Amplitude(1.0, (1E-3, 1E3)) + SquaredExponential(10, (1e-2, 1e2))

gp = GaussianProcessRegressor(kernel=kernel,
                              n_restarts_optimizer=10)

# Fit to data using Maximum Likelihood Estimation of the parameters
gp.fit(X, Y)

predictedY = gp.predict(newX)
print(rmse(newY, predictedY))
len(newX)


0.218676481165
Out[39]:
200

In [36]:
from sklearn.gaussian_process.kernels import Matern

kernel = Amplitude(1.0, (1E-3, 1E3)) * Matern(1.0, (1e-3, 1e3), 2.5)

gp = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=10)

gp.fit(X, Y)

predictedY = gp.predict(newX)
print(rmse(newY, predictedY))
len(newX)


0.162933773478
Out[36]:
200

In [44]:
kernel = Amplitude(1.0, (1E-4, 1E4)) * Matern(1.0, (1e-4, 1e4), 0.5)

gp = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=10)

gp.fit(X, Y)

predictedY = gp.predict(newX)
print(rmse(newY, predictedY))
len(newX)


0.15766040735
Out[44]:
200

In [45]:
from sklearn.gaussian_process.kernels import RationalQuadratic

kernel = Amplitude(1.0, (1E-4, 1E4)) * RationalQuadratic()

gp = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=10)

gp.fit(X, Y)

predictedY = gp.predict(newX)
print(rmse(newY, predictedY))
len(newX)


0.156793034718
Out[45]:
200

In [33]:
kernel = Amplitude(1.0, (1E-3, 1E3))

gp = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=10)

gp.fit(X, Y)

predictedY = gp.predict(newX)
print(rmse(newY, predictedY))
len(newX)


/usr/local/lib/python3.5/site-packages/sklearn/gaussian_process/gpr.py:424: UserWarning: fmin_l_bfgs_b terminated abnormally with the  state: {'task': b'ABNORMAL_TERMINATION_IN_LNSRCH', 'warnflag': 2, 'grad': array([  1.47661416e+08]), 'funcalls': 21, 'nit': 0}
  " state: %s" % convergence_dict)
/usr/local/lib/python3.5/site-packages/sklearn/gaussian_process/gpr.py:424: UserWarning: fmin_l_bfgs_b terminated abnormally with the  state: {'task': b'ABNORMAL_TERMINATION_IN_LNSRCH', 'warnflag': 2, 'grad': array([ -1.20702726e+09]), 'funcalls': 72, 'nit': 2}
  " state: %s" % convergence_dict)
0.164758416845
Out[33]:
200

Gaussian processes are pretty robust against kernel choices, and also perform comparably to random forests! But they eat a lot more RAM when the number of images goes up, so we'll leave it at this.

Return to the random forest

There were a bunch of things we didn't do with the random forest that we could have done. Let's try some of them now. First, let's see what happens if we use more trees on larger images, but make the trees less tall (shallower).


In [31]:
size = 100
n_images = 200
randforest = RandomForestRegressor(n_estimators=500, max_features = 0.25, 
                            n_jobs=4, verbose=1, warm_start=False)
X = [pipeline(fn, size) for fn in small_training_filenames[:n_images]]
X = np.array(X)
Y = np.array(small_training_answers[:n_images])
randforest.fit(X, Y)

newX = [pipeline(fn, size) for fn in small_training_filenames[n_images:2*n_images]]
newX = np.array(newX)
newY = np.array(small_training_answers[n_images:2*n_images])
predictedY = randforest.predict(newX)
print(newY.shape, predictedY.shape)
print(rmse(newY, predictedY))


[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    5.2s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:   25.1s
[Parallel(n_jobs=4)]: Done 442 tasks      | elapsed:   56.1s
[Parallel(n_jobs=4)]: Done 500 out of 500 | elapsed:  1.0min finished
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 442 tasks      | elapsed:    0.1s
[Parallel(n_jobs=4)]: Done 500 out of 500 | elapsed:    0.1s finished
(200, 37) (200, 37)
0.151650328554

That did very little! The answer is comparable to what we got before. We're clearly limited by the size of the training set.

Rather than making the forest bigger and shorter, let's make the forest smarter instead: we'll try boosting the performance of the trees with AdaBoost. AdaBoost forces the random forest to re-examine the parts of the training set that it initially doesn't handle well. More on this algorithm in the scikit-learn documentation.


In [53]:
size = 20
n_images = 200
X = [pipeline(fn, size) for fn in small_training_filenames[:n_images]]
X = np.array(X)
Y = np.array(small_training_answers[:n_images])
newX = [pipeline(fn, size) for fn in small_training_filenames[n_images:2*n_images]]
newX = np.array(newX)

In [55]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import AdaBoostRegressor

#AdaBoost in sklearn can't handle multi-dimensional outputs! So let's roll our own.
# This should be an object, but I'm going to make functions instead because that's how I roll.
def ABRegressors_init(X, Y, n_estimators):
    '''
    Create a group of AdaBoostRegressors suitable for the given X and Y input.
    '''
    abregressors = [AdaBoostRegressor(DecisionTreeRegressor(), n_estimators = n_estimators) for i in range(Y.shape[1])]
    return abregressors

def ABRegressors_fit(ABregressors, X, Y):
    '''
    Use a group of AdaBoostRegressors to fit to the training set X and Y.
    '''
    return [abr.fit(X, Y[:, i]) for i, abr in enumerate(ABregressors)]
    

def ABRegressors_predict(ABregressors, X):
    '''
    Use a group of AdaBoostRegressors to predict for the test set X.
    '''
    predictions = [abr.predict(X) for abr in ABregressors]
    predictions = np.array(predictions)
    predictions = predictions.transpose()
    return predictions

boostedforest = ABRegressors_init(X, Y, 100)
ABRegressors_fit(boostedforest, X, Y)
predictedY = ABRegressors_predict(boostedforest, newX)

print(rmse(newY, predictedY))


0.167577836867

Well, that's weird — it's worse than the un-boosted algorithm! I think we really are hugely limited by the size of the dataset. So let's try using the best algorithm we've got with a really big fraction of the training set and see how much that improves things. This will require being a little careful with RAM use if we want to run this on a laptop, but scikit-learn makes it relatively easy to do this.


In [13]:
from math import ceil
from multiprocessing import Pool

image_size = 100
def pipefunc_size(fn): return pipeline(fn, image_size)
base_chunk_size = 40 
# base_chunk_size determines how much data is read into memory at any one time.
# The smaller it is, the less RAM this will all use. 
# Don't make it much larger than 200, and definitely don't make it over 1000 if you want to run this on a laptop.
n_jobs = 5
# n_jobs determines how many processors will be used. 
# Make sure that chunk size is an even multiple of n_jobs!
chunk_size = base_chunk_size * n_jobs
trainingXfiles = small_training_filenames
trainingY = np.array(small_training_answers)
testXfiles = small_test_filenames
testY = np.array(small_test_answers)

# Start with a function for training the forest chunkily. 

def chunkyfit(randforest, Xfiles, Y, chunksize = chunk_size, njobs= n_jobs):
    '''
    Feeds the list of training X files through the pipeline, chunk by chunk, so the RAM doesn't get overloaded.
    Then it feeds those chunks of X, along with the corresponding chunks of Y, into randforest.
    '''
    for i in range(ceil(len(Xfiles)/chunk_size)):
        print("Processing training chunk", i+1, "out of", ceil(len(Xfiles)/chunk_size))
        pool = Pool(njobs)
        Xfiles_chunk = Xfiles[i*chunksize:(i+1)*chunksize]
        Xchunk = pool.map(pipefunc_size, Xfiles_chunk)
        pool.close()
        Xchunk = np.array(Xchunk)
        Ychunk = Y[i*chunksize:(i+1)*chunksize]
        randforest.n_estimators += 5
        randforest.fit(Xchunk, Ychunk)
    return randforest

# A function that tells your forest to make a prediction on each chunk of the test set, then stitch them together.

def chunkypredict(randforest, Xfiles, chunksize = chunk_size, njobs= n_jobs):
    '''
    Feeds the list of test X files through the pipeline, chunk by chunk, so the RAM doesn't get overloaded.
    Then it feeds those chunks of X into randforest.predict, and stitches the answers together.
    '''
    predictY = []
    for i in range(ceil(len(Xfiles)/chunk_size)):
        print("Processing test chunk", i+1, "out of", ceil(len(Xfiles)/chunk_size))
        pool = Pool(njobs)
        Xfiles_chunk = Xfiles[i*chunksize:(i+1)*chunksize]
        Xchunk = pool.map(pipefunc_size, Xfiles_chunk)
        pool.close()
        Xchunk = np.array(Xchunk)
        Ychunk = randforest.predict(Xchunk)
        predictY.extend(Ychunk)
    return np.array(predictY)

# We set warm_start = True, 
# so we can train new trees in the forest on new chunks of the data set without losing the old trees.
randforest = RandomForestRegressor(n_estimators=0, max_features = 0.3333,
                            n_jobs=n_jobs, verbose=0, warm_start=True)

maxfiles = 30000
chunkyfit(randforest, trainingXfiles[:maxfiles], trainingY[:maxfiles])

predictedY = chunkypredict(randforest, testXfiles[:maxfiles])

print(rmse(testY[:maxfiles], predictedY))


Processing training chunk 1 out of 150
Processing training chunk 2 out of 150
Processing training chunk 3 out of 150
Processing training chunk 4 out of 150
Processing training chunk 5 out of 150
Processing training chunk 6 out of 150
Processing training chunk 7 out of 150
Processing training chunk 8 out of 150
Processing training chunk 9 out of 150
Processing training chunk 10 out of 150
Processing training chunk 11 out of 150
Processing training chunk 12 out of 150
Processing training chunk 13 out of 150
Processing training chunk 14 out of 150
Processing training chunk 15 out of 150
Processing training chunk 16 out of 150
Processing training chunk 17 out of 150
Processing training chunk 18 out of 150
Processing training chunk 19 out of 150
Processing training chunk 20 out of 150
Processing training chunk 21 out of 150
Processing training chunk 22 out of 150
Processing training chunk 23 out of 150
Processing training chunk 24 out of 150
Processing training chunk 25 out of 150
Processing training chunk 26 out of 150
Processing training chunk 27 out of 150
Processing training chunk 28 out of 150
Processing training chunk 29 out of 150
Processing training chunk 30 out of 150
Processing training chunk 31 out of 150
Processing training chunk 32 out of 150
Processing training chunk 33 out of 150
Processing training chunk 34 out of 150
Processing training chunk 35 out of 150
Processing training chunk 36 out of 150
Processing training chunk 37 out of 150
Processing training chunk 38 out of 150
Processing training chunk 39 out of 150
Processing training chunk 40 out of 150
Processing training chunk 41 out of 150
Processing training chunk 42 out of 150
Processing training chunk 43 out of 150
Processing training chunk 44 out of 150
Processing training chunk 45 out of 150
Processing training chunk 46 out of 150
Processing training chunk 47 out of 150
Processing training chunk 48 out of 150
Processing training chunk 49 out of 150
Processing training chunk 50 out of 150
Processing training chunk 51 out of 150
Processing training chunk 52 out of 150
Processing training chunk 53 out of 150
Processing training chunk 54 out of 150
Processing training chunk 55 out of 150
Processing training chunk 56 out of 150
Processing training chunk 57 out of 150
Processing training chunk 58 out of 150
Processing training chunk 59 out of 150
Processing training chunk 60 out of 150
Processing training chunk 61 out of 150
Processing training chunk 62 out of 150
Processing training chunk 63 out of 150
Processing training chunk 64 out of 150
Processing training chunk 65 out of 150
Processing training chunk 66 out of 150
Processing training chunk 67 out of 150
Processing training chunk 68 out of 150
Processing training chunk 69 out of 150
Processing training chunk 70 out of 150
Processing training chunk 71 out of 150
Processing training chunk 72 out of 150
Processing training chunk 73 out of 150
Processing training chunk 74 out of 150
Processing training chunk 75 out of 150
Processing training chunk 76 out of 150
Processing training chunk 77 out of 150
Processing training chunk 78 out of 150
Processing training chunk 79 out of 150
Processing training chunk 80 out of 150
Processing training chunk 81 out of 150
Processing training chunk 82 out of 150
Processing training chunk 83 out of 150
Processing training chunk 84 out of 150
Processing training chunk 85 out of 150
Processing training chunk 86 out of 150
Processing training chunk 87 out of 150
Processing training chunk 88 out of 150
Processing training chunk 89 out of 150
Processing training chunk 90 out of 150
Processing training chunk 91 out of 150
Processing training chunk 92 out of 150
Processing training chunk 93 out of 150
Processing training chunk 94 out of 150
Processing training chunk 95 out of 150
Processing training chunk 96 out of 150
Processing training chunk 97 out of 150
Processing training chunk 98 out of 150
Processing training chunk 99 out of 150
Processing training chunk 100 out of 150
Processing training chunk 101 out of 150
Processing training chunk 102 out of 150
Processing training chunk 103 out of 150
Processing training chunk 104 out of 150
Processing training chunk 105 out of 150
Processing training chunk 106 out of 150
Processing training chunk 107 out of 150
Processing training chunk 108 out of 150
Processing training chunk 109 out of 150
Processing training chunk 110 out of 150
Processing training chunk 111 out of 150
Processing training chunk 112 out of 150
Processing training chunk 113 out of 150
Processing training chunk 114 out of 150
Processing training chunk 115 out of 150
Processing training chunk 116 out of 150
Processing training chunk 117 out of 150
Processing training chunk 118 out of 150
Processing training chunk 119 out of 150
Processing training chunk 120 out of 150
Processing training chunk 121 out of 150
Processing training chunk 122 out of 150
Processing training chunk 123 out of 150
Processing training chunk 124 out of 150
Processing training chunk 125 out of 150
Processing training chunk 126 out of 150
Processing training chunk 127 out of 150
Processing training chunk 128 out of 150
Processing training chunk 129 out of 150
Processing training chunk 130 out of 150
Processing training chunk 131 out of 150
Processing training chunk 132 out of 150
Processing training chunk 133 out of 150
Processing training chunk 134 out of 150
Processing training chunk 135 out of 150
Processing training chunk 136 out of 150
Processing training chunk 137 out of 150
Processing training chunk 138 out of 150
Processing training chunk 139 out of 150
Processing training chunk 140 out of 150
Processing training chunk 141 out of 150
Processing training chunk 142 out of 150
Processing training chunk 143 out of 150
Processing training chunk 144 out of 150
Processing training chunk 145 out of 150
Processing training chunk 146 out of 150
Processing training chunk 147 out of 150
Processing training chunk 148 out of 150
Processing training chunk 149 out of 150
Processing training chunk 150 out of 150
Processing test chunk 1 out of 150
Processing test chunk 2 out of 150
Processing test chunk 3 out of 150
Processing test chunk 4 out of 150
Processing test chunk 5 out of 150
Processing test chunk 6 out of 150
Processing test chunk 7 out of 150
Processing test chunk 8 out of 150
Processing test chunk 9 out of 150
Processing test chunk 10 out of 150
Processing test chunk 11 out of 150
Processing test chunk 12 out of 150
Processing test chunk 13 out of 150
Processing test chunk 14 out of 150
Processing test chunk 15 out of 150
Processing test chunk 16 out of 150
Processing test chunk 17 out of 150
Processing test chunk 18 out of 150
Processing test chunk 19 out of 150
Processing test chunk 20 out of 150
Processing test chunk 21 out of 150
Processing test chunk 22 out of 150
Processing test chunk 23 out of 150
Processing test chunk 24 out of 150
Processing test chunk 25 out of 150
Processing test chunk 26 out of 150
Processing test chunk 27 out of 150
Processing test chunk 28 out of 150
Processing test chunk 29 out of 150
Processing test chunk 30 out of 150
Processing test chunk 31 out of 150
Processing test chunk 32 out of 150
Processing test chunk 33 out of 150
Processing test chunk 34 out of 150
Processing test chunk 35 out of 150
Processing test chunk 36 out of 150
Processing test chunk 37 out of 150
Processing test chunk 38 out of 150
Processing test chunk 39 out of 150
Processing test chunk 40 out of 150
Processing test chunk 41 out of 150
Processing test chunk 42 out of 150
Processing test chunk 43 out of 150
Processing test chunk 44 out of 150
Processing test chunk 45 out of 150
Processing test chunk 46 out of 150
Processing test chunk 47 out of 150
Processing test chunk 48 out of 150
Processing test chunk 49 out of 150
Processing test chunk 50 out of 150
Processing test chunk 51 out of 150
Processing test chunk 52 out of 150
Processing test chunk 53 out of 150
Processing test chunk 54 out of 150
Processing test chunk 55 out of 150
Processing test chunk 56 out of 150
Processing test chunk 57 out of 150
Processing test chunk 58 out of 150
Processing test chunk 59 out of 150
Processing test chunk 60 out of 150
Processing test chunk 61 out of 150
Processing test chunk 62 out of 150
Processing test chunk 63 out of 150
Processing test chunk 64 out of 150
Processing test chunk 65 out of 150
Processing test chunk 66 out of 150
Processing test chunk 67 out of 150
Processing test chunk 68 out of 150
Processing test chunk 69 out of 150
Processing test chunk 70 out of 150
Processing test chunk 71 out of 150
Processing test chunk 72 out of 150
Processing test chunk 73 out of 150
Processing test chunk 74 out of 150
Processing test chunk 75 out of 150
Processing test chunk 76 out of 150
Processing test chunk 77 out of 150
Processing test chunk 78 out of 150
Processing test chunk 79 out of 150
Processing test chunk 80 out of 150
Processing test chunk 81 out of 150
Processing test chunk 82 out of 150
Processing test chunk 83 out of 150
Processing test chunk 84 out of 150
Processing test chunk 85 out of 150
Processing test chunk 86 out of 150
Processing test chunk 87 out of 150
Processing test chunk 88 out of 150
Processing test chunk 89 out of 150
Processing test chunk 90 out of 150
Processing test chunk 91 out of 150
Processing test chunk 92 out of 150
Processing test chunk 93 out of 150
Processing test chunk 94 out of 150
Processing test chunk 95 out of 150
Processing test chunk 96 out of 150
Processing test chunk 97 out of 150
Processing test chunk 98 out of 150
Processing test chunk 99 out of 150
Processing test chunk 100 out of 150
Processing test chunk 101 out of 150
Processing test chunk 102 out of 150
Processing test chunk 103 out of 150
Processing test chunk 104 out of 150
Processing test chunk 105 out of 150
Processing test chunk 106 out of 150
Processing test chunk 107 out of 150
Processing test chunk 108 out of 150
Processing test chunk 109 out of 150
Processing test chunk 110 out of 150
Processing test chunk 111 out of 150
Processing test chunk 112 out of 150
Processing test chunk 113 out of 150
Processing test chunk 114 out of 150
Processing test chunk 115 out of 150
Processing test chunk 116 out of 150
Processing test chunk 117 out of 150
Processing test chunk 118 out of 150
Processing test chunk 119 out of 150
Processing test chunk 120 out of 150
Processing test chunk 121 out of 150
Processing test chunk 122 out of 150
Processing test chunk 123 out of 150
Processing test chunk 124 out of 150
Processing test chunk 125 out of 150
Processing test chunk 126 out of 150
Processing test chunk 127 out of 150
Processing test chunk 128 out of 150
Processing test chunk 129 out of 150
Processing test chunk 130 out of 150
Processing test chunk 131 out of 150
Processing test chunk 132 out of 150
Processing test chunk 133 out of 150
Processing test chunk 134 out of 150
Processing test chunk 135 out of 150
Processing test chunk 136 out of 150
Processing test chunk 137 out of 150
Processing test chunk 138 out of 150
Processing test chunk 139 out of 150
Processing test chunk 140 out of 150
Processing test chunk 141 out of 150
Processing test chunk 142 out of 150
Processing test chunk 143 out of 150
Processing test chunk 144 out of 150
Processing test chunk 145 out of 150
Processing test chunk 146 out of 150
Processing test chunk 147 out of 150
Processing test chunk 148 out of 150
Processing test chunk 149 out of 150
Processing test chunk 150 out of 150
0.153361271479

Well, that's definitely the limit of random forest acting on pixels, then. Time to move on to an entirely different approach.

Except...we did a litte better with a much smaller data set earlier. What gives? Probably something about the fraction of the feature set allowed (the max_features hyperparameter). Let's try that approach again with this fraction of the data. I'm guessing that it won't be too much better, and then we can move on to something else.


In [28]:
from math import ceil
maxfiles = 30000
image_size = 40
def pipefunc_size(fn): return pipeline(fn, image_size)
base_chunk_size = 160 
# base_chunk_size determines how much data is read into memory at any one time.
# The smaller it is, the less RAM this will all use. 
# Don't make it much larger than 200, and definitely don't make it over 1000 if you want to run this on a laptop.
n_jobs = 5
# n_jobs determines how many processors will be used. 
# Make sure that chunk size is an even multiple of n_jobs!
chunk_size = base_chunk_size * n_jobs

def chunkyfit(randforest, Xfiles, Y, chunksize = chunk_size, njobs= n_jobs):
    '''
    Feeds the list of training X files through the pipeline, chunk by chunk, so the RAM doesn't get overloaded.
    Then it feeds those chunks of X, along with the corresponding chunks of Y, into randforest.
    '''
    for i in range(ceil(len(Xfiles)/chunk_size)):
        print("Processing training chunk", i+1, "out of", ceil(len(Xfiles)/chunk_size))
        pool = Pool(njobs)
        Xfiles_chunk = Xfiles[i*chunksize:(i+1)*chunksize]
        Xchunk = pool.map(pipefunc_size, Xfiles_chunk)
        pool.close()
        Xchunk = np.array(Xchunk)
        Ychunk = Y[i*chunksize:(i+1)*chunksize]
        randforest.n_estimators += 40
        randforest.fit(Xchunk, Ychunk)
    return randforest

# A function that tells your forest to make a prediction on each chunk of the test set, then stitch them together.

def chunkypredict(randforest, Xfiles, chunksize = chunk_size, njobs= n_jobs):
    '''
    Feeds the list of test X files through the pipeline, chunk by chunk, so the RAM doesn't get overloaded.
    Then it feeds those chunks of X into randforest.predict, and stitches the answers together.
    '''
    predictY = []
    for i in range(ceil(len(Xfiles)/chunk_size)):
        print("Processing test chunk", i+1, "out of", ceil(len(Xfiles)/chunk_size))
        pool = Pool(njobs)
        Xfiles_chunk = Xfiles[i*chunksize:(i+1)*chunksize]
        Xchunk = pool.map(pipefunc_size, Xfiles_chunk)
        pool.close()
        Xchunk = np.array(Xchunk)
        Ychunk = randforest.predict(Xchunk)
        predictY.extend(Ychunk)
    return np.array(predictY)

# We set warm_start = True, 
# so we can train new trees in the forest on new chunks of the data set without losing the old trees.
randforest = RandomForestRegressor(n_estimators=0,
                            n_jobs=n_jobs, verbose=0, warm_start=True)

chunkyfit(randforest, trainingXfiles[:maxfiles], trainingY[:maxfiles], chunksize = chunk_size)


predictedY = chunkypredict(randforest, testXfiles[:maxfiles], chunksize = chunk_size)

print(rmse(testY[:maxfiles], predictedY))


Processing training chunk 1 out of 38
Processing training chunk 2 out of 38
Processing training chunk 3 out of 38
Processing training chunk 4 out of 38
Processing training chunk 5 out of 38
Processing training chunk 6 out of 38
Processing training chunk 7 out of 38
Processing training chunk 8 out of 38
Processing training chunk 9 out of 38
Processing training chunk 10 out of 38
Processing training chunk 11 out of 38
Processing training chunk 12 out of 38
Processing training chunk 13 out of 38
Processing training chunk 14 out of 38
Processing training chunk 15 out of 38
Processing training chunk 16 out of 38
Processing training chunk 17 out of 38
Processing training chunk 18 out of 38
Processing training chunk 19 out of 38
Processing training chunk 20 out of 38
Processing training chunk 21 out of 38
Processing training chunk 22 out of 38
Processing training chunk 23 out of 38
Processing training chunk 24 out of 38
Processing training chunk 25 out of 38
Processing training chunk 26 out of 38
Processing training chunk 27 out of 38
Processing training chunk 28 out of 38
Processing training chunk 29 out of 38
Processing training chunk 30 out of 38
Processing training chunk 31 out of 38
Processing training chunk 32 out of 38
Processing training chunk 33 out of 38
Processing training chunk 34 out of 38
Processing training chunk 35 out of 38
Processing training chunk 36 out of 38
Processing training chunk 37 out of 38
Processing training chunk 38 out of 38
Processing test chunk 1 out of 38
Processing test chunk 2 out of 38
Processing test chunk 3 out of 38
Processing test chunk 4 out of 38
Processing test chunk 5 out of 38
Processing test chunk 6 out of 38
Processing test chunk 7 out of 38
Processing test chunk 8 out of 38
Processing test chunk 9 out of 38
Processing test chunk 10 out of 38
Processing test chunk 11 out of 38
Processing test chunk 12 out of 38
Processing test chunk 13 out of 38
Processing test chunk 14 out of 38
Processing test chunk 15 out of 38
Processing test chunk 16 out of 38
Processing test chunk 17 out of 38
Processing test chunk 18 out of 38
Processing test chunk 19 out of 38
Processing test chunk 20 out of 38
Processing test chunk 21 out of 38
Processing test chunk 22 out of 38
Processing test chunk 23 out of 38
Processing test chunk 24 out of 38
Processing test chunk 25 out of 38
Processing test chunk 26 out of 38
Processing test chunk 27 out of 38
Processing test chunk 28 out of 38
Processing test chunk 29 out of 38
Processing test chunk 30 out of 38
Processing test chunk 31 out of 38
Processing test chunk 32 out of 38
Processing test chunk 33 out of 38
Processing test chunk 34 out of 38
Processing test chunk 35 out of 38
Processing test chunk 36 out of 38
Processing test chunk 37 out of 38
Processing test chunk 38 out of 38
0.14654572635

OK, no real improvement. Time to try featurization.

VI. Featurization

Now let's try pulling features out of the images and training over those, rather than running on the pixels themselves. We could find features ourselves, but I'd be more interested in some algorithm that finds features, then using those features for the ML algorithm.

VII. Deep learning with Keras and Tensorflow

IX. Can we beat the best answer?


In [ ]: