GalaxyZoo has a fabulous data set for machine learning. There was a Kaggle competition a few years back to create an automatic galaxy classifier based on the GalaxyZoo data, the winning entry (code, explanation) works very well — but it takes days to initialize. Let's see if we can create something almost as good that runs much faster. (I'm using Python 3, but I'll try to make this notebook backwards compatible with Python 2 as well.)
First, we need the data from the Kaggle competition (the raw GalaxyZoo data is also available, but we don't need nearly that much data to get started). You'll need to create a Kaggle account (free) in order to access this. Download images_training_rev1.zip
(~60,000 training images), images_test_rev1.zip
(~80,000 test images), and training_solutions_rev1.csv
(the classifications for the training images).
You'll also need the following Python packages (and their dependencies) installed:
Let's start by testing out image reading, since that will be pretty important for this project.
In [ ]:
## An attempt at Python 2 backward compatibility
from __future__ import absolute_import, division, print_function, unicode_literals
In [1]:
%matplotlib inline
from skimage import data, io, filters
filename = "images_training_rev1/100008.jpg"
galaxy = io.imread(filename)
io.imshow(galaxy)
io.show()
Next, let's have a look at the beginning of that CSV file of galaxy classifications.
In [2]:
import pandas as pd
training_solutions = pd.read_csv("training_solutions_rev1.csv",index_col=0)
training_solutions.head()
Out[2]:
Each column gives the fraction of answers given in the Galaxy Zoo Decision Tree. (There is a nice visualization of the tree here, but annoyingly, the questions in that tree are indexed from 0, while the Kaggle set indexes them from 1.) So, for example, there are three possible answers to the first question, which is why we have columns Class 1.1, Class 1.2, and Class 1.3. For the first image (the one we displayed as a test, image 100008
, about 38.3% of users gave answer 1 to question 1, 61.7% gave answer 2, and nobody gave answer 3.
The overall goal is to predict the distribution of all of these answers for all of the test images, given the above distribution for the training images. Let's get to work!
Let's start with one of the dumbest "algorithms" imaginable: generating an entirely random set of answer distributions for each image. We won't even have to look at the images for this one! This will give us a baseline: if we can't do better than this, we're in trouble.
First, we need to know how big our data set is.
In [29]:
import os
import time
import numpy as np
from collections import OrderedDict
# First, let's find out exactly how many files there are in the training set.
# We could get this from the training set solution table, but let's just count the files instead.
training_filenames = os.listdir(path="images_training_rev1/")
# We need to make sure these are all images, rather than anything else that somehow ended up in the folder
# (e.g. OSX invisible files)
training_filenames = [f for f in training_filenames if os.path.splitext(f)[1] == ".jpg"]
n_files = len(training_filenames)
print("There are", n_files, "files in the training set.")
# Now we need to know how many questions there are, and how many answers there are for each question.
# To do that, let's have a look at the GalaxyZoo decision tree:
# https://www.kaggle.com/c/galaxy-zoo-the-galaxy-challenge/details/the-galaxy-zoo-decision-tree
# Again, we could pull this from the training set solution table, but I don't feel like doing that,
# because regular expressions make me sad.
questions_dict = {
1:3,
2:2,
3:2,
4:2,
5:4,
6:2,
7:3,
8:7,
9:3,
10:3,
11:6
}
questions_dict = OrderedDict(sorted(questions_dict.items()))
n_questions = len(questions_dict)
n_columns = sum(questions_dict[k] for k in questions_dict.keys())
print("There are", n_questions, "questions and", n_columns, "possible answers in the GalaxyZoo tree.")
Now we can create our random matrix. We'll make it not entirely stupid, and keep the probabilities normalized.
In [30]:
def randrow():
'''
Creates a random row of numbers.
Takes into account the constraints imposed by the number of questions:
namely, that the total probability for each question must sum to 1.
NB: returns a list, not an array.
'''
row = []
for k in questions_dict.keys():
n_answers = questions_dict[k]
remaining_probability = 1
while n_answers > 1:
r = remaining_probability*np.random.random()
row.append(r)
remaining_probability -= r
n_answers -= 1
row.append(remaining_probability)
return row
def randarray(rows):
'''
Returns a random array with the specified number of rows.
Each row is generated using randrow().
'''
ra = [randrow() for i in range(rows)]
return np.array(ra)
start = time.time()
ray = randarray(n_files)
delta_t = time.time()-start
print("Generated random array with shape", ray.shape, "in", delta_t, "seconds.")
Now we have our terrible guess. But how terrible is it? Kaggle uses root-mean-square error to evaluate the submissions for this competition. Let's do that — and let's compare our dumb idea against an even dumber one: an array full of ones.
In [4]:
from numpy import sqrt, asarray, ones
from sklearn.metrics import mean_squared_error
training_solutions_array = asarray(training_solutions)
def rmse(guess_array, test_array):
'''Gives us the root-mean-squared error between the guess and the test set answers.'''
rmse = sqrt(mean_squared_error(guess_array, test_array))
return rmse
print("Error for our random array is", rmse(ray, training_solutions_array))
print("Error for an even sillier array (all ones) is", rmse(ones(training_solutions_array.shape), training_solutions_array))
Now we have our baseline. Just to be sure, let's generate a bunch of those random arrays and get the average RMSE across all of them. (Though, given the sheer number of elements in our arrays, it's likely that we'll get something very close to what we already have.)
In [82]:
n = 50
print("Generating", n, "random arrays...")
t1 = time.time()
random_errs = [rmse(randarray(n_files), training_solutions_array) for i in range(n)]
t2 = time.time()
print("Finished in", t2 - t1, "seconds.")
random_baseline_error = sum(random_errs)/n
print("Baseline RMSE is", baseline)
Astonishingly, it turns out there's something even dumber we can do that will nonetheless do better than our random arrays.
In [5]:
from numpy import zeros
zeros_baseline_error = rmse(zeros(training_solutions_array.shape), training_solutions_array)
print("All-zeros baseline RMSE:", zeros_baseline_error)
So, that's the new goal: do better than an array full of zeros. Let's see if we can manage it.
In [43]:
io.imshow(galaxy)
io.show()
galaxy.shape
Out[43]:
424x424 pixels, with three color channels? Yeah, we don't need all of that. Let's make it 50x50 and black & white instead.
In [5]:
from skimage import transform, color
smallgalaxy = transform.resize(galaxy, (50, 50))
smallbwgalaxy = color.rgb2gray(smallgalaxy)
io.imshow(smallbwgalaxy)
io.show()
Much better. But that's an elliptical galaxy — can we still resolve the arms of a spiral galaxy, or the bars, or the "odd features" that are part of the Galaxy Zoo decision tree?
In [61]:
fn = np.random.choice(training_filenames)
randgalaxy = io.imread("images_training_rev1/" + fn)
randsmallgalaxy = color.rgb2gray(transform.resize(randgalaxy, (50, 50)))
io.imshow(randgalaxy)
io.show()
io.imshow(randsmallgalaxy)
io.show()
That's...not great. If you showed me the lower image without seeing the upper one, I might think it was an elliptical galaxy, rather than a spiral. How does 100x100 compare?
In [66]:
fn = np.random.choice(training_filenames)
randgalaxy = io.imread("images_training_rev1/" + fn)
randsmallgalaxy = color.rgb2gray(transform.resize(randgalaxy, (50, 50)))
randmedgalaxy = color.rgb2gray(transform.resize(randgalaxy, (100, 100)))
io.imshow(randgalaxy)
io.show()
io.imshow(randmedgalaxy)
io.show()
io.imshow(randsmallgalaxy)
io.show()
OK, looks like we might want to go up to 100x100. Let's try 50x50 first, though, and then go back up to 100x100 later.
Unfortunately, the only way to check our guesses for the test images is to submit the guesses to Kaggle. We don't want to do that every time — so let's train our machine-learning algorithms on half of our training set (randomly selected), and use the other half as our "test" set. If we end up with an algorithm that performs well on that test set, we'll retrain the algoritm on the entire training set, then use the actual test set to create a submission to Kaggle.
In [6]:
import random
shuffled_training_set = list(zip(training_filenames, training_solutions_array))
random.shuffle(shuffled_training_set)
small_training_set = shuffled_training_set[:n_files//2]
small_test_set = shuffled_training_set[n_files//2:]
print(len(small_training_set), len(small_test_set))
small_training_filenames = [i[0] for i in small_training_set]
small_training_answers = [i[1] for i in small_training_set]
small_test_filenames = [i[0] for i in small_test_set]
small_test_answers = [i[1] for i in small_test_set]
We'll also need a pipeline function for these images:
In [7]:
def downsample_and_grayscale(filename, size):
'''
Grabs the file, scales it down to the given size (in pixels), and grayscales it.
Returns a numpy array with the downscaled and grayscaled image data.
'''
raw_image = io.imread("images_training_rev1/" + filename)
downscaled_gray_image = color.rgb2gray(transform.resize(raw_image, (size, size)))
return downscaled_gray_image
io.imshow(downsample_and_grayscale(small_training_filenames[2], 50))
io.show()
How fast is our pipeline? It might be too slow.
In [32]:
n = 300
pix = 10
t1 = time.time()
images = [downsample_and_grayscale(fn, pix) for fn in small_training_filenames[:n]]
t2 = time.time()
print("Processed", n, "images at", pix, "x", pix, "resolution in", t2-t1, "seconds.")
It's probably file I/O that's taking up most of the time. 300 images seem to take about 6 seconds, so...0.02s per image? That's a long time! At that rate, reading in all the images in the smaller training set alone will take...
In [33]:
print(0.02*len(small_training_set)/60/60, "hours")
That's not too bad after all. But for now, let's see what we can do with a small subset of the data. We're still not done with our pipeline: we need to flatten the image arrays.
In [8]:
def pipeline(fn, size):
'''Downsamples, grayscales, and flattens.'''
return downsample_and_grayscale(fn, size).flatten()
We're finally ready to start doing some machine learning. Let's grow a forest!
In [9]:
from sklearn.ensemble import RandomForestRegressor
size = 20
n_images = 200
randforest = RandomForestRegressor(n_estimators=100,
n_jobs=4, verbose=1, warm_start=False)
X = [pipeline(fn, size) for fn in small_training_filenames[:n_images]]
X = np.array(X)
Y = np.array(small_training_answers[:n_images])
randforest.fit(X, Y)
Out[9]:
In [19]:
newX = [pipeline(fn, size) for fn in small_training_filenames[n_images:2*n_images]]
newX = np.array(newX)
newY = np.array(small_training_answers[n_images:2*n_images])
predictedY = randforest.predict(newX)
print(newY.shape, predictedY.shape)
print(rmse(newY, predictedY))
That is surprisingly un-horrible! Let's see if we can do better with the same training set by altering the hyper-parameters (e.g. number of trees, number of pixels fed in, size of the training set, etc.) We'll change them one at a time, and see what gets us the biggest improvement.
In [87]:
randforest2 = RandomForestRegressor(n_estimators=400,
n_jobs=4, verbose=1, warm_start=False)
randforest2.fit(X, Y)
predictedY = randforest2.predict(newX)
print("RMSE with more trees:", rmse(newY, predictedY))
In [85]:
n_images = 400
size = 20
longerX = [pipeline(fn, size) for fn in small_training_filenames[:n_images]]
longerX = np.array(longerX)
longerY = np.array(small_training_answers[:n_images])
randforest = RandomForestRegressor(n_estimators=100,
n_jobs=4, verbose=1, warm_start=False)
randforest.fit(longerX, longerY)
testlongerX = [pipeline(fn, size) for fn in small_training_filenames[n_images:2*n_images]]
testlongerX = np.array(testlongerX)
testlongerY = np.array(small_training_answers[n_images:2*n_images])
predictedY = randforest.predict(testlongerX)
print("RMSE with a larger training set:", rmse(predictedY, testlongerY))
In [86]:
n_images = 200
size = 40
biggerX = [pipeline(fn, size) for fn in small_training_filenames[:n_images]]
biggerX = np.array(biggerX)
biggerY = np.array(small_training_answers[:n_images])
randforest = RandomForestRegressor(n_estimators=100,
n_jobs=4, verbose=1, warm_start=False)
randforest.fit(biggerX, biggerY)
testbiggerX = [pipeline(fn, size) for fn in small_training_filenames[n_images:2*n_images]]
testbiggerX = np.array(testbiggerX)
testbiggerY = np.array(small_training_answers[n_images:2*n_images])
predictedY = randforest.predict(testbiggerX)
print("RMSE with larger images:", rmse(predictedY, testbiggerY))
Unsurprisingly, a larger training set wins. We'll have to find a way to get that to work before we can go any further with this, I think.
In [90]:
import cProfile
N = 800
offset = 400
cProfile.run("list(io.imread('images_training_rev1/' + fn) for fn in small_training_filenames[offset:N+offset])")
OK, that's not terrible. Let's try a larger training set and see how we do.
In [91]:
n_images = 1000
size = 20
longerX = [pipeline(fn, size) for fn in small_training_filenames[:n_images]]
longerX = np.array(longerX)
longerY = np.array(small_training_answers[:n_images])
randforest = RandomForestRegressor(n_estimators=100,
n_jobs=4, verbose=1, warm_start=False)
randforest.fit(longerX, longerY)
testlongerX = [pipeline(fn, size) for fn in small_training_filenames[n_images:2*n_images]]
testlongerX = np.array(testlongerX)
testlongerY = np.array(small_training_answers[n_images:2*n_images])
predictedY = randforest.predict(testlongerX)
print("RMSE with a larger training set:", rmse(predictedY, testlongerY))
Not bad -- we're about a third of the way up the leaderboard on Kaggle. Let's make the images bigger now and see if that helps.
In [92]:
n_images = 1000
size = 50
longerX = [pipeline(fn, size) for fn in small_training_filenames[:n_images]]
longerX = np.array(longerX)
longerY = np.array(small_training_answers[:n_images])
randforest = RandomForestRegressor(n_estimators=100,
n_jobs=4, verbose=1, warm_start=False)
randforest.fit(longerX, longerY)
testlongerX = [pipeline(fn, size) for fn in small_training_filenames[n_images:2*n_images]]
testlongerX = np.array(testlongerX)
testlongerY = np.array(small_training_answers[n_images:2*n_images])
predictedY = randforest.predict(testlongerX)
print("RMSE with a larger training set and larger images:", rmse(predictedY, testlongerY))
OK, that gained us a few leaderboard spots. Let's try making the images bigger again, and increasing the number of trees to compensate for the larger input vectors. Increasing those individually didn't help before, but doing it in tandem might.
In [93]:
n_images = 1000
size = 100
longerX = [pipeline(fn, size) for fn in small_training_filenames[:n_images]]
longerX = np.array(longerX)
longerY = np.array(small_training_answers[:n_images])
randforest = RandomForestRegressor(n_estimators=400,
n_jobs=4, verbose=1, warm_start=False)
randforest.fit(longerX, longerY)
testlongerX = [pipeline(fn, size) for fn in small_training_filenames[n_images:2*n_images]]
testlongerX = np.array(testlongerX)
testlongerY = np.array(small_training_answers[n_images:2*n_images])
predictedY = randforest.predict(testlongerX)
print("RMSE with a larger training set, a bigger forest, and much larger images:", rmse(predictedY, testlongerY))
Ok, let's try 50x50 images with more trees, since we probably didn't fully sample the space there.
In [94]:
n_images = 1000
size = 50
longerX = [pipeline(fn, size) for fn in small_training_filenames[:n_images]]
longerX = np.array(longerX)
longerY = np.array(small_training_answers[:n_images])
randforest = RandomForestRegressor(n_estimators=500,
n_jobs=4, verbose=1, warm_start=False)
randforest.fit(longerX, longerY)
testlongerX = [pipeline(fn, size) for fn in small_training_filenames[n_images:2*n_images]]
testlongerX = np.array(testlongerX)
testlongerY = np.array(small_training_answers[n_images:2*n_images])
predictedY = randforest.predict(testlongerX)
print("RMSE with a larger training set and larger images:", rmse(predictedY, testlongerY))
Wow, that's not a lot better! So that means the way to crank down that RMSE on a pixel-based approach is really just to crank up the number of images. Let's do that.
In [12]:
n_images = 3000
size = 20
longerX = [pipeline(fn, size) for fn in small_training_filenames[:n_images]]
longerX = np.array(longerX)
longerY = np.array(small_training_answers[:n_images])
randforest = RandomForestRegressor(n_estimators=100,
n_jobs=4, verbose=1, warm_start=False)
randforest.fit(longerX, longerY)
testlongerX = [pipeline(fn, size) for fn in small_training_filenames[n_images:2*n_images]]
testlongerX = np.array(testlongerX)
testlongerY = np.array(small_training_answers[n_images:2*n_images])
predictedY = randforest.predict(testlongerX)
print("RMSE:", rmse(predictedY, testlongerY))
Nope. I think we've reached the limits of what random forest is capable of when acting on raw pixels. We'd need to run something with a huge forest and large images on the whole dataset to be sure, but I doubt we'd get much better. Time to try something else.
In [38]:
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF as SquaredExponential
from sklearn.gaussian_process.kernels import ConstantKernel as Amplitude
kernel = Amplitude(1.0, (1E-3, 1E3)) * SquaredExponential(10, (1e-2, 1e2))
gp = GaussianProcessRegressor(kernel=kernel,
n_restarts_optimizer=10)
# Fit to data using Maximum Likelihood Estimation of the parameters
gp.fit(X, Y)
predictedY = gp.predict(newX)
print(rmse(newY, predictedY))
len(newX)
Out[38]:
Let's try altering the kernel.
In [39]:
kernel = Amplitude(1.0, (1E-3, 1E3)) + SquaredExponential(10, (1e-2, 1e2))
gp = GaussianProcessRegressor(kernel=kernel,
n_restarts_optimizer=10)
# Fit to data using Maximum Likelihood Estimation of the parameters
gp.fit(X, Y)
predictedY = gp.predict(newX)
print(rmse(newY, predictedY))
len(newX)
Out[39]:
In [36]:
from sklearn.gaussian_process.kernels import Matern
kernel = Amplitude(1.0, (1E-3, 1E3)) * Matern(1.0, (1e-3, 1e3), 2.5)
gp = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=10)
gp.fit(X, Y)
predictedY = gp.predict(newX)
print(rmse(newY, predictedY))
len(newX)
Out[36]:
In [44]:
kernel = Amplitude(1.0, (1E-4, 1E4)) * Matern(1.0, (1e-4, 1e4), 0.5)
gp = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=10)
gp.fit(X, Y)
predictedY = gp.predict(newX)
print(rmse(newY, predictedY))
len(newX)
Out[44]:
In [45]:
from sklearn.gaussian_process.kernels import RationalQuadratic
kernel = Amplitude(1.0, (1E-4, 1E4)) * RationalQuadratic()
gp = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=10)
gp.fit(X, Y)
predictedY = gp.predict(newX)
print(rmse(newY, predictedY))
len(newX)
Out[45]:
In [33]:
kernel = Amplitude(1.0, (1E-3, 1E3))
gp = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=10)
gp.fit(X, Y)
predictedY = gp.predict(newX)
print(rmse(newY, predictedY))
len(newX)
Out[33]:
Gaussian processes are pretty robust against kernel choices, and also perform comparably to random forests! But they eat a lot more RAM when the number of images goes up, so we'll leave it at this.
There were a bunch of things we didn't do with the random forest that we could have done. Let's try some of them now. First, let's see what happens if we use more trees on larger images, but make the trees less tall (shallower).
In [31]:
size = 100
n_images = 200
randforest = RandomForestRegressor(n_estimators=500, max_features = 0.25,
n_jobs=4, verbose=1, warm_start=False)
X = [pipeline(fn, size) for fn in small_training_filenames[:n_images]]
X = np.array(X)
Y = np.array(small_training_answers[:n_images])
randforest.fit(X, Y)
newX = [pipeline(fn, size) for fn in small_training_filenames[n_images:2*n_images]]
newX = np.array(newX)
newY = np.array(small_training_answers[n_images:2*n_images])
predictedY = randforest.predict(newX)
print(newY.shape, predictedY.shape)
print(rmse(newY, predictedY))
That did very little! The answer is comparable to what we got before. We're clearly limited by the size of the training set.
Rather than making the forest bigger and shorter, let's make the forest smarter instead: we'll try boosting the performance of the trees with AdaBoost. AdaBoost forces the random forest to re-examine the parts of the training set that it initially doesn't handle well. More on this algorithm in the scikit-learn documentation.
In [53]:
size = 20
n_images = 200
X = [pipeline(fn, size) for fn in small_training_filenames[:n_images]]
X = np.array(X)
Y = np.array(small_training_answers[:n_images])
newX = [pipeline(fn, size) for fn in small_training_filenames[n_images:2*n_images]]
newX = np.array(newX)
In [55]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import AdaBoostRegressor
#AdaBoost in sklearn can't handle multi-dimensional outputs! So let's roll our own.
# This should be an object, but I'm going to make functions instead because that's how I roll.
def ABRegressors_init(X, Y, n_estimators):
'''
Create a group of AdaBoostRegressors suitable for the given X and Y input.
'''
abregressors = [AdaBoostRegressor(DecisionTreeRegressor(), n_estimators = n_estimators) for i in range(Y.shape[1])]
return abregressors
def ABRegressors_fit(ABregressors, X, Y):
'''
Use a group of AdaBoostRegressors to fit to the training set X and Y.
'''
return [abr.fit(X, Y[:, i]) for i, abr in enumerate(ABregressors)]
def ABRegressors_predict(ABregressors, X):
'''
Use a group of AdaBoostRegressors to predict for the test set X.
'''
predictions = [abr.predict(X) for abr in ABregressors]
predictions = np.array(predictions)
predictions = predictions.transpose()
return predictions
boostedforest = ABRegressors_init(X, Y, 100)
ABRegressors_fit(boostedforest, X, Y)
predictedY = ABRegressors_predict(boostedforest, newX)
print(rmse(newY, predictedY))
Well, that's weird — it's worse than the un-boosted algorithm! I think we really are hugely limited by the size of the dataset. So let's try using the best algorithm we've got with a really big fraction of the training set and see how much that improves things. This will require being a little careful with RAM use if we want to run this on a laptop, but scikit-learn makes it relatively easy to do this.
In [13]:
from math import ceil
from multiprocessing import Pool
image_size = 100
def pipefunc_size(fn): return pipeline(fn, image_size)
base_chunk_size = 40
# base_chunk_size determines how much data is read into memory at any one time.
# The smaller it is, the less RAM this will all use.
# Don't make it much larger than 200, and definitely don't make it over 1000 if you want to run this on a laptop.
n_jobs = 5
# n_jobs determines how many processors will be used.
# Make sure that chunk size is an even multiple of n_jobs!
chunk_size = base_chunk_size * n_jobs
trainingXfiles = small_training_filenames
trainingY = np.array(small_training_answers)
testXfiles = small_test_filenames
testY = np.array(small_test_answers)
# Start with a function for training the forest chunkily.
def chunkyfit(randforest, Xfiles, Y, chunksize = chunk_size, njobs= n_jobs):
'''
Feeds the list of training X files through the pipeline, chunk by chunk, so the RAM doesn't get overloaded.
Then it feeds those chunks of X, along with the corresponding chunks of Y, into randforest.
'''
for i in range(ceil(len(Xfiles)/chunk_size)):
print("Processing training chunk", i+1, "out of", ceil(len(Xfiles)/chunk_size))
pool = Pool(njobs)
Xfiles_chunk = Xfiles[i*chunksize:(i+1)*chunksize]
Xchunk = pool.map(pipefunc_size, Xfiles_chunk)
pool.close()
Xchunk = np.array(Xchunk)
Ychunk = Y[i*chunksize:(i+1)*chunksize]
randforest.n_estimators += 5
randforest.fit(Xchunk, Ychunk)
return randforest
# A function that tells your forest to make a prediction on each chunk of the test set, then stitch them together.
def chunkypredict(randforest, Xfiles, chunksize = chunk_size, njobs= n_jobs):
'''
Feeds the list of test X files through the pipeline, chunk by chunk, so the RAM doesn't get overloaded.
Then it feeds those chunks of X into randforest.predict, and stitches the answers together.
'''
predictY = []
for i in range(ceil(len(Xfiles)/chunk_size)):
print("Processing test chunk", i+1, "out of", ceil(len(Xfiles)/chunk_size))
pool = Pool(njobs)
Xfiles_chunk = Xfiles[i*chunksize:(i+1)*chunksize]
Xchunk = pool.map(pipefunc_size, Xfiles_chunk)
pool.close()
Xchunk = np.array(Xchunk)
Ychunk = randforest.predict(Xchunk)
predictY.extend(Ychunk)
return np.array(predictY)
# We set warm_start = True,
# so we can train new trees in the forest on new chunks of the data set without losing the old trees.
randforest = RandomForestRegressor(n_estimators=0, max_features = 0.3333,
n_jobs=n_jobs, verbose=0, warm_start=True)
maxfiles = 30000
chunkyfit(randforest, trainingXfiles[:maxfiles], trainingY[:maxfiles])
predictedY = chunkypredict(randforest, testXfiles[:maxfiles])
print(rmse(testY[:maxfiles], predictedY))
Well, that's definitely the limit of random forest acting on pixels, then. Time to move on to an entirely different approach.
Except...we did a litte better with a much smaller data set earlier. What gives? Probably something about the fraction of the feature set allowed (the max_features
hyperparameter). Let's try that approach again with this fraction of the data. I'm guessing that it won't be too much better, and then we can move on to something else.
In [28]:
from math import ceil
maxfiles = 30000
image_size = 40
def pipefunc_size(fn): return pipeline(fn, image_size)
base_chunk_size = 160
# base_chunk_size determines how much data is read into memory at any one time.
# The smaller it is, the less RAM this will all use.
# Don't make it much larger than 200, and definitely don't make it over 1000 if you want to run this on a laptop.
n_jobs = 5
# n_jobs determines how many processors will be used.
# Make sure that chunk size is an even multiple of n_jobs!
chunk_size = base_chunk_size * n_jobs
def chunkyfit(randforest, Xfiles, Y, chunksize = chunk_size, njobs= n_jobs):
'''
Feeds the list of training X files through the pipeline, chunk by chunk, so the RAM doesn't get overloaded.
Then it feeds those chunks of X, along with the corresponding chunks of Y, into randforest.
'''
for i in range(ceil(len(Xfiles)/chunk_size)):
print("Processing training chunk", i+1, "out of", ceil(len(Xfiles)/chunk_size))
pool = Pool(njobs)
Xfiles_chunk = Xfiles[i*chunksize:(i+1)*chunksize]
Xchunk = pool.map(pipefunc_size, Xfiles_chunk)
pool.close()
Xchunk = np.array(Xchunk)
Ychunk = Y[i*chunksize:(i+1)*chunksize]
randforest.n_estimators += 40
randforest.fit(Xchunk, Ychunk)
return randforest
# A function that tells your forest to make a prediction on each chunk of the test set, then stitch them together.
def chunkypredict(randforest, Xfiles, chunksize = chunk_size, njobs= n_jobs):
'''
Feeds the list of test X files through the pipeline, chunk by chunk, so the RAM doesn't get overloaded.
Then it feeds those chunks of X into randforest.predict, and stitches the answers together.
'''
predictY = []
for i in range(ceil(len(Xfiles)/chunk_size)):
print("Processing test chunk", i+1, "out of", ceil(len(Xfiles)/chunk_size))
pool = Pool(njobs)
Xfiles_chunk = Xfiles[i*chunksize:(i+1)*chunksize]
Xchunk = pool.map(pipefunc_size, Xfiles_chunk)
pool.close()
Xchunk = np.array(Xchunk)
Ychunk = randforest.predict(Xchunk)
predictY.extend(Ychunk)
return np.array(predictY)
# We set warm_start = True,
# so we can train new trees in the forest on new chunks of the data set without losing the old trees.
randforest = RandomForestRegressor(n_estimators=0,
n_jobs=n_jobs, verbose=0, warm_start=True)
chunkyfit(randforest, trainingXfiles[:maxfiles], trainingY[:maxfiles], chunksize = chunk_size)
predictedY = chunkypredict(randforest, testXfiles[:maxfiles], chunksize = chunk_size)
print(rmse(testY[:maxfiles], predictedY))
OK, no real improvement. Time to try featurization.
In [ ]: