Replicating: https://github.com/fastai/fastai/blob/master/courses/dl1/lesson2-image_models.ipynb
In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline
In [2]:
from fastai.imports import *
from fastai.transforms import *
from fastai.conv_learner import *
from fastai.model import *
from fastai.dataset import *
from fastai.sgdr import *
In [3]:
PATH = "data/planet/"
In [8]:
!ls {PATH}
In [4]:
from fastai.plots import *
In [6]:
def get_1st(path): return glob(f'{path}/*.*')[0]
In [7]:
dc_path = "data/dogscats/valid/"
list_paths = [get_1st(f"{dc_path}cats"), get_1st(f"{dc_path}dogs")]
plots_from_files(list_paths, titles=["cat", "dog"], maintitle="Single-label Classification")
In single-label classification each sample belongs to one class. In the previous example, each image is either a dog or a cat.
In [8]:
list_paths = [f"{PATH}train-jpg/train_0.jpg", f"{PATH}train-jpg/train_1.jpg"]
titles = ["haze primary", "agriculture clear primary water"]
plots_from_files(list_paths, titles=titles, maintitle="Multi-label Classification")
Softmax wouldn't be good because it wants to "pick a thing". Instead, we'll use the Sigmoid. In multi-label clsfn, each sample can belong to one or more classes. In the previous example, the 1st images belong to two classes: haze and primary. The 2nd belongs to four classes: agriculture, clear, primary, and water.
fastai is baller AF bc it'll look at the labels in the CSV, and if there're > 1 labels ever, for any item, it'll automatically switch to multi-label mode.
In [5]:
from planet import f2
metrics = [f2]
f_model = resnet34
In [6]:
label_csv = f'{PATH}train_v2.csv'
n = len(list(open(label_csv)))-1
val_idxs = get_cv_idxs(n, cv_idx=4) # using cv_idx=4 to keep same permutation as orig. notebook
We use a different set of DAs for this dataset -- we also allow vertical flips, since we don't expect the vertical orientation of satellite images to change our classifications.
In [7]:
def get_data(sz,bs=64):
tfms = tfms_from_model(f_model, sz, aug_tfms=transforms_top_down, max_zoom=1.05)
return ImageClassifierData.from_csv(PATH, 'train-jpg', label_csv, tfms=tfms, bs=bs,
suffix='.jpg', val_idxs=val_idxs, test_name='test-jpg')
In [12]:
data = get_data(256) # the planet images are 256x256
In [13]:
# turning a dataloader into an iterator:
x,y = next(iter(data.val_dl)) # note: .._ds: dataset; .._dl: dataloader | PyTorch concepts
# idea: dataset gives you a single image or object back
# dataloader gives you back a single (transformed) mini-batch (next mini-batch only)
In [14]:
y # 64x17: batch size x number of classes
Out[14]:
In [15]:
# zip, basically zips two lists together into an iterator
list(zip(data.classes, y[0])) # getting the 0th image's labels -- from validation set
Out[15]:
Behind the scenes, FastAI & PyTorch are turning our labels into 1-Hot Encoded Labels.
Storing 1H-Encs as separate arrays is v.inefficient, instead the index values of positive-encodings are used, although the actual 1H-Enc vectors are dealt w/ deep in PyTorch.
In [16]:
# data.val_ds.fnames[:15]
In [17]:
plt.imshow(data.val_ds.get_x(0))
Out[17]:
Images are just matrices of numbers, so we can just multiply them by a number if they're too washed out / hazy:
In [18]:
plt.imshow(data.val_ds.denorm(to_np(x))[0]*1.4);
# NOTE: same as: plt.imshow(data.val_ds.get_x(0) * 1.4)
From here on, we just proceed as normal.
What's interesting about this dataset is it's nothing like ImageNet.
The first thing we do is resize the data to 64 x 64. The data starts out as 256 x 256. You wouldn't want to do this for the DogsCats competition because the pretrained models start out almost perfect for them (being trained on similarly-sized ImageNet images). If we resized to 64 x 64 and then retrained the entire network: we'd essentially destroy the weights that were already pretrained to be v.good. Most ImageNet-trained models are trained at 224 x 224 or 299 x 299.
However there's no satellite imagery in ImageNet. So only the earlier Conv layers are going to be useful to us -- edges, gradients, repeating patterns, etc.
Starting out by training smaller images tends to work well for satellite images -- coming from an ImageNet-pretrained model.
In [19]:
sz = 64 # resize to 64x64
data = get_data(sz) # grab data
data = data.resize(int(sz*1.3), 'tmp') # <-- this line not necessary
A NOTE on the data.resize(int(sz*1.3), 'tmp') line above. The dataloader only resizes images when it outputs them to the model --> the images go into the dataloader at full size. The data.resize(.) method takes a maximum size (here int(64 * 1.3) = 83) and creates a resized copy of the entire dataset to a temporary folder. This technique is purely a speed-up measure, with no actual effect on model performace. ... But very handy in practice when you think about it... good to know..
http://forums.fast.ai/t/dog-breed-identification-challenge/7464/51
In [20]:
learn = ConvLearner.pretrained(f_model, data, metrics=metrics) # build the model
In [22]:
learn.lr_find()
learn.sched.plot()
The optimal starting learning rates turns out to be quite high: O(1e0) to O(1e-1)
Because the dataset is so unlike ImageNet, we'll have to do quite a bit of fitting of the last layer -- do that until it starts to flatten out.
Another difference from ImageNet-like datasets: instead of Dfntl LRs differing by orders of magnitude (λr÷100, λr÷10, λr), we use a much more aggressive Dfntl of 3: (λr÷9, λr÷3, λr)
With the idea being: the earlier layers are not as close to what they need to be compared to the ImageNet-like datasets.
Then we unfreeze and train for a while:
In [13]:
lr = 0.2
In [24]:
learn.fit(lr, 3, cycle_len=1, cycle_mult=2)
In [14]:
lrs = np.array([lr/9, lr/3, lr])
In [26]:
learn.unfreeze()
learn.fit(lrs, 3, cycle_len=1, cycle_mult=2)
In [27]:
learn.save(f'{sz}')
We can see the loss-spikes at the start of each SGDR cycle:
In [28]:
learn.sched.plot_loss()
Now we increase (double here) the size of our images and train again:
In [29]:
sz = 128
In [30]:
learn.set_data(get_data(sz))
learn.freeze()
learn.fit(lr, 3, cycle_len=1, cycle_mult=2)
In [31]:
learn.unfreeze()
learn.fit(lrs, 3, cycle_len=1, cycle_mult=2)
learn.save(f'{sz}')
Double image size again (finally up to original size) & repeat:
In [12]:
??learn.save
In [33]:
learn.save(f'{sz}_unfrozen')
<<< Restarted Notebook -- Loading saved Weights >>>>
I thought my weights were deleted for a while there. For some reason the file 128_unfrozen.h5 was saved to {PATH}/tmp/83/models/ instead of to {PATH}/models/. Is the FastAI library setting its current working directory to the temporary images folder after data.resize(.)?
I'll just move the models folder up to where it should be and carry on as normal.
In [11]:
data = get_data(sz = 128)
learn = ConvLearner.pretrained(f_model, data, metrics=metrics)
learn.load(f'128_unfrozen')
In [9]:
sz = 256
In [28]:
learn.set_data(get_data(sz))
learn.freeze()
learn.fit(lr, 3, cycle_len=1, cycle_mult=2)
In [29]:
learn.save(f'{sz}')
learn.data.bs = 32
In [13]:
learn.load(f'256')
learn.set_data(get_data(sz, bs=24))
In [10]:
# restarting after a network error
data = get_data(sz, bs=24)
learn = ConvLearner.pretrained(f_model, data, metrics=metrics)
learn.load(f'256')
In [17]:
learn.unfreeze()
learn.fit(lrs, 3, cycle_len=1, cycle_mult=2)
learn.save(f'{sz}')
Finally predict with Test Time Augmentation (this is on validation data)
In [18]:
tta = learn.TTA()
In [19]:
f2(*tta)
Out[19]:
This approach gets up to 50th place on the Kaggle Planet Amazon competition! A competition full of machine-learning / data-scientist wizards. Hail RoboLord. Hail. ..But seriously, better tools are awesome.
In [1]:
# reloading everything in one cell after dead kernel:
%reload_ext autoreload
%autoreload 2
%matplotlib inline
from fastai.imports import *
from fastai.transforms import *
from fastai.conv_learner import *
from fastai.model import *
from fastai.dataset import *
from fastai.sgdr import *
PATH = "data/planet/"
from fastai.plots import *
from planet import f2
metrics = [f2]
f_model = resnet34
label_csv = f'{PATH}train_v2.csv'
# n = len(list(open(label_csv)))-1
# val_idxs = get_cv_idxs(n, cv_idx=4)
val_idxs = [0]
def get_data(sz,bs=64):
tfms = tfms_from_model(f_model, sz, aug_tfms=transforms_top_down, max_zoom=1.05)
return ImageClassifierData.from_csv(PATH, 'train-jpg', label_csv, tfms=tfms, bs=bs,
suffix='.jpg', val_idxs=val_idxs, test_name='test-jpg')
data = get_data(sz=256)
learn = ConvLearner.pretrained(f_model, data, metrics=metrics)
learn.load(f'256')
In [3]:
learn.summary()
Out[3]:
In [4]:
learn.data.sz
Out[4]:
In [4]:
# It seems this is still necessary despite similar problems solved in this issue:
# https://github.com/fastai/fastai/issues/23
import resource
rlimit = resource.getrlimit(resource.RLIMIT_NOFILE)
resource.setrlimit(resource.RLIMIT_NOFILE, (4096, rlimit[1]))
In [8]:
test_log_preds = learn.TTA(is_test=True)
# test_preds = np.exp(test_log_preds)
In [24]:
# cut off the unneeded Zero labels
test_log_preds = test_log_preds[0]
In [25]:
test_preds = np.exp(test_log_preds)
In [31]:
data.classes
Out[31]:
In [32]:
data.test_ds.fnames[:10]
Out[32]:
In [34]:
x = 'test-jpg/test_26652.jpg'
In [41]:
print(x)
print(x.split('/'))
print(x.split('/')[1])
print(x.split('/')[1].split('.')[0])
Submission format:
image_name,tags
test_0,agriculture road water
test_1,primary clear
test_2,haze primary
etc.
In [49]:
test_preds[5]
Out[49]:
In [101]:
arr = [chr(ord('ა') + i) for i in range(10)]
' '.join(arr)
Out[101]:
In [67]:
arr = [chr(ord('a') + i) for i in range(10)]
num = [i % 2 for i in range(10)]
print(arr)
print(idxs)
# NOTE: np.where(.) needs NumPy arrays!
arr = np.array(arr)
num = np.array(num)
print(arr[np.where(num==1)])
In [78]:
temp = np.array([i % 2 for i in range(6)])
classes = np.array(data.classes)
classes[np.where(temp==1)]
Out[78]:
Looking at test_preds I'm noticing a base value of 1.0 and predictions that go up from there to, I think, a max of around 2.0. Judging by that and tips in this fastai thread: I'll set a threshold of 1.02 and anything higher than that will be a positive match.
I don't know why my predictions are starting from 1 and not zero, if I did something wrong or if it's just like that.
In [102]:
threshold = 1.02
classes = np.array(data.classes)
predicted_tags = [[' '.join(classes[np.where(pred >= threshold)])] for pred in test_preds]
In [104]:
submission = pd.DataFrame(predicted_tags)
submission.columns = ['tags']
In [106]:
submission.insert(0, 'image_name', [f.split('/')[1].split('.')[0] for f in data.test_ds.fnames])
submission.head()
Out[106]:
In [107]:
# for single-class; won't work here
# submission = pd.DataFrame(test_preds)
# submission.columns = data.classes
# submission.insert(0, 'id', [f.split('/')[1].split('.')[0] for f in data.test_ds.fnames])
In [108]:
SUBM = PATH + 'subm/'
os.makedirs(SUBM, exist_ok=True)
In [110]:
submission.to_csv(f'{SUBM}FADL1-L3CA-submission-RN34-00.csv.gz', compression='gzip', index=False)
In [111]:
# FileLink(f'{SUBM}FADL1-L3CA-submission-RN34-00.csv.gz')
Out[111]:
I made a mistake here. I didn't add the additional test set images. Will do that below. Not sure if there's a cleaner way to add more test data in fastai:
In [113]:
# does tfms matter since I'm only adding another test-set ?
tfms = tfms_from_model(f_model, sz=256, aug_tfms=transforms_top_down, max_zoom=1.05)
data = ImageClassifierData.from_csv(PATH, 'train-jpg', label_csv, tfms=tfms, bs=24, suffix='.jpg',
val_idxs=val_idxs, test_name='test-jpg-additional')
learn.set_data(data)
In [114]:
log_preds_adtl = learn.TTA(is_test=True)[0]
preds_adtl = np.exp(log_preds_adtl)
In [117]:
data.test_ds.fnames[0]
Out[117]:
In [118]:
predicted_tags_adtl = [[' '.join(classes[np.where(pred >= threshold)])] for pred in preds_adtl]
df_adtl = pd.DataFrame(predicted_tags_adtl, columns = ['tags'])
df_adtl.insert(0, 'image_name', [f.split('/')[1].split('.')[0] for f in data.test_ds.fnames])
In [119]:
df_adtl.head()
Out[119]:
Pandas DataFrame appending: Docs λink
In [120]:
submission.append(df_adtl)
Out[120]:
In [133]:
submission.shape
Out[133]:
So append doesn't automatically change what you're appending to.
In [134]:
submission = submission.append(df_adtl)
In [130]:
x, y = 'a', 'b'
f'{x,y}', f'{x}{y}'
Out[130]:
In [135]:
subm_name = "FADL1-L3CA-submission-RN34-00"
submission.to_csv(f'{SUBM}{subm_name}.csv.gz', compression='gzip', index=False)
In [136]:
FileLink(f'{SUBM}{subm_name}.csv.gz')
Out[136]:
Submissino FADL1-L3CA-submission-RN34-00 scored 0.88148 -- 616/938 Private.
Saving raw log predictions incase I try to tweak thresholds later:
In [139]:
len(test_log_preds), len(log_preds_adtl)
Out[139]:
In [144]:
pd.DataFrame(test_log_preds, columns=data.classes).to_feather(f'{SUBM}test_log_preds.feather')
pd.DataFrame(log_preds_adtl, columns=data.classes).to_feather(f'{SUBM}test_log_preds_adtl.feather')
Tweaking thresholds and seeing if that changes things:
In [146]:
temp = pd.read_feather(f'{SUBM}test_log_preds.feather')
In [147]:
temp.head()
Out[147]:
In [149]:
temp = temp.as_matrix();
type(temp)
Out[149]:
In [151]:
temp[0]
Out[151]:
Oh Shiiit. Those are Logarithmic Predictions?? Aha.. So.. I shouldn't've been exponentiating them in the first place: that explains why I had strange values. Right. So instead of needing a fancy function to iterate & test thresholds (which I can still do), I just need to make a submission of the log predictions. Got it.
In [153]:
log_preds_adtl[0]
Out[153]:
In [158]:
a = [i for i in range(5)]; b = [-i for i in range(5)];
c = np.concatenate((a,b)); c
Out[158]:
In [178]:
In [201]:
def get_data_2(test_name=None):
label_csv = f'{PATH}train_v2.csv'
f_model = resnet34
tfms = tfms_from_model(f_model, sz=256, aug_tfms=transforms_top_down, max_zoom=1.05)
return ImageClassifierData.from_csv(PATH, 'train-jpg', label_csv, bs=64, tfms=tfms,
suffix='.jpg', val_idxs=[0], test_name=test_name)
def generate_submission(threshold=0.02):
SUBM = PATH + 'subm/'
test_log_preds = pd.read_feather(f'{SUBM}test_log_preds.feather').as_matrix()
test_log_preds_adtl = pd.read_feather(f'{SUBM}test_log_preds_adtl.feather').as_matrix()
data = get_data_2(test_name='test-jpg')
fnames = data.test_ds.fnames
data = get_data_2(test_name='test-jpg-additional')
fnames = np.concatenate((fnames, data.test_ds.fnames))
classes = np.array(data.classes)
preds = np.concatenate((test_log_preds, test_log_preds_adtl))
names = [f.split('/')[1].split('.')[0] for f in fnames]
predicted_tags = [[' '.join(classes[np.where(pred >= threshold)])] for pred in preds]
submission = pd.DataFrame(predicted_tags, columns=['tags'])
submission.insert(0, 'image_name', names)
return submission
In [202]:
submission = generate_submission(threshold=0.02)
submission.head()
Out[202]:
In [203]:
subm_name = "FADL1-L3CA-submission-RN34-01"
submission.to_csv(f'{SUBM}{subm_name}.csv.gz', compression='gzip', index=False)
FileLink(f'{SUBM}{subm_name}.csv.gz')
Out[203]:
In [206]:
submission = generate_submission(threshold=0.03)
subm_name = "FADL1-L3CA-submission-RN34-04"
submission.to_csv(f'{SUBM}{subm_name}.csv.gz', compression='gzip', index=False)
FileLink(f'{SUBM}{subm_name}.csv.gz')
Out[206]:
In [207]:
submission = generate_submission(threshold=0.035)
subm_name = "FADL1-L3CA-submission-RN34-05"
submission.to_csv(f'{SUBM}{subm_name}.csv.gz', compression='gzip', index=False)
FileLink(f'{SUBM}{subm_name}.csv.gz')
Out[207]:
This scores: 0.90181 -- 517/938 Private, when using a threshold of 0.035 and Log predictions.
In [208]:
submission = generate_submission(threshold=0.04)
subm_name = "FADL1-L3CA-submission-RN34-06"
submission.to_csv(f'{SUBM}{subm_name}.csv.gz', compression='gzip', index=False)
FileLink(f'{SUBM}{subm_name}.csv.gz')
Out[208]:
In [209]:
submission = generate_submission(threshold=0.045)
subm_name = "FADL1-L3CA-submission-RN34-07"
submission.to_csv(f'{SUBM}{subm_name}.csv.gz', compression='gzip', index=False)
FileLink(f'{SUBM}{subm_name}.csv.gz')
Out[209]:
0.90596 -- 488/938 Private, threshold of 0.045
Oh.. hah ffs.. Looks like the optimal threshold was around 0.2, not 0.02 as I thought. fastai link. Well I used up my daily submits at this point -- will see how high this model can get in about 23 hours.
In [ ]: