In [1]:
%matplotlib inline
%reload_ext autoreload
%autoreload 2
In [2]:
from fastai.conv_learner import *
from fastai.dataset import *
from pathlib import Path
import json
from PIL import ImageDraw, ImageFont
from matplotlib import patches, patheffects
We'll be looking at the Pascal VOC dataset. It's quite slow, so you may prefer to download from this mirror. There are two different competition/research datasets, from 2007 and 2012. We'll use the 2007 version. You can use the larger 2012 for better results, or even combine them (but be careful to avoid data leakage between the validation sets if you do this).
Unlike previous lessons, we're using the Python 3 standard library pathlib
for our paths and file access. Note that it returns an OS-specific class (on Unix: PosixPath
) so your output may look a little different. Most libraries that take paths as input can take a pathlib object - although some (like cv2) can't, in which case you can use str()
to convert to a string.
In [6]:
PATH = Path('data/pascal/')
list(PATH.iterdir())
Out[6]:
As well as images, there're also annotations - bounding boxes showing where each object is. These were hand labeled. The original versions were in XML, which is a little hard to work with nowadas, so we use the more recent JSON version.
pathlib
includes the ability to open files, and much more.
Here we want to open JSON files that contain the bounding boxes and object classes. The fastest way to do this in Python is with the JSON library -- although there are Google versions for super-large files.
In [7]:
trn_j = json.load((PATH/'pascal_train2007.json').open())
trn_j.keys()
Out[7]:
JSON - JavaScript Object Notation - is kind of a standard way to pass around hierarchical structured data now.
In [8]:
IMAGES, ANNOTATIONS, CATEGORIES = ['images', 'annotations', 'categories']
trn_j[IMAGES][:5]
Out[8]:
In [9]:
trn_j[ANNOTATIONS][:2]
Out[9]:
Segmentation is Polygon Segmentation. We'll use bounding box.
In [10]:
trn_j[CATEGORIES][:4]
Out[10]:
It's helpful to use constants instead of strings, since we get tab-completion and don't mistype.
We can turn this categories list from a dictionary of id -> name:
In [11]:
FILE_NAME, ID, IMG_ID, CAT_ID, BBOX = 'file_name', 'id', 'image_id', 'category_id', 'bbox'
cats = {o[ID]:o['name'] for o in trn_j[CATEGORIES]}
trn_fns = {o[ID]:o[FILE_NAME] for o in trn_j[IMAGES]}
trn_ids = [o[ID] for o in trn_j[IMAGES]]
In [12]:
list((PATH/'VOCdevkit'/'VOC2007').iterdir())
Out[12]:
In [13]:
JPEGS = 'VOCdevkit/VOC2007/JPEGImages'
In [14]:
IMG_PATH = PATH/JPEGS
list(IMG_PATH.iterdir())[:5]
Out[14]:
Each image has a unique ID
In [15]:
im0_d = trn_j[IMAGES][0]
im0_d[FILE_NAME], im0_d[ID]
Out[15]:
A defaultdict
is useful any time you want to have a default dictionary entry for new keys. Here we create a dict from image IDs to a list of annotations (tuple of bounding box and class id).
We convert VOC's height/width into top-left/bot-right, and witch x/y coords to be consistent with NumPy.
The idea here is to create a dictionary where the key is the image id, and the value is the list of all its annotations. So: go through each of the annotations; if it doesn't say to ignore it: append its bounding-box and class to the appropriate dictionary item (where that dictionary item is a list).
But if that dictionary item doesnt exist yet then there's no list to append to. collections.defaultdict
which behaves just like a regular dictionary, except that if you try to access a key that does not exist, it will make that key exist with the default value of a function you specify -- in this case: lambda: []
NOTE that the dimensions are reversed in hw_bb
. This is because CV usually uses W,H whereas Mathematics uses R,C. Width x Height vs Rows x Columns. FastAI follows the NumPy/PyTorch way of RxC. FastAI also uses Top-Left;Bottom-Right coordinates, instead of Top-Left;Heigh,Width.
In [16]:
def hw_bb(bb): return np.array([bb[1], bb[0], bb[3]+bb[1]-1, bb[2]+bb[0]-1])
trn_anno = collections.defaultdict(lambda:[])
for o in trn_j[ANNOTATIONS]:
if not o['ignore']:
bb = o[BBOX]
bb = hw_bb(bb)
trn_anno[o[IMG_ID]].append((bb, o[CAT_ID]))
len(trn_anno)
Out[16]:
Now we have a dictionary of filenames -> tuple(bounding_box_coords, class_id)
In [17]:
im_a = trn_anno[im0_d[ID]]; im_a
Out[17]:
In [18]:
im0_a = im_a[0]; im0_a
Out[18]:
In [19]:
cats[7]
Out[19]:
In [20]:
trn_anno[17]
Out[20]:
In [21]:
cats[15], cats[13]
Out[21]:
Some libs take VOC format bounding boxes, so this let's us convert back when required:
In [22]:
def bb_hw(a): return np.array([a[1], a[0], a[3]-a[1]+1, a[2]-a[0]+1])
In [23]:
bb_voc = [155, 96, 196, 174]
bb_fastai = hw_bb(bb_voc)
In [24]:
f'expected: {bb_voc}, actual: {bb_hw(bb_fastai)}'
Out[24]:
YOu can use Visual Studio Code (vscode - open source editor that comes with recent versions of Anaconda, or can be installed seperately), or mose editors and IDEs, to find out all about the open_image
function. csvode things to know:
Ctrl-shift-p
)Ctrl-t
)Shift-F12
)F12
)alt-left
)Ctrl-b
)Ctrl-k,z
)
In [25]:
im = open_image(IMG_PATH/im0_d[FILE_NAME])
Matplotlib's plt.subplots
is a really sueful wrapper for creating plots, regardless of whether you have more than one subplot. NOTE that Matplotlib has an optional object-oriented API which is much easier to understand and use (although few examples online use it).
In [26]:
def show_img(im, figsize=None, ax=None):
if not ax: fig, ax = plt.subplots(figsize=figsize)
ax.imshow(im)
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
return ax
A simple but rarely used trick to making text visible regardless of background is to use white text with blackoutline, or vice versa. Here's how to do it in matplotlib:
In [27]:
def draw_outline(o, lw):
o.set_path_effects([patheffects.Stroke(
linewidth=lw, foreground='black'), patheffects.Normal()])
Note that *
in argument lists is the splat operator. In this case it's a little shortcut compared to writing out b[-2], b[-1]
.
In [28]:
def draw_rect(ax, b):
patch = ax.add_patch(patches.Rectangle(b[:2], *b[-2:], fill=False, edgecolor='white', lw=2))
draw_outline(patch, 4)
In [29]:
def draw_text(ax, xy, txt, sz=14):
text = ax.text(*xy, txt,
verticalalignment='top', color='white', fontsize=sz, weight='bold')
draw_outline(text, 1)
In [30]:
ax = show_img(im)
b = bb_hw(im0_a[0])
draw_rect(ax, b)
draw_text(ax, b[:2], cats[im0_a[1]]) # b[:2] is top-left; im0_a[1] is class
So because Matplotlib has an OO API, we can just create an axis
object in draw_text
, and pass that off to draw_outline
to draw an outline around it. Same for the bounding box: draw_rect
creates an axis
object called patch
that it sends to draw_outline
which puts a black outline around the white rectangle.
Matplotlib calls .add_patch
on an axis
object to draw a rectangle via a patches.Rectangle
argument.
What's great is now that we have all that set up, we can use it for all our Object Detection work going forward! So let's package that all up a bit for quick use later.
In [34]:
# draw image with annotations
def draw_im(im, ann):
ax = show_img(im, figsize=(16,8))
for b,c in ann: # destructuring assignment
b = bb_hw(b)
draw_rect(ax, b)
draw_text(ax, b[:2], cats[c], sz=16)
# draw image at a particular index
def draw_idx(i):
im_a = trn_anno[i] # grab img ID
im = open_image(IMG_PATH/trn_fns[i]) # open image
print(im.shape)
draw_im(im, im_a)
In [35]:
draw_idx(17)
A λambda function is simply a way to define an anonymous function inline. Here we use it to describe how to sort the annotation for each image - by bounding box size (descending).
In [36]:
def get_lrg(b):
if not b: raise Exception()
b = sorted(b, key=lambda x: np.product(x[0][-2:] - x[0][:2]), reverse=True)
return b[0]
In [37]:
trn_lrg_anno = {a: get_lrg(b) for a,b in trn_anno.items()}
Here's something cool -- J.Howard started with the second line, above, then wrote the first. He started with the API he wanted to work with -- then implemented it.
Something that takes all of the bounding boxes for a particular image and finds the largest.
He does that by sorting the bounding boxes via: the product of the difference of the last two items of the bounding-box lis (bottom-right corner) and the first two items (top-left corner). (bot-right) minus (top-left) = width and height; product of that = size of the bounding box. Cool.
Now we have a dictionary from image id to a single bounding box - the largest for that image.
In [38]:
b, c = trn_lrg_anno[23]
b = bb_hw(b)
ax = show_img(open_image(IMG_PATH/trn_fns[23]), figsize=(5,10))
draw_rect(ax, b)
draw_text(ax, b[:2], cats[c], sz=16)
It's very important to look at your work at every stage in the pipeline.
In [39]:
(PATH/'tmp').mkdir(exist_ok=True) # making a new folder in our directory
CSV = PATH/'tmp/lrg.csv' # path to large-objects csv file
Often it's easiest to simply create a CSV of the data you want to model, rather than trying to create a custom dataset. here we use Pandas to help us create a CSV of the image filename and class. Basically: we already have a ImageClassifierData.from_csv
method, there's no reason to build a custom dataloader; just put the labels & ids into a CSV file.
--> this is actually exactly what I did for my GLoC Detector.
Below: easiest way to create CSV: Pandas dataframe. Create as dictionary of 'name of column' : 'list of things in that column'. columns
is specified even though columns are already given becaues dictionaries are unordered -- and order matters here.
--> Learned that the hard way in my GLoC Detector.
In [40]:
df = pd.DataFrame({'fn': [trn_fns[o] for o in trn_ids],
'cat': [cats[trn_lrg_anno[o][1]] for o in trn_ids]}, columns=['fn','cat'])
df.to_csv(CSV, index=False)
In [41]:
f_model = resnet34
sz = 224
bs = 64
From here on it's jus tlike Dogs vs Cats! We have a CSV file containing a bunch of file names, and for each one: the class and bounding box.
In [42]:
tfms = tfms_from_model(f_model, sz, aug_tfms=transforms_side_on, crop_type=CropType.NO)
md = ImageClassifierData.from_csv(PATH, JPEGS, CSV, tfms=tfms)
In [43]:
x, y = next(iter(md.val_dl))
In [44]:
show_img(md.val_ds.denorm(to_np(x))[0]);
Some differences with how things were done in Part 1. crop_type
is different - to resize an image to 224x224 the image is resized to 224 along its smallest axis, then a random square crop is taken; during validation take a center crop (multiple random crops if using data augmentation). We don't want to do that for Object Detection because objects can be anywhere in an image -- in Image Classification the object is usually in the center -- we don't want to crop out the object we want to detect.
crop_type=CropType.NO
means no crop -- the image is just resized to a square. Most CV models work better if you crop rather than squish, but the still work well nonetheless.
md
is a ModelData object. It's .trn_dl
is a train dataloader iterator that returns a the next minibatch. TO use it manually: iter(md.val_dl)
returns an iterator from which you can call next(.)
to get the net minibatch.
However, we can't take X and Y from the next minibatch and send it straight to show_img
because:
All standard ImageNet-pretrained models expect data to've been normalized to a zero-mean and a one-standard-deviation.
So you use the method denorm
via md.val_ds.denorm(.)
on the dataset which denormalizes the image and reorders its dimensions. The image norms are hardcoded in Fastai from ImageNet, Inception, etc. statistics. The denormalization depends on the transform - and the dataset knows which transform was used to create it.
In [41]:
# x[0] # x : minibatch of 64x3x224x224
In [41]:
learn = ConvLearner.pretrained(f_model, md, metrics=[accuracy])
learn.opt_fn = optim.Adam
In [53]:
lrf = learn.lr_find(1e-5, 100)
When your LR finder graph looks like this, you can ask for more points on each end:
In [54]:
learn.sched.plot()
In [55]:
learn.sched.plot(n_skip=5, n_skip_end=1)
In [56]:
# NB: disabling monitor thread to fix annoying tqdm errors - https://github.com/tqdm/tqdm/issues/481
# also: https://github.com/tqdm/tqdm/issues/481#issuecomment-378067008
# tqdm.monitor_interval = 0 ## <-- doesn't seem to change anything
In [57]:
lr = 2e-2
In [58]:
learn.fit(lr, 1, cycle_len=1)
Out[58]:
In [59]:
lrs = np.array([lr/1000, lr/100, lr])
In [60]:
learn.freeze_to(-2)
In [61]:
lrf = learn.lr_find(lrs/1000)
learn.sched.plot(1)
In [62]:
learn.fit(lrs/5, 1, cycle_len=1)
Out[62]:
In [63]:
learn.unfreeze()
Accuracy isn't improving much - since many images have multiple different objects, it's going to be impossible to be that accurate.
In [64]:
learn.fit(lrs/5, 1, cycle_len=2)
Out[64]:
In [65]:
learn.save('class_one')
In [42]:
learn.load('class_one')
In [43]:
x,y = next(iter(md.val_dl))
probs = F.softmax(predict_batch(learn.model, x), -1)
x,preds = to_np(x),to_np(probs)
preds = np.argmax(preds, -1)
You can use the python debugger pdb
to step through code.
pdb.set_trace()
to set a breakpoint%debug
magic to trace an errorCommands you need to know:
s / n / c
u / d
p
l
In [56]:
fix, axes = plt.subplots(3, 4, figsize=(12, 8))
for i,ax in enumerate(axes.flat):
ima = md.val_ds.denorm(x)[i]
b = md.classes[preds[i]]
ax = show_img(ima, ax=ax)
draw_text(ax, (0,0), b)
plt.tight_layout()
A way to see what the code above does is to take the contents of the loop, outdent them, set i = 0
, put each line in a seperate cell, and run each cell, printing its output.
In [ ]:
i = 0
In [ ]:
ima = md.val_ds.denorm(x)[i]
In [ ]:
b = md.classes[preds[i]]
In [ ]:
ax = show_img(ima, ax=ax)
In [ ]:
draw_text(ax, (0,0), b)
Python debugger is also very useful. If you know an issue is happening at a specific minibatch/iteration: you can just set a breakpoint: pdb.set_trace()
to trigger conditionally. h
for help. Pdb will show you the line it's about to run. If you want to print out something, you can write any python expression - hit enter - and it'll display it: in this case: md.val_ds.denorm(x)
Then to see what comes after that piece of code: l
for list displays where in the code/loop you are. It'll point an arrow to the line you are about to run.
To run that line and go to the next: n
. We enter n
again to go to the next line -- also if you just hit enter, pdb will do the last thing you entered. At this point, if we want to see the b
-- b
is also a pdb command, so to force pdb to print the b
variable: p b
. Then we enter n
for the next line.
At this point the code is about to draw the image. We don't want to draw it - but we want to see how it's drawn -- so we want to step into the function with s
s
takes us into draw_text
. We can enter n
to go to the next line inside draw_text
, and l
to see where we are inside the function.
If we want to continue on the next break point we enter c
Example case: say we step into denorm
: n
-> s
-> l
What'll often happen is your debugging something in your PyTorch module, and it's hit an exception and you're trying to debug: you'll find yourself 6 layers deep inside PyTorch, and you want to see back up to where you called it from.
In this case we're inside a @property
but we want to know what was going on up the call stack: we hit u
- which doesn't run anything, but changes the context of the debugger, to show us what called it -- at which point we can enter things to find out about that environment, like p i
to print the value of i
.
After that if we want to go back down again: d
.
ipdb
is the IPython debugger and it's prettier.
In [57]:
fix, axes = plt.subplots(3, 4, figsize=(12, 8))
for i,ax in enumerate(axes.flat):
pdb.set_trace() # <-- pdb breakpoint
ima = md.val_ds.denorm(x)[i]
b = md.classes[preds[i]]
ax = show_img(ima, ax=ax)
draw_text(ax, (0,0), b)
plt.tight_layout()
h - md.val_ds.denorm(x) - l - n - n - p b - n - s - n - l - c -- n - s - l - u - p i - d - exit
The other place that the debugger comes in particularly hadny is if you've got an
Exception
- particularly if that's deep inside PyTorch.
Imagine we set preds[i*100]
instead of preds[i]
. In this case it's easy to see what's wrong but often it isn't so here's what we do.
In [58]:
fix, axes = plt.subplots(3, 4, figsize=(12, 8))
for i,ax in enumerate(axes.flat):
ima = md.val_ds.denorm(x)[i]
b = md.classes[preds[i*100]]
ax = show_img(ima, ax=ax)
draw_text(ax, (0,0), b)
plt.tight_layout()
%debug
pops open the debuggger at the point the exception happened. So now we can check what happened. Try len(preds)
. Try p i*100
to print i*
100; and you can do down, up the list, and etc. J.Howard does all of the Fastai Library and Course development in Jupyter notebooks interactively, and uses %debug
all the time - along with copying out functions into indiv.cells and running them piecemeal.
In [59]:
%debug
Next from here we want to create the bounding box. We can createa a Regression instead of a Classification Neural Network. A Classification Neural Network is one that has a Sigmoid or Softmax output, and a Cross Entropy, Binary Cross Entropy, or Negative Log Likelihood loss function. If we don't have the Softmax or Sigmoid at the end and use Mean Squared Error as a loss function: it's now a regression model. So we can use it to predict a continuous number rather than a category.
We also know that we can have multiple outputs -- we did multiple object classification in the Planet competition.
So we can combine those two ideas and do a multiple column regression. In this case we have 4 numbers (top-left x,y; bot-right x,y) - and we could create a neural net with 4 activations and no softmax/sigmoid, and with an MSE loss function.
Here is were you think in terms of Differentiable Programming. You're not thinking "how do I create a bounding box model" -- instead it's "I need 4 numbers, therefore I need a neural network with 4 activations. That's half of what i need to know. The other half is the loss function. What's a loss function that, when it is lower, it means that the 4 numbers are better? If I can do those 2 things: I'm done.
Well, if the X is close to the first activation, and the Y to the second and so forth .. then I'm done! So that's it. I just need to create a model with 4 activations with an MSE loss function, and that should be it.
Now we'll try to find the bounding box of the largest object. This is simply a regression with 4 outputs. So we can use a CSV with multiple 'labels'.
In [45]:
BB_CSV = PATH/'tmp/bb.csv'
In [46]:
bb = np.array([trn_lrg_anno[o][0] for o in trn_ids]) # largest item dictionary
bbs = [' '.join(str(p) for p in o) for o in bb] # bbxs separated by spaces via list comprehension
df = pd.DataFrame({'fn': [trn_fns[o] for o in trn_ids], 'bbox': bbs}, columns=['fn','bbox'])
df.to_csv(BB_CSV, index=False) # turn dataframe to csv
From Part 1: to do a multiple-labels classification: the multiple labels have to be space-separated, and the filename is comma-separated.
In [47]:
BB_CSV.open().readlines()[:5]
Out[47]:
In [50]:
f_model = resnet34
sz = 224
bs = 64
Set continuous=True
to tell fastai this is a regression problem, which means it won't one-hot encode the labels, and will use MSE as the default crit.
NOTE that we have to tell the transforms constructor that our labels are coordinates, so that it can handle the transforms correctly.
Also, we use CropType.NO
because we want to squish
the rectangular images into squares, rather than center cropping, so that we don't accidentally crop out some of the objects (This is less of an issue in something like ImageNet, where there's a single object to classify, and it's generally large and centrally located).
NOTE that when we're doing scaling and data augmentation - that has to be applied to the bounding boxes as well as the images $\longrightarrow$ tfm_y=TfmType.COORD
The transforms are defined inside the fastai transforms module as just a list. You can always create your own list of augmentations:
In [51]:
augs = [RandomFlip(),
RandomRotate(30),
RandomLighting(0.1, 0.1)]
In [52]:
tfms = tfms_from_model(f_model, sz, crop_type=CropType.NO, aug_tfms=augs)
md = ImageClassifierData.from_csv(PATH, JPEGS, BB_CSV, tfms=tfms, continuous=True)
Now we can grab a minibatch of data:
In [57]:
x,y = next(iter(md.val_dl))
In [58]:
ima = md.val_ds.denorm(to_np(x))[0] # denormalize
b = bb_hw(to_np(y[0])); b # cvt bb -> hw to display
Out[58]:
Let's go through and rerun the iterator a few times with the new model data object & augmentations:
In [59]:
idx = 3
fig,axes = plt.subplots(3, 3, figsize=(9,9))
for i,ax in enumerate(axes.flat):
x,y = next(iter(md.aug_dl))
ima = md.val_ds.denorm(to_np(x))[idx]
b = bb_hw(to_np(y[idx]))
print(b)
show_img(ima, ax=ax)
draw_rect(ax, b)
This is the problem with Data Augmentation when your Dependent Variable is pixel values or in some way connected to your Independent Variable. The 2 need to be Augmented together.
Looking at the arrays above: the image is larger than 224 but we're asking for 224 without any scaling or cropping. Our Dependent Variable needs to go through all the same Gemoetric Transformations as our Dependent Variable.
So to do that: every transformation has an optional 'transform y' parameter. It takes a transform type Enum with a few options. the COORD
option says the y vals represent coordinates. So if you flip or rotate, you need to change those coordinates to match. So we just add TfmType.COORD
to all our augmentations.
We also have to add the same thing to our Transforms from Model function bc it does the cropping, zooming, padding, and resizing -- and all that needs to happen to the Dependent Variable as well.
In [53]:
augs = [RandomFlip(tfm_y=TfmType.COORD),
RandomRotate(30, tfm_y=TfmType.COORD),
RandomLighting(0.1, 0.1, tfm_y=TfmType.COORD)]
In [54]:
tfms = tfms_from_model(f_model, sz, crop_type=CropType.NO, tfm_y=TfmType.COORD, aug_tfms=augs)
md = ImageClassifierData.from_csv(PATH, JPEGS, BB_CSV, tfms=tfms, continuous=True, bs=4)
In [64]:
idx = 3
fig,axes = plt.subplots(3, 3, figsize=(9,9))
for i,ax in enumerate(axes.flat):
x,y = next(iter(md.aug_dl))
ima = md.val_ds.denorm(to_np(x))[idx]
b = bb_hw(to_np(y[idx]))
print(b)
show_img(ima, ax=ax)
draw_rect(ax, b)
Now you'll see the bounding box changes each time, and also matches the transform of the picture.
You need to be careful not to do too much rotation with bounding boxes - since there's not enough information for them to stay accurate. Polygons or Segmentations would work fine.
We'll use a maximum of 3° rotations, and only half the time.
In [55]:
tfm_y = TfmType.COORD
augs = [RandomFlip(tfm_y=tfm_y),
RandomRotate(3, p=0.5, tfm_y=tfm_y),
RandomLighting(0.05, 0.05, tfm_y=tfm_y)]
tfms = tfms_from_model(f_model, sz, crop_type=CropType.NO, tfm_y=tfm_y, aug_tfms=augs)
md = ImageClassifierData.from_csv(PATH, JPEGS, BB_CSV, tfms=tfms, continuous=True)
fastai let's you use a custom_head
to add your own module on top of a ConvNet, instead of the Adaptive Pooling and Fully Connected Net which is added by default. In this case, we don't want to do any pooling, since we need to know the activations of each grid cell.
The final layer has 4 activations, one per bounding box coordinate. Our target is continous, not categorical, so the MSE loss function used does not do any Sigmoid or Softmax to the module outputs.
We want to create a ConvNet based on ResNet34, but we don't want to add the standard set of Fully-Connected layers that create a Classifier. We'll add a single Linear Layer with 4 outputs. L1 loss vs MSE: instead of taking mean of squared errors, add up absolute errors.
In [68]:
512*7*7
Out[68]:
In [56]:
head_reg4 = nn.Sequential(Flatten(), nn.Linear(25088, 4)) # flatten prev layer, add linear layer
learn = ConvLearner.pretrained(f_model, md, custom_head=head_reg4) # add custom head to resnet34 model
learn.opt_fn = optim.Adam
learn.crit = nn.L1Loss() # L1 loss instead of MSE
.summary
runs a small batch of data through the model and prints out how big it is at every layer.
We can see that at the end of the Convolutional section before we hit the Flatten its 512 7 7. So a Rank-3 Tensor sized 51277 flattened out into a Rank-1 Tensor (a Vector) would be 25,088 long.
That's why we have that line nn.Linear(25088, 4)
$\longrightarrow$ inputting the flattened Tensor and outputting 4 numbers for our bounding box coordinates.
So now we just stick that on top of a pretrained ResNet
In [70]:
learn.summary()
Out[70]:
In [71]:
learn.lr_find(1e-5, 100)
learn.sched.plot(5)
In [72]:
lr = 2e-3
In [73]:
learn.fit(lr, 2, cycle_len=1, cycle_mult=2)
Out[73]:
In [74]:
lrs = np.array([lr/100, lr/10, lr])
In [75]:
learn.freeze_to(-2)
In [76]:
lrf = learn.lr_find(lrs/1000)
learn.sched.plot(1)
In [77]:
learn.fit(lrs, 2, cycle_len=1, cycle_mult=2)
Out[77]:
In [78]:
learn.freeze_to(-3)
In [79]:
learn.fit(lrs, 1, cycle_len=2)
Out[79]:
In [80]:
learn.save('reg4')
In [81]:
learn.load('reg4')
In [82]:
x,y = next(iter(md.val_dl))
learn.model.eval()
preds = to_np(learn.model(VV(x)))
In [83]:
fig, axes = plt.subplots(3, 4, figsize=(12,8))
for i,ax in enumerate(axes.flat):
ima = md.val_ds.denorm(to_np(x))[i]
b = bb_hw(preds[i])
ax = show_img(ima, ax=ax)
draw_rect(ax, b)
plt.tight_layout()
Now let's put those pieces together and get something that classifies and does bounding boxes.
There are 3 things we need to do whenever we want to train a Neural Network:
Loss Function: "anything that gives a lower number here is a better network, using this Data and Architecture.
In [57]:
f_model = resnet34
sz = 224
bs = 64
val_idxs = get_cv_idxs(len(trn_fns))
We need to create those 3 things for our Classification + Bounding Box Regression. We need a ModelData object with Independents: the images; Dependents a tuple of the bounding box coordinates, and class.
There are many ways to do this. The way J.Howard chose is to simply create 2 ModelData objects representing the 2 Dependent Variables, using CSVs as before.
In [58]:
tfms = tfms_from_model(f_model, sz, crop_type=CropType.NO, tfm_y=TfmType.COORD, aug_tfms=augs)
md = ImageClassifierData.from_csv(PATH, JPEGS, BB_CSV, tfms=tfms,
continuous=True, val_idxs=val_idxs)
In [59]:
md2 = ImageClassifierData.from_csv(PATH, JPEGS, CSV, tfms=tfms_from_model(f_model, sz))
And we'll just create a class to merge them together.
A dataset can be anything with __len__
and __getitem__
. Here's a dataset that adds a 2nd label to an existing dataset:
In [60]:
class ConcatLblDataset(Dataset):
def __init__(self, ds, y2): self.ds, self.y2 = ds,y2
def __len__(self): return len(self.ds)
def __getitem__(self, i): # indexer - lets you use []
x,y = self.ds[i]
return (x, (y, self.y2[i]))
We'll use it to add the classes to the bounding box labels.
In [61]:
trn_ds2 = ConcatLblDataset(md.trn_ds, md2.trn_y)
val_ds2 = ConcatLblDataset(md.val_ds, md2.val_y)
In [63]:
val_ds2[0][1]
Out[63]:
We can replace the dataloader's datasets with these new ones:
In [64]:
md.trn_dl.dataset = trn_ds2
md.val_dl.dataset = val_ds2
We can test it by grabbing a minibatch of data.
We have to denorm
alize the images from the dataloader before they can be plotted.
In [66]:
x,y = next(iter(md.val_dl))
idx = 3
ima = md.val_ds.ds.denorm(to_np(x))[idx]
b = bb_hw(to_np(y[0][idx])); b
Out[66]:
In [67]:
ax = show_img(ima)
draw_rect(ax, b)
draw_text(ax, b[:2], md2.classes[y[1][idx]])
That's one way to customize the dataset.
So we have our Data. Now we need the Architecture. The architectures will be the same as those we used for the classifier and bounding box regression -- we're just going to combine them.
If there are C classes, then the num actvns we need in the final layer is 4 + C.
We need 1 output actvn foreach class (for its probaility) plus 1 foreach bounding box coordinate. We'll use an extra linear layer this time, plus some Dropout, to help us train a more flexibly model. That's why we have the Flatten layer at the start of this head (for the new linear layer). There's no batchnorm to start because the ResNet backbone already has batchnorm in its final layer.
In [68]:
head_reg4 = nn.Sequential(
Flatten(),
nn.ReLU(),
nn.Dropout(0.5),
nn.Linear(25088, 256),
nn.ReLU(),
nn.BatchNorm1d(256),
nn.Dropout(0.5),
nn.Linear(256, 4+len(cats)), # final layer: 4+C actvns
)
models = ConvnetBuilder(f_model, 0, 0, 0, custom_head=head_reg4)
learn = ConvLearner(md, models)
learn.opt_fn = optim.Adam
We got Data, Architecture, now we need a Loss Function.
The loss function needs to look at those 4+C actvns and decide how good they are. For the first 4 we use L1 loss just as in bounding box regression before (L1 loss is like MSE: instead of sum of squares, it's the sum of absolute values). For the rest of the actvns we can use Cross Entropy loss.
BN -> ReLU cannot create negative numbers; ReLU -> BN can & works a bit better. The bb_i = F.sigmoid(bb_i)*224
helps force our data into the right range -- this helps our network train.
A great thing about Dropout is it has a parameter -- parameters are great esp for regularization: bc it lets you build a great big overparamaterized model and then decide how much to regularize it.
Finally, with detn_loss
(Detection Loss): now that we got our inputs and targets, we can just calculate the L1 Loss and add to it the Cross Entropy:
F.l1_loss(bb_i, bb_t) + F.cross_entropy(c_i, c_t)*20
And that's our Loss Function. The CE and L1L may be of (wildly) different scales, and in the LossFn the larger one will dominate. So J.Howard ran them in the debugger, found how big each was, and found multiply the CE by 20 made them both about the same scale.
As you're training it's nice to print out information as you go. So J.Howard grabbed the L1 part of the LossFn in the function detn_l1
, and also created a fn for accuracy, so he could make them into metrics to print out.
In [70]:
def detn_loss(input, target): # input: actvns; target: grnd truth
bb_t,c_t = target # destructuring asnmt to grab bbs & classes
bb_i,c_i = input[:, :4], input[:, 4:] # batch dim; 1st 4 (bbx); 4 onwards (classes)
bb_i = F.sigmoid(bb_i)*224 # we know bbxs between 0:224 (img size)
# these quantities were looked at sepearately first, then a
# multiplier was chosen to make them approximately equal
return F.l1_loss(bb_i, bb_t) + F.cross_entropy(c_i, c_t)*20
def detn_l1(input, target):
bb_t,_ = target
bb_i = input[:, :4]
bb_i = F.sigmoid(bb_i)*224
return F.l1_loss(V(bb_i),V(bb_t)).data
def detn_acc(input, target):
_,c_t = target
c_i = input[:, 4:]
return accuracy(c_i, c_t)
learn.crit = detn_loss
learn.metrics = [detn_acc, detn_l1]
In [71]:
learn.lr_find()
learn.sched.plot()
In [88]:
lr = 1e-2
Now we have something that's printing out our Object Detection Loss, Accuracy, and Detection L1:
In [73]:
learn.fit(lr, 1, cycle_len=3, use_clr=(32,5))
Out[73]:
In [74]:
learn.save('reg1_0')
In [75]:
learn.freeze_to(-2)
In [76]:
lrs = np.array([lr/100, lr/10, lr])
In [77]:
learn.lr_find(lrs/1000)
learn.sched.plot(0)
In [78]:
learn.fit(lrs/5, 1, cycle_len=5, use_clr=(32,10))
Out[78]:
In [79]:
learn.save('reg1_1')
In [80]:
learn.load('reg1_1')
In [81]:
learn.unfreeze()
In [82]:
learn.fit(lrs/10, 1, cycle_len=10, use_clr=(32,10))
Out[82]:
In [83]:
learn.save('reg1')
In [84]:
learn.load('reg1')
In [85]:
y = learn.predict()
x,_ = next(iter(md.val_dl))
Detection accuracy is still in the low 80s, which isn't surprising since ResNet was designed for classification -- so we're unlikely to improve it with smth so simple.
ResNet wasn't designed to do bbx regression -- it was explicitly design to not care about geometry: it takes that last 7x7 grid of activations and averages them all together: throwing away a bunch of information.
When we only train the last layer the detection accuracy is very bad ( 24.564463 in the first epoch set ), and it improves a lot - although detection accuracy doesn't improve.
Interestingly, the Detection L1, when we do accuracy and bounding box at the same time ( 18.090298 ) seems a lot better than when we just do bounding box regression ( 19.628677 in section 3).
Figuring out what the main object of an image is is the 'hard' part, and where its bounding box is is the 'easy' part. If you have a single Network that's supposed to tell what and where an object is, it's going to share all the computation about finding the object -- so all that shared computation is very efficient.
So when we backprop the errors in the class and place -- all that computation is going to help in finding the main object.
>> anytime you have multiple tasks that share some concept of what those tasks would need to do to complete their work, its very likely they should share some layers of the network.
In [86]:
from scipy.special import expit
In [87]:
fig, axes = plt.subplots(3, 5, figsize=(12, 8))
for i,ax in enumerate(axes.flat):
ima = md.val_ds.ds.denorm(to_np(x))[i]
bb = expit(y[i][:4])*224
b = bb_hw(bb)
c = np.argmax(y[i][4:])
ax = show_img(ima, ax=ax)
draw_rect(ax, b)
draw_text(ax, b[:2], md2.classes[c])
plt.tight_layout()
In [ ]: