Practical Deep Learning for Coders, v3

Lesson3_planet

Multi-label prediction with Planet Amazon dataset

基于Planet Amazon数据集的多标签分类预测


In [ ]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [ ]:
from fastai.vision import *

Getting the data 数据获取

The planet dataset isn't available on the fastai dataset page due to copyright restrictions.

由于版权原因,我们不能将亚马逊雨林的数据集放在fastai网页的数据集页面。

You can download it from Kaggle however. Let's see how to do this by using the Kaggle API as it's going to be pretty useful to you if you want to join a competition or use other Kaggle datasets later on.

不过,你可以通过Kaggle网站来进行下载。让我们来看一下怎样通过 Kaggle API 下载数据。这项技能很重要,因为未来你可能会需要使用这个API来下载竞赛数据,或者使用其他Kaggle的数据集。

First, install the Kaggle API by uncommenting the following line and executing it, or by executing it in your terminal (depending on your platform you may need to modify this slightly to either add source activate fastai or similar, or prefix pip with a path. Have a look at how conda install is called for your platform in the appropriate Returning to work section of https://course.fast.ai/. (Depending on your environment, you may also need to append "--user" to the command.)

首先,要安装Kaggle API需要取消下一行的注释并运行代码,或者在terminal里执行(这取决于你的运行系统,你可能需要对这个代码进行一点修改。你可能会需要添加source activate fastai来略作修改,也可以在下面的代码之前加入pip, 还可以在代码之后加上--user。 在你的系统中究竟该怎样使用conda install,可以参照https://course.fast.ai/ 页面的Returning to work 部分)。


In [ ]:
# ! {sys.executable} -m pip install kaggle --upgrade

Then you need to upload your credentials from Kaggle on your instance. Login to kaggle and click on your profile picture on the top left corner, then 'My account'. Scroll down until you find a button named 'Create New API Token' and click on it. This will trigger the download of a file named 'kaggle.json'.

接下来你需要在你的代码中上传你的身份验证资料。你需要登入Kaggle,在左上角点击你的头像,选择我的账户,然后向下滑动,直到你找到创建新API许可权,并点击这个按钮。这样会产生一个自动下载的名为kaggle.json的文件。

Upload this file to the directory this notebook is running in, by clicking "Upload" on your main Jupyter page, then uncomment and execute the next two commands (or run them in a terminal). For Windows, uncomment the last two commands.

在Jupyter的主页上点击Upload,将下载好的文件上传到这个notebook运行的路径中。然后运行下面两行代码,你也可以直接在terminal里运行。如果你是windows用户,只运行后面两行代码即可。


In [ ]:
# ! mkdir -p ~/.kaggle/
# ! mv kaggle.json ~/.kaggle/

# For Windows, uncomment these two commands
# ! mkdir %userprofile%\.kaggle
# ! move kaggle.json %userprofile%\.kaggle

You're all set to download the data from planet competition. You first need to go to its main page and accept its rules, and run the two cells below (uncomment the shell commands to download and unzip the data). If you get a 403 forbidden error it means you haven't accepted the competition rules yet (you have to go to the competition page, click on Rules tab, and then scroll to the bottom to find the accept button).

现在你可以开始从planet competition下载数据。需要注意的是,你要先去kaggle主页上接受相应的条款,之后运行下面的两行代码,取消注释并且解压数据。如果你看到403 forbidden的字样,这代表你还没有接受比赛的条款。你需要去这个比赛的主页,点击Rules 按钮,下拉到页面最下方并点击accept 按钮。


In [ ]:
path = Config.data_path()/'planet'
path.mkdir(parents=True, exist_ok=True)
path


Out[ ]:
PosixPath('/home/ubuntu/.fastai/data/planet')

In [ ]:
# ! kaggle competitions download -c planet-understanding-the-amazon-from-space -f train-jpg.tar.7z -p {path}  
# ! kaggle competitions download -c planet-understanding-the-amazon-from-space -f train_v2.csv -p {path}  
# ! unzip -q -n {path}/train_v2.csv.zip -d {path}

To extract the content of this file, we'll need 7zip, so uncomment the following line if you need to install it (or run sudo apt install p7zip-full in your terminal).

我们需要7zip来提取所有文件,因此如果你需要安装7zip, 你可以取消下面这一行代码的注释,然后运行就可以安装了(或者如果你是苹果系统用户,可以在terminal里运行sudo apt install p7zip-full)。


In [ ]:
# ! conda install --yes --prefix {sys.prefix} -c haasad eidl7zip

And now we can unpack the data (uncomment to run - this might take a few minutes to complete).

我们可以运行下面的代码来解压数据(取消注释再运行——这可能需要几分钟才能完成)。


In [ ]:
# ! 7za -bd -y -so x {path}/train-jpg.tar.7z | tar xf - -C {path.as_posix()}

Multiclassification多标签分类问题

Contrary to the pets dataset studied in last lesson, here each picture can have multiple labels. If we take a look at the csv file containing the labels (in 'train_v2.csv' here) we see that each 'image_name' is associated to several tags separated by spaces.

与上节课学习的宠物数据集相比,这节课的数据集里每张图片都有多个标签。如果我们看一下导入的csv数据(在“train_v2.csv”这里),就可以看见每个图片名都有好几个由空格分开的标签。


In [ ]:
df = pd.read_csv(path/'train_v2.csv')
df.head()


Out[ ]:
image_name tags
0 train_0 haze primary
1 train_1 agriculture clear primary water
2 train_2 clear primary
3 train_3 clear primary
4 train_4 agriculture clear habitation primary road

To put this in a DataBunch while using the data block API, we then need to using ImageList (and not ImageDataBunch). This will make sure the model created has the proper loss function to deal with the multiple classes.

我们将这些数据和标签用 data block API 转化成DataBunch,接着需要使用ImageList (而不是ImageDataBunch)。这样做可以保证模型有正确的损失函数来处理多标签的问题。


In [ ]:
tfms = get_transforms(flip_vert=True, max_lighting=0.1, max_zoom=1.05, max_warp=0.)

We use parentheses around the data block pipeline below, so that we can use a multiline statement without needing to add '\'.

我们在下面的代码前后使用括号,这样可以很方便的写入多行代码而不需要给每行末尾加“\”。


In [ ]:
np.random.seed(42)
src = (ImageList.from_csv(path, 'train_v2.csv', folder='train-jpg', suffix='.jpg')
       .split_by_rand_pct(0.2)
       .label_from_df(label_delim=' '))

In [ ]:
data = (src.transform(tfms, size=128)
        .databunch().normalize(imagenet_stats))

show_batch still works, and show us the different labels separated by ;.


In [ ]:
data.show_batch(rows=3, figsize=(12,9))


To create a Learner we use the same function as in lesson 1. Our base architecture is resnet50 again, but the metrics are a little bit differeent: we use accuracy_thresh instead of accuracy. In lesson 1, we determined the predicition for a given class by picking the final activation that was the biggest, but here, each activation can be 0. or 1. accuracy_thresh selects the ones that are above a certain threshold (0.5 by default) and compares them to the ground truth.

我们用第一课里同样的函数来创建一个Learner。我们的基础架构依然是resnet50, 但这次使用的度量函数有点不同:我们会使用accuracy_thresh来代替 accuracy。在第一课里,我们采用的分组标签是给定品种的最终激活函数的最大值,但是在这里,每个激活函数的值可以是0或1,由accuracy_thresh选取所有高于某个“阈值”(默认为0.5)的图像,然后与真实的标签做对比。

As for Fbeta, it's the metric that was used by Kaggle on this competition. See here for more details.

至于Fbeta, 它是这项Kaggle比赛使用的测度。欲知详情,可以看这里


In [ ]:
arch = models.resnet50

In [ ]:
acc_02 = partial(accuracy_thresh, thresh=0.2)
f_score = partial(fbeta, thresh=0.2)
learn = cnn_learner(data, arch, metrics=[acc_02, f_score])


Downloading: "https://download.pytorch.org/models/resnet50-19c8e357.pth" to /home/ubuntu/.torch/models/resnet50-19c8e357.pth
100%|██████████| 102502400/102502400 [00:01<00:00, 100859665.66it/s]

We use the LR Finder to pick a good learning rate.

我们使用LR Finder来选取一个好的学习率。


In [ ]:
learn.lr_find()


LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.

In [ ]:
learn.recorder.plot()


Then we can fit the head of our network.

接下来,我们可以用找好的学习率来训练神经网络模型了。


In [ ]:
learn.fit_one_cycle(5, slice(lr))


Total time: 03:30

epoch train_loss valid_loss accuracy_thresh fbeta
1 0.125128 0.110038 0.944010 0.904893
2 0.112533 0.101303 0.955964 0.910774
3 0.100574 0.093255 0.955187 0.918653
4 0.096056 0.087997 0.954983 0.924016
5 0.092320 0.086761 0.956400 0.925110


In [ ]:
learn.save('stage-1-rn50')

...And fine-tune the whole model:

接下来给整个模型调参:


In [ ]:
learn.unfreeze()

In [ ]:
learn.lr_find()
learn.recorder.plot()


LR Finder complete, type {learner_name}.recorder.plot() to see the graph.

In [ ]:
learn.fit_one_cycle(5, slice(1e-5, lr/5))


Total time: 04:00

epoch train_loss valid_loss accuracy_thresh fbeta
1 0.097016 0.094868 0.952004 0.916215
2 0.095774 0.088899 0.954540 0.922340
3 0.090646 0.085958 0.959249 0.924921
4 0.085097 0.083291 0.958849 0.928195
5 0.079197 0.082855 0.958602 0.928259


In [ ]:
learn.save('stage-2-rn50')

In [ ]:
data = (src.transform(tfms, size=256)
        .databunch().normalize(imagenet_stats))

learn.data = data
data.train_ds[0][0].shape


Out[ ]:
torch.Size([3, 256, 256])

In [ ]:
learn.freeze()

In [ ]:
learn.lr_find()
learn.recorder.plot()


LR Finder complete, type {learner_name}.recorder.plot() to see the graph.

In [ ]:
lr=1e-2/2

In [ ]:
learn.fit_one_cycle(5, slice(lr))


Total time: 09:01

epoch train_loss valid_loss accuracy_thresh fbeta
1 0.087761 0.085013 0.958006 0.926066
2 0.087641 0.083732 0.958260 0.927459
3 0.084250 0.082856 0.958485 0.928200
4 0.082347 0.081470 0.960091 0.929166
5 0.078463 0.080984 0.959249 0.930089


In [ ]:
learn.save('stage-1-256-rn50')

In [ ]:
learn.unfreeze()

In [ ]:
learn.fit_one_cycle(5, slice(1e-5, lr/5))


Total time: 11:25

epoch train_loss valid_loss accuracy_thresh fbeta
1 0.082938 0.083548 0.957846 0.927756
2 0.086312 0.084802 0.958718 0.925416
3 0.084824 0.082339 0.959975 0.930054
4 0.078784 0.081425 0.959983 0.929634
5 0.074530 0.080791 0.960426 0.931257


In [ ]:
learn.recorder.plot_losses()



In [ ]:
learn.save('stage-2-256-rn50')

You won't really know how you're going until you submit to Kaggle, since the leaderboard isn't using the same subset as we have for training. But as a guide, 50th place (out of 938 teams) on the private leaderboard was a score of 0.930.

正如我们训练时做的那样,排名榜单上使用了不同的数据子集, 如果你没有在Kaggle提交你的模型,你就无法知道自己的模型表现得怎么样。不过作为一个参考,0.930在非公开的榜单上大约是在938个团队里排到第50名。


In [ ]:
learn.export()

fin

(This section will be covered in part 2 - please don't ask about it just yet! :) )

(这个部分在part2里已经被覆盖——就别在意这些细节啦 :))


In [ ]:
#! kaggle competitions download -c planet-understanding-the-amazon-from-space -f test-jpg.tar.7z -p {path}  
#! 7za -bd -y -so x {path}/test-jpg.tar.7z | tar xf - -C {path}
#! kaggle competitions download -c planet-understanding-the-amazon-from-space -f test-jpg-additional.tar.7z -p {path}  
#! 7za -bd -y -so x {path}/test-jpg-additional.tar.7z | tar xf - -C {path}

In [ ]:
test = ImageList.from_folder(path/'test-jpg').add(ImageList.from_folder(path/'test-jpg-additional'))
len(test)


Out[ ]:
61191

In [ ]:
learn = load_learner(path, test=test)
preds, _ = learn.get_preds(ds_type=DatasetType.Test)

In [ ]:
thresh = 0.2
labelled_preds = [' '.join([learn.data.classes[i] for i,p in enumerate(pred) if p > thresh]) for pred in preds]

In [ ]:
labelled_preds[:5]


Out[ ]:
['agriculture cultivation partly_cloudy primary road',
 'clear haze primary water',
 'agriculture clear cultivation primary',
 'clear primary',
 'partly_cloudy primary']

In [ ]:
fnames = [f.name[:-4] for f in learn.data.test_ds.items]

In [ ]:
df = pd.DataFrame({'image_name':fnames, 'tags':labelled_preds}, columns=['image_name', 'tags'])

In [ ]:
df.to_csv(path/'submission.csv', index=False)

In [ ]:
! kaggle competitions submit planet-understanding-the-amazon-from-space -f {path/'submission.csv'} -m "My submission"


Warning: Your Kaggle API key is readable by other users on this system! To fix this, you can run 'chmod 600 /home/ubuntu/.kaggle/kaggle.json'
100%|██████████████████████████████████████| 2.18M/2.18M [00:02<00:00, 1.05MB/s]
Successfully submitted to Planet: Understanding the Amazon from Space

Private Leaderboard score: 0.9296 (around 80th)

内部排名榜单得分:0.9296(约第80位)