第4章 [実践] 深層学習

このノートブックでは、第4章で説明した GoogLeNet による一般物体認識の再現を行います。

環境構築

このノートブックの内容をお手元のコンピュータ上で再現するには、Caffe の実行環境を構築し、IPython のインストールも必要です。IPython は第2章用の環境構築によってインストールされるので、第2章のノートブックをご覧になってください。

Caffe のインストール

まず、Caffe を GitHub からクローンします。Caffe のリポジトリ BVLC/caffe は本リポジトリにサブモジュールとして追加してあります。ですから、次のコマンドでクローンし、caffe ディレクトリに入ってください。

$ git submodule update --init caffe
$ cd caffe

次に、Caffe のインストールです。インストール方法は "Installation" に記載されています。このページを参照して環境を構築してください。

本ノートブックは pycaffe を使用しています。Caffe の Installation ページではオプショナル扱いになっていますが、pycaffe のインストールも忘れずに行ってください。pycaffe のインストール手順をまとめると以下のとおりです。

$ for req in $(cat python/requirements.txt); do pip install $req; done
$ make pycaffe

準備

ノートブックの環境を準備します。

matplotlib の設定

まずは matplotlib の設定です。


In [1]:
%matplotlib inline
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt

以下の設定で pyplot.imshow で表示される画像のサイズを大きくします。


In [2]:
plt.rcParams['figure.figsize'] = (10, 10)
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

pycaffe のロード

次に pycaffe をロードします。


In [3]:
import os
import sys

caffe_root = os.path.expanduser("caffe")
sys.path.insert(0, caffe_root + '/python')
import caffe

Caffe の設定

Caffe で 0 番目の GPU を使用するように設定します。Caffe を CPU モードでビルドした読者は、以下のコードを実行しないでください。


In [4]:
caffe.set_device(0)
caffe.set_mode_gpu()

必要なファイルのダウンロードとロード

モデルファイルが無い場合はダウンロードします


In [5]:
googlenet_dir = os.path.expanduser(caffe_root + '/models/bvlc_googlenet/')
if not os.path.isfile(googlenet_dir + 'bvlc_googlenet.caffemodel'):
    print("Downloading pre-trained CaffeNet model...")
    !caffe/scripts/download_model_binary.py caffe/models/bvlc_googlenet


Downloading pre-trained CaffeNet model...
...100%, 51 MB, 9564 KB/s, 5 seconds passed

ラベルファイルも同様に存在しない場合はダウンロードします


In [6]:
# ラベルを読み込む
imagenet_labels_filename = caffe_root + '/data/ilsvrc12/synset_words.txt'
if not os.path.isfile(imagenet_labels_filename):
    print("Downloading ImageNet labels...")
    !caffe/data/ilsvrc12/get_ilsvrc_aux.sh


Downloading ImageNet labels...
Downloading...
--2015-10-14 16:44:53--  http://dl.caffe.berkeleyvision.org/caffe_ilsvrc12.tar.gz
Resolving dl.caffe.berkeleyvision.org (dl.caffe.berkeleyvision.org)... 169.229.222.251
Connecting to dl.caffe.berkeleyvision.org (dl.caffe.berkeleyvision.org)|169.229.222.251|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 17858008 (17M) [application/octet-stream]
Saving to: ‘caffe_ilsvrc12.tar.gz’

100%[======================================>] 17,858,008  5.76MB/s   in 3.0s   

2015-10-14 16:44:56 (5.76 MB/s) - ‘caffe_ilsvrc12.tar.gz’ saved [17858008/17858008]

Unzipping...
Done.

ダウンロードしたモデルとラベルをロードします。


In [7]:
# モデルファイルのロード
net = caffe.Net(googlenet_dir + 'deploy.prototxt',
                googlenet_dir + 'bvlc_googlenet.caffemodel',
                caffe.TEST)
# ラベルファイルのロード
imagenet_labels = np.loadtxt(imagenet_labels_filename, str, delimiter='\t')

入力画像の変換器の作成


In [8]:
transformer = caffe.io.Transformer({'data': net.blobs['data'].data.shape})
transformer.set_transpose('data', (2,0,1))
transformer.set_mean('data', np.load(caffe_root + '/python/caffe/imagenet/ilsvrc_2012_mean.npy').mean(1).mean(1))
transformer.set_raw_scale('data', 255)
transformer.set_channel_swap('data', (2,1,0))

入力画像の読み込みと変換結果の可視化


In [9]:
image_file = 'momosan.jpg'
image = caffe.io.load_image(image_file)
net.blobs['data'].data[...] = transformer.preprocess('data', image)

In [10]:
plt.imshow(transformer.deprocess('data', net.blobs['data'].data[0]))
plt.axis('off')


Out[10]:
(-0.5, 223.5, 223.5, -0.5)

分類予測の算出と予測確率が高いラベルの取得


In [13]:
# ソフトマックス出力からトップ k 予測を取り出す
out = net.forward()
print("Predicted class is #{}.".format(out['prob'].argmax()))
prob = net.blobs['prob'].data[0].flatten()
top_k = prob.argsort()[-1:-10:-1]
for i, label in enumerate(imagenet_labels[top_k], start=1):
    print("{}: {}".format(i, label))


Predicted class is #225.
1: n02105162 malinois
2: n01877812 wallaby, brush kangaroo
3: n02106662 German shepherd, German shepherd dog, German police dog, alsatian
4: n02105412 kelpie
5: n02115641 dingo, warrigal, warragal, Canis dingo
6: n02091467 Norwegian elkhound, elkhound
7: n02114712 red wolf, maned wolf, Canis rufus, Canis niger
8: n02113186 Cardigan, Cardigan Welsh corgi
9: n02115913 dhole, Cuon alpinus

GoogLeNet の隠れ層の可視化

各階層の出力の大きさ


In [14]:
# ネットワークの各層の形
[(k, v.data.shape) for k, v in net.blobs.items()]


Out[14]:
[('data', (10, 3, 224, 224)),
 ('conv1/7x7_s2', (10, 64, 112, 112)),
 ('pool1/3x3_s2', (10, 64, 56, 56)),
 ('pool1/norm1', (10, 64, 56, 56)),
 ('conv2/3x3_reduce', (10, 64, 56, 56)),
 ('conv2/3x3', (10, 192, 56, 56)),
 ('conv2/norm2', (10, 192, 56, 56)),
 ('pool2/3x3_s2', (10, 192, 28, 28)),
 ('pool2/3x3_s2_pool2/3x3_s2_0_split_0', (10, 192, 28, 28)),
 ('pool2/3x3_s2_pool2/3x3_s2_0_split_1', (10, 192, 28, 28)),
 ('pool2/3x3_s2_pool2/3x3_s2_0_split_2', (10, 192, 28, 28)),
 ('pool2/3x3_s2_pool2/3x3_s2_0_split_3', (10, 192, 28, 28)),
 ('inception_3a/1x1', (10, 64, 28, 28)),
 ('inception_3a/3x3_reduce', (10, 96, 28, 28)),
 ('inception_3a/3x3', (10, 128, 28, 28)),
 ('inception_3a/5x5_reduce', (10, 16, 28, 28)),
 ('inception_3a/5x5', (10, 32, 28, 28)),
 ('inception_3a/pool', (10, 192, 28, 28)),
 ('inception_3a/pool_proj', (10, 32, 28, 28)),
 ('inception_3a/output', (10, 256, 28, 28)),
 ('inception_3a/output_inception_3a/output_0_split_0', (10, 256, 28, 28)),
 ('inception_3a/output_inception_3a/output_0_split_1', (10, 256, 28, 28)),
 ('inception_3a/output_inception_3a/output_0_split_2', (10, 256, 28, 28)),
 ('inception_3a/output_inception_3a/output_0_split_3', (10, 256, 28, 28)),
 ('inception_3b/1x1', (10, 128, 28, 28)),
 ('inception_3b/3x3_reduce', (10, 128, 28, 28)),
 ('inception_3b/3x3', (10, 192, 28, 28)),
 ('inception_3b/5x5_reduce', (10, 32, 28, 28)),
 ('inception_3b/5x5', (10, 96, 28, 28)),
 ('inception_3b/pool', (10, 256, 28, 28)),
 ('inception_3b/pool_proj', (10, 64, 28, 28)),
 ('inception_3b/output', (10, 480, 28, 28)),
 ('pool3/3x3_s2', (10, 480, 14, 14)),
 ('pool3/3x3_s2_pool3/3x3_s2_0_split_0', (10, 480, 14, 14)),
 ('pool3/3x3_s2_pool3/3x3_s2_0_split_1', (10, 480, 14, 14)),
 ('pool3/3x3_s2_pool3/3x3_s2_0_split_2', (10, 480, 14, 14)),
 ('pool3/3x3_s2_pool3/3x3_s2_0_split_3', (10, 480, 14, 14)),
 ('inception_4a/1x1', (10, 192, 14, 14)),
 ('inception_4a/3x3_reduce', (10, 96, 14, 14)),
 ('inception_4a/3x3', (10, 208, 14, 14)),
 ('inception_4a/5x5_reduce', (10, 16, 14, 14)),
 ('inception_4a/5x5', (10, 48, 14, 14)),
 ('inception_4a/pool', (10, 480, 14, 14)),
 ('inception_4a/pool_proj', (10, 64, 14, 14)),
 ('inception_4a/output', (10, 512, 14, 14)),
 ('inception_4a/output_inception_4a/output_0_split_0', (10, 512, 14, 14)),
 ('inception_4a/output_inception_4a/output_0_split_1', (10, 512, 14, 14)),
 ('inception_4a/output_inception_4a/output_0_split_2', (10, 512, 14, 14)),
 ('inception_4a/output_inception_4a/output_0_split_3', (10, 512, 14, 14)),
 ('inception_4b/1x1', (10, 160, 14, 14)),
 ('inception_4b/3x3_reduce', (10, 112, 14, 14)),
 ('inception_4b/3x3', (10, 224, 14, 14)),
 ('inception_4b/5x5_reduce', (10, 24, 14, 14)),
 ('inception_4b/5x5', (10, 64, 14, 14)),
 ('inception_4b/pool', (10, 512, 14, 14)),
 ('inception_4b/pool_proj', (10, 64, 14, 14)),
 ('inception_4b/output', (10, 512, 14, 14)),
 ('inception_4b/output_inception_4b/output_0_split_0', (10, 512, 14, 14)),
 ('inception_4b/output_inception_4b/output_0_split_1', (10, 512, 14, 14)),
 ('inception_4b/output_inception_4b/output_0_split_2', (10, 512, 14, 14)),
 ('inception_4b/output_inception_4b/output_0_split_3', (10, 512, 14, 14)),
 ('inception_4c/1x1', (10, 128, 14, 14)),
 ('inception_4c/3x3_reduce', (10, 128, 14, 14)),
 ('inception_4c/3x3', (10, 256, 14, 14)),
 ('inception_4c/5x5_reduce', (10, 24, 14, 14)),
 ('inception_4c/5x5', (10, 64, 14, 14)),
 ('inception_4c/pool', (10, 512, 14, 14)),
 ('inception_4c/pool_proj', (10, 64, 14, 14)),
 ('inception_4c/output', (10, 512, 14, 14)),
 ('inception_4c/output_inception_4c/output_0_split_0', (10, 512, 14, 14)),
 ('inception_4c/output_inception_4c/output_0_split_1', (10, 512, 14, 14)),
 ('inception_4c/output_inception_4c/output_0_split_2', (10, 512, 14, 14)),
 ('inception_4c/output_inception_4c/output_0_split_3', (10, 512, 14, 14)),
 ('inception_4d/1x1', (10, 112, 14, 14)),
 ('inception_4d/3x3_reduce', (10, 144, 14, 14)),
 ('inception_4d/3x3', (10, 288, 14, 14)),
 ('inception_4d/5x5_reduce', (10, 32, 14, 14)),
 ('inception_4d/5x5', (10, 64, 14, 14)),
 ('inception_4d/pool', (10, 512, 14, 14)),
 ('inception_4d/pool_proj', (10, 64, 14, 14)),
 ('inception_4d/output', (10, 528, 14, 14)),
 ('inception_4d/output_inception_4d/output_0_split_0', (10, 528, 14, 14)),
 ('inception_4d/output_inception_4d/output_0_split_1', (10, 528, 14, 14)),
 ('inception_4d/output_inception_4d/output_0_split_2', (10, 528, 14, 14)),
 ('inception_4d/output_inception_4d/output_0_split_3', (10, 528, 14, 14)),
 ('inception_4e/1x1', (10, 256, 14, 14)),
 ('inception_4e/3x3_reduce', (10, 160, 14, 14)),
 ('inception_4e/3x3', (10, 320, 14, 14)),
 ('inception_4e/5x5_reduce', (10, 32, 14, 14)),
 ('inception_4e/5x5', (10, 128, 14, 14)),
 ('inception_4e/pool', (10, 528, 14, 14)),
 ('inception_4e/pool_proj', (10, 128, 14, 14)),
 ('inception_4e/output', (10, 832, 14, 14)),
 ('pool4/3x3_s2', (10, 832, 7, 7)),
 ('pool4/3x3_s2_pool4/3x3_s2_0_split_0', (10, 832, 7, 7)),
 ('pool4/3x3_s2_pool4/3x3_s2_0_split_1', (10, 832, 7, 7)),
 ('pool4/3x3_s2_pool4/3x3_s2_0_split_2', (10, 832, 7, 7)),
 ('pool4/3x3_s2_pool4/3x3_s2_0_split_3', (10, 832, 7, 7)),
 ('inception_5a/1x1', (10, 256, 7, 7)),
 ('inception_5a/3x3_reduce', (10, 160, 7, 7)),
 ('inception_5a/3x3', (10, 320, 7, 7)),
 ('inception_5a/5x5_reduce', (10, 32, 7, 7)),
 ('inception_5a/5x5', (10, 128, 7, 7)),
 ('inception_5a/pool', (10, 832, 7, 7)),
 ('inception_5a/pool_proj', (10, 128, 7, 7)),
 ('inception_5a/output', (10, 832, 7, 7)),
 ('inception_5a/output_inception_5a/output_0_split_0', (10, 832, 7, 7)),
 ('inception_5a/output_inception_5a/output_0_split_1', (10, 832, 7, 7)),
 ('inception_5a/output_inception_5a/output_0_split_2', (10, 832, 7, 7)),
 ('inception_5a/output_inception_5a/output_0_split_3', (10, 832, 7, 7)),
 ('inception_5b/1x1', (10, 384, 7, 7)),
 ('inception_5b/3x3_reduce', (10, 192, 7, 7)),
 ('inception_5b/3x3', (10, 384, 7, 7)),
 ('inception_5b/5x5_reduce', (10, 48, 7, 7)),
 ('inception_5b/5x5', (10, 128, 7, 7)),
 ('inception_5b/pool', (10, 832, 7, 7)),
 ('inception_5b/pool_proj', (10, 128, 7, 7)),
 ('inception_5b/output', (10, 1024, 7, 7)),
 ('pool5/7x7_s1', (10, 1024, 1, 1)),
 ('loss3/classifier', (10, 1000)),
 ('prob', (10, 1000))]

各階層の結合パラメータの大きさ


In [15]:
# ネットワークの結合パラメタの形
[(k, v[0].data.shape) for k, v in net.params.items()]


Out[15]:
[('conv1/7x7_s2', (64, 3, 7, 7)),
 ('conv2/3x3_reduce', (64, 64, 1, 1)),
 ('conv2/3x3', (192, 64, 3, 3)),
 ('inception_3a/1x1', (64, 192, 1, 1)),
 ('inception_3a/3x3_reduce', (96, 192, 1, 1)),
 ('inception_3a/3x3', (128, 96, 3, 3)),
 ('inception_3a/5x5_reduce', (16, 192, 1, 1)),
 ('inception_3a/5x5', (32, 16, 5, 5)),
 ('inception_3a/pool_proj', (32, 192, 1, 1)),
 ('inception_3b/1x1', (128, 256, 1, 1)),
 ('inception_3b/3x3_reduce', (128, 256, 1, 1)),
 ('inception_3b/3x3', (192, 128, 3, 3)),
 ('inception_3b/5x5_reduce', (32, 256, 1, 1)),
 ('inception_3b/5x5', (96, 32, 5, 5)),
 ('inception_3b/pool_proj', (64, 256, 1, 1)),
 ('inception_4a/1x1', (192, 480, 1, 1)),
 ('inception_4a/3x3_reduce', (96, 480, 1, 1)),
 ('inception_4a/3x3', (208, 96, 3, 3)),
 ('inception_4a/5x5_reduce', (16, 480, 1, 1)),
 ('inception_4a/5x5', (48, 16, 5, 5)),
 ('inception_4a/pool_proj', (64, 480, 1, 1)),
 ('inception_4b/1x1', (160, 512, 1, 1)),
 ('inception_4b/3x3_reduce', (112, 512, 1, 1)),
 ('inception_4b/3x3', (224, 112, 3, 3)),
 ('inception_4b/5x5_reduce', (24, 512, 1, 1)),
 ('inception_4b/5x5', (64, 24, 5, 5)),
 ('inception_4b/pool_proj', (64, 512, 1, 1)),
 ('inception_4c/1x1', (128, 512, 1, 1)),
 ('inception_4c/3x3_reduce', (128, 512, 1, 1)),
 ('inception_4c/3x3', (256, 128, 3, 3)),
 ('inception_4c/5x5_reduce', (24, 512, 1, 1)),
 ('inception_4c/5x5', (64, 24, 5, 5)),
 ('inception_4c/pool_proj', (64, 512, 1, 1)),
 ('inception_4d/1x1', (112, 512, 1, 1)),
 ('inception_4d/3x3_reduce', (144, 512, 1, 1)),
 ('inception_4d/3x3', (288, 144, 3, 3)),
 ('inception_4d/5x5_reduce', (32, 512, 1, 1)),
 ('inception_4d/5x5', (64, 32, 5, 5)),
 ('inception_4d/pool_proj', (64, 512, 1, 1)),
 ('inception_4e/1x1', (256, 528, 1, 1)),
 ('inception_4e/3x3_reduce', (160, 528, 1, 1)),
 ('inception_4e/3x3', (320, 160, 3, 3)),
 ('inception_4e/5x5_reduce', (32, 528, 1, 1)),
 ('inception_4e/5x5', (128, 32, 5, 5)),
 ('inception_4e/pool_proj', (128, 528, 1, 1)),
 ('inception_5a/1x1', (256, 832, 1, 1)),
 ('inception_5a/3x3_reduce', (160, 832, 1, 1)),
 ('inception_5a/3x3', (320, 160, 3, 3)),
 ('inception_5a/5x5_reduce', (32, 832, 1, 1)),
 ('inception_5a/5x5', (128, 32, 5, 5)),
 ('inception_5a/pool_proj', (128, 832, 1, 1)),
 ('inception_5b/1x1', (384, 832, 1, 1)),
 ('inception_5b/3x3_reduce', (192, 832, 1, 1)),
 ('inception_5b/3x3', (384, 192, 3, 3)),
 ('inception_5b/5x5_reduce', (48, 832, 1, 1)),
 ('inception_5b/5x5', (128, 48, 5, 5)),
 ('inception_5b/pool_proj', (128, 832, 1, 1)),
 ('loss3/classifier', (1000, 1024))]

可視化用の関数定義

以下の関数で、フィルタと隠れ層の可視化を行います。


In [16]:
# take an array of shape (n, height, width) or (n, height, width, channels)
# and visualize each (height, width) thing in a grid of size approx. sqrt(n) by sqrt(n)
def vis_square(data, padsize=1, padval=0, gamma=1):
    data -= data.min()
    data /= data.max()
    data = np.power(data, gamma)

    # force the number of filters to be square
    n = int(np.ceil(np.sqrt(data.shape[0])))
    padding = ((0, n ** 2 - data.shape[0]), (0, padsize), (0, padsize)) + ((0, 0),) * (data.ndim - 3)
    data = np.pad(data, padding, mode='constant', constant_values=(padval, padval))
    
    # tile the filters into an image
    data = data.reshape((n, n) + data.shape[1:]).transpose((0, 2, 1, 3) + tuple(range(4, data.ndim + 1)))
    data = data.reshape((n * data.shape[1], n * data.shape[3]) + data.shape[4:])
    
    plt.imshow(data)
    plt.axis('off')

C1層の可視化

C1 層のパラメータ (図8) は以下のようになっています。


In [17]:
# the parameters are a list of [weights, biases]
filters = net.params['conv1/7x7_s2'][0].data
vis_square(filters.transpose(0, 2, 3, 1), padsize=1)


C1 層の出力の一部をガンマ補正したもの (図9) は以下のとおりです。


In [18]:
feat = net.blobs['conv1/7x7_s2'].data[0]
print(feat.shape)
vis_square(feat[28:64], padval=1, padsize=2, gamma=0.2)


(64, 112, 112)

C2 層の可視化

次は、誌面には掲載していない C2 層のパラメータと出力です。


In [19]:
filters = net.params['conv2/3x3'][0].data
print(filters.shape)
vis_square(filters[:64].reshape(64**2, 3, 3), padval=0, padsize=1)


(192, 64, 3, 3)

In [20]:
feat = net.blobs['conv2/3x3'].data[0]
print(feat.shape)
vis_square(feat[28:64], padval=1, padsize=1, gamma=0.2)


(192, 56, 56)