--- layout: post title: Finetuning InceptionV3 for MapSwipe math: true date: 2018-01-09 tags: missing-maps mapswipe ml src: https://github.com/philiptromans/mapswipe-ml/tree/post-001/1%20-%20Analysing%20InceptionV3%20results.ipynb ---

Much of the world isn't mapped. This seems odd at first, but it basically comes down to a question of cash, and a large chunk of the world doesn't have enough of it. Maps are important, and when big charities like the Red Cross, or Médecins Sans Frontières try to respond to crises, or run public health projects, the lack of mapping is a serious problem. This is why the Missing Maps project came into existence. It's a volunteer project with the goal of putting the world's most vulnerable people on the map. In more concrete terms, volunteers spend time pouring over satellite imagery, tracing over things like roads and buildings (you can learn more here), and this data's then available for anyone to use. This is a time-consuming process, and much of the world is pretty empty (you don't see many buildings in the rainforest, or the desert). The MapSwipe app was created to help accelerate the mapping process, by pre-filtering the tiles. MapSwipe users scroll through bits of satellite imagery (in a mobile app), and identify images with buildings and other features in (depending on the project). Once this data has been gathered it means that the mapping volunteers can maximize their productivity, by going straight to the tiles that need mapping and not waste their time pouring over large expanses of forest (say).

When I first heard about this, I thought that it sounded like a machine learning problem. I'm not necessarily looking to automate MapSwipe - that might well be quite hard. A good chunk of the tiles in a MapSwipe problem are pretty easy to identify though, and it makes sense for humans to be principally involved in the more difficult ones. A good ML solution could also be used to partially verify the output of the human mappers - it might help notice missing buildings or roads for example. It's also a useful exercise in trying to solve the eventual MissingMaps problem - generating maps straight from the raw satellite imagery. Before we continue, we need to properly define the MapSwipe problem. MapSwipe is a classification problem - users classify a single tile of satellite imagery as either:

Example	Class
	Bad Imagery means that something on the ground can't be seen. This is often because of cloud cover obstructing the satellite's view, or sometimes because something seems to be broken with the satellite.
	Built imagery means that there are buildings in view.
	Empty imagery contains no buildings.

To make life a little easier, I chose to only consider the projects that are solely focussed on finding buildings (roads can be tackled another day).

For my first attempt at using machine learning to solve the MapSwipe problem, I followed the approach laid out in the first few lectures of the fast.ai course. Basically, you take a neural network that has already been trained to solve the ImageNet problem, and adapt it for your own computer vision problem. The next section outlines exactly what I did, but feel free to skip to the results section.

My first experiment

All scripts used are present in my mapswipe-ml repository.

I started by generating a dataset. There's a fuller explanation of the generate_dataset.py script in the repository, but essentially it downloads as many examples as possible of the three categories: bad imagery, built and empty, whilst keeping the sizes of the three groups the same. The projects that I selected were all that had their lookFor property set to buildings only. (It now transpires that there's a similar category, which some of the newer projects fall into, which is just buildings - these were not included). This is approximately 1.4 million images. They are split 80-10-10 into a training set, a validation set and a test set.

python3 generate_dataset.py 124 303 407 692 1166 1333 1440 1599 1788 1901 2020 2158 2293 2473 2644 2671 2809 2978 3121 3310 3440 3610 3764 3906 4103 4242 4355 4543 4743 4877 5061 5169 5291 5368 5519 5688 5870 5990 6027 6175 6310 6498 6628 6637 6646 6794 6807 6918 6930 7049 7056 7064 7108 7124 7125 7260 7280 7281 7605 7738 7871 8059 8324 -k <bing maps api key> -o experiment_1/all_projects_dataset --inner-test-dir-for-keras

To actually create the model, I used Keras to fine-tune Google's InceptionV3 model. This means removing its top layer of output neurons, and replacing them with three fully connected output neurons (one for each class), with a Softmax output (see the script for exact details - I've omitted a couple of layers for brevity). During the training process, only the top (newly added) layers are trained.

python3 train.py --dataset-dir experiment_1/all_projects_dataset --output-dir experiment_1/inception_v3_fine_tuned --fine-tune --num-epochs 1

After one epoch of training, you get a model with a validation accuracy of approximately 54%. With extra epochs of fine tuning this increases slightly, but I didn't feel that it was particularly worth doing. Instead, I thought about the ImageNet problem. ImageNet is primarily concerned with identifying the one object that dominates the foreground of any particular photo. MapSwipe is fundamentally different, in that it's more about considering the whole image, and any piece of the image may either have something obscuring it (in the case of bad imagery), or a building, which changes the entire image's classification. The objects being identified are less complex than ImageNet (where you need to be able to, say, differentiate between a cat's face and a dog's), but the whole image is more important in the MapSwipe problem (whereas ImageNet has a better separation of foreground and background). Considering this hypothesis, I decided to train all layers of ImageNet for several epochs:

python3 train.py --dataset-dir experiment_1/all_projects_dataset --output-dir experiment_1/inception_v3_all_layers --num-epochs 10 --start_model experiment_1/inception_v3_fine_tuned/model.01-0.906-0.539.hdf5

I let it train for 9 epochs before stopping it (I was using an Amazon AWS P3.2xlarge instance, which isn't cheap) to see how it was progressing. The final trained model had a validation accuracy of 65%. The accuracy was always increasing, but the rate of increase had slowed significantly. I suspect that there's more improvement to be made by training for longer, but I wanted to start analysing the results.

To classify the test set:

python3 test.py --dataset-dir experiment_1/all_projects_dataset/test/ -m experiment_1/inception_v3_all_layers/model.01-0.906-0.539.hdf5.09-0.737-0.649.hdf5 -o experiment_1/inception_v3_all_layers.results

Results

The first question on your mind is probably, "How accurate was it?".



In [1]:

    
from mapswipe_analysis import *

all_projects_solution = Solution(
    ground_truth_solutions_file_to_map('../experiment_1/all_projects_dataset/test/solutions.csv'),
    predictions_file_to_map('../experiment_1/inception_v3_all_layers.results')
)
all_projects_solution.accuracy









    Out[1]:





0.6432250733187717

So, we're about 64% accurate. This means that 64% of the time, we select the right class for the tile (bad imagery, built, or empty). If we guessed at random, we'd expect to be 33% accurate (there are three classes, so we have a one in three chance of being correct). Let's break down that accuracy in to a per-category accuracy:



In [2]:

    
category_accuracies_df = pd.DataFrame(all_projects_solution.category_accuracies, index=class_names, columns=['Test dataset'])
display(HTML(category_accuracies_df.transpose().to_html()))









    





  
    
      
      bad_imagery
      built
      empty
    
  
  
    
      Test dataset
      0.552432
      0.667687
      0.709556

It seems almost suspicious that our bad image detection accuracy is so much lower than the other categories. Let's break down this accuracy data further into a confusion matrix:



In [3]:

    
conf_matrix_df = pd.DataFrame(all_projects_solution.confusion_matrix, index=class_names, columns=class_names)
display(HTML(conf_matrix_df.to_html()))









    





  
    
      
      bad_imagery
      built
      empty
    
  
  
    
      bad_imagery
      27062
      3441
      18484
    
    
      built
      4061
      32708
      12218
    
    
      empty
      9955
      4273
      34759

The rows correspond to what our model predicted, and the columns correspond to the official solution. If our model was perfect, we'd expect to have non-zero entries on the main diagonal (top left to bottom right), and zeroes everywhere else. The biggest non-zero entry corresponds to examples that officially (according to the MapSwipe data) are bad imagery, but our model has classified as empty. Let's take a look at the examples where we were most confident that the imagery was empty, but was actually bad (according to the official solution).



In [4]:

    
quadkeys = [x[0] for x in all_projects_solution.classified_as(predicted_class='empty', solution_class='bad_imagery')[0:9]]
tableau(quadkeys, all_projects_solution)









    




Quadkey: 132212212003211000
Officially: bad_imagery
Predicted class: empty

PV:[0.01397401 0.02337974 0.96264625] Quadkey: 132212212023002101
Officially: bad_imagery
Predicted class: empty

PV:[0.02471545 0.0167773  0.95850724] Quadkey: 132212210001023113
Officially: bad_imagery
Predicted class: empty

PV:[0.02505477 0.02580438 0.9491408 ]
Quadkey: 132212130033101123
Officially: bad_imagery
Predicted class: empty

PV:[0.03569183 0.01793411 0.9463741 ] Quadkey: 132212132002033111
Officially: bad_imagery
Predicted class: empty

PV:[0.05563853 0.01403912 0.93032235] Quadkey: 132212113031022031
Officially: bad_imagery
Predicted class: empty

PV:[0.0451606  0.02489267 0.9299468 ]
Quadkey: 132212123113130320
Officially: bad_imagery
Predicted class: empty

PV:[0.06579539 0.00917824 0.92502636] Quadkey: 132212132303011231
Officially: bad_imagery
Predicted class: empty

PV:[0.07165854 0.00919708 0.9191443 ] Quadkey: 132212113011332021
Officially: bad_imagery
Predicted class: empty

PV:[0.06143104 0.01993423 0.91863465]

(note that the prediction vectors have the form $(\mathbb{P}(\text{bad_imagery}), \mathbb{P}(\text{built}), \mathbb{P}(\text{empty}))$, where $\mathbb{P}$ denotes a probability)

As you can see, all of these images seem perfectly fine, and all in fact show land with no buildings. Now, we've only looked at the 9 that the model's most confident about, but I've skimmed through a large number of them (not included here for brevity) and whilst the occasional one has a small amount of cloud cover, the vast majority are absolutely fine.

I'm not sure why this is happening, but I have a few hypotheses:

A significant number of users may be mistaken about the definition of bad imagery, or unsure about what to do for empty tiles (and are triple tapping to feed back that the images are empty, when they should just be ignoring them).
Bing may have updated the imagery since the feedback was gained from the users.

It's also interesting to review some other scenarios. Here are some images that the solution defines as empty, but the model believes that they contain buildings:



In [5]:

    
quadkeys = [x[0] for x in all_projects_solution.classified_as(predicted_class='built', solution_class='empty')[0:9]]
tableau(quadkeys, all_projects_solution)









    




Quadkey: 023313133023320231
Officially: empty
Predicted class: built

PV:[3.4844992e-03 9.9635458e-01 1.6092640e-04] Quadkey: 300110012122202220
Officially: empty
Predicted class: built

PV:[0.00453361 0.9944021  0.00106429] Quadkey: 132212131313001333
Officially: empty
Predicted class: built

PV:[0.00127954 0.99339586 0.00532465]
Quadkey: 132212131331330300
Officially: empty
Predicted class: built

PV:[0.00491786 0.9876587  0.00742349] Quadkey: 300311311302130313
Officially: empty
Predicted class: built

PV:[0.010339   0.9842988  0.00536216] Quadkey: 122320132103323030
Officially: empty
Predicted class: built

PV:[0.01284369 0.98264277 0.00451359]
Quadkey: 122320132313302301
Officially: empty
Predicted class: built

PV:[0.01074562 0.9814532  0.00780118] Quadkey: 122320133102010121
Officially: empty
Predicted class: built

PV:[0.01149258 0.98020375 0.00830375] Quadkey: 122320133011300312
Officially: empty
Predicted class: built

PV:[0.01287544 0.9798688  0.00725573]

So, it's not quite as open-and-shut as the previous set of examples, but it still helps build confidence in the model, and support the hypothesis that the MapSwipe data is far from accurate.

Individual Project Accuracy

Everything we've done so far has considered one giant dataset, composed of a large number of projects (where each project corresponds to relatively small geographic area). It's interesting to see if the model's accuracy varies between the individual projects. To do this, I generated individual datasets for each project (using a similar workflow to that described previously), and then used the same model as before to grade each individual project's test dataset.



In [6]:

    
import json
from os.path import isdir, join
import os
import urllib.request

from bokeh.plotting import figure, ColumnDataSource
from bokeh.models import HoverTool
from bokeh.io import output_notebook, show

with urllib.request.urlopen("http://api.mapswipe.org/projects.json") as url:
    projects = json.loads(url.read().decode())

individual_projects_dir = '../individual_projects/'
project_dirs = [d for d in os.listdir(individual_projects_dir) if isdir(join(individual_projects_dir, d))]
project_dirs.sort(key=int)

project_ids = []
accuracies = []
names = []
tile_counts = []

for project_id in project_dirs:
    solutions_csv = join(individual_projects_dir, project_id, 'test', 'solutions.csv')
    if (os.path.getsize(solutions_csv) > 0):
        solution = Solution(
            ground_truth_solutions_file_to_map(solutions_csv),
            predictions_file_to_map(join(individual_projects_dir, project_id, 'initial_inception_v3_all_layers.out'))
        )
        project_ids.append(project_id)
        accuracies.append(solution.accuracy * 100)
        names.append(projects[project_id]['name'])
        tile_counts.append(solution.tile_count)

output_notebook()

source = ColumnDataSource(data=dict(
    x=project_ids,
    y=accuracies,
    names=names,
    tile_counts=tile_counts
))

hover = HoverTool(tooltips=[
    ("Project ID", "@x"),
    ("Accuracy", "@y%"),
    ("Name", "@names"),
    ("Tile count", "@tile_counts")
])

p = figure(plot_width=800, plot_height=600, tools=[hover],
           title="Test accuracy for each MapSwipe project")

p.circle('x', 'y', size=10, source=source)

show(p)









    





    
        
        Loading BokehJS ...

In this figure, we're graphing the project ID against the accuracy of the model. Project IDs are set at the time the project was created, and as time goes on newer projects get larger IDs. So, the x-axis represents the passage of time in an arbitrary (unlikely to be anything like linear) scale. It's interesting to note that there isn't a huge amount of variety in the individual project accuracies (project 6027 is tiny, so it's barely worth considering). The only real insight is that the model seems to be particularly effective in the Cambodia / Laos region (you can hover your mouse over a mark on the scatter plot to see some project details).

Further work

I think it's pretty clear from what we've seen that a significant problem facing MapSwipe is data quality. A machine learning model is only as good as the data that goes into it, and mislabelled data could create confounding results for researchers trying to solve the problem. There are two obvious ways to try to solve this problem:

To include a tile in the dataset, I required at least one vote for a particular category, and no votes for the others. I suspect that increasing the vote threshold would produce more accurate models. This will have the effect of lowering the number of tiles in the dataset though, which isn't ideal. The other problem with this is that it's not possible to do consistently - empty tiles aren't explicitly marked as empty by MapSwipe users, they're just not marked at all. It's difficult to tell whether or not an image has been seen multiple times (although you can estimate it according to how often its explicitly marked neighbours have been viewed - this leads to its own problems in terms of bias though).
We could request more votes from users for tiles that the model has confidently classified, but classified incorrectly according to the official MapSwipe data. This'll require engineering, and user's time, but I think it's the most promising solution. To get a high quality model, a large amount of high quality data will be needed.

If we consider the engineering problem previously suggested, it provides an opportunity to consider a fundamentally different data model. I propose that the data model should consist of a set of tiles. For each tile, a number of questions can be asked. For instance, "Does this tile contain any buildings?". The answer to this question is yes, no or maybe. Multiple questions can be assigned to a tile, which allows a tile to simultaneously contain buildings and be bad imagery (if it's partially obscured by cloud), which can't happen in the current model (but will act to confound many simple ML models). It also allows tiles to be explicitly marked as empty by users, as opposed to just being skipped, and not having any data recorded. This is critically important for training ML models in future, as the empty tiles are just as important as the built ones, and we must have a large amount of confidence in the training dataset's annotations for both categories.

	bad_imagery	built	empty
bad_imagery	27062	3441	18484
built	4061	32708	12218
empty	9955	4273	34759

Quadkey: 132212212003211000 Officially: bad_imagery Predicted class: empty PV:[0.01397401 0.02337974 0.96264625]	Quadkey: 132212212023002101 Officially: bad_imagery Predicted class: empty PV:[0.02471545 0.0167773 0.95850724]	Quadkey: 132212210001023113 Officially: bad_imagery Predicted class: empty PV:[0.02505477 0.02580438 0.9491408 ]
Quadkey: 132212130033101123 Officially: bad_imagery Predicted class: empty PV:[0.03569183 0.01793411 0.9463741 ]	Quadkey: 132212132002033111 Officially: bad_imagery Predicted class: empty PV:[0.05563853 0.01403912 0.93032235]	Quadkey: 132212113031022031 Officially: bad_imagery Predicted class: empty PV:[0.0451606 0.02489267 0.9299468 ]
Quadkey: 132212123113130320 Officially: bad_imagery Predicted class: empty PV:[0.06579539 0.00917824 0.92502636]	Quadkey: 132212132303011231 Officially: bad_imagery Predicted class: empty PV:[0.07165854 0.00919708 0.9191443 ]	Quadkey: 132212113011332021 Officially: bad_imagery Predicted class: empty PV:[0.06143104 0.01993423 0.91863465]

Quadkey: 023313133023320231 Officially: empty Predicted class: built PV:[3.4844992e-03 9.9635458e-01 1.6092640e-04]	Quadkey: 300110012122202220 Officially: empty Predicted class: built PV:[0.00453361 0.9944021 0.00106429]	Quadkey: 132212131313001333 Officially: empty Predicted class: built PV:[0.00127954 0.99339586 0.00532465]
Quadkey: 132212131331330300 Officially: empty Predicted class: built PV:[0.00491786 0.9876587 0.00742349]	Quadkey: 300311311302130313 Officially: empty Predicted class: built PV:[0.010339 0.9842988 0.00536216]	Quadkey: 122320132103323030 Officially: empty Predicted class: built PV:[0.01284369 0.98264277 0.00451359]
Quadkey: 122320132313302301 Officially: empty Predicted class: built PV:[0.01074562 0.9814532 0.00780118]	Quadkey: 122320133102010121 Officially: empty Predicted class: built PV:[0.01149258 0.98020375 0.00830375]	Quadkey: 122320133011300312 Officially: empty Predicted class: built PV:[0.01287544 0.9798688 0.00725573]