Much of the world isn't mapped. This seems odd at first, but it basically comes down to a question of cash, and a large chunk of the world doesn't have enough of it. Maps are important, and when big charities like the Red Cross, or Médecins Sans Frontières try to respond to crises, or run public health projects, the lack of mapping is a serious problem. This is why the Missing Maps project came into existence. It's a volunteer project with the goal of putting the world's most vulnerable people on the map. In more concrete terms, volunteers spend time pouring over satellite imagery, tracing over things like roads and buildings (you can learn more here), and this data's then available for anyone to use. This is a time-consuming process, and much of the world is pretty empty (you don't see many buildings in the rainforest, or the desert). The MapSwipe app was created to help accelerate the mapping process, by pre-filtering the tiles. MapSwipe users scroll through bits of satellite imagery (in a mobile app), and identify images with buildings and other features in (depending on the project). Once this data has been gathered it means that the mapping volunteers can maximize their productivity, by going straight to the tiles that need mapping and not waste their time pouring over large expanses of forest (say).
When I first heard about this, I thought that it sounded like a machine learning problem. I'm not necessarily looking to automate MapSwipe - that might well be quite hard. A good chunk of the tiles in a MapSwipe problem are pretty easy to identify though, and it makes sense for humans to be principally involved in the more difficult ones. A good ML solution could also be used to partially verify the output of the human mappers - it might help notice missing buildings or roads for example. It's also a useful exercise in trying to solve the eventual MissingMaps problem - generating maps straight from the raw satellite imagery. Before we continue, we need to properly define the MapSwipe problem. MapSwipe is a classification problem - users classify a single tile of satellite imagery as either:
| Example | Class |
|---|---|
| Bad Imagery means that something on the ground can't be seen. This is often because of cloud cover obstructing the satellite's view, or sometimes because something seems to be broken with the satellite. | |
| Built imagery means that there are buildings in view. | |
| Empty imagery contains no buildings. |
To make life a little easier, I chose to only consider the projects that are solely focussed on finding buildings (roads can be tackled another day).
For my first attempt at using machine learning to solve the MapSwipe problem, I followed the approach laid out in the first few lectures of the fast.ai course. Basically, you take a neural network that has already been trained to solve the ImageNet problem, and adapt it for your own computer vision problem. The next section outlines exactly what I did, but feel free to skip to the results section.
All scripts used are present in my mapswipe-ml repository.
I started by generating a dataset. There's a fuller explanation of the generate_dataset.py script in the repository, but essentially it downloads as many examples as possible of the three categories: bad imagery, built and empty, whilst keeping the sizes of the three groups the same. The projects that I selected were all that had their lookFor property set to buildings only. (It now transpires that there's a similar category, which some of the newer projects fall into, which is just buildings - these were not included). This is approximately 1.4 million images. They are split 80-10-10 into a training set, a validation set and a test set.
python3 generate_dataset.py 124 303 407 692 1166 1333 1440 1599 1788 1901 2020 2158 2293 2473 2644 2671 2809 2978 3121 3310 3440 3610 3764 3906 4103 4242 4355 4543 4743 4877 5061 5169 5291 5368 5519 5688 5870 5990 6027 6175 6310 6498 6628 6637 6646 6794 6807 6918 6930 7049 7056 7064 7108 7124 7125 7260 7280 7281 7605 7738 7871 8059 8324 -k <bing maps api key> -o experiment_1/all_projects_dataset --inner-test-dir-for-keras
To actually create the model, I used Keras to fine-tune Google's InceptionV3 model. This means removing its top layer of output neurons, and replacing them with three fully connected output neurons (one for each class), with a Softmax output (see the script for exact details - I've omitted a couple of layers for brevity). During the training process, only the top (newly added) layers are trained.
python3 train.py --dataset-dir experiment_1/all_projects_dataset --output-dir experiment_1/inception_v3_fine_tuned --fine-tune --num-epochs 1
After one epoch of training, you get a model with a validation accuracy of approximately 54%. With extra epochs of fine tuning this increases slightly, but I didn't feel that it was particularly worth doing. Instead, I thought about the ImageNet problem. ImageNet is primarily concerned with identifying the one object that dominates the foreground of any particular photo. MapSwipe is fundamentally different, in that it's more about considering the whole image, and any piece of the image may either have something obscuring it (in the case of bad imagery), or a building, which changes the entire image's classification. The objects being identified are less complex than ImageNet (where you need to be able to, say, differentiate between a cat's face and a dog's), but the whole image is more important in the MapSwipe problem (whereas ImageNet has a better separation of foreground and background). Considering this hypothesis, I decided to train all layers of ImageNet for several epochs:
python3 train.py --dataset-dir experiment_1/all_projects_dataset --output-dir experiment_1/inception_v3_all_layers --num-epochs 10 --start_model experiment_1/inception_v3_fine_tuned/model.01-0.906-0.539.hdf5
I let it train for 9 epochs before stopping it (I was using an Amazon AWS P3.2xlarge instance, which isn't cheap) to see how it was progressing. The final trained model had a validation accuracy of 65%. The accuracy was always increasing, but the rate of increase had slowed significantly. I suspect that there's more improvement to be made by training for longer, but I wanted to start analysing the results.
To classify the test set:
python3 test.py --dataset-dir experiment_1/all_projects_dataset/test/ -m experiment_1/inception_v3_all_layers/model.01-0.906-0.539.hdf5.09-0.737-0.649.hdf5 -o experiment_1/inception_v3_all_layers.results
In [1]:
from mapswipe_analysis import *
all_projects_solution = Solution(
ground_truth_solutions_file_to_map('../experiment_1/all_projects_dataset/test/solutions.csv'),
predictions_file_to_map('../experiment_1/inception_v3_all_layers.results')
)
all_projects_solution.accuracy
Out[1]:
So, we're about 64% accurate. This means that 64% of the time, we select the right class for the tile (bad imagery, built, or empty). If we guessed at random, we'd expect to be 33% accurate (there are three classes, so we have a one in three chance of being correct). Let's break down that accuracy in to a per-category accuracy:
In [2]:
category_accuracies_df = pd.DataFrame(all_projects_solution.category_accuracies, index=class_names, columns=['Test dataset'])
display(HTML(category_accuracies_df.transpose().to_html()))
It seems almost suspicious that our bad image detection accuracy is so much lower than the other categories. Let's break down this accuracy data further into a confusion matrix:
In [3]:
conf_matrix_df = pd.DataFrame(all_projects_solution.confusion_matrix, index=class_names, columns=class_names)
display(HTML(conf_matrix_df.to_html()))
The rows correspond to what our model predicted, and the columns correspond to the official solution. If our model was perfect, we'd expect to have non-zero entries on the main diagonal (top left to bottom right), and zeroes everywhere else. The biggest non-zero entry corresponds to examples that officially (according to the MapSwipe data) are bad imagery, but our model has classified as empty. Let's take a look at the examples where we were most confident that the imagery was empty, but was actually bad (according to the official solution).
In [4]:
quadkeys = [x[0] for x in all_projects_solution.classified_as(predicted_class='empty', solution_class='bad_imagery')[0:9]]
tableau(quadkeys, all_projects_solution)
(note that the prediction vectors have the form $(\mathbb{P}(\text{bad_imagery}), \mathbb{P}(\text{built}), \mathbb{P}(\text{empty}))$, where $\mathbb{P}$ denotes a probability)
As you can see, all of these images seem perfectly fine, and all in fact show land with no buildings. Now, we've only looked at the 9 that the model's most confident about, but I've skimmed through a large number of them (not included here for brevity) and whilst the occasional one has a small amount of cloud cover, the vast majority are absolutely fine.
I'm not sure why this is happening, but I have a few hypotheses:
It's also interesting to review some other scenarios. Here are some images that the solution defines as empty, but the model believes that they contain buildings:
In [5]:
quadkeys = [x[0] for x in all_projects_solution.classified_as(predicted_class='built', solution_class='empty')[0:9]]
tableau(quadkeys, all_projects_solution)
So, it's not quite as open-and-shut as the previous set of examples, but it still helps build confidence in the model, and support the hypothesis that the MapSwipe data is far from accurate.
Everything we've done so far has considered one giant dataset, composed of a large number of projects (where each project corresponds to relatively small geographic area). It's interesting to see if the model's accuracy varies between the individual projects. To do this, I generated individual datasets for each project (using a similar workflow to that described previously), and then used the same model as before to grade each individual project's test dataset.
In [6]:
import json
from os.path import isdir, join
import os
import urllib.request
from bokeh.plotting import figure, ColumnDataSource
from bokeh.models import HoverTool
from bokeh.io import output_notebook, show
with urllib.request.urlopen("http://api.mapswipe.org/projects.json") as url:
projects = json.loads(url.read().decode())
individual_projects_dir = '../individual_projects/'
project_dirs = [d for d in os.listdir(individual_projects_dir) if isdir(join(individual_projects_dir, d))]
project_dirs.sort(key=int)
project_ids = []
accuracies = []
names = []
tile_counts = []
for project_id in project_dirs:
solutions_csv = join(individual_projects_dir, project_id, 'test', 'solutions.csv')
if (os.path.getsize(solutions_csv) > 0):
solution = Solution(
ground_truth_solutions_file_to_map(solutions_csv),
predictions_file_to_map(join(individual_projects_dir, project_id, 'initial_inception_v3_all_layers.out'))
)
project_ids.append(project_id)
accuracies.append(solution.accuracy * 100)
names.append(projects[project_id]['name'])
tile_counts.append(solution.tile_count)
output_notebook()
source = ColumnDataSource(data=dict(
x=project_ids,
y=accuracies,
names=names,
tile_counts=tile_counts
))
hover = HoverTool(tooltips=[
("Project ID", "@x"),
("Accuracy", "@y%"),
("Name", "@names"),
("Tile count", "@tile_counts")
])
p = figure(plot_width=800, plot_height=600, tools=[hover],
title="Test accuracy for each MapSwipe project")
p.circle('x', 'y', size=10, source=source)
show(p)
In this figure, we're graphing the project ID against the accuracy of the model. Project IDs are set at the time the project was created, and as time goes on newer projects get larger IDs. So, the x-axis represents the passage of time in an arbitrary (unlikely to be anything like linear) scale. It's interesting to note that there isn't a huge amount of variety in the individual project accuracies (project 6027 is tiny, so it's barely worth considering). The only real insight is that the model seems to be particularly effective in the Cambodia / Laos region (you can hover your mouse over a mark on the scatter plot to see some project details).
I think it's pretty clear from what we've seen that a significant problem facing MapSwipe is data quality. A machine learning model is only as good as the data that goes into it, and mislabelled data could create confounding results for researchers trying to solve the problem. There are two obvious ways to try to solve this problem:
If we consider the engineering problem previously suggested, it provides an opportunity to consider a fundamentally different data model. I propose that the data model should consist of a set of tiles. For each tile, a number of questions can be asked. For instance, "Does this tile contain any buildings?". The answer to this question is yes, no or maybe. Multiple questions can be assigned to a tile, which allows a tile to simultaneously contain buildings and be bad imagery (if it's partially obscured by cloud), which can't happen in the current model (but will act to confound many simple ML models). It also allows tiles to be explicitly marked as empty by users, as opposed to just being skipped, and not having any data recorded. This is critically important for training ML models in future, as the empty tiles are just as important as the built ones, and we must have a large amount of confidence in the training dataset's annotations for both categories.