Machine Learning Applications for Astronomy

========

(or, but seriously - how does a machine learn?)

Version 0.1

By AA Miller 2018 Nov 05

What is Machine Learning?

Artificial Intelligence, clustering, data mining, pattern recognition, ...

Lots of hype, lots of buzz, and lots and lots of \$\$\$\$.

Summarizing the excitement for machine learning with one figure:

(credit: https://www.researchgate.net/Winner-results-of-the-ImageNet-large-scale-visual-recognition-challenge-LSVRC-of-the_fig7_324476862)

So... what is it?

Bishop's text book provides a useful 4 word summary:

(credit: Springer-Verlag New York)

Briefly, machine learning algorithms build highly complex, non-linear models (i.e., these things are black boxes) to map outcomes from input information.

Ultimately, machine learning is fundamentally concerned with classification (i.e. predicting outcomes).

As astronomers and astrophysicists –– this is great news!

The history of astronomy is a long story of classification. Essentially, point a telescope at some new location in the sky, and there is a decent chance you might find something that has never been seen before. Now, figure out how this new thing relates to all the things that you already know (classification).

Astronomy is (and has been) an observational/experimental led field. The typical pattern is obsservers find some weird thing, and then the theorists try to explain what is going on (there are obviously exceptions, predictions for kilonovae prior to the LIGO detection of a neutron star-neutron star merger are a recent example).

This makes us very different from physics, where theory generates predictions that then lead the observations (e.g., Higgs boson, general relativity, etc.).

Thus, if machine learning fundamentally is about classification, and astronomers spend all their time classifying objects, that must mean that machine learning and astronomy are a match made in heaven.

Right?

This is where I say – not so fast.$^\dagger$

Even though astronomy is an observationally-led, classification-concerned field, ultimately, like physicists, we care about the development of a physical understanding of how the Universe works. And this is not what machine learning is fundamentally built to do.

$^\dagger$ A long list of people would dispute this assertion.

In other words,

machine learning $\longleftrightarrow$ prediction

astronomy $\longleftrightarrow$ inference

And thus, astronomy and machine learning may not be a match made in heaven.

Basic Concepts of Machine Learning

Broadly, there are 2 types of machine learning:

There are no known outcomes for the data $\longleftrightarrow$ unsupervised machine learning
Outcomes are known for a subset of the data $\longleftrightarrow$ supervised machine learning

Unsupervised Machine Learning

(aka clustering, data mining)

Algorithm is designed to cluster data into groups
No natural metric for measuring the quality of clustering (3, 4, or 17 clusters)
Can be very useful for data exploration

Supervised Machine Learning

Algorithm builds a map between data and (known) outcomes for a subset of the data
Algorithm is optimized to maximize accuracy
Useful for both classification and regression

Terminology

Features – measured properties (numerical or categorical)

Labels – outcomes, i.e., thing the algorithm predicts

Training set – subset of the full data set where features and labels are known (used to train the model)

Adam's 1 slide summary of supervised machine learning:

True positive (TP) = + classified as +

False positive (FP) = - classified as + (type I error)

True negative (TN) = - classified as -

False negative (FN) = + classified as - (type II error)

Problem 1

What is more detrimental when building a model: false positives or false negatives?

Take a few minutes to discuss with your partner

Ultimately, this depends on the problem at hand. If you are building a model to detect cancer, false negatives are really really bad. If you are building a model to find extremely metal poor stars, and then you obtain a 10 hr spectrum on a 10 m telescope to confirm your candidates, false positives are really really bad.

From TP, FP, TN, and FN, it is possible to calculate several useful metrics for evaluating your model:

Accuracy = (TP + TN)/(TP + FP + TN + FN)

True Positive Rate (TPR) = TP/(TP + FN)

False Positive Rate (FPR) = FP/(TN + FP)

By varying the classification threshold, it is possible to determine the TPR as a function of FPR, also known as the Receiver Operating Characteristic (ROC) curve

(credit: Zahiri et al. 2013)

Precision = TP/(TP + FP)

This is an incredibly useful metric for astronmical applications, because it informs the degree of loss when obtaining follow up observations (follow up is expensive, e.g., we can only obtain spectra for a small fraction of LSST objects - how to choose which ones to observe?)

Cross validation (CV) – method to estimate any of the above metrics using the training set alone

Basic idea for $k$-fold CV: remove 1/$k$ from training set, train model on remaining data and predict labels for the remaining 1/$k$. Repeat $k$ times.

Enables unique predictions for every training set source.

CV enables the construction of the confusion matrix:

(credit: Richards et al. 2013)

Why is Adam Warning Me About Machine Learning?

Training sets are hard.

Why is this? Bias. Pattern matching cannot inherently know about bias, and so biases in the training set will be propagated to the final model predictions.

Furthermore, nearly every survey is guaranteed to have a biased training set. Any new survey is likely to be observing the sky in some unique fashion, and thus probing parameter space in some new way. This new survey mode will therefore find sources that were not present in the previous survey.

For example, a new survey that goes 1.5 mag deeper than the previous survey will find a lot of galaxies at higher redshifts. Furthermore, at fixed redshift, the new survey will find more intrinsically faint galaxies. These high redshift and intrinsically faint galaxies, will not have counterparts in a training set based on the previous survey, and therefore they will be classified incorrectly.

This is known as sample selection bias and it is nasty.

Problem 2

Based on very sound theoretical reasoning, you expect to find the following in a sample of 100 stars: 60 orange stars, 30 purple stars, and 10 grey stars.

Data from another survey includes 1000 orange stars, 200 purple stars, and 14 grey stars.

Furthermore, 14 of the orange stars, 7 of the purple stars, and 5 of the grey stars have features that are missing.

You need to build a model to classify stars as either orange, purple, or grey. What do you do?

Take a few minutes to discuss with your partner

There is no correct answer here. Given what we know, I'd do the following:

throw away the sources with the missing data
fit an unsupervized clustering model to the data to identify whether classes are easily separable
sample from the training set to create a distribution that matches the theoretical prediction
fit the machine learning model and make final predictions

Conclusions

Machine learning algorithms are extremely powerful (next generation may not learn to drive)

Machine learning = prediction; astronomy = inference (be caseful about equating the two)

All astronomical training sets are biased – very difficult to properly interpret (some) predictions as a result