In this breakout we'll explore unsupervised clustering and anomaly detection in variable sources from the LINEAR survey. This is similar to part of the analysis done in this paper (see also this astroML example).
We'll start with our standard imports, then take a look at the data:
In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
# use seaborn plotting defaults
# If this causes an error, you can comment it out.
import seaborn as sns
sns.set()
First we will download the data with astroML. This data is a set of variable sources from the LINEAR survey: we have both photometry (u-g, g-i, i-K, J-K colors) and light-curve measurements (logP, amplitude, and skew).
We have two views of the data. The first is a catalog derived from the light curves:
In [2]:
from astroML.datasets import fetch_LINEAR_geneva
data = fetch_LINEAR_geneva()
The data is in the form of a record array. That is, each item of the data is a compound object containing multiple values:
In [3]:
print(data.shape)
In [4]:
data[0]
Out[4]:
To learn what is in the data, we can look at the names from the data type
In [5]:
print(data.dtype.names)
Any individual column of values can be found with the item getter interface. For example, to get an array of all the RA values we can write:
In [6]:
data['RA']
Out[6]:
If we want to do some learning on this data, we need to put it in the form of a matrix. For example, we can choose the following features and construct the matrix this way:
In [7]:
feature_names = ['ug', 'gi', 'iK', 'JK', 'logP', 'amp', 'skew']
X = np.vstack([data[f] for f in feature_names]).T
X.shape
Out[7]:
The object ID is also available, if you'd like to refer back to the raw LINEAR data
In [8]:
data['LINEARobjectID']
Out[8]:
In [9]:
from astroML.datasets import fetch_LINEAR_sample
lightcurves = fetch_LINEAR_sample()
In [10]:
print(lightcurves)
The result is an object which contains references to all the LINEAR data; if we're interested in any individual light curve, we can access it via the ID:
In [11]:
lc = lightcurves.get_light_curve(10003298)
lc.shape
Out[11]:
This light curve has 196 observations, and the columns are [time, magnitude, mag_error]. One shortcut to access this data is to transpose the array:
In [12]:
t, y, dy = lc.T
We can now plot this data:
In [13]:
plt.errorbar(t, y, dy, fmt='o');
In [14]:
lc_index = 20
lc_id = data['LINEARobjectID'][lc_index]
lc_id
Out[14]:
In [15]:
lc = lightcurves.get_light_curve(int(lc_id))
t, y, dy = lc.T
Then we can plot a phased lightcurve using the best-fit period, e.g.
In [16]:
P = 10 ** data['logP'][lc_index]
phase = t % P
plt.errorbar(phase, y, dy, fmt='o');
This exercise is very free-form, but here are some suggestions for how to proceed with exploring this data:
Using Principal Component Analysis:
A. How many dimensions are needed to represent the data?
B. Which objects are not well-approximated by this representation?
C. Can you find any clusters in the data by-eye?
Using GMM:
A. How many clusters best-fit the data? (you can use BIC, AIC, or some other method)
B. Does the best-fit GMM model show any outliers in the data? Are these "real" outliers or artifacts?
C. Do any of these clusters dominate (i.e. small covariance, large weight, etc.)? Can you come up with a good visualization to convey information about which point belongs to which cluster?
The purpose of this exercise is not necessarily to answer these questions, but to spend time using the tools we've learned on some real data, so you can see just how messy the process of data mining is! (And hey, if you find something interesting, maybe you should write a quick paper about it!) Good luck!