Cleaning Astronomical Datasets

I have posed two problems for you to work on in this hands-on exercise.

  • Concept drift: Do the training and test set distributions differ?
  • Find mislabeled examples in the labeled ZTF data provided

More information for each is provided below. You may want or need to cut and paste code from your other notebooks. But first...

0a. Imports

These are all the imports that will be used in this notebook. All should be available in the DSFP conda environment.


In [7]:
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
# You can add anything you need as you work

0b. Data Location

We will use the same data from the Day 2 clustering exercise (see that notebook to download the data).

Please specify paths for the following:


In [8]:
F_META = '../Day2/dsfp_ztf_meta.npy'
F_FEATS = '../Day2/dsfp_ztf_feats.npy'
D_STAMPS = '../Day2/dsfp_ztf_png_stamps'

0c. Load Data


In [10]:
meta_np = np.load(F_META)
feats_np = np.load(F_FEATS)

COL_NAMES = ['diffmaglim', 'magpsf', 'sigmapsf', 'chipsf', 'magap', 'sigmagap',
             'distnr', 'magnr', 'sigmagnr', 'chinr', 'sharpnr', 'sky',
             'magdiff', 'fwhm', 'classtar', 'mindtoedge', 'magfromlim', 'seeratio',
             'aimage', 'bimage', 'aimagerat', 'bimagerat', 'elong', 'nneg',
             'nbad', 'ssdistnr', 'ssmagnr', 'sumrat', 'magapbig', 'sigmagapbig',
             'ndethist', 'ncovhist', 'jdstarthist', 'jdendhist', 'scorr', 'label']

# NOTE FROM Umaa: I've decided to eliminate the following features. Dropping them.
#
COL_TO_DROP = ['ndethist', 'ncovhist', 'jdstarthist', 'jdendhist', 
               'distnr', 'magnr', 'sigmagnr', 'chinr', 'sharpnr', 
               'classtar', 'ssdistnr', 'ssmagnr', 'aimagerat', 'bimagerat', 
               'magapbig', 'sigmagapbig', 'scorr']
feats_df = pd.DataFrame(data=feats_np, index=meta_np['candid'], columns=COL_NAMES)
print("There are {} columns left.".format(len(feats_df.columns)))
print("They are: {}".format(list(feats_df.columns)))
feats_df.drop(columns=COL_TO_DROP, inplace=True) 
#feats_df.describe()


There are 36 columns left.
They are: ['diffmaglim', 'magpsf', 'sigmapsf', 'chipsf', 'magap', 'sigmagap', 'distnr', 'magnr', 'sigmagnr', 'chinr', 'sharpnr', 'sky', 'magdiff', 'fwhm', 'classtar', 'mindtoedge', 'magfromlim', 'seeratio', 'aimage', 'bimage', 'aimagerat', 'bimagerat', 'elong', 'nneg', 'nbad', 'ssdistnr', 'ssmagnr', 'sumrat', 'magapbig', 'sigmagapbig', 'ndethist', 'ncovhist', 'jdstarthist', 'jdendhist', 'scorr', 'label']

1. Concept Drift

In the last exercise, you created a training and test set for the purposes of building a classifier. The goal of this exercise is to note any changes in the feature distributions for these two sets.

Per feature, can you:

  • plot test vs. train distributions for both real and bogus, and note areas that does not overlap
  • quantiatively measure this using Kullback-Leibler divergence, and print or plot the scores for all features.

Which feature exhibits the most drift between train and test?


In [ ]:

2. Finding Mislabeled Examples

This task ties together the work you did for the unsupervised and supervised exercises. Here's how to get started.

  1. Cluster the entire labeled set provided to you. How you choose to do the clustering is up to you. I would recommend getting your clustering results into a two dimensional space that you can plot, but this is not strictly necessary.

  2. Apply the labels to the clusters you've created and plot them. If you're working in a >=3-dimensional space, find a way to print the candids in each cluster. You can sort the list by examples that are closet to the centroids, and print their associated labels.

  3. Look at some postage stamps of examples you suspect are mislabeled. Can you devise a simple way to identify a set of mislabeled examples? Can you come up with an estimate of how many examples are mislabeled?


In [9]: