I have posed two problems for you to work on in this hands-on exercise.
More information for each is provided below. You may want or need to cut and paste code from your other notebooks. But first...
In [7]:
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
# You can add anything you need as you work
We will use the same data from the Day 2 clustering exercise (see that notebook to download the data).
Please specify paths for the following:
In [8]:
F_META = '../Day2/dsfp_ztf_meta.npy'
F_FEATS = '../Day2/dsfp_ztf_feats.npy'
D_STAMPS = '../Day2/dsfp_ztf_png_stamps'
In [10]:
meta_np = np.load(F_META)
feats_np = np.load(F_FEATS)
COL_NAMES = ['diffmaglim', 'magpsf', 'sigmapsf', 'chipsf', 'magap', 'sigmagap',
'distnr', 'magnr', 'sigmagnr', 'chinr', 'sharpnr', 'sky',
'magdiff', 'fwhm', 'classtar', 'mindtoedge', 'magfromlim', 'seeratio',
'aimage', 'bimage', 'aimagerat', 'bimagerat', 'elong', 'nneg',
'nbad', 'ssdistnr', 'ssmagnr', 'sumrat', 'magapbig', 'sigmagapbig',
'ndethist', 'ncovhist', 'jdstarthist', 'jdendhist', 'scorr', 'label']
# NOTE FROM Umaa: I've decided to eliminate the following features. Dropping them.
#
COL_TO_DROP = ['ndethist', 'ncovhist', 'jdstarthist', 'jdendhist',
'distnr', 'magnr', 'sigmagnr', 'chinr', 'sharpnr',
'classtar', 'ssdistnr', 'ssmagnr', 'aimagerat', 'bimagerat',
'magapbig', 'sigmagapbig', 'scorr']
feats_df = pd.DataFrame(data=feats_np, index=meta_np['candid'], columns=COL_NAMES)
print("There are {} columns left.".format(len(feats_df.columns)))
print("They are: {}".format(list(feats_df.columns)))
feats_df.drop(columns=COL_TO_DROP, inplace=True)
#feats_df.describe()
In the last exercise, you created a training and test set for the purposes of building a classifier. The goal of this exercise is to note any changes in the feature distributions for these two sets.
Per feature, can you:
Which feature exhibits the most drift between train and test?
In [ ]:
This task ties together the work you did for the unsupervised and supervised exercises. Here's how to get started.
Cluster the entire labeled set provided to you. How you choose to do the clustering is up to you. I would recommend getting your clustering results into a two dimensional space that you can plot, but this is not strictly necessary.
Apply the labels to the clusters you've created and plot them. If you're working in a >=3-dimensional space, find a way to print the candids in each cluster. You can sort the list by examples that are closet to the centroids, and print their associated labels.
Look at some postage stamps of examples you suspect are mislabeled. Can you devise a simple way to identify a set of mislabeled examples? Can you come up with an estimate of how many examples are mislabeled?
In [9]: