What do I want?

  • Match COSMOS galaxies to HSC objects
  • Remove non-matched HSC objects
  • Create a classification based on COSMOS features
    • (z < 0.15) and ($10^8$ < Mstellar < $10^9$ M$\odot$)
  • Use that COSMOS classification to create an HSC training set
  • Run Random Forest & Logistic Regression on that data set (pure HSC features! no COSMOS features)
  • Create visual diagnostics

I'll test two basic classifiers: a RandomForest classifier and a Logistic Regression classifier. I'll try a very simplistic weak-learner: a straight i-band cut.

For my training data, I started by getting objects and labels from COSMOS. For input features, I then matched those COSMOS galaxies to their nearest HSC counterpart. I then used HSC i-band magnitude, along with HSC g-r, r-i, i-z, z-y colors. Finally, I augment it with some HSC photo-z information (FRANKEN-Z).

In this notebook I'll look at the full decision curves for the classifiers, in hopes of better understanding my results.

Code


In [1]:
# give access to importing dwarfz
import os, sys
dwarfz_package_dir = os.getcwd().split("dwarfz")[0]
if dwarfz_package_dir not in sys.path:
    sys.path.insert(0, dwarfz_package_dir)

import dwarfz
    
# back to regular import statements

%matplotlib inline
from matplotlib import pyplot as plt

import seaborn as sns
sns.set(context="poster", style="ticks", font_scale=1.4)

import numpy as np
import pandas as pd

from scipy.special import expit

import pathlib

In [2]:
import matplotlib as mpl
mpl.rcParams['savefig.dpi'] = 80
mpl.rcParams['figure.dpi'] = 80
mpl.rcParams['figure.figsize'] = 2*np.array((8,6))
mpl.rcParams['figure.facecolor'] = "white"

In [3]:
COSMOS_filename = pathlib.Path(dwarfz.data_dir_default) / "COSMOS_reference.sqlite"
COSMOS = dwarfz.datasets.COSMOS(COSMOS_filename)

In [4]:
HSC_filename = pathlib.Path(dwarfz.data_dir_default) / "HSC_COSMOS_median_forced.sqlite3"
HSC = dwarfz.datasets.HSC(HSC_filename)

In [5]:
matches_filename = pathlib.Path(dwarfz.data_dir_default) / "matches.sqlite3"
matches_df = dwarfz.matching.Matches.load_from_filename(matches_filename)

In [6]:
combined = matches_df[matches_df.match].copy()
combined["ra"]       = COSMOS.df.loc[combined.index].ra
combined["dec"]      = COSMOS.df.loc[combined.index].dec
combined["photo_z"]  = COSMOS.df.loc[combined.index].photo_z
combined["log_mass"] = COSMOS.df.loc[combined.index].mass_med

photometry_cols = [
    "gcmodel_flux","gcmodel_flux_err","gcmodel_flux_flags", "gcmodel_mag",
    "rcmodel_flux","rcmodel_flux_err","rcmodel_flux_flags", "rcmodel_mag",
    "icmodel_flux","icmodel_flux_err","icmodel_flux_flags", "icmodel_mag",
    "zcmodel_flux","zcmodel_flux_err","zcmodel_flux_flags", "zcmodel_mag",
    "ycmodel_flux","ycmodel_flux_err","ycmodel_flux_flags", "ycmodel_mag",
]

for col in photometry_cols:
    combined[col] = HSC.df.loc[combined.catalog_2_ids][col].values

Turn magnitudes into colors


In [7]:
combined["g_minus_r"] = combined.gcmodel_mag - combined.rcmodel_mag
combined["r_minus_i"] = combined.rcmodel_mag - combined.icmodel_mag
combined["i_minus_z"] = combined.icmodel_mag - combined.zcmodel_mag
combined["z_minus_y"] = combined.zcmodel_mag - combined.ycmodel_mag

Filter out bad data


In [8]:
mask =    np.isfinite(combined["g_minus_r"]) & np.isfinite(combined["r_minus_i"]) \
        & np.isfinite(combined["i_minus_z"]) & np.isfinite(combined["z_minus_y"]) \
        & np.isfinite(combined["icmodel_mag"]) \
        & (~combined.gcmodel_flux_flags) & (~combined.rcmodel_flux_flags) \
        & (~combined.icmodel_flux_flags) & (~combined.zcmodel_flux_flags) \
        & (~combined.ycmodel_flux_flags)

combined = combined[mask]

Get FRANKENZ photo-z's


In [9]:
df_frankenz = pd.read_sql_table("photo_z",
                                "sqlite:///{}".format(
                                    pathlib.Path(dwarfz.data_dir_default)
                                                 / "HSC_matched_to_FRANKENZ.sqlite"),
                                index_col="object_id")

df_frankenz.head()


Out[9]:
FRANKENZ_ids photoz_best photoz_risk_best
object_id
43158034708430853 43158034708462402 3.39 0.410430
43158034708430859 43158034708462728 2.68 0.186912
43158034708430861 43158034708462729 2.57 0.565077
43158034708430866 43158034708438547 1.59 0.466111
43158034708430870 43158034708438694 2.37 0.363387

In [10]:
combined = combined.join(df_frankenz[["photoz_best", "photoz_risk_best"]],
                         on="catalog_2_ids")

Create classification labels


In [11]:
low_z    = (combined.photo_z  < .15)
low_mass = (combined.log_mass > 8) & (combined.log_mass < 9)

In [12]:
combined["low_z_low_mass"] = (low_z & low_mass)
combined.low_z_low_mass.mean()


Out[12]:
0.002185072976027315

In [13]:
combined.low_z_low_mass.sum()


Out[13]:
727

In [14]:
combined.shape


Out[14]:
(332712, 35)

Build Classifiers


In [15]:
features = combined.loc[:,["g_minus_r", "r_minus_i", "i_minus_z", "z_minus_y",
                           "icmodel_mag",
                           "photoz_best",
                           "photoz_risk_best" # The risk of photoz_best being outside of the range z_true +- 0.15(1+z_true). It ranges from 0 (safe) to 1(risky)
                          ]]

target = combined.loc[:,["low_z_low_mass"]]

In [16]:
target.mean()


Out[16]:
low_z_low_mass    0.002185
dtype: float64

In [17]:
COSMOS_field_area = 2 # sq. degree
N_COSMOS_total = HSC.df.shape[0]
N_COSMOS_good = combined.shape[0]

true_dwarf_density = target.sum().values[0] / COSMOS_field_area 
print("true dwarf density: {:.2f} / sq. deg.".format(true_dwarf_density))


true dwarf density: 363.50 / sq. deg.

Partition training and testing sets


In [21]:
testing_fraction = .1
np.random.seed(0)
shuffled_indices = np.random.permutation(target.index.values)

N_testing_indices = int(testing_fraction*shuffled_indices.size)
testing_set_indices  = shuffled_indices[:N_testing_indices]
training_set_indices = shuffled_indices[N_testing_indices:]

features_train = features.loc[training_set_indices]
features_test  = features.loc[testing_set_indices]

target_train   = target.loc[training_set_indices]
target_test    = target.loc[testing_set_indices]

true_dwarf     =  target_test.values.flatten()
true_non_dwarf = ~target_test.values.flatten()

In [18]:
# def get_classification_characteristics(target_prob, threshold_prob, verbose=False):

#     target_prediction = (target_prob > threshold_prob)
    
#     prediction_dwarf     =  target_prediction
#     prediction_non_dwarf = ~target_prediction
    
#     completeness = (true_dwarf & prediction_dwarf).sum() / true_dwarf.sum() 
    
#     purity = (true_dwarf & prediction_dwarf).sum() / prediction_dwarf.sum() 
    
#     sample_size_reduction = prediction_dwarf.size / prediction_dwarf.sum()
    
#     true_positives  = np.sum(true_dwarf     & prediction_dwarf)
#     false_positives = np.sum(true_non_dwarf & prediction_dwarf)
    
#     true_negatives  = np.sum(true_non_dwarf & prediction_non_dwarf)
#     false_negatives = np.sum(true_dwarf     & prediction_non_dwarf)
    
#     true_positive_rate  = true_positives  / true_dwarf.sum()
#     false_positive_rate = false_positives / true_non_dwarf.sum()
    
#     objects_per_sq_deg = N_COSMOS_good / COSMOS_field_area / sample_size_reduction
    
#     if verbose:
#         print("completeness:          ", completeness)
#         print("purity:                ", purity)
#         print("sample_size_reduction: ", sample_size_reduction)
#         print("true  positive rate:   ", true_positive_rate)
#         print("false positive rate:   ", false_positive_rate)
#         print("objects per sq deg:    ", objects_per_sq_deg)
        
#     return {
#         "completeness": completeness,
#         "purity": purity,
#         "sample_size_reduction": sample_size_reduction,
#         "threshold_prob": threshold_prob,
#         "true_positive_rate": true_positive_rate,
#         "false_positive_rate": false_positive_rate,
#         "objects_per_sq_deg" : objects_per_sq_deg,
#            }

Shared Plotting Parameters


In [22]:
color_RF = "g"
color_LR = "b"
color_MC = "r"

label_RF = "Random Forest"
label_LR = "Logistic Regression"
label_MC = "Magnitude Cut"

linewidth = 4

Cross-validation Wrapper

Note: this doesn't perfectly handle when the number of objects doesn't evenly divide by the number of folds. In practice an example might be held out multiple times, and I'll just throw those few examples away.


In [ ]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

In [29]:
n_folds_default = 10
def get_cross_validation_matrix(classifier, seed=0, folds=n_folds_default, only_i_mag=False):
    
    np.random.seed(seed)
    testing_fraction = 1/folds

    shuffled_indices = np.random.permutation(target.index.values)
    all_indices_set = set(shuffled_indices)
    
    results = {HSC_id: []
               for HSC_id in combined.loc[shuffled_indices].catalog_2_ids.drop_duplicates().values}
    
    for fold in range(folds):
        print("\rfold: {} / {}".format(fold+1, folds), end="", flush=True)
        if fold == folds-1:
            testing_set_indices  = shuffled_indices[fold*N_testing_indices:]
        else:
            testing_set_indices  = shuffled_indices[fold*N_testing_indices:(fold+1)*N_testing_indices]

        training_set_indices = np.array(list(all_indices_set - set(testing_set_indices)))

        
        features_train = features.loc[training_set_indices]
        features_test  = features.loc[testing_set_indices]
        if only_i_mag:
            features_train = features_train[["icmodel_mag"]]
            features_test  = features_test[["icmodel_mag"]]


        target_train   = target.loc[training_set_indices]
        target_test    = target.loc[testing_set_indices]
        
        
        classifier.fit(features_train, target_train.values.flatten())
        target_prob = classifier.predict_proba(features_test)[:,1]
        
        
        for i, COSMOS_id in enumerate(testing_set_indices):
            HSC_id = combined.loc[COSMOS_id].catalog_2_ids
            results[HSC_id].append(target_prob[i])
        
    return results

i-mag cut


In [30]:
classifier_i_mag = LogisticRegression(class_weight=None, 
                                      solver="lbfgs", 
                                      max_iter=300)


fold: 10 / 10

Logistic Regression


In [32]:
classifier_LR = LogisticRegression(class_weight=None, 
                                   solver="lbfgs", 
                                   max_iter=300)


fold: 10 / 10

Random Forest


In [33]:
classifier_RF = RandomForestClassifier(n_estimators=1000, 
                                       n_jobs=4)


fold: 10 / 10

Save all model results

Note, I previously saved this data at ../data/galaxy_images_training/2017_09_26-dwarf_galaxy_scores.csv. That's the version I use in constructing my DNN training set. The file created below isn't likely to be significantly better/different, but it uses different random seeds and thus will have some small variations in the probabilities assigned to each galaxy.


In [34]:
hdf_file = pathlib.Path("results_cross-validated_all.hdf5")
overwrite = False
if (not hdf_file.is_file()) or overwrite:
    results_RF = get_cross_validation_matrix(classifier_RF)
    results_LR = get_cross_validation_matrix(classifier_LR)
    results_i_mag = get_cross_validation_matrix(classifier_i_mag, 
                                                only_i_mag=True)

    HSC_ids = list(sorted(results_LR.keys()))

    HSC_ids = [HSC_id for HSC_id in HSC_ids
               if len(results_LR[HSC_id])==1]

    df_results = pd.DataFrame({
        "HSC_id":  HSC_ids,
        "LR_prob": [results_LR[HSC_id][0] for HSC_id in HSC_ids],
        "RF_prob": [results_RF[HSC_id][0] for HSC_id in HSC_ids],
        "i_mag_prob": [results_i_mag[HSC_id][0] for HSC_id in HSC_ids],
        "target": combined.set_index("catalog_2_ids").loc[HSC_ids].low_z_low_mass

    })

    df_results.to_hdf(hdf_file, key="results")
else:
    df_results = pd.read_hdf(hdf_file)
    
df_results.head()


Out[34]:
HSC_id LR_prob RF_prob i_mag_prob target
catalog_2_ids
43158176442354198 43158176442354198 0.000259 0.0 0.000574 False
43158176442354210 43158176442354210 0.000655 0.0 0.000479 False
43158176442354213 43158176442354213 0.002109 0.0 0.000546 False
43158176442354230 43158176442354230 0.000326 0.0 0.000607 False
43158176442354240 43158176442354240 0.000597 0.0 0.001646 False

Get Metrics from cross-validated results


In [35]:
threshold_probs = expit(np.linspace(-9, 6))
threshold_probs = np.array([-1e-6, *threshold_probs, 1+1e-6])

In [36]:
def get_purities(key, df_results=df_results, threshold_probs=threshold_probs):
    
    purities = np.empty_like(threshold_probs)
    df_tmp = df_results[[key, "target"]]
    
    for i, threshold_prob in enumerate(threshold_probs):
        mask = df_tmp[key] > threshold_prob
        purities[i] = df_tmp["target"][mask].mean()
    
    return purities

def get_completenesses(key, df_results=df_results, threshold_probs=threshold_probs):
    
    completenesses = np.empty_like(threshold_probs)
    df_tmp = df_results[[key, "target"]]
    df_tmp = df_tmp[df_tmp.target]

    for i, threshold_prob in enumerate(threshold_probs):
        mask = df_tmp[key] > threshold_prob

        completenesses[i] = mask.mean()
    
    return completenesses


def get_selected_object_density(key, df_results=df_results, threshold_probs=threshold_probs):
    """per sq deg"""
    object_density = np.empty_like(threshold_probs)
    df_tmp = df_results[[key, "target"]]

    for i, threshold_prob in enumerate(threshold_probs):
        mask = df_tmp[key] > threshold_prob

        object_density[i] = mask.sum()
    
    return object_density / COSMOS_field_area
    
    
def get_FPRs(key, df_results=df_results, threshold_probs=threshold_probs):
    
    FPRs = np.empty_like(threshold_probs)
    df_tmp = df_results[[key, "target"]]
    df_tmp = df_tmp[~df_tmp.target]

    for i, threshold_prob in enumerate(threshold_probs):
        mask = df_tmp[key] > threshold_prob

        FPRs[i] = mask.mean()
    
    return FPRs

In [37]:
purities_RF = get_purities("RF_prob")
completenesses_RF = get_completenesses("RF_prob")
TPR_RF = completenesses_RF
FPR_RF = get_FPRs("RF_prob")
object_density_RF = get_selected_object_density("RF_prob")


purities_LR = get_purities("LR_prob")
completenesses_LR = get_completenesses("LR_prob")
TPR_LR = completenesses_LR
FPR_LR = get_FPRs("LR_prob")
object_density_LR = get_selected_object_density("LR_prob")

purities_i_mag = get_purities("i_mag_prob")
completenesses_i_mag = get_completenesses("i_mag_prob")
TPR_i_mag = completenesses_i_mag
FPR_i_mag = get_FPRs("i_mag_prob")
object_density_i_mag = get_selected_object_density("i_mag_prob")

In [38]:
import sklearn
import sklearn.metrics

In [39]:
AUC_RF = sklearn.metrics.average_precision_score(df_results.target, df_results.RF_prob)
plt.plot(completenesses_RF, purities_RF,
         marker="o", color=color_RF, label="Random Forest        (AUC={:.2f})".format(AUC_RF), 
         linewidth=linewidth,
)

AUC_LR = sklearn.metrics.average_precision_score(df_results.target, df_results.LR_prob)
plt.plot(completenesses_LR, purities_LR,
         marker="o", color=color_LR, label="Logistic Regression (AUC={:.2f})".format(AUC_LR), 
         linestyle="dashed",
         linewidth=linewidth,
)
AUC_i_mag = sklearn.metrics.average_precision_score(df_results.target, df_results.i_mag_prob)
plt.plot(completenesses_i_mag, purities_i_mag,
         marker="o", color=color_MC, label="$i$-band cut                 (AUC={:.2f})".format(AUC_i_mag), 
         linestyle="dotted", 
         linewidth=linewidth,
)


plt.xlabel("Completeness")
plt.ylabel("Purity")
plt.ylim(0,1)


leg = plt.legend(loc="best")


filename = "plots_for_thesis/purity-completeness-all"
plt.tight_layout()
plt.savefig(filename + ".pdf")
plt.savefig(filename + ".png")



In [40]:
AUC_RF = sklearn.metrics.roc_auc_score(df_results.target, df_results.RF_prob)
plt.plot(FPR_RF, TPR_RF,
         marker="o", color=color_RF, label="Random Forest        (AUC={:.3f})".format(AUC_RF),
         drawstyle="steps-post",
         linewidth=linewidth,

)

AUC_LR = sklearn.metrics.roc_auc_score(df_results.target, df_results.LR_prob)
plt.plot(FPR_LR, TPR_LR,
         marker="o", color=color_LR, label="Logistic Regression (AUC={:.3f})".format(AUC_LR),
         linestyle="dashed",
         drawstyle="steps-post",
         linewidth=linewidth,

)
AUC_i_mag = sklearn.metrics.roc_auc_score(df_results.target, df_results.i_mag_prob)
plt.plot(FPR_i_mag, TPR_i_mag,
         marker="o", color=color_MC, label="$i$-band cut                (AUC={:.3f})".format(AUC_i_mag),
         linestyle="dotted",
         drawstyle="steps-post",
         linewidth=linewidth,

)

plt.plot([0,1], [0,1 ], linestyle="dotted", color="k", label="Random guessing",
         linewidth=linewidth,
        )


plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
# plt.xlim(0,1)
# plt.ylim(0,1)


plt.legend(loc="best")

filename = "plots_for_thesis/ROC-all"
plt.tight_layout()
plt.savefig(filename + ".pdf")
plt.savefig(filename + ".png")



In [41]:
f, (ax1, ax2) = plt.subplots(2, sharex=True)
f.subplots_adjust(hspace=0.1)


ax1.plot(object_density_RF, purities_RF, 
         marker="o", color=color_RF, label=label_RF,
         linewidth=linewidth,
        )

ax1.axvline(1e3,
           color="black", linestyle="dashed", label="DNN Training Set Density")

ax1.set_ylabel("Purity")
ax1.set_xscale("log")
ax1.set_ylim(0,1)

ax2.plot(object_density_RF, completenesses_RF, 
         marker="o", color=color_RF, label=label_RF,
         linewidth=linewidth,
        )

ax2.axvline(1e3,
            color="black", linestyle="dashed", label="DNN Training Set Density",
            linewidth=linewidth,
           )

ax2.set_xlabel("Number of Selected Objects per sq. deg.")
ax2.set_ylabel("Completeness")
ax2.set_xscale("log")
ax2.set_ylim(0,1)

ax2.legend(loc="best")

plt.tight_layout()


filename = "plots_for_thesis/purity-completeness-RF"
plt.tight_layout()
plt.savefig(filename + ".pdf")
plt.savefig(filename + ".png")



In [42]:
theoretical_probs=np.linspace(0,1,num=11)
empirical_probs_RF = np.empty(theoretical_probs.size-1)
num_in_bin_RF = np.empty_like(empirical_probs_RF)

for i in range(theoretical_probs.size-1):
    prob_lim_low  = theoretical_probs[i]
    prob_lim_high = theoretical_probs[i+1]
    
    mask_RF = (df_results["RF_prob"] >= prob_lim_low) & (df_results["RF_prob"] < prob_lim_high)
    empirical_probs_RF[i] = df_results["target"][mask_RF].mean()
    num_in_bin_RF[i] = df_results["target"][mask_RF].size

f, (ax1, ax2) = plt.subplots(2, sharex=True, 
                             gridspec_kw = {'height_ratios':[1, 3]},
                            )

ax1.plot(theoretical_probs, [num_in_bin_RF[0], *num_in_bin_RF],
         drawstyle="steps", color=color_RF,
         linewidth=linewidth,
        )

ax1.set_yscale("log")
ax1.set_ylim(bottom=10**-.5, top=10**6.5)
ax1.yaxis.set_ticks([1e0, 1e3, 1e6])
ax1.set_ylabel("Number of \nGalaxies in Bin")

ax2.step(theoretical_probs, [empirical_probs_RF[0], *empirical_probs_RF], 
         linestyle="steps", color=color_RF, label=label_RF,
         linewidth=linewidth,
        )

ax2.fill_between(theoretical_probs, theoretical_probs-theoretical_probs[1], theoretical_probs, 
                 step="pre", color="black", label="ideal", alpha=.2,
                 linewidth=linewidth,
                )

plt.xlabel("Reported Probability")
plt.ylabel("Actual (Binned) Probability")

plt.legend(loc="best")

plt.xlim(0,1)
plt.ylim(0,1)


plt.tight_layout()


filename = "plots_for_thesis/probability-calibration-RF"
plt.tight_layout()
plt.savefig(filename + ".pdf")
plt.savefig(filename + ".png")



In [43]:
sklearn.metrics.log_loss(df_results.target, df_results.RF_prob)


Out[43]:
0.008584162549760178

In [44]:
n_trees = classifier_RF.n_estimators
n_pseudo_obs = 2
pseudo_obs_class_balance = 0.5
df_results["RF_prob_softened"] = (df_results["RF_prob"] * n_trees + n_pseudo_obs * pseudo_obs_class_balance) \
                                    / (n_trees + n_pseudo_obs)

sklearn.metrics.log_loss(df_results.target, df_results.RF_prob_softened)


Out[44]:
0.006868875300331472

In [45]:
sklearn.metrics.log_loss(df_results.target, df_results.LR_prob)


Out[45]:
0.009958606240087662

In [46]:
sklearn.metrics.log_loss(df_results.target, df_results.i_mag_prob)


Out[46]:
0.011930754057327415

Predicted Probabilities


In [50]:
plt.hist(df_results["RF_prob"], bins=np.linspace(0,1), alpha=.5, color=color_RF, label=label_RF)
plt.hist(df_results["LR_prob"], bins=np.linspace(0,1), alpha=.5, color=color_LR, label=label_LR)

plt.yscale("log")
plt.xlabel("p(dwarf | model)")
plt.ylabel("Number of objects")

plt.legend(loc="best")


Out[50]:
<matplotlib.legend.Legend at 0x14fa91d68>

In [51]:
plt.hist(1-df_results["RF_prob"],
         cumulative=True, alpha=0.9,
         label="RF",
         color=color_RF)

plt.hist(1-df_results["LR_prob"],
         cumulative=True,
         label="LR",
         color=color_LR)

plt.ylim(ymin=1e-5)
plt.yscale("log")

plt.legend(loc="best")
plt.xlabel("1 - prob(dwarf)")
plt.ylabel("CDF")


/Users/egentry/anaconda3/envs/tf36/lib/python3.6/site-packages/matplotlib/axes/_base.py:3604: MatplotlibDeprecationWarning: 
The `ymin` argument was deprecated in Matplotlib 3.0 and will be removed in 3.2. Use `bottom` instead.
  alternative='`bottom`', obj_type='argument')
Out[51]:
Text(0, 0.5, 'CDF')

How do I actually get the images for the best candidate galaxies?

Check out ../data/get_training_galaxy_images.ipynb


In [ ]: