Galaxias en VVV

Clasificacion con Machine Learning

Objetos visualmente confirmados VS. Objetos candidatos

Usamos un catálogo de galaxias identificadas en VVV en los tiles d010 y d0115 de Baravalle L.

Para saber donde estan ubicados los tiles usamos el mapa de VVV

En estos tiles encontraron 574 objetos con propiedades morfologicas, fotometricas y fotocromaticas propias de galaxias. 90 de los mismos han sido visualmente inspeccionados, y constituyen una muestra bona fide de galaxias en el VVV.

Analisis de los datos

Primero cargamos las librerias necesarias


In [17]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from IPython.display import display

from astropy.io import ascii
from astropy.table import Table, Column

%matplotlib inline

Leemos los datos de la tabla de Baravalle et al. 2017

El catálogo se encuentra en formato cds, pero hubo que modificar los datos perdidos. Simplemente se reemplazaron los valores -- por espacios vacios.


In [66]:
cat = ascii.read('Table1.all.txt', format='cds')

Se crea una columna donde podamos marcar los objetos identificados visualmente.


In [67]:
visual = []
for row in cat:
    if row['Id'].endswith('*'):
        visual.append(1)
    else: visual.append(0)

In [68]:
vis = Column(data=visual, name='Visual', dtype='Bool')
cat.add_column(vis)

El catalogo entonces quedaría con esta pinta:


In [71]:
cat


Out[71]:
<Table masked=True length=574>
IdRAhRAmRAsDE-DEdDEmDEsZYJHKsZ$_{2\prime\prime}$Y$_{2\prime\prime}$J$_{2\prime\prime}$H$_{2\prime\prime}$Ks$_{2\prime\prime}$R$_{1/2}$C$\epsilon$nVisual
hmins-degarcminarcsecmagmagmagmagmagmagmagmagmagmagarcsec
str24int64int64float64str1int64int64float64float64float64float64float64float64float64float64float64float64float64float64float64float64float64bool
VVV-J114419.03-603025.9114419.03-603025.915.8615.8515.7615.315.0215.9415.9615.7715.1915.062.463.540.143.97False
VVV-J114428.39-603158.4114428.39-603158.4----17.1717.016.44----17.1416.9116.461.032.340.57.38False
VVV-J114431.78-601626.8114431.78-601626.8----16.8116.3416.27----16.8216.3516.251.253.340.17.18False
VVV-J114433.70-602742.8114433.7-602742.816.8216.7816.6316.216.1116.7716.8316.6516.1316.081.092.560.184.12False
VVV-J114450.83-603356.9114450.83-603356.917.1817.0216.7516.2216.0616.9717.1616.816.1316.052.23.650.438.36False
VVV-J114453.09-601805.8114453.09-60185.816.3816.3516.2815.7515.7316.3616.416.2915.7715.721.482.970.464.62False
VVV-J114455.73-600020.9114455.73-60020.917.2617.1717.0316.6616.517.217.2817.0616.6416.461.482.660.261.15False
VVV-J114456.52-603249.6114456.52-603249.6----16.5616.0215.86----16.615.9615.891.252.460.352.64False
VVV-J114457.91-603958.3114457.91-603958.317.817.7617.617.3116.4317.7617.7717.5717.3216.42.513.090.424.9False
.....................................................................
VVV-J135038.26-641427.9135038.26-641427.9----16.7816.6816.36----16.7616.6916.261.033.150.634.49False
VVV-J135041.18-642132.0135041.18-642132.016.8516.2716.1215.715.6516.2817.015.9615.5915.411.082.680.572.48False
VVV-J135041.21-641842.1135041.21-641842.1----16.7416.3216.32----16.7916.4216.291.112.270.142.77False
VVV-J135043.34-641748.5135043.34-641748.517.5616.8516.7316.2616.2116.7917.5916.7116.2116.131.012.170.451.33False
VVV-J135044.11-641907.3135044.11-64197.3----14.6714.1413.91----14.7614.1613.991.873.310.443.66False
VVV-J135047.21-642130.9135047.21-642130.917.4717.0416.5516.2816.2617.0217.6116.7716.1716.081.393.450.494.56False
VVV-J135047.21-642243.0135047.21-642243.017.3917.0616.8616.4816.417.1217.5816.9216.4616.362.223.440.155.81False
VVV-J135047.69-641911.6135047.69-641911.616.5616.7716.6416.1315.9616.8416.716.5716.0815.831.263.120.611.22False
VVV-J135050.30-642103.2135050.3-64213.2----16.2316.0215.87----16.3616.3415.761.363.420.66.78False
VVV-J135054.74-642240.8135054.74-642240.8----17.416.9316.46----17.2716.816.291.13.250.354.08False

Guardamos el catalogo en formato csv para conservar la columna con identificacion visual


In [72]:
cat.write('cat_lau.csv', format='csv', overwrite=True)

Veamos las distribuciones y dependencias de los datos.


In [73]:
pdcat = cat.to_pandas()
print pdcat.columns


Index([u'Id', u'RAh', u'RAm', u'RAs', u'DE-', u'DEd', u'DEm', u'DEs', u'Z',
       u'Y', u'J', u'H', u'Ks', u'Z$_{2\prime\prime}$', u'Y$_{2\prime\prime}$',
       u'J$_{2\prime\prime}$', u'H$_{2\prime\prime}$', u'Ks$_{2\prime\prime}$',
       u'R$_{1/2}$', u'C', u'$\epsilon$', u'n', u'Visual'],
      dtype='object')

In [74]:
numeric_cols = pdcat[[u'Z', u'Y', u'J', u'H', u'Ks', u'Z$_{2\prime\prime}$', u'Y$_{2\prime\prime}$', 
                      u'J$_{2\prime\prime}$', u'H$_{2\prime\prime}$', u'Ks$_{2\prime\prime}$', 
                      u'R$_{1/2}$', u'C', u'$\epsilon$', u'n']]
pd.scatter_matrix(numeric_cols,  alpha=0.3, figsize=(12, 12), diagonal='kde')
plt.show()


Ahora que hemos chequeado las distribuciones de los valores de magnitudes y parametros extra podemos ver que a simple vista la muestra proviene de una misma poblacion de objetos.


In [103]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
from sklearn import neighbors
import sklearn.cross_validation as cv
from sklearn.model_selection import StratifiedKFold
from sklearn import metrics

In [76]:
def experiment(clf, x, y, nfolds=10):
    skf = StratifiedKFold(n_splits=nfolds)
    probabilities = np.array([])
    predictions = np.array([])
    y_testing = np.array([])
    
    for train, test in skf.split(x, y):
        
        x_train = x[train]
        y_train = y[train]
        clf.fit(x_train, y_train)

        x_test = x[test]
        y_test = y[test]
        pr = clf.predict(x_test)
        probs = clf.predict_proba(x_test)[:, 0]

        probabilities = np.hstack([probabilities, probs])
        predictions = np.hstack([predictions, pr])
        y_testing = np.hstack([y_testing, y_test])

    print metrics.classification_report(y_testing, predictions)
    fpr, tpr, thresholds = metrics.roc_curve(y_testing, 1.-probabilities)
    roc_auc = metrics.auc(fpr, tpr)
    return {'fpr': fpr, 
            'tpr': tpr, 
            'thresh': thresholds, 
            'roc_auc': roc_auc, 
            'y_test': y_testing, 
            'predictions': predictions,
            'probabilities': probabilities, 
            'confusion_matrix': metrics.confusion_matrix(y_testing, predictions),
            }

In [98]:
X = pdcat[[u'J', u'H', u'Ks', 
           u'J$_{2\prime\prime}$', u'H$_{2\prime\prime}$', u'Ks$_{2\prime\prime}$', 
           u'R$_{1/2}$', u'C', u'$\epsilon$', u'n']]
X = X.as_matrix()

# Y = pdcat[u'Visual']
# Y = Y.as_matrix()

Y = np.asarray(pdcat['Visual'], dtype="int")

In [99]:
X = np.delete(X, [[378, 442]], axis=0)
Y = np.delete(Y, [[378, 442]], axis=0)

In [121]:
res_Dtree = experiment(DecisionTreeClassifier(max_leaf_nodes=5), X, Y, 10)


             precision    recall  f1-score   support

        0.0       0.86      0.95      0.90       482
        1.0       0.34      0.14      0.20        90

avg / total       0.77      0.82      0.79       572


In [122]:
res_knn = experiment(neighbors.KNeighborsClassifier(n_neighbors=3, weights='uniform'), X, Y, 10)


             precision    recall  f1-score   support

        0.0       0.86      0.92      0.89       482
        1.0       0.29      0.17      0.21        90

avg / total       0.77      0.80      0.78       572


In [123]:
res_svc = experiment(svm.SVC(kernel='linear', probability=True), X, Y, 10)


             precision    recall  f1-score   support

        0.0       0.84      1.00      0.91       482
        1.0       0.00      0.00      0.00        90

avg / total       0.71      0.84      0.77       572


In [124]:
res_Rforest = experiment(RandomForestClassifier(n_estimators=100, min_samples_leaf=5), X, Y, 10)


             precision    recall  f1-score   support

        0.0       0.85      0.98      0.91       482
        1.0       0.29      0.06      0.09        90

avg / total       0.76      0.83      0.78       572


In [125]:
res_C45 = experiment(DecisionTreeClassifier(criterion='entropy', max_leaf_nodes=5), X, Y, 10)


             precision    recall  f1-score   support

        0.0       0.84      1.00      0.91       482
        1.0       0.00      0.00      0.00        90

avg / total       0.71      0.84      0.77       572

Ahora que probamos con varios clasificadores podemos construir una curva ROC


In [127]:
plt.figure(figsize=(6, 6))
#plt.figaspect(.8)

fpr = res_Dtree['fpr']
tpr = res_Dtree['tpr']
roc_auc = metrics.auc(fpr, tpr)
plt.plot(fpr, tpr, lw=1.5, color='blue', label='Dtree AUC={:06.4f}'.format(roc_auc))

fpr = res_C45['fpr']
tpr = res_C45['tpr']
roc_auc = metrics.auc(fpr, tpr)
plt.plot(fpr, tpr, lw=1.5, color='red', label='C45 AUC={:06.4f}'.format(roc_auc))

fpr = res_Rforest['fpr']
tpr = res_Rforest['tpr']
roc_auc = metrics.auc(fpr, tpr)
plt.plot(fpr, tpr, lw=1.5, color='green', label='Rforest AUC={:06.4f}'.format(roc_auc))

fpr = res_svc['fpr']
tpr = res_svc['tpr']
roc_auc = metrics.auc(fpr, tpr)
plt.plot(fpr, tpr, lw=1.5, color='black', label='SVM AUC={:06.4f}'.format(roc_auc))

fpr = res_knn['fpr']
tpr = res_knn['tpr']
roc_auc = metrics.auc(fpr, tpr)
plt.plot(fpr, tpr, lw=1.5, color='magenta', label='KNN AUC={:06.4f}'.format(roc_auc))


plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend(loc='lower right')
plt.title('Receiver Operating Characteristic Curve')


Out[127]:
<matplotlib.text.Text at 0x7fe975ecd350>

In [ ]: