Find out Blended stars

Today our job is to find out blending stars and distinguish them with the true signals.

When two stars are close to each other in the CCD, part of their lights may fall into the same pixel. If one of the star has a eclipsing signal, the others will see a similar signal in their light curves. Usually the mirrored signal will be more diluted than the main signal. But there could be exceptions.



In [1]:

    
import matplotlib
from matplotlib import pyplot as plt
import numpy as np
import scipy as sp
%matplotlib inline









    



/Users/chelsea/miniconda2/lib/python2.7/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')

Let's create two stars as if their point spread functions (PSFs) are one-d gaussians with full width half maximum (FWHM) of 1.2 pixels, which means sigma ~0.5 pixel. Star one is twice the brightness as star two, and the two stars are one pixel apart. Note I haven't pixelize the flux in the left figure below.



In [2]:

    
from scipy.stats import norm
FWHM=1.2
x=np.linspace(-3,3,50)
rv = norm(scale=FWHM/2.35)
starone=rv.pdf(x)*1000.
rv = norm(scale=FWHM/2.35)
startwo=(rv.pdf(x-1.0))*500.
fig=plt.figure()
ax=fig.add_subplot(121)
ax.plot(x,starone,label="star1")
ax.plot(x,startwo,label="star2")
ax.plot(x,starone+startwo,label="Total")
ax.set_xlabel("Pixel")
ax.set_ylabel("Flux")
ax.legend()
x_pixel=np.arange(7)-3
def pixelize(x,flux,x_pixel):
    flux_pixel=np.zeros(len(x_pixel))
    for i in xrange(len(x_pixel)):
        index=(x>=x_pixel[i]-0.5)*(x<x_pixel[i]+0.5) 
        flux_pixel[i]=np.sum(flux[index])
    return flux_pixel
starone_pixel=pixelize(x,starone,x_pixel)
startwo_pixel=pixelize(x,startwo,x_pixel)
ax2=fig.add_subplot(122)
ax2.plot(x_pixel,starone_pixel,label="star1")
ax2.plot(x_pixel,startwo_pixel,label="star2")
total=starone_pixel+startwo_pixel
ax2.plot(x_pixel,total,label="Total")
ax2.set_xlabel("Pixel")
#ax2.set_ylabel("Flux")
#ax2.legend()
plt.show()

Let's first create a Eclipse signal that makes star one dimmer by 10%.



In [3]:

    
fig=plt.figure()
ax=fig.add_subplot(121)
ax.plot(x,starone*0.9,label="star1")
ax.plot(x,startwo,label="star2")
ax.plot(x,starone*0.9+startwo,label="Total")
ax.set_xlabel("Pixel")
ax.set_ylabel("Flux")
ax.legend()
starone_pixel_ec=pixelize(x,starone*0.9,x_pixel)
startwo_pixel_ec=pixelize(x,startwo,x_pixel)
ax2=fig.add_subplot(122)
ax2.plot(x_pixel,starone_pixel_ec,label="star1")
ax2.plot(x_pixel,startwo_pixel_ec,label="star2")
total_ec=starone_pixel_ec+startwo_pixel_ec
ax2.plot(x_pixel,total_ec,label="Total")
ax2.set_xlabel("Pixel")
#ax2.set_ylabel("Flux")
plt.show()

Let's look at the actual dimming of two stars assuming we take a radius equals to one apperture around both stars.



In [4]:

    
plt.scatter(np.arange(2),[np.sum(total[2:5]),np.sum(total_ec[2:5])],marker='o',s=50,label='star1_obs')
plt.scatter(np.arange(2),[np.sum(total[3:6]),np.sum(total_ec[3:6])],marker='d',s=50,label='star2_obs')
plt.legend()









    Out[4]:





<matplotlib.legend.Legend at 0x10ec7bf90>

Both star dimmed, however, star 1 dimed by about 7%, while star 2 dimed by slightly less than 6%.

Task:

Try to change the configuration of the two stars and compare the results.

For example:

what if the transit happen around Star 2;
what if the two stars are closer/further away from each other;
what if the brightness ratio vary;



In [5]:

    
import pandas as pd
import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn.utils import shuffle
from sklearn import metrics
from sklearn.metrics import roc_curve
from sklearn.metrics import classification_report
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.cross_validation import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.grid_search import GridSearchCV









    



/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/cross_validation.py:42: DeprecationWarning: This module has been deprecated in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/grid_search.py:43: DeprecationWarning: This module has been deprecated in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)



In [47]:

    
def make_ROC_curve(testY, predY, name):
    fig2 = plt.figure()
    ax= fig2.add_subplot(1,1,1)
    fpr, tpr, _ = roc_curve(testY, predY)
    ax.plot(fpr, tpr, label = name)
    ax.set_title(('ROC Curve for %s') % name)
    ax.set_ylabel('True Positive Rate')
    ax.set_xlabel('False Positive Rate')
def collect_lc_feature(idlist):
    LCfeature=np.zeros([len(idlist),481])
    count=0
    for i in idlist:
        #print i
        infile="LTFsmall/"+str(i)+".ltf"
        lc=np.loadtxt(infile)[:,1]
        LCfeature[count,0]=i
        LCfeature[count,1:]=lc
        count+=1
    return LCfeature
def plot_confusion_matrix(cm, title='Confusion matrix', cmap=plt.cm.Blues):
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(2)
    plt.xticks(tick_marks, ['false positives', 'transits'], rotation=45)
    plt.yticks(tick_marks, ['false positives', 'transits'])
    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
def fit(model,name,data,cv=True):
    trainX,trainY,testX,testY,X,Y=data
    model.fit(trainX, trainY)
    predY = model.predict(testX)
    f1score = metrics.f1_score(testY, predY)
    cm = metrics.confusion_matrix(testY, predY)
    plot_confusion_matrix(cm)
    predY=model.predict_proba(testX)[:,1]
    rocscore = metrics.roc_auc_score(testY, predY)
    precision, recall, thresholds = metrics.precision_recall_curve(testY, predY)
    aucscore=metrics.auc(precision,recall,reorder=True)
    
    print "#####################################"
    print "Result using",model
    print "f1 score from train test split %f" % f1score
    print "roc score from train test split %f" % rocscore
    print "auc score from train test split %f" % aucscore
    if cv:
        f1cvscore= cross_val_score(model, X, Y, cv = 5, scoring = 'f1')
        cvscore= cross_val_score(model, X, Y, cv = 5, scoring = 'roc_auc')
        print "roc_auc score from CV5 %f +- %f" % (np.mean(cvscore),np.std(cvscore))
        print "f1 score from CV5 %f +- %f" % (np.mean(f1cvscore),np.std(f1cvscore))
    
    
    print cm
    make_ROC_curve(testY,predY,name)
    return



In [7]:

    
df=pd.read_csv("TESSfield_19h_44d_combinedfeatures.csv")

Let's first have compare the Manuel selected signals with the catalog selected signals.

As a reminder:

Manuel Y is those signals selected by human eye as looks like transit shape
Catalog Y is those stars have signals injected in them at the original input catalog that we determined have enough signal to noise to be recovered



In [8]:

    
#print df.shape
plt.plot(df['ManuleY']+np.random.randn(df.shape[0])*0.1,df['CatalogY']+np.random.randn(df.shape[0])*0.1,'.')
plt.xlim([-0.4,1.3])
plt.ylim([-0.4,1.3])
plt.xlabel('ManuleY')
plt.ylabel('CatalogY')









    Out[8]:





<matplotlib.text.Text at 0x112ee57d0>

There are four groups of cluster above:

left bottom corner: manule selection determines most of the stars without a signal correctly.
right top corner: manule selection determines some of the stars with a catalog signal correctly.
left top coner: singals that's not sensitive to human eyes.
right bottom coner: why would signals that's not in catalog appears to be ture to human eye?

The answer to the question is blending. And our task today is to further distinuish the stars affected by blending and those with true astrophysical signals.



In [9]:

    
Manule_selected=df.loc[df['ManuleY']==1,:]
Manule_selected.shape









    Out[9]:





(700, 32)

We drop all the irrelevant columns and all the informations from catalog to creat our innitial feature set.



In [10]:

    
X=Manule_selected.drop(['Ids','CatalogY','ManuleY','CombinedY','Catalog_Period','Depth','Catalog_Epoch','SNR','Catmag','CatX','CatY'],axis=1)



In [11]:

    
Y=Manule_selected['CatalogY']



In [12]:

    
trainX, testX, trainY, testY= train_test_split(X, Y,test_size = 0.2)
data=[trainX,trainY,testX,testY,X,Y]
print X.shape, Y[Y==1].shape









    



(700, 21) (235,)

This is a slightly inbalanced sample, with the false sample twice as large as the true samples.



In [26]:

    
model=RandomForestClassifier(n_estimators=3000,n_jobs=-1)
name="RFC"
fit(model,name,data)









    



#####################################
Result using RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=3000, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
f1 score from train test split 0.297297
roc score from train test split 0.616596
auc score from train test split 0.119950
roc_auc score from CV5 0.697575 +- 0.054884
[[77 19]
 [33 11]]

Note that we added the new auc score which computes the area under the precision recall curve.



In [27]:

    
model=GradientBoostingClassifier(n_estimators=3000)
name="GBC"
fit(model,name,data)









    



#####################################
Result using GradientBoostingClassifier(init=None, learning_rate=0.1, loss='deviance',
              max_depth=3, max_features=None, max_leaf_nodes=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=3000,
              presort='auto', random_state=None, subsample=1.0, verbose=0,
              warm_start=False)
f1 score from train test split 0.417582
roc score from train test split 0.611979
auc score from train test split 0.167026
roc_auc score from CV5 0.656509 +- 0.037402
[[68 28]
 [25 19]]



In [28]:

    
from xgboost import XGBClassifier
model = XGBClassifier(n_estimators=3000)
model.fit(trainX,trainY)
name="XGBoost"
fit(model,name,data)









    



#####################################
Result using XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=3000, nthread=-1,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=True, subsample=1)
f1 score from train test split 0.395349
roc score from train test split 0.612926
auc score from train test split 0.142112
roc_auc score from CV5 0.690231 +- 0.053082
[[71 25]
 [27 17]]

Let's make some new features: One feature that makes sense is the to compare the detected period of planet with any other detected signals and compute an dimentionless distance.



In [84]:

    
import scipy as sp
from scipy import special
def find_distance(Parr,Earr,threshold=0.005):
    #print Parr
    dist_P=np.zeros(len(Parr))+0.5
    dist_E=np.zeros(len(Parr))+0.5
    for i in xrange(len(Parr)):
        for j in xrange(len(Parr)):
            if i==j:
                continue
            if Parr[j]>Parr[i]:
                delt_P= np.abs(np.abs(Parr[j]-Parr[i])/Parr[i]-round(np.abs(Parr[j]-Parr[i])/Parr[i]))
            else:
                delt_P= np.abs(np.abs(Parr[j]-Parr[i])/Parr[j]-round(np.abs(Parr[j]-Parr[i])/Parr[j]))
            if dist_P[i]>delt_P:
                dist_P[i]=delt_P
            
            delt_E=np.abs(np.abs(Earr[j]-Earr[i])/Parr[i]-round(np.abs(Earr[j]-Earr[i])/Parr[i]))
            if dist_E[i]>delt_E:
                dist_E[i]=delt_E
        if dist_P[i]==0:
            dist_P[i]=10
        else: 
            dist_P[i]=np.sqrt(2)*sp.special.erfcinv(dist_P[i])
        #print i, dist_P[i]
        if dist_E[i]==0:
            dist_E[i]=10
        else:
            dist_E[i]=np.sqrt(2)*sp.special.erfcinv(dist_E[i])
    return [dist_P,dist_E]
#add back the magnitude of stars from the catalog
X=Manule_selected.drop(['Ids','CatalogY','ManuleY','CombinedY','Catalog_Period','Depth','Catalog_Epoch','SNR','CatX','CatY'],axis=1)
Pdist,Edist=find_distance(np.array(X['BLS_Period_1_0']),np.array(X['BLS_Tc_1_0']))
X['dist_E']=Edist
X['dist_P']=Pdist
X['diff_mag']=X['Catmag']-X['BLS_OOTmag_1_0']
plt.plot(Pdist,Edist,'.')









    Out[84]:





[<matplotlib.lines.Line2D at 0x11869b850>]



In [85]:

    
trainX, testX, trainY, testY= train_test_split(X, Y,test_size = 0.2)
data=[trainX,trainY,testX,testY,X,Y]
print X.shape, Y[Y==1].shape









    



(700, 25) (235,)



In [50]:

    
model=RandomForestClassifier(n_estimators=5000,class_weight={0:1,1:3},n_jobs=-1)
name="RFC"
fit(model,name,data)









    



#####################################
Result using RandomForestClassifier(bootstrap=True, class_weight={0: 1, 1: 3},
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=5000, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
f1 score from train test split 0.376812
roc score from train test split 0.550904
auc score from train test split 0.163879
roc_auc score from CV5 0.688790 +- 0.044227
f1 score from CV5 0.362331 +- 0.192535
[[84  9]
 [34 13]]



In [51]:

    
model=GradientBoostingClassifier(n_estimators=3000)
name="GBC"
fit(model,name,data)









    



#####################################
Result using GradientBoostingClassifier(init=None, learning_rate=0.1, loss='deviance',
              max_depth=3, max_features=None, max_leaf_nodes=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=3000,
              presort='auto', random_state=None, subsample=1.0, verbose=0,
              warm_start=False)
f1 score from train test split 0.425532
roc score from train test split 0.560055
auc score from train test split 0.103696
roc_auc score from CV5 0.661542 +- 0.039940
f1 score from CV5 0.461036 +- 0.081371
[[66 27]
 [27 20]]



In [86]:

    
from xgboost import XGBClassifier
model = XGBClassifier(n_estimators=3000,max_depth=18,colsample_bytree=1.0,min_child_weight=1.0,learning_rate=0.001)
model.fit(trainX,trainY)
name="XGBoost"
fit(model,name,data)









    



#####################################
Result using XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1.0,
       gamma=0, learning_rate=0.001, max_delta_step=0, max_depth=18,
       min_child_weight=1.0, missing=None, n_estimators=3000, nthread=-1,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=True, subsample=1)
f1 score from train test split 0.494118
roc score from train test split 0.697781
auc score from train test split 0.291113
roc_auc score from CV5 0.724365 +- 0.034988
f1 score from CV5 0.504518 +- 0.052050
[[76 17]
 [26 21]]



In [21]:

    
featurelist=X.columns
rfc= RandomForestClassifier(n_estimators=3000)
rfc.fit(trainX, trainY)









    Out[21]:





RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=3000, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)



In [22]:

    
importances = rfc.feature_importances_
std = np.std([tree.feature_importances_ for tree in rfc.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]
print indices.shape,trainX.shape
# Print the feature ranking
print("Feature ranking:")
threshold=0.03
droplist=[]
for f in range(X.shape[1]):
    try:
        if importances[indices[f]]<threshold:
            droplist.append(featurelist[indices[f]])
        print("%d. feature %d (%s %f)" % (f + 1, indices[f], featurelist[indices[f]],importances[indices[f]]))
    except:
        pass









    



(26,) (560, 26)
Feature ranking:
1. feature 22 (Catmag 0.065428)
2. feature 3 (BLS_SR_1_0 0.062854)
3. feature 5 (BLS_Depth_1_0 0.059267)
4. feature 18 (BLS_Whitenoise_1_0 0.050444)
5. feature 8 (BLS_OOTmag_1_0 0.049806)
6. feature 11 (BLS_deltaChi2_1_0 0.049762)
7. feature 24 (dist_E 0.047468)
8. feature 25 (dist_P 0.043370)
9. feature 19 (BLS_SignaltoPinknoise_1_0 0.040501)
10. feature 7 (BLS_Qingress_1_0 0.040213)
11. feature 20 (CatX 0.039680)
12. feature 4 (BLS_SDE_1_0 0.039580)
13. feature 12 (BLS_fraconenight_1_0 0.039018)
14. feature 9 (BLS_i1_1_0 0.036927)
15. feature 17 (BLS_Rednoise_1_0 0.036609)
16. feature 10 (BLS_i2_1_0 0.035976)
17. feature 2 (BLS_SN_1_0 0.035917)
18. feature 21 (CatY 0.031186)
19. feature 23 (Ids.1 0.031009)
20. feature 0 (BLS_Period_1_0 0.028615)
21. feature 6 (BLS_Qtran_1_0 0.028235)
22. feature 16 (BLS_Npointsaftertransit_1_0 0.025640)
23. feature 13 (BLS_Npointsintransit_1_0 0.024095)
24. feature 15 (BLS_Npointsbeforetransit_1_0 0.022962)
25. feature 1 (BLS_Tc_1_0 0.019492)
26. feature 14 (BLS_Ntransits_1_0 0.015947)



In [23]:

    
X_selected=X.drop(droplist,axis=1)
X_selected.head()









    Out[23]:






  
    
      
      BLS_SN_1_0
      BLS_SR_1_0
      BLS_SDE_1_0
      BLS_Depth_1_0
      BLS_Qingress_1_0
      BLS_OOTmag_1_0
      BLS_i1_1_0
      BLS_i2_1_0
      BLS_deltaChi2_1_0
      BLS_fraconenight_1_0
      BLS_Rednoise_1_0
      BLS_Whitenoise_1_0
      BLS_SignaltoPinknoise_1_0
      CatX
      CatY
      Catmag
      Ids.1
      dist_E
      dist_P
    
  
  
    
      400
      12.53336
      0.00634
      7.57704
      0.03424
      0.36902
      7.37420
      0.94231
      1.02001
      -403.87175
      0.05159
      0.00310
      0.00829
      34.89784
      325.972487
      11.685035
      12.568078
      401
      3.291958
      4.438688
    
    
      402
      12.73811
      0.00347
      7.72676
      0.01991
      0.40737
      7.02694
      0.44397
      0.52002
      -307.10972
      0.04950
      0.00170
      0.00463
      36.61236
      324.873895
      14.207794
      13.762556
      403
      4.520055
      3.999749
    
    
      403
      12.55868
      0.00386
      7.72838
      0.02155
      0.38419
      7.23076
      0.44914
      0.52277
      -317.47072
      0.05163
      0.00183
      0.00495
      36.48016
      323.930348
      9.362986
      13.920750
      404
      3.512163
      4.165658
    
    
      505
      8.73484
      0.00179
      9.24720
      0.01374
      0.24426
      8.69045
      0.01177
      0.03695
      -53.01366
      0.25765
      0.00099
      0.00828
      9.67684
      191.119809
      0.133972
      13.979692
      506
      4.524330
      3.158509
    
    
      593
      11.14045
      0.01313
      8.34805
      0.08545
      0.22411
      8.64580
      0.19679
      0.22928
      -561.32033
      0.22492
      0.00102
      0.00968
      59.23343
      81.152584
      6.024888
      13.277000
      594
      3.510703
      3.500817



In [24]:

    
model=RandomForestClassifier(n_estimators=3000,n_jobs=-1,class_weight='balanced_subsample')
name="RFC"
fit(model,name,data)









    



#####################################
Result using RandomForestClassifier(bootstrap=True, class_weight='balanced_subsample',
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=3000, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
f1 score from train test split 0.277778
roc score from train test split 0.614228
auc score from train test split 0.107593
roc_auc score from CV5 0.692839
[[78 18]
 [34 10]]

Task:

Try to improve the result:

create more features, for example calculate the minimum distance to another star.
Fine tune the algorithms.



In [ ]:

	BLS_SN_1_0	BLS_SR_1_0	BLS_SDE_1_0	BLS_Depth_1_0	BLS_Qingress_1_0	BLS_OOTmag_1_0	BLS_i1_1_0	BLS_i2_1_0	BLS_deltaChi2_1_0	BLS_fraconenight_1_0	BLS_Rednoise_1_0	BLS_Whitenoise_1_0	BLS_SignaltoPinknoise_1_0	CatX	CatY	Catmag	Ids.1	dist_E	dist_P
400	12.53336	0.00634	7.57704	0.03424	0.36902	7.37420	0.94231	1.02001	-403.87175	0.05159	0.00310	0.00829	34.89784	325.972487	11.685035	12.568078	401	3.291958	4.438688
402	12.73811	0.00347	7.72676	0.01991	0.40737	7.02694	0.44397	0.52002	-307.10972	0.04950	0.00170	0.00463	36.61236	324.873895	14.207794	13.762556	403	4.520055	3.999749
403	12.55868	0.00386	7.72838	0.02155	0.38419	7.23076	0.44914	0.52277	-317.47072	0.05163	0.00183	0.00495	36.48016	323.930348	9.362986	13.920750	404	3.512163	4.165658
505	8.73484	0.00179	9.24720	0.01374	0.24426	8.69045	0.01177	0.03695	-53.01366	0.25765	0.00099	0.00828	9.67684	191.119809	0.133972	13.979692	506	4.524330	3.158509
593	11.14045	0.01313	8.34805	0.08545	0.22411	8.64580	0.19679	0.22928	-561.32033	0.22492	0.00102	0.00968	59.23343	81.152584	6.024888	13.277000	594	3.510703	3.500817