Sloan Digital Sky Survey Classification

In this notebook - through a logistic regression classifier - we classify observations of space objects to be either stars, galaxies or quasars.

We are using data from the Sloan Digital Sky Survey (Release 14)

Data Acquisition



In [1]:

    
import pandas as pd

sdss = pd.read_csv('../datasets/Skyserver_SQL2_27_2018 6_51_39 PM.csv', skiprows=1)



In [4]:

    
sdss.head(2)









    Out[4]:







  
    
      
      objid
      ra
      dec
      u
      g
      r
      i
      z
      run
      rerun
      camcol
      field
      specobjid
      class
      redshift
      plate
      mjd
      fiberid
    
  
  
    
      0
      1237648704577142822
      183.531326
      0.089693
      19.47406
      17.04240
      15.94699
      15.50342
      15.22531
      752
      301
      4
      267
      3722360139651588096
      STAR
      -0.000009
      3306
      54922
      491
    
    
      1
      1237648704577142859
      183.598371
      0.135285
      18.66280
      17.21449
      16.67637
      16.48922
      16.39150
      752
      301
      4
      267
      363814405953054720
      STAR
      -0.000055
      323
      51615
      541

The class identifies an object to be either a galaxy, star or quasar.
This will be the output variable which we will predict. A multi-class target variable.



In [5]:

    
sdss['class'].value_counts()









    Out[5]:





GALAXY    4998
STAR      4152
QSO        850
Name: class, dtype: int64



In [6]:

    
sdss.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 18 columns):
objid        10000 non-null int64
ra           10000 non-null float64
dec          10000 non-null float64
u            10000 non-null float64
g            10000 non-null float64
r            10000 non-null float64
i            10000 non-null float64
z            10000 non-null float64
run          10000 non-null int64
rerun        10000 non-null int64
camcol       10000 non-null int64
field        10000 non-null int64
specobjid    10000 non-null uint64
class        10000 non-null object
redshift     10000 non-null float64
plate        10000 non-null int64
mjd          10000 non-null int64
fiberid      10000 non-null int64
dtypes: float64(8), int64(8), object(1), uint64(1)
memory usage: 1.4+ MB

The dataset has 10000 examples, 17 features and 1 target.



In [7]:

    
sdss.describe()









    Out[7]:







  
    
      
      objid
      ra
      dec
      u
      g
      r
      i
      z
      run
      rerun
      camcol
      field
      specobjid
      redshift
      plate
      mjd
      fiberid
    
  
  
    
      count
      1.000000e+04
      10000.000000
      10000.000000
      10000.000000
      10000.000000
      10000.000000
      10000.000000
      10000.000000
      10000.000000
      10000.0
      10000.000000
      10000.000000
      1.000000e+04
      10000.000000
      10000.000000
      10000.000000
      10000.000000
    
    
      mean
      1.237650e+18
      175.529987
      14.836148
      18.619355
      17.371931
      16.840963
      16.583579
      16.422833
      981.034800
      301.0
      3.648700
      302.380100
      1.645022e+18
      0.143726
      1460.986400
      52943.533300
      353.069400
    
    
      std
      1.173967e+12
      47.783439
      25.212207
      0.828656
      0.945457
      1.067764
      1.141805
      1.203188
      273.305024
      0.0
      1.666183
      162.577763
      2.013998e+18
      0.388774
      1788.778371
      1511.150651
      206.298149
    
    
      min
      1.237647e+18
      8.235100
      -5.382632
      12.988970
      12.799550
      12.431600
      11.947210
      11.610410
      308.000000
      301.0
      1.000000
      11.000000
      2.995782e+17
      -0.004136
      266.000000
      51578.000000
      1.000000
    
    
      25%
      1.237649e+18
      157.370946
      -0.539035
      18.178035
      16.815100
      16.173333
      15.853705
      15.618285
      752.000000
      301.0
      2.000000
      184.000000
      3.389250e+17
      0.000081
      301.000000
      51900.000000
      186.750000
    
    
      50%
      1.237649e+18
      180.394514
      0.404166
      18.853095
      17.495135
      16.858770
      16.554985
      16.389945
      756.000000
      301.0
      4.000000
      299.000000
      4.966580e+17
      0.042591
      441.000000
      51997.000000
      351.000000
    
    
      75%
      1.237651e+18
      201.547279
      35.649397
      19.259232
      18.010145
      17.512675
      17.258550
      17.141447
      1331.000000
      301.0
      5.000000
      414.000000
      2.881300e+18
      0.092579
      2559.000000
      54468.000000
      510.000000
    
    
      max
      1.237652e+18
      260.884382
      68.542265
      19.599900
      19.918970
      24.802040
      28.179630
      22.833060
      1412.000000
      301.0
      6.000000
      768.000000
      9.468834e+18
      5.353854
      8410.000000
      57481.000000
      1000.000000

No missing values!

Data preprocessing

Remove unnecessary features



In [8]:

    
sdss.columns.values









    Out[8]:





array(['objid', 'ra', 'dec', 'u', 'g', 'r', 'i', 'z', 'run', 'rerun',
       'camcol', 'field', 'specobjid', 'class', 'redshift', 'plate', 'mjd',
       'fiberid'], dtype=object)

From the project website, we can see that objid and specobjid are just identifiers for accessing the rows in the original database. Therefore we will not need them for classification as they are not related to the outcome.

Moreover, the features run, rerun, camcol and field are values which describe parts of the camera at the moment when making the observation, e.g. 'run' represents the corresponding scan which captured the object.

We will drop these columns as any correlation to the outcome would be coincidentally.



In [9]:

    
sdss.drop(['objid', 'run', 'rerun', 'camcol', 'field', 'specobjid'], axis=1, inplace=True)



In [10]:

    
sdss.head(2)



In [11]:

    
# Extract output values
y = sdss['class'].copy() # copy “y” column values out

sdss.drop(['class'], axis=1, inplace=True) # then, drop y column

Training

Split the dataset

We will split the data into a training and a test part (70-30). The models will be trained on the training data set and tested on the test data set



In [12]:

    
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(sdss, y, test_size=0.3)

Data scaling

Scaling all values to be within the (0, 1) interval will reduce the distortion due to exceptionally high values and make some algorithms converge faster.



In [13]:

    
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled  = scaler.transform(X_test)

fit one-versus-all model

The One-versus-all model is extremely simple and is exactly what you expect of the words: you train a classifier for each category.

So for example, you train one that says the positive category is going to be all the Stars and the negative category is going to be everything else (in our example, galaxies and quasars); it is just trying to learn a classifier that separates most of the stars from the other sky objects.

We'll learn a model for each one of these cases; after Stars will be Galaxies and then Quasars.

As a prediction, whatever class has the highest probability wins. So in other words, if the probability that an input is a star against everything else is higher then the object is a star.

Everything is performed by the sklearn classifier when multiclass is ovr (for one-versus-rest):



In [14]:

    
from sklearn.linear_model import LogisticRegression



In [15]:

    
# Create One-vs-rest logistic regression object
ovr = LogisticRegression(multi_class='ovr', solver='liblinear')



In [16]:

    
modelOvr = ovr.fit(X_train_scaled, y_train)



In [17]:

    
modelOvr.score(X_test_scaled, y_test)









    Out[17]:





0.90766666666666662

A very simple classifier like this has already a high accuracy of 90%

fit a LR model with cross-entropy loss function

Sklearn offers also a classifier that uses the cross-entropy as loss function.
The loss minimised is the multinomial loss fit across the entire probability distribution.
Note that it does not work for liblinear solver so we are going to use the Newton optimisation algorithm but new in version 0.18 of sklearn is the Stochastic Average Gradient Descent solver (can be useful to try other solvers when the model is not converging).



In [18]:

    
# Create cross-entropy-loss logistic regression object
xe = LogisticRegression(multi_class='multinomial', solver='newton-cg')



In [19]:

    
# Train model
modelXE = xe.fit(X_train_scaled, y_train)



In [20]:

    
preds = modelXE.predict(X_test_scaled)



In [21]:

    
modelXE.score(X_test_scaled, y_test)









    Out[21]:





0.91633333333333333

A small improvement.

	objid	ra	dec	u	g	r	i	z	run	rerun	camcol	field	specobjid	class	redshift	plate	mjd	fiberid
0	1237648704577142822	183.531326	0.089693	19.47406	17.04240	15.94699	15.50342	15.22531	752	301	4	267	3722360139651588096	STAR	-0.000009	3306	54922	491
1	1237648704577142859	183.598371	0.135285	18.66280	17.21449	16.67637	16.48922	16.39150	752	301	4	267	363814405953054720	STAR	-0.000055	323	51615	541

	objid	ra	dec	u	g	r	i	z	run	rerun	camcol	field	specobjid	redshift	plate	mjd	fiberid
count	1.000000e+04	10000.000000	10000.000000	10000.000000	10000.000000	10000.000000	10000.000000	10000.000000	10000.000000	10000.0	10000.000000	10000.000000	1.000000e+04	10000.000000	10000.000000	10000.000000	10000.000000
mean	1.237650e+18	175.529987	14.836148	18.619355	17.371931	16.840963	16.583579	16.422833	981.034800	301.0	3.648700	302.380100	1.645022e+18	0.143726	1460.986400	52943.533300	353.069400
std	1.173967e+12	47.783439	25.212207	0.828656	0.945457	1.067764	1.141805	1.203188	273.305024	0.0	1.666183	162.577763	2.013998e+18	0.388774	1788.778371	1511.150651	206.298149
min	1.237647e+18	8.235100	-5.382632	12.988970	12.799550	12.431600	11.947210	11.610410	308.000000	301.0	1.000000	11.000000	2.995782e+17	-0.004136	266.000000	51578.000000	1.000000
25%	1.237649e+18	157.370946	-0.539035	18.178035	16.815100	16.173333	15.853705	15.618285	752.000000	301.0	2.000000	184.000000	3.389250e+17	0.000081	301.000000	51900.000000	186.750000
50%	1.237649e+18	180.394514	0.404166	18.853095	17.495135	16.858770	16.554985	16.389945	756.000000	301.0	4.000000	299.000000	4.966580e+17	0.042591	441.000000	51997.000000	351.000000
75%	1.237651e+18	201.547279	35.649397	19.259232	18.010145	17.512675	17.258550	17.141447	1331.000000	301.0	5.000000	414.000000	2.881300e+18	0.092579	2559.000000	54468.000000	510.000000
max	1.237652e+18	260.884382	68.542265	19.599900	19.918970	24.802040	28.179630	22.833060	1412.000000	301.0	6.000000	768.000000	9.468834e+18	5.353854	8410.000000	57481.000000	1000.000000