Sloan Digital Sky Survey Classification

In this notebook - through a logistic regression classifier - we classify observations of space objects to be either stars, galaxies or quasars.

We are using data from the Sloan Digital Sky Survey (Release 14)

Data Acquisition


In [1]:
import pandas as pd

sdss = pd.read_csv('../datasets/Skyserver_SQL2_27_2018 6_51_39 PM.csv', skiprows=1)

In [4]:
sdss.head(2)


Out[4]:
objid ra dec u g r i z run rerun camcol field specobjid class redshift plate mjd fiberid
0 1237648704577142822 183.531326 0.089693 19.47406 17.04240 15.94699 15.50342 15.22531 752 301 4 267 3722360139651588096 STAR -0.000009 3306 54922 491
1 1237648704577142859 183.598371 0.135285 18.66280 17.21449 16.67637 16.48922 16.39150 752 301 4 267 363814405953054720 STAR -0.000055 323 51615 541

The class identifies an object to be either a galaxy, star or quasar.
This will be the output variable which we will predict. A multi-class target variable.


In [5]:
sdss['class'].value_counts()


Out[5]:
GALAXY    4998
STAR      4152
QSO        850
Name: class, dtype: int64

In [6]:
sdss.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 18 columns):
objid        10000 non-null int64
ra           10000 non-null float64
dec          10000 non-null float64
u            10000 non-null float64
g            10000 non-null float64
r            10000 non-null float64
i            10000 non-null float64
z            10000 non-null float64
run          10000 non-null int64
rerun        10000 non-null int64
camcol       10000 non-null int64
field        10000 non-null int64
specobjid    10000 non-null uint64
class        10000 non-null object
redshift     10000 non-null float64
plate        10000 non-null int64
mjd          10000 non-null int64
fiberid      10000 non-null int64
dtypes: float64(8), int64(8), object(1), uint64(1)
memory usage: 1.4+ MB

The dataset has 10000 examples, 17 features and 1 target.


In [7]:
sdss.describe()


Out[7]:
objid ra dec u g r i z run rerun camcol field specobjid redshift plate mjd fiberid
count 1.000000e+04 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.0 10000.000000 10000.000000 1.000000e+04 10000.000000 10000.000000 10000.000000 10000.000000
mean 1.237650e+18 175.529987 14.836148 18.619355 17.371931 16.840963 16.583579 16.422833 981.034800 301.0 3.648700 302.380100 1.645022e+18 0.143726 1460.986400 52943.533300 353.069400
std 1.173967e+12 47.783439 25.212207 0.828656 0.945457 1.067764 1.141805 1.203188 273.305024 0.0 1.666183 162.577763 2.013998e+18 0.388774 1788.778371 1511.150651 206.298149
min 1.237647e+18 8.235100 -5.382632 12.988970 12.799550 12.431600 11.947210 11.610410 308.000000 301.0 1.000000 11.000000 2.995782e+17 -0.004136 266.000000 51578.000000 1.000000
25% 1.237649e+18 157.370946 -0.539035 18.178035 16.815100 16.173333 15.853705 15.618285 752.000000 301.0 2.000000 184.000000 3.389250e+17 0.000081 301.000000 51900.000000 186.750000
50% 1.237649e+18 180.394514 0.404166 18.853095 17.495135 16.858770 16.554985 16.389945 756.000000 301.0 4.000000 299.000000 4.966580e+17 0.042591 441.000000 51997.000000 351.000000
75% 1.237651e+18 201.547279 35.649397 19.259232 18.010145 17.512675 17.258550 17.141447 1331.000000 301.0 5.000000 414.000000 2.881300e+18 0.092579 2559.000000 54468.000000 510.000000
max 1.237652e+18 260.884382 68.542265 19.599900 19.918970 24.802040 28.179630 22.833060 1412.000000 301.0 6.000000 768.000000 9.468834e+18 5.353854 8410.000000 57481.000000 1000.000000

No missing values!

Data preprocessing

Remove unnecessary features


In [8]:
sdss.columns.values


Out[8]:
array(['objid', 'ra', 'dec', 'u', 'g', 'r', 'i', 'z', 'run', 'rerun',
       'camcol', 'field', 'specobjid', 'class', 'redshift', 'plate', 'mjd',
       'fiberid'], dtype=object)

From the project website, we can see that objid and specobjid are just identifiers for accessing the rows in the original database. Therefore we will not need them for classification as they are not related to the outcome.

Moreover, the features run, rerun, camcol and field are values which describe parts of the camera at the moment when making the observation, e.g. 'run' represents the corresponding scan which captured the object.

We will drop these columns as any correlation to the outcome would be coincidentally.


In [9]:
sdss.drop(['objid', 'run', 'rerun', 'camcol', 'field', 'specobjid'], axis=1, inplace=True)

In [10]:
sdss.head(2)


Out[10]:
ra dec u g r i z class redshift plate mjd fiberid
0 183.531326 0.089693 19.47406 17.04240 15.94699 15.50342 15.22531 STAR -0.000009 3306 54922 491
1 183.598371 0.135285 18.66280 17.21449 16.67637 16.48922 16.39150 STAR -0.000055 323 51615 541

In [11]:
# Extract output values
y = sdss['class'].copy() # copy “y” column values out

sdss.drop(['class'], axis=1, inplace=True) # then, drop y column

Training

Split the dataset

We will split the data into a training and a test part (70-30). The models will be trained on the training data set and tested on the test data set


In [12]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(sdss, y, test_size=0.3)

Data scaling

Scaling all values to be within the (0, 1) interval will reduce the distortion due to exceptionally high values and make some algorithms converge faster.


In [13]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled  = scaler.transform(X_test)

fit one-versus-all model

The One-versus-all model is extremely simple and is exactly what you expect of the words: you train a classifier for each category.

So for example, you train one that says the positive category is going to be all the Stars and the negative category is going to be everything else (in our example, galaxies and quasars); it is  just trying to learn a classifier that separates most of the stars from the other sky objects.

We'll learn a model for each one of these cases; after Stars will be Galaxies and then Quasars.

As a prediction, whatever class has the highest probability wins. So in other words, if the probability that an input is a star against everything else is higher then the object is a star.

Everything is performed by the sklearn classifier when multiclass is ovr (for one-versus-rest):


In [14]:
from sklearn.linear_model import LogisticRegression

In [15]:
# Create One-vs-rest logistic regression object
ovr = LogisticRegression(multi_class='ovr', solver='liblinear')

In [16]:
modelOvr = ovr.fit(X_train_scaled, y_train)

In [17]:
modelOvr.score(X_test_scaled, y_test)


Out[17]:
0.90766666666666662

A very simple classifier like this has already a high accuracy of 90%

fit a LR model with cross-entropy loss function

Sklearn offers also a classifier that uses the cross-entropy as loss function.
The loss minimised is the multinomial loss fit across the entire probability distribution.
Note that it does not work for liblinear solver so we are going to use the Newton optimisation algorithm but new in version 0.18 of sklearn is the Stochastic Average Gradient Descent solver (can be useful to try other solvers when the model is not converging).


In [18]:
# Create cross-entropy-loss logistic regression object
xe = LogisticRegression(multi_class='multinomial', solver='newton-cg')

In [19]:
# Train model
modelXE = xe.fit(X_train_scaled, y_train)

In [20]:
preds = modelXE.predict(X_test_scaled)

In [21]:
modelXE.score(X_test_scaled, y_test)


Out[21]:
0.91633333333333333

A small improvement.