In this notebook - through a logistic regression classifier - we classify observations of space objects to be either stars, galaxies or quasars.
We are using data from the Sloan Digital Sky Survey (Release 14)
In [1]:
import pandas as pd
sdss = pd.read_csv('../datasets/Skyserver_SQL2_27_2018 6_51_39 PM.csv', skiprows=1)
In [4]:
sdss.head(2)
Out[4]:
The class identifies an object to be either a galaxy, star or quasar.
This will be the output variable which we will predict. A multi-class target variable.
In [5]:
sdss['class'].value_counts()
Out[5]:
In [6]:
sdss.info()
The dataset has 10000 examples, 17 features and 1 target.
In [7]:
sdss.describe()
Out[7]:
No missing values!
In [8]:
sdss.columns.values
Out[8]:
From the project website, we can see that objid and specobjid are just identifiers for accessing the rows in the original database. Therefore we will not need them for classification as they are not related to the outcome.
Moreover, the features run, rerun, camcol and field are values which describe parts of the camera at the moment when making the observation, e.g. 'run' represents the corresponding scan which captured the object.
We will drop these columns as any correlation to the outcome would be coincidentally.
In [9]:
sdss.drop(['objid', 'run', 'rerun', 'camcol', 'field', 'specobjid'], axis=1, inplace=True)
In [10]:
sdss.head(2)
Out[10]:
In [11]:
# Extract output values
y = sdss['class'].copy() # copy “y” column values out
sdss.drop(['class'], axis=1, inplace=True) # then, drop y column
In [12]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(sdss, y, test_size=0.3)
Scaling all values to be within the (0, 1) interval will reduce the distortion due to exceptionally high values and make some algorithms converge faster.
In [13]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
The One-versus-all model is extremely simple and is exactly what you expect of the words: you train a classifier for each category.
So for example, you train one that says the positive category is going to be all the Stars and the negative category is going to be everything else (in our example, galaxies and quasars); it is just trying to learn a classifier that separates most of the stars from the other sky objects.
We'll learn a model for each one of these cases; after Stars will be Galaxies and then Quasars.
As a prediction, whatever class has the highest probability wins. So in other words, if the probability that an input is a star against everything else is higher then the object is a star.
Everything is performed by the sklearn classifier when multiclass is ovr (for one-versus-rest):
In [14]:
from sklearn.linear_model import LogisticRegression
In [15]:
# Create One-vs-rest logistic regression object
ovr = LogisticRegression(multi_class='ovr', solver='liblinear')
In [16]:
modelOvr = ovr.fit(X_train_scaled, y_train)
In [17]:
modelOvr.score(X_test_scaled, y_test)
Out[17]:
A very simple classifier like this has already a high accuracy of 90%
Sklearn offers also a classifier that uses the cross-entropy as loss function.
The loss minimised is the multinomial loss fit across the entire probability distribution.
Note that it does not work for liblinear solver so we are going to use the Newton optimisation algorithm but new in version 0.18 of sklearn is the Stochastic Average Gradient Descent solver (can be useful to try other solvers when the model is not converging).
In [18]:
# Create cross-entropy-loss logistic regression object
xe = LogisticRegression(multi_class='multinomial', solver='newton-cg')
In [19]:
# Train model
modelXE = xe.fit(X_train_scaled, y_train)
In [20]:
preds = modelXE.predict(X_test_scaled)
In [21]:
modelXE.score(X_test_scaled, y_test)
Out[21]:
A small improvement.