Dataset Prepration


In [3]:
import numpy as np
import pandas as pd
from astropy.io import fits
from sklearn.preprocessing import StandardScaler
from mclearn.preprocessing import csv_to_hdf
from mclearn.photometry import (reddening_correction_w14,
                                correct_magnitudes,
                                compute_colours)
from mclearn.tools import save_results, results_exist

SDSS Dataset

Getting the Labelled SDSS Dataset

The Sloan Digital Sky Survey (SDSS) is a comprehensive survey of the northern sky. We are interested in a subset of this survey, namely the photometric measurements and spectroscopic labels of around 2.8 million objects. Given the size of the dataset (385MB), we will need to manually query the SDSS server. To do this, create an account on the SDSS CasJobs site and submit the following SQL query to the DR12 catalog:

    SELECT
    -- right ascension and declination in degrees
    p.ra, p.dec,

    -- class of object, expert opinion (galaxy, star, or quasar)
    CASE s.class WHEN 'GALAXY' THEN 'Galaxy'
                 WHEN 'STAR' THEN 'Star'
                 WHEN 'QSO' THEN 'Quasar'
                 END AS class,

    -- subclass of object
    s.subclass,

    -- redshift of object from spectrum with error, expert opnion
    s.z AS redshift,
    s.zErr AS redshiftErr,

    -- 0 if spectrum is ok
    s.zWarning,

    -- PSF magnitude measurements in 5 bands (ugriz) with error, assuming object is a point souce
    p.psfMag_u, p.psfMagErr_u,
    p.psfMag_g, p.psfMagErr_g,
    p.psfMag_r, p.psfMagErr_r,
    p.psfMag_i, p.psfMagErr_i,
    p.psfMag_z, p.psfMagErr_z,

    -- Petrosian magnitude measurements in 5 bands (ugriz) with error, assuming object is an extended souce
    p.petroMag_u, p.petroMagErr_u,
    p.petroMag_g, p.petroMagErr_g,
    p.petroMag_r, p.petroMagErr_r,
    p.petroMag_i, p.petroMagErr_i,
    p.petroMag_z, p.petroMagErr_z,

    -- extinction values
    p.extinction_u, p.extinction_g, p.extinction_r, p.extinction_i, p.extinction_z,

    -- size measurement in r-band in arc seconds
    p.petroRad_r, p.petroRadErr_r

FROM PhotoObj AS p
   JOIN SpecObj AS s
   ON s.bestobjid = p.objid

WHERE
    -- only include objects with complete and reasonably accurate data
    p.psfMagErr_u BETWEEN 0 AND 3
    AND p.psfMagErr_g BETWEEN 0 AND 3
    AND p.psfMagErr_r BETWEEN 0 AND 3
    AND p.psfMagErr_i BETWEEN 0 AND 3
    AND p.psfMagErr_z BETWEEN 0 AND 3
    AND p.petroMagErr_u BETWEEN 0 AND 3
    AND p.petroMagErr_g BETWEEN 0 AND 3
    AND p.petroMagErr_r BETWEEN 0 AND 3
    AND p.petroMagErr_i BETWEEN 0 AND 3
    AND p.petroMagErr_z BETWEEN 0 AND 3
    AND p.petroRadErr_r BETWEEN 0 AND 3
    AND s.zErr BETWEEN 0 AND 0.1
    AND s.zWarning = 0    -- spectrum is ok

The WHERE conditions ensure that we only select the best possible data. Alternatively, you can download the dataset from here. From now, we assume the dataset is stored in data/sdss_dr7_photometry_source.csv.gz.

About the Columns

The downloaded dataset is a compressed csv file and can be loaded as a pandas DataFrame.


In [2]:
sdss = pd.read_csv("../data/sdss_dr7_photometry_source.csv.gz", compression="gzip")

Here are the first four rows of the table. Note that the DataFrame has been transposed below for easier viewing. Without the transpose, each row is an object is the sky and each column is a feature/input.


In [3]:
sdss.head(4).transpose()


Out[3]:
0 1 2 3
ra 189.4298 189.4538 189.4687 196.2366
dec -0.1310421 -0.09731284 -0.03599975 0.4123472
class Star Star Star Galaxy
subclass A0 F5 F5 null
redshift 0.0006484753 5.906141e-07 0.0005791125 0.6179363
redshiftErr 7.360589e-06 9.354951e-06 1.180595e-05 0.0001318748
zWarning 0 0 0 0
psfMag_u 17.84807 17.66626 17.28147 24.66923
psfMagErr_u 0.01703377 0.01657351 0.01586868 0.8418523
psfMag_g 16.66706 16.64595 16.2146 23.05412
psfMagErr_g 0.0144116 0.01439991 0.01434032 0.2069454
psfMag_r 16.85531 16.28934 15.76875 21.47531
psfMagErr_r 0.01748326 0.01707548 0.01702425 0.1176952
psfMag_i 17.0429 16.17299 15.50953 20.41323
psfMagErr_i 0.01451051 0.01167456 0.01160269 0.165401
psfMag_z 17.16665 16.13468 15.47708 20.04133
psfMagErr_z 0.02462654 0.01768295 0.01717788 0.2461791
petroMag_u 17.88087 17.70506 17.33332 23.97661
petroMagErr_u 0.0111319 0.009922555 0.008038488 2.120464
petroMag_g 16.66306 16.67624 16.253 23.00093
petroMagErr_g 0.003165843 0.002867172 0.002438473 0.6729416
petroMag_r 16.87562 16.33335 15.8059 20.86409
petroMagErr_r 0.008610857 0.002749886 0.002263545 0.2500357
petroMag_i 17.0827 16.21605 15.5625 19.80136
petroMagErr_i 0.0201462 0.003140237 0.002287731 0.3799355
petroMag_z 17.19835 16.1933 15.51777 19.40505
petroMagErr_z 0.04253768 0.009198451 0.005646828 0.4743574
extinction_u 0.1196574 0.1151118 0.1186043 0.1026673
extinction_g 0.08804275 0.08469814 0.08726794 0.07554164
extinction_r 0.06385595 0.06143015 0.06329399 0.0547891
extinction_i 0.04842003 0.04658063 0.04799392 0.04154491
extinction_z 0.03433041 0.03302624 0.03402828 0.02945586
petroRad_r 1.286998 1.265791 1.265033 2.018322
petroRadErr_r 0.01953742 0.01779405 0.01784949 0.390549

The first two columns (ra and dec) are the right ascension and the declination of the object in degrees. The third column (class) is the spectroscopic class (Star, Galaxy, and Quasar) as determined by expert opnion. This will be the target vector in the classficiation. Some objects are also further divided into subclasses, as indicated by the fourth column. The columns (redshift and redshiftErr) are the redshift (with errror) of the object, also determined by expert opinion.

There are 11 columns that we can use as feature vectors. These are the different PSF and Petrosian magnitude measurements:

  • psfMag_u: PSF magnitude measurement in u-band, assuming the object is a point souce
  • psfMag_g: PSF magnitude measurement in g-band, assuming the object is a point souce
  • psfMag_r: PSF magnitude measurement in r-band, assuming the object is a point souce
  • psfMag_i: PSF magnitude measurement in i-band, assuming the object is a point souce
  • psfMag_z: PSF magnitude measurement in z-band, assuming the object is a point souce
  • petroMag_u: Petrosian magnitude measurement in u-band, assuming the object is an extended souce
  • petroMag_g: Petrosian magnitude measurement in g-band, assuming the object is an extended souce
  • petroMag_r: Petrosian magnitude measurement in r-band, assuming the object is an extended souce
  • petroMag_i: Petrosian magnitude measurement in i-band, assuming the object is an extended souce
  • petroMag_z: Petrosian magnitude measurement in z-band, assuming the object is an extended souce
  • petroRad_r: size measurement of the object in r-band in arc seconds

Each of these 11 measurements also has an associated error.

In addition, there are 4 columns that contain extinction values. These values should be subtracted from the magnitude measurements to correct for the scattering of light by the galactic dust:

  • extinction_u: Extinction value in the u-band
  • extinction_g: Extinction value in the g-band
  • extinction_r: Extinction value in the r-band
  • extinction_i: Extinction value in the i-band
  • extinction_z: Extinction value in the z-band

Getting the Full SDSS Dataset

In a few notebooks we will also use the full dataset which contains photometric measurements of 800,000 million objects. To obtain these data, remove all the WHERE conditions and LEFT JOIN (instead of JOIN) PhotoObj with SpecObj. Note that since the full set is extremly large (around 200GB), you will not be able to use CasJobs. Instead, you need to email the SDSS Help Desk directly for a custom transfer. Here is the required query:

    SELECT
    p.ra, p.dec,
    CASE s.class WHEN 'GALAXY' THEN 'Galaxy'
                 WHEN 'STAR' THEN 'Star'
                 WHEN 'QSO' THEN 'Quasar'
                 END AS class,
    s.subclass,
    s.z AS redshift,
    s.zErr AS redshiftErr,
    s.zWarning,
    p.psfMag_u, p.psfMagErr_u,
    p.psfMag_g, p.psfMagErr_g,
    p.psfMag_r, p.psfMagErr_r,
    p.psfMag_i, p.psfMagErr_i,
    p.psfMag_z, p.psfMagErr_z,
    p.petroMag_u, p.petroMagErr_u,
    p.petroMag_g, p.petroMagErr_g,
    p.petroMag_r, p.petroMagErr_r,
    p.petroMag_i, p.petroMagErr_i,
    p.petroMag_z, p.petroMagErr_z,
    p.extinction_u, p.extinction_g, p.extinction_r, p.extinction_i, p.extinction_z,
    p.petroRad_r, p.petroRadErr_r

FROM PhotoObj AS p
   LEFT JOIN SpecObj AS s
   ON s.bestobjid = p.objid

Assuming the files are split into 100 files with names ANUdata00.csv, ANUdata01.csv, etc., we now combine them into one HDF5 table.


In [ ]:
csv_path = '../data/sdss_full/ANUdata{0:02d}.csv'
hdf_path = '../data/sdss_full.h5'
sdss_cols = ['objID', 'ra', 'dec', 'specobjid', 'class', 'subclass', 'redshift',
             'redshiftErr', 'zWarning', 'type', 'clean', 'flags', 'probPSF', 'psfMag_u',
             'psfMagErr_u', 'psfMag_g', 'psfMagErr_g', 'psfMag_r', 'psfMagErr_r',
             'psfMag_i', 'psfMagErr_i', 'psfMag_z', 'psfMagErr_z', 'petroMag_u',
             'petroMagErr_u', 'petroMag_g', 'petroMagErr_g', 'petroMag_r', 'petroMagErr_r',
             'petroMag_i', 'petroMagErr_i', 'petroMag_z', 'petroMagErr_z', 'extinction_u',
             'extinction_g', 'extinction_r', 'extinction_i', 'extinction_z', 'petroRad_r',
             'petroRadErr_r']
csv_to_hdf(csv_path, no_files=100, hdf_path=hdf_path, data_cols=sdss_cols,
           expectedrows=7569900, min_itemsize=40, table_name='sdss_full')

Extracting Colour Features

Extract the colour features and scale the data


In [3]:
# compute reddening correction
A_u_w14, A_g_w14, A_r_w14, A_i_w14, A_z_w14 = reddening_correction_w14(sdss['extinction_r'])

# useful variables
psf_magnitudes = ['psfMag_u', 'psfMag_g', 'psfMag_r', 'psfMag_i', 'psfMag_z']
petro_magnitudes = ['petroMag_u', 'petroMag_g', 'petroMag_r', 'petroMag_i', 'petroMag_z']
w14_corrections = [A_u_w14, A_g_w14, A_r_w14, A_i_w14, A_z_w14]
colours = [('psfMag_u', 'psfMag_g'), ('psfMag_g', 'psfMag_r'), ('psfMag_r', 'psfMag_i'), ('psfMag_i', 'psfMag_z'),
           ('petroMag_u', 'petroMag_g'), ('petroMag_g', 'petroMag_r'), ('petroMag_r', 'petroMag_i'), ('petroMag_i', 'petroMag_z')]

# calculate the corrected magnitudes
correct_magnitudes(sdss, psf_magnitudes, w14_corrections, '_w14')
correct_magnitudes(sdss, petro_magnitudes, w14_corrections, '_w14')

# calculate the corrected magnitudes
compute_colours(sdss, colours, '_w14')

# scale data to zero mean and unit variance
scaler = StandardScaler()
w14_feature_cols = ['psfMag_r_w14', 'psf_u_g_w14', 'psf_g_r_w14', 'psf_r_i_w14',
                'psf_i_z_w14', 'petroMag_r_w14', 'petro_u_g_w14', 'petro_g_r_w14',
                'petro_r_i_w14', 'petro_i_z_w14', 'petroRad_r']
sdss[w14_feature_cols] = scaler.fit_transform(sdss[w14_feature_cols])
save_results(scaler, '../pickle/01_dataset_prepration/sdss_scaler.pickle')

# save as HDF5 table
relevant_cols = ['ra', 'dec', 'class', 'psfMag_r_w14', 'psf_u_g_w14', 'psf_g_r_w14', 'psf_r_i_w14',
                'psf_i_z_w14', 'petroMag_r_w14', 'petro_u_g_w14', 'petro_g_r_w14',
                'petro_r_i_w14', 'petro_i_z_w14', 'petroRad_r']
sdss[relevant_cols].to_hdf('../data/sdss.h5', 'sdss')

Extracting Dust Extinction Vector


In [6]:
ebv_path = '../data/sdss_ebv.h5'
if not results_exist(ebv_path):
    sdss_full = pd.read_hdf('../data/sdss_full.h5', 'sdss_full', columns=['ra', 'dec', 'extinction_r'], chunksize=100000)
    store = pd.HDFStore(ebv_path, complevel=9, complib='zlib', fletcher32=True)
    for i, chunk in enumerate(sdss_full):
        chunk['ebv'] = chunk['extinction_r'] / 2.751
        store.append('sdss_ebv', chunk[['ra', 'dec', 'ebv']], index=False, expectedrows=7569900)
        if i % 10 == 0: print(i // 10, end=' ')
    store.close()

VST ATLAS Dataset


In [2]:
sources = ['../data/vstatlas/lib_gal__AT.fits',
           '../data/vstatlas/lib_qso__AT.fits',
           '../data/vstatlas/lib_star_AT.fits',
           '../data/vstatlas/lib_WD___AT.fits']

classes = ['Galaxy', 'Quasar', 'Star', 'White Dwarf']
vstatlas_features = ['rmagC', 'umg', 'gmr', 'rmi', 'imz', 'rmw1', 'w1m2']

dfs = []
for source, obj in zip(sources, classes):
    data = fits.open(source)
    df = pd.DataFrame(data[1].data)
    df['class'] = obj
    dfs.append(df)
    
vstatlas = pd.concat(dfs, axis=0, ignore_index=True)
vstatlas = vstatlas[['class', 'rmagC', 'umg', 'gmr', 'rmi', 'imz', 'rmw1', 'w1m2']]
vstatlas = vstatlas.iloc[np.random.permutation(len(vstatlas))]
vstatlas = vstatlas.reset_index(drop=True)

scaler = StandardScaler()
vstatlas[vstatlas_features] = scaler.fit_transform(vstatlas[vstatlas_features])

vstatlas.to_hdf('../data/vstatlas.h5', 'vstatlas')

To represent a random sky


In [11]:
# a-priori likelihood for a random object in the sky
counts = np.array([6559, 2303, 25604, 590])
likelihoods = [6559. / 127. * 2496,
               2303. / 5. * 16,
               25604. / 697. * 14868,
               590. / 3. * 38]

np.array(np.sum(counts) * (np.array(likelihoods) / np.sum(likelihoods)), dtype=int)


Out[11]:
array([ 6550,   374, 27751,   379])