In [3]:
import numpy as np
import pandas as pd
from astropy.io import fits
from sklearn.preprocessing import StandardScaler
from mclearn.preprocessing import csv_to_hdf
from mclearn.photometry import (reddening_correction_w14,
correct_magnitudes,
compute_colours)
from mclearn.tools import save_results, results_exist
The Sloan Digital Sky Survey (SDSS) is a comprehensive survey of the northern sky. We are interested in a subset of this survey, namely the photometric measurements and spectroscopic labels of around 2.8 million objects. Given the size of the dataset (385MB), we will need to manually query the SDSS server. To do this, create an account on the SDSS CasJobs site and submit the following SQL query to the DR12 catalog:
SELECT
-- right ascension and declination in degrees
p.ra, p.dec,
-- class of object, expert opinion (galaxy, star, or quasar)
CASE s.class WHEN 'GALAXY' THEN 'Galaxy'
WHEN 'STAR' THEN 'Star'
WHEN 'QSO' THEN 'Quasar'
END AS class,
-- subclass of object
s.subclass,
-- redshift of object from spectrum with error, expert opnion
s.z AS redshift,
s.zErr AS redshiftErr,
-- 0 if spectrum is ok
s.zWarning,
-- PSF magnitude measurements in 5 bands (ugriz) with error, assuming object is a point souce
p.psfMag_u, p.psfMagErr_u,
p.psfMag_g, p.psfMagErr_g,
p.psfMag_r, p.psfMagErr_r,
p.psfMag_i, p.psfMagErr_i,
p.psfMag_z, p.psfMagErr_z,
-- Petrosian magnitude measurements in 5 bands (ugriz) with error, assuming object is an extended souce
p.petroMag_u, p.petroMagErr_u,
p.petroMag_g, p.petroMagErr_g,
p.petroMag_r, p.petroMagErr_r,
p.petroMag_i, p.petroMagErr_i,
p.petroMag_z, p.petroMagErr_z,
-- extinction values
p.extinction_u, p.extinction_g, p.extinction_r, p.extinction_i, p.extinction_z,
-- size measurement in r-band in arc seconds
p.petroRad_r, p.petroRadErr_r
FROM PhotoObj AS p
JOIN SpecObj AS s
ON s.bestobjid = p.objid
WHERE
-- only include objects with complete and reasonably accurate data
p.psfMagErr_u BETWEEN 0 AND 3
AND p.psfMagErr_g BETWEEN 0 AND 3
AND p.psfMagErr_r BETWEEN 0 AND 3
AND p.psfMagErr_i BETWEEN 0 AND 3
AND p.psfMagErr_z BETWEEN 0 AND 3
AND p.petroMagErr_u BETWEEN 0 AND 3
AND p.petroMagErr_g BETWEEN 0 AND 3
AND p.petroMagErr_r BETWEEN 0 AND 3
AND p.petroMagErr_i BETWEEN 0 AND 3
AND p.petroMagErr_z BETWEEN 0 AND 3
AND p.petroRadErr_r BETWEEN 0 AND 3
AND s.zErr BETWEEN 0 AND 0.1
AND s.zWarning = 0 -- spectrum is ok
The WHERE conditions ensure that we only select the best possible data. Alternatively, you can download the dataset from here. From now, we assume the dataset is stored in data/sdss_dr7_photometry_source.csv.gz
.
The downloaded dataset is a compressed csv file and can be loaded as a pandas DataFrame.
In [2]:
sdss = pd.read_csv("../data/sdss_dr7_photometry_source.csv.gz", compression="gzip")
Here are the first four rows of the table. Note that the DataFrame has been transposed below for easier viewing. Without the transpose, each row is an object is the sky and each column is a feature/input.
In [3]:
sdss.head(4).transpose()
Out[3]:
The first two columns (ra
and dec
) are the right ascension and the declination of the object in degrees. The third column (class
) is the spectroscopic class (Star, Galaxy, and Quasar) as determined by expert opnion. This will be the target vector in the classficiation. Some objects are also further divided into subclasses, as indicated by the fourth column. The columns (redshift
and redshiftErr
) are the redshift (with errror) of the object, also determined by expert opinion.
There are 11 columns that we can use as feature vectors. These are the different PSF and Petrosian magnitude measurements:
psfMag_u
: PSF magnitude measurement in u-band, assuming the object is a point soucepsfMag_g
: PSF magnitude measurement in g-band, assuming the object is a point soucepsfMag_r
: PSF magnitude measurement in r-band, assuming the object is a point soucepsfMag_i
: PSF magnitude measurement in i-band, assuming the object is a point soucepsfMag_z
: PSF magnitude measurement in z-band, assuming the object is a point soucepetroMag_u
: Petrosian magnitude measurement in u-band, assuming the object is an extended soucepetroMag_g
: Petrosian magnitude measurement in g-band, assuming the object is an extended soucepetroMag_r
: Petrosian magnitude measurement in r-band, assuming the object is an extended soucepetroMag_i
: Petrosian magnitude measurement in i-band, assuming the object is an extended soucepetroMag_z
: Petrosian magnitude measurement in z-band, assuming the object is an extended soucepetroRad_r
: size measurement of the object in r-band in arc secondsEach of these 11 measurements also has an associated error.
In addition, there are 4 columns that contain extinction values. These values should be subtracted from the magnitude measurements to correct for the scattering of light by the galactic dust:
extinction_u
: Extinction value in the u-bandextinction_g
: Extinction value in the g-bandextinction_r
: Extinction value in the r-bandextinction_i
: Extinction value in the i-bandextinction_z
: Extinction value in the z-bandIn a few notebooks we will also use the full dataset which contains photometric measurements of 800,000 million objects. To obtain these data, remove all the WHERE conditions and LEFT JOIN (instead of JOIN) PhotoObj with SpecObj. Note that since the full set is extremly large (around 200GB), you will not be able to use CasJobs. Instead, you need to email the SDSS Help Desk directly for a custom transfer. Here is the required query:
SELECT
p.ra, p.dec,
CASE s.class WHEN 'GALAXY' THEN 'Galaxy'
WHEN 'STAR' THEN 'Star'
WHEN 'QSO' THEN 'Quasar'
END AS class,
s.subclass,
s.z AS redshift,
s.zErr AS redshiftErr,
s.zWarning,
p.psfMag_u, p.psfMagErr_u,
p.psfMag_g, p.psfMagErr_g,
p.psfMag_r, p.psfMagErr_r,
p.psfMag_i, p.psfMagErr_i,
p.psfMag_z, p.psfMagErr_z,
p.petroMag_u, p.petroMagErr_u,
p.petroMag_g, p.petroMagErr_g,
p.petroMag_r, p.petroMagErr_r,
p.petroMag_i, p.petroMagErr_i,
p.petroMag_z, p.petroMagErr_z,
p.extinction_u, p.extinction_g, p.extinction_r, p.extinction_i, p.extinction_z,
p.petroRad_r, p.petroRadErr_r
FROM PhotoObj AS p
LEFT JOIN SpecObj AS s
ON s.bestobjid = p.objid
Assuming the files are split into 100 files with names ANUdata00.csv
, ANUdata01.csv
, etc., we now combine them into one HDF5 table.
In [ ]:
csv_path = '../data/sdss_full/ANUdata{0:02d}.csv'
hdf_path = '../data/sdss_full.h5'
sdss_cols = ['objID', 'ra', 'dec', 'specobjid', 'class', 'subclass', 'redshift',
'redshiftErr', 'zWarning', 'type', 'clean', 'flags', 'probPSF', 'psfMag_u',
'psfMagErr_u', 'psfMag_g', 'psfMagErr_g', 'psfMag_r', 'psfMagErr_r',
'psfMag_i', 'psfMagErr_i', 'psfMag_z', 'psfMagErr_z', 'petroMag_u',
'petroMagErr_u', 'petroMag_g', 'petroMagErr_g', 'petroMag_r', 'petroMagErr_r',
'petroMag_i', 'petroMagErr_i', 'petroMag_z', 'petroMagErr_z', 'extinction_u',
'extinction_g', 'extinction_r', 'extinction_i', 'extinction_z', 'petroRad_r',
'petroRadErr_r']
csv_to_hdf(csv_path, no_files=100, hdf_path=hdf_path, data_cols=sdss_cols,
expectedrows=7569900, min_itemsize=40, table_name='sdss_full')
Extract the colour features and scale the data
In [3]:
# compute reddening correction
A_u_w14, A_g_w14, A_r_w14, A_i_w14, A_z_w14 = reddening_correction_w14(sdss['extinction_r'])
# useful variables
psf_magnitudes = ['psfMag_u', 'psfMag_g', 'psfMag_r', 'psfMag_i', 'psfMag_z']
petro_magnitudes = ['petroMag_u', 'petroMag_g', 'petroMag_r', 'petroMag_i', 'petroMag_z']
w14_corrections = [A_u_w14, A_g_w14, A_r_w14, A_i_w14, A_z_w14]
colours = [('psfMag_u', 'psfMag_g'), ('psfMag_g', 'psfMag_r'), ('psfMag_r', 'psfMag_i'), ('psfMag_i', 'psfMag_z'),
('petroMag_u', 'petroMag_g'), ('petroMag_g', 'petroMag_r'), ('petroMag_r', 'petroMag_i'), ('petroMag_i', 'petroMag_z')]
# calculate the corrected magnitudes
correct_magnitudes(sdss, psf_magnitudes, w14_corrections, '_w14')
correct_magnitudes(sdss, petro_magnitudes, w14_corrections, '_w14')
# calculate the corrected magnitudes
compute_colours(sdss, colours, '_w14')
# scale data to zero mean and unit variance
scaler = StandardScaler()
w14_feature_cols = ['psfMag_r_w14', 'psf_u_g_w14', 'psf_g_r_w14', 'psf_r_i_w14',
'psf_i_z_w14', 'petroMag_r_w14', 'petro_u_g_w14', 'petro_g_r_w14',
'petro_r_i_w14', 'petro_i_z_w14', 'petroRad_r']
sdss[w14_feature_cols] = scaler.fit_transform(sdss[w14_feature_cols])
save_results(scaler, '../pickle/01_dataset_prepration/sdss_scaler.pickle')
# save as HDF5 table
relevant_cols = ['ra', 'dec', 'class', 'psfMag_r_w14', 'psf_u_g_w14', 'psf_g_r_w14', 'psf_r_i_w14',
'psf_i_z_w14', 'petroMag_r_w14', 'petro_u_g_w14', 'petro_g_r_w14',
'petro_r_i_w14', 'petro_i_z_w14', 'petroRad_r']
sdss[relevant_cols].to_hdf('../data/sdss.h5', 'sdss')
In [6]:
ebv_path = '../data/sdss_ebv.h5'
if not results_exist(ebv_path):
sdss_full = pd.read_hdf('../data/sdss_full.h5', 'sdss_full', columns=['ra', 'dec', 'extinction_r'], chunksize=100000)
store = pd.HDFStore(ebv_path, complevel=9, complib='zlib', fletcher32=True)
for i, chunk in enumerate(sdss_full):
chunk['ebv'] = chunk['extinction_r'] / 2.751
store.append('sdss_ebv', chunk[['ra', 'dec', 'ebv']], index=False, expectedrows=7569900)
if i % 10 == 0: print(i // 10, end=' ')
store.close()
In [2]:
sources = ['../data/vstatlas/lib_gal__AT.fits',
'../data/vstatlas/lib_qso__AT.fits',
'../data/vstatlas/lib_star_AT.fits',
'../data/vstatlas/lib_WD___AT.fits']
classes = ['Galaxy', 'Quasar', 'Star', 'White Dwarf']
vstatlas_features = ['rmagC', 'umg', 'gmr', 'rmi', 'imz', 'rmw1', 'w1m2']
dfs = []
for source, obj in zip(sources, classes):
data = fits.open(source)
df = pd.DataFrame(data[1].data)
df['class'] = obj
dfs.append(df)
vstatlas = pd.concat(dfs, axis=0, ignore_index=True)
vstatlas = vstatlas[['class', 'rmagC', 'umg', 'gmr', 'rmi', 'imz', 'rmw1', 'w1m2']]
vstatlas = vstatlas.iloc[np.random.permutation(len(vstatlas))]
vstatlas = vstatlas.reset_index(drop=True)
scaler = StandardScaler()
vstatlas[vstatlas_features] = scaler.fit_transform(vstatlas[vstatlas_features])
vstatlas.to_hdf('../data/vstatlas.h5', 'vstatlas')
To represent a random sky
In [11]:
# a-priori likelihood for a random object in the sky
counts = np.array([6559, 2303, 25604, 590])
likelihoods = [6559. / 127. * 2496,
2303. / 5. * 16,
25604. / 697. * 14868,
590. / 3. * 38]
np.array(np.sum(counts) * (np.array(likelihoods) / np.sum(likelihoods)), dtype=int)
Out[11]: