The SDSS dataset contains 100,000 spectra and catalog data for the same sources from the BOSS survey.
spPlate-merged.hdf5All of the spectra have the same rest-frame wavelength grid (things are redshifted). They have 4603 pixels, and we provide the wavelength grid, flux values, and inverse-variance (uncertainties) for each source.
specObj-merged.hdf5We provide a row-matched table of spectroscopic catalog information derived from the spectra by the SDSS pipeline. This is provided as a table stored in an HDF5 file.
photoPosPlate-merged.hdf5We also provide a row-matched table of photometric catalog information from the SDSS imaging and derived by the SDSS photometric pipeline. This is provided as a table stored in an HDF5 file.
In [1]:
    
from os import path
from astropy.table import Table
import h5py
import matplotlib.pyplot as plt
plt.style.use('notebook.mplstyle')
%matplotlib inline
import numpy as np
    
In [2]:
    
data_path = '../data/'
    
In [3]:
    
with h5py.File(path.join(data_path, 'sdss', 'spPlate-merged.hdf5')) as f:
    print(list(f.keys()))
    
    print(f['flux'].shape)
    
    
The spectral flux for all objects is stored in the 'flux' dataset as a single 2D array. There are 100000 spectra, each with 4603 pixels.
The inverse-variance (uncertainty) for the flux is stored in the 'ivar' dataset, also as a single 2D array, with the same shape as the flux array. The inverse-variance array will be 0 where the flux data are bad.
The wavelength array is stored as a 1D array in the 'wave' dataset: the wavelength grid is the same for all spectra.
Example: Let's read a random spectrum (at index 71924), and plot the wavelength, flux, and inverse-variance:
In [4]:
    
with h5py.File(path.join(data_path, 'sdss', 'spPlate-merged.hdf5')) as f:
    wave = f['wave'][:]
    flux = f['flux'][71924]
    ivar = f['ivar'][71924]
    
In [5]:
    
plt.figure(figsize=(12,4))
plt.plot(wave, flux, marker='None', linewidth=1,
         linestyle='-', drawstyle='steps-mid')
plt.plot(wave, 1/np.sqrt(ivar), marker='None', linestyle='-', 
         drawstyle='steps-mid', alpha=0.75, linewidth=1)
plt.xlim(wave.min(), wave.max())
plt.xlabel(r'wavelength [${\rm \AA}$]')
    
    
    Out[5]:
    
In [6]:
    
specObj = Table.read(path.join(data_path, 'sdss', 'specObj-merged.hdf5'), 
                     path='specObj')
len(specObj)
    
    Out[6]:
This table has many columns - some that might be useful as labels or for filtering:
CLASS - the best guess of the type of object, can be "STAR", "QSO", or "GALAXY"Z and Z_ERR - the estimated redshift and redshift errorVDISP - the stellar velocity dispersion (measured from line widths) from a galaxy spectrumSN_MEDIAN - the signal to noise. might be useful if you want to first test on only high SN sourcesELODIE_TEFF, ELODIE_LOGG, ELODIE_FEH - for stars, the effective temperature, surface gravity, and metallicity measured from template spectra
In [7]:
    
print(specObj.colnames)
    
    
Example: What are the possible spectral classes and how many spectra do we have of each?
In [8]:
    
spec_class = specObj['CLASS'].astype(str)
spec_classes = np.unique(spec_class)
for cls in spec_classes:
    print(cls, (spec_class == cls).sum())
    
    
Example: What are the redshift distributions of all objects classified as GALAXY or QSO?
In [9]:
    
bins = np.linspace(0, 5, 24)
for cls in ['GALAXY', 'QSO']:
    plt.hist(specObj['Z'][specObj['CLASS'] == cls], 
             bins=bins, label=cls, alpha=0.4)
plt.legend(fontsize=20)
plt.xlabel('redshift, $z$')
    
    Out[9]:
    
In [10]:
    
photoPos = Table.read(path.join(data_path, 'sdss', 'photoPosPlate-merged.hdf5'), 
                      path='photoPosPlate')
    
In [11]:
    
print(photoPos.colnames)
    
    
This column has 5 elements per source, one magnitude for each of $ugriz$:
In [12]:
    
photoPos['PSFMAG'].shape
    
    Out[12]:
Example: Let's plot the g-r, r-i colors of all of our sources for each of the spectroscopic classes:
In [13]:
    
g_r = photoPos['PSFMAG'][:,1] - photoPos['PSFMAG'][:,2]
r_i = photoPos['PSFMAG'][:,2] - photoPos['PSFMAG'][:,3]
    
In [14]:
    
fig, axes = plt.subplots(1, len(spec_classes), figsize=(12.5,5), 
                         sharex=True, sharey=True)
for i, cls in enumerate(spec_classes):
    axes[i].plot(g_r[spec_class == cls], r_i[spec_class == cls], 
                 marker='.', linestyle='none', alpha=0.1)
    axes[i].set_title(cls)
    axes[i].set_xlabel('$g-r$ [mag]')
axes[0].set_xlim(-0.5, 2.5)
axes[0].set_ylim(-1, 2)
axes[0].set_ylabel('$r-i$ [mag]')
fig.tight_layout()
    
    
Example: Select all spectra that meet some photometric cuts
We'll define a color cut in $g-r$, $r-i$ colors and co-add all spectra in that box that are also stars:
In [15]:
    
color_cut = (g_r > 0.45) & (g_r < 0.55) & (r_i > 0.) & (r_i < 0.4)
print("{0} objects pass this cut".format(color_cut.sum()))
    
    
In [16]:
    
fig, ax = plt.subplots(1, 1, figsize=(5,5))
ax.plot(g_r[spec_class == 'STAR'], r_i[spec_class == 'STAR'], 
        marker='.', linestyle='none', alpha=0.25, color='#aaaaaa')
ax.plot(g_r[(spec_class == 'STAR') & color_cut], 
        r_i[(spec_class == 'STAR') & color_cut], 
        marker='.', linestyle='none', alpha=0.5)
ax.set_xlim(-0.5, 2.5)
ax.set_ylim(-1, 2)
ax.set_xlabel('$g-r$ [mag]')
ax.set_ylabel('$r-i$ [mag]')
    
    Out[16]:
    
Now we'll load the spectra for those objects:
In [17]:
    
with h5py.File(path.join(data_path, 'sdss', 'spPlate-merged.hdf5')) as f:
    wave = f['wave'][:]
    color_cut_flux = f['flux'][color_cut, :]
    color_cut_ivar = f['ivar'][color_cut, :]
    
In [18]:
    
color_cut_coadd = np.sum(color_cut_flux, axis=0)
    
In [19]:
    
plt.figure(figsize=(12,6))
plt.plot(wave, color_cut_coadd, marker='None', linewidth=1,
         linestyle='-', drawstyle='steps-mid')
plt.xlim(wave.min(), wave.max())
plt.xlabel(r'wavelength [${\rm \AA}$]')
    
    Out[19]: