Getting the Datasets

AstroML includes commands for loading various datasets. The first time the command is run, it fetches the data from the web, does some preprocessing, and caches the result on disk.

In order to not inadvertently bring down the network, we need to do a bit of prep to import the files from the memory stick provided by the conference organizers. The following function will do this, if you specify the correct path of the memory stick:


In [ ]:
import os
import sys
import shutil
import numpy as np
from astroML.datasets import get_data_home

def load_LGAStat_data(src_dir):
    data_dir = get_data_home()
    assert(os.path.exists(src_dir))
    assert(os.path.isdir(src_dir))
    
    # specgals
    filename = 'SDSSspecgalsDR8.fit'
    print("copying {0}".format(filename))
    sys.stdout.flush()
    shutil.copyfile(os.path.join(src_dir, filename),
                    os.path.join(data_dir, filename))
    
    # RR Lyrae mags
    filename = 'RRLyrae.fit'
    print("copying {0}".format(filename))
    sys.stdout.flush()
    shutil.copyfile(os.path.join(src_dir, filename),
                    os.path.join(data_dir, filename))
    
    # Stripe 82
    filename = 'stripe82calibStars_v2.6.dat.gz'
    print("processing {0}".format(filename))
    sys.stdout.flush()
    archive_file = 'sdss_S82standards.npy'
    DTYPE = [('RA', 'f8'),
             ('DEC', 'f8'),
             ('RArms', 'f4'),
             ('DECrms', 'f4'),
             ('Ntot', 'i4'),
             ('A_r', 'f4')]

    for band in 'ugriz':
        DTYPE += [('Nobs_%s' % band, 'i4')]
        DTYPE += map(lambda s: (s + '_' + band, 'f4'),
                     ['mmed', 'mmu', 'msig', 'mrms', 'mchi2'])
    
    # first column is 'CALIBSTARS'.  We'll ignore this.
    COLUMNS = range(1, len(DTYPE) + 1)
    kwargs = kwargs = dict(usecols=COLUMNS, dtype=DTYPE)

    data = np.loadtxt(os.path.join(src_dir, filename), **kwargs)
    np.save(os.path.join(data_dir, archive_file), data)
    
    # make sure that it worked
    from astroML.datasets import fetch_rrlyrae_combined
    from astroML.datasets import fetch_sdss_specgals
    X, y = fetch_rrlyrae_combined(download_if_missing=False)
    photoz_gals = fetch_sdss_specgals(download_if_missing=False)
    
    assert(X.shape == (93141, 4))
    assert(photoz_gals.shape == (661598,))
    
    print("Finished!")

To Get the Data Without Killing The Network

  1. Copy LocalGroupAstrostatistics2015/obs/astroML directory to the directory where this notebook is
  2. Run the following code:

In [ ]:
load_LGAStat_data('./astroML/')

(This will take a bit of time: it is reading and re-formatting some of the binary data). If that code block runs without raising an error, then you have the data cached on your system.

Now you can run these commands & astroML will load the cached version of the datasets we'll use:


In [ ]:
from astroML.datasets import fetch_rrlyrae_combined
X, y = fetch_rrlyrae_combined(download_if_missing=False)

In [ ]:
from astroML.datasets import fetch_sdss_specgals
photoz_gals = fetch_sdss_specgals(download_if_missing=False)

The default value of download_if_missing is True; if you change it & have not already cached the data, it will start a ~200MB download!