AstroML includes commands for loading various datasets. The first time the command is run, it fetches the data from the web, does some preprocessing, and caches the result on disk.
In order to not inadvertently bring down the network, we need to do a bit of prep to import the files from the memory stick provided by the conference organizers. The following function will do this, if you specify the correct path of the memory stick:
In [ ]:
import os
import sys
import shutil
import numpy as np
from astroML.datasets import get_data_home
def load_LGAStat_data(src_dir):
data_dir = get_data_home()
assert(os.path.exists(src_dir))
assert(os.path.isdir(src_dir))
# specgals
filename = 'SDSSspecgalsDR8.fit'
print("copying {0}".format(filename))
sys.stdout.flush()
shutil.copyfile(os.path.join(src_dir, filename),
os.path.join(data_dir, filename))
# RR Lyrae mags
filename = 'RRLyrae.fit'
print("copying {0}".format(filename))
sys.stdout.flush()
shutil.copyfile(os.path.join(src_dir, filename),
os.path.join(data_dir, filename))
# Stripe 82
filename = 'stripe82calibStars_v2.6.dat.gz'
print("processing {0}".format(filename))
sys.stdout.flush()
archive_file = 'sdss_S82standards.npy'
DTYPE = [('RA', 'f8'),
('DEC', 'f8'),
('RArms', 'f4'),
('DECrms', 'f4'),
('Ntot', 'i4'),
('A_r', 'f4')]
for band in 'ugriz':
DTYPE += [('Nobs_%s' % band, 'i4')]
DTYPE += map(lambda s: (s + '_' + band, 'f4'),
['mmed', 'mmu', 'msig', 'mrms', 'mchi2'])
# first column is 'CALIBSTARS'. We'll ignore this.
COLUMNS = range(1, len(DTYPE) + 1)
kwargs = kwargs = dict(usecols=COLUMNS, dtype=DTYPE)
data = np.loadtxt(os.path.join(src_dir, filename), **kwargs)
np.save(os.path.join(data_dir, archive_file), data)
# make sure that it worked
from astroML.datasets import fetch_rrlyrae_combined
from astroML.datasets import fetch_sdss_specgals
X, y = fetch_rrlyrae_combined(download_if_missing=False)
photoz_gals = fetch_sdss_specgals(download_if_missing=False)
assert(X.shape == (93141, 4))
assert(photoz_gals.shape == (661598,))
print("Finished!")
In [ ]:
load_LGAStat_data('./astroML/')
(This will take a bit of time: it is reading and re-formatting some of the binary data). If that code block runs without raising an error, then you have the data cached on your system.
Now you can run these commands & astroML will load the cached version of the datasets we'll use:
In [ ]:
from astroML.datasets import fetch_rrlyrae_combined
X, y = fetch_rrlyrae_combined(download_if_missing=False)
In [ ]:
from astroML.datasets import fetch_sdss_specgals
photoz_gals = fetch_sdss_specgals(download_if_missing=False)
The default value of download_if_missing
is True
; if you change it & have not already cached the data, it will start a ~200MB download!