Introduction: Data --> Knowledge

Some definitions:

  • Data Mining: -- "unsupervised learning"

    • looking for and describing patterns in large data sets
    • "knowledge discovery"
    • description of the data itself
  • Machine Learning -- "supervised learning"

    • "fitting"
    • interpreting those data with respect to models
    • (regression, classification, maximum likelihood, Bayesian)

Stats Terminology:

  • set of individual measurements: $x_i$ (where i, . . . ,N)

  • True Distribution

    $h(x)$ - function that generates x
    $h(x)dx\equiv$ probability distribution function (population pdf)

    $H(x)=\int_{-\infty}^{x}h(x')dx'$ cumulative distribution function

  • Empirical Distribution

    $f(x)$ - function that generates x
    $f(x)dx\equiv$ empirical probability distribution function (empirical pdf)

    $F(x)=\int_{-\infty}^{x}h(x')dx'$ cumulative empirical distribution function

    (Normalized such that $H(\infty)=F(\infty)=1$)

  • Since data sets are never infinitely large (and well sampled) $f(x)\neq h(x)$:

    $f(x)$ is a model of $h(x)$

  • Errors (associated with measurement $x_i$)

    $e(x)=p(x|\mu,I)$ - $\mu$: true value, $I$: describes the error distribution

    for a gaussian:

    $p(x|\mu,\sigma)=\frac{1}{\sigma\sqrt{2\pi}}\exp{\left(\frac{-(x-\mu)^2}{2\sigma^2}\right)}$

  • Ramble about broad $f(x)$:

    could be due to errors (larger sample will lead to better derivation of $h(x)$), or it could be due to broad h(x)

Data used in the text:

AstroML has "fetching functions" to download all of the datasets. To see a list in an iPython terminal one could type:

ln [ ]: from astroML.datasets import [TAB]

and it would list options (in an ipython notebook this comes up as a scrolling list)

  • SDSS:

    • SDSS imaging (p.16)

      examples from the text (fetching the imaging data for 330753 objects)

I'm going to start writing code, so I'll include the following command so plots show up in the window:


In [12]:
%matplotlib inline

In [13]:
from astroML.datasets import fetch_imaging_sample
data = fetch_imaging_sample()

# determine the shape (size) of the downloaded data
data.shape


Out[13]:
(330753,)

In [15]:
# The code below finds the tags and prints the first five positions (RA,Dec)
print data.dtype.names
print data['ra'][:5], data['dec'][:5]


('ra', 'dec', 'run', 'rExtSFD', 'uRaw', 'gRaw', 'rRaw', 'iRaw', 'zRaw', 'uErr', 'gErr', 'rErr', 'iErr', 'zErr', 'uRawPSF', 'gRawPSF', 'rRawPSF', 'iRawPSF', 'zRawPSF', 'upsfErr', 'gpsfErr', 'rpsfErr', 'ipsfErr', 'zpsfErr', 'type', 'ISOLATED')
[ 0.358174  0.358382  0.357898  0.35791   0.358881] [-0.508718 -0.551157 -0.570892 -0.426526 -0.505625]

SDSS Spectroscopy: (p.19)

from astroML.datasets import fetch_sdss_spectrum

  • Galaxies (p.21)

    from astroML.datasets import fetch_sdss_specgals

  • DR7 Quasar Catalog (p.23)

    from astroML.datasets import fetch_dr7_quasar

  • SEGUE stellar parameters pipeline (SSPP) (p.25)

    from astroML.datasets import fetch_sdss_sspp

  • SDSS standard stars from STRIPE 82 (p.26)

    from astroML.datasets import fetch_sdss_S82standards

  • SDSS moving object catalog (p.30)

    from astroML.datasets import fetch_moving_objects

LINEAR stellar light curves: (p.27)

from astroML.datasets import fetch_LINEAR_sample

Plotting

  • Tufte book (Visual Display of Quantatative Information) I have a copy - come look at it if you haven't seen it before!

    • many #s in small space
    • make large datasets coherent
    • reveal the data at several levels of detail
    • encourage the eye to compare different pieces of data
  • 2D plotting:

    • easiest, but can be difficult with LARGE datasets (in dense spaces)
  • 3+D plotting:

    • can use colors/density contours or possibly find helpful projections
    • Spherical projections: in Matplotlib, HEALpix (Hierarchical Equal Area isoLatitude Pixelization)

In [36]:
# Partially I'm including these to demonstrate that you can simply pull examples from the textbook using the URLs!

from IPython.display import Image
from IPython.display import display

simple_plot = Image(url='http://www.astroml.org/_images/fig_sdss_S82standards_1.png',width=300)
contour_plot = Image(url='http://www.astroml.org/_images/fig_S82_scatter_contour_1.png',width=300)
hist_plot = Image(url='http://www.astroml.org/_images/fig_S82_hess_1.png',width=300)

display(simple_plot,contour_plot,hist_plot)



In [ ]: