This notebook contains the text, figures, and source code for my Scientific American blog post.
Note that I have hidden much of the code for this post at the end of the notebook. To execute the code this notebook, please start by running the functions defined at the end of the text, then return to the beginning.
The content of this notebook is BSD licensed, ©2014, Jake VanderPlas
One of the largest treasure troves of astronomical data comes from the Sloan Digital Sky Survey (SDSS), an ongoing scan of the firmament that began 15 years ago. Its catalogue covers 35 percent of the sky and contains multicolor observations of hundreds of millions of distinct galaxies, stars and quasars. If a person were to attempt to individually inspect each of these objects at a rate of one per second through the workday, it would be a full-time job lasting over 60 years!
Fortunately, such individual inspection is not how astronomers work. Instead, we use various specialized algorithms to automatically sift through and categorize this vast data set and dream up novel visualization schemes to make clear in a glance the relationships between thousands or millions of individual objects.
One of my favorite examples of this type of visualization involves a subset of the above data, the SDSS Moving Object Catalog, which my colleague Zeljko Ivezic has been instrumental in producing and collating. This catalogue contains detailed data on nearly half a million small asteroids orbiting our sun, allowing us not only to track the orbital path of individual asteroids but also to gain insight into the chemical composition and formation history of individual objects and the solar system as a whole. The following video simulation from Alex Parker gives a glimpse into the orbital characteristics of this data:
In [9]:
from IPython.display import YouTubeVideo
YouTubeVideo('Xo7A-tKITlo')
Out[9]:
The SDSS data gives us much more than just the orbital dynamics. Multiband imaging gives us detailed measurements of the color of reflected sunlight off each of these asteroids. Just as on Earth our eyes can distinguish white quartz from dark basalt based on how they each reflect sunlight, the SDSS telescope can distinguish among different chemical compositions of asteroids based on how their surfaces reflect sunlight.
We can summarize this chemical information with two "color" measurements: color in the optical range and color in the near-infrared range. Combining this with the semimajor axis (a measure of the size of the orbit around the sun) and the inclination (a measure of how "tilted" the orbit is compared with Earth’s orbital plane) gives us a four-dimensional data set: four properties of each asteroid that contain information about its orbit and chemical composition.
With this four-dimensional data set decided, we can now think about how to best visualize it. One-dimensional data fits on a number-line; two-dimensional data can be plotted on a flat page or screen; three-dimensional data can be conceived as a 3-D plot, perhaps rotating on a computer screen; but how do you effectively plot four-dimensional data?
We can start simple, by splitting the data into chemical indicators on one hand and orbital indicators on the other:
In [10]:
plot_basic()
This visualization is full of information. The left panel is known as a color–color diagram and distinguishes between broad classes of asteroid chemistries. The left-most clump in this panel is primarily carbonaceous (C-type) asteroids whereas the right-most clump is primarily silicaceous (S-type) asteroids. The faint downward extension of the right-most clump is V-type asteroids, known to be associated with the asteroid Vesta.
The right panel showing the orbital characteristics offers even more insight. Immediately we see that there is some intriguing structure to the data: clumps and clusters as well as specific regions that are devoid of any asteroids at all. These clumps are known as asteroid families, and the vertical voids near 2.5 astronomical units (AU) and 2.8 AU are areas of orbital resonance with Jupiter: In this particular region of the solar system, these resonance effects quickly kick any asteroids out of their orbits. (An astronomical unit is the mean Earth–sun distance, or about 149.6 million kilometers.)
These are all interesting insights, but what we'd really like is to see intuitively how the chemistry reflected in the left panel is related to the orbital dynamics reflected in the right panel. Such a four-dimensional relationship is very difficult to capture.
One common way to visualize high-dimensional data is to use a grid of multiple two-dimensional plots. In this way we can plot each pair of features against one another and look at the correlations. Of course, the two panels from above are just a subset of the six distinct plots (each with a mirror-image) created by this method:
In [11]:
plot_all_pairs()
This plot conveys a lot of information, and there are some intriguing pieces. For example, in the panel comparing near-infrared color to orbital inclination (top row, second from the left) we see a distinct clump of data: These are points that are clustered both in color and inclination. Further investigation shows this clump reflects the Vesta family, a chemically similar group of asteroids that also shares the same orbital inclination. We'll return to these below.
Another common high-dimensional visualization technique is to treat color as an added dimension. This way a standard two-dimensional plot can reflect three-dimensions of information. Let's try visualizing the four dimensions in this way: We'll do two versions of the orbital inclination plot, using a different color scale in each plot:
In [12]:
plot_3D()
In the left plot the colors of the points correspond to the optical color whereas in the right plot the colors correspond to the near-infrared color. With this enhancement, we're really getting somewhere: The left panel makes clear that the clumps of asteroids in orbital space are generally grouped according to their optical colors—that is, their position on the carbonaceous–silicaceous spectrum. The right plot shows the Vesta group that we pointed out above—the group of asteroids near 2.4 AU with a blue-infrared color.
Let's put these all together. Rather than using two separate color scales to identify these asteroid groups, we can define a single two-dimensional color scale reflecting the asteroid chemistry and use these colors when plotting the same points in orbital space. The result is a plot very similar to the one that appeared in Parker et al., 2008, where this work was first reported:
In [13]:
plot_multicolor()
This final plot offers a full, intuitive view of the relationships between the four measured asteroid characteristics. From this visualization, it becomes clear that asteroid chemistry (reflected in the color of the individual points defined by the left panel) is strongly related to the orbital distribution of asteroids (reflected in the clumping of asteroid families in the right panel). What this shows is that families of asteroids not only orbit near one another in space, but also have largely the same chemical composition!
This observation lends evidence to the theory that asteroid families are formed by collisions of larger bodies. At some time in the past, two larger asteroids likely collided, shattering into hundreds or thousands of smaller bodies. Because these smaller bodies each came from the same source, they will be chemically similar and continue to orbit in the same region for several hundred million years.
As the volume and complexity of data grow, novel visualization techniques like this are an important part of mining large data sets in search of such insights.
Editor’s Note: This post was created using the IPython notebook, an interactive computing environment that aids in producing fully reproducible scientific analysis. The original notebook is available for download on GitHub, and contains all the code and data necessary to create the figures in this post.
About the Author: Jake Vanderplas is an astronomer by training and currently works as a data scientist in the University of Washington's interdisciplinary eScience Institute. You can follow him on Twitter at @jakevdp, see his coding projects on GitHub or see what he's been thinking about lately on his blog, Pythonic Perambulations.
In [1]:
# Below is the code that downloads this data and creates the plots shown above
# If executing the notebook
# Run this first before running the figure generation functions above
In [2]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
In [3]:
from astroML.datasets import fetch_moving_objects
from astroML.plotting.tools import devectorize_axes
from contextlib import contextmanager
def black_bg_subplot(*args, **kwargs):
"""Create a subplot with black background"""
kwargs['axisbg'] = 'k'
ax = plt.subplot(*args, **kwargs)
# set ticks and labels to white
for spine in ax.spines.values():
spine.set_color('w')
for tick in ax.xaxis.get_major_ticks() + ax.yaxis.get_major_ticks():
for child in tick.get_children():
child.set_color('w')
return ax
def compute_color(mag_a, mag_i, mag_z, a_crit=-0.1):
"""
Compute the scatter-plot color using code adapted from
TCL source used in Parker 2008.
"""
# define the base color scalings
R = np.ones_like(mag_i)
G = 0.5 * 10 ** (-2 * (mag_i - mag_z - 0.01))
B = 1.5 * 10 ** (-8 * (mag_a + 0.0))
# enhance green beyond the a_crit cutoff
G += 10. / (1 + np.exp((mag_a - a_crit) / 0.02))
# normalize color of each point to its maximum component
RGB = np.vstack([R, G, B])
RGB /= RGB.max(0)
# return an array of RGB colors, which is shape (n_points, 3)
return RGB.T
def fetch_data(truncate=None):
global _PROCESSED_DATA
if _PROCESSED_DATA is not None:
mag_a, mag_i, mag_z, a, sini, color = _PROCESSED_DATA
else:
rng = np.random.RandomState(0)
#------------------------------------------------------------
# Fetch data and extract the desired quantities
data = fetch_moving_objects(Parker2008_cuts=True)
mag_a = data['mag_a']
mag_i = data['mag_i']
mag_z = data['mag_z']
a = data['aprime']
sini = data['sin_iprime']
# dither: magnitudes are recorded only to +/- 0.01
mag_a += -0.005 + 0.01 * rng.rand(*mag_a.shape)
mag_i += -0.005 + 0.01 * rng.rand(*mag_i.shape)
mag_z += -0.005 + 0.01 * rng.rand(*mag_z.shape)
# compute RGB color based on magnitudes
color = compute_color(mag_a, mag_i, mag_z)
if truncate is not None:
mag_a = mag_a[::truncate]
mag_i = mag_i[::truncate]
mag_z = mag_z[::truncate]
a = a[::truncate]
sini = sini[::truncate]
color = color[::truncate]
return mag_a, mag_i, mag_z, a, sini, color
_PROCESSED_DATA = None
In [4]:
def plot_basic():
mag_a, mag_i, mag_z,a, sini, color = fetch_data()
fig, axx = plt.subplots(1, 2, figsize=(10.5, 5))
fig.subplots_adjust(left=0.08, right=0.95, wspace=0.2,
bottom=0.1, top=0.9)
ax = axx[0]
ax.scatter(mag_a, mag_i - mag_z, s=1, lw=0, c='k')
ax.set_xlim(-0.3, 0.4)
ax.set_ylim(-0.8, 0.6)
#ax.plot([0, 0], [-0.8, 0.6], '--k', lw=2)
#ax.plot([0, 0.4], [-0.15, -0.15], '--k', lw=2)
ax.set_xlabel('Optical Color (a*)')
ax.set_ylabel('Near-IR Color (i - z)')
# plot the orbital parameters plot
ax = axx[1]
ax.scatter(a, sini, c='k', s=1, lw=0)
#ax.plot([2.5, 2.5], [-0.02, 0.3], '--k', lw=2)
#ax.plot([2.82, 2.82], [-0.02, 0.3], '--k', lw=2)
ax.set_xlim(2.0, 3.3)
ax.set_ylim(-0.02, 0.3)
ax.set_xlabel('Semi-major Axis (AU)')
ax.set_ylabel('Sine of Orbital Inclination')
In [5]:
def plot_all_pairs():
mag_a, mag_i, mag_z, a, sini, color = fetch_data()
fig, ax = plt.subplots(4, 4, figsize=(10, 10), sharex='col', sharey='row')
fig.subplots_adjust(hspace=0.05, wspace=0.05)
data = [mag_a, mag_i - mag_z, a, sini]
lims = [(-0.3, 0.4), (-0.8, 0.6), (2.0, 3.3), (-0.02, 0.3)]
labels = ['Optical Color (a*)', 'Near-IR Color (i - z)',
'Semi-major Axis (AU)', 'Sine of Orbital Inclination']
for i in range(4):
for j in range(4):
this_ax = ax[3 - j, i]
this_ax.xaxis.set_major_locator(plt.MaxNLocator(4, prune='upper'))
this_ax.yaxis.set_major_locator(plt.MaxNLocator(4, prune='upper'))
if i == j:
this_ax.axis('off')
this_ax.text(0.5, 0.5, labels[i], ha='center', va='center',
rotation=45, transform=this_ax.transAxes, fontsize=14)
else:
this_ax.scatter(data[i], data[j], c='k', s=1, lw=0, alpha=0.1)
this_ax.set_xlim(lims[i])
this_ax.set_ylim(lims[j])
In [6]:
def plot_3D():
mag_a, mag_i, mag_z, a, sini, color = fetch_data()
fig, axx = plt.subplots(1, 2, figsize=(12, 5), sharey=True,
subplot_kw=dict(axisbg='k'))
fig.subplots_adjust(left=0.08, right=0.95, wspace=0.1,
bottom=0.1, top=0.9)
# plot the color-magnitude plot
ax = axx[0]
scat = ax.scatter(a, sini, c=mag_a, s=1, lw=0, cmap=plt.cm.RdBu_r)
ax.set_xlim(2.0, 3.3)
ax.set_ylim(-0.02, 0.3)
ax.set_xlabel('Semi-major Axis (AU)')
ax.set_ylabel('Sine of Orbital Inclination')
fig.colorbar(scat, ax=ax).set_label('Optical Color (a*)')
# plot the orbital parameters plot
ax = axx[1]
scat = ax.scatter(a, sini, c=mag_i - mag_z, s=1, lw=0, cmap=plt.cm.RdBu_r)
ax.set_xlim(2.0, 3.3)
ax.set_ylim(-0.02, 0.3)
ax.set_xlabel('Semi-major Axis (AU)')
fig.colorbar(scat, ax=ax).set_label('Near-IR Color (i - z)')
In [7]:
def plot_multicolor():
mag_a, mag_i, mag_z,a, sini, color = fetch_data()
#------------------------------------------------------------
# set up the plot
fig = plt.figure(figsize=(10.5, 5), facecolor='k')
fig.subplots_adjust(left=0.08, right=0.95, wspace=0.2,
bottom=0.1, top=0.9)
# plot the color-magnitude plot
ax = black_bg_subplot(121)
ax.scatter(mag_a, mag_i - mag_z,
c=color, s=1, lw=0, edgecolors=color)
ax.plot([0, 0], [-0.8, 0.6], '--w', lw=2)
ax.plot([0, 0.4], [-0.15, -0.15], '--w', lw=2)
ax.set_xlim(-0.3, 0.4)
ax.set_ylim(-0.8, 0.6)
ax.set_xlabel('Optical Color (a*)', color='w')
ax.set_ylabel('Near-IR Color (i - z)', color='w')
# plot the orbital parameters plot
ax = black_bg_subplot(122)
ax.scatter(a, sini,
c=color, s=1, lw=0, edgecolors=color)
ax.plot([2.5, 2.5], [-0.02, 0.3], '--w')
ax.plot([2.82, 2.82], [-0.02, 0.3], '--w')
ax.set_xlim(2.0, 3.3)
ax.set_ylim(-0.02, 0.3)
ax.set_xlabel('Semi-major Axis (AU)', color='w')
ax.set_ylabel('Sine of Orbital Inclination', color='w')
# label the plot
text_kwargs = dict(color='w', fontsize=14,
transform=plt.gca().transAxes,
ha='center', va='bottom')
ax.text(0.25, 1.01, 'Inner', **text_kwargs)
ax.text(0.53, 1.01, 'Mid', **text_kwargs)
ax.text(0.83, 1.01, 'Outer', **text_kwargs)