Often, you will be given data in the form of an ASCII table. This table will have multiple rows that indicate a star, or a galaxy, or a particle in a simulation, and the columns indicate different properties for the object such as its ID, brightness, mass, or distance. Before we can select different samples of objects to study, we need to read in this data. There are many ways to do this in Python: numpy, pandas, and astropy are commonly used packages.
Let's first import these packages. Again, use CTRL-ENTER to execute each cell.
In [ ]:
import numpy as np
import pandas as pd
from astropy.io import ascii
Let's start with an example of reading in data into a Pandas DataFrame
. Just as a numpy
array is like a python list with extra super-powers, you can think of a Pandas DataFrame
as an augmented numpy
array with extra member functions to help you work with the data.
The data we will be using in this notebook should have been downloaded with the rest of the package into a "data" folder, so we use this for the file's location (e.g., your Downloads
folder) in the next command. If the file isn't there, you'll get an error.
In [ ]:
data1=pd.read_csv('data/SNIa_DM.dat',delim_whitespace=True, skiprows=4)
Note that we had to tell Pandas to skip the first 4 rows (comments) and use white-space (spaces and tabs) to delimit the columns.
We can now look at the properties of data1. Such as its dimensions:
In [ ]:
print(data1.shape)
There are 277 rows and 5 columns. What are the names of the columns?
In [ ]:
print(data1.columns.values)
And what if we want to peak at all of the properties of the first 5 SN? Pandas DataFrames
act just like arrays in this respect. Use the numpy
array slicing you learned in our last tutorial to print out only the first 5 rows of data1
.
In [ ]:
If you want to look at the entries for a specific row, or using something more complicated than a slice (e.g., an index array), you need to use the iloc
attribute of a Pandas DataFrame.
In [ ]:
print(data1.iloc[5])
We can also just select specific columns by using the names you found above.
In [ ]:
print(data1['DM'])
Now see if you can print out just the first 5 rows of the 'DM'
column. Try other combinations of column names and row slices.
In [ ]:
Note that the index is always shown here. If we just want the raw data in the form of a numpy array, we just do the following:
In [ ]:
print(data1['DM'].values)
A nice feature of pandas
dataframes is you can use a short-hand notation for specifying columns as member data, as long as the label of the column doesn't confuse python
. So I can access the 'DM
' column this way:
In [ ]:
print (data1.DM[0:5])
Try to do the same thing with the '+/-
' column. You should get an error. That's because the +
, /
, and -
are all numerical operators in python
, so you can't use them as part of member variable names. Same goes with having a column that starts with a number or hash (#
), etc. To access those columns, you need to fall back on the string index:
In [ ]:
Next, we do a similar exercise reading in the data into an astropy Table:
In [ ]:
data2=ascii.read('data/SNIa_DM.dat')
print(data2.colnames)
print(data2[:5])
print(data2['DM'][:5])
Note that the astropy.ascii
package requires fewer arguments (it's a little smarter about comments and delimiters), but astropy
tables have a bit less functionality. Luckily, the astropy
developers recognise that Pandas is pretty awesome, so you can convert an Astropy Tables
object to a Pandas DataFrame
object! (You can also convert Pandas DFs to Astropy Tables)
In [ ]:
data3=data2.to_pandas()
print(data3[:5])
The astropy.io
module also has the ability to read FITS
images and tables and understands way more data formats common to astronomers than pandas
. So a good strategy is to use astropy.io
to read in the data, and then convert it to a pandas
dataframe. Or, you might find that astropy.tables
are sufficent for your needs.
Something we do all the time is take one table of data and merge it with another table, based on the value in one or more columns that match. Here, pandas
has a very handy method we can use. As an example, we have another data file, data/SN_HostMass.dat
, that contains galaxy host masses for some of the supernovae in the data/SNIa_DM.dat
file. Load up this data, call it data4
, and print out its shape, and print out the first five lines.
In [ ]:
The simplest thing we can do is merge the two tables using the supernova name (column SN
) to do the matching:
In [ ]:
merged = pd.merge(data3,data4, on='SN')
print (merged[0:7])
print (merged.shape)
This creates a table with the combined columns. Note that the table has fewer rows than data3
, but more rows than data4
. In cases where a SN is missing from either table, that row is dropped from the final table. In cases where multiple lines match, you get all combinations (notice that SN2004ef
shows up four times because it is repeated in both tables). This is called an "inner join". If you want to include rows with missing data, you can specify the how
argument:
how = "inner"
: (default) only include rows that match in both input tableshow = "left"
: include all rows from the first table, but with possible missing datahow = "right"
: include all rows form the second table, but with possible missing datahow = "outer"
: include all rows from both tables, but with possible missing dataIf there is a one-to-one correspondence between the rows in both tables, then these are all equivalent. For more information about how to combine tables, consult the Pandas
Documentation. Try using a different how
argument in the command above and see what happens.
In [ ]:
data3.to_csv('data/output.dat',index=False,sep=' ')
# the 'index' keyword gives you the option of also printing the Pandas index
# the 'sep' keyword specifies the delimiter between columns
Most of the time you will be dealing with subsets of your data set, e.g., galaxies at a certain redshift, stars at a particular distance or brightness, you want to get rid of outliers, etc. We therefore need to select the desired sample using conditionals. We'll use a Pandas DataFrame as an example. Here, we create a pandas
column called ix
that contains True/False values for each row in data1
based on a condition: True if redshift is less than 0.02, False otherwise. This new array, if used as an index, will pick out only those rows that are True. This is often referred to as creating a 'mask'.
In [ ]:
ix = (data1.zcmb < 0.02)
print(data1[ix][0:5])
Try selecting the data from each survey and compute its median redshift. Hint: the dataframes and columns have a median
function.
In [ ]:
The most popular plotting package in Python is matplotlib
but there are several others one might want to explore. Here we will show you how to make basic plots in matplotlib. The references below are useful:
First, let's invoke an iPython magic command (i.e., beginning with a %) so that the plots that are made will show up in this notebook.
In [ ]:
%matplotlib inline
pyplot
is the main plotting function in matplotlib
, it is commonly imported on its own as plt
.
In [ ]:
import matplotlib.pyplot as plt
Now let's make some fake data to play with and plot. Create an array x
from 0 to 9 inclusive (this is just a suggestion, you can make whatever x
you want). Now make a new array y
that is the square of x
(or whatever function you like).
In [ ]:
Armed with this data, this is probably the quickest way to make a figure:
In [ ]:
plt.plot(x,y,'bo')
# The string 'bo' above indicates that the points will be blue(b) circles(o)
Note that matplotlib
does a lot of things automatically for you, like setting the limits on the axes as well as the interval between major tick marks. These are all things that you can adjust manually with more code.
Below is the long form to getting the same symbols as above. By including more code, you can start to tinker around with different aspects of the plot. Many of matplotlib's optional arguments have a long and short form (e.g., you can specify linestyle='-'
or the shorter ls='-'
. Try different linestyles: '--', '-.'. Try different symbols: 's', 'd', '*'.
In [ ]:
plt.plot(x,y,color='blue',marker='o',ms=6,ls='None')
Yet another way to make the same plot, using the scatter
function:
In [ ]:
plt.scatter(x,y,s=10*np.sqrt(y),c=np.log10(x),edgecolors='black')
The above examples are the quickest, easiest ways to produce a figure in matplotlib
and are great for quick and dirty data exploration. However, when it comes time to make "publication quality" graphs, you'll find that including more lines of code will make things easier down the road as it will provide for more functionality. Below, we invoke the subplots
function, which returns figure
and axis
objects. We can also play around with the colors, symbols, and line styles. This way of using matplotlib
is more pythonic. The previous way was more akin to using MATLAB
and it was the developer's intention to make it more intuitive. Most examples use the pythonic methods.
In [ ]:
fig, ax = plt.subplots()
# We then use ax to do our plotting
ax.plot(x,y,color='royalblue',marker='*',ms=15,ls='None')
ax.plot(x,y,color='red',ls='-',lw=2,alpha=1)
ax.invert_yaxis()
So this produces the same kind of plot as above, but now we can use ax
to play with the axes:
In [ ]:
fig, ax = plt.subplots()
ax.set_ylim((90,0)) # set limits on y-axis
ax.set_xlabel('x') # set label for x-axis
ax.set_ylabel('y') # set label for y-axis
ax.plot(x,y,color='mediumseagreen',marker='s',ms=15,ls='None')
ax.plot(x,y,color='darkorange',ls='--',lw=2)
Often, you'll want an axis to be on a logarithmic scale:
In [ ]:
fig, ax = plt.subplots()
ax.set_yscale('log') # set y-axis to be in log
ax.set_xlabel('x',fontsize=15)
ax.set_ylabel('y',fontsize=15)
ax.plot(x,y,color='gold',mec='mediumvioletred',mew=2,marker='p',ms=25,ls='None')
ax.plot(x,y,color='mediumvioletred',ls='-.',lw=2)
In [ ]:
# two figures, side by side:
fig, axarr = plt.subplots(1,2,figsize=(10,5))
# axarr is an array of axis objects, with each element representing one subplot
# first subplot
ax=axarr[0]
ax.plot(x,y,color='blue',marker='s',ms=15,ls='None')
# second subplot
ax=axarr[1]
ax.plot(x,y,color='red',marker='*',ms=15,ls='None')
# change title on first subplot
ax=axarr[0]
ax.set_xlabel('x',fontsize=16)
Using subplots, which returns an axis array, is useful for going back and forth between different figures. Note how we can go back to axarr[0]
at the end and change the label on the x axis.
If you have a bunch of images or the same type of figure for multiple objects, it helps to make a giant grid of subplots. And rather than manually declaring a new subplot each time, it helps to automate the process with for
loop(s).
In [ ]:
nrow=2
ncol=3
fig, axarr = plt.subplots(nrow,ncol,figsize=(9,6))
for ii in range(nrow):
for jj in range(ncol):
ax=axarr[ii,jj]
if ii==nrow-1: ax.set_yscale('log')
ax.set_xlabel('x', fontsize=20)
ax.set_ylabel('y', fontsize=20)
ax.plot(x,y,color='blue',marker='s',ms=5,ls='None')
ax.plot(x,y,color='red',ls='--',lw=2)
fig.tight_layout()
subplots
has many useful features, like the the sharex
and sharey
keywords, which allow you to declare that the axes for each subplot have the same scale. Invoking subplots_adjust
can then optionally allow you to squish the subplots together.
In [ ]:
nrow=2
ncol=3
fig, axarr = plt.subplots(nrow,ncol,figsize=(9,6),sharex=True,sharey=True)
plt.subplots_adjust(hspace=0,wspace=0)
for ii in range(nrow):
for jj in range(ncol):
ax=axarr[ii,jj]
if ii==nrow-1:
ax.set_xlabel('x',fontsize=16)
if jj==0:
ax.set_ylabel('y',fontsize=16)
ax.plot(x,y,color='blue',marker='s',ms=5,ls='None')
ax.plot(x,y,color='red',ls='--',lw=2)
axarr[1,2].set_ylim(0,20)
In [ ]:
x=np.arange(10)+1.
y1=x**2
y2=np.sqrt(x)
fig, ax = plt.subplots()
ax.set_yscale('log')
# The `label` keyword below is used to specify the label for the particular data set
ax.plot(x,y1,color='blue',mec='orange',mew=2,marker='*',ms=20,ls='None',label='$y=x^2$')
ax.plot(x,y2,color='red',marker='s',ms=15,ls='None',label='$y=\sqrt{x}$')
ax.text(6,10,'$y=x^{n}$',fontsize=24)
ax.legend(loc='upper left',fontsize=16,numpoints=1)
Error bars:
In [ ]:
# first, generate some fake data
x=np.arange(10)+1.
y=x**2
xerr=np.zeros(10)+0.5 # fixed error in x
yerr=np.sqrt(y) # Poisson error in y
plt.errorbar(x,y,xerr=xerr,yerr=yerr,marker='o',ls='None')
Histogram of Gaussians:
In [ ]:
import numpy.random as npr
x1=npr.randn(10000) # mean=0, std=1.0
x2=npr.randn(10000)*0.5+1. # mean=1, std=0.5
bins=np.linspace(-5.,5.,21)
info1 = plt.hist(x1,bins=bins, color='red',lw=1, histtype='step')
info2 = plt.hist(x2,bins=bins, color='blue',lw=1, histtype='step')
Display an image from a FITS file. First download this file: im3433.fits
In [ ]:
# Display an image from a FITS file
from astropy.io import fits
im3433=fits.open('data/im3433.fits')
im=im3433[0].data # the first extension (i.e., index 0), contains the image data
fig, ax = plt.subplots(figsize=(10,10))
ax.imshow(im,interpolation='none',origin='lower',cmap='gray')
ax.get_xaxis().set_visible(False) # comment these two lines to see what happens
ax.get_yaxis().set_visible(False)
Here's another example using some supernova data, which will have been downloaded as part of our github package.
We will plot these two images side-by-side. We specify vmin=-40
and vmax=40
in the imshow()
function, which will set appropriate
limits on the color map (there are saturated pixels that will cause the image to be washed out otherwise). We also choose the
reverse color map (gray_r
). Question: where's the supernova?
In [ ]:
im1 = fits.open('data/SN2011iv_B_SWO_DC_2011_12_11SN.fits')
im2 = fits.open('data/SN2011iv_B_template.fits')
fig,ax = plt.subplots(1,2, figsize=(15,8))
ax[0].imshow(im1[0].data, vmin=-40,vmax=40, cmap='gray_r')
ax[1].imshow(im2[0].data, vmin=-40, vmax=40, cmap='gray_r')
These two images are from different epochs. So if you subtract one from the other and plot out the results, the supernova (and anything else that changed) should stand out. Try this. There's a surprise.
In [ ]:
In [ ]:
x=np.arange(10)+1.
y1=x**2
y2=np.sqrt(x)
fig, ax = plt.subplots()
ax.set_yscale('log')
ax.plot(x,y1,color='blue',marker='*',ms=15,ls='None',label='$y=x^2$')
ax.plot(x,y2,color='red',marker='s',ms=15,ls='None',label='$y=\sqrt{x}$')
ax.text(6,10,'Hi there!')
ax.legend(loc='upper left',fontsize=16,numpoints=1)
# Note these two lines
fig.savefig('example.pdf',format='pdf')
plt.close()
Now that you've learned the basics of Python and its plotting package, matplotlib
, download one of these data sets, make some figures, and tell us what you see:
zcmb
), Distance modulus (DM
), error eDM
, and a survey number.DM
on the y-axis, zcmb
on the x-axis) with errorbars.There's tons more "out there" for helping you visualize your data. In the folder you downloaded, there is another ipython
notebook called Skyfit.ipynb
which shows you an example of using one such package: Bokeh
. We encourage you to have a look at it, as it may give you ideas for handling your own summer research project. The math is pretty dense, but the real point is to see how data can be manipulated, fit, and visualized.