This notebook is based on the 2016 AAS Python Workshop tutorial on tables, available on [GitHub](https://github.com/spacetelescope/AAS2016), though it has been modified. Some of the pandas stuff was borrowed from a notebook put together by [Jake Vanderplas](http://www.vanderplas.com) and has been modified to suit the purposes of this course, including expansion/modification of explanations and additional exercises. Source and license info for the original is on [GitHub](https://github.com/jakevdp/2014_fall_ASTR599/)
The astropy Table class provides an extension of NumPy structured arrays for storing and manipulating heterogeneous tables of data. A few notable features of this package are:
The Pandas package provides a powerful, high-performance table object via the DataFrame class. Pandas has a few downsides, including its lack of support for multidimensional table columns, but Pandas is the generally-used Python tables packange,a dn so we will use it here as well. Pandas DataFrame functionality is very complementary to astropy Tables so astropy 1.1 and later provides interfaces for converting between astropy Tables and DataFrames. If you wish to learn more about Pandas, there are many resources available on-line. A good starting point is the main tutorials site at http://pandas.pydata.org/pandas-docs/stable/tutorials.html.
For more information about the features presented below, you can read the astropy.table docs.
In [ ]:
from astropy.table import Table
from numpy import *
import matplotlib
matplotlib.use('nbagg') # required for interactive plotting
import matplotlib.pyplot as plt
%matplotlib inline
There is great deal of flexibility in the way that a table can be initially constructed:
From an existing data structure in memory:
See the documentation section on Constructing a table for the gory details and plenty of examples.
In [ ]:
t = Table()
t['name'] = ['larry', 'curly', 'moe', 'shemp']
t['flux'] = [1.2, 2.2, 3.1, 4.3]
In [ ]:
t
If you did the same in a terminal session you get a different view that isn't as pretty but does give a bit more information about the table:
>>> t
<Table rows=4 names=('name','flux')>
array([('source 1', 1.2), ('source 2', 2.2), ('source 3', 3.1),
('source 4', 4.3)],
dtype=[('name', 'S8'), ('flux', '<f8')])
To get a plain view which is the same in notebook and terminal use print()
:
In [ ]:
print(t)
In [ ]:
##similar, but nicer when there are lots and lots of rows/columns
t.pprint()
To get the table column names and data types using the colnames
and dtype
properties:
In [ ]:
t.colnames
In [ ]:
t.dtype
Astropy 1.1 and later provides a show_in_notebook()
method that allows more interactive exploration of tables. It can be especially handy for large tables.
In [ ]:
t.show_in_notebook()
In [ ]:
t['flux'] # Flux column (notice meta attributes)
In [ ]:
t['flux'][1] # Row 1 of flux column
In [ ]:
t[1]['flux'] # equivalent!
In [ ]:
t[1][1] # also equivalent. Which is the column index? Play with this to find out.
In [ ]:
t[1] # one index = row number
In [ ]:
t[1:3] # 2nd and 3rd rows in a new table (remember that the a:b indexing is not inclusive of b)
In [ ]:
t[1:3]['flux']
In [ ]:
t[[1, 3]] # the second and fourth rows of t in a new table
One of the most powerful concepts is using boolean selection masks to filter tables
In [ ]:
mask = t['flux'] > 3.0 # Define boolean (True/False) mask for all flux values > 3
mask
In [ ]:
t[mask] # Create a new table with only the "True" rows
Once the table exists with defined columns there are a number of ways to modify the table in place. These are fully documented in the section Modifying a Table.
To give a couple of simple examples, you can add rows with the add_row() method or add new columns using dict-style assignment:
In [ ]:
t.add_row(('joe', 10.1)) # Add a new source at the end
t['logflux'] = log10(t['flux']) # Compute the log10 of the flux
t
Notice that the logflux
column really has too many output digits given the precision of the input values. We can fix this by setting the format using normal Python formatting syntax:
The format operator in python acts on an object and reformatts it according to your specifications. The syntax is alwasy object.format = '%format_string', where format_string tells it how to format the output. For now let's just deal with two of the more useful types:
Float Formatting Floats are denoted with '%A.Bf', where A is the number of total characters you want, including the decimal point, and B is the number of characters that you want after the decimal. The f tells it that you would like the output as a float. If you don't specify A, python will keep as many characters as are currently to the left of the decimal point. If you specify more characters to the left of the decimal than are there, python will usually print the extra space as blank characters. If you want it to print leading zeroes instead, use the format '%0A.Bf'. This is not the case in tables though, where white space and leading zeroes will be ignored.
Scientific Notation Formatting Sometimes in tables, we will be dealing with very large numbers. Exponential formatting is similar to float formatting in that you are formatting the float that comes before the "e" (meaning 10 to some power). Numbers in scientific notation print as X.YeNN where NN is the power of the exponent. The formatting string for floating point exponentials looks like "%A.Be" or "%A.BE", where e and E print lowwercase and capital es, respectively.
Should you need it in the future, here is a more detailed reference regarding string formatting.
Also useful is printing numbers in a given format, for which you use the syntax print('%format code'% object), as demonstrated below. Play around with the cells below to make sure you understand the subtelties here before moving on.
In [ ]:
t['flux'].format = '%.2f'
t['logflux'].format = '%.2f'
t
In [ ]:
print('%11.2f'% 100000)
print('%8.2f'% 100000)
In [ ]:
t['flux'].format = '%5.2e'
t['logflux'].format = '%.2E'
t
In [ ]:
print('%5.2e'% 0.0005862341)
print('%4.2E'% 246001)
Sometimes you may not want or be able to use a Table
object and prefer to work with a plain numpy array (like if you read in data and then want to manipulate it. This is easily done by passing the table to the np.array()
constructor.
This makes a copy of the data. If you have a huge table and don't want to waste memory, supply copy=False
to the constructor, but be warned that changing the output numpy array will change the original table.
In [ ]:
array(t)
In [ ]:
array(t['flux'])
In [ ]:
t2 = Table([['x', 'y', 'z'],
[1.1, 2.2, 3.3]],
names=['name', 'value'],
masked=True)
t2
In [ ]:
t2['value'].mask = [False, True, False]
In [ ]:
print(t2)
In [ ]:
t2['value'].fill_value = -99
print(t2.filled())
So far we've just worked with one table at a time and viewed that table as a monolithic entity. Astropy also supports high-level Table operations that manipulate multiple tables or view one table as a collection of sub-tables (groups).
Documentation | Description |
---|---|
Grouped operations | Group tables and columns by keys |
Stack vertically | Concatenate input tables along rows |
Stack horizontally | Concatenate input tables along columns |
Join | Database-style join of two tables |
Here we'll just introduce the join operation but go into more detail on the others in the exercises.
In [ ]:
from astropy.table import join
Now recall our original table t
:
In [ ]:
t
Now say that we now got some additional flux values from a different reference for a different, but overlapping sample of sources:
In [ ]:
t2 = Table()
t2['name'] = ['larry', 'moe', 'groucho']
t2['flux2'] = [1.4, 3.5, 8.6]
Now we can get a master table of flux measurements which are joined matching the values the name
column. This includes every row from each of the two tables, which is known as an outer
join.
In [ ]:
t3 = join(t, t2, keys=['name'], join_type='outer')
print(t3)
In [ ]:
mean(t3['flux2'])
Alternately we could choose to keep only rows where both tables had a valid measurement using an inner
join:
In [ ]:
join(t, t2, keys=['name'], join_type='inner')
In [ ]:
t3.write('test.fits', overwrite=True)
In [ ]:
t3.write('test.vot', format='votable', overwrite=True)
You can read data using the Table.read() method:
In [ ]:
t4 = Table.read('test.fits')
t4
Some formats, such as FITS and HDF5, are automatically identified by file extention while most others will require format
to be explicitly provided. A number of common ascii formats are supported such as IPAC, sextractor, daophot, and CSV. Refer to the documentation for a full listing.
In [ ]:
Table.read?
In [ ]:
t_2mass = Table.read("data/2mass.tbl", format="ascii.ipac")
t_2mass.show_in_notebook()
Although astropy Tables has some nice functionality that Pandas doesn't and is also a simpler, easier to use package, Pandas is the more versatile and commonly used table manipluator for Python so I recommend you use it wherever possible.
Astropy 1.1 includes new to_pandas()
and from_pandas()
methods that facilitate conversion to/from pandas DataFrame
objects. There are a few caveats in making these conversions:
numpy.nan
. Numerical columns, int or float, are thus converted to numpy.float
while string columns with missing values are converted to object columns with numpy.nan
values to indicate missing or masked data. Therefore, one cannot always round-trip between Table
and DataFrame
.
In [ ]:
import pandas as pd
Data frames are defined like dictionaries with a column header/label (similar to a key) and a list of entries.
In [ ]:
df = pd.DataFrame({'a': [10,20,30],
'b': [40,50,60]})
In [ ]:
df
think of DataFrames as numpy arrays plus some extra pointers to their columns and the indices of the row entries that make them amenable for tables
In [ ]:
df.columns
In [ ]:
df.index
In [ ]:
#hit shift + tab tab in the cell below to read more about dataframe objects and operations
df.
pandas has built-in functions for reading all kinds of types of data. In the cell below, hit tab once after the r to see all of the read functions. In our case here, read_table will work fine
In [ ]:
pd.r
we can also convert the table that we already made with Astropy Tables to pandas dataframe format
In [ ]:
pd_2mass = t_2mass.to_pandas()
pd_2mass
And the opposite operation (conversion from pandas dataframe to astropy table) works as well
In [ ]:
t_pd = Table.from_pandas(pd_2mass)
t_pd.show_in_notebook()
Unlike astropy Tables, pandas can also read excel spreadsheets
In [ ]:
asteroids = pd.read_excel("data/asteroids5000.xlsx")
asteroids
#excel_data = Table.from_pandas(pd.read_excel("2mass.xls"))
#excel_data.show_in_notebook()
pandas dataframe columns can be called as python series using the syntax dataframe.columnlabel, as below, which is why it usually makes sense to define a column name/label that is short and has no spaces
In [ ]:
asteroids.ra
this calling method allows you to do use some useful built-in functions as well
In [ ]:
#this one counts how many occurrences there are in the table for each unique value
asteroids.ph_qual.value_counts()
To pull up individual rows or entries, the fact that pandas dataframes always print the indices of rows off of their lefthand side helps. You index dataframes with .loc (if using column name) or .iloc (if using column index), as below
In [ ]:
asteroids.loc[4,"ra"]
In [ ]:
asteroids.iloc[4,0] #same because column 0 is "ra"
you can always check that the column you're indexing is the one you want as below
In [ ]:
asteroids.columns[0]
Although indices are nice for reference, sometimes you might want the row labels to be more descriptive. What is the line below doing?
In [ ]:
# make the row names more interesting than numbers starting from zero
asteroids.index = ['Asteroid %d'%(i+1) for i in asteroids.index]
In [ ]:
#and you can index multiple columns/rows in the usual way
asteroids.iloc[:10,:2]
You can do lots more with this as well, including logical operations to parse the table
In [ ]:
asteroids.columns
ast_new = asteroids[asteroids.dist < 500]
ast_new
To start with, read in the two data files representing the master source list and observations source list. The fields for the two tables are respectively documented in:
You may use either pandas or astropy tables.
In [ ]:
## code to read in source data here
master_sources
Each distinct X-ray source identified on the sky is represented in the catalog by a single "master source" entry and one or more "source observation" entries, one for each observation in which the source has been detected. The master source entry records the best estimates of the properties of a source, based on the data extracted from the set of observations in which the source has been detected. The subset of fields in our exercise table file are:
Name | Description |
---|---|
msid | Master source ID |
name | Source name in the Chandra catalog |
ra | Source RA (deg) |
dec | Source Dec (deg) |
obs_sources
The individual source entries record all of the properties about a detection extracted from a single observation, as well as associated file-based data products, which are observation-specific. The subset of fields in our exercise table file are:
Name | Description |
---|---|
obsid | Observation ID |
obi | Observation interval |
targname | Target name |
gti_obs | Observation date |
flux_aper_b | Broad band (0.5 - 7 keV) flux (erg/cm2/sec) |
src_cnts_aper_b | Broad band source counts |
ra_b | Source RA (deg) |
dec_b | Source Dec (deg) |
livetime | Observation duration (sec) |
posid | Position ID |
theta | Off-axis angle (arcmin) |
msid | Master source ID |
Do the following to explore the two tables:
<TAB>
completion to easily discover all the attributes and methods, e.g. type master_sources.
and then hit the <TAB>
key.master_sources
table using its pprint()
method (astropy tables only).
In [ ]:
For our analysis we don't actually need the obi
(observation interval) column in the obs_sources
table.
obi
column from the obs_sources
table.The gti_obs
column name is a bit obscure (GTI is a good time interval, FWIW).
gti_obs
column to obs_date
.It would be nice to have a count rate in addition to the source counts.
src_rate_aper_b
which is the source counts divided by observation duration in sec.Some of the sources have a negative net flux in the broad band
In [ ]:
For each source detected in an individual observation (in the obs_sources
table), let's look at the source flux values.
Use the matplotlib hist()
function to make a histogram of the source fluxes. Since the fluxes vary by orders of magnitude,
use the numpy.log10
to put the fluxes in log space.
Also make the same plot but using only sources within 4 arcmin of the center. HINT: use a boolean mask to select values of theta
that are less than 4.0.
In [ ]:
The master_sources
and obs_sources
tables share a common msid
column. What we now want is to join the master RA and Dec positions and master source names with the individual observations table.
sources
that has the master RA, Dec, and name included for each observation source.HINT: the defaults for keys
and join_type='inner'
are correct in this case, so the simplest possible call to join()
will work!
Intermediate: Is the length of the new sources
the same as obs_sources
? What happened?
Advanced: Make a scatter plot of the RA (x-axis) and Dec (y-axis) difference between the master source position and the observation source position. You'll need to use coordinates
!
In [ ]:
sources
Finally, we can look at the variability properties of sources in the CDFS using the group_by()
functionality.
This method makes a new table in which all the sources with identical master ID are next to each other.
g_sources
which is the sources
table grouped by the msid
key using the group_by()
method.The g_sources
table is just a regular table with all the sources
in a particular order. The attribute g_sources.groups
is an object that provides access to the msid
sub-groups. You can access the $i^{th}$ group with g_sources.groups[i]
.
In addition the g_sources.groups.indices
attribute is an array with the indicies of the group boundaries.
np.diff()
find the number of repeat observations of each master sources. HINT: use the indices, Luke.The real power of grouping comes in the ability to create aggregate values for each of the groups, for instance the mean flux for each unique source. This is done with the aggregate()
method, which takes a function reference as its input. This function must take as input an array of values and return a single value.
Aggregate returns a new table that has a length equal to the number of groups.
aggregate
and the np.mean
function. Call this table g_sources_mean
.join()
function to restore the master_sources
information to g_sources_mean
.
In [ ]: