This notebook is based on the 2016 AAS Python Workshop tutorial on tables, available on [GitHub](https://github.com/spacetelescope/AAS2016), though it has been modified. Some of the pandas stuff was borrowed from a notebook put together by [Jake Vanderplas](http://www.vanderplas.com) and has been modified to suit the purposes of this course, including expansion/modification of explanations and additional exercises. Source and license info for the original is on [GitHub](https://github.com/jakevdp/2014_fall_ASTR599/)

Names: [Insert Your Names Here]

Lab 7 - Python Tables

Lab 7 Contents

  1. Astropy Tables
    • Constructing Tables
    • Displaying Tables in Notebook
    • Indexing Tables
    • Modifying Tables
    • Converting Tables to Numpy
    • Masking Tables
    • High-level Table Operations
    • Reading and Writing Tabular Data
  2. Pandas

In [ ]:
from astropy.table import Table
from numpy import *
import matplotlib 
matplotlib.use('nbagg')  # required for interactive plotting
import matplotlib.pyplot as plt
%matplotlib inline

1. Astropy Tables

The astropy Table class provides an extension of NumPy structured arrays for storing and manipulating heterogeneous tables of data. A few notable features of this package are:

  • Initialize a table from a wide variety of input data structures and types.
  • Modify a table by adding or removing columns, changing column names, or adding new rows of data.
  • Handle tables containing missing values.
  • Include table and column metadata as flexible data structures.
  • Specify a description, units and output formatting for columns.
  • Perform operations like database joins, concatenation, and grouping.
  • Manipulate multidimensional columns.
  • Methods for Reading and writing Table objects to files
  • Integration with Astropy Units and Quantities

or more information about the features and functionalities of Astropy tables, you can read the astropy.table docs.

1.1 Constructing Tables

There is great deal of flexibility in the way that a table can be initially constructed:

  • Read an existing table from a file or web URL
  • Add columns of data one by one
  • Add rows of data one by one
  • From an existing data structure in memory:

    • List of data columns
    • Dict of data columns
    • List of row dicts
    • Numpy homgeneous array or structured array
    • List of row records

See the documentation section on Constructing a table for the gory details and plenty of examples.


In [ ]:
t = Table()
t['name'] = ['larry', 'curly', 'moe', 'shemp']
t['flux'] = [1.2, 2.2, 3.1, 4.3]

1.2 Displaying Tables in Notebook

In IPython notebook, showing a table will produce a nice HTML representation of the table:


In [ ]:
t

If you did the same in a terminal session you get a different view that isn't as pretty but does give a bit more information about the table:

>>> t
<Table rows=4 names=('name','flux')>
array([('source 1', 1.2), ('source 2', 2.2), ('source 3', 3.1),
       ('source 4', 4.3)], 
      dtype=[('name', 'S8'), ('flux', '<f8')])

To get a plain view which is the same in notebook and terminal use print():


In [ ]:
print(t)

In [ ]:
##similar, but nicer when there are lots and lots of rows/columns
t.pprint()

To get the table column names and data types using the colnames and dtype properties:


In [ ]:
t.colnames

In [ ]:
t.dtype

Astropy 1.1 and later provides a show_in_notebook() method that allows more interactive exploration of tables. It can be especially handy for large tables.


In [ ]:
t.show_in_notebook()

1.3 Indexing Tables

We can access the columns and rows in a way similar to accessing discionary entries (with dict[key]), but here the syntax is table[column]. Table objects can also be indexed by row or column, and the column index can be swapped with column name.


In [ ]:
t['flux']  # Flux column (notice meta attributes)

In [ ]:
t['flux'][1]  # Row 1 of flux column

In [ ]:
t[1]['flux']  # equivalent!

In [ ]:
t[1][1]  # also equivalent. Which is the column index? Play with this to find out.

In [ ]:
t[1]  # one index = row number

In [ ]:
t[1:3]  # 2nd and 3rd rows in a new table (remember that the a:b indexing is not inclusive of b)

In [ ]:
t[1:3]['flux']

In [ ]:
t[[1, 3]] # the second and fourth rows of t in a new table

1.4 Modifying Tables

Once the table exists with defined columns there are a number of ways to modify the table in place. These are fully documented in the section Modifying a Table.

To give a couple of simple examples, you can add rows with the add_row() method or add new columns using dict-style assignment:


In [ ]:
t.add_row(('joe', 10.1))  # Add a new source at the end
t['logflux'] = log10(t['flux'])  # Compute the log10 of the flux
t

In [ ]:
t['flux'].format = '%.2f'
t['logflux'].format = '%.2f'
t

In [ ]:
print('%11.2f'% 100000)
print('%8.2f'% 100000)

In [ ]:
t['flux'].format = '%5.2e'
t['logflux'].format = '%.2E'
t

In [ ]:
print('%5.2e'% 0.0005862341)
print('%4.2E'% 246001)

1.5 Converting Tables to Numpy

Sometimes you may not want or be able to use a Table object and prefer to work with a plain numpy array (like if you read in data and then want to manipulate it. This is easily done by passing the table to the np.array() constructor.

This makes a copy of the data. If you have a huge table and don't want to waste memory, supply copy=False to the constructor, but be warned that changing the output numpy array will change the original table.


In [ ]:
array(t)

In [ ]:
array(t['flux'])

1.6 Masking Tables

One of the most powerful concepts in table manipulation is using boolean selection masks to select only table entries that meet certain criteria.


In [ ]:
mask = t['flux'] > 3.0  # Define boolean (True/False) mask for all flux values > 3
mask

In [ ]:
t[mask]  # Create a new table with only the "True" rows

In [ ]:
t2 = Table([['x', 'y', 'z'], 
            [1.1, 2.2, 3.3]],
           names=['name', 'value'],
           masked=True)
t2

In [ ]:
t2['value'].mask = [False, True, False]

In [ ]:
print(t2)

In [ ]:
t2['value'].fill_value = -99
print(t2.filled())

1.7 High-Level Table Operations

So far we've just worked with one table at a time and viewed that table as a monolithic entity. Astropy also supports high-level Table operations that manipulate multiple tables or view one table as a collection of sub-tables (groups).

Documentation Description
Grouped operations Group tables and columns by keys
Stack vertically Concatenate input tables along rows
Stack horizontally Concatenate input tables along columns
Join Database-style join of two tables

Here we'll just introduce the join operation but go into more detail on the others in the exercises.


In [ ]:
from astropy.table import join

Now recall our original table t:


In [ ]:
t

Now say that we now got some additional flux values from a different reference for a different, but overlapping sample of sources:


In [ ]:
t2 = Table()
t2['name'] = ['larry', 'moe', 'groucho']
t2['flux2'] = [1.4, 3.5, 8.6]

Now we can get a master table of flux measurements which are joined matching the values the name column. This includes every row from each of the two tables, which is known as an outer join.


In [ ]:
t3 = join(t, t2, keys=['name'], join_type='outer')
print(t3)

In [ ]:
mean(t3['flux2'])

Alternately we could choose to keep only rows where both tables had a valid measurement using an inner join:


In [ ]:
join(t, t2, keys=['name'], join_type='inner')

1.8 Writing and Reading Tabular Data

You can write data using the Table.write() method:


In [ ]:
t3.write('test.fits', overwrite=True)

In [ ]:
t3.write('test.vot', format='votable', overwrite=True)

You can read data using the Table.read() method:


In [ ]:
t4 = Table.read('test.fits')
t4

Some formats, such as FITS and HDF5, are automatically identified by file extention while most others will require format to be explicitly provided. A number of common ascii formats are supported such as IPAC, sextractor, daophot, and CSV. Refer to the documentation for a full listing.


In [ ]:
Table.read?

In [ ]:
t_2mass = Table.read("data/2mass.tbl", format="ascii.ipac")
t_2mass.show_in_notebook()

2. Pandas

Although astropy Tables has some nice functionality that Pandas doesn't and is also a simpler, easier to use package, Pandas is the more versatile and commonly used table manipluator for Python so I recommend you use it wherever possible.

Astropy 1.1 includes new to_pandas() and from_pandas() methods that facilitate conversion to/from pandas DataFrame objects. There are a few caveats in making these conversions:

  • Tables with multi-dimensional columns cannot be converted.
  • Masked values are converted to numpy.nan. Numerical columns, int or float, are thus converted to numpy.float while string columns with missing values are converted to object columns with numpy.nan values to indicate missing or masked data. Therefore, one cannot always round-trip between Table and DataFrame.

In [ ]:
import pandas as pd

Data frames are defined like dictionaries with a column header/label (similar to a key) and a list of entries.


In [ ]:
df = pd.DataFrame({'a': [10,20,30],
                   'b': [40,50,60]})

In [ ]:
df

think of DataFrames as numpy arrays plus some extra pointers to their columns and the indices of the row entries that make them amenable for tables


In [ ]:
df.columns

In [ ]:
df.index

In [ ]:
#hit shift + tab tab in the cell below to read more about dataframe objects and operations
df.

pandas has built-in functions for reading all kinds of types of data. In the cell below, hit tab once after the r to see all of the read functions. In our case here, read_table will work fine


In [ ]:
pd.r

we can also convert the table that we already made with Astropy Tables to pandas dataframe format


In [ ]:
pd_2mass = t_2mass.to_pandas()
pd_2mass

And the opposite operation (conversion from pandas dataframe to astropy table) works as well


In [ ]:
t_pd = Table.from_pandas(pd_2mass)
t_pd.show_in_notebook()

Unlike astropy Tables, pandas can also read excel spreadsheets


In [ ]:
asteroids = pd.read_excel("data/asteroids5000.xlsx")
asteroids
#excel_data = Table.from_pandas(pd.read_excel("2mass.xls"))
#excel_data.show_in_notebook()

pandas dataframe columns can be called as python series using the syntax dataframe.columnlabel, as below, which is why it usually makes sense to define a column name/label that is short and has no spaces


In [ ]:
asteroids.ra

this calling method allows you to do use some useful built-in functions as well


In [ ]:
#this one counts how many occurrences there are in the table for each unique value
asteroids.ph_qual.value_counts()

To pull up individual rows or entries, the fact that pandas dataframes always print the indices of rows off of their lefthand side helps. You index dataframes with .loc (if using column name) or .iloc (if using column index), as below


In [ ]:
asteroids.loc[4,"ra"]

In [ ]:
asteroids.iloc[4,0] #same because column 0 is "ra"

you can always check that the column you're indexing is the one you want as below


In [ ]:
asteroids.columns[0]

Although indices are nice for reference, sometimes you might want the row labels to be more descriptive. What is the line below doing?


In [ ]:
# make the row names more interesting than numbers starting from zero
asteroids.index = ['Asteroid %d'%(i+1) for i in asteroids.index]

In [ ]:
#and you can index multiple columns/rows in the usual way 
asteroids.iloc[:10,:2]

You can do lots more with this as well, including logical operations to parse the table


In [ ]:
asteroids.columns
ast_new = asteroids[asteroids.dist < 500]
ast_new

3. Exercises

In these exercises, you will be dealing with two tables of information, described below. We'll be doing lots of manipulation of pandas dataframes in Labs 9, 11 and 13, so these exercises focus mostly on special functions of Astropy tables, but you should, wherever possible, try to figure out how to do the same thing with a Pandas dataframe.

master_sources

Each distinct X-ray source identified on the sky is represented in the catalog by a single "master source" entry and one or more "source observation" entries, one for each observation in which the source has been detected. The master source entry records the best estimates of the properties of a source, based on the data extracted from the set of observations in which the source has been detected. The subset of fields in our exercise table file are:

Name Description
msid Master source ID
name Source name in the Chandra catalog
ra Source RA (deg)
dec Source Dec (deg)

obs_sources

The individual source entries record all of the properties about a detection extracted from a single observation, as well as associated file-based data products, which are observation-specific. The subset of fields in our exercise table file are:

Name Description
obsid Observation ID
obi Observation interval
targname Target name
gti_obs Observation date
flux_aper_b Broad band (0.5 - 7 keV) flux (erg/cm2/sec)
src_cnts_aper_b Broad band source counts
ra_b Source RA (deg)
dec_b Source Dec (deg)
livetime Observation duration (sec)
posid Position ID
theta Off-axis angle (arcmin)
msid Master source ID

Exercise 1 - Read the data


To start with, read in the two data files representing the master source list and observations source list. The fields for the two tables are respectively documented in:

Read them in as astropy tables and then convert them to pandas. In the end you should have four table objects, two astropy tables and two pandas dataframes


In [ ]:
## code to read in source data here

In [ ]:
## code to convert to pandas dataframes here

Get a list of the column names for each table.

Hint: use <TAB> completion to easily discover all the attributes and methods, e.g. type master_sources. and then hit the <TAB> key. This will reveal some built-in methods to do things like print column names, as well as some of the other things below


In [ ]:
## code to print list of column names

Find and print the length of each table.


In [ ]:
## code to print the length of each table here

Find the column datatypes for each table (also a built-in method).


In [ ]:
## code to print the column datatypes

Display all the rows of the master_sources table using its pprint() method (astropy tables only).


In [ ]:
## code to pprint the rows of master_sources

Exercise 2 - Modifying tables


Remove the obi column from the obs_sources table.


In [ ]:
#code to remove column here

The gti_obs column name is a bit obscure (GTI is "good time interval", but it really just means "date"). Rename the gti_obs column to obs_date.


In [ ]:
# code to rename column here

The source count column tells you how many photons were collected by the detector, but it would also be nice to have a count rate (number of photons per second). Add a new column src_rate_aper_b which is the source counts divided by observation duration in sec.


In [ ]:
# code to create and add new column here

Exercise 3 - Visualizing Data

Use the matplotlib hist() function to make a histogram of the source flux column. Since the fluxes vary by orders of magnitude, use numpy.log10 to put the fluxes in log space.


In [ ]:
# code to create histogram

Let's now remove any sources that we think might not be associated with the source we pointed at ("background/foreground objects" - things that appear near the location of our source in the sky, but that aren't physically associated with it and are actually either much closer or much farther away). To remove these potentially unassociated objects, make the same plot but using only observations where the source was within 4 arcminutes of the place where the telescope was pointed. HINT: use a boolean mask to select values of theta that are less than 4.0.


In [ ]:
# code to mask table and create new histogram

Exercise 4 - Join the master_sources and obs_sources tables

The master_sources and obs_sources tables share a common msid column. What we now want is to join the master list of sky positions (RA and Dec columns - essentially celestial longitude and latitude) and source names with the individual observations table.

Use the table.join() function to make a single table called sources that has the master RA, Dec, and name included for each observation source.

HINT: the defaults for keys and join_type='inner' are correct in this case, so the simplest possible call to join() will work!


In [ ]:
# code to join tables

Is the length of the new sources the same as obs_sources? What happened? Use specific examples in your explanations.


In [ ]:
# code to investigate lengths

insert explanation here

Exercise 5 - Grouped properties of sources


When using tables, we may occasionally wish to group entries based on various properties, which is done using the group_by() functionality.

This method makes a new table in which all the sources with the same entry for some property (that property is specified in the function call) are next to one another.

Make a new table g_sources which is the sources table grouped by the msid key using the group_by() method.


In [ ]:
## code to group sources by the msid key and write into new table g_sources

The new g_sources table is just a regular table with all the sources in a particular order. The attribute g_sources.groups has also been created and is an object that provides access to the msid sub-groups. You can access the $i^{th}$ group with g_sources.groups[i].

In addition the g_sources.groups.indices attribute is an array with the indicies of the group boundaries.

Using np.diff() find the number of repeat observations of each master sources. HINT: use the indices, Luke.


In [ ]:
## code to find the number of observations for each source

Print the 50th group and note which columns are the same for all group members and which are different. Does this make sense? In these few observations how many different target names were provided by observers?


In [ ]:
## code to print info for the 50th group

Exercise 6 - Aggregation


The real power of grouping comes in the ability to create aggregate values for each of the groups, for instance the mean flux for each unique source. This is done with the aggregate() method, which takes a function reference as its input. This function must take as input an array of values and return a single value.

Aggregate returns a new table that has a length equal to the number of groups.

Compute the mean of all columns for each unique source (i.e. each group) using aggregate and the np.mean function. Call this table g_sources_mean.


In [ ]:
## code to create a new table with the group means

Notice that aggregation cannot form a mean for certain columns and these are dropped from the output. Use the join() function to restore the master_sources information to g_sources_mean.


In [ ]:
## code to add back in the columns from master_sources

In [1]:
from IPython.core.display import HTML
def css_styling():
    styles = open("../custom.css", "r").read()
    return HTML(styles)
css_styling()


Out[1]: