Some of the code in this lab is based on a program written by Dmitry Savransky (Cornell) to parse the IPAC exoplanet table

Names: [Insert Your Names Here]

Lab 11 - Data Investigation 2 (Week 1) - Exoplanet Database

Lab 11 Contents

  1. Introduction to the Exoplanet Database
  2. Adding a Third Dimension to Scatterplots
  3. Computing Correlation Coefficients
  4. Data Investigation 2 - Week 2 Instructions

In [ ]:
#data from: http://exoplanetarchive.ipac.caltech.edu/cgi-bin/TblView/nph-tblView?app=ExoTbls&config=planets
#download table -> csv format, all rows, all columns

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import scipy.stats as st

In [ ]:
# these set the pandas defaults so that it will print ALL values, even for very long lists and large dataframes
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

1. Introduction and Preliminaries

This lab will use the publicly-available IPAC exoplanet database. This is a constantly updated database that contains all measured properties of confirmed extrasolar planets as well as some properties of their host stars. For descriptions of the columns/quantities in the databse and their units, see the living table


In [ ]:
#read in the data, skipping the first 73 rows of ancillary information
data=pd.read_csv('planets032717.csv', skiprows=73)

In [ ]:
#print the columns. 
data.columns

In [ ]:
#this truncates to only planet detection methods with >30 successful detections (skip if you want all of them)
methods,methods_inds,methods_counts = np.unique(data['pl_discmethod'],return_index=True,return_counts=True)
methods = methods[methods_counts> 30]
methods

In [ ]:
data

In [ ]:
#find the indices of all entries where pl_discmethod is one of these four
inds = [j for j in range(len(data)) if data['pl_discmethod'][j] in methods]

#write a new dataframe with just these entries
data = data.loc[inds]

##note this cell can't be run twice because it redefines (overwrites) the data dataframe. 
#If you need to reexecute, start at the beginning.

In [ ]:
#note that the table is now shorter
len(data)

2. Adding a Third Dimension to Scatterplots

This part will walk you through how to take a third quantity (in this case planetary discovery method) and show it by giving various data categories different plotting symbols. This is a good method when you have one categorical variable and two continuous variables to plot.


In [ ]:
#Some setup to plot stuff
#List of symbol styles (uses shorthand that you can find described here: http://matplotlib.org/api/markers_api.html)
syms = 'os^pvD<>8*'

#color list in [red, green, blue] format
cmap = [[0,0,1],[1,0,0],[0.1,1,0.1],[1,0.6,0],
        [0.75,0.5,0],[1,0.75,0.8],[0.75,0,0.75],
        [0.7,0.75,1],[0.85,0.85,0.85],[0,0.75,1]]

In [ ]:
#Some info about solar system planets

planetnames = ['Mercury','Venus','Earth','Mars','Jupiter','Saturn','Uranus','Neptune','Pluto']

#planet masses in 10^24kg
Ms=np.array([0.33,4.87,5.97,0.642,1898,568,86.8,102,0.0146])
#to Jupiter masses
Ms = Ms/Ms[4]

#Planet radii in km
Rs = np.array([2440.,6052.,6378.,3397.,71492.,60268.,25559.,24766.,24766.])
#to Jupiter radii
Rs = Rs/Rs[4]

#planet semi-major axes in AU
smas = np.array([0.3871,0.7233,	1,1.524,5.203,9.539,19.19,30.06,39.48])

#placement info for planet name labels on plot. All relative to point
has = ['left','right','right','left','left','left','right','right','left']
offs = [(0,0),(-4,-12),(-4,4),(0,0),(6,-4),(5,1),(-6,-8),(-5,4),(0,0)]

In [ ]:
# Plot Planet mass vs. semi-major axis (distance from star)
fig,ax = plt.subplots(figsize=(7,7))

#loop over all of the methods and their corresponding symbols and colors
for m,s,c in zip(methods,syms,cmap):
    #find all of the indices for that method
    inds = data['pl_discmethod'] == m
    #pull the planet masses and semi-major axis for those entries
    mj = data[inds]['pl_bmassj']
    a = data[inds]['pl_orbsmax']
    # make a scatterplot with symbols from symbol array s, colors from color array c, label=method m
    ax.scatter(a,mj,marker=s,s=60,
            facecolors=c,edgecolors=None,alpha=0.75,label=m)

#overplot solar system planets
ax.scatter(smas,Ms,marker='o',s=60,facecolors='yellow',edgecolors=None,alpha=1)
for a,m,n,ha,off in zip(smas,Ms,planetnames,has,offs):
    #add label with planet name
    ax.annotate(n,(a,m),ha=ha,xytext=off,textcoords='offset points')

#log scale axes
ax.set_xscale('log')
ax.set_yscale('log')

#set limits to reasonable ranges
ax.set_xlim([1e-2,1e3])
ax.set_ylim([1e-3,40])

#label axes and plot
ax.set_xlabel('Semi-Major Axis (AU)', fontsize=14)
ax.set_ylabel('(Minimum) Ma4s (M$_J$)', fontsize=14)
ax.legend(loc='lower right',scatterpoints=1,prop={'size':12})

#define a second axis on right for mass in earth masses
ax2 = ax.twinx()
ax2.set_yscale('log')
ax2.set_ylim(np.array(ax.get_ylim())/Ms[2])
ax2.set_ylabel('M$_\oplus$', fontsize=14)

Exercise 1


Manipulate the code above to answer the following questions, and describe the results in words. Integrate plots that you create into your explanations to support your answer, but do it by adding a save statement to the code above and saving a different file each time. In each case, make sure your new image "looks good" before saving it. Generally, this will mean manipulating things like the legend location, font size and axis limits. Before moving on to answering the next question, make sure to undo whatever manipulations you've made to the plot so that you start with a clean slate each time.

a) What happens when you don't filter the data to include only those methods with > 30 successful discoveries? List one or two interesting things that you notice in this view and 1 or two questions that it raises for you.

b) What happens when (now back to just four methods) you don't use a log scale for the axes? Manipulate the x and y axis ranges as best you can to make an informative plot with linear axis scales and then compare it to the log plot. What are the advantages and disadvantages of each?

c) Manipulate the plot until you can see ALL of the solar system planets (there are a few missing in this default view). List one or two interesting things that you notice in this view and 1 or two questions that it raises for you.

This markdown cell is for part a

This markdown cell is for part b

This markdown cell is for part c

Exercise 2


Using the code for the graphic above as a model, make a similar plot for planet mass (x-axis) and radius (y-axis). This time, include error bars (so use plt.errorbar instead of plt.scatter).

Hints:
0) I suggest starting with making just a scatterplot that you like with plt.scatter, and then translating it to a plot with errorbars with plt.errorbar
1) The scatter keywords "facecolors" and "marker" are (functionally) the same as the errorbar keywords "color" and "fmt" so you'll want to "translate". The scatter keywords s and edgecolors shouldn't be necessary and won't be recognized by errorbar, so just remove them.
2) Note that errors in both mass AND radius are given in the table, so you should use the xerr **
and** yerr keywords).
3) There are two errors given in the table for each quantity (mass, radius, etc) - a postitive error and a negative one (sometimes the uncertainty in a measurement is greater in one direction than the other). To specify asymmetric error bars, use the following basic syntax: plt.errorbar(x,y,xerr=[negerr,poserr],yerr=[negerr,poserr])
4) Since the negative errorbars are specified in the table with a negative sign and plt.errorbar's xerr and yerr keywords don't know how to handle negative numbers, you'll have to use the absolute value. In other words, the basic syntax is actually: plt.errorbar(x,y,xerr=[abs(negerr),poserr],yerr=[abs(negerr),poserr])


In [ ]:
#code for plot goes here

3. Computing Correlation Coefficients

If you look at your mass/radius plot, it should be fairly clear to you that the two quantitites are correlated, meaning that there is a relationship between the value of one variable and the value of the other.

One way to seek out correlations in the entire dataset would be to plot every variable versus every other, and indeed we do this below, but scatterplots can only tell us so much, so let's talk about how to calculate the correlation coefficient "r" that we discussed in class.


In [ ]:
data.dtypes
#select only those columns that are numeric (these are the only ones that make sense to calculate correlations for)
data2=data.select_dtypes(exclude=["object"])

In [ ]:
#np.corrcoef(data2.values,rowvar=0) # alternate syntax, but doesn't deal well with nans
data2.corr()

There are way too many things to look at here, so let's remove all of the error columns and those with "lim" or "blend" in the name (which are more complicated to understand than the other variables and are not as useful in this context) and just look at correlations between what's left


In [ ]:
#create list of column names
cols = data2.columns

In [ ]:
#create an empty list to store column indices for non-error columns
colidx=[]

#loop over column list
for j in np.arange(len(cols)):
    #find all columns that don't contain "error", "lim" or "blend" in the name 
    if "err" not in cols[j] and "lim" not in cols[j] and "blend" not in cols[j]:
        colidx.append(j)

In [ ]:
#create a new dataframe with just these "non-error" columns
data3 = data2[colidx]

In [ ]:
corr_mat = data3.corr()
#corr_mat

Exercise 3


Write a function that replaces any values in a dataframe less than a user-specified value with NaNs. For example, if the function were called isolate_vals and it were operating on the dataframe df, then isolate_vals(df, 0.3) would return a dataframe with all values < 0.3 replaces with NaNs.

Then, use that program to identify the strongest correlations in this matrix and describe them (look at the online table for descriptions of what the columnns are). Which ones do you think might be interesting/meaningful and why?


In [ ]:
#program goes here

4. Data Investigation 2 - Week 2 Instructions

Now that you are familar with the exoplanet database, you and your partner must come up with an investigation that you would like to complete using this data. It's completely up to you what you choose to investigate, but here are a few broad ideas to guide your thinking:

  • You might choose to isolate a population of interesting planets that you noticed in one of the plots and attempt to understand it (descriptive statistics, correlations, etc) and/or compare it to another population
  • You might make a quantitative comparsion of the TYPES of planets detected with different methods and connect this to the limitations of that method (e.g. what types of planets is it best at discovering and why?)
  • You might isolate a region of a plot with an apparent correlation and attempt to fit a model to it.
  • You might consider adding a fourth variable to one of the plots you made by sizing the points to represent that variable.

In all cases, I can provide suggestions and guidance, and would be happy to discuss at office hours or by appointment.

Before 5pm next Monday evening (4/11), you must send me a brief e-mail (that you write together, one e-mail per group) describing a plan for how you will approach a question that you have developed. What do you need to know that you don't know already? What kind of plots will you make and what kinds of statistics will you compute? What is your first thought for what your final data representations will look like?


In [1]:
from IPython.core.display import HTML
def css_styling():
    styles = open("../custom.css", "r").read()
    return HTML(styles)
css_styling()


Out[1]: