In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib notebook

Wranglin' –– Corraling Unruly Data

One bit at a time

Version 0.1

By AA Miller 2019 Mar 26

Webster's Dictionary$^\ast$ defines wrangler as:

wrangler noun

wran·gler | raŋ-g(ə-)lər

(short for horse-wrangler, probably partial translation of Mexican Spanish caballerango groom): a ranch hand who takes care of the saddle horses broadly : cowboy

$^\ast$actually https://www.merriam-webster.com/dictionary/ - Webster's didn't define wrangler in the way I wanted

How then, as astronomers, are we all like cowhands?

Data are often like horses in that: they all differ, rarely conform to a single standard set of behavior, and they love to eat hay.$^\dagger$

$^\dagger$I made that last one up.

Thus, in our efforts to better understand the Universe, we must often manipulate, coax, and, in some cases, force our data to "behave." This involves a variety of tasks, such as: gathering, cleaning, matching, restructuring, transforming, filtering, combining, merging, verifying, and fixing data.

Here is a brief and unfortunate truth, there isn't a single person in the entire world that would organize data in exactly the same way that you would.

As a result, you may find that data that are useful to you are not organized in an optimal fashion for use in your workflow/software.

Hence: the need to wrangle.

There is one important and significant way in which our lives as astronomers are much better than the average data scientist: even though our data are "worthless," virtually all of it is numbers.

Furthermore, I contend that most astronomical data can easily be organized into a simple tabular structure.

Yesterday we joked that you could write a 10 page text document describing every individual galaxy in a galaxy cluster with 5000 members, but no one would ever do that.

Nevertheless, as you will see during the exercises, even with relatively simple, small numerical data sets there is a need for wrangling.

And wrangling brings up a lot of issues...

Consider the following data set that contains the street names for my best friends from childhood:

['Ewing', 'Isabella', 'Reese', 'Isabella', 
 'Thayer', 'Reese', 'Reese', 'Ewing', 'Reece']

Do you notice anything interesting?

Either my hometown has a street named "Reese" and a street named "Reece", or the last entry was input incorrectly.

If the later is true, then we have to raise the question of: what should we do?

For this particular data set, it would be possible to create a verification procedure to test for similar errors.

  1. Collect the names of every street in the city (from post office?)
  2. Confirm every instance in the data set has a counterpart.

For any instances where this isn't the case, one could then intervene with a correction.

This particular verification catches this street name error, but it doesn't correct for the possibility that the person doing the data entry may have been reading addresses really quickly and the third "Reese" entry should have actually said "Lawndale."

(verification is really hard)

This also raises a question about data provenance.

Data provenance is a historical record of the data and its origins.

If you are making "corrections" to the data, then each and every one of those corrections should be reported (this is similar to the idea of logging from yesterday's databases talk). Ideally, these reports would live with the data so others could understand how things have changed.

If you did change "Reece" to "Reese", anyone working with the data should be able to confirm those changes.

Suppose now you wanted to use the same street name data set to estimate which street I lived on while growing up.

One way to mathematically approach this problem would be to convert the streets in the data set to GPS coordinates, and then to perform a KDE of the PDF for the coordinates of where I lived.

This too is a form of wrangling, because the data you have (street names) are not the data you need (coordinates).

Why harp on this?

In practice, data scientists (including astronomers) spend an unreasonable amount of time manipulating and quality checking data (some indsutry experts estimate that up to 80% of their time is spent warnglin').

Today, we will work through several examples that require wrangling, while, hopefully, building some strategies to minimize the amount of time you spend on these tasks in the future.

For completeness, I will mention that there is a famous canonical paper about data wrangling, which introduces the Wrangler, a tool specifically designed to take heterogeneous (text) data, and provide a set of suggested operations/manipulations to create a homogenous table amenable to standard statistical analysis.

One extremely nice property of the Wrangler is that it records every operation performed on the data, ensuring high-fidelity reporting on the data provenance. We should do a better job of this in astronomy.

Today, we are going to focus on python solutions to some specific astronomical data sets.

Hopefully you learn some tricks to make your work easier in the future.

Problem 1) The Sins of Our Predecessors

If at any point in your career you need to access archival infrared data, you will likely need to retrieve that information from the NASA IPAC InfRed Science Archive. IRSA houses the data for every major NASA IR mission, and several ground-based missions as well (e.g., 2MASS, IRTF). Whether you are sudying brown dwarfs, explosive transients, solar system objects, star-formation, galaxy evolution, Milky Way dust and the resulting extinction of extragalactic observations, or quasars (and much more) the IR plays a critical role.

Given the importance of IR observations, it stands to reason that IRSA would provide data in a simple to read format for modern machines, such as comma separated values or FITS binary tables...

Right?...

Right?...

In fact, IRSA has created their own standard for storing data in a text file. If you haven't already, please download irsa_catalog_WISE_iPTF14jg_search_results.tbl, a file that is written in the standard IRSA format.

shameless plug alert! iPTF14jg is a really strange star that exhibited a large outburst that we still don't totally understand. The associated data file includes NEOWISE observations of the mid-IR evolution of this outburst.

Problem 1a

Using pandas read the data in the IRSA table file into a DataFrame object.

Hint 1 - you absolutely should look at the text file to develop a strategy to accomplish this goal.

Hint 2 - you may want to manipulate the text file so that it can more easily be read by pandas. If you do this be sure to copy the file to another name as you will want to leave the original intact.


In [5]:
# Solution 1 - pure python solution with pandas

with open('irsa_catalog_WISE_iPTF14jg_search_results.tbl') as f:
    ll = f.readlines()
    for linenum, l in enumerate(ll):
        if l[0] == '|':
            header = l.replace('|', ',').replace(' ', '')
            header = list(header[1:-2].split(','))
            break
            
irsa_tbl = pd.read_csv("irsa_catalog_WISE_iPTF14jg_search_results.tbl", 
            skiprows=linenum+4, delim_whitespace=True, 
            header=None, names=header)
irsa_tbl.head(5)


Out[5]:
ra dec sigra sigdec sigradec w1mpro w1sigmpro w1snr w1rchi2 w2mpro ... w4sigmpro_allwise tmass_key j_m_2mass j_msig_2mass h_m_2mass h_msig_2mass k_m_2mass k_msig_2mass dist angle
0 40.125566 60.879306 0.0947 0.0936 -0.0337 13.251 0.038 28.8 0.8767 12.735 ... NaN NaN NaN NaN NaN NaN NaN NaN 0.028668 242.330013
1 40.125574 60.879297 0.0925 0.0874 -0.0397 13.241 0.036 30.0 1.2610 12.629 ... NaN NaN NaN NaN NaN NaN NaN NaN 0.048011 192.659014
2 40.125623 60.879318 0.0808 0.0794 0.0098 12.753 0.039 27.8 0.7640 11.901 ... NaN NaN NaN NaN NaN NaN NaN NaN 0.079498 69.867298
3 40.125553 60.879286 0.0663 0.0608 -0.0156 12.774 0.028 38.6 1.3610 11.935 ... NaN NaN NaN NaN NaN NaN NaN NaN 0.097389 209.287467
4 40.125572 60.879280 0.0645 0.0599 -0.0223 12.840 0.028 38.7 1.5550 11.897 ... NaN NaN NaN NaN NaN NaN NaN NaN 0.107707 187.006050

5 rows × 37 columns

That pure python solution is a bit annoying as it requires a for loop with a break, and specific knowledge about how IRSA tables handler data headers (hence the use of linenum + 4 for skiprows). Alternatively, one could manipulate the data file to read in the data.


In [6]:
# solution 2 - edit the text file
# !cp irsa_catalog_WISE_iPTF14jg_search_results.tbl tmp.tbl
### delete lines 1-89, and 90-92
### replace whitespace with commas (may require multiple operations)
### replace '|' with commas
### replace ',,' with single commas
### replace ',\n,' with '\n'
### delete the comma at the very beginning and very end of the file

tedit_tbl = pd.read_csv('tmp.tbl')
tedit_tbl.head(5)


Out[6]:
ra dec sigra sigdec sigradec w1mpro w1sigmpro w1snr w1rchi2 w2mpro ... w4sigmpro_allwise tmass_key j_m_2mass j_msig_2mass h_m_2mass h_msig_2mass k_m_2mass k_msig_2mass dist angle
0 40.125566 60.879306 0.0947 0.0936 -0.0337 13.251 0.038 28.8 0.8767 12.735 ... NaN NaN NaN NaN NaN NaN NaN NaN 0.028668 242.330013
1 40.125574 60.879297 0.0925 0.0874 -0.0397 13.241 0.036 30.0 1.2610 12.629 ... NaN NaN NaN NaN NaN NaN NaN NaN 0.048011 192.659014
2 40.125623 60.879318 0.0808 0.0794 0.0098 12.753 0.039 27.8 0.7640 11.901 ... NaN NaN NaN NaN NaN NaN NaN NaN 0.079498 69.867298
3 40.125553 60.879286 0.0663 0.0608 -0.0156 12.774 0.028 38.6 1.3610 11.935 ... NaN NaN NaN NaN NaN NaN NaN NaN 0.097389 209.287467
4 40.125572 60.879280 0.0645 0.0599 -0.0223 12.840 0.028 38.7 1.5550 11.897 ... NaN NaN NaN NaN NaN NaN NaN NaN 0.107707 187.006050

5 rows × 37 columns

That truly wasn't all that better - as it required a bunch of clicks/text editor edits. (There are programs such as sed and awk that could be used to execute all the necessary edits from the command line, but that too is cumbersome and somewhat like the initial all python solution).

If astronomers are creating data in a "standard" format, then it ought to be easy for other astronomers to access that data.

Fortunately, in this particular case, there is an easy solution - astropy Tables.

IRSA tables are so commonly used throughout the community, that the folks at astropy have created a convenience method for all of us to read in tables created in that particular (unusual?) format.

Problem 1b

Use Table.read() to read in irsa_catalog_WISE_iPTF14jg_search_results.tbl to an astropy Table object.


In [28]:
from astropy.table import Table

Table.read('irsa_catalog_WISE_iPTF14jg_search_results.tbl', format='ipac')


Out[28]:
Table masked=True length=137
radecsigrasigdecsigradecw1mprow1sigmprow1snrw1rchi2w2mprow2sigmprow2snrw2rchi2nbnacc_flagsph_qualqual_framemjdallwise_cntrw1mpro_allwisew1sigmpro_allwisew2mpro_allwisew2sigmpro_allwisew3mpro_allwisew3sigmpro_allwisew4mpro_allwisew4sigmpro_allwisetmass_keyj_m_2massj_msig_2massh_m_2massh_msig_2massk_m_2massk_msig_2massdistangle
degdegarcsecarcsecarcsecmagmagmagmagmjdatemagmagmagmagmagmagmagmagmagmagmagmagmagmagarcsecdeg
float64float64float64float64float64float64float64float64float64float64float64float64float64int64int64str4str2int64float64int64float64float64float64float64float64float64float64float64int64float64float64float64float64float64float64float64float64
40.125565560.87930630.09470.0936-0.033713.2510.03828.80.876712.7350.05818.60.7135100000AA1057621.23501414--------------------------------0.028668242.330013
40.12557460.8792970.09250.0874-0.039713.2410.03630.01.26112.6290.05121.21.121100000AA557621.62810604--------------------------------0.048011192.659014
40.125622660.87931760.08080.07940.009812.7530.03927.80.76411.9010.03432.01.059100000AA1057063.79830186--------------------------------0.07949869.867298
40.125552860.87928640.06630.0608-0.015612.7740.02838.61.36111.9350.03729.30.7133100000AA1057256.77485551--------------------------------0.097389209.287467
40.125572560.87928030.06450.0599-0.022312.840.02838.71.55511.8970.03135.11.045100000AA557257.10275142--------------------------------0.107707187.00605
40.125573660.87927990.11640.1059-0.03913.5870.04126.41.64713.1450.07614.31.03100000AA557987.97028435--------------------------------0.108928185.907602
40.125564460.87927440.08560.086-0.03113.1980.03531.21.72512.6440.0521.70.7308100000AA557620.84192211--------------------------------0.131052192.038104
40.125551760.87927130.06790.0685-0.021812.8070.02838.81.46611.9890.03828.30.7736100000AA1057256.64356978--------------------------------0.147884199.589209
40.125571860.87926870.12240.1104-0.054713.4960.04424.60.976712.7390.06417.11.444100000AA1057787.46478715--------------------------------0.149378185.518975
...............................................................................................................
40.125386360.87923240.13620.12-0.041813.8030.04922.30.782513.1450.07115.20.8457100000AA557988.16676654--------------------------------0.439544230.53831
40.125401760.87921880.17560.1349-0.035413.4170.0813.51.91312.7710.07215.20.4021100000AA1057790.47736456--------------------------------0.453174223.574167
40.125450460.87919530.0650.0627-0.023612.8290.03135.02.80311.9030.03729.21.251100000AA557423.45964858--------------------------------0.471232208.805123
40.125419560.8791910.07360.0604-0.025512.7770.02739.90.882111.9640.03333.10.8097100000AA1057063.99542123--------------------------------0.512441213.279623
40.125359460.87921630.06310.0675-0.009312.7910.0336.61.1311.9720.03927.50.6409100000AA1057422.73853115--------------------------------0.512985228.885562
40.125277160.87934110.1150.1078-0.039713.5110.04126.21.29912.6680.06117.80.6642100000AA557790.01894558--------------------------------0.542346281.913746
40.125317260.87921860.10350.1091-0.027213.3280.03828.61.86812.8930.09111.91.075100000AA557790.21542903--------------------------------0.565899234.448038
40.125624760.87915180.14630.1293-0.057513.7340.04822.90.995413.2570.1228.90.5377100000AB557988.36312131--------------------------------0.574875172.170603
40.125278960.87923520.11740.1057-0.049713.360.04922.30.785612.7290.0618.10.7291100000AA557787.33375567--------------------------------0.59227242.957045
40.126043860.87918450.19110.1661-0.054914.0360.0715.44.36913.4720.09411.61.701100000AA057988.49402461--------------------------------0.929713119.074968

A benefit to using this method, as opposed to pandas, is that data typing and data units are naturally read from the IRSA table and included with the associated columns. Thus, if you are uncertain if some brightness measurement is in magnitudes or Janskys, the astropy Table can report on that information.

Unfortunately, astropy does not know about every strange formating decision that every astronomer has made at some point in their lives (as we are about to see...)

Problem 2) The Sins of Our Journals

Unlike IRSA/IPAC, which uses a weird but nevertheless consistent format for data tables, data retrieved from Journal articles essentially follows no rules. In principle, tables in Journal articles are supposed to be provided in a machine readable format. In practice, as we are about to see, this is far from the case.

For this particular wrangling case study we will focus on supernova light curves, a simple thing to report: time, filter, brightness, uncertainty on that brightness, that the community has nevertheless managed to mangle into some truly wild and difficult to parse forms.

(Sorry for the heavy emphasis on time-domain examples - I'm pulling straight from my own life today, but the issues described here are not perfectly addressed by any subfield within the astro umbrella)

Here is the LaTeX-formatted version of Table 4 from Miller et al. 2011:

<img style="display: block; margin-left: auto; margin-right: auto" src="images/Miller11_tbl4.png" align=\"middle">

That is a very simple table to interpret, no?

Have a look at the "machine-readible" file that ApJ provides for readers that might want to evaluate these photometric measurements.

Problem 2a

Download the ApJ version of Table 4 from Miller et al. 2011 and read that table into either a pandas DataFrame or an astropy Table.


In [38]:
# pure python solution with pandas

tbl4 = pd.read_csv('Miller_et_al2011_table4.txt', 
                   skiprows=5, delim_whitespace=True,
                   skipfooter=3, engine='python',
                   names=['t_mid', 'J', 'Jdum', 'J_unc', 
                                   'H', 'Hdum', 'H_unc',
                                   'K', 'Kdum', 'K_unc'])
tbl4.drop(columns=['Jdum', 'Hdum', 'Kdum'], inplace=True)

print(tbl4)


       t_mid      J  J_unc     H  H_unc     K  K_unc
0  55466.137  10.04   0.03  9.14   0.03  8.65   0.03
1  55468.145   9.99   0.03  9.06   0.04  8.64   0.04
2  55469.148  10.04   0.03  9.07   0.03  8.70   0.03
3  55479.109  10.11   0.03  9.11   0.03  8.63   0.04
4  55504.164  10.20   0.03  9.24   0.03  8.74   0.03
5  55513.195  10.29   0.03  9.34   0.03  8.79   0.03
6  55518.168  10.32   0.03  9.34   0.04  8.84   0.03
7  55527.117  10.35   0.03  9.40   0.03  8.89   0.03
8  55531.145  10.40   0.03  9.44   0.03  8.97   0.03
9  55543.066  10.45   0.03  9.48   0.04  9.06   0.04

That wasn't too terrible. But what if we consider a more typical light curve table, where there are loads of missing data, such as Table 2 from Foley et al. 2009:

<img style="display: block; margin-left: auto; margin-right: auto" src="images/Foley09_tbl2.png" align=\"middle">

Again, this table is straightforward to read, and it isn't hard to imagine how one could construct a machine-readable csv or other file from this information. But alas, this is not what is available from ApJ. So, we will need to figure out how to deal with both the missing data, "...", and the weird convention that many astronomers use where the uncertainties are (a) not reported in their own column, and (b) are not provided in the same units as the measurement itself. I can understand the former, but the later is somewhat baffling...

Problem 2b

Download the ApJ version of Table 2 from Foley et al. 2009 and read that table into either a pandas DataFrame or an astropy Table.


In [48]:
# a (not terribly satisfying) pure python solution
# read the file in, parse and write another file that plays nice with pandas

with open('Foley_et_al2009_for_pd.csv','w') as fw:
    print('JD,Bmag,Bmag_unc,Vmag,Vmag_unc,Rmag,Rmag_unc,Imag,Imag_unc,Unfiltmag,Unfiltmag_unc,Telescope',file=fw)
    with open('Foley_et_al2009_table2.txt') as f:
        ll = f.readlines()
        for l in ll:
            if l[0] == '2':
                print_str = l.split()[0] + ','
                for col in l.split()[1:]:
                    if col == 'sdotsdotsdot':
                        print_str += '-999,-999,'
                    elif col[0] == '>':
                        print_str += '{},-999,'.format(col[1:])
                    elif col == 'KAIT':
                        print_str += 'KAIT'
                    elif col == 'Nickel':
                        print_str += 'Nickel'
                    elif col[0] == '(':
                        print_str += '0.{},'.format(col[1:-1])
                    else:
                        print_str += '{},'.format(col)

                print(print_str,file=fw)

pd.read_csv('Foley_et_al2009_for_pd.csv')


Out[48]:
JD Bmag Bmag_unc Vmag Vmag_unc Rmag Rmag_unc Imag Imag_unc Unfiltmag Unfiltmag_unc Telescope
0 2454764.80 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 19.500 -999.000 KAIT
1 2454778.69 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 18.069 0.092 KAIT
2 2454781.76 18.340 0.084 17.828 0.037 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 KAIT
3 2454783.74 18.229 0.062 17.718 0.042 17.509 0.041 17.377 0.054 -999.000 -999.000 KAIT
4 2454784.64 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 17.736 0.091 KAIT
5 2454784.71 18.230 0.030 17.635 0.030 17.570 0.030 17.392 0.030 -999.000 -999.000 Nickel
6 2454785.67 18.385 0.030 17.660 0.027 17.610 0.023 17.425 0.034 17.683 0.038 KAIT
7 2454786.80 18.415 0.030 17.710 0.030 17.544 0.030 17.358 0.030 -999.000 -999.000 Nickel
8 2454787.67 18.596 0.030 17.762 0.030 17.552 0.030 17.376 0.030 -999.000 -999.000 KAIT
9 2454789.68 18.904 0.030 17.827 0.014 17.573 0.011 17.353 0.030 17.732 0.028 KAIT
10 2454790.71 18.969 0.030 17.871 0.030 17.560 0.030 17.278 0.030 -999.000 -999.000 Nickel
11 2454792.70 19.563 0.054 18.127 0.022 17.689 0.016 17.443 0.030 -999.000 -999.000 KAIT
12 2454794.68 19.789 0.072 18.377 0.032 17.904 0.025 17.541 0.037 -999.000 -999.000 KAIT
13 2454795.78 20.297 0.216 -999.000 -999.000 -999.000 -999.000 -999.000 -999.000 18.009 0.103 KAIT
14 2454798.65 20.390 0.123 18.791 0.033 18.308 0.022 17.876 0.039 18.388 0.022 KAIT
15 2454799.66 -999.000 -999.000 18.921 0.046 18.343 0.029 17.919 0.050 18.468 0.040 KAIT
16 2454800.67 -999.000 -999.000 18.942 0.035 18.347 0.029 17.892 0.033 18.482 0.027 KAIT
17 2454801.64 -999.000 -999.000 19.004 0.044 18.438 0.026 17.964 0.036 18.589 0.031 KAIT
18 2454802.74 -999.000 -999.000 19.064 0.082 18.472 0.037 17.921 0.030 18.635 0.026 KAIT
19 2454803.63 -999.000 -999.000 19.130 0.054 18.569 0.031 18.087 0.047 18.589 0.030 KAIT
20 2454805.63 -999.000 -999.000 19.211 0.095 18.851 0.055 18.142 0.056 18.682 0.048 KAIT
21 2454809.67 -999.000 -999.000 19.288 0.206 19.104 0.135 -999.000 -999.000 18.909 0.078 KAIT
22 2454811.61 -999.000 -999.000 19.462 0.103 19.069 0.073 18.374 0.067 19.150 0.060 KAIT
23 2454812.62 -999.000 -999.000 19.458 0.207 19.118 0.065 18.482 0.077 19.070 0.049 KAIT
24 2454831.63 21.372 0.030 20.191 0.030 19.647 0.030 18.864 0.030 -999.000 -999.000 Nickel

Okay - there is nothing elegant about that particular solution. But it works, and wranglin' ain't pretty.

It is likely that you developed a solution that looks very different from this one, and that is fine. When data are provided in an unrulely format, the most important thing is to develop some method, any method, for converting the information into a useful format. Following whatever path you used above, it should now be easy to plot the light curve of SN 2008ha.

Problem 3) My Heart Will Go On

Sometimes there is no difficultly whatsoever in reading in the data (as was the case in Problems 1 and 2), but instead the difficultly lies in wranglin' the data to be appropriate for the model that you are building.

For the next problem we will work with the famous Titanic survival data set. If you haven't already, please download the csv file with the relevant information.

Briefly, the Titantic is a famous historical ship that was thought to be unsinkable. Spoiler alert it hit an iceberg and sank. The data in the Titanic data set includes information about 891 passengers from the Titanic, as well as whether or not they survived. The aim of this data set is to build a machine learning model to predict which passengers survived and which did not.

The features include:

Feature Description
PassengerId Running index that describes the individual passengers
Pclass A proxy for socio-economic status (1 = Upper class, 2 = Middle Class, 3 = Lower Class)
Name The passenger's name
Sex The passenger's sex
Age The passenger's age - note age's ending in 0.5 are estimated
SibSp The sum of the passenger's sibblings and spouces on board
Parch The sum of the passenger's parents and children on board
Ticket The ticket number for the passenger
Fare The price paid for the ticket by th passenger
Cabin The Cabin in which the passenger stayed
Embarked The point of Origin for the Passenger: C = Cherbourg, S = Southampton, Q = Queenstown

And of course, we are trying to predict:

Label Description
Survived 1 = yes; 0 = no

Problem 3a

Read in the Titanic training data and create the scikit-learn standard X and y arrays to hold the features and the labels, respectively.


In [124]:
titanic_df = pd.read_csv('titanic_kaggle_training_set.csv', comment='#')

feat_list = list(titanic_df.columns)
label = 'Survived'
feat_list.remove(label)
X = titanic_df[feat_list].values
y = titanic_df[label]

Now that we have the data in the appropriate X and y arrays, estimate the accuracy with which a K nearest neighbors classification model can predict whether or not a passenger would survive the Titanic disaster. Use $k=10$ fold cross validation for the prediction.

Problem 3b

Train a $k=7$ nearest neighbors machine learning model on the Titanic training set.


In [95]:
from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier(n_neighbors=7)
knn_clf.fit(X, y)


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-95-a87f7dab2ac9> in <module>()
      1 from sklearn.neighbors import KNeighborsClassifier
      2 knn_clf = KNeighborsClassifier(n_neighbors=7)
----> 3 knn_clf.fit(X, y)

~/miniconda3/envs/DSFP/lib/python3.6/site-packages/sklearn/neighbors/base.py in fit(self, X, y)
    763         """
    764         if not isinstance(X, (KDTree, BallTree)):
--> 765             X, y = check_X_y(X, y, "csr", multi_output=True)
    766 
    767         if y.ndim == 1 or y.ndim == 2 and y.shape[1] == 1:

~/miniconda3/envs/DSFP/lib/python3.6/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
    571     X = check_array(X, accept_sparse, dtype, order, copy, force_all_finite,
    572                     ensure_2d, allow_nd, ensure_min_samples,
--> 573                     ensure_min_features, warn_on_dtype, estimator)
    574     if multi_output:
    575         y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,

~/miniconda3/envs/DSFP/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    431                                       force_all_finite)
    432     else:
--> 433         array = np.array(array, dtype=dtype, order=order, copy=copy)
    434 
    435         if ensure_2d:

ValueError: could not convert string to float: 'Q'

Note - that should have failed! And for good reason - recall that kNN models measure the Euclidean distance between all points within the feature space. So, when considering the sex of a passenger, what is the numerical distance between male and female?

In other words, we need to wrangle this data before we can run the machine learning model.

Most of the features in this problem are non-numeric (i.e. we are dealing with categorical features), and therefore we need to figure out how to include them in the kNN model.

The first step when wrangling for machine learning is to figure out if anything can be thrown away. We certainly want to avoid including any uninformative features in the model.

If you haven't already, now would be a good time to create a new cell and examine the contents of the csv


In [96]:
titanic_df


Out[96]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C
10 11 1 3 Sandstrom, Miss. Marguerite Rut female 4.0 1 1 PP 9549 16.7000 G6 S
11 12 1 1 Bonnell, Miss. Elizabeth female 58.0 0 0 113783 26.5500 C103 S
12 13 0 3 Saundercock, Mr. William Henry male 20.0 0 0 A/5. 2151 8.0500 NaN S
13 14 0 3 Andersson, Mr. Anders Johan male 39.0 1 5 347082 31.2750 NaN S
14 15 0 3 Vestrom, Miss. Hulda Amanda Adolfina female 14.0 0 0 350406 7.8542 NaN S
15 16 1 2 Hewlett, Mrs. (Mary D Kingcome) female 55.0 0 0 248706 16.0000 NaN S
16 17 0 3 Rice, Master. Eugene male 2.0 4 1 382652 29.1250 NaN Q
17 18 1 2 Williams, Mr. Charles Eugene male NaN 0 0 244373 13.0000 NaN S
18 19 0 3 Vander Planke, Mrs. Julius (Emelia Maria Vande... female 31.0 1 0 345763 18.0000 NaN S
19 20 1 3 Masselmani, Mrs. Fatima female NaN 0 0 2649 7.2250 NaN C
20 21 0 2 Fynney, Mr. Joseph J male 35.0 0 0 239865 26.0000 NaN S
21 22 1 2 Beesley, Mr. Lawrence male 34.0 0 0 248698 13.0000 D56 S
22 23 1 3 McGowan, Miss. Anna "Annie" female 15.0 0 0 330923 8.0292 NaN Q
23 24 1 1 Sloper, Mr. William Thompson male 28.0 0 0 113788 35.5000 A6 S
24 25 0 3 Palsson, Miss. Torborg Danira female 8.0 3 1 349909 21.0750 NaN S
25 26 1 3 Asplund, Mrs. Carl Oscar (Selma Augusta Emilia... female 38.0 1 5 347077 31.3875 NaN S
26 27 0 3 Emir, Mr. Farred Chehab male NaN 0 0 2631 7.2250 NaN C
27 28 0 1 Fortune, Mr. Charles Alexander male 19.0 3 2 19950 263.0000 C23 C25 C27 S
28 29 1 3 O'Dwyer, Miss. Ellen "Nellie" female NaN 0 0 330959 7.8792 NaN Q
29 30 0 3 Todoroff, Mr. Lalio male NaN 0 0 349216 7.8958 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
859 862 0 2 Giles, Mr. Frederick Edward male 21.0 1 0 28134 11.5000 NaN S
860 863 1 1 Swift, Mrs. Frederick Joel (Margaret Welles Ba... female 48.0 0 0 17466 25.9292 D17 S
861 864 0 3 Sage, Miss. Dorothy Edith "Dolly" female NaN 8 2 CA. 2343 69.5500 NaN S
862 865 0 2 Gill, Mr. John William male 24.0 0 0 233866 13.0000 NaN S
863 866 1 2 Bystrom, Mrs. (Karolina) female 42.0 0 0 236852 13.0000 NaN S
864 867 1 2 Duran y More, Miss. Asuncion female 27.0 1 0 SC/PARIS 2149 13.8583 NaN C
865 868 0 1 Roebling, Mr. Washington Augustus II male 31.0 0 0 PC 17590 50.4958 A24 S
866 869 0 3 van Melkebeke, Mr. Philemon male NaN 0 0 345777 9.5000 NaN S
867 870 1 3 Johnson, Master. Harold Theodor male 4.0 1 1 347742 11.1333 NaN S
868 871 0 3 Balkic, Mr. Cerin male 26.0 0 0 349248 7.8958 NaN S
869 872 1 1 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) female 47.0 1 1 11751 52.5542 D35 S
870 873 0 1 Carlsson, Mr. Frans Olof male 33.0 0 0 695 5.0000 B51 B53 B55 S
871 874 0 3 Vander Cruyssen, Mr. Victor male 47.0 0 0 345765 9.0000 NaN S
872 875 1 2 Abelson, Mrs. Samuel (Hannah Wizosky) female 28.0 1 0 P/PP 3381 24.0000 NaN C
873 876 1 3 Najib, Miss. Adele Kiamie "Jane" female 15.0 0 0 2667 7.2250 NaN C
874 877 0 3 Gustafsson, Mr. Alfred Ossian male 20.0 0 0 7534 9.8458 NaN S
875 878 0 3 Petroff, Mr. Nedelio male 19.0 0 0 349212 7.8958 NaN S
876 879 0 3 Laleff, Mr. Kristo male NaN 0 0 349217 7.8958 NaN S
877 880 1 1 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 0 1 11767 83.1583 C50 C
878 881 1 2 Shelley, Mrs. William (Imanita Parrish Hall) female 25.0 0 1 230433 26.0000 NaN S
879 882 0 3 Markun, Mr. Johann male 33.0 0 0 349257 7.8958 NaN S
880 883 0 3 Dahlberg, Miss. Gerda Ulrika female 22.0 0 0 7552 10.5167 NaN S
881 884 0 2 Banfield, Mr. Frederick James male 28.0 0 0 C.A./SOTON 34068 10.5000 NaN S
882 885 0 3 Sutehall, Mr. Henry Jr male 25.0 0 0 SOTON/OQ 392076 7.0500 NaN S
883 886 0 3 Rice, Mrs. William (Margaret Norton) female 39.0 0 5 382652 29.1250 NaN Q
884 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
885 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
886 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
887 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
888 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

889 rows × 12 columns

Problem 3c

Are there any features that obviously will not help with this classification problem?

If you answer yes - ignore those features moving forward

Solution 3c

write your solution here

The Name of the passenger will not be useful for this classification task. In particular, the useful information that might be learned from the name, such as the sex (e.g., Mr. vs. Mrs.), or the age (e.g., Mr. vs. Master), are already summarized elsewhere in the data.

It is also highly likely that the PassengerId, which is just a running index for each person in the dataset, is unlikely to be useful when classifying this data.

Given that we have both categorical and numeric features, let's start with the numerical features and see how well they can predict survival on the Titanic.

One note - for now we are going to exclude Age, because as you saw when you examined the data, there are some passengers that do not have any age information. This problem, known as "missing data" is one that we will deal with before the end of this problem.

Problem 3d

How accurately can the numeric features, Pclass, SibSp, Parch, and Fare predict survival on the Titanic? Use a $k = 7$ Nearest Neighbors model, and estimate the model accuracy using 10-fold cross validation.

Hint 1 - you'll want to redefine your features vector X

Hint 2 - you may find cross_val_score from scikit-learn helpful


In [97]:
from sklearn.model_selection import cross_val_score

X = titanic_df[['Pclass', 'SibSp', 'Parch', 'Fare']]

knn_clf = KNeighborsClassifier(n_neighbors=7)

cv_results = cross_val_score(knn_clf, X, y, cv=10)

print('The accuracy from numeric features = {:.2f}%'.format(100*np.mean(cv_results)))


The accuracy from numeric features = 68.51%

An accuracy of 68% isn't particularly inspiring. But, there's a lot of important information that we are excluding. As far as the Titanic is concerned, Kate and Leo taught us that female passengers are far more likely to survive, while male passengers are not. So, if we can include gender in the model then we may be able to achieve more accurate predictions.

Problem 3e

Create a new feature called gender that equals 1 for male passengers and 2 for female passengers. Add this feature to your dataframe, and include it in a kNN model with the other numeric features. Does the inclusion of this feature improve the 10-fold CV accuracy?


In [98]:
gender = np.ones(len(titanic_df['Sex']))
gender[np.where(titanic_df['Sex'] == 'female')] = 2

titanic_df['gender'] = gender

X = titanic_df[['Pclass', 'SibSp', 'Parch', 'Fare', 'gender']]

knn_clf = KNeighborsClassifier(n_neighbors=7)

cv_results = cross_val_score(knn_clf, X, y, cv=10)

print('The accuracy when including gender = {:.2f}%'.format(100*np.mean(cv_results)))


The accuracy when including gender = 77.28%

A 14% improvement is pretty good! But, we can wrangle even more out of the gender feature. Recall that kNN models measure the Euclidean distance between sources, meaning the scale of each feature really matters. Given that the fare ranges from 0 up to 512.3292, the kNN model will see this feature as far more important than gender, for no other reason than the units that have been adopted.

If women are far more likely to survive than men, then we want to be sure that gender is weighted at least the same as all the other features, which we can do with a minmax scaler. As a brief reminder - a minmax scaler scales all values of a feature to be between 0 and 1 by subtracting the minimum value of each feature and then dividing by the maximum minus the minimum.

Problem 3f

Scale all the features from the previous problem using a minmax scaler and evaluate the CV accuracy of the kNN model.

Hint - you may find MinMaxScaler helpful


In [99]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(X)
Xminmax = scaler.transform(X)

knn_clf = KNeighborsClassifier(n_neighbors=7)

cv_results = cross_val_score(knn_clf, Xminmax, y, cv=10)

print('The accuracy when scaling features = {:.2f}%'.format(100*np.mean(cv_results)))


The accuracy when scaling features = 80.43%

Scaling the features leads to further improvement!

Now turn your attention to another categorical feature, Embarked, which includes the point of origin for each passenger and has three categories, S, Q, and C. We need to convert these values to a numeric representation for inclusion in the model.

Problem 3g

Convert the categorical feature Embarked to a numeric representation, and add it to the titanic_df.


In [100]:
# following previous example, set C = 0, S = 1, Q = 2

porigin = np.empty(len(titanic_df['Sex'])).astype(int)
porigin[np.where(titanic_df['Embarked'] == 'C')] = 0
porigin[np.where(titanic_df['Embarked'] == 'S')] = 1
porigin[np.where(titanic_df['Embarked'] == 'Q')] = 2

titanic_df['porigin'] = porigin

But wait! Does this actually make sense?

Our "numerification" has now introduced order where there previously was none. We are effectively telling the model that Cherbourg and Queenstown are far apart (not in distance but in terms of the similarity of the passengers that boarded the ship in each location), while each are equally close to Southampton. Is there actually any evidence to support this conclusion?

By definition categorical features do not have order (e.g., cat, dog, horse, elephant), and therefore we should not impose any when converting these features to numeric values for inclusion in our model. Instead, we should be creating a new set of binary features for every category within the feature set. Thus, Embarked will now need to be represented by 3 different features, where the feature Queenstown equals one for passengers that boarded there and zero for everyone else.

Problem 3h

Complete the function below that will automatically create binary arrays for a categorical feature.


In [53]:
def create_bin_cat_feats(feature_array):
    categories = np.unique(feature_array)
    feat_dict = {}
    for cat in categories:
        exec('{} = np.zeros(len(feature_array)).astype(int)'.format(cat))
        exec('{0}[np.where(feature_array == "{0}")] = 1'.format(cat))
        exec('feat_dict["{0}"] = {0}'.format(cat))
    
    return feat_dict

Problem 3i

Use the create_bin_cat_feats function to convert the Embarked and Sex, yes we need to do this for Sex as well where we otherwise previously introduced order, categorical features to a numeric representation. Add these features to the titanic_df data frame.


In [125]:
gender_dict = create_bin_cat_feats(titanic_df['Sex'])
porigin_dict = create_bin_cat_feats(titanic_df['Embarked'])

for feat in gender_dict.keys():
    titanic_df[feat] = gender_dict[feat]
    
for feat in porigin_dict.keys():
    titanic_df[feat] = porigin_dict[feat]

Problem 3j

Use the newly created female, male, S, Q, and C features in combination with the Pclass, SibSp, Parch, and Fare features to estimate the classification accuracy of a $k = 7$ nearest neighbors model with 10-fold cross validation.

How does the addition of the point of origin feature affect the final model output?

Hint - don't forget to scale the features in the model


In [126]:
from sklearn.preprocessing import MinMaxScaler

X = titanic_df[['Pclass', 'SibSp', 'Parch', 'Fare', 'female', 'male', 'S', 'Q', 'C']]

scaler = MinMaxScaler()
scaler.fit(X)
Xminmax = scaler.transform(X)

knn_clf = KNeighborsClassifier(n_neighbors=7)

cv_results = cross_val_score(knn_clf, Xminmax, y, cv=10)

print('The accuracy with categorical features = {:.2f}%'.format(100*np.mean(cv_results)))


The accuracy with categorical features = 80.09%

The last thing we'd like to add to the model is the Age feature. Unfortunately, for 177 passengers we do not have a reported value for their age. This is a standard issue when building models known as "missing data" and this happens in astronomy all the time (for example, LSST is going to observe millions of L and T dwarfs that are easily detected in the $y$-band, but which do not have a detection in $u$-band).

There are several different strategies for dealing with missing data. The first and most straightforward is to simply remove observations with missing data (note - to simplify this example I already did this by removing the 2 passengers from the training set that did not have an entry for Embarked).

This strategy is perfectly fine if only a few sources have missing information (2/891 for Embarked - and none of the test set sources are missing Embarked). If, however, a significant fraction are missing data, this strategy would remove a lot of useful data from the model.

If you cannot remove the sources with missing data, then it is essential to ask the following question:

Does the missing information have meaning?

In the LSST L/T dwarf example, the lack of a $u$-band detection is meaningful: these stars are too faint for LSST. When this is the case, an indicator value (e.g., -999) allows the model to recognize the non-detection.

For the Titanic data, the lack of age information is not meaningful. Simply put, there are some passengers that did not have recorded ages. We will now show this to be the case.

Problem 3k

Replace the unknown ages with a value of -999, and estimate the accuracy of the model via 10-fold cross validation.


In [127]:
age_impute = titanic_df['Age'].copy()
age_impute[np.isnan(age_impute)] = -999

titanic_df['age_impute'] = age_impute

X = titanic_df[['Pclass', 'SibSp', 'Parch', 'Fare', 'female', 'male', 'S', 'Q', 'C', 'age_impute']]

scaler = MinMaxScaler()
scaler.fit(X)
Xminmax = scaler.transform(X)

knn_clf = KNeighborsClassifier(n_neighbors=7)

cv_results = cross_val_score(knn_clf, Xminmax, y, cv=10)

print('The accuracy with -999 for missing ages = {:.2f}%'.format(100*np.mean(cv_results)))


The accuracy with -999 for missing ages = 80.42%

The accuracy of the model hasn't improved by adding the age information (even though we know children were more likely to survive than adults).

Given that the missing ages don't have meaning, we need to develop alternative strategies for "imputing" the missing data. The most simple approach in this regard is to replace the missing values with the mean value of the feature distribution for sources that do have measurements (use the median if the distribution has significant outliers).

Problem 3l

Replace the unknown ages with the mean age of passengers, and estiamte the accuracy of the model via 10-fold cross validation.


In [216]:
age_impute = titanic_df['Age'].copy().values
age_impute[np.isnan(age_impute)] = np.mean(age_impute[np.isfinite(age_impute)])

titanic_df['age_impute'] = age_impute

X = titanic_df[['Pclass', 'SibSp', 'Parch', 'Fare', 'female', 'male', 'S', 'Q', 'C', 'age_impute']]

scaler = MinMaxScaler()
scaler.fit(X)
Xminmax = scaler.transform(X)

knn_clf = KNeighborsClassifier(n_neighbors=7)

cv_results = cross_val_score(knn_clf, Xminmax, y, cv=10)

print('The accuracy with the mean for missing ages = {:.2f}%'.format(100*np.mean(cv_results)))


The accuracy with the mean for missing ages = 80.76%

Using the mean age for missing values provides a marginal improvement over the models with no age information. Is there anything else we can do? Yes - we can build a machine learning model to predict the values of the missing data. So there will be a machine learning model within the final machine learning model. In order to predict ages, we will need to build a regression model. Simple algorithms include Linear or Logistic Regression, while more complex examples include kNN or random forest regression.

I quickly tested the above four methods, and found that the scikit-learn LinearRegression performs best when using the model defaults.

Probem 3m

Build a LinearRegression model to predict a passenger's age based on the Pclass, SibSp, Parch, Fare, female, male, S, Q, C features. The model should be trained with passengers that have known ages. Use 10-fold cross validation to determine the performance of this model.

Hint - note that for regression models the typical metric of evaluation is the mean squared error, and that for consistency within the scikit-learn API, the negative mean squared error is returned rather than the mean squared error.


In [217]:
from sklearn.linear_model import LinearRegression

has_ages = np.where(np.isfinite(titanic_df['Age']))[0]

impute_X_train = titanic_df[['Pclass', 'SibSp', 'Parch', 'Fare', 'female', 'male', 'S', 'Q', 'C']].iloc[has_ages]
impute_y_train = titanic_df['Age'].iloc[has_ages]

scaler = MinMaxScaler()
scaler.fit(impute_X_train)
Xminmax = scaler.transform(impute_X_train)

lr_age = LinearRegression().fit(Xminmax, impute_y_train)

cv_results = cross_val_score(LinearRegression(), Xminmax, impute_y_train, cv=10, scoring='neg_mean_squared_error')

print('Missing ages have RMSE = {:.2f}'.format(np.mean((-1*cv_results)**0.5)))


Missing ages have RMSE = 12.77

Problem 3n

Use the age regression model to predict the ages for passengers with missing data.


In [218]:
missing_ages = np.where(np.isnan(titanic_df['Age']))[0]

impute_X_missing = titanic_df[['Pclass', 'SibSp', 'Parch', 'Fare', 'female', 'male', 'S', 'Q', 'C']].iloc[missing_ages]

X_missing_minmax = scaler.transform(impute_X_missing)

age_preds = lr_age.predict(X_missing_minmax)

Problem 3o

Use the imputed age estimates to predict the passenger survival via 10-fold cross validation.


In [229]:
age_impute = titanic_df['Age'].copy().values
age_impute[missing_ages] = age_preds

titanic_df['age_impute'] = age_impute

X = titanic_df[['Pclass', 'SibSp', 'Parch', 'Fare', 'female', 'male', 'S', 'Q', 'C', 'age_impute']]

scaler = MinMaxScaler()
scaler.fit(X)
Xminmax = scaler.transform(X)

knn_clf = KNeighborsClassifier(n_neighbors=7)

cv_results = cross_val_score(knn_clf, Xminmax, y, cv=10)

print('The accuracy with the mean for missing ages = {:.2f}%'.format(100*np.mean(cv_results)))


The accuracy with the mean for missing ages = 80.20%

As far as ages are concerned, imputation of the missing data does not significantly improve the model.

Which brings us to the concluding lesson in wrangling the Titanic data - not every piece of information is useful. This is critical to remember when building machine learning models.

(As a quick aside - it wouldn't be entirely fair to say there is no useful information in the age feature. It is clear, for example, that "children," i.e. those with Age < 10, had a much higher probability of survival than adults. Perhaps the creation of a child feature based on age would improve the model... or using the age in combination with other features, e.g., AgexPclass which will further highlight that 1st class passengers were more likely to survive than 3rd class passengers)

Finally - note that you can try to build a model and submit it to Kaggle to see how well you preform on blind data.

https://www.kaggle.com/c/titanic - the classifications are not revealed, but from the leaderboard it is clear that some people were able to build models that perfectly classified the blind data.

Problem 4) Wrangling an astro machine learning model

Now we will put together the above examples to work on an astronomical problem of great importance for LSST: photometric redshifts. As we have previously discussed, LSST will observe many, many, many, many more galaxies than we can possibly observe with spectroscopic instruments. As a result, the vast majority of the galaxies observed by LSST will not have precise redshift estimates, and instead we will need to estimate redshifts via photometry alone.

This week we have heard a lot about SDSS CasJobs, and for today's problem we will use 50,000 SDSS sources with spectroscopic measurements to train a machine learning model to estimate redshifts from photometry. The following query was used to generate this training set:

select top 50000 s.specObjID, s.z, s.type,
   psfMag_u, psfMag_g, psfMag_r, psfMag_i, psfMag_z, 
   modelMag_u, modelMag_g, modelMag_r, modelMag_i, modelMag_z,
   extinction_u, extinction_g, extinction_r, extinction_i, extinction_z,
   w.w1mpro, w1snr, w.w2mpro, w2snr, w.w3mpro, w3snr, w.w4mpro, w4snr
   from specphotoall s 
     join (WISE_xmatch wx  
       join WISE_allsky w on wx.wise_cntr = w.cntr) on s.objid = wx.sdss_objid

A few notes about the data: specObjID is the SDSS identifier, z is the redshift, type is the photometric type ("ps" for point source or "ext" for extended), psfMag is the SDSS PSF mag in each filter, modelMag is the aperture-matched model mag in each filter, extinction is the SFD extinction estimate, w?mpro is the WISE satellite mid-IR magnitude in the W1, W2, W3, and W4 bands, and w?msnr is the SNR in each of the respective filters.

quick aside - the SDSS photometric type is reported as either 6 or 3, and, for convenience, I have converted these to 'ps' and 'ext' in the file used for the problem below.

Important note - not every SDSS source will have detections in each of the WISE filters. If w?msnr < 2, then the w?mpro number represents an upper limit, not a detection (i.e., these are cases with missing data).

Problem 4a

Download and read in the training set for the model.


In [79]:
sdss = pd.read_csv("DSFP_SDSS_spec_train.csv")
sdss[:5]


Out[79]:
specObjID z type psfMag_u psfMag_g psfMag_r psfMag_i psfMag_z modelMag_u modelMag_g ... extinction_i extinction_z w1mpro w1snr w2mpro w2snr w3mpro w3snr w4mpro w4snr
0 299567742770505728 0.071414 ext 20.88291 19.23907 18.56170 18.09715 17.76469 19.62189 18.03702 ... 0.061267 0.045572 14.395 21.1 14.236 15.6 11.029 6.8 8.579 -0.9
1 299568017178650624 0.071380 ext 20.88291 19.23907 18.56170 18.09715 17.76469 19.62189 18.03702 ... 0.061267 0.045572 14.395 21.1 14.236 15.6 11.029 6.8 8.579 -0.9
2 299566643258877952 0.088173 ext 20.84844 18.96040 18.08027 17.62953 17.31857 20.18508 18.26120 ... 0.047582 0.035392 14.162 35.0 13.970 22.3 12.233 2.9 9.067 -0.3
3 299569116690278400 0.088161 ext 20.84844 18.96040 18.08027 17.62953 17.31857 20.18508 18.26120 ... 0.047582 0.035392 14.162 35.0 13.970 22.3 12.233 2.9 9.067 -0.3
4 299568292056557568 0.066539 ext 21.28256 19.61427 18.98529 18.52956 18.26322 20.18081 18.47435 ... 0.058709 0.043669 14.734 30.6 14.512 17.6 11.078 9.8 9.054 2.3

5 rows × 26 columns

Problem 4b

Are there categorical features in the dataset? If yes, convert the categorical features for use in machine learning models.

Hint - recall Problem 3h and 3i.


In [82]:
type_dict = create_bin_cat_feats(sdss['type'])


for feat in type_dict.keys():
    sdss[feat] = type_dict[feat]

Problem 4c

Is there missing data in the training set? If yes, how will you deal these data?

Hint - you were already told there is missing data.


In [83]:
# WISE non-detections have SNR < 2

for wsnr in ['w1snr', 'w2snr', 'w3snr', 'w4snr']:
    frac_missing = sum(sdss[wsnr] < 2)/len(sdss[wsnr])
    print('{:.2f}% of the obs in {} are non-detections'.format(100*frac_missing, wsnr[0:2]))


0.01% of the obs in w1 are non-detections
1.11% of the obs in w2 are non-detections
34.22% of the obs in w3 are non-detections
64.98% of the obs in w4 are non-detections

Given that there are obs missing in each filter, with a substantial number missing in W3 and W4, we will create a new categorical variable for detection in each filter. We will also replace "upper limits" with -9.99.

(Alternative strategies that could prove worthwhile include: removing W3 and W4 as they are largely missing, imputing the missing values for non-detections, or removing individual sources with non-detections. This last choice would really cut the training set).


In [86]:
for filt in ['w1', 'w2', 'w3', 'w4']:
    det = np.ones(len(sdss)).astype(int)
    det[np.where(sdss['{}snr'.format(filt)] < 2)] = 0
    sdss['{}det'.format(filt)] = det
    mag = sdss['{}mpro'.format(filt)].values
    mag[det == 0] = -9.99
    sdss['{}mag'.format(filt)] = mag

Now that we have dealt with missing and categorical data we can construct out machine learning model. We will use a $k$ nearest neighbors regression model to determine how well we can measure photometric redshifts.

The sdss dataframe now includes far more columns than are necessary for the model. Should any of these be excluded from the fitting?

Problem 4d

Build a scikit-learn appropriate X and y array for the features and labels (i.e. redshifts) of the photo-z training set.


In [90]:
X = np.array(sdss[['psfMag_u', 'psfMag_g', 'psfMag_r',
       'psfMag_i', 'psfMag_z', 'modelMag_u', 'modelMag_g', 'modelMag_r',
       'modelMag_i', 'modelMag_z', 'extinction_u', 'extinction_g',
       'extinction_r', 'extinction_i', 'extinction_z','ext', 'ps',
       'w1det', 'w1mag', 'w2det', 'w2mag', 'w3det', 'w3mag', 'w4det', 'w4mag']])

y = np.array(sdss['z'])

Problem 4e

Estimate the RMSE for a KNeighborsRegression via 10 fold CV.

Hint - use cross_val_predict in this case so you can plot $z_\mathrm{SDSS}$ vs. $z_\mathrm{kNN}$.

Hint 2 - make the plot in a separate cell so you can adjust it without needing to re-run the CV.


In [95]:
# cross validation goes here
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import mean_squared_error

knn_reg = KNeighborsRegressor(n_neighbors=7)
y_preds = cross_val_predict(knn_reg, X, y, cv=10)
print('The RMSE = {}'.format(np.sqrt(mean_squared_error(y,y_preds))))


The RMSE = 0.21209874958474734

In [106]:
# plotting goes here

fig, ax = plt.subplots()
ax.scatter(y, y_preds, alpha=0.02)
ax.plot([0,6],[0,6], 'Crimson')
ax.set_xlabel('z_SDSS')
ax.set_ylabel('z_kNN')


Out[106]:
Text(0,0.5,'z_kNN')

If you made a plot of redshift vs. predictions, you likely saw that there are many sources that are far from the 1 to 1 line. These are called catastrophic outliers, and they are a serious problem for science programs that rely on photometric redshifts.

Problem 4f

Write a function that takes input arrays ground_truth and predictions and determines the fraction of the dataset that is catastrophic outliers. For today's purposes we will say a catastrophic outlier is one where the prediction differs by 20%.


In [97]:
def catastrophic_fraction(ground_truth, predictions, threshold=0.2):
    '''Function to calculate fraction of predictions that are catastrophic outliers
    
    Parameter
    ---------
    ground_truth : array-like
        Correct labels for the model sources
    
    predictions : array-like
        Predictions for the model sources
    
    threshold : float (optional, default=0.2)
        The threshold to determine if a "miss" is catastrophic or not
    
    Returns
    -------
    oh_nos : float
        Fractional number of catastrophic outliers
    '''
    n_outliers = len(np.where(np.abs(ground_truth - predictions)/ground_truth > threshold)[0])
    oh_nos = n_outliers/len(ground_truth)
    
    return oh_nos

Problem 4g

How many catastrophic outliers did your previous predictions have?


In [98]:
catastrophic_fraction(y, y_preds)


/Users/adamamiller/miniconda3/envs/DSFP/lib/python3.6/site-packages/ipykernel_launcher.py:20: RuntimeWarning: divide by zero encountered in true_divide
Out[98]:
0.39788

Earlier we saw that the performance of a kNN model greatly improves with feature scaling.

Problem 4h

Use a MinMax scaler to scale the features, performs 10 fold cross-validation, and estimate the catastrophic outlier fraction.


In [100]:
scaler = MinMaxScaler()
scaler.fit(X)
Xminmax = scaler.transform(X)

knn_reg = KNeighborsRegressor(n_neighbors=7)
y_preds = cross_val_predict(knn_reg, Xminmax, y, cv=10)
print('The RMSE = {}'.format(np.sqrt(mean_squared_error(y,y_preds))))
print('{} are catastrophic'.format(catastrophic_fraction(y, y_preds)))


The RMSE = 0.2785871143378626
0.6464 are catastrophic
/Users/adamamiller/miniconda3/envs/DSFP/lib/python3.6/site-packages/ipykernel_launcher.py:20: RuntimeWarning: divide by zero encountered in true_divide

In [102]:
# plotting goes here

fig, ax = plt.subplots()
ax.scatter(y, y_preds, alpha=0.02)
ax.plot([0,6],[0,6], 'Crimson')
ax.set_xlabel('z_SDSS')
ax.set_ylabel('z_kNN')


Out[102]:
Text(0,0.5,'z_kNN')

The MinMax scaler didn't help at all!

Perhaps we need a different kind of wrangling for this data set - feature engineering. (For example, does it make sense to include extinction as a feature, or should we instead combine the extinction with the observed magnitudes to get extinction corrected brightness measurements?)

Finally - we close with a machine learning question, not a data wrangling question. Does it actually make sense to use a kNN model for datasets that have categorical features?

Problem 4i

Build a better machine learning model, possibly with the use of different features, to improve the model classification performance.


In [108]:
from sklearn.ensemble import RandomForestRegressor

for filt in ['u', 'r', 'i', 'z']:
    sdss['psf_g-{}'.format(filt)] = sdss['psfMag_g'] - sdss['psfMag_{}'.format(filt)]
    sdss['model_g-{}'.format(filt)] = sdss['modelMag_g'] - sdss['modelMag_{}'.format(filt)]

X = np.array(sdss[['ps', 'ext', 
                   'w1det', 'w1mag', 'w2det', 'w2mag', 'w3det', 'w3mag', 'w4det', 'w4mag',
                   'psf_g-u', 'psf_g-r', 'psf_g-i', 'psf_g-z', 
                   'model_g-u', 'model_g-r', 'model_g-i', 'model_g-z']])

rf_reg = RandomForestRegressor(n_estimators=100)
y_preds = cross_val_predict(rf_reg, X, y, cv=10)
print('The RMSE = {}'.format(np.sqrt(mean_squared_error(y,y_preds))))
print('{} are catastrophic'.format(catastrophic_fraction(y, y_preds)))


The RMSE = 0.2231823202772911
0.32776 are catastrophic
/Users/adamamiller/miniconda3/envs/DSFP/lib/python3.6/site-packages/ipykernel_launcher.py:20: RuntimeWarning: divide by zero encountered in true_divide

In [111]:
# plotting goes here

fig, ax = plt.subplots(figsize=(8,8))
ax.scatter(y, y_preds, alpha=0.02)
ax.plot([0,6],[0,6], 'Crimson')
ax.set_xlim(-0.1,4)
ax.set_ylim(-0.1,4)
ax.set_xlabel('z_SDSS')
ax.set_ylabel('z_kNN')


Out[111]:
Text(0,0.5,'z_kNN')