In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib notebook
Webster's Dictionary$^\ast$ defines wrangler as:
wrangler noun
wran·gler | raŋ-g(ə-)lər
(short for horse-wrangler, probably partial translation of Mexican Spanish caballerango groom): a ranch hand who takes care of the saddle horses broadly : cowboy
$^\ast$actually https://www.merriam-webster.com/dictionary/ - Webster's didn't define wrangler in the way I wanted
How then, as astronomers, are we all like cowhands?
Data are often like horses in that: they all differ, rarely conform to a single standard set of behavior, and they love to eat hay.$^\dagger$
$^\dagger$I made that last one up.
Thus, in our efforts to better understand the Universe, we must often manipulate, coax, and, in some cases, force our data to "behave." This involves a variety of tasks, such as: gathering, cleaning, matching, restructuring, transforming, filtering, combining, merging, verifying, and fixing data.
Here is a brief and unfortunate truth, there isn't a single person in the entire world that would organize data in exactly the same way that you would.
As a result, you may find that data that are useful to you are not organized in an optimal fashion for use in your workflow/software.
Hence: the need to wrangle.
There is one important and significant way in which our lives as astronomers are much better than the average data scientist: even though our data are "worthless," virtually all of it is numbers.
Furthermore, I contend that most astronomical data can easily be organized into a simple tabular structure.
Yesterday we joked that you could write a 10 page text document describing every individual galaxy in a galaxy cluster with 5000 members, but no one would ever do that.
Nevertheless, as you will see during the exercises, even with relatively simple, small numerical data sets there is a need for wrangling.
And wrangling brings up a lot of issues...
Consider the following data set that contains the street names for my best friends from childhood:
['Ewing', 'Isabella', 'Reese', 'Isabella',
'Thayer', 'Reese', 'Reese', 'Ewing', 'Reece']
Do you notice anything interesting?
Either my hometown has a street named "Reese" and a street named "Reece", or the last entry was input incorrectly.
If the later is true, then we have to raise the question of: what should we do?
For this particular data set, it would be possible to create a verification procedure to test for similar errors.
For any instances where this isn't the case, one could then intervene with a correction.
This particular verification catches this street name error, but it doesn't correct for the possibility that the person doing the data entry may have been reading addresses really quickly and the third "Reese" entry should have actually said "Lawndale."
(verification is really hard)
This also raises a question about data provenance.
Data provenance is a historical record of the data and its origins.
If you are making "corrections" to the data, then each and every one of those corrections should be reported (this is similar to the idea of logging from yesterday's databases talk). Ideally, these reports would live with the data so others could understand how things have changed.
If you did change "Reece" to "Reese", anyone working with the data should be able to confirm those changes.
Suppose now you wanted to use the same street name data set to estimate which street I lived on while growing up.
One way to mathematically approach this problem would be to convert the streets in the data set to GPS coordinates, and then to perform a KDE of the PDF for the coordinates of where I lived.
This too is a form of wrangling, because the data you have (street names) are not the data you need (coordinates).
Why harp on this?
In practice, data scientists (including astronomers) spend an unreasonable amount of time manipulating and quality checking data (some indsutry experts estimate that up to 80% of their time is spent warnglin').
Today, we will work through several examples that require wrangling, while, hopefully, building some strategies to minimize the amount of time you spend on these tasks in the future.
For completeness, I will mention that there is a famous canonical paper about data wrangling, which introduces the Wrangler
, a tool specifically designed to take heterogeneous (text) data, and provide a set of suggested operations/manipulations to create a homogenous table amenable to standard statistical analysis.
One extremely nice property of the Wrangler
is that it records every operation performed on the data, ensuring high-fidelity reporting on the data provenance. We should do a better job of this in astronomy.
Today, we are going to focus on python
solutions to some specific astronomical data sets.
Hopefully you learn some tricks to make your work easier in the future.
If at any point in your career you need to access archival infrared data, you will likely need to retrieve that information from the NASA IPAC InfRed Science Archive. IRSA houses the data for every major NASA IR mission, and several ground-based missions as well (e.g., 2MASS, IRTF). Whether you are sudying brown dwarfs, explosive transients, solar system objects, star-formation, galaxy evolution, Milky Way dust and the resulting extinction of extragalactic observations, or quasars (and much more) the IR plays a critical role.
Given the importance of IR observations, it stands to reason that IRSA would provide data in a simple to read format for modern machines, such as comma separated values or FITS binary tables...
Right?...
Right?...
In fact, IRSA has created their own standard for storing data in a text file. If you haven't already, please download irsa_catalog_WISE_iPTF14jg_search_results.tbl, a file that is written in the standard IRSA format.
shameless plug alert! iPTF14jg is a really strange star that exhibited a large outburst that we still don't totally understand. The associated data file includes NEOWISE observations of the mid-IR evolution of this outburst.
Problem 1a
Using pandas
read the data in the IRSA table file into a DataFrame
object.
Hint 1 - you absolutely should look at the text file to develop a strategy to accomplish this goal.
Hint 2 - you may want to manipulate the text file so that it can more easily be read by pandas
. If you do this be sure to copy the file to another name as you will want to leave the original intact.
In [5]:
# Solution 1 - pure python solution with pandas
with open('irsa_catalog_WISE_iPTF14jg_search_results.tbl') as f:
ll = f.readlines()
for linenum, l in enumerate(ll):
if l[0] == '|':
header = l.replace('|', ',').replace(' ', '')
header = list(header[1:-2].split(','))
break
irsa_tbl = pd.read_csv("irsa_catalog_WISE_iPTF14jg_search_results.tbl",
skiprows=linenum+4, delim_whitespace=True,
header=None, names=header)
irsa_tbl.head(5)
Out[5]:
That pure python solution is a bit annoying as it requires a for loop with a break, and specific knowledge about how IRSA tables handler data headers (hence the use of linenum + 4
for skiprows
). Alternatively, one could manipulate the data file to read in the data.
In [6]:
# solution 2 - edit the text file
# !cp irsa_catalog_WISE_iPTF14jg_search_results.tbl tmp.tbl
### delete lines 1-89, and 90-92
### replace whitespace with commas (may require multiple operations)
### replace '|' with commas
### replace ',,' with single commas
### replace ',\n,' with '\n'
### delete the comma at the very beginning and very end of the file
tedit_tbl = pd.read_csv('tmp.tbl')
tedit_tbl.head(5)
Out[6]:
That truly wasn't all that better - as it required a bunch of clicks/text editor edits. (There are programs such as sed
and awk
that could be used to execute all the necessary edits from the command line, but that too is cumbersome and somewhat like the initial all python
solution).
If astronomers are creating data in a "standard" format, then it ought to be easy for other astronomers to access that data.
Fortunately, in this particular case, there is an easy solution - astropy Tables
.
IRSA tables are so commonly used throughout the community, that the folks at astropy
have created a convenience method for all of us to read in tables created in that particular (unusual?) format.
Problem 1b
Use Table.read()
to read in irsa_catalog_WISE_iPTF14jg_search_results.tbl
to an astropy Table
object.
In [28]:
from astropy.table import Table
Table.read('irsa_catalog_WISE_iPTF14jg_search_results.tbl', format='ipac')
Out[28]:
A benefit to using this method, as opposed to pandas
, is that data typing and data units are naturally read from the IRSA table and included with the associated columns. Thus, if you are uncertain if some brightness measurement is in magnitudes or Janskys, the astropy Table
can report on that information.
Unfortunately, astropy
does not know about every strange formating decision that every astronomer has made at some point in their lives (as we are about to see...)
Unlike IRSA/IPAC, which uses a weird but nevertheless consistent format for data tables, data retrieved from Journal articles essentially follows no rules. In principle, tables in Journal articles are supposed to be provided in a machine readable format. In practice, as we are about to see, this is far from the case.
For this particular wrangling case study we will focus on supernova light curves, a simple thing to report: time, filter, brightness, uncertainty on that brightness, that the community has nevertheless managed to mangle into some truly wild and difficult to parse forms.
(Sorry for the heavy emphasis on time-domain examples - I'm pulling straight from my own life today, but the issues described here are not perfectly addressed by any subfield within the astro umbrella)
Here is the LaTeX-formatted version of Table 4 from Miller et al. 2011:
<img style="display: block; margin-left: auto; margin-right: auto" src="images/Miller11_tbl4.png" align=\"middle">
That is a very simple table to interpret, no?
Have a look at the "machine-readible" file that ApJ provides for readers that might want to evaluate these photometric measurements.
Problem 2a
Download the ApJ version of Table 4 from Miller et al. 2011 and read that table into either a pandas DataFrame
or an astropy Table
.
In [38]:
# pure python solution with pandas
tbl4 = pd.read_csv('Miller_et_al2011_table4.txt',
skiprows=5, delim_whitespace=True,
skipfooter=3, engine='python',
names=['t_mid', 'J', 'Jdum', 'J_unc',
'H', 'Hdum', 'H_unc',
'K', 'Kdum', 'K_unc'])
tbl4.drop(columns=['Jdum', 'Hdum', 'Kdum'], inplace=True)
print(tbl4)
That wasn't too terrible. But what if we consider a more typical light curve table, where there are loads of missing data, such as Table 2 from Foley et al. 2009:
<img style="display: block; margin-left: auto; margin-right: auto" src="images/Foley09_tbl2.png" align=\"middle">
Again, this table is straightforward to read, and it isn't hard to imagine how one could construct a machine-readable csv or other file from this information. But alas, this is not what is available from ApJ. So, we will need to figure out how to deal with both the missing data, "...", and the weird convention that many astronomers use where the uncertainties are (a) not reported in their own column, and (b) are not provided in the same units as the measurement itself. I can understand the former, but the later is somewhat baffling...
Problem 2b
Download the ApJ version of Table 2 from Foley et al. 2009 and read that table into either a pandas DataFrame
or an astropy Table
.
In [48]:
# a (not terribly satisfying) pure python solution
# read the file in, parse and write another file that plays nice with pandas
with open('Foley_et_al2009_for_pd.csv','w') as fw:
print('JD,Bmag,Bmag_unc,Vmag,Vmag_unc,Rmag,Rmag_unc,Imag,Imag_unc,Unfiltmag,Unfiltmag_unc,Telescope',file=fw)
with open('Foley_et_al2009_table2.txt') as f:
ll = f.readlines()
for l in ll:
if l[0] == '2':
print_str = l.split()[0] + ','
for col in l.split()[1:]:
if col == 'sdotsdotsdot':
print_str += '-999,-999,'
elif col[0] == '>':
print_str += '{},-999,'.format(col[1:])
elif col == 'KAIT':
print_str += 'KAIT'
elif col == 'Nickel':
print_str += 'Nickel'
elif col[0] == '(':
print_str += '0.{},'.format(col[1:-1])
else:
print_str += '{},'.format(col)
print(print_str,file=fw)
pd.read_csv('Foley_et_al2009_for_pd.csv')
Out[48]:
Okay - there is nothing elegant about that particular solution. But it works, and wranglin' ain't pretty.
It is likely that you developed a solution that looks very different from this one, and that is fine. When data are provided in an unrulely format, the most important thing is to develop some method, any method, for converting the information into a useful format. Following whatever path you used above, it should now be easy to plot the light curve of SN 2008ha.
Sometimes there is no difficultly whatsoever in reading in the data (as was the case in Problems 1 and 2), but instead the difficultly lies in wranglin' the data to be appropriate for the model that you are building.
For the next problem we will work with the famous Titanic survival data set. If you haven't already, please download the csv file with the relevant information.
Briefly, the Titantic is a famous historical ship that was thought to be unsinkable. Spoiler alert it hit an iceberg and sank. The data in the Titanic data set includes information about 891 passengers from the Titanic, as well as whether or not they survived. The aim of this data set is to build a machine learning model to predict which passengers survived and which did not.
The features include:
Feature | Description |
---|---|
PassengerId | Running index that describes the individual passengers |
Pclass | A proxy for socio-economic status (1 = Upper class, 2 = Middle Class, 3 = Lower Class) |
Name | The passenger's name |
Sex | The passenger's sex |
Age | The passenger's age - note age's ending in 0.5 are estimated |
SibSp | The sum of the passenger's sibblings and spouces on board |
Parch | The sum of the passenger's parents and children on board |
Ticket | The ticket number for the passenger |
Fare | The price paid for the ticket by th passenger |
Cabin | The Cabin in which the passenger stayed |
Embarked | The point of Origin for the Passenger: C = Cherbourg, S = Southampton, Q = Queenstown |
And of course, we are trying to predict:
Label | Description |
---|---|
Survived | 1 = yes; 0 = no |
Problem 3a
Read in the Titanic training data and create the scikit-learn
standard X
and y
arrays to hold the features and the labels, respectively.
In [124]:
titanic_df = pd.read_csv('titanic_kaggle_training_set.csv', comment='#')
feat_list = list(titanic_df.columns)
label = 'Survived'
feat_list.remove(label)
X = titanic_df[feat_list].values
y = titanic_df[label]
Now that we have the data in the appropriate X
and y
arrays, estimate the accuracy with which a K nearest neighbors classification model can predict whether or not a passenger would survive the Titanic disaster. Use $k=10$ fold cross validation for the prediction.
Problem 3b
Train a $k=7$ nearest neighbors machine learning model on the Titanic training set.
In [95]:
from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier(n_neighbors=7)
knn_clf.fit(X, y)
Note - that should have failed! And for good reason - recall that kNN
models measure the Euclidean distance between all points within the feature space. So, when considering the sex of a passenger, what is the numerical distance between male and female?
In other words, we need to wrangle this data before we can run the machine learning model.
Most of the features in this problem are non-numeric (i.e. we are dealing with categorical features), and therefore we need to figure out how to include them in the kNN
model.
The first step when wrangling for machine learning is to figure out if anything can be thrown away. We certainly want to avoid including any uninformative features in the model.
If you haven't already, now would be a good time to create a new cell and examine the contents of the csv
In [96]:
titanic_df
Out[96]:
Problem 3c
Are there any features that obviously will not help with this classification problem?
If you answer yes - ignore those features moving forward
Solution 3c
write your solution here
The Name
of the passenger will not be useful for this classification task. In particular, the useful information that might be learned from the name, such as the sex (e.g., Mr. vs. Mrs.), or the age (e.g., Mr. vs. Master), are already summarized elsewhere in the data.
It is also highly likely that the PassengerId
, which is just a running index for each person in the dataset, is unlikely to be useful when classifying this data.
Given that we have both categorical and numeric features, let's start with the numerical features and see how well they can predict survival on the Titanic.
One note - for now we are going to exclude Age
, because as you saw when you examined the data, there are some passengers that do not have any age information. This problem, known as "missing data" is one that we will deal with before the end of this problem.
Problem 3d
How accurately can the numeric features, Pclass
, SibSp
, Parch
, and Fare
predict survival on the Titanic? Use a $k = 7$ Nearest Neighbors model, and estimate the model accuracy using 10-fold cross validation.
Hint 1 - you'll want to redefine your features vector X
Hint 2 - you may find cross_val_score
from scikit-learn
helpful
In [97]:
from sklearn.model_selection import cross_val_score
X = titanic_df[['Pclass', 'SibSp', 'Parch', 'Fare']]
knn_clf = KNeighborsClassifier(n_neighbors=7)
cv_results = cross_val_score(knn_clf, X, y, cv=10)
print('The accuracy from numeric features = {:.2f}%'.format(100*np.mean(cv_results)))
An accuracy of 68% isn't particularly inspiring. But, there's a lot of important information that we are excluding. As far as the Titanic is concerned, Kate and Leo taught us that female passengers are far more likely to survive, while male passengers are not. So, if we can include gender in the model then we may be able to achieve more accurate predictions.
Problem 3e
Create a new feature called gender
that equals 1 for male passengers and 2 for female passengers. Add this feature to your dataframe, and include it in a kNN
model with the other numeric features. Does the inclusion of this feature improve the 10-fold CV accuracy?
In [98]:
gender = np.ones(len(titanic_df['Sex']))
gender[np.where(titanic_df['Sex'] == 'female')] = 2
titanic_df['gender'] = gender
X = titanic_df[['Pclass', 'SibSp', 'Parch', 'Fare', 'gender']]
knn_clf = KNeighborsClassifier(n_neighbors=7)
cv_results = cross_val_score(knn_clf, X, y, cv=10)
print('The accuracy when including gender = {:.2f}%'.format(100*np.mean(cv_results)))
A 14% improvement is pretty good! But, we can wrangle even more out of the gender feature. Recall that kNN
models measure the Euclidean distance between sources, meaning the scale of each feature really matters. Given that the fare ranges from 0 up to 512.3292, the kNN
model will see this feature as far more important than gender
, for no other reason than the units that have been adopted.
If women are far more likely to survive than men, then we want to be sure that gender is weighted at least the same as all the other features, which we can do with a minmax scaler. As a brief reminder - a minmax scaler scales all values of a feature to be between 0 and 1 by subtracting the minimum value of each feature and then dividing by the maximum minus the minimum.
Problem 3f
Scale all the features from the previous problem using a minmax scaler and evaluate the CV accuracy of the kNN
model.
Hint - you may find MinMaxScaler
helpful
In [99]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(X)
Xminmax = scaler.transform(X)
knn_clf = KNeighborsClassifier(n_neighbors=7)
cv_results = cross_val_score(knn_clf, Xminmax, y, cv=10)
print('The accuracy when scaling features = {:.2f}%'.format(100*np.mean(cv_results)))
Scaling the features leads to further improvement!
Now turn your attention to another categorical feature, Embarked
, which includes the point of origin for each passenger and has three categories, S
, Q
, and C
. We need to convert these values to a numeric representation for inclusion in the model.
Problem 3g
Convert the categorical feature Embarked
to a numeric representation, and add it to the titanic_df
.
In [100]:
# following previous example, set C = 0, S = 1, Q = 2
porigin = np.empty(len(titanic_df['Sex'])).astype(int)
porigin[np.where(titanic_df['Embarked'] == 'C')] = 0
porigin[np.where(titanic_df['Embarked'] == 'S')] = 1
porigin[np.where(titanic_df['Embarked'] == 'Q')] = 2
titanic_df['porigin'] = porigin
But wait! Does this actually make sense?
Our "numerification" has now introduced order where there previously was none. We are effectively telling the model that Cherbourg and Queenstown are far apart (not in distance but in terms of the similarity of the passengers that boarded the ship in each location), while each are equally close to Southampton. Is there actually any evidence to support this conclusion?
By definition categorical features do not have order (e.g., cat, dog, horse, elephant), and therefore we should not impose any when converting these features to numeric values for inclusion in our model. Instead, we should be creating a new set of binary features for every category within the feature set. Thus, Embarked
will now need to be represented by 3 different features, where the feature Queenstown
equals one for passengers that boarded there and zero for everyone else.
Problem 3h
Complete the function below that will automatically create binary arrays for a categorical feature.
In [53]:
def create_bin_cat_feats(feature_array):
categories = np.unique(feature_array)
feat_dict = {}
for cat in categories:
exec('{} = np.zeros(len(feature_array)).astype(int)'.format(cat))
exec('{0}[np.where(feature_array == "{0}")] = 1'.format(cat))
exec('feat_dict["{0}"] = {0}'.format(cat))
return feat_dict
Problem 3i
Use the create_bin_cat_feats
function to convert the Embarked
and Sex
, yes we need to do this for Sex
as well where we otherwise previously introduced order, categorical features to a numeric representation. Add these features to the titanic_df
data frame.
In [125]:
gender_dict = create_bin_cat_feats(titanic_df['Sex'])
porigin_dict = create_bin_cat_feats(titanic_df['Embarked'])
for feat in gender_dict.keys():
titanic_df[feat] = gender_dict[feat]
for feat in porigin_dict.keys():
titanic_df[feat] = porigin_dict[feat]
Problem 3j
Use the newly created female
, male
, S
, Q
, and C
features in combination with the Pclass
, SibSp
, Parch
, and Fare
features to estimate the classification accuracy of a $k = 7$ nearest neighbors model with 10-fold cross validation.
How does the addition of the point of origin feature affect the final model output?
Hint - don't forget to scale the features in the model
In [126]:
from sklearn.preprocessing import MinMaxScaler
X = titanic_df[['Pclass', 'SibSp', 'Parch', 'Fare', 'female', 'male', 'S', 'Q', 'C']]
scaler = MinMaxScaler()
scaler.fit(X)
Xminmax = scaler.transform(X)
knn_clf = KNeighborsClassifier(n_neighbors=7)
cv_results = cross_val_score(knn_clf, Xminmax, y, cv=10)
print('The accuracy with categorical features = {:.2f}%'.format(100*np.mean(cv_results)))
The last thing we'd like to add to the model is the Age
feature. Unfortunately, for 177 passengers we do not have a reported value for their age. This is a standard issue when building models known as "missing data" and this happens in astronomy all the time (for example, LSST is going to observe millions of L and T dwarfs that are easily detected in the $y$-band, but which do not have a detection in $u$-band).
There are several different strategies for dealing with missing data. The first and most straightforward is to simply remove observations with missing data (note - to simplify this example I already did this by removing the 2 passengers from the training set that did not have an entry for Embarked
).
This strategy is perfectly fine if only a few sources have missing information (2/891 for Embarked
- and none of the test set sources are missing Embarked
). If, however, a significant fraction are missing data, this strategy would remove a lot of useful data from the model.
If you cannot remove the sources with missing data, then it is essential to ask the following question:
Does the missing information have meaning?
In the LSST L/T dwarf example, the lack of a $u$-band detection is meaningful: these stars are too faint for LSST. When this is the case, an indicator value (e.g., -999) allows the model to recognize the non-detection.
For the Titanic data, the lack of age information is not meaningful. Simply put, there are some passengers that did not have recorded ages. We will now show this to be the case.
Problem 3k
Replace the unknown ages with a value of -999, and estimate the accuracy of the model via 10-fold cross validation.
In [127]:
age_impute = titanic_df['Age'].copy()
age_impute[np.isnan(age_impute)] = -999
titanic_df['age_impute'] = age_impute
X = titanic_df[['Pclass', 'SibSp', 'Parch', 'Fare', 'female', 'male', 'S', 'Q', 'C', 'age_impute']]
scaler = MinMaxScaler()
scaler.fit(X)
Xminmax = scaler.transform(X)
knn_clf = KNeighborsClassifier(n_neighbors=7)
cv_results = cross_val_score(knn_clf, Xminmax, y, cv=10)
print('The accuracy with -999 for missing ages = {:.2f}%'.format(100*np.mean(cv_results)))
The accuracy of the model hasn't improved by adding the age information (even though we know children were more likely to survive than adults).
Given that the missing ages don't have meaning, we need to develop alternative strategies for "imputing" the missing data. The most simple approach in this regard is to replace the missing values with the mean value of the feature distribution for sources that do have measurements (use the median if the distribution has significant outliers).
Problem 3l
Replace the unknown ages with the mean age of passengers, and estiamte the accuracy of the model via 10-fold cross validation.
In [216]:
age_impute = titanic_df['Age'].copy().values
age_impute[np.isnan(age_impute)] = np.mean(age_impute[np.isfinite(age_impute)])
titanic_df['age_impute'] = age_impute
X = titanic_df[['Pclass', 'SibSp', 'Parch', 'Fare', 'female', 'male', 'S', 'Q', 'C', 'age_impute']]
scaler = MinMaxScaler()
scaler.fit(X)
Xminmax = scaler.transform(X)
knn_clf = KNeighborsClassifier(n_neighbors=7)
cv_results = cross_val_score(knn_clf, Xminmax, y, cv=10)
print('The accuracy with the mean for missing ages = {:.2f}%'.format(100*np.mean(cv_results)))
Using the mean age for missing values provides a marginal improvement over the models with no age information. Is there anything else we can do? Yes - we can build a machine learning model to predict the values of the missing data. So there will be a machine learning model within the final machine learning model. In order to predict ages, we will need to build a regression model. Simple algorithms include Linear or Logistic Regression, while more complex examples include kNN
or random forest regression.
I quickly tested the above four methods, and found that the scikit-learn
LinearRegression
performs best when using the model defaults.
Probem 3m
Build a LinearRegression
model to predict a passenger's age based on the Pclass
, SibSp
, Parch
, Fare
, female
, male
, S
, Q
, C
features. The model should be trained with passengers that have known ages. Use 10-fold cross validation to determine the performance of this model.
Hint - note that for regression models the typical metric of evaluation is the mean squared error, and that for consistency within the scikit-learn
API, the negative mean squared error is returned rather than the mean squared error.
In [217]:
from sklearn.linear_model import LinearRegression
has_ages = np.where(np.isfinite(titanic_df['Age']))[0]
impute_X_train = titanic_df[['Pclass', 'SibSp', 'Parch', 'Fare', 'female', 'male', 'S', 'Q', 'C']].iloc[has_ages]
impute_y_train = titanic_df['Age'].iloc[has_ages]
scaler = MinMaxScaler()
scaler.fit(impute_X_train)
Xminmax = scaler.transform(impute_X_train)
lr_age = LinearRegression().fit(Xminmax, impute_y_train)
cv_results = cross_val_score(LinearRegression(), Xminmax, impute_y_train, cv=10, scoring='neg_mean_squared_error')
print('Missing ages have RMSE = {:.2f}'.format(np.mean((-1*cv_results)**0.5)))
Problem 3n
Use the age regression model to predict the ages for passengers with missing data.
In [218]:
missing_ages = np.where(np.isnan(titanic_df['Age']))[0]
impute_X_missing = titanic_df[['Pclass', 'SibSp', 'Parch', 'Fare', 'female', 'male', 'S', 'Q', 'C']].iloc[missing_ages]
X_missing_minmax = scaler.transform(impute_X_missing)
age_preds = lr_age.predict(X_missing_minmax)
Problem 3o
Use the imputed age estimates to predict the passenger survival via 10-fold cross validation.
In [229]:
age_impute = titanic_df['Age'].copy().values
age_impute[missing_ages] = age_preds
titanic_df['age_impute'] = age_impute
X = titanic_df[['Pclass', 'SibSp', 'Parch', 'Fare', 'female', 'male', 'S', 'Q', 'C', 'age_impute']]
scaler = MinMaxScaler()
scaler.fit(X)
Xminmax = scaler.transform(X)
knn_clf = KNeighborsClassifier(n_neighbors=7)
cv_results = cross_val_score(knn_clf, Xminmax, y, cv=10)
print('The accuracy with the mean for missing ages = {:.2f}%'.format(100*np.mean(cv_results)))
As far as ages are concerned, imputation of the missing data does not significantly improve the model.
Which brings us to the concluding lesson in wrangling the Titanic data - not every piece of information is useful. This is critical to remember when building machine learning models.
(As a quick aside - it wouldn't be entirely fair to say there is no useful information in the age feature. It is clear, for example, that "children," i.e. those with Age < 10, had a much higher probability of survival than adults. Perhaps the creation of a child
feature based on age would improve the model... or using the age in combination with other features, e.g., Age
xPclass
which will further highlight that 1st class passengers were more likely to survive than 3rd class passengers)
Finally - note that you can try to build a model and submit it to Kaggle to see how well you preform on blind data.
https://www.kaggle.com/c/titanic - the classifications are not revealed, but from the leaderboard it is clear that some people were able to build models that perfectly classified the blind data.
Now we will put together the above examples to work on an astronomical problem of great importance for LSST: photometric redshifts. As we have previously discussed, LSST will observe many, many, many, many more galaxies than we can possibly observe with spectroscopic instruments. As a result, the vast majority of the galaxies observed by LSST will not have precise redshift estimates, and instead we will need to estimate redshifts via photometry alone.
This week we have heard a lot about SDSS CasJobs
, and for today's problem we will use 50,000 SDSS sources with spectroscopic measurements to train a machine learning model to estimate redshifts from photometry. The following query was used to generate this training set:
select top 50000 s.specObjID, s.z, s.type,
psfMag_u, psfMag_g, psfMag_r, psfMag_i, psfMag_z,
modelMag_u, modelMag_g, modelMag_r, modelMag_i, modelMag_z,
extinction_u, extinction_g, extinction_r, extinction_i, extinction_z,
w.w1mpro, w1snr, w.w2mpro, w2snr, w.w3mpro, w3snr, w.w4mpro, w4snr
from specphotoall s
join (WISE_xmatch wx
join WISE_allsky w on wx.wise_cntr = w.cntr) on s.objid = wx.sdss_objid
A few notes about the data: specObjID
is the SDSS identifier, z
is the redshift, type
is the photometric type ("ps" for point source or "ext" for extended), psfMag
is the SDSS PSF mag in each filter, modelMag
is the aperture-matched model mag in each filter, extinction
is the SFD extinction estimate, w?mpro
is the WISE satellite mid-IR magnitude in the W1, W2, W3, and W4 bands, and w?msnr
is the SNR in each of the respective filters.
quick aside - the SDSS photometric type is reported as either 6 or 3, and, for convenience, I have converted these to 'ps' and 'ext' in the file used for the problem below.
Important note - not every SDSS source will have detections in each of the WISE filters. If w?msnr
< 2, then the w?mpro
number represents an upper limit, not a detection (i.e., these are cases with missing data).
Problem 4a
Download and read in the training set for the model.
In [79]:
sdss = pd.read_csv("DSFP_SDSS_spec_train.csv")
sdss[:5]
Out[79]:
Problem 4b
Are there categorical features in the dataset? If yes, convert the categorical features for use in machine learning models.
Hint - recall Problem 3h and 3i.
In [82]:
type_dict = create_bin_cat_feats(sdss['type'])
for feat in type_dict.keys():
sdss[feat] = type_dict[feat]
Problem 4c
Is there missing data in the training set? If yes, how will you deal these data?
Hint - you were already told there is missing data.
In [83]:
# WISE non-detections have SNR < 2
for wsnr in ['w1snr', 'w2snr', 'w3snr', 'w4snr']:
frac_missing = sum(sdss[wsnr] < 2)/len(sdss[wsnr])
print('{:.2f}% of the obs in {} are non-detections'.format(100*frac_missing, wsnr[0:2]))
Given that there are obs missing in each filter, with a substantial number missing in W3 and W4, we will create a new categorical variable for detection in each filter. We will also replace "upper limits" with -9.99.
(Alternative strategies that could prove worthwhile include: removing W3 and W4 as they are largely missing, imputing the missing values for non-detections, or removing individual sources with non-detections. This last choice would really cut the training set).
In [86]:
for filt in ['w1', 'w2', 'w3', 'w4']:
det = np.ones(len(sdss)).astype(int)
det[np.where(sdss['{}snr'.format(filt)] < 2)] = 0
sdss['{}det'.format(filt)] = det
mag = sdss['{}mpro'.format(filt)].values
mag[det == 0] = -9.99
sdss['{}mag'.format(filt)] = mag
Now that we have dealt with missing and categorical data we can construct out machine learning model. We will use a $k$ nearest neighbors regression model to determine how well we can measure photometric redshifts.
The sdss
dataframe now includes far more columns than are necessary for the model. Should any of these be excluded from the fitting?
Problem 4d
Build a scikit-learn
appropriate X
and y
array for the features and labels (i.e. redshifts) of the photo-z training set.
In [90]:
X = np.array(sdss[['psfMag_u', 'psfMag_g', 'psfMag_r',
'psfMag_i', 'psfMag_z', 'modelMag_u', 'modelMag_g', 'modelMag_r',
'modelMag_i', 'modelMag_z', 'extinction_u', 'extinction_g',
'extinction_r', 'extinction_i', 'extinction_z','ext', 'ps',
'w1det', 'w1mag', 'w2det', 'w2mag', 'w3det', 'w3mag', 'w4det', 'w4mag']])
y = np.array(sdss['z'])
Problem 4e
Estimate the RMSE for a KNeighborsRegression
via 10 fold CV.
Hint - use cross_val_predict
in this case so you can plot $z_\mathrm{SDSS}$ vs. $z_\mathrm{kNN}$.
Hint 2 - make the plot in a separate cell so you can adjust it without needing to re-run the CV.
In [95]:
# cross validation goes here
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import mean_squared_error
knn_reg = KNeighborsRegressor(n_neighbors=7)
y_preds = cross_val_predict(knn_reg, X, y, cv=10)
print('The RMSE = {}'.format(np.sqrt(mean_squared_error(y,y_preds))))
In [106]:
# plotting goes here
fig, ax = plt.subplots()
ax.scatter(y, y_preds, alpha=0.02)
ax.plot([0,6],[0,6], 'Crimson')
ax.set_xlabel('z_SDSS')
ax.set_ylabel('z_kNN')
Out[106]:
If you made a plot of redshift vs. predictions, you likely saw that there are many sources that are far from the 1 to 1 line. These are called catastrophic outliers, and they are a serious problem for science programs that rely on photometric redshifts.
Problem 4f
Write a function that takes input arrays ground_truth
and predictions
and determines the fraction of the dataset that is catastrophic outliers. For today's purposes we will say a catastrophic outlier is one where the prediction differs by 20%.
In [97]:
def catastrophic_fraction(ground_truth, predictions, threshold=0.2):
'''Function to calculate fraction of predictions that are catastrophic outliers
Parameter
---------
ground_truth : array-like
Correct labels for the model sources
predictions : array-like
Predictions for the model sources
threshold : float (optional, default=0.2)
The threshold to determine if a "miss" is catastrophic or not
Returns
-------
oh_nos : float
Fractional number of catastrophic outliers
'''
n_outliers = len(np.where(np.abs(ground_truth - predictions)/ground_truth > threshold)[0])
oh_nos = n_outliers/len(ground_truth)
return oh_nos
Problem 4g
How many catastrophic outliers did your previous predictions have?
In [98]:
catastrophic_fraction(y, y_preds)
Out[98]:
Earlier we saw that the performance of a kNN model greatly improves with feature scaling.
Problem 4h
Use a MinMax scaler to scale the features, performs 10 fold cross-validation, and estimate the catastrophic outlier fraction.
In [100]:
scaler = MinMaxScaler()
scaler.fit(X)
Xminmax = scaler.transform(X)
knn_reg = KNeighborsRegressor(n_neighbors=7)
y_preds = cross_val_predict(knn_reg, Xminmax, y, cv=10)
print('The RMSE = {}'.format(np.sqrt(mean_squared_error(y,y_preds))))
print('{} are catastrophic'.format(catastrophic_fraction(y, y_preds)))
In [102]:
# plotting goes here
fig, ax = plt.subplots()
ax.scatter(y, y_preds, alpha=0.02)
ax.plot([0,6],[0,6], 'Crimson')
ax.set_xlabel('z_SDSS')
ax.set_ylabel('z_kNN')
Out[102]:
The MinMax scaler didn't help at all!
Perhaps we need a different kind of wrangling for this data set - feature engineering. (For example, does it make sense to include extinction
as a feature, or should we instead combine the extinction with the observed magnitudes to get extinction corrected brightness measurements?)
Finally - we close with a machine learning question, not a data wrangling question. Does it actually make sense to use a kNN model for datasets that have categorical features?
Problem 4i
Build a better machine learning model, possibly with the use of different features, to improve the model classification performance.
In [108]:
from sklearn.ensemble import RandomForestRegressor
for filt in ['u', 'r', 'i', 'z']:
sdss['psf_g-{}'.format(filt)] = sdss['psfMag_g'] - sdss['psfMag_{}'.format(filt)]
sdss['model_g-{}'.format(filt)] = sdss['modelMag_g'] - sdss['modelMag_{}'.format(filt)]
X = np.array(sdss[['ps', 'ext',
'w1det', 'w1mag', 'w2det', 'w2mag', 'w3det', 'w3mag', 'w4det', 'w4mag',
'psf_g-u', 'psf_g-r', 'psf_g-i', 'psf_g-z',
'model_g-u', 'model_g-r', 'model_g-i', 'model_g-z']])
rf_reg = RandomForestRegressor(n_estimators=100)
y_preds = cross_val_predict(rf_reg, X, y, cv=10)
print('The RMSE = {}'.format(np.sqrt(mean_squared_error(y,y_preds))))
print('{} are catastrophic'.format(catastrophic_fraction(y, y_preds)))
In [111]:
# plotting goes here
fig, ax = plt.subplots(figsize=(8,8))
ax.scatter(y, y_preds, alpha=0.02)
ax.plot([0,6],[0,6], 'Crimson')
ax.set_xlim(-0.1,4)
ax.set_ylim(-0.1,4)
ax.set_xlabel('z_SDSS')
ax.set_ylabel('z_kNN')
Out[111]: