In [ ]:
from __future__ import print_function, division, absolute_import
In [ ]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.spatial import Delaunay
# from matplotlib.collections import LineCollection
import os
import copy
%matplotlib inline
In this problem, will to query SkyServer spatially (based on RA and Dec). You may enter the queries into http://skyserver.sdss.org/dr13/en/tools/search/sql.aspx or use astroquery (demonstrated in the database re-introduction yesterday). We are interested in the approximate timing of the queries. Don't take query times too seriouslys since there are many factors in reponse time with a public database with many users.
We're interested in getting all the photometry for objects within 1 arcminute of (18.27, 0.94). SkyServer has angular distance functions, for example:
dbo.fDistanceArcMinEq(ra1, dec1, ra2, dec2)
One might naively filter on all sources for which the distance from the source to (18.27, 0.94) is < 1:
-- BAD! Don't actually do this in real life! --
SELECT COUNT(objid)
FROM PhotoObjAll p
WHERE dbo.fDistanceArcMinEq(18.27, 0.94, p.ra, p.dec) < 1;
Feel free to try out this query (you can cancel when you get tired of waiting.) It will compute the distance of each row in PhotoObjAll (N=1231051050 ) to the search coordinate. What are some ways to utilize the indices to cut down on the query time?
Hint: You can assume that the table has indices on ra, dec, and htmID.
Problem 1a
One way to improve query time is to pre-filter the data on an indexed column. Fill in an additional WHERE clause to filter the data on RA before computing the distance. How long does this take?
SELECT COUNT(objid)
FROM PhotoObjAll p
WHERE -- COMPLETE THIS LINE
AND dbo.fDistanceArcMinEq(18.27, 0.94, p.ra, p.dec) < 1
Problem 1b
This probably improved runtime dramatically. But the database still had to compute the distance to a large number of rows. Write a query to find out how times it computed the distance in your query in problem 1a. Would be the effect prefiltering clause based on both ra and decl?
Problem 1c
We can do better by using the HTMID. SkyServer defines Table Valued Functions for spatial queries, fGetNearbyObjEq and fGetObjFromRectEq.
These functions return a list of the ObjId
within the search area you defined. Like the demo it is behind the scenes it is joining on HTMID ranges and performing the join on the PhotoObjAll table.
Inspect the results of:
Select * FROM fGetNearbyObjEq(18.27, 0.94, 1)
Complete this query to get the photometry from PhotoObjAll utilizing fGetNearbyObjEq(18.27, 0.94, 1)
:
SELECT p.objid,
p.run, p.rerun, p.camcol, p.field, p.obj,
p.type, p.ra, p.dec, p.u, p.g, p.r, p.i, p.z,
p.Err_u, p.Err_g, p.Err_r, p.Err_i, p.Err_z
FROM PhotoObjAll p
INNER JOIN fGetNearbyObjEq(18.27, 0.94, 1) n
ON -- COMPLETE
Problem 1d - revisiting the challenge problem from yesterday
The challenge problem yesterday relied on the table ROSAT
which is the result of a spatial cross match between PhotoPrimary
and the positions of ROSAT sources. Perform your own spatial cross match between the RA/DEC in the ROSAT table and the PhotoPrimary view.
How many rows in the ROSAT table, have a source in the PhotoPrimary
table within 1 arcmin?
HINT: Use SkyServer function: dbo.fGetNearestObjIdEq() answer is 47070 for DR13 and should take < 10 seconds
Finally, augment this query to return the TOP 10 of these matches with their ROSAT.SOURCENAME and the ugriz photometry from PhotoPrimary. Write your query below:
In [ ]:
Pandas supports SQL operations on relations or tables called DataFrames. If your data can fit in memory, or it can be broken into chunks that fit in memory, this is a great way to manipulate your tables in python. :)
For this problem, you may want to make use of the pandas documentation: http://pandas.pydata.org/pandas-docs/stable/ for syntax
We're going to go through some basic operations on DataFrames: Select, Join and Group By. First lets load a dataset of i-band photometry performed on a patch of a coadd from Stripe 82 re-processed with the LSST DM stack.
Setup
Download the datafile and load it into a DataFrame
$ curl -O https://lsst-web.ncsa.illinois.edu/~yusra/survey_area/DeepSourceAll_i_lt300_narrow_32610.csv.gz
In [ ]:
# Please forgive my LSSTDM/C++ style variable names
# CHANGE IF YOU NEED:
DATA_DIR = '.' # Path to datafile wherever you put the datafile
In [ ]:
def getSourceDataFrame():
TYPE = {'deepSourceId': 'int64',
'parentDeepSourceId': str,
'deepCoaddId': int,
'ra': float,
'decl': float,
'psfMag': float,
'psfMagSigma': float,
'tract': int,
'patch': str,
'detect_is_primary': int
}
df = pd.read_csv(os.path.join(DATA_DIR, 'DeepSourceAll_i_lt300_narrow_32610.csv.gz'),
index_col=0, dtype=TYPE, compression='gzip')
# replaced the NULLs with -1. pandas can't represent NULLS in integer columns
# (only NaNs in float columns or Nones in Object columns)
df['deepSourceId'] = df['deepSourceId'].astype('int64')
df['parentDeepSourceId'].loc[df['parentDeepSourceId'].isnull()] = -1
df['parentDeepSourceId'] = df['parentDeepSourceId'].astype(float).astype(int)
df.index = df['deepSourceId']
return df
df = getSourceDataFrame()
df.head()
Problem 2a Projection and Selection
Recall that in SQL, we can SELECT particular columns and particular rows.
In pandas there are different interfaces for this: http://pandas.pydata.org/pandas-docs/stable/indexing.html#
This includes some syntax thats intuitive for numpy. For example, you can select columns like:
df['psfMag']
df[['psfMag', 'psfMagSigma']] # or even
df.psfMag
and rows like:
df[df.psfMag < 18]
Make a scatter plot of ra
and dec
with two separate colors for detect_is_primary
0 or 1.
Note: The photometry pipeline deblended all multipeak detections (parents) into its component peaks (children). If a source was deblended, it was marked detect_is_primary = 0
. The children flux measurements were entered into the table, detect_is_primary
marked 1, and the deepSourceId
of their parent put in the parentDeepSourceId
column. All deepSourceIds
in the parentDeepSourceId
column are also in deepSourceId
column (but with detect_is_primary
=0).
In [ ]:
# e.g.
plt.scatter( #
plt.scatter( #
Problem 2b Selection and aggregation
How many "primary" sources (that is rows with detect_is_primary = 1
) have a psfMag between 19 and 20?
Can you do this in one line?
In [ ]:
df[ #COMPLETE
Problem 2c Group By
Pandas also provides grouping: http://pandas.pydata.org/pandas-docs/stable/groupby.html
Count the number of primary vs. non-primary sources using groupby
.
In [ ]:
grouped = df.groupby(#COMPLETE
grouped.#COMPLETE
Now use groupby
again to make a histogram of sources in integer bins of psfMag
for only the primary sources. Plot the log number counts vs. psfMag you get. What's the approximate depth of this dataset in mags?
In [ ]:
grouped = df[# COMPLETE
counts = grouped.deepSourceId.# COMPLETE
plt.plot(counts.index, # COMPLETE
plt.xlabel("PsfMag")
plt.ylabel("log10 Counts")
Problem 2d Group By with UDF One particularly power feature of pandas is the ease with which you can apply any function to a group (an aggregate user defined function). Remember all your data is in memory, so there's no need to limit your self to commuting operations like count, mean, sum, first and last.
Write a function to be applied to the psfMag
bins. Using your same groups from above, use your new function as the aggregate.
Your function doesn't have to make sense. One idea would be the interquartile range.
In [ ]:
def new_function(arr):
# where arr is array-like
# your code here
return value # as single value
grouped. # Apply your new aggregate UDF new_fuction to your groups.
Problem 2d Join
Backstory: analysis has revealed that the photometry is bad for any children of parent sources that are brighter than 18th magnitude. Too much scattered light from the background. We want to remove all sources that have a parent brighter than 18th mag.
Fortunately, pandas provides joins. Pandas DataFrames
have two methods merge
and join
which both implement joins. There is also a static pd.merge()
. Use any of these to join the DataFrame df
to itself, in order to add the psfMag
of the parent to the row. Call it "parent_psfMag". The result should be a new DataFrame dfJoined
.
What type of Join? We want the resulting table to have the same number of rows as df
regardlesss if a source has a parent or not.
_HINT: In pandas, deepSourceId isn't a column, but an "index." As a result, the arguments to merge or join are not as symmetric as writing SQL. If you use pd.merge, user right_index=True
._
In [ ]:
dfJoined = # complete
Now select just the rows where detect_is_primary=1
and parent_psfMag
is either null or fainter than 18.
In [ ]:
dfGood =
dfGood.count() #should return 4088
Problem 2e (BONUS) If you're done with the challenge problem, I encourage you to download Topcat and reproduce your solutions to Problem 2 with topcat: http://www.star.bris.ac.uk/~mbt/topcat/
The goal in this problem is to convert the histogram you made above into a density in units of deg$^{-2}$mag$^{-1}$. But to do this we need to compute the survey area. The scatter plot of the positions show some pretty significant gaps in the area accessed by the survey, because of large galaxies in the field of view. We want to compute this area, not including the gaps.
First let's make a Delaunay triangulation of all the good points, and delete all the triangles that are bigger than a certain cutoff:
In [ ]:
# RUN THIS. No need to change anything.
def isTriangleLargerThan(triangle, cutoff=0.006):
pa, pb, pc = triangle
a = np.sqrt((pa[0] - pb[0])**2 + (pa[1] - pb[1])**2)
b = np.sqrt((pb[0] - pc[0])**2 + (pb[1] - pc[1])**2)
c = np.sqrt((pc[0] - pa[0])**2 + (pc[1] - pa[1])**2)
s = (a + b + c) / 2.0
area = np.sqrt(s * (s - a) * (s - b) * (s - c))
circum_r = a * b * c / (4.0 * area)
return circum_r < cutoff
allpoints = dfGood[['ra', 'decl']].values
# coords = copy.copy(allpoints)
tri = Delaunay(allpoints)
triangles = allpoints[tri.simplices]
idx = [isTriangleLargerThan(triangle) for triangle in triangles]
# Take note of these options for our discussion tomorrow on MapReduce
# idx = map(isTriangleLargerThan, triangles)
# onlySmallTriangles = filter(isTriangleLargerThan, triangles)
onlySmallTriangles = tri.simplices[idx]
We now have a collection of triangles. Each triangle is made up of 3 vertices, but it is also made up of 3 edges. Lets convert this list of triangles into a bag of edges. There will be duplicates.
In [ ]:
edges = np.vstack((onlySmallTriangles[:,0:2], onlySmallTriangles[:,1:3], onlySmallTriangles[:,[0,2]]))
edges
Now we want to find the boundary of the whole structure. How do we know if an edge is an outer edge or inner edge? Outer edges will only appear once in the bag. MOST of the edges were part of 2 triangles and will appear twice.
We don't care about direction. So edge 4 -> 8 is the same as 8 -> 4
Find all the edges that only appear ONCE in this bag of edges. There are many ways to do this. In the spirit of SQL, this sounds like a: ... GROUP BY HAVING COUNT(*) = 1
. Pandas will do that for you.
In [ ]:
dfEdges = pd.DataFrame(edges)
# COMPLETE
singleEdges = #
In [ ]:
# Shapely can turn a bag of edges into a polygon.
# If you haven't installed shapely, do that now.
from shapely.geometry import MultiLineString
lines = MultiLineString([((line[0][0], line[0][1]),
(line[1][0], line[1,1])) for line in allpoints[singleEdges]])
polygonList = list(polygonize(lines))
print("The polygon's area is: %s" % (polygonList[0].area))
NOTE: This procedure measured euclidean area rather than solid angle, why is that OK here for this dataset?
In [ ]: