In [ ]:
from __future__ import print_function, division, absolute_import

Data Management Part 1

For April 25, 2017


Yusra AlSayyad (Princeton University)

This excerise demonstrates the power of spatial indexes in Databases and non-database SQL implementations.


In [ ]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.spatial import Delaunay
# from matplotlib.collections import LineCollection
import os
import copy

%matplotlib inline

Problem 1) Spatial Indexes in SkyServer

In this problem, will to query SkyServer spatially (based on RA and Dec). You may enter the queries into http://skyserver.sdss.org/dr13/en/tools/search/sql.aspx or use astroquery (demonstrated in the database re-introduction yesterday). We are interested in the approximate timing of the queries. Don't take query times too seriouslys since there are many factors in reponse time with a public database with many users.

We're interested in getting all the photometry for objects within 1 arcminute of (18.27, 0.94). SkyServer has angular distance functions, for example:

dbo.fDistanceArcMinEq(ra1, dec1, ra2, dec2)

One might naively filter on all sources for which the distance from the source to (18.27, 0.94) is < 1:

-- BAD! Don't actually do this in real life! -- SELECT COUNT(objid) FROM PhotoObjAll p WHERE dbo.fDistanceArcMinEq(18.27, 0.94, p.ra, p.dec) < 1;

Feel free to try out this query (you can cancel when you get tired of waiting.) It will compute the distance of each row in PhotoObjAll (N=1231051050 ) to the search coordinate. What are some ways to utilize the indices to cut down on the query time?

Hint: You can assume that the table has indices on ra, dec, and htmID.

Problem 1a

One way to improve query time is to pre-filter the data on an indexed column. Fill in an additional WHERE clause to filter the data on RA before computing the distance. How long does this take?

SELECT COUNT(objid) FROM PhotoObjAll p WHERE -- COMPLETE THIS LINE AND dbo.fDistanceArcMinEq(18.27, 0.94, p.ra, p.dec) < 1

Problem 1b

This probably improved runtime dramatically. But the database still had to compute the distance to a large number of rows. Write a query to find out how times it computed the distance in your query in problem 1a. Would be the effect prefiltering clause based on both ra and decl?

Problem 1c

We can do better by using the HTMID. SkyServer defines Table Valued Functions for spatial queries, fGetNearbyObjEq and fGetObjFromRectEq.

These functions return a list of the ObjId within the search area you defined. Like the demo it is behind the scenes it is joining on HTMID ranges and performing the join on the PhotoObjAll table.

Inspect the results of:

Select * FROM fGetNearbyObjEq(18.27, 0.94, 1)

Complete this query to get the photometry from PhotoObjAll utilizing fGetNearbyObjEq(18.27, 0.94, 1):

SELECT p.objid, p.run, p.rerun, p.camcol, p.field, p.obj, p.type, p.ra, p.dec, p.u, p.g, p.r, p.i, p.z, p.Err_u, p.Err_g, p.Err_r, p.Err_i, p.Err_z FROM PhotoObjAll p INNER JOIN fGetNearbyObjEq(18.27, 0.94, 1) n ON -- COMPLETE

Problem 1d - revisiting the challenge problem from yesterday

The challenge problem yesterday relied on the table ROSAT which is the result of a spatial cross match between PhotoPrimary and the positions of ROSAT sources. Perform your own spatial cross match between the RA/DEC in the ROSAT table and the PhotoPrimary view.

How many rows in the ROSAT table, have a source in the PhotoPrimary table within 1 arcmin?

HINT: Use SkyServer function: dbo.fGetNearestObjIdEq() answer is 47070 for DR13 and should take < 10 seconds

Finally, augment this query to return the TOP 10 of these matches with their ROSAT.SOURCENAME and the ugriz photometry from PhotoPrimary. Write your query below:


In [ ]:

Problem 2) SQL operations with Pandas

Pandas supports SQL operations on relations or tables called DataFrames. If your data can fit in memory, or it can be broken into chunks that fit in memory, this is a great way to manipulate your tables in python. :)

For this problem, you may want to make use of the pandas documentation: http://pandas.pydata.org/pandas-docs/stable/ for syntax

We're going to go through some basic operations on DataFrames: Select, Join and Group By. First lets load a dataset of i-band photometry performed on a patch of a coadd from Stripe 82 re-processed with the LSST DM stack.

Setup

Download the datafile and load it into a DataFrame

$ curl -O https://lsst-web.ncsa.illinois.edu/~yusra/survey_area/DeepSourceAll_i_lt300_narrow_32610.csv.gz

In [ ]:
# Please forgive my LSSTDM/C++ style variable names

# CHANGE IF YOU NEED:
DATA_DIR = '.'  # Path to datafile wherever you put the datafile

In [ ]:
def getSourceDataFrame():
    TYPE = {'deepSourceId': 'int64',
            'parentDeepSourceId': str,
            'deepCoaddId': int,
            'ra': float,
            'decl': float,
            'psfMag': float,
            'psfMagSigma': float,
            'tract': int,
            'patch': str,
            'detect_is_primary': int
            }
    df = pd.read_csv(os.path.join(DATA_DIR, 'DeepSourceAll_i_lt300_narrow_32610.csv.gz'),
                     index_col=0, dtype=TYPE, compression='gzip')
    # replaced the NULLs with -1. pandas can't represent NULLS in integer columns
    # (only NaNs in float columns or Nones in Object columns)
    df['deepSourceId'] = df['deepSourceId'].astype('int64')
    df['parentDeepSourceId'].loc[df['parentDeepSourceId'].isnull()] = -1
    df['parentDeepSourceId'] = df['parentDeepSourceId'].astype(float).astype(int)
    df.index = df['deepSourceId']
    return df

df = getSourceDataFrame()
df.head()

Problem 2a Projection and Selection

Recall that in SQL, we can SELECT particular columns and particular rows.

In pandas there are different interfaces for this: http://pandas.pydata.org/pandas-docs/stable/indexing.html#

This includes some syntax thats intuitive for numpy. For example, you can select columns like:

df['psfMag']
df[['psfMag', 'psfMagSigma']] # or even
df.psfMag

and rows like: df[df.psfMag < 18]

Make a scatter plot of ra and dec with two separate colors for detect_is_primary 0 or 1.

Note: The photometry pipeline deblended all multipeak detections (parents) into its component peaks (children). If a source was deblended, it was marked detect_is_primary = 0. The children flux measurements were entered into the table, detect_is_primary marked 1, and the deepSourceId of their parent put in the parentDeepSourceId column. All deepSourceIds in the parentDeepSourceId column are also in deepSourceId column (but with detect_is_primary=0).


In [ ]:
# e.g.
plt.scatter( #
plt.scatter( #

Problem 2b Selection and aggregation

How many "primary" sources (that is rows with detect_is_primary = 1) have a psfMag between 19 and 20? Can you do this in one line?


In [ ]:
df[ #COMPLETE

Problem 2c Group By

Pandas also provides grouping: http://pandas.pydata.org/pandas-docs/stable/groupby.html

Count the number of primary vs. non-primary sources using groupby.


In [ ]:
grouped = df.groupby(#COMPLETE
grouped.#COMPLETE

Now use groupby again to make a histogram of sources in integer bins of psfMag for only the primary sources. Plot the log number counts vs. psfMag you get. What's the approximate depth of this dataset in mags?


In [ ]:
grouped = df[# COMPLETE
counts = grouped.deepSourceId.# COMPLETE
plt.plot(counts.index, # COMPLETE

plt.xlabel("PsfMag")
plt.ylabel("log10 Counts")

Problem 2d Group By with UDF One particularly power feature of pandas is the ease with which you can apply any function to a group (an aggregate user defined function). Remember all your data is in memory, so there's no need to limit your self to commuting operations like count, mean, sum, first and last.

Write a function to be applied to the psfMag bins. Using your same groups from above, use your new function as the aggregate.

Your function doesn't have to make sense. One idea would be the interquartile range.


In [ ]:
def new_function(arr):
    # where arr is array-like
    # your code here
    return value # as single value

grouped. # Apply your new aggregate UDF new_fuction to your groups.

Problem 2d Join

Backstory: analysis has revealed that the photometry is bad for any children of parent sources that are brighter than 18th magnitude. Too much scattered light from the background. We want to remove all sources that have a parent brighter than 18th mag.

Fortunately, pandas provides joins. Pandas DataFrames have two methods merge and join which both implement joins. There is also a static pd.merge(). Use any of these to join the DataFrame df to itself, in order to add the psfMag of the parent to the row. Call it "parent_psfMag". The result should be a new DataFrame dfJoined.

What type of Join? We want the resulting table to have the same number of rows as df regardlesss if a source has a parent or not.

_HINT: In pandas, deepSourceId isn't a column, but an "index." As a result, the arguments to merge or join are not as symmetric as writing SQL. If you use pd.merge, user right_index=True._


In [ ]:
dfJoined = # complete

Now select just the rows where detect_is_primary=1 and parent_psfMag is either null or fainter than 18.


In [ ]:
dfGood = 
dfGood.count() #should return 4088

Problem 2e (BONUS) If you're done with the challenge problem, I encourage you to download Topcat and reproduce your solutions to Problem 2 with topcat: http://www.star.bris.ac.uk/~mbt/topcat/

Challenge Problem: Graphs

The goal in this problem is to convert the histogram you made above into a density in units of deg$^{-2}$mag$^{-1}$. But to do this we need to compute the survey area. The scatter plot of the positions show some pretty significant gaps in the area accessed by the survey, because of large galaxies in the field of view. We want to compute this area, not including the gaps.

First let's make a Delaunay triangulation of all the good points, and delete all the triangles that are bigger than a certain cutoff:


In [ ]:
# RUN THIS. No need to change anything. 

def isTriangleLargerThan(triangle, cutoff=0.006):
        pa, pb, pc = triangle
        a = np.sqrt((pa[0] - pb[0])**2 + (pa[1] - pb[1])**2)
        b = np.sqrt((pb[0] - pc[0])**2 + (pb[1] - pc[1])**2)
        c = np.sqrt((pc[0] - pa[0])**2 + (pc[1] - pa[1])**2)
        s = (a + b + c) / 2.0
        area = np.sqrt(s * (s - a) * (s - b) * (s - c))
        circum_r = a * b * c / (4.0 * area)
        return circum_r < cutoff

allpoints = dfGood[['ra', 'decl']].values
# coords = copy.copy(allpoints)
tri = Delaunay(allpoints)
triangles = allpoints[tri.simplices]


idx = [isTriangleLargerThan(triangle) for triangle in triangles]

# Take note of these options for our discussion tomorrow on MapReduce
# idx = map(isTriangleLargerThan, triangles)
# onlySmallTriangles = filter(isTriangleLargerThan, triangles)

onlySmallTriangles = tri.simplices[idx]

We now have a collection of triangles. Each triangle is made up of 3 vertices, but it is also made up of 3 edges. Lets convert this list of triangles into a bag of edges. There will be duplicates.


In [ ]:
edges = np.vstack((onlySmallTriangles[:,0:2], onlySmallTriangles[:,1:3], onlySmallTriangles[:,[0,2]]))
edges

Now we want to find the boundary of the whole structure. How do we know if an edge is an outer edge or inner edge? Outer edges will only appear once in the bag. MOST of the edges were part of 2 triangles and will appear twice.

We don't care about direction. So edge 4 -> 8 is the same as 8 -> 4

Find all the edges that only appear ONCE in this bag of edges. There are many ways to do this. In the spirit of SQL, this sounds like a: ... GROUP BY HAVING COUNT(*) = 1. Pandas will do that for you.


In [ ]:
dfEdges = pd.DataFrame(edges)

# COMPLETE

singleEdges = #

In [ ]:
# Shapely can turn a bag of edges into a polygon. 
# If you haven't installed shapely, do that now.

from shapely.geometry import MultiLineString
lines = MultiLineString([((line[0][0], line[0][1]),
                          (line[1][0], line[1,1])) for line in allpoints[singleEdges]])
polygonList = list(polygonize(lines))
print("The polygon's area is: %s" % (polygonList[0].area))

NOTE: This procedure measured euclidean area rather than solid angle, why is that OK here for this dataset?


In [ ]: