In [ ]:
from __future__ import print_function, absolute_import, division

Re-Introduction to Databases:

Selecting Sources from the Sloan Digital Sky Survey

Version 0.1


By AA Miller 2017 Apr 19

During the first session of the DSFP we learned about the basics of database operation and writing queries/code in SQL. Here, we will review some basics and complete some problems concerning queries on the SDSS database.


In [ ]:
import matplotlib.pyplot as plt
%matplotlib notebook

At the most basic level - databases store your bytes, and later return those bytes (or a subset of them) when queried.

They provide a highly efficient means for filtering your bytes (there are many different strategies that the user can employ).

The backend for most databases is the Structured Query Language or SQL, which is a standard declarative language.

There are many different libraries that implement SQL: MySQL, PostgreSQL, Greenplum, Microsoft SQL server, IBM DB2, Oracle Database, etc.

Problem 1) Basic SQL Operations with SQLite

The most basic implementation is SQLite a self-contained, SQL database engine. We will discuss SQLite further later in the week, but in brief - it is a nice stand alone package that works really well for small problems (such as the example that we are about to encounter).


In [ ]:
import sqlite3

Without diving too much into the weeds (we'll investigate this further later this week), we need to establish a connection to the database. From the connection we create a cursor, which allows us to actually interact with the database.


In [ ]:
conn = sqlite3.connect("reintro.db")
cur = conn.cursor()

And just like that - we have now created a new database reintro.db, with which we can "store bytes" or later "retrieve bytes" once we have added some data to the database.

Aside - note that unlike many SQL libraries, SQLite does not require a server and creates an actual database file on your hard drive. This improves portability, but also creates some downsides as well.

Now we need to create a table and insert some data. We will interact with the database via the execute() method for the cursor object.

Recall that creating a table requires a specification of the table name, the columns in the table, and the data type for each column. Here's an example where I create a table to store info on my pets:

cur.execute("""create table PetInfo(
                                    Name text, 
                                    Species text,
                                    Age tinyint,
                                    FavoriteFood text
                                    )""")

Problem 1a

Create a new table in the database called DSFPstudents with columns Name, Institution, and Year, where Year is the year in graduate school.


In [ ]:
cur.execute( # complete

Once a table is created, we can use the database to store bytes. If I were to populate my PetInfo table I would do the following:

cur.execute("""insert into PetInfo(Name, Species, Age, FavoriteFood) 
                            values ("Rocky", "Dog", 12, "Bo-Nana")""")
cur.execute("""insert into PetInfo(Name, Species, Age, FavoriteFood) 
                            values ("100 Emoji-Flames Emoji", "Red Panda", 2, "bamboo leaves")""")

Note - column names do not need to be explicitly specified, but for clarity this is always preferred.

Problem 1b

Insert data for yourself, and the two people sitting next to you into the database.


In [ ]:
cur.execute( # complete

Now that we have bytes in the database, we can retrieve those bytes with one (or several) queries. There are 3 basic building blocks to a query:

SELECT...
FROM...
WHERE...

Where SELECT specifies the information we want to retrieve from the database, FROM specifies the tables being queried in the database, and WHERE specifies the conditions for the query.

Problem 1c

Select the institutions for all students in the DSFPstudents table who have been in grad school for more than 2 years.

Hint - to display the results of your query run cur.fetchall().


In [ ]:
cur.execute( # complete
cur.fetchall()

As we round out this brief review of databases, recall that the success of databases relies on ACID properties:

  1. Atomicity - all parts of transaction succeed, or rollback state of database
  2. Consistency - data always meets validation rules
  3. Isolation - no interference across transactions (even if concurrent)
  4. Durability - a committed transaction remains committed (even if there's a power outage, etc)

Problem 2) Complex Queries with SDSS

Above we looked at the most basic operations possible with a database (recall - databases are unnecessary, and possibly cumbersome, with small data sets). A typical database consists of many tables, and these tables may be joined together to unlock complex questions for the data.

As a reminder on (some of) this functionality, we are now going to go through some problems using the SDSS database. The full SDSS schema explains all of the tables, columns, views and functions for querying the database. We will keep things relatively simple in that regard.


In [ ]:
from astroquery.sdss import SDSS

As we have seen previously, astroquery enables seemless connections to the SDSS database via the Python shell.

Problem 2a

Select 20 random sources from the PhotoObjAll table and return all columns in the table.

Hint - while this would normally be accomplished by ending the query ... limit 20, SDSS CasJobs uses Microsoft's SQL Server, which adopts select top 20 ... at the beginning of the query to accomplish an identical result.


In [ ]:
SDSS.query_sql( # complete

That's more columns than we will likely ever need. Instead, let's focus on objID, a unique identifier, cModelMag_u, cModelMag_g, cModelMag_r, cModelMag_i, and cModelMag_z, the source magnitude in $u', g', r', i', z'$, respectively.

We will now (re-)introduce the concept of joining two tables.

The most common operation is known as an inner join (which is often referred to as just join). An inner join returns records that have matching sources in both tables in the join.

Less, but nevertheless still powerful, is the outer join. An outer join returns all records in either table, with NULL values for columns in a table in which the record does not exist.

Specialized versions of the outer join include the left join and right join, whereby all records in either the left or right table, respectively, are returned along with their counterparts.

Problem 2b

Select objid and $u'g'r'i'z'$ from PhotoObjAll and the corresponding class from specObjAll for 20 random sources.

There are multiple columns you could use to join the tables, in this case match objid to bestobjid from specObjAll and use an inner join.


In [ ]:
SDSS.query_sql( # complete

Problem 2c

Perform an identical query to the one above, but this time use a left outer join (or left join).

How do your results compare to the previous query?


In [ ]:
SDSS.query_sql( # complete

Problem 2d

This time use a right outer join (or right join).

How do your results compare to the previous query?


In [ ]:
SDSS.query_sql( # complete

Challenge Problem

To close the notebook we will perform a nested query. In brief, the idea is to join the results of one query with a separate query.

Here, we are going to attempt to identify bright AGN that don't have SDSS spectra. To do so we will need the photoObjAll table, the specObjAll table, both of which we've seen before, and the rosat table, which includes all cross matches between SDSS sources and X-ray sources detected by the Rosat satellite.

Create a nested query that selects all Rosat sources that don't have SDSS spectra with cModelFlux_u + cModelFlux_g + cModelFlux_r + cModelFlux_i + cModelFlux_z > 10000 (this flux contraints ensures the source is bright without making any cuts on color) and type = 3, this last constraint means the source is extended in SDSS images.

Hint - you may run into timeout issues in which case you should run the query on CasJobs.


In [ ]:
SDSS.query_sql( # complete