In [1]:
from __future__ import print_function, absolute_import, division
During the first session of the DSFP we learned about the basics of database operation and writing queries/code in SQL. Here, we will review some basics and complete some problems concerning queries on the SDSS database.
In [2]:
import matplotlib.pyplot as plt
%matplotlib notebook
At the most basic level - databases store your bytes, and later return those bytes (or a subset of them) when queried.
They provide a highly efficient means for filtering your bytes (there are many different strategies that the user can employ).
The backend for most databases is the Structured Query Language or SQL, which is a standard declarative language.
There are many different libraries that implement SQL: MySQL, PostgreSQL, Greenplum, Microsoft SQL server, IBM DB2, Oracle Database, etc.
The most basic implementation is SQLite
a self-contained, SQL database engine. We will discuss SQLite
further later in the week, but in brief - it is a nice stand alone package that works really well for small problems (such as the example that we are about to encounter).
In [3]:
import sqlite3
Without diving too much into the weeds (we'll investigate this further later this week), we need to establish a connection
to the database. From the connection
we create a cursor
, which allows us to actually interact with the database.
In [4]:
conn = sqlite3.connect("reintro.db")
cur = conn.cursor()
And just like that - we have now created a new database reintro.db
, with which we can "store bytes" or later "retrieve bytes" once we have added some data to the database.
Aside - note that unlike many SQL libraries, SQLite
does not require a server and creates an actual database file on your hard drive. This improves portability, but also creates some downsides as well.
Now we need to create a table and insert some data. We will interact with the database via the execute()
method for the cursor
object.
Recall that creating a table requires a specification of the table name, the columns in the table, and the data type for each column. Here's an example where I create a table to store info on my pets:
cur.execute("""create table PetInfo(
Name text,
Species text,
Age tinyint,
FavoriteFood text
)""")
Problem 1a
Create a new table in the database called DSFPstudents
with columns Name
, Institution
, and Year
, where Year
is the year in graduate school.
In [ ]:
cur.execute( # complete
In [10]:
cur.execute("""create table DSFPstudents(
Name text,
Institution text,
Year tinyint
)""")
Out[10]:
Once a table is created, we can use the database to store bytes. If I were to populate my PetInfo
table I would do the following:
cur.execute("""insert into PetInfo(Name, Species, Age, FavoriteFood)
values ("Rocky", "Dog", 12, "Bo-Nana")""")
cur.execute("""insert into PetInfo(Name, Species, Age, FavoriteFood)
values ("100 Emoji-Flames Emoji", "Red Panda", 2, "bamboo leaves")""")
Note - column names do not need to be explicitly specified, but for clarity this is always preferred.
Problem 1b
Insert data for yourself, and the two people sitting next to you into the database.
In [ ]:
cur.execute( # complete
In [12]:
cur.execute("""insert into DSFPstudents(Name, Institution, Year)
values ("Adam Miller", "Northwestern", 10)""")
cur.execute("""insert into DSFPstudents(Name, Institution, Year)
values ("Lucianne Walkowicz", "Adler", 13)""")
Out[12]:
Now that we have bytes in the database, we can retrieve those bytes with one (or several) queries. There are 3 basic building blocks to a query:
SELECT...
FROM...
WHERE...
Where SELECT
specifies the information we want to retrieve from the database, FROM
specifies the tables being queried in the database, and WHERE
specifies the conditions for the query.
Problem 1c
Select the institutions for all students in the DSFPstudents
table who have been in grad school for more than 2 years.
Hint - to display the results of your query run cur.fetchall()
.
In [ ]:
cur.execute( # complete
cur.fetchall()
In [13]:
cur.execute("""select Institution from DSFPstudents where year > 2""")
cur.fetchall()
Out[13]:
As we round out this brief review of databases, recall that the quality of databases relies on ACID properties:
As we round out this brief review of databases, recall that the success of databases relies on ACID properties:
As we round out this brief review of databases, recall that the success of databases relies on ACID properties:
As we round out this brief review of databases, recall that the success of databases relies on ACID properties:
As we round out this brief review of databases, recall that the success of databases relies on ACID properties:
Above we looked at the most basic operations possible with a database (recall - databases are unnecessary, and possibly cumbersome, with small data sets). A typical database consists of many tables, and these tables may be joined together to unlock complex questions for the data.
As a reminder on (some of) this functionality, we are now going to go through some problems using the SDSS database. The full SDSS schema explains all of the tables, columns, views and functions for querying the database. We will keep things relatively simple in that regard.
In [16]:
from astroquery.sdss import SDSS
As we have seen previously, astroquery
enables seemless connections to the SDSS database via the Python shell.
Problem 2a
Select 20 random sources from the PhotoObjAll
table and return all columns in the table.
Hint - while this would normally be accomplished by starting the query select limit 20 ...
, SDSS CasJobs uses Microsoft's SQL Server, which adopts select top 20 ...
to accomplish an identical result.
In [ ]:
SDSS.query_sql( # complete
In [19]:
SDSS.query_sql("""select top 20 * from PhotoObjAll""")
Out[19]:
That's more columns than we will likely ever need. Instead, let's focus on objID
, a unique identifier, cModelMag_u
, cModelMag_g
, cModelMag_r
, cModelMag_i
, and cModelMag_z
, the source magnitude in $u', g', r', i', z'$, respectively.
We will now (re-)introduce the concept of joining two tables.
The most common operation is known as an inner join
(which is often referred to as just join
). An inner join
returns records that have matching sources in both tables in the join.
Less, but nevertheless still powerful, is the outer join
. An outer join returns all records in either table, with NULL
values for columns in a table in which the record does not exist.
Specialized versions of the outer join
include the left join
and right join
, whereby all records in either the left or right table, respectively, are returned along with their counterparts.
Problem 2b
Select objid
and $u'g'r'i'z'$ from PhotoObjAll
and the corresponding class
from specObjAll
for 20 random sources.
There are multiple columns you could use to join the tables, in this case match objid
to bestobjid
from specObjAll
and use an inner join
.
In [ ]:
SDSS.query_sql( # complete
In [20]:
SDSS.query_sql("""select top 20 objid, cModelMag_u, cModelMag_g, cModelMag_r, cModelMag_i, cModelMag_z,
class
from photoobjall p
inner join specobjall s on p.objid = s.bestobjid""")
Out[20]:
Problem 2c
Perform an identical query to the one above, but this time use a left outer join
(or left join
).
How do your results compare to the previous query?
In [21]:
SDSS.query_sql("""select top 20 objid, cModelMag_u, cModelMag_g, cModelMag_r, cModelMag_i, cModelMag_z,
class
from photoobjall p
left outer join specobjall s on p.objid = s.bestobjid""")
Out[21]:
Problem 2d
This time use a right outer join
(or right join
).
How do your results compare to the previous query?
In [37]:
SDSS.query_sql("""select top 20 objid, cModelMag_u, cModelMag_g, cModelMag_r, cModelMag_i, cModelMag_z,
class
from photoobjall p
right outer join specobjall s on s.bestobjid = p.objid
""")
Out[37]:
To close the notebook we will perform a nested query. In brief, the idea is to join the results of one query with a separate query.
Here, we are going to attempt to identify bright AGN that don't have SDSS spectra. To do so we will need the photoObjAll
table, the specObjAll
table, both of which we've seen before, and the rosat
table, which includes all cross matches between SDSS sources and X-ray sources detected by the Rosat satellite.
Create a nested query that selects all Rosat sources that don't have SDSS spectra with cModelFlux_u + cModelFlux_g + cModelFlux_r + cModelFlux_i + cModelFlux_z > 10000
(this flux contraints ensures the source is bright without making any cuts on color) and type = 3
, this last constraint means the source is extended in SDSS images.
Hint - you may run into timeout issues in which case you should run the query on CasJobs.
In [42]:
SDSS.query_sql("""select rm.*
from
(select r.objid, r.sourcename, r.ra, r.dec, r.cps, r.hr1, r.hr2, cModelMag_u, cModelMag_g, cModelMag_r, cModelMag_i, cModelMag_z
from photoobjall p join rosat r on p.objid = r.objid
where (cModelFlux_u + cModelFlux_g + cModelFlux_r + cModelFlux_i + cModelFlux_z > 10000)
and p.type = 3) as rm
left join specobjall p on rm.objid = p.bestobjid
where p.bestobjid is null
""")
In [ ]: