The Union of Concerned Scientists maintains a database of ~1000 Earth satellites:
http://www.ucsusa.org/nuclear-weapons/space-weapons/satellite-database.html
For the majority of satellites, it includes kinematic, material, electrical, political, functional, and economic characteristics, such as dry mass, launch date, orbit type, country of operator, and purpose. The data appears to have been mirrored on other satellite search websites, e.g. http://satellites.findthedata.com/ .
This iPython notebook describes a sequence of interactions with a snapshot of this database using the bayeslite implementation of BayesDB, using the Python bayeslite client library. The snapshot includes a population of satellites defined using the UCS data as well as a constellation of generative probabilistic models for this population.
We have already analyzed this population of satellite data. We first download the results of that analysis:
In [1]:
import os
import subprocess
if not os.path.exists('satellites.bdb'):
subprocess.check_call(['curl', '-O', 'http://probcomp.csail.mit.edu/bayesdb/downloads/satellites.bdb'])
In [2]:
# Load the bayeslite client library
import bayeslite
import bdbcontrib
from bdbcontrib import Population
# Load the satellites snapshot into a local instance of bayeslite
satellites = Population(name='satellites', bdb_path='satellites.bdb')
Before querying the implications of a population, it can be useful to look at a sample of the raw data and metadata. This can be done using a combination of ordinary SQL and convenience functions built into bayeslite. We start by finding one of the most well-known satellites, the International Space Station:
In [3]:
satellites.q("""
SELECT * FROM satellites
WHERE Name LIKE 'International Space Station%'
""").transpose()
Out[3]:
In [4]:
satellites.q("""SELECT COUNT(*) FROM satellites;""")
Out[4]:
In [5]:
satellites.q("""SELECT * FROM satellites WHERE Name LIKE '%GPS%'""").transpose()
Out[5]:
In [6]:
satellites.q("""
SELECT name, dry_mass_kg, period_minutes, class_of_orbit FROM satellites
ORDER BY period_minutes LIMIT 10;
""")
Out[6]:
bayeslite includes statistical graphics procedures designed for easy use with data extracted from an SQL database. Consider the problem of visualizing a table with two columns: dry_mass_kg
, a NUMERICAL
column, and class_of_orbit
, a CATEGORICAL
column. The bdbcontrib.histogram
procedure renders this data by producing overlaid histograms, one per distinct value of class_of_orbit
.
In [7]:
import matplotlib.pyplot as plt
%matplotlib inline
In [8]:
satellites.histogram("""SELECT dry_mass_kg, class_of_orbit FROM satellites""",
nbins=35, normed=True);
Consider the following "what if?" question about satellites:
Suppose there is a satellite in geosynchronous orbit that we know has a dry mass of 500 kilograms. What is its probable purpose? and what countries might be operating it?
In some applications, these "what if?" scenarios may be of intrinsic interest. They also provide a way for domain experts to scrutinize the models that come with this snapshot, by qualitatively checking their implications via simulated examples.
It is straightforward to pose this question using BQL:
In [9]:
satellites.q("""DROP TABLE IF EXISTS "satellite_purpose";""");
satellites.q("""
CREATE TEMP TABLE IF NOT EXISTS satellite_purpose AS
SIMULATE country_of_operator, purpose FROM %g
GIVEN Class_of_orbit = 'GEO', Dry_mass_kg = 500
LIMIT 1000;
""");
Note that everything after the AS
is a perfectly valid query. CREATE TEMP TABLE satellite_purpose AS
saves the result of the query that follows it into a table called satellite_purpose
which we can refer to later. Temporary
tables are destroyed when the session is closed.
The %g
specifies the set of generative population models created for this population during the analysis step that we performed for you on this population, and stored in the bdb file that we downloaded in the first step of the notebook. Creating those models is beyond the scope of this segment of the tutorial. The important thing to note is that one has to use %g
(or explicitly the name of the set of models, in this case satellites_cc
as we can see in the BQL echoed above) whenever we are trying to SIMULATE
or ESTIMATE
from those models.
To inspect the results, we first create a derived Country-Purpose
variable, aggregate over it, and sort the results, all using SQL, and visualize it using bdbcontrib.barplot
:
In [10]:
satellites.barplot("""
SELECT country_of_operator || "--" || purpose AS "Country-Purpose",
COUNT("Country-Purpose") AS frequency
FROM satellite_purpose
GROUP BY "Country-Purpose"
ORDER BY frequency DESC
LIMIT 20;
""");
What if you approached this question by querying the data and not its implications? One approach would be to find existing satellites that are in GEO
and have a dry_mass_kg
that is close to 500:
In [11]:
satellites.q("""
SELECT country_of_operator, purpose, Class_of_orbit, Dry_mass_kg
FROM satellites
WHERE Class_of_orbit = 'GEO'
AND Dry_Mass_kg BETWEEN 400 AND 600""")
Out[11]:
Without understanding the joint distribution of dry_mass_kg
, it is difficult to know how wide a net to cast. Broadening the range another 100kg still yields an idiosyncratic list.
In [12]:
satellites.q("""
SELECT country_of_operator, purpose, Class_of_orbit, Dry_mass_kg
FROM satellites
WHERE Class_of_orbit = 'GEO'
AND Dry_Mass_kg BETWEEN 300 AND 700""")
Out[12]:
In general, as the constraints for the hypothetical get narrower, the results of a SELECT
based approach grow more unstable.
For data sources that can be arranged into statistical populations, a key exploratory question is
Which variables probably predict one another?
It is closely related to the key confirmatory analysis question of
How much evidence is there for a predictive relationship between two variables?
BayesDB makes it easy to ask and answer both of these questions by using simple BQL queries executed against baseline models built using Meta-modeling Language (MML).
To quantify the evidence for (or against) a predictive relationship between two pairs of variables, BQL relies on information theory. The notion of dependence between two variables A and B is taken to be mutual information; the amount of evidence for dependence is then the probability that the mutual information between A and B is nonzero. This can be defined in terms of a weighted collection of population models {(G_i, m_i)} as follows:
Pr[ A dep B ] = Pr[ I( A ; B ) > 0 ] = \sum_i G_i w_i Pr[ I( A ; B ) > 0 | G_i ]
If the population models are obtained by posterior inference in a meta-model — as is the case with MML — then this probability approximates the posterior probability (or strength of evidence) that the mutual information is nonzero.
The Python client for bayeslite makes it straightforward to examine the overall matrix of pairwise dependence probabilities. Cell (i,j) in this matrix records Pr[ variable i is dependent on variable j ]. The matrix is reordered using a clustering algorithm to make higher-order predictive relationships — cases where some group of variables are probably all mutually independent — more visually apparent.
In [13]:
satellites.heatmap("""ESTIMATE DEPENDENCE PROBABILITY FROM PAIRWISE COLUMNS OF %g;""");
This heatmap shows several groups of variables with high probability of mutual interdependence. For example, we see a definite block of geopolitically related variables, such as the country of contractor & operator, the contractor's identity, and the location of the satellite (if it is in geosynchronous orbit). The kinematic variables describing orbits, such as perigee, apogee, period, and orbit class, are also shown as strongly interdependent. A domain expert with sufficiently confident domain knowledge can thus use this overview of the predictive relationships to critically assess the value of the data and the efficacy of MML.
It is also instructive to compare the heatmap of pairwise dependence probabilities with standard alternatives from statistics, such as datatype-appropriate measures of correlation:
In [14]:
# WARNING: This may take a couple minutes.
satellites.heatmap("""ESTIMATE CORRELATION FROM PAIRWISE COLUMNS OF %g;""");
The results from correlation are sufficiently noisy that it would be difficult to trust inferences from techniques that use correlation to select variables. Furthermore, the most causally unambiguous relationships, such as the geometric constraint relating perigee and apogee, are not detected by correlation.
A key step in data cleaning is choosing a method for handling missing values. BQL provides the INFER primitive to make it straightforward to obtain point estimates and confidence scores for arbitrary cells in the database.
Consider the variable type_of_orbit
. It is easy to see that many values are missing:
In [15]:
satellites.q("""SELECT COUNT(*) FROM satellites WHERE type_of_orbit IS NULL;""")
Out[15]:
The following query produces a table with several variables, along with both a predicted value for type_of_orbit
and the confidence associated with that value:
In [16]:
satellites.q("""DROP TABLE IF EXISTS "inferred_orbit";""")
satellites.q("""
CREATE TEMP TABLE IF NOT EXISTS inferred_orbit AS
INFER EXPLICIT anticipated_lifetime, perigee_km,
period_minutes, class_of_orbit,
PREDICT type_of_orbit AS inferred_orbit_type
CONFIDENCE inferred_orbit_type_conf
FROM %g
WHERE type_of_orbit IS NULL;
""")
Out[16]:
We can visualize the result both in tabular and graphical form.
In [17]:
satellites.q("""SELECT * FROM inferred_orbit LIMIT 30;""")
Out[17]:
In [18]:
satellites.pairplot("""
SELECT inferred_orbit_type, inferred_orbit_type_conf, class_of_orbit
FROM inferred_orbit;
""",
colorby='class_of_orbit');
The plot on the bottom left shows that the confidence depends on the orbit class and on the predicted value for the inferred orbit type. For example, there is typically moderate to high confidence for the orbit type of LEO
satellites — and high confidence (but some variability in confidence) for those with Sun-Synchronous
orbits. Satellites with Elliptical
orbits may be assigned a Sun-Synchronous
type with moderate confidence, but for other target labels confidence is generally lower.
Note that many standard techniques for imputation from statistics correspond to INFER ... WITH CONFIDENCE 0, as they have no natural notion of confidence.
Recall earlier that we mentioned that some of the relations are governed by the laws of physics and are thus nearly deterministic? We can use this determinism coupled with our notion of anomalousness to search the table for data-entry errors. A geosynchronous orbit should take 24 hours (1440 minutes). Let us display the anomalous values for satellites in geosynchronous orbit.
In [19]:
satellites.q("""DROP TABLE IF EXISTS "unlikely_periods";""")
satellites.q("""
CREATE TEMP TABLE IF NOT EXISTS unlikely_periods AS
ESTIMATE name, class_of_orbit, period_minutes,
PREDICTIVE PROBABILITY OF period_minutes
AS "Relative Probability of Period"
FROM %g;
""")
Out[19]:
In [20]:
satellites.q("""
SELECT * FROM unlikely_periods
WHERE class_of_orbit = 'GEO'
AND period_minutes IS NOT NULL
ORDER BY "Relative Probability of Period" ASC LIMIT 10;
""")
Out[20]:
We see a couple of oddities. There are satellites with 24-minute periods. It appears that these entries are in hours rather than minutes. There are other entries that have too-short periods, which appear to be decimal errors.
NOTE: We have reported these errors to the database maintainers.
We need your help to improve the quality of this database and BayesDB snapshot. Here are some ideas to get you started:
Detecting and patching bad data. We've found many ETL errors using simple queries that find satellites with the least likely measured values. This dataset appears to have been mirrored many times, so fixing errors could help many downstream users.
Assuring model and inference quality. We've shown our exploratory analysis to domain experts from NASA and to MIT colleagues. How much room is there to improve results by customizing the model and fixing inference quality bugs? and how easy is it for domain experts to distinguish between model problems, data problems, and their own misconceptions?
We're also interested in exploring new applications. For example:
Build a satellite search engine for satellite spotters. This could answer queries like "I see a satellite with a particular velocity that I know is not a communications satellite. What are the most probable matches?"
What-if scenarios. Given a new Chinese project to launch a communications satellite, who are the most likely contractors? and how heavy is it likely to be? What about if it had been launched in 1970?
If you are a satellite (or satellite industry) enthusiast, and want to help out, please contact us about additional possibilities, such as as mapping the characteristics of contractors or automatically finding similar contractors and launch sites. These applications require analyzing other populations that can be derived from the same data source.
In [ ]:
Copyright (c) 2010-2016, MIT Probabilistic Computing Project
Licensed under Apache 2.0 (edit cell for details).