Exploring and cleaning the Union of Concerned Scientists database of Earth Satellites

The Union of Concerned Scientists maintains a database of ~1000 Earth satellites:

http://www.ucsusa.org/nuclear-weapons/space-weapons/satellite-database.html

For the majority of satellites, it includes kinematic, material, electrical, political, functional, and economic characteristics, such as dry mass, launch date, orbit type, country of operator, and purpose. The data appears to have been mirrored on other satellite search websites, e.g. http://satellites.findthedata.com/ .

This iPython notebook describes a sequence of interactions with a snapshot of this database using the bayeslite implementation of BayesDB, using the Python bayeslite client library. The snapshot includes a population of satellites defined using the UCS data as well as a constellation of generative probabilistic models for this population.

We have already analyzed this population of satellite data. We first download the results of that analysis:


In [1]:
import os
import subprocess
if not os.path.exists('satellites.bdb'):
    subprocess.check_call(['curl', '-O', 'http://probcomp.csail.mit.edu/bayesdb/downloads/satellites.bdb'])

In [2]:
# Load the bayeslite client library
import bayeslite
import bdbcontrib
from bdbcontrib import Population

# Load the satellites snapshot into a local instance of bayeslite
satellites = Population(name='satellites', bdb_path='satellites.bdb')


BQL [SELECT * FROM bayesdb_generator] ()

Querying the data using SQL

Before querying the implications of a population, it can be useful to look at a sample of the raw data and metadata. This can be done using a combination of ordinary SQL and convenience functions built into bayeslite. We start by finding one of the most well-known satellites, the International Space Station:


In [3]:
satellites.q("""
SELECT * FROM satellites 
WHERE Name LIKE 'International Space Station%'
""").transpose()


BQL [ SELECT * FROM satellites WHERE Name LIKE 'International Space Station%' ] ()
Out[3]:
0
Name International Space Station (ISS [first elemen...
Country_of_Operator Multinational
Operator_Owner NASA/Multinational
Users Government
Purpose Scientific Research
Class_of_Orbit LEO
Type_of_Orbit Intermediate
Perigee_km 401
Apogee_km 422
Eccentricity 0.00155
Period_minutes 92.8
Launch_Mass_kg NaN
Dry_Mass_kg NaN
Power_watts NaN
Date_of_Launch 36119
Anticipated_Lifetime 30
Contractor Boeing Satellite Systems (prime)/Multinational
Country_of_Contractor Multinational
Launch_Site Baikonur Cosmodrome
Launch_Vehicle Proton
Source_Used_for_Orbital_Data www.satellitedebris.net 12/12
longitude_radians_of_geo NaN
Inclination_radians 0.90059

In [4]:
satellites.q("""SELECT COUNT(*) FROM satellites;""")


BQL [SELECT COUNT(*) FROM satellites;] ()
Out[4]:
"COUNT"(*)
0 1167

In [5]:
satellites.q("""SELECT * FROM satellites WHERE Name LIKE '%GPS%'""").transpose()


BQL [SELECT * FROM satellites WHERE Name LIKE '%GPS%'] ()
Out[5]:
0 1 2 3 4 5 6 7 8 9 ... 22 23 24 25 26 27 28 29 30 31
Name Navstar GPS II-10 (Navstar SVN 23, PRN 32, US... Navstar GPS II-14 (Navstar SVN 26, PRN 26, USA... Navstar GPS II-21 (Navstar SVN 39, PRN 09, USA... Navstar GPS II-23 (Navstar SVN 34, PRN 04, USA... Navstar GPS II-24 (Navstar SVN 36, PRN 06, USA... Navstar GPS II-25 (Navstar SVN 33, PRN 03, USA... Navstar GPS II-26 (Navstar SVN 40, PRN 10, USA... Navstar GPS II-28 (Navstar SVN 38, PRN 08, USA... Navstar GPS II-35 (Navstar SVN 35, PRN 30, USA... Navstar GPS IIF-1 (Navstar SVN 62, PRN 25, USA... ... Navstar GPS IIR-7 (Navstar SVN 54, PRN 18, USA... Navstar GPS IIR-8 (Navstar SVN 56, PRN 16, USA... Navstar GPS IIR-9 (Navstar SVN 45, PRN 21, USA... Navstar GPS IIR-M-1 (Navstar SVN 53, PRN 17, U... Navstar GPS IIR-M-2 (Navstar SVN 52, PRN 31, U... Navstar GPS IIR-M-3 (Navstar SVN 58, PRN 12, U... Navstar GPS IIR-M-4 (Navstar SVN 55, PRN 15, U... Navstar GPS IIR-M-5 (Navstar SVN 57, PRN 29, U... Navstar GPS IIR-M-6 (Navstar SVN 48, PRN 07, U... Navstar GPS IIR-M-8 (Navstar SVN 50, PRN 05, U...
Country_of_Operator USA USA USA USA USA USA USA USA USA USA ... USA USA USA USA USA USA USA USA USA USA
Operator_Owner DoD/US Air Force DoD/US Air Force DoD/US Air Force DoD/US Air Force DoD/US Air Force DoD/US Air Force DoD/US Air Force DoD/US Air Force DoD/US Air Force DoD/US Air Force ... DoD/US Air Force DoD/US Air Force DoD/US Air Force DoD/US Air Force DoD/US Air Force DoD/US Air Force DoD/US Air Force DoD/US Air Force DoD/US Air Force DoD/US Air Force
Users Military/Commercial Military/Commercial Military/Commercial Military/Commercial Military/Commercial Military/Commercial Military/Commercial Military/Commercial Military/Commercial Military/Commercial ... Military/Commercial Military/Commercial Military/Commercial Military/Commercial Military/Commercial Military/Commercial Military/Commercial Military/Commercial Military/Commercial Military/Commercial
Purpose Navigation/Global Positioning Navigation/Global Positioning Navigation/Global Positioning Navigation/Global Positioning Navigation/Global Positioning Navigation/Global Positioning Navigation/Global Positioning Navigation/Global Positioning Navigation/Global Positioning Navigation/Global Positioning ... Navigation/Global Positioning Navigation/Global Positioning Navigation/Global Positioning Navigation/Global Positioning Navigation/Global Positioning Navigation/Global Positioning Navigation/Global Positioning Navigation/Global Positioning Navigation/Global Positioning Navigation/Global Positioning
Class_of_Orbit MEO MEO MEO MEO MEO MEO MEO MEO MEO MEO ... MEO MEO MEO MEO MEO MEO MEO MEO MEO MEO
Type_of_Orbit N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A ... N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
Perigee_km 19781 19959 20120 20104 19986 20080 20134 19912 20109 20188 ... 20104 20155 20063 20142 20020 20206 20149 20150 20135 20160
Apogee_km 20582 20403 20244 20260 20315 20284 20227 20449 20257 20224 ... 20266 20344 20433 20221 20342 20366 20213 20311 20152 20209
Eccentricity 0.01508 0.00836 0.00234 0.00294 0.0062 0.00384 0.00175 0.01011 0.00279 0.00068 ... 0.00305 0.00355 0.00695 0.00149 0.00606 0.003 0.00121 0.00303 0.00032 0.00092
Period_minutes 717.94 717.93 717.97 717.97 716.69 717.97 717.91 717.91 718.01 718.94 ... 718.09 720.71 720.65 717.95 717.93 722.19 717.93 719.92 716.4 718.07
Launch_Mass_kg 1816 1816 1816 1816 1816 1816 1816 1816 1816 1630 ... 2217 2217 2217 2217 2060 2060 2217 2060 2217 2059
Dry_Mass_kg 844 844 844 844 844 844 844 844 844 NaN ... 980 980 980 980 NaN NaN 980 NaN 980 980
Power_watts 700 700 700 700 700 700 700 700 700 NaN ... 1136 1136 1136 1136 NaN NaN 1136 NaN 1136 1136
Date_of_Launch 33203 33792 34146 34268 34403 35152 35262 35740 34211 40326 ... 36921 37650 37711 38621 38985 39038 39372 39436 39522 40042
Anticipated_Lifetime 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 12 ... 10 10 10 10 10 10 10 10 10 10
Contractor Rockwell International Rockwell International Rockwell International Rockwell International Rockwell International Rockwell International Rockwell International Rockwell International Rockwell International Boeing Satellite Systems ... Lockheed Martin Lockheed Martin Lockheed Martin Lockheed Martin Lockheed Martin Missiles and Space Lockheed Martin Missiles and Space Lockheed Martin Missiles and Space Lockheed Martin Missiles and Space Lockheed Martin Lockheed Martin Missiles and Space
Country_of_Contractor USA USA USA USA USA USA USA USA USA USA ... USA USA USA USA USA USA USA USA USA USA
Launch_Site Cape Canaveral Cape Canaveral Cape Canaveral Cape Canaveral Cape Canaveral Cape Canaveral Cape Canaveral Cape Canaveral Cape Canaveral Cape Canaveral ... Cape Canaveral Cape Canaveral Cape Canaveral Cape Canaveral Cape Canaveral Cape Canaveral Cape Canaveral Cape Canaveral Cape Canaveral Cape Canaveral
Launch_Vehicle Delta 2 Delta 2 Delta 2 Delta 2 Delta 2 Delta 2 Delta 2 Delta 2 Delta 2 Delta 4 ... Delta 2 Delta 2 Delta 2 Delta 2 Delta 2 Delta 2 Delta 2 Delta 2 Delta 2 Delta 2
Source_Used_for_Orbital_Data None JMSatcat304 JMSatcat304 JMSatcat304 JMSatcat304 JMSatcat304 JMSatcat304 JMSatcat304 JMSatcat304 SC - ASCR ... JMSatcat304 JMSatcat304 JMSatcat304 SC - ASCR SC - ASCR SC - ASCR None SC - ASCR SC - ASCR SC - ASCR
longitude_radians_of_geo NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Inclination_radians 0.973545 0.959931 0.954695 0.959931 0.958186 0.954695 0.959931 0.958186 0.95644 0.959931 ... 0.959931 0.959931 0.958186 0.961327 0.958884 0.96028 0.95644 0.959233 0.961676 0.962025

23 rows × 32 columns


In [6]:
satellites.q("""
SELECT name, dry_mass_kg, period_minutes, class_of_orbit FROM satellites 
ORDER BY period_minutes LIMIT 10;
""")


BQL [ SELECT name, dry_mass_kg, period_minutes, class_of_orbit FROM satellites ORDER BY period_minutes LIMIT 10; ] ()
Out[6]:
Name Dry_Mass_kg Period_minutes Class_of_Orbit
0 Advanced Orion 5 (NRO L-32, USA 223) NaN NaN GEO
1 IGS-8B (Information Gathering Satellite 8B, IG... NaN NaN LEO
2 Interstellar Boundary EXplorer (IBEX) NaN 0.22 Elliptical
3 Spektr-R/RadioAstron NaN 0.22 Elliptical
4 SDS III-6 (Satellite Data System) NRO L-27, Gr... NaN 14.36 GEO
5 Advanced Orion 6 (NRO L-15, USA 237) NaN 23.94 GEO
6 SDS III-7 (Satellite Data System) NRO L-38, Dr... NaN 23.94 GEO
7 RISat-2 (Radar Imaging Satellite 2) NaN 41.20 LEO
8 Kuaizhou-1 (KZ-1) NaN 90.61 LEO
9 X37-B OTV-1 (USA 240) NaN 91.54 LEO

bayeslite includes statistical graphics procedures designed for easy use with data extracted from an SQL database. Consider the problem of visualizing a table with two columns: dry_mass_kg, a NUMERICAL column, and class_of_orbit, a CATEGORICAL column. The bdbcontrib.histogram procedure renders this data by producing overlaid histograms, one per distinct value of class_of_orbit.


In [7]:
import matplotlib.pyplot as plt
%matplotlib inline

In [8]:
satellites.histogram("""SELECT dry_mass_kg, class_of_orbit FROM satellites""", 
                     nbins=35, normed=True);


BQL [SELECT dry_mass_kg, class_of_orbit FROM satellites] ()

Querying the implications using BQL — What if? questions

Consider the following "what if?" question about satellites:

Suppose there is a satellite in geosynchronous orbit that we know has a dry mass of 500 kilograms. What is its probable purpose? and what countries might be operating it?

In some applications, these "what if?" scenarios may be of intrinsic interest. They also provide a way for domain experts to scrutinize the models that come with this snapshot, by qualitatively checking their implications via simulated examples.

It is straightforward to pose this question using BQL:


In [9]:
satellites.q("""DROP TABLE IF EXISTS "satellite_purpose";""");
satellites.q("""
CREATE TEMP TABLE IF NOT EXISTS satellite_purpose AS
    SIMULATE country_of_operator, purpose FROM %g
    GIVEN Class_of_orbit = 'GEO', Dry_mass_kg = 500 
    LIMIT 1000;
""");


BQL [DROP TABLE IF EXISTS "satellite_purpose";] ()
BQL [ CREATE TEMP TABLE IF NOT EXISTS satellite_purpose AS SIMULATE country_of_operator, purpose FROM "satellites_cc" GIVEN Class_of_orbit = 'GEO', Dry_mass_kg = 500 LIMIT 1000; ] ()

Note that everything after the AS is a perfectly valid query. CREATE TEMP TABLE satellite_purpose AS saves the result of the query that follows it into a table called satellite_purpose which we can refer to later. Temporary tables are destroyed when the session is closed.

The %g specifies the set of generative population models created for this population during the analysis step that we performed for you on this population, and stored in the bdb file that we downloaded in the first step of the notebook. Creating those models is beyond the scope of this segment of the tutorial. The important thing to note is that one has to use %g (or explicitly the name of the set of models, in this case satellites_cc as we can see in the BQL echoed above) whenever we are trying to SIMULATE or ESTIMATE from those models.

To inspect the results, we first create a derived Country-Purpose variable, aggregate over it, and sort the results, all using SQL, and visualize it using bdbcontrib.barplot:


In [10]:
satellites.barplot("""
SELECT country_of_operator || "--" || purpose AS "Country-Purpose", 
    COUNT("Country-Purpose") AS frequency
FROM satellite_purpose
    GROUP BY "Country-Purpose"
    ORDER BY frequency DESC
LIMIT 20;
""");


BQL [ SELECT country_of_operator || "--" || purpose AS "Country-Purpose", COUNT("Country-Purpose") AS frequency FROM satellite_purpose GROUP BY "Country-Purpose" ORDER BY frequency DESC LIMIT 20; ] ()

What if you approached this question by querying the data and not its implications? One approach would be to find existing satellites that are in GEO and have a dry_mass_kg that is close to 500:


In [11]:
satellites.q("""
SELECT country_of_operator, purpose, Class_of_orbit, Dry_mass_kg
    FROM satellites
        WHERE Class_of_orbit = 'GEO'
            AND Dry_Mass_kg BETWEEN 400 AND 600""")


BQL [ SELECT country_of_operator, purpose, Class_of_orbit, Dry_mass_kg FROM satellites WHERE Class_of_orbit = 'GEO' AND Dry_Mass_kg BETWEEN 400 AND 600] ()
Out[11]:
Country_of_Operator Purpose Class_of_Orbit Dry_Mass_kg
0 India Communications GEO 559
1 India Meteorology GEO 500

Without understanding the joint distribution of dry_mass_kg, it is difficult to know how wide a net to cast. Broadening the range another 100kg still yields an idiosyncratic list.


In [12]:
satellites.q("""
SELECT country_of_operator, purpose, Class_of_orbit, Dry_mass_kg
    FROM satellites
        WHERE Class_of_orbit = 'GEO'
            AND Dry_Mass_kg BETWEEN 300 AND 700""")


BQL [ SELECT country_of_operator, purpose, Class_of_orbit, Dry_mass_kg FROM satellites WHERE Class_of_orbit = 'GEO' AND Dry_Mass_kg BETWEEN 300 AND 700] ()
Out[12]:
Country_of_Operator Purpose Class_of_Orbit Dry_Mass_kg
0 Malaysia Communications GEO 650
1 Israel Communications GEO 646
2 Luxembourg Communications GEO 700
3 Russia Communications GEO 620
4 China (PR) Earth Science GEO 620
5 China (PR) Earth Science GEO 620
6 China (PR) Earth Science GEO 620
7 India Communications GEO 559
8 India Navigation GEO 614
9 India Meteorology GEO 500
10 Malaysia Communications GEO 650
11 Multinational Earth Science/Meteorology GEO 320
12 United Kingdom Communications GEO 660
13 Norway Communications GEO 646

In general, as the constraints for the hypothetical get narrower, the results of a SELECT based approach grow more unstable.

Querying the implications using BQL — Exploring predictive relationships

For data sources that can be arranged into statistical populations, a key exploratory question is

Which variables probably predict one another?

It is closely related to the key confirmatory analysis question of

How much evidence is there for a predictive relationship between two variables?

BayesDB makes it easy to ask and answer both of these questions by using simple BQL queries executed against baseline models built using Meta-modeling Language (MML).

To quantify the evidence for (or against) a predictive relationship between two pairs of variables, BQL relies on information theory. The notion of dependence between two variables A and B is taken to be mutual information; the amount of evidence for dependence is then the probability that the mutual information between A and B is nonzero. This can be defined in terms of a weighted collection of population models {(G_i, m_i)} as follows:

Pr[ A dep B ] = Pr[ I( A ; B ) > 0 ] = \sum_i G_i w_i Pr[ I( A ; B ) > 0 | G_i ]

If the population models are obtained by posterior inference in a meta-model — as is the case with MML — then this probability approximates the posterior probability (or strength of evidence) that the mutual information is nonzero.

The Python client for bayeslite makes it straightforward to examine the overall matrix of pairwise dependence probabilities. Cell (i,j) in this matrix records Pr[ variable i is dependent on variable j ]. The matrix is reordered using a clustering algorithm to make higher-order predictive relationships — cases where some group of variables are probably all mutually independent — more visually apparent.


In [13]:
satellites.heatmap("""ESTIMATE DEPENDENCE PROBABILITY FROM PAIRWISE COLUMNS OF %g;""");


BQL [ESTIMATE DEPENDENCE PROBABILITY FROM PAIRWISE COLUMNS OF "satellites_cc";] ()
Detected value limits as [0, 1]. Override with vmin and vmax.
/Users/probcomp/GoogleDrive/ProbComp/venv-2.7.11-0.1.4/lib/python2.7/site-packages/matplotlib/collections.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  if self._edgecolors == str('face'):

This heatmap shows several groups of variables with high probability of mutual interdependence. For example, we see a definite block of geopolitically related variables, such as the country of contractor & operator, the contractor's identity, and the location of the satellite (if it is in geosynchronous orbit). The kinematic variables describing orbits, such as perigee, apogee, period, and orbit class, are also shown as strongly interdependent. A domain expert with sufficiently confident domain knowledge can thus use this overview of the predictive relationships to critically assess the value of the data and the efficacy of MML.

It is also instructive to compare the heatmap of pairwise dependence probabilities with standard alternatives from statistics, such as datatype-appropriate measures of correlation:


In [14]:
# WARNING: This may take a couple minutes.
satellites.heatmap("""ESTIMATE CORRELATION FROM PAIRWISE COLUMNS OF %g;""");


BQL [ESTIMATE CORRELATION FROM PAIRWISE COLUMNS OF "satellites_cc";] ()
Detected value limits as [0, 1]. Override with vmin and vmax.

The results from correlation are sufficiently noisy that it would be difficult to trust inferences from techniques that use correlation to select variables. Furthermore, the most causally unambiguous relationships, such as the geometric constraint relating perigee and apogee, are not detected by correlation.

Querying the implications using BQL — Inferring missing values

A key step in data cleaning is choosing a method for handling missing values. BQL provides the INFER primitive to make it straightforward to obtain point estimates and confidence scores for arbitrary cells in the database.

Consider the variable type_of_orbit. It is easy to see that many values are missing:


In [15]:
satellites.q("""SELECT COUNT(*) FROM satellites WHERE type_of_orbit IS NULL;""")


BQL [SELECT COUNT(*) FROM satellites WHERE type_of_orbit IS NULL;] ()
Out[15]:
"COUNT"(*)
0 123

The following query produces a table with several variables, along with both a predicted value for type_of_orbit and the confidence associated with that value:


In [16]:
satellites.q("""DROP TABLE IF EXISTS "inferred_orbit";""")
satellites.q("""
CREATE TEMP TABLE IF NOT EXISTS inferred_orbit AS
    INFER EXPLICIT anticipated_lifetime, perigee_km,
        period_minutes, class_of_orbit,
        PREDICT type_of_orbit AS inferred_orbit_type
            CONFIDENCE inferred_orbit_type_conf
    FROM %g
        WHERE type_of_orbit IS NULL;
""")


BQL [DROP TABLE IF EXISTS "inferred_orbit";] ()
BQL [ CREATE TEMP TABLE IF NOT EXISTS inferred_orbit AS INFER EXPLICIT anticipated_lifetime, perigee_km, period_minutes, class_of_orbit, PREDICT type_of_orbit AS inferred_orbit_type CONFIDENCE inferred_orbit_type_conf FROM "satellites_cc" WHERE type_of_orbit IS NULL; ] ()
Out[16]:

We can visualize the result both in tabular and graphical form.


In [17]:
satellites.q("""SELECT * FROM inferred_orbit LIMIT 30;""")


BQL [SELECT * FROM inferred_orbit LIMIT 30;] ()
Out[17]:
anticipated_lifetime perigee_km period_minutes class_of_orbit inferred_orbit_type inferred_orbit_type_conf
0 1.0 770 100.42 LEO Sun-Synchronous 0.81
1 NaN 480 97.40 LEO Sun-Synchronous 0.59
2 NaN 495 97.50 LEO Sun-Synchronous 0.69
3 NaN 499 97.60 LEO Sun-Synchronous 0.79
4 2.0 443 98.00 LEO Sun-Synchronous 0.77
5 2.0 443 98.00 LEO Sun-Synchronous 0.74
6 3.0 510 95.10 LEO Sun-Synchronous 0.53
7 2.0 570 96.20 LEO Sun-Synchronous 0.69
8 NaN 599 96.90 LEO Sun-Synchronous 0.76
9 3.0 558 96.06 LEO Sun-Synchronous 0.74
10 1.0 264 150.40 Elliptical Sun-Synchronous 0.34
11 NaN 443 98.00 LEO Sun-Synchronous 0.78
12 5.0 503 94.70 LEO Sun-Synchronous 0.80
13 NaN 403 96.85 LEO Sun-Synchronous 0.38
14 2.0 325 103.11 Elliptical Sun-Synchronous 0.48
15 NaN 538 95.50 LEO Sun-Synchronous 0.64
16 3.0 713 99.20 LEO Sun-Synchronous 0.83
17 NaN 443 98.00 LEO Sun-Synchronous 0.76
18 0.5 326 102.89 LEO Sun-Synchronous 0.40
19 1.5 326 103.00 LEO Sun-Synchronous 0.53
20 1.0 306 101.08 Elliptical Sun-Synchronous 0.55
21 NaN 538 95.40 LEO Sun-Synchronous 0.68
22 NaN 488 94.49 LEO Sun-Synchronous 0.82
23 NaN 488 94.49 LEO Sun-Synchronous 0.76
24 NaN 827 101.45 LEO Sun-Synchronous 0.80
25 NaN 537 95.60 LEO Sun-Synchronous 0.70
26 NaN 1101 107.35 LEO Sun-Synchronous 0.74
27 NaN 1068 107.00 LEO Sun-Synchronous 0.72
28 NaN 1066 106.70 LEO Sun-Synchronous 0.70
29 3.0 443 98.00 LEO Sun-Synchronous 0.73

In [18]:
satellites.pairplot("""
SELECT inferred_orbit_type, inferred_orbit_type_conf, class_of_orbit
FROM inferred_orbit;
""", 
                    colorby='class_of_orbit');


BQL [ SELECT inferred_orbit_type, inferred_orbit_type_conf, class_of_orbit FROM inferred_orbit; ] ()
/Users/probcomp/GoogleDrive/ProbComp/venv-2.7.11-0.1.4/lib/python2.7/site-packages/numpy/lib/function_base.py:564: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  n = np.zeros(bins, ntype)
/Users/probcomp/GoogleDrive/ProbComp/venv-2.7.11-0.1.4/lib/python2.7/site-packages/numpy/lib/function_base.py:600: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  n += np.bincount(indices, weights=tmp_w, minlength=bins).astype(ntype)

The plot on the bottom left shows that the confidence depends on the orbit class and on the predicted value for the inferred orbit type. For example, there is typically moderate to high confidence for the orbit type of LEO satellites — and high confidence (but some variability in confidence) for those with Sun-Synchronous orbits. Satellites with Elliptical orbits may be assigned a Sun-Synchronous type with moderate confidence, but for other target labels confidence is generally lower.

Note that many standard techniques for imputation from statistics correspond to INFER ... WITH CONFIDENCE 0, as they have no natural notion of confidence.

Query the implications using BQL — Identify anomalies

Another core data cleaning operation is identifying erroneous values — including (univariate) outliers and other kinds of anomalies. BQL makes it straightforward to assess the predictive probability density of each cell in the data.

Recall earlier that we mentioned that some of the relations are governed by the laws of physics and are thus nearly deterministic? We can use this determinism coupled with our notion of anomalousness to search the table for data-entry errors. A geosynchronous orbit should take 24 hours (1440 minutes). Let us display the anomalous values for satellites in geosynchronous orbit.


In [19]:
satellites.q("""DROP TABLE IF EXISTS "unlikely_periods";""")
satellites.q("""
CREATE TEMP TABLE IF NOT EXISTS unlikely_periods AS
    ESTIMATE name, class_of_orbit, period_minutes,
        PREDICTIVE PROBABILITY OF period_minutes
            AS "Relative Probability of Period"
    FROM %g;
""")


BQL [DROP TABLE IF EXISTS "unlikely_periods";] ()
BQL [ CREATE TEMP TABLE IF NOT EXISTS unlikely_periods AS ESTIMATE name, class_of_orbit, period_minutes, PREDICTIVE PROBABILITY OF period_minutes AS "Relative Probability of Period" FROM "satellites_cc"; ] ()
Out[19]:

In [20]:
satellites.q("""
SELECT * FROM unlikely_periods
    WHERE class_of_orbit = 'GEO'
        AND period_minutes IS NOT NULL
    ORDER BY "Relative Probability of Period" ASC LIMIT 10;
""")


BQL [ SELECT * FROM unlikely_periods WHERE class_of_orbit = 'GEO' AND period_minutes IS NOT NULL ORDER BY "Relative Probability of Period" ASC LIMIT 10; ] ()
Out[20]:
Name Class_of_Orbit Period_minutes Relative Probability of Period
0 AEHF-3 (Advanced Extremely High Frequency sate... GEO 1306.29 0.001295
1 AEHF-2 (Advanced Extremely High Frequency sate... GEO 1306.29 0.001295
2 DSP 20 (USA 149) (Defense Support Program) GEO 142.08 0.002638
3 Intelsat 903 GEO 1436.16 0.003249
4 BSAT-3B GEO 1365.61 0.003418
5 Intelsat 902 GEO 1436.10 0.003443
6 SDS III-6 (Satellite Data System) NRO L-27, Gr... GEO 14.36 0.003735
7 Advanced Orion 6 (NRO L-15, USA 237) GEO 23.94 0.003863
8 SDS III-7 (Satellite Data System) NRO L-38, Dr... GEO 23.94 0.003863
9 QZS-1 (Quazi-Zenith Satellite System, Michibiki) GEO 1436.00 0.004522

We see a couple of oddities. There are satellites with 24-minute periods. It appears that these entries are in hours rather than minutes. There are other entries that have too-short periods, which appear to be decimal errors.

NOTE: We have reported these errors to the database maintainers.

Request for help to improve the database

We need your help to improve the quality of this database and BayesDB snapshot. Here are some ideas to get you started:

Detecting and patching bad data. We've found many ETL errors using simple queries that find satellites with the least likely measured values. This dataset appears to have been mirrored many times, so fixing errors could help many downstream users.

Assuring model and inference quality. We've shown our exploratory analysis to domain experts from NASA and to MIT colleagues. How much room is there to improve results by customizing the model and fixing inference quality bugs? and how easy is it for domain experts to distinguish between model problems, data problems, and their own misconceptions?

We're also interested in exploring new applications. For example:

Build a satellite search engine for satellite spotters. This could answer queries like "I see a satellite with a particular velocity that I know is not a communications satellite. What are the most probable matches?"

What-if scenarios. Given a new Chinese project to launch a communications satellite, who are the most likely contractors? and how heavy is it likely to be? What about if it had been launched in 1970?

If you are a satellite (or satellite industry) enthusiast, and want to help out, please contact us about additional possibilities, such as as mapping the characteristics of contractors or automatically finding similar contractors and launch sites. These applications require analyzing other populations that can be derived from the same data source.


In [ ]:


Copyright (c) 2010-2016, MIT Probabilistic Computing Project

Licensed under Apache 2.0 (edit cell for details).