Before we can query a population, we must have one. We will use a population of satellites as an example.
In this portion of the tutorial we will query and plot the undigested population data, and not their implications. All of the queries here are analogous to similar or identical queries in Structured Query Language (SQL).
For simplicity we still use the same pre-analyzed population database (.bdb file) as in other portions of the tutorial, even though it is not important that any probabilistic analysis has been done. If we have not yet downloaded that pre-analyzed database, do so now:
In [1]:
import os
import subprocess
if not os.path.exists('satellites.bdb'):
subprocess.check_call(['curl', '-O', 'http://probcomp.csail.mit.edu/bayesdb/downloads/satellites.bdb'])
We construct a Population instance that helps us read, query, and visualize a particular population.
In [2]:
from bdbcontrib import Population
satellites = Population(name='satellites', bdb_path='satellites.bdb')
Before querying the implications of a population, it can be useful to look at a sample of the raw data and metadata. This can be done using a combination of ordinary SQL and convenience functions built into bayeslite. We start by finding one of the most well-known satellites, the International Space Station:
In [3]:
satellites.q("""
SELECT * FROM satellites
WHERE Name LIKE 'International Space Station%'
""").transpose()
Out[3]:
In [4]:
satellites.q("""SELECT COUNT(*) FROM satellites;""")
Out[4]:
We can select multiple items using a SQL wildcard, in this case the match-anything '%' on either side of "GPS".
We ask for variables as rows and observations as columns by using .transpose()
as we did for the ISS above. By default, observations map to rows, and variables map to columns.
In [5]:
satellites.q("""SELECT * FROM satellites WHERE Name LIKE '%GPS%'""").transpose()
Out[5]:
Select just a few variables in the data, ordering by the number of minutes it takes for the satellite to complete one orbit, measured in minutes, and sorted ascending (as opposed to DESC
), again as in SQL:
In [6]:
satellites.q("""
SELECT name, dry_mass_kg, period_minutes, class_of_orbit FROM satellites
ORDER BY period_minutes ASC LIMIT 10;
""")
Out[6]:
Note that NaN is ordered before 0 in this sort.
Bayeslite includes statistical graphics procedures designed for easy use with data extracted from an SQL database.
Before we introduce those, let the notebook know that we would like to use and display matplotlib figures within the notebook:
In [7]:
%matplotlib inline
Let's see a menu of the easily available plotting utilities:
In [8]:
satellites.help("plot")
We will get more detailed help on each plotting utility as we introduce it.
The methods pairplot
and pairplot_vars
are intended to plot all pairs within a group of variables. The plots are arranged as a lower-triangular matrix of plots.
Along the diagonal, there are histograms with the values of the given variable along the x axis, and the counts of occurrences of those values (or bucketed ranges of those values) on the y axis.
The rest of the lower triangle plots the row variable on the y axis against the column variable on the x axis.
Different kinds of plots are used for categorical vs. numeric values.
The fuller documentation:
In [9]:
help(satellites.pairplot)
pairplot_vars
is a shortcut to help you just name the variables you want to see, rather than writing the BQL to select those variables. As we will see, you may often start with pairplot_vars, and decide to refine your query in BQL to focus on particular areas of interest:
In [10]:
satellites.pairplot_vars(['purpose', 'power_watts', 'launch_mass_kg'],
colorby='class_of_orbit', show_contour=False);
In [11]:
satellites.pairplot("""SELECT purpose, power_watts, launch_mass_kg, class_of_orbit
FROM satellites
WHERE purpose LIKE '%Meteorology%';""",
colorby='class_of_orbit', show_contour=False);
We might learn that meteorology satellites in geosynchronous orbit use about as much or more power than meteorology satellites in low-earth orbit (see power_watts row of plots), but that they use a little less power at a given mass (see scatter of launch mass vs. power_watts), and that there are no meteorology satellites in medium earth orbit or in elliptical orbits (class_of_orbit color legend box).
An expert might be able to help us interpret these observations, e.g. why certain orbits are preferred for meteorology, what the driving considerations are for power consumption and launch mass, etc., but pairplots are a powerful tool for visually finding questions to ask.
In [12]:
satellites.pairplot("""SELECT purpose, power_watts, launch_mass_kg, class_of_orbit
FROM satellites
WHERE purpose LIKE '%Meteorology%';""",
colorby='class_of_orbit', show_contour=True);
So when we show them, the way the underlying plotting utility works, we see suggestions of negative wattages and masses!
The contours in the power vs. mass plot also obscure the small number of data points, lending a false sense of meaning.
When there are enough data points, it can be useful to plot kernel density estimators (contours) on each plot, to see tendencies overlaid above the data points, so long as one keeps the above shortcomings in mind:
In [13]:
satellites.pairplot("""SELECT power_watts, launch_mass_kg
FROM satellites""",
show_contour=True);
In [14]:
satellites.pairplot_vars(['purpose', 'class_of_orbit']);
In [15]:
satellites.pairplot("""SELECT purpose, class_of_orbit FROM %t
GROUP BY purpose
HAVING COUNT(purpose) >= 5;""");
In [16]:
satellites.q('''SELECT apogee_km FROM %t WHERE period_minutes is NULL;''')
Out[16]:
When we pairplot these, normally that data point would simply be missing, but with show_missing
, there is a line indicating that period_minutes could be anything at an apogee around 35k.
In [17]:
satellites.pairplot_vars(['period_minutes', 'apogee_km'], show_missing=True);
In [18]:
satellites.pairplot("""
SELECT period_minutes / 60.0 as period_hours,
apogee_km / 1000.0 as apogee_x1000km FROM %t""",
show_missing=True, show_contour=False);
In [19]:
help(satellites.barplot)
In [20]:
satellites.barplot("""
SELECT class_of_orbit, count(*) AS class_count FROM satellites
GROUP BY class_of_orbit
ORDER BY class_count DESC
""");
Let's add the type of orbit too:
In [21]:
satellites.barplot("""
SELECT class_of_orbit || "--" || type_of_orbit as class_type,
count(*) AS class_type_count
FROM satellites
GROUP BY class_type
ORDER BY class_type_count DESC
""");
One can even do a bit of computation here, in this case computing and plotting the average power_watts, rather than the merely the count:
In [22]:
satellites.barplot("""
SELECT class_of_orbit || "--" || type_of_orbit as class_type,
sum(power_watts)/count(*) AS average_power
FROM satellites
GROUP BY class_type
ORDER BY average_power DESC
""");
In [23]:
help(satellites.histogram)
In [24]:
satellites.histogram("""SELECT dry_mass_kg FROM %t""", nbins=35);
We can break down that silhouette according to a categorical column that comes second.
We can also show percentages rather than absolute counts using normed
.
In [25]:
satellites.histogram("""
SELECT dry_mass_kg, class_of_orbit FROM satellites
WHERE dry_mass_kg < 5000
""", nbins=15, normed=True);
In [26]:
help(satellites.heatmap)
In [27]:
satellites.heatmap("""
SELECT users, country_of_operator, COUNT(country_of_operator) as country_count FROM %t
GROUP BY country_of_operator
HAVING COUNT(country_of_operator) >= 5;
""")
Out[27]:
In [28]:
satellites.heatmap("""
SELECT users, country_of_operator, COUNT(country_of_operator) as country_count FROM %t
GROUP BY country_of_operator
HAVING COUNT(country_of_operator) >= 5;""",
figsize=(12, 10))
Out[28]:
In [ ]:
Copyright (c) 2010-2016, MIT Probabilistic Computing Project
Licensed under Apache 2.0 (edit cell for details).
In [ ]: