In this notebook, we'll define an instaquery() function that lets you:
pandas.query syntaxThis tiny function can be handy for quick, throwaway exploration that you do not want captured permanently in the notebook (e.g., exploration off the primary track).
In [1]:
%matplotlib inline
In [105]:
from IPython.display import display, Image
from IPython.html.widgets import interact_manual
In [78]:
def instaquery(df, renderer=lambda df, by: display(df)):
'''
Creates an interactive query widget with an optional custom renderer.
df: DataFrame to query
renderer: Render function of the form lambda df, by where df is the subset of the DataFrame rows
matching the query and by is the column selected for a group-by option. The default render
function simply displays the rows matching the query and ignores the group-by.
'''
by_vals = tuple(['---'] + list(df.columns))
@interact_manual(query='', by=by_vals)
def instaquery(query, by):
'''Inner function that gets called when the user interacts with the widgets.'''
try:
sub_df = df.query(query)
except Exception:
sub_df = df
# replace sentinel with None
by = None if by == '---' else by
renderer(sub_df, by)
It doesn't look like much, but hers's a screenshot of just one thing it can do.
In [107]:
Image('./instaquery.png', retina=True, )
Out[107]:
In [97]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import seaborn as sns
In [98]:
df = sns.load_dataset('iris')
In [99]:
df.head()
Out[99]:
We initialize the instaquery with the iris DataFrame and a custom render function. In our function, we render plots of all pairwise column combinations. We color points / bars by a selected column.
species in the by dropdown and click Run instaquery again. You still see the full dataset, but with each feature vector color coded according to its species category. petal_width > 0.5 and petal_length < 5 and click Run instaquery. Now you only see the data that fall within the query parameters colorized according to species.
In [104]:
instaquery(df, lambda df, by: sns.pairplot(df, size=2.5, hue=by))
In [109]:
df = sns.load_dataset('tips')
In this dataset, not all columns have numeric values. Since we're interested in using pairplot, we need to filter out some columns from our pairings. But, at the same time, we gain more categorical columns to use for grouping.
In [111]:
df.head()
Out[111]:
In this invocation of instaquery, we pass a fixed list of columns of interest for ploitting. We also specify the plot fill opacity since our data is denser than in the iris dataset.
sex, day, smoker, etc.day, try the query total_bill * 0.15 < tip. This plots customer tips greater than 15% of the total bill amount colored by day.
In [113]:
instaquery(df, lambda df, by: sns.pairplot(df, vars=['tip', 'total_bill', 'size'], size=3.5, hue=by, plot_kws={'alpha' : 0.7}))
As a final example, we cease plotting and use a different renderer: a pandas table giving the basic summary stats of all numeric columns, optionally grouped by a column.
sex.sex with the query time == 'Lunch'.
In [115]:
instaquery(df, lambda df, by: display(df.groupby(by).describe()) if by else display(df.describe()))
In [126]:
sns.violinplot(df.tip, df.sex)
Out[126]: