Introduction to pdvega

Full article at pbpython.com


In [1]:
import pandas as pd
import pdvega

In [2]:
%matplotlib inline

Read in the FiveThirtyEight data on candy


In [3]:
df = pd.read_csv("https://github.com/fivethirtyeight/data/blob/master/candy-power-ranking/candy-data.csv?raw=True")

In [4]:
# Clean up broken apostrophe
df['competitorname'].replace(regex=True,inplace=True,to_replace=r'Õ',value=r"'")

In [5]:
df.head()


Out[5]:
competitorname chocolate fruity caramel peanutyalmondy nougat crispedricewafer hard bar pluribus sugarpercent pricepercent winpercent
0 100 Grand 1 0 1 0 0 1 0 1 0 0.732 0.860 66.971725
1 3 Musketeers 1 0 0 0 1 0 0 1 0 0.604 0.511 67.602936
2 One dime 0 0 0 0 0 0 0 0 0 0.011 0.116 32.261086
3 One quarter 0 0 0 0 0 0 0 0 0 0.011 0.511 46.116505
4 Air Heads 0 1 0 0 0 0 0 0 0 0.906 0.511 52.341465

Try a pandas plot first


In [6]:
df["winpercent"].plot.hist()


Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f73b242b780>

Try the same thing using pdvega


In [7]:
df["winpercent"].vgplot.hist()


KDE plots work as expected


In [8]:
df["sugarpercent"].vgplot.kde()


We can look at the sugar and price percentile distributions


In [9]:
df["sugarpercent"].vgplot.hist()



In [10]:
df["pricepercent"].vgplot.hist()



In [11]:
df[["sugarpercent", "pricepercent"]].vgplot.hist()


Compare it to the pure pandas example


In [12]:
df[["sugarpercent", "pricepercent"]].plot.hist(alpha=0.5)


Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f73a1d405c0>

Let's try some scatter plots


In [13]:
df.vgplot.scatter(x='pricepercent', y='sugarpercent')



In [14]:
df.vgplot.scatter(x='winpercent', y='sugarpercent')


The pandas version does not look as nice


In [15]:
df.plot.scatter(x='winpercent', y='sugarpercent', c='bar')


Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f73a1d5ff98>

pdvega suppports encoding the size and color based on values in columns of the dataframe


In [16]:
df.vgplot.scatter(x='winpercent', y='sugarpercent', s='pricepercent', c='bar')


The scatter matrix is really helpful


In [17]:
pdvega.scatter_matrix(df[["sugarpercent", "winpercent", "pricepercent"]], "winpercent")


Here's a simple bar chart. Unfortunately I could not figure out how to sort by the winpercent


In [18]:
df.sort_values(by=['winpercent'], ascending=False).head(10)


Out[18]:
competitorname chocolate fruity caramel peanutyalmondy nougat crispedricewafer hard bar pluribus sugarpercent pricepercent winpercent
52 Reese's Peanut Butter cup 1 0 0 1 0 0 0 0 0 0.720 0.651 84.180290
51 Reese's Miniatures 1 0 0 1 0 0 0 0 0 0.034 0.279 81.866257
79 Twix 1 0 1 0 0 1 0 1 0 0.546 0.906 81.642914
28 Kit Kat 1 0 0 0 0 1 0 1 0 0.313 0.511 76.768600
64 Snickers 1 0 1 1 1 0 0 1 0 0.546 0.651 76.673782
53 Reese's pieces 1 0 0 1 0 0 0 0 1 0.406 0.651 73.434990
36 Milky Way 1 0 1 0 1 0 0 1 0 0.604 0.651 73.099556
54 Reese's stuffed with pieces 1 0 0 1 0 0 0 0 0 0.988 0.651 72.887901
32 Peanut butter M&M's 1 0 0 1 0 0 0 0 1 0.825 0.651 71.465050
42 Nestle Butterfinger 1 0 0 1 0 0 0 1 0 0.604 0.767 70.735641

In [19]:
df.sort_values(by=['winpercent'], ascending=False).head(15).plot.barh(x='competitorname', y='winpercent')


Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f73a0497080>

In [20]:
df.sort_values(by=['winpercent'], ascending=False).head(15).vgplot.barh(x='competitorname', y='winpercent')



In [ ]: