Full article at pbpython.com
In [1]:
import pandas as pd
import pdvega
In [2]:
%matplotlib inline
Read in the FiveThirtyEight data on candy
In [3]:
df = pd.read_csv("https://github.com/fivethirtyeight/data/blob/master/candy-power-ranking/candy-data.csv?raw=True")
In [4]:
# Clean up broken apostrophe
df['competitorname'].replace(regex=True,inplace=True,to_replace=r'Õ',value=r"'")
In [5]:
df.head()
Out[5]:
Try a pandas plot first
In [6]:
df["winpercent"].plot.hist()
Out[6]:
Try the same thing using pdvega
In [7]:
df["winpercent"].vgplot.hist()
KDE plots work as expected
In [8]:
df["sugarpercent"].vgplot.kde()
We can look at the sugar and price percentile distributions
In [9]:
df["sugarpercent"].vgplot.hist()
In [10]:
df["pricepercent"].vgplot.hist()
In [11]:
df[["sugarpercent", "pricepercent"]].vgplot.hist()
Compare it to the pure pandas example
In [12]:
df[["sugarpercent", "pricepercent"]].plot.hist(alpha=0.5)
Out[12]:
Let's try some scatter plots
In [13]:
df.vgplot.scatter(x='pricepercent', y='sugarpercent')
In [14]:
df.vgplot.scatter(x='winpercent', y='sugarpercent')
The pandas version does not look as nice
In [15]:
df.plot.scatter(x='winpercent', y='sugarpercent', c='bar')
Out[15]:
pdvega suppports encoding the size and color based on values in columns of the dataframe
In [16]:
df.vgplot.scatter(x='winpercent', y='sugarpercent', s='pricepercent', c='bar')
The scatter matrix is really helpful
In [17]:
pdvega.scatter_matrix(df[["sugarpercent", "winpercent", "pricepercent"]], "winpercent")
Here's a simple bar chart. Unfortunately I could not figure out how to sort by the winpercent
In [18]:
df.sort_values(by=['winpercent'], ascending=False).head(10)
Out[18]:
In [19]:
df.sort_values(by=['winpercent'], ascending=False).head(15).plot.barh(x='competitorname', y='winpercent')
Out[19]:
In [20]:
df.sort_values(by=['winpercent'], ascending=False).head(15).vgplot.barh(x='competitorname', y='winpercent')
In [ ]: