In [1]:
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import altair as alt
import pandas as pd
import scipy.stats as ss
import statsmodels
Btw, here are some resources on KDE: https://yyiki.org/wiki/Kernel%20density%20estimation/ Feel free to check out if you want to learn more about KDE.
Let's import the IMDb data.
In [2]:
import vega_datasets
movies = vega_datasets.data.movies()
movies.head()
Out[2]:
Q: Can you drop rows that have NaN value in either IMDB Rating
or Rotten Tomatoes Rating
?
In [3]:
# TODO
We can plot histogram and KDE using pandas:
In [4]:
movies['IMDB Rating'].hist(bins=10, density=True)
movies['IMDB Rating'].plot(kind='kde')
Out[4]:
Or using seaborn:
In [5]:
sns.distplot(movies['IMDB Rating'], bins=15)
Out[5]:
Q: Can you plot the histogram and KDE of the Rotten Tomatoes Rating
?
In [6]:
# TODO: implement this using pandas
Out[6]:
In [7]:
f = plt.figure(figsize=(15,8))
plt.xlim(0, 10)
sample_sizes = [10, 50, 100, 500, 1000, 2000]
for i, N in enumerate(sample_sizes, 1):
plt.subplot(2,3,i)
plt.title("Sample size: {}".format(N))
for j in range(5):
s = movies['IMDB Rating'].sample(N)
sns.kdeplot(s, kernel='gau', legend=False)
Let's try all kernel types supported by seaborn's kdeplot()
. Plot the same 2x3 grid with all kernels: https://seaborn.pydata.org/generated/seaborn.kdeplot.html#seaborn.kdeplot To see how do the kernels look like, just sample 2 data points and plot them. if you get an error, just re-run it.
In [8]:
# Implement here
Q: We can also play with the bandwidth option. Make sure to set the xlim
so that all plots have the same x range, so that we can compare.
In [9]:
f = plt.figure(figsize=(15,8))
bw = ['scott', 'silverman', 0.01, 0.1, 1, 5]
sample_size = 10
kernel = 'gau'
# Implement here
Q: What's your takeaway? Explain how bandwidth affects the result of your visualization.
In [ ]:
One area where interpolation is used a lot is image processing. Play with it!
https://matplotlib.org/examples/images_contours_and_fields/interpolation_methods.html
In [10]:
methods = [None, 'none', 'nearest', 'bilinear', 'bicubic', 'spline16',
'spline36', 'hanning', 'hamming', 'hermite', 'kaiser', 'quadric',
'catrom', 'gaussian', 'bessel', 'mitchell', 'sinc', 'lanczos']
np.random.seed(0)
grid = np.random.rand(4, 4)
plt.imshow(grid, interpolation=None, cmap='viridis')
Out[10]:
In [11]:
plt.imshow(grid, interpolation='bicubic', cmap='viridis')
Out[11]:
Let's look at some time series data.
In [12]:
co2 = vega_datasets.data.co2_concentration()
co2.head()
Out[12]:
In [13]:
co2.drop(['adjusted CO2'], axis=1, inplace=True)
In [14]:
co2.Date.dtype
Out[14]:
The Date
colume is stored as strings. Let's convert it to datetime
so that we can manipulate.
In [15]:
pd.to_datetime(co2.Date).head()
Out[15]:
In [16]:
co2.Date = pd.to_datetime(co2.Date)
In [17]:
co2.set_index('Date', inplace=True)
co2.head()
Out[17]:
In [18]:
co2.plot()
Out[18]:
😢
In [19]:
recent_co2 = co2.tail(8)
In [20]:
recent_co2.plot()
Out[20]:
This standard line chart above can be considered as a chart with linear interpolation between data points.
The data contains measurements at the resolution of about a month. Let's up-sample the data. This process create new rows that fill the gap between data points. We can use interpolate()
function to fill the gaps.
In [21]:
upsampled = recent_co2.resample('D')
upsampled.interpolate().head()
Out[21]:
If we do linear
interpolation, we get more or less the same plot, but just with more points.
In [22]:
recent_co2.resample('D').interpolate(method='linear').plot(style='o-')
Out[22]:
In [23]:
recent_co2.plot(style='o-')
Out[23]:
Nearest
interpolation is just a process of assigning the nearest value to each missing rows.
In [24]:
ax = recent_co2.resample('D').interpolate(method='nearest').plot(style='o-')
recent_co2.plot(ax=ax, style='o', ms=5)
Out[24]:
Let's try a spline too.
In [25]:
ax = recent_co2.resample('D').interpolate(method='spline', order=5).plot(style='+-', lw=1)
recent_co2.plot(ax=ax, style='o', ms=5)
Out[25]:
Pandas has a nice method called rolling()
: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rolling.html
It lets you do operations on the rolling
windows. For instance, if you want to calculate the moving average, you can simply
In [26]:
ax = co2[-100:].plot(lw=0.5)
co2[-100:].rolling(12).mean().plot(ax=ax)
Out[26]:
By default, it consider every data point inside each window equally (win_type=None
) but there are many window types supported by scipy
. Also by default, the mean value is put at the right end of the window (trailing average).
Q: can you create a plot with triang
window type and centered average?
In [27]:
# Implement here
Out[27]:
Remember Anscombe's quartet? Actually, the dataset is not only included in vega_datasets
but also in seaborn
.
In [28]:
df = sns.load_dataset("anscombe")
df.head()
Out[28]:
All four datasets are in this single data frame and the 'dataset' indicator is one of the columns. This is a form often called tidy data, which is easy to manipulate and plot. In tidy data, each row is an observation and columns are the properties of the observation. Seaborn makes use of the tidy form. Using seaborn's lmplot
, you can very quickly examine relationships between variables, separated by some facets of the dataset.
Q: Can you produce the plot below using lmplot()
?
In [29]:
# plotting parameters you can use
palette = "muted"
scatter_kws={"s": 50, "alpha": 1}
ci=None
height=4
# Implement
Out[29]:
Q: So let's look at the relationship between IMDB Rating
and Rotten Tomatoes Rating
in the movies
dataset, separated with respect to MPAA Rating
. Put 4 plots in a row.
In [30]:
# Implement
Out[30]:
It may be interesting to dig up what are the movies that have super high Rotten Tomatoes rating and super low IMDB rating (and vice versa)!
Another useful method for examining relationships is jointplot()
, which produces a scatter plot with two marginal histograms.
In [31]:
g = sns.jointplot(movies['IMDB Rating'], movies['Rotten Tomatoes Rating'], s=5, alpha=0.2, facecolors='none', edgecolors='b')
In 2D, heatmap can be considered as a color-based histogram. You divide the space into bins and show the frequency with colors. A common binning method is the hexagonal bin.
We can again use the jointplot()
and setting the kind
to be hexbin
.
Q: Can you create one?
In [32]:
# implement
In [33]:
cmap = "Reds"
shade = True # what happens if you change this?
shade_lowest = True # what happens if you change this?
# implement
Out[33]:
Or again using jointplot()
by setting the kind
parameter. Look, we also have the 1D marginal KDE plots!
Q: create jointplot with KDE
(Note that x-axis is in log-scale here)
In [34]:
# Implement: draw a joint plot with bivariate KDE as well as marginal distributions with KDE
Out[34]:
In [ ]: