Module 9: Estimation


In [1]:
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import altair as alt
import pandas as pd
import scipy.stats as ss
import statsmodels

Kernel density estimation

Btw, here are some resources on KDE: https://yyiki.org/wiki/Kernel%20density%20estimation/ Feel free to check out if you want to learn more about KDE.

Let's import the IMDb data.


In [2]:
import vega_datasets

movies = vega_datasets.data.movies()
movies.head()


Out[2]:
Title US Gross Worldwide Gross US DVD Sales Production Budget Release Date MPAA Rating Running Time min Distributor Source Major Genre Creative Type Director Rotten Tomatoes Rating IMDB Rating IMDB Votes
0 The Land Girls 146083.0 146083.0 NaN 8000000.0 Jun 12 1998 R NaN Gramercy None None None None NaN 6.1 1071.0
1 First Love, Last Rites 10876.0 10876.0 NaN 300000.0 Aug 07 1998 R NaN Strand None Drama None None NaN 6.9 207.0
2 I Married a Strange Person 203134.0 203134.0 NaN 250000.0 Aug 28 1998 None NaN Lionsgate None Comedy None None NaN 6.8 865.0
3 Let's Talk About Sex 373615.0 373615.0 NaN 300000.0 Sep 11 1998 None NaN Fine Line None Comedy None None 13.0 NaN NaN
4 Slam 1009819.0 1087521.0 NaN 1000000.0 Oct 09 1998 R NaN Trimark Original Screenplay Drama Contemporary Fiction None 62.0 3.4 165.0

Q: Can you drop rows that have NaN value in either IMDB Rating or Rotten Tomatoes Rating?


In [3]:
# TODO

We can plot histogram and KDE using pandas:


In [4]:
movies['IMDB Rating'].hist(bins=10, density=True)
movies['IMDB Rating'].plot(kind='kde')


Out[4]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc6d992cad0>

Or using seaborn:


In [5]:
sns.distplot(movies['IMDB Rating'], bins=15)


Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc6c965b1d0>

Q: Can you plot the histogram and KDE of the Rotten Tomatoes Rating?


In [6]:
# TODO: implement this using pandas


Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc6c9653b50>

We can get a random sample using the pandas' sample() function. The kdeplot() function in seaborn provides many options (like kernel types) to do KDE. Let's sample some data points and see how does KDE plot changes with the size of the samples.


In [7]:
f = plt.figure(figsize=(15,8))
plt.xlim(0, 10)

sample_sizes = [10, 50, 100, 500, 1000, 2000]
for i, N in enumerate(sample_sizes, 1):
    plt.subplot(2,3,i)
    plt.title("Sample size: {}".format(N))
    for j in range(5):
        s = movies['IMDB Rating'].sample(N)
        sns.kdeplot(s, kernel='gau', legend=False)


Let's try all kernel types supported by seaborn's kdeplot(). Plot the same 2x3 grid with all kernels: https://seaborn.pydata.org/generated/seaborn.kdeplot.html#seaborn.kdeplot To see how do the kernels look like, just sample 2 data points and plot them. if you get an error, just re-run it.


In [8]:
# Implement here


Q: We can also play with the bandwidth option. Make sure to set the xlim so that all plots have the same x range, so that we can compare.


In [9]:
f = plt.figure(figsize=(15,8))
bw = ['scott', 'silverman', 0.01, 0.1, 1, 5]
sample_size = 10
kernel = 'gau'

# Implement here


Q: What's your takeaway? Explain how bandwidth affects the result of your visualization.


In [ ]:

Interpolation

One area where interpolation is used a lot is image processing. Play with it!

https://matplotlib.org/examples/images_contours_and_fields/interpolation_methods.html


In [10]:
methods = [None, 'none', 'nearest', 'bilinear', 'bicubic', 'spline16',
           'spline36', 'hanning', 'hamming', 'hermite', 'kaiser', 'quadric',
           'catrom', 'gaussian', 'bessel', 'mitchell', 'sinc', 'lanczos']
np.random.seed(0)
grid = np.random.rand(4, 4)

plt.imshow(grid, interpolation=None, cmap='viridis')


Out[10]:
<matplotlib.image.AxesImage at 0x7fc6c9f24cd0>

In [11]:
plt.imshow(grid, interpolation='bicubic', cmap='viridis')


Out[11]:
<matplotlib.image.AxesImage at 0x7fc6c9e3fc10>

Let's look at some time series data.


In [12]:
co2 = vega_datasets.data.co2_concentration()
co2.head()


Out[12]:
Date CO2 adjusted CO2
0 1958-03-01 315.70 314.44
1 1958-04-01 317.46 315.16
2 1958-05-01 317.51 314.71
3 1958-07-01 315.86 315.19
4 1958-08-01 314.93 316.19

In [13]:
co2.drop(['adjusted CO2'], axis=1, inplace=True)

In [14]:
co2.Date.dtype


Out[14]:
dtype('O')

The Date colume is stored as strings. Let's convert it to datetime so that we can manipulate.


In [15]:
pd.to_datetime(co2.Date).head()


Out[15]:
0   1958-03-01
1   1958-04-01
2   1958-05-01
3   1958-07-01
4   1958-08-01
Name: Date, dtype: datetime64[ns]

In [16]:
co2.Date = pd.to_datetime(co2.Date)

In [17]:
co2.set_index('Date', inplace=True)
co2.head()


Out[17]:
CO2
Date
1958-03-01 315.70
1958-04-01 317.46
1958-05-01 317.51
1958-07-01 315.86
1958-08-01 314.93

In [18]:
co2.plot()


Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc6c9e6fcd0>

😢


In [19]:
recent_co2 = co2.tail(8)

In [20]:
recent_co2.plot()


Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc6c9e58510>

This standard line chart above can be considered as a chart with linear interpolation between data points.

The data contains measurements at the resolution of about a month. Let's up-sample the data. This process create new rows that fill the gap between data points. We can use interpolate() function to fill the gaps.


In [21]:
upsampled = recent_co2.resample('D')
upsampled.interpolate().head()


Out[21]:
CO2
Date
2019-09-01 408.550
2019-09-02 408.546
2019-09-03 408.542
2019-09-04 408.538
2019-09-05 408.534

If we do linear interpolation, we get more or less the same plot, but just with more points.


In [22]:
recent_co2.resample('D').interpolate(method='linear').plot(style='o-')


Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc6c9d7e990>

In [23]:
recent_co2.plot(style='o-')


Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc6c9e51410>

Nearest interpolation is just a process of assigning the nearest value to each missing rows.


In [24]:
ax = recent_co2.resample('D').interpolate(method='nearest').plot(style='o-')
recent_co2.plot(ax=ax, style='o', ms=5)


Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc6c9d83b10>

Let's try a spline too.


In [25]:
ax = recent_co2.resample('D').interpolate(method='spline', order=5).plot(style='+-', lw=1)
recent_co2.plot(ax=ax, style='o', ms=5)


Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc6c9d6bd90>

Moving average

Pandas has a nice method called rolling(): https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rolling.html

It lets you do operations on the rolling windows. For instance, if you want to calculate the moving average, you can simply


In [26]:
ax = co2[-100:].plot(lw=0.5)
co2[-100:].rolling(12).mean().plot(ax=ax)


Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc6c9f54990>

By default, it consider every data point inside each window equally (win_type=None) but there are many window types supported by scipy. Also by default, the mean value is put at the right end of the window (trailing average).

Q: can you create a plot with triang window type and centered average?


In [27]:
# Implement here


Out[27]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc6c9e71e90>

Examining relationsips

Remember Anscombe's quartet? Actually, the dataset is not only included in vega_datasets but also in seaborn.


In [28]:
df = sns.load_dataset("anscombe")
df.head()


Out[28]:
dataset x y
0 I 10.0 8.04
1 I 8.0 6.95
2 I 13.0 7.58
3 I 9.0 8.81
4 I 11.0 8.33

All four datasets are in this single data frame and the 'dataset' indicator is one of the columns. This is a form often called tidy data, which is easy to manipulate and plot. In tidy data, each row is an observation and columns are the properties of the observation. Seaborn makes use of the tidy form. Using seaborn's lmplot, you can very quickly examine relationships between variables, separated by some facets of the dataset.

Q: Can you produce the plot below using lmplot()?


In [29]:
# plotting parameters you can use
palette = "muted"
scatter_kws={"s": 50, "alpha": 1}
ci=None
height=4

# Implement


Out[29]:
<seaborn.axisgrid.FacetGrid at 0x7fc6c9f90d10>

Q: So let's look at the relationship between IMDB Rating and Rotten Tomatoes Rating in the movies dataset, separated with respect to MPAA Rating. Put 4 plots in a row.


In [30]:
# Implement


Out[30]:
<seaborn.axisgrid.FacetGrid at 0x7fc6ca063790>

It may be interesting to dig up what are the movies that have super high Rotten Tomatoes rating and super low IMDB rating (and vice versa)!

Another useful method for examining relationships is jointplot(), which produces a scatter plot with two marginal histograms.


In [31]:
g = sns.jointplot(movies['IMDB Rating'], movies['Rotten Tomatoes Rating'], s=5, alpha=0.2, facecolors='none', edgecolors='b')


Hexbin density plot

In 2D, heatmap can be considered as a color-based histogram. You divide the space into bins and show the frequency with colors. A common binning method is the hexagonal bin.

We can again use the jointplot() and setting the kind to be hexbin.

Q: Can you create one?


In [32]:
# implement


2D KDE

We can also do 2D KDE using seaborn's kdeplot() function.

Q: Can you draw one like this?


In [33]:
cmap = "Reds"
shade = True         # what happens if you change this?
shade_lowest = True  # what happens if you change this?

# implement


Out[33]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc6ca1de7d0>

Or again using jointplot() by setting the kind parameter. Look, we also have the 1D marginal KDE plots!

Q: create jointplot with KDE

(Note that x-axis is in log-scale here)


In [34]:
# Implement: draw a joint plot with bivariate KDE as well as marginal distributions with KDE


Out[34]:
<seaborn.axisgrid.JointGrid at 0x7fc6ca1faad0>

In [ ]: