Module 9: Estimation



In [1]:

    
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import altair as alt
import pandas as pd
import scipy.stats as ss
import statsmodels

Kernel density estimation

Btw, here are some resources on KDE: https://yyiki.org/wiki/Kernel%20density%20estimation/ Feel free to check out if you want to learn more about KDE.

Let's import the IMDb data.



In [2]:

    
import vega_datasets

movies = vega_datasets.data.movies()
movies.head()









    Out[2]:







  
    
      
      Title
      US Gross
      Worldwide Gross
      US DVD Sales
      Production Budget
      Release Date
      MPAA Rating
      Running Time min
      Distributor
      Source
      Major Genre
      Creative Type
      Director
      Rotten Tomatoes Rating
      IMDB Rating
      IMDB Votes
    
  
  
    
      0
      The Land Girls
      146083.0
      146083.0
      NaN
      8000000.0
      Jun 12 1998
      R
      NaN
      Gramercy
      None
      None
      None
      None
      NaN
      6.1
      1071.0
    
    
      1
      First Love, Last Rites
      10876.0
      10876.0
      NaN
      300000.0
      Aug 07 1998
      R
      NaN
      Strand
      None
      Drama
      None
      None
      NaN
      6.9
      207.0
    
    
      2
      I Married a Strange Person
      203134.0
      203134.0
      NaN
      250000.0
      Aug 28 1998
      None
      NaN
      Lionsgate
      None
      Comedy
      None
      None
      NaN
      6.8
      865.0
    
    
      3
      Let's Talk About Sex
      373615.0
      373615.0
      NaN
      300000.0
      Sep 11 1998
      None
      NaN
      Fine Line
      None
      Comedy
      None
      None
      13.0
      NaN
      NaN
    
    
      4
      Slam
      1009819.0
      1087521.0
      NaN
      1000000.0
      Oct 09 1998
      R
      NaN
      Trimark
      Original Screenplay
      Drama
      Contemporary Fiction
      None
      62.0
      3.4
      165.0

Q: Can you drop rows that have NaN value in either IMDB Rating or Rotten Tomatoes Rating?



In [3]:

    
# TODO

We can plot histogram and KDE using pandas:



In [4]:

    
movies['IMDB Rating'].hist(bins=10, density=True)
movies['IMDB Rating'].plot(kind='kde')









    Out[4]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fc6d992cad0>

Or using seaborn:



In [5]:

    
sns.distplot(movies['IMDB Rating'], bins=15)









    Out[5]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fc6c965b1d0>

Q: Can you plot the histogram and KDE of the Rotten Tomatoes Rating?



In [6]:

    
# TODO: implement this using pandas









    Out[6]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fc6c9653b50>

We can get a random sample using the pandas' sample() function. The kdeplot() function in seaborn provides many options (like kernel types) to do KDE. Let's sample some data points and see how does KDE plot changes with the size of the samples.



In [7]:

    
f = plt.figure(figsize=(15,8))
plt.xlim(0, 10)

sample_sizes = [10, 50, 100, 500, 1000, 2000]
for i, N in enumerate(sample_sizes, 1):
    plt.subplot(2,3,i)
    plt.title("Sample size: {}".format(N))
    for j in range(5):
        s = movies['IMDB Rating'].sample(N)
        sns.kdeplot(s, kernel='gau', legend=False)

Let's try all kernel types supported by seaborn's kdeplot(). Plot the same 2x3 grid with all kernels: https://seaborn.pydata.org/generated/seaborn.kdeplot.html#seaborn.kdeplot To see how do the kernels look like, just sample 2 data points and plot them. if you get an error, just re-run it.



In [8]:

    
# Implement here

Q: We can also play with the bandwidth option. Make sure to set the xlim so that all plots have the same x range, so that we can compare.



In [9]:

    
f = plt.figure(figsize=(15,8))
bw = ['scott', 'silverman', 0.01, 0.1, 1, 5]
sample_size = 10
kernel = 'gau'

# Implement here

Q: What's your takeaway? Explain how bandwidth affects the result of your visualization.



In [ ]:

Interpolation

One area where interpolation is used a lot is image processing. Play with it!

https://matplotlib.org/examples/images_contours_and_fields/interpolation_methods.html



In [10]:

    
methods = [None, 'none', 'nearest', 'bilinear', 'bicubic', 'spline16',
           'spline36', 'hanning', 'hamming', 'hermite', 'kaiser', 'quadric',
           'catrom', 'gaussian', 'bessel', 'mitchell', 'sinc', 'lanczos']
np.random.seed(0)
grid = np.random.rand(4, 4)

plt.imshow(grid, interpolation=None, cmap='viridis')









    Out[10]:





<matplotlib.image.AxesImage at 0x7fc6c9f24cd0>



In [11]:

    
plt.imshow(grid, interpolation='bicubic', cmap='viridis')









    Out[11]:





<matplotlib.image.AxesImage at 0x7fc6c9e3fc10>

Let's look at some time series data.



In [12]:

    
co2 = vega_datasets.data.co2_concentration()
co2.head()









    Out[12]:







  
    
      
      Date
      CO2
      adjusted CO2
    
  
  
    
      0
      1958-03-01
      315.70
      314.44
    
    
      1
      1958-04-01
      317.46
      315.16
    
    
      2
      1958-05-01
      317.51
      314.71
    
    
      3
      1958-07-01
      315.86
      315.19
    
    
      4
      1958-08-01
      314.93
      316.19



In [13]:

    
co2.drop(['adjusted CO2'], axis=1, inplace=True)



In [14]:

    
co2.Date.dtype









    Out[14]:





dtype('O')

The Date colume is stored as strings. Let's convert it to datetime so that we can manipulate.



In [15]:

    
pd.to_datetime(co2.Date).head()









    Out[15]:





0   1958-03-01
1   1958-04-01
2   1958-05-01
3   1958-07-01
4   1958-08-01
Name: Date, dtype: datetime64[ns]



In [16]:

    
co2.Date = pd.to_datetime(co2.Date)



In [17]:

    
co2.set_index('Date', inplace=True)
co2.head()



In [18]:

    
co2.plot()









    Out[18]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fc6c9e6fcd0>

😢



In [19]:

    
recent_co2 = co2.tail(8)



In [20]:

    
recent_co2.plot()









    Out[20]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fc6c9e58510>

This standard line chart above can be considered as a chart with linear interpolation between data points.

The data contains measurements at the resolution of about a month. Let's up-sample the data. This process create new rows that fill the gap between data points. We can use interpolate() function to fill the gaps.



In [21]:

    
upsampled = recent_co2.resample('D')
upsampled.interpolate().head()

If we do linear interpolation, we get more or less the same plot, but just with more points.



In [22]:

    
recent_co2.resample('D').interpolate(method='linear').plot(style='o-')









    Out[22]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fc6c9d7e990>



In [23]:

    
recent_co2.plot(style='o-')









    Out[23]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fc6c9e51410>

Nearest interpolation is just a process of assigning the nearest value to each missing rows.



In [24]:

    
ax = recent_co2.resample('D').interpolate(method='nearest').plot(style='o-')
recent_co2.plot(ax=ax, style='o', ms=5)









    Out[24]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fc6c9d83b10>

Let's try a spline too.



In [25]:

    
ax = recent_co2.resample('D').interpolate(method='spline', order=5).plot(style='+-', lw=1)
recent_co2.plot(ax=ax, style='o', ms=5)









    Out[25]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fc6c9d6bd90>

Moving average

Pandas has a nice method called rolling(): https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rolling.html

It lets you do operations on the rolling windows. For instance, if you want to calculate the moving average, you can simply



In [26]:

    
ax = co2[-100:].plot(lw=0.5)
co2[-100:].rolling(12).mean().plot(ax=ax)









    Out[26]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fc6c9f54990>

By default, it consider every data point inside each window equally (win_type=None) but there are many window types supported by scipy. Also by default, the mean value is put at the right end of the window (trailing average).

Q: can you create a plot with triang window type and centered average?



In [27]:

    
# Implement here









    Out[27]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fc6c9e71e90>

Examining relationsips

Remember Anscombe's quartet? Actually, the dataset is not only included in vega_datasets but also in seaborn.



In [28]:

    
df = sns.load_dataset("anscombe")
df.head()

All four datasets are in this single data frame and the 'dataset' indicator is one of the columns. This is a form often called tidy data, which is easy to manipulate and plot. In tidy data, each row is an observation and columns are the properties of the observation. Seaborn makes use of the tidy form. Using seaborn's lmplot, you can very quickly examine relationships between variables, separated by some facets of the dataset.

Q: Can you produce the plot below using lmplot()?



In [29]:

    
# plotting parameters you can use
palette = "muted"
scatter_kws={"s": 50, "alpha": 1}
ci=None
height=4

# Implement









    Out[29]:





<seaborn.axisgrid.FacetGrid at 0x7fc6c9f90d10>

Q: So let's look at the relationship between IMDB Rating and Rotten Tomatoes Rating in the movies dataset, separated with respect to MPAA Rating. Put 4 plots in a row.



In [30]:

    
# Implement









    Out[30]:





<seaborn.axisgrid.FacetGrid at 0x7fc6ca063790>

It may be interesting to dig up what are the movies that have super high Rotten Tomatoes rating and super low IMDB rating (and vice versa)!

Another useful method for examining relationships is jointplot(), which produces a scatter plot with two marginal histograms.



In [31]:

    
g = sns.jointplot(movies['IMDB Rating'], movies['Rotten Tomatoes Rating'], s=5, alpha=0.2, facecolors='none', edgecolors='b')

Hexbin density plot

In 2D, heatmap can be considered as a color-based histogram. You divide the space into bins and show the frequency with colors. A common binning method is the hexagonal bin.

We can again use the jointplot() and setting the kind to be hexbin.

Q: Can you create one?



In [32]:

    
# implement

2D KDE

We can also do 2D KDE using seaborn's kdeplot() function.

Q: Can you draw one like this?



In [33]:

    
cmap = "Reds"
shade = True         # what happens if you change this?
shade_lowest = True  # what happens if you change this?

# implement









    Out[33]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fc6ca1de7d0>

Or again using jointplot() by setting the kind parameter. Look, we also have the 1D marginal KDE plots!

Q: create jointplot with KDE

(Note that x-axis is in log-scale here)



In [34]:

    
# Implement: draw a joint plot with bivariate KDE as well as marginal distributions with KDE









    Out[34]:





<seaborn.axisgrid.JointGrid at 0x7fc6ca1faad0>



In [ ]:

	CO2
Date
2019-09-01	408.550
2019-09-02	408.546
2019-09-03	408.542
2019-09-04	408.538
2019-09-05	408.534

	dataset	x	y
0	I	10.0	8.04
1	I	8.0	6.95
2	I	13.0	7.58
3	I	9.0	8.81
4	I	11.0	8.33

	Title	US Gross	Worldwide Gross	US DVD Sales	Production Budget	Release Date	MPAA Rating	Running Time min	Distributor	Source	Major Genre	Creative Type	Director	Rotten Tomatoes Rating	IMDB Rating	IMDB Votes
0	The Land Girls	146083.0	146083.0	NaN	8000000.0	Jun 12 1998	R	NaN	Gramercy	None	None	None	None	NaN	6.1	1071.0
1	First Love, Last Rites	10876.0	10876.0	NaN	300000.0	Aug 07 1998	R	NaN	Strand	None	Drama	None	None	NaN	6.9	207.0
2	I Married a Strange Person	203134.0	203134.0	NaN	250000.0	Aug 28 1998	None	NaN	Lionsgate	None	Comedy	None	None	NaN	6.8	865.0
3	Let's Talk About Sex	373615.0	373615.0	NaN	300000.0	Sep 11 1998	None	NaN	Fine Line	None	Comedy	None	None	13.0	NaN	NaN
4	Slam	1009819.0	1087521.0	NaN	1000000.0	Oct 09 1998	R	NaN	Trimark	Original Screenplay	Drama	Contemporary Fiction	None	62.0	3.4	165.0

	Date	CO2	adjusted CO2
0	1958-03-01	315.70	314.44
1	1958-04-01	317.46	315.16
2	1958-05-01	317.51	314.71
3	1958-07-01	315.86	315.19
4	1958-08-01	314.93	316.19