In [1]:
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import altair as alt
import pandas as pd
%matplotlib inline
In [2]:
import matplotlib
matplotlib.__version__
Out[2]:
Let's revisit the table from the class
Hours | Frequency |
---|---|
0-1 | 4,300 |
1-3 | 6,900 |
3-5 | 4,900 |
5-10 | 2,000 |
10-24 | 2,100 |
You can draw a histogram by just providing bins and counts instead of a list of numbers. So, let's try that.
In [3]:
bins = [0, 1, 3, 5, 10, 24]
data = {0.5: 4300, 2: 6900, 4: 4900, 7: 2000, 15: 2100}
In [4]:
data.keys()
Out[4]:
Q: Draw histogram using this data. Don't normalize it for now. Useful query: Google search: matplotlib histogram pre-counted
In [5]:
# TODO: draw a histogram with weighted data.
Out[5]:
As you can see, the default histogram does not normalize with binwidth and simply shows the counts! This can be very misleading if you are working with variable bin width (e.g. logarithmic bins). So please be mindful about histograms when you work with variable bins.
Q: You can fix this by using the density
option.
In [6]:
# TODO: fix it with density option.
Out[6]:
In [7]:
import vega_datasets
In [8]:
movies = vega_datasets.data.movies()
movies.head()
Out[8]:
Let's plot the histogram of IMDB ratings.
In [9]:
plt.hist(movies['IMDB Rating'])
Out[9]:
Did you get an error or a warning? What's going on?
The problem is that the column contains NaN
(Not a Number) values, which represent missing data points. The following command check whether each value is a NaN
and returns the result.
In [10]:
movies['IMDB Rating'].isna()
Out[10]:
As you can see there are a bunch of missing rows. You can count them.
In [11]:
sum(movies['IMDB Rating'].isna())
Out[11]:
or drop them.
In [12]:
IMDB_ratings_nan_dropped = movies['IMDB Rating'].dropna()
len(IMDB_ratings_nan_dropped)
Out[12]:
In [13]:
213 + 2988
Out[13]:
The dropna
can be applied to the dataframe too.
Q: drop rows from movies
dataframe where either IMDB Rating
or IMDB Votes
is NaN
.
In [14]:
# TODO
In [15]:
# Both should be zero.
print(sum(movies['IMDB Rating'].isna()), sum(movies['IMDB Votes'].isna()))
How does matplotlib
decides the bins? Actually matplotlib
's hist
function uses numpy
's histogram
function under the hood.
Q: Plot the histogram of movie ratings (IMDB Rating
) using the plt.hist()
function.
In [16]:
# TODO
Out[16]:
Have you noticed that this function returns three objects? Take a look at the documentation here to figure out what they are.
To get the returned three objects:
In [17]:
n_raw, bins_raw, patches = ...
print(n_raw)
print(bins_raw)
Here, n_raw
contains the values of histograms, i.e., the number of movies in each of the 10 bins. Thus, the sum of the elements in n_raw
should be equal to the total number of movies.
Q: Test whether the sum of values in n_raw
is equal to the number of movies in the movies
dataset
In [18]:
# TODO: test whether the sum of the numbers in n_raw is equal to the number of movies.
Out[18]:
The second returned object (bins_raw
) is a list containing the edges of the 10 bins: the first bin is [1.4, 2.18], the second [2.18, 2.96], and so on. What's the width of the bins?
In [19]:
np.diff(bins_raw)
Out[19]:
The width is same as the maximum value minus minimum value, divided by 10.
In [20]:
min_rating = min(movies['IMDB Rating'])
max_rating = max(movies['IMDB Rating'])
print(min_rating, max_rating)
print( (max_rating-min_rating) / 10 )
Now, let's plot a normalized (density) histogram.
In [21]:
n, bins, patches = plt.hist(movies['IMDB Rating'], density=True)
print(n)
print(bins)
The ten bins do not change. But now n
represents the density of the data inside each bin. In other words, the sum of the area of each bar will equal to 1.
Q: Can you verify this?
Hint: the area of each bar is calculated as height * width. You may get something like 0.99999999999999978 instead of 1.
In [22]:
# TODO
Out[22]:
Anyway, these data generated from the hist
function is calculated from numpy
's histogram
function. https://docs.scipy.org/doc/numpy/reference/generated/numpy.histogram.html
Note that the result of np.histogram()
is same as that of plt.hist()
.
In [23]:
np.histogram(movies['IMDB Rating'])
Out[23]:
In [24]:
plt.hist(movies['IMDB Rating'])
Out[24]:
If you look at the documentation, you can see that numpy
uses simply 10 as the default number of bins. But you can set it manually or set it to be auto
, which is the "Maximum of the sturges
and fd
estimators.". Let's try this auto
option.
In [25]:
_ = plt.hist(movies['IMDB Rating'], bins='auto')
Let's explore the effect of bin size using small multiples. In matplotlib
, you can use subplot to put multiple plots into a single figure.
For instance, you can do something like:
In [26]:
plt.figure(figsize=(10,5))
plt.subplot(1,2,1)
movies['IMDB Rating'].hist(bins=3)
plt.subplot(1,2,2)
movies['IMDB Rating'].hist(bins=20)
Out[26]:
What does the argument in plt.subplot(1,2,1)
mean? If you're not sure, check out: http://stackoverflow.com/questions/3584805/in-matplotlib-what-does-the-argument-mean-in-fig-add-subplot111
Q: create 8 subplots (2 rows and 4 columns) with the following binsizes
.
In [27]:
nbins = [2, 3, 5, 10, 30, 40, 60, 100 ]
figsize = (18, 10)
# TODO
# ...
for i, bins in enumerate(nbins):
# TODO: use subplot and hist() function to draw 8 plots
Do you see the issues with having too few bins or too many bins? In particular, do you notice weird patterns that emerge from bins=30
?
Q: Can you guess why do you see such patterns? What are the peaks and what are the empty bars? What do they tell you about choosing the binsize in histograms?
In [28]:
# TODO: Provide your answer and evidence here
Explanation: ...
In [33]:
N = len(movies)
plt.figure(figsize=(12,4))
# Sqrt
nbins = int(np.sqrt(N))
plt.subplot(1,3,1)
plt.title("SQRT, {} bins".format(nbins))
movies['IMDB Rating'].hist(bins=nbins)
# Sturge's formula
nbins = int(np.ceil(np.log2(N) + 1))
plt.subplot(1,3,2)
plt.title("Sturge, {} bins".format(nbins))
movies['IMDB Rating'].hist(bins=nbins)
# Freedman-Diaconis
iqr = np.percentile(movies['IMDB Rating'], 75) - np.percentile(movies['IMDB Rating'], 25)
width = 2*iqr/np.power(N, 1/3)
nbins = int((max(movies['IMDB Rating']) - min(movies['IMDB Rating'])) / width)
plt.subplot(1,3,3)
plt.title("F-D, {} bins".format(nbins))
movies['IMDB Rating'].hist(bins=nbins)
Out[33]:
But we can also use built-in formulae too. Let's try all of them.
In [34]:
plt.figure(figsize=(20,4))
plt.subplot(161)
movies['IMDB Rating'].hist(bins='fd')
plt.subplot(162)
movies['IMDB Rating'].hist(bins='doane')
plt.subplot(163)
movies['IMDB Rating'].hist(bins='scott')
plt.subplot(164)
movies['IMDB Rating'].hist(bins='rice')
plt.subplot(165)
movies['IMDB Rating'].hist(bins='sturges')
plt.subplot(166)
movies['IMDB Rating'].hist(bins='sqrt')
Out[34]:
Some are decent, but several of them tend to overestimate the good number of bins. As you have more data points, some of the formulae may overestimate the necessary number of bins. Particularly in our case, because of the precision issue, we shouldn't increase the number of bins too much.
So what's the conclusion? use Scott's rule or Sturges' formula?
No, I think the take-away is that you should understand how the inappropriate number of bins can mislead you and you should try multiple number of bins to obtain the most accurate picture of the data. Although the 'default' may work in most cases, don't blindly trust it! Don't judge the distribution of a dataset based on a single histogram. Try multiple parameters to get the full picture!
In [35]:
movies['IMDB Rating'].hist(cumulative=True)
Out[35]:
You can also combine with options such as histtype
and density
.
In [36]:
movies['IMDB Rating'].hist(histtype='step', cumulative=True, density=True)
Out[36]:
And increase the number of bins.
In [37]:
movies['IMDB Rating'].hist(cumulative=True, density=True, bins=1000)
Out[37]:
This method works fine. By increasing the number of bins, you can get a CDF in the resolution that you want. But let's also try it manually to better understand what's going on. First, we should sort all the values.
In [38]:
rating_sorted = movies['IMDB Rating'].sort_values()
rating_sorted.head()
Out[38]:
We need to know the number of data points,
In [39]:
N = len(rating_sorted)
N
Out[39]:
And I think this may be useful for you.
In [40]:
n = 50
np.linspace(1/n, 1.0, num=n)
Out[40]:
Q: now you're ready to draw a proper CDF. Draw the CDF plot of this data.
In [41]:
In [42]:
vega_datasets.data.movies.url
Out[42]:
In [43]:
# Choose based on your environment
# alt.renderers.enable('jupyterlab')
# alt.renderers.enable('notebook')
Out[43]:
As mentioned before, in altair
histogram is not special. It is just a plot that use bars (mark_bar()
) where X axis is defined by IMDB Rating
with bins (bin=True
), and Y axis is defined by count()
aggregation function.
In [44]:
alt.Chart(vega_datasets.data.movies.url).mark_bar().encode(
alt.X("IMDB Rating:Q", bin=True),
alt.Y('count()')
)
Out[44]:
Have you noted that it is IMDB Rating:Q
not IMDB Rating
? This is a shorthand for
In [45]:
alt.Chart(vega_datasets.data.movies.url).mark_bar().encode(
alt.X('IMDB Rating', type='quantitative', bin=True),
alt.Y(aggregate='count', type='quantitative')
)
Out[45]:
In altair, you want to specify the data types using one of the four categories: quantitative, ordinal, nominal, and temporal. https://altair-viz.github.io/user_guide/encoding.html#data-types
Although you can adjust the bins in altair
, it does not encourage you to set the bins directly. For instance, although there is step
parameter that directly sets the bin size, there are parameters such as maxbins
(maximum number of bins) or minstep
(minimum allowable step size), or nice
(attemps to make the bin boundaries more human-friendly), that encourage you not to specify the bins directly.
In [46]:
from altair import Bin
alt.Chart(vega_datasets.data.movies.url).mark_bar().encode(
alt.X("IMDB Rating:Q", bin=Bin(step=0.09)),
alt.Y('count()')
)
Out[46]:
In [47]:
alt.Chart(vega_datasets.data.movies.url).mark_bar().encode(
alt.X("IMDB Rating:Q", bin=Bin(nice=True, maxbins=20)),
alt.Y('count()')
)
Out[47]:
In [48]:
chart1 = alt.Chart(vega_datasets.data.movies.url).mark_bar().encode(
alt.X("IMDB Rating:Q", bin=Bin(step=0.1)),
alt.Y('count()')
).properties(
width=300,
height=150
)
chart2 = alt.Chart(vega_datasets.data.movies.url).mark_bar().encode(
alt.X("IMDB Rating:Q", bin=Bin(nice=True, maxbins=20)),
alt.Y('count()')
).properties(
width=300,
height=150
)
In [49]:
chart1 | chart2
Out[49]:
In [50]:
alt.hconcat(chart1, chart2)
Out[50]:
Vertical commposition?
In [51]:
alt.vconcat(chart1, chart2)
Out[51]:
In [52]:
chart1 & chart2
Out[52]:
Shall we avoid some repetitions? You can define a base empty chart first and then assign encodings later when you put together multiple charts together. Here is an example: https://altair-viz.github.io/user_guide/compound_charts.html#repeated-charts
Q: Using the base chart approach to create a 2x2 chart where the top row shows the two histograms of IMDB Rating
with maxbins
=10 and 50 respectively, and the bottom row shows another two histograms of IMDB Votes
with maxbins
=10 and 50.
In [53]:
# TODO
Out[53]: