Module 10: Logscale


In [1]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np
import scipy.stats as ss
import vega_datasets
%matplotlib inline

Ratio and logarithm

If you use linear scale to visualize ratios, it can be quite misleading.

Let's first create some ratios.


In [2]:
x = np.array([1,    1,   1,  1, 10, 100, 1000])
y = np.array([1000, 100, 10, 1, 1,  1,   1   ])
ratio = x/y
print(ratio)


[1.e-03 1.e-02 1.e-01 1.e+00 1.e+01 1.e+02 1.e+03]

Q: Plot on the linear scale using the scatter() function. Also draw a horizontal line at ratio=1 for a reference.


In [11]:
# Implement


Out[11]:
Text(0,0.5,'Ratio')

Q: Explain what's bad about this plot.


In [ ]:

Q: Can you fix it?


In [ ]:
# Implement

Log-binning

Let's first see what happens if we do not use the log scale for a dataset with a heavy tail.

Q: Load the movie dataset from vega_datasets and remove the NaN rows based on the following three columns: IMDB_Rating, IMDB_Votes, Rotten_Tomatoes_Rating.


In [20]:
# Implement

If you simply call hist() method with a dataframe object, it identifies all the numeric columns and draw a histogram for each.

Q: draw all possible histograms of the movie dataframe. Adjust the size of the plots if needed.


In [26]:
# Implement


As we can see, a majority of the columns are not normally distributed. In particular, if you look at the worldwide gross variable, you only see a couple of meaningful data from the histogram. Is this a problem of resolution? How about increasing the number of bins?

Q: Play with the number of bins, and then increase the number of bins to 200.


In [39]:
# Implement


Out[39]:
Text(0,0.5,'Frequency')

Maybe a bit more useful, but it doesn't tell anything about the data distribution above certain point.

Q: How about changing the vertical scale to logarithmic scale?


In [38]:
# Implement


Out[38]:
Text(0,0.5,'Frequency')

Now, let's try log-bin. Recall that when plotting histgrams we can specify the edges of bins through the bins parameter. For example, we can specify the edges of bins to [1, 2, 3, ... , 10] as follows.


In [40]:
movies.IMDB_Rating.hist(bins=range(0,11))


Out[40]:
<matplotlib.axes._subplots.AxesSubplot at 0x1102c5fd0>

Here, we can specify the edges of bins in a similar way. Instead of specifying on the linear scale, we do it on the log space. Some useful resources:

Hint: since $10^{\text{start}} = \text{min(Worldwide_Gross)}$, $\text{start} = \log_{10}(\text{min(Worldwide_Gross)})$


In [41]:
min(movies.Worldwide_Gross)


Out[41]:
0.0

Because there seems to be movie(s) that made $0, and because log(0) is undefined & log(1) = 0, let's add 1 to the variable.


In [63]:
movies.Worldwide_Gross = movies.Worldwide_Gross+1.0

Q: now create logarithmic bins. Create 20 bins from the minimum value to the maximum value.


In [ ]:
# Implement

Now we can plot a histgram with log-bin. Set both axis to be log-scale.


In [103]:
# Implement


Out[103]:
Text(0,0.5,'Frequency')

What is going on? Is this the right plot?

Q: explain and fix


In [ ]:

Q: Can you explain the plot? Why are there gaps?


In [ ]:

CCDF

CCDF is a nice alternative to examine distributions with heavy tails. The idea is same as CDF, but the direction of aggregation is opposite. We have done CDF before. It's just a small change to that code.

Q: Draw a CCDF in log-log scale


In [100]:
# implement


We can also try in semilog scale (only one axis is in a log-scale), where the horizontal axis is linear.

Q: Draw a CCDF in semilog scale


In [109]:
# Implement


A straight line in semilog scale means exponential decay (cf. a straight line in log-log scale means power-law decay). So it seems like the amount of money a movie makes across the world follows roughly an exponential distribution, while there are some outliers that make insane amount of money.

Q: Which is the most successful movie in our dataset?

You can use the following

Which one should you use, loc or iloc? How are they different from each other?


In [ ]:
# Implement