Module 10: Logscale



In [1]:

    
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np
import scipy.stats as ss
import vega_datasets
%matplotlib inline

Ratio and logarithm

If you use linear scale to visualize ratios, it can be quite misleading.

Let's first create some ratios.



In [2]:

    
x = np.array([1,    1,   1,  1, 10, 100, 1000])
y = np.array([1000, 100, 10, 1, 1,  1,   1   ])
ratio = x/y
print(ratio)









    



[1.e-03 1.e-02 1.e-01 1.e+00 1.e+01 1.e+02 1.e+03]

Q: Plot on the linear scale using the scatter() function. Also draw a horizontal line at ratio=1 for a reference.



In [11]:

    
# Implement









    Out[11]:





Text(0,0.5,'Ratio')

Q: Explain what's bad about this plot.



In [ ]:

Q: Can you fix it?



In [ ]:

    
# Implement

Log-binning

Let's first see what happens if we do not use the log scale for a dataset with a heavy tail.

Q: Load the movie dataset from vega_datasets and remove the NaN rows based on the following three columns: IMDB_Rating, IMDB_Votes, Rotten_Tomatoes_Rating.



In [20]:

    
# Implement

If you simply call hist() method with a dataframe object, it identifies all the numeric columns and draw a histogram for each.

Q: draw all possible histograms of the movie dataframe. Adjust the size of the plots if needed.



In [26]:

    
# Implement

As we can see, a majority of the columns are not normally distributed. In particular, if you look at the worldwide gross variable, you only see a couple of meaningful data from the histogram. Is this a problem of resolution? How about increasing the number of bins?

Q: Play with the number of bins, and then increase the number of bins to 200.



In [39]:

    
# Implement









    Out[39]:





Text(0,0.5,'Frequency')

Maybe a bit more useful, but it doesn't tell anything about the data distribution above certain point.

Q: How about changing the vertical scale to logarithmic scale?



In [38]:

    
# Implement









    Out[38]:





Text(0,0.5,'Frequency')

Now, let's try log-bin. Recall that when plotting histgrams we can specify the edges of bins through the bins parameter. For example, we can specify the edges of bins to [1, 2, 3, ... , 10] as follows.



In [40]:

    
movies.IMDB_Rating.hist(bins=range(0,11))









    Out[40]:





<matplotlib.axes._subplots.AxesSubplot at 0x1102c5fd0>

Here, we can specify the edges of bins in a similar way. Instead of specifying on the linear scale, we do it on the log space. Some useful resources:

Hint: since $10^{\text{start}} = \text{min(Worldwide_Gross)}$, $\text{start} = \log_{10}(\text{min(Worldwide_Gross)})$



In [41]:

    
min(movies.Worldwide_Gross)









    Out[41]:





0.0

Because there seems to be movie(s) that made $0, and because log(0) is undefined & log(1) = 0, let's add 1 to the variable.



In [63]:

    
movies.Worldwide_Gross = movies.Worldwide_Gross+1.0

Q: now create logarithmic bins. Create 20 bins from the minimum value to the maximum value.



In [ ]:

    
# Implement

Now we can plot a histgram with log-bin. Set both axis to be log-scale.



In [103]:

    
# Implement









    Out[103]:





Text(0,0.5,'Frequency')

What is going on? Is this the right plot?

Q: explain and fix



In [ ]:

Q: Can you explain the plot? Why are there gaps?



In [ ]:

CCDF

CCDF is a nice alternative to examine distributions with heavy tails. The idea is same as CDF, but the direction of aggregation is opposite. We have done CDF before. It's just a small change to that code.

Q: Draw a CCDF in log-log scale



In [100]:

    
# implement

We can also try in semilog scale (only one axis is in a log-scale), where the horizontal axis is linear.

Q: Draw a CCDF in semilog scale



In [109]:

    
# Implement

A straight line in semilog scale means exponential decay (cf. a straight line in log-log scale means power-law decay). So it seems like the amount of money a movie makes across the world follows roughly an exponential distribution, while there are some outliers that make insane amount of money.

Q: Which is the most successful movie in our dataset?

You can use the following

Which one should you use, loc or iloc? How are they different from each other?



In [ ]:

    
# Implement