In [1]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np
import scipy.stats as ss
import vega_datasets
%matplotlib inline
In [2]:
x = np.array([1, 1, 1, 1, 10, 100, 1000])
y = np.array([1000, 100, 10, 1, 1, 1, 1 ])
ratio = x/y
print(ratio)
Q: Plot on the linear scale using the scatter()
function. Also draw a horizontal line at ratio=1 for a reference.
In [11]:
# Implement
Out[11]:
Q: Explain what's bad about this plot.
In [ ]:
Q: Can you fix it?
In [ ]:
# Implement
In [20]:
# Implement
If you simply call hist()
method with a dataframe object, it identifies all the numeric columns and draw a histogram for each.
Q: draw all possible histograms of the movie dataframe. Adjust the size of the plots if needed.
In [26]:
# Implement
As we can see, a majority of the columns are not normally distributed. In particular, if you look at the worldwide gross variable, you only see a couple of meaningful data from the histogram. Is this a problem of resolution? How about increasing the number of bins?
Q: Play with the number of bins, and then increase the number of bins to 200.
In [39]:
# Implement
Out[39]:
Maybe a bit more useful, but it doesn't tell anything about the data distribution above certain point.
Q: How about changing the vertical scale to logarithmic scale?
In [38]:
# Implement
Out[38]:
Now, let's try log-bin. Recall that when plotting histgrams we can specify the edges of bins through the bins
parameter. For example, we can specify the edges of bins to [1, 2, 3, ... , 10] as follows.
In [40]:
movies.IMDB_Rating.hist(bins=range(0,11))
Out[40]:
Here, we can specify the edges of bins in a similar way. Instead of specifying on the linear scale, we do it on the log space. Some useful resources:
Hint: since $10^{\text{start}} = \text{min(Worldwide_Gross)}$, $\text{start} = \log_{10}(\text{min(Worldwide_Gross)})$
In [41]:
min(movies.Worldwide_Gross)
Out[41]:
Because there seems to be movie(s) that made $0, and because log(0) is undefined & log(1) = 0, let's add 1 to the variable.
In [63]:
movies.Worldwide_Gross = movies.Worldwide_Gross+1.0
Q: now create logarithmic bins. Create 20 bins from the minimum value to the maximum value.
In [ ]:
# Implement
Now we can plot a histgram with log-bin. Set both axis to be log-scale.
In [103]:
# Implement
Out[103]:
What is going on? Is this the right plot?
Q: explain and fix
In [ ]:
Q: Can you explain the plot? Why are there gaps?
In [ ]:
In [100]:
# implement
We can also try in semilog scale (only one axis is in a log-scale), where the horizontal axis is linear.
Q: Draw a CCDF in semilog scale
In [109]:
# Implement
A straight line in semilog scale means exponential decay (cf. a straight line in log-log scale means power-law decay). So it seems like the amount of money a movie makes across the world follows roughly an exponential distribution, while there are some outliers that make insane amount of money.
Q: Which is the most successful movie in our dataset?
You can use the following
idxmax()
: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.idxmax.htmlloc
: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html or iloc
: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iloc.html Which one should you use, loc
or iloc
? How are they different from each other?
In [ ]:
# Implement