By Evgenia "Jenny" Nitishinskaya and Delaney Granizo-Mackenzie
Notebook released under the Creative Commons Attribution 4.0 License.
A histogram displays a frequency distribution using bars. It lets us quickly see where most of the observations are clustered. The height of each bar represents the number of observations that lie in each interval.
In [59]:
import numpy as np
import matplotlib.pyplot as plt
# Get returns data for S&P 500
start = '2014-01-01'
end = '2015-01-01'
spy = get_pricing('SPY', fields='price', start_date=start, end_date=end).pct_change()[1:]
# Plot a histogram using 20 bins
fig = plt.figure(figsize = (16, 7))
_, bins, _ = plt.hist(spy, 20)
labels = ['%.3f' % a for a in bins] # Reduce precision so labels are legible
plt.xticks(bins, labels)
plt.xlabel('Returns')
plt.ylabel('Number of Days')
plt.title('Frequency distribution of S&P 500 returns, 2014');
The graph above shows, for example, that the daily returns on the S&P 500 were between 0.010 and 0.013 on 10 of the days in 2014. Note that we are completely discarding the dates corresponding to these returns.
An alternative way to display the data would be using a cumulative distribution function, in which the height of a bar represents the number of observations that lie in that bin or in one of the previous ones. This graph is always nondecreasing since you cannot have a negative number of observations. The choice of graph depends on the information you are interested in.
In [56]:
# Example of a cumulative histogram
fig = plt.figure(figsize = (16, 7))
_, bins, _ = plt.hist(spy, 20, cumulative='True')
labels = ['%.3f' % a for a in bins]
plt.xticks(bins, labels)
plt.xlabel('Returns')
plt.ylabel('Number of Days')
plt.title('Cumulative distribution of S&P 500 returns, 2014');
A scatter plot is useful for visualizing the relationship between two data sets. We use two data sets which have some sort of correspondence, such as the date on which the measurement was taken. Each point represents two corresponding values from the two data sets. However, we don't plot the date that the measurements were taken on.
In [57]:
# Get returns data for some security
asset = get_pricing('MSFT', fields='price', start_date=start, end_date=end).pct_change()[1:]
# Plot the asset returns vs S&P 500 returns
plt.scatter(asset, spy)
plt.xlabel('MSFT')
plt.ylabel('SPY')
plt.title('Returns in 2014');
A line graph can be used when we want to track the development of the y value as the x value changes. For instance, when we are plotting the price of a stock, showing it as a line graph instead of just plotting the data points makes it easier to follow the price over time. This necessarily involves "connecting the dots" between the data points.
In [58]:
spy.plot()
plt.ylabel('Returns');