The data we get can often be modeled as multiple instances or realizations of a random variable. For example, we can have a dataset that simply lists the results of 100 tosses of a die. The result of each toss was a realization or instance of a random variable.
When we get data like this that's multiple instances of some random variable, we're often interested in finding out what the PDF of the random variable that generated the data is. We'd liket to know whether it's data from a normal distribution or not, and if it is from a normal distribution, what are the parameters of that specific normal distribution that the data came from.
If we have multiple instances of a random variable, such as the result of multiple die tosses, we can get an approximation of the PDF of that random variable by using a histogram. A histogram of a dataset is created by partitioning an interval in to discrete and disjoint subintervals, and counting the instances of data that fall in each respective interval. If we're interested in finding out how much a cat's weighs and we have the weight of hearts for several cats, we can approximate a PDF by looking at the histogram of this data.
In [6]:
import pandas as pd
catsData = pd.read_csv('../data/cats.csv')
catsData.head()
Out[6]:
In [7]:
%matplotlib inline
import matplotlib.pyplot as plt
catsData.Hwt.hist()
Out[7]:
The histogram tells us a few things:
We can generate values (or samples) from the normal distribution with a specific mean and variance using numpy. Here we generate 500 samples from a normal distribution with $\mu=-2$ and $\sigma^2=0.5$. We store these samples as a vector in a variable called normal_samples
. We then plot a histogram of these samples.
In [9]:
import numpy as np
normal_samples = np.random.normal(loc=-2, scale=0.5, size=500)
In [10]:
%matplotlib inline
plt.hist(normal_samples)
Out[10]:
Going back to the cat example, we might be interested in asking questions like how many cat hearts weight more than 10 grams, or how many weight less than 6 grams. A histogram can give you rough answers to these sorts of questions. We can also work directly with the data and find what are called quantiles.
In [17]:
print 'Percent of heart rates greater than 10', (catsData.Hwt > 10).mean()
A quantile is a number a random variable can take on that is defined by a percent value p such that the probability of the random variable being less than or equal to that number is p.