Histograms and Univariate Data Exploration

The data we get can often be modeled as multiple instances or realizations of a random variable. For example, we can have a dataset that simply lists the results of 100 tosses of a die. The result of each toss was a realization or instance of a random variable.

When we get data like this that's multiple instances of some random variable, we're often interested in finding out what the PDF of the random variable that generated the data is. We'd liket to know whether it's data from a normal distribution or not, and if it is from a normal distribution, what are the parameters of that specific normal distribution that the data came from.

Definition of a Histogram

If we have multiple instances of a random variable, such as the result of multiple die tosses, we can get an approximation of the PDF of that random variable by using a histogram. A histogram of a dataset is created by partitioning an interval in to discrete and disjoint subintervals, and counting the instances of data that fall in each respective interval. If we're interested in finding out how much a cat's weighs and we have the weight of hearts for several cats, we can approximate a PDF by looking at the histogram of this data.

Plotting a Histogram with Pandas and Matplotlib


In [6]:
import pandas as pd

catsData = pd.read_csv('../data/cats.csv')
catsData.head()


Out[6]:
Sex Bwt Hwt
0 F 2.0 7.0
1 F 2.0 7.4
2 F 2.0 9.5
3 F 2.1 7.2
4 F 2.1 7.3

In [7]:
%matplotlib inline
import matplotlib.pyplot as plt
catsData.Hwt.hist()


Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x109f4de50>

The histogram tells us a few things:

  1. The heart weights range from 6 to 20 grams.
  2. 10 grams seems to be the most common heart rate, i.e. the mode of random variable.
  3. The average or Expected value of the random variable also seems to be around 10 grams.
  4. The histogram leads us to believe that there's a good chance the cat heart rates are normally distributed.

The Connections Between Histograms and PDFs

We can generate values (or samples) from the normal distribution with a specific mean and variance using numpy. Here we generate 500 samples from a normal distribution with $\mu=-2$ and $\sigma^2=0.5$. We store these samples as a vector in a variable called normal_samples. We then plot a histogram of these samples.


In [9]:
import numpy as np
normal_samples = np.random.normal(loc=-2, scale=0.5, size=500)

In [10]:
%matplotlib inline
plt.hist(normal_samples)


Out[10]:
(array([   5.,   25.,   53.,   91.,  133.,  106.,   59.,   20.,    7.,    1.]),
 array([-3.42924824, -3.12093907, -2.8126299 , -2.50432072, -2.19601155,
        -1.88770237, -1.5793932 , -1.27108403, -0.96277485, -0.65446568,
        -0.3461565 ]),
 <a list of 10 Patch objects>)

Quantiles

Going back to the cat example, we might be interested in asking questions like how many cat hearts weight more than 10 grams, or how many weight less than 6 grams. A histogram can give you rough answers to these sorts of questions. We can also work directly with the data and find what are called quantiles.


In [17]:
print 'Percent of heart rates greater than 10', (catsData.Hwt > 10).mean()


Percent of heart rates greater than 10 0.548611111111

A quantile is a number a random variable can take on that is defined by a percent value p such that the probability of the random variable being less than or equal to that number is p.