A distribution describes the spread and tendency of a collection of numeric data. In this case, the spread is the relative distance of a data point to the other data points. You can think of this as data points being grouped close to one another, or spread far apart from one another. A common measurement of this spread is variance, which is the spread from the mean of a distribution.
Mean measures the central tendency of a distribution. Tendency refers to when data points "tend" to group closely to one another. This is easily calculated by summing all the values of the data points and dividing by the total number of data points n.
A distribution is represented on a graph by a histogram. A histogram charts a distribution by separating the numeric data in the distribution into discrete bins along the x-axis. These bins are charted as bars, where the height of each bar represents how many numeric values are in that bin.
A distribution represented by a histogram closely resembles a probability density function for continuous numeric data, or a probability mass function for discrete numeric data.
The distribution analysis can show three graphs, the histogram, boxplot, and cumulative distribution plot.
Let's first import sci-analysis and setup some variables to use in these examples.
In [1]:
import numpy as np
import scipy.stats as st
from sci_analysis import analyze
%matplotlib inline
In [2]:
# Create sequence from random variables.
np.random.seed(987654321)
sequence = st.norm.rvs(2, 0.45, size=2000)
The histogram, as described above, separates numeric data into discrete bins along the x-axis. The y-axis is the probability that a data point from a given distribution will belong to a particular bin.
In [4]:
analyze(
sequence,
boxplot=False,
)
Boxplots in sci-analysis are actually a hybrid of two distribution visualization techniques, the boxplot and the violin plot. Boxplots are a good way to quickly understand a distribution, but can be misleading when the distribution is multimodal. A violin plot does a much better job at showing local maxima and minima of a distribution.
In the center of each box is a red line and green triangle. The green triangle represents the mean of the group while the red line represents the median, sometimes referred to as the second quartile (Q2) or 50% line. The circles that might be seen at either end of the boxplot are outliers, and referred to as such because they are in the bottom 5% and top 95% of the distribution.
In [5]:
analyze(sequence)
The cumulative distribution function (cdf) differs from the probability density function in that it directly shows the probability of a data point occurring at a particular value in the distribution. The x-axis is the value in the distribution and the y-axis is the probability. For example, the center line of the y-axis is the 0.5 line (also known as the second quartile or Q2), and where the cdf crosses the 0.5 line is the median of the distribution.
In [3]:
analyze(
sequence,
boxplot=False,
cdf=True,
)
There are two types of data that accompany the distribution analysis -- the summary statistics and the test for normality.
The Shapiro-Wilk test attempts to determine if the distribution closely resembles the normal distribution. If the p value is less than or equal to alpha, the distribution is considered to not be normally distributed.
The bare minimum requirement for performing a Distribution analysis. Should be an array-like of numeric values.
In [4]:
analyze(sequence)
Controls whether the boxplot above the histogram is displayed or not. The default value is True.
In [6]:
analyze(
sequence,
boxplot=False,
)
Controls whether the cumulative distribution function is displayed or not. The default value is False.
In [7]:
analyze(
sequence,
cdf=True,
)
Controls whether the analysis is performed assuming whether sequence is a sample if True, or a population if False. The default value is False.
In [8]:
analyze(
sequence,
sample=True,
)
Controls the number of bins to use for the histogram. The default value is 20.
In [9]:
analyze(
sequence,
bins=100,
)
The title of the distribution to display above the graph.
In [10]:
analyze(
sequence,
title='This is a Title',
)
The name of the distribution to display on the x-axis.
In [11]:
analyze(
sequence,
name='Sequence Name',
)
In [12]:
analyze(
sequence,
xname='Sequence Name',
)
The value to display along the y-axis.
In [13]:
analyze(
sequence,
yname='Prob',
)