Location testing is useful for comparing groups (also known as categories or treatments) of similar values to see if their locations are matched. In this case, location refers to a central value where all the values in a group have tendency to collect around. This is usually a mean or median of the group.
The Location Test analysis actually performs two tests, one for comparing variances between groups, and the second for comparing the location between groups. Both are useful for determining how similar or dissimilar the distribution of the groups are compared to one another.
The graph produced by the Location Test produces three charts by default: Boxplots, Tukey-Kramer circles, and a Normal Quantile plot. Let's examine these individually.
In [1]:
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import scipy.stats as st
from sci_analysis import analyze
%matplotlib inline
Boxplots in sci-analysis are actually a hybrid of two distribution visualization techniques, the boxplot and the violin plot. Boxplots are a good way to quickly understand a distribution, but can be misleading when the distribution is multimodal. A violin plot does a much better job at showing the local maxima and minima of a distribution.
In [2]:
np.random.seed(987654321)
a = st.norm.rvs(0, 1, 1000)
b = np.append(st.norm.rvs(4, 2, 500), st.norm.rvs(0, 1, 500))
analyze(
{'A': a, 'B': b},
circles=False,
nqp=False,
)
In the center of each box is a red line and green triangle. The green triangle represents the mean of the group while the red line represents the median, sometimes referred to as the second quartile (Q2) or 50% line.
The boxplot graph also shows a short dotted line and long dotted line that represent the grand median and grand mean respectively.
In [3]:
np.random.seed(987654321)
a = np.append(st.norm.rvs(2, 1, 500), st.norm.rvs(-2, 2, 500))
b = np.append(st.norm.rvs(8, 1, 500), st.norm.rvs(4, 2, 500))
analyze(
{'A': a, 'B': b},
circles=False,
nqp=False,
)
Tukey-Kramer Circles, also referred to as comparison circles are based on the Tukey HSD test. Each circle is centered on the mean of each group and the radius of the circle is calculated from the mean standard error and size of the group. In this case, the radius is proportional to the standard error and inversely proportional to the size of the group. Therefore, a higher variation or smaller group size will produce a larger circle.
In [4]:
np.random.seed(987654321)
a = st.norm.rvs(0, 1, 100)
b = st.norm.rvs(0, 3, 100)
c = st.norm.rvs(0, 1, 20)
analyze(
{'A': a, 'B': b, 'C': c},
nqp=False,
)
If circles of different groups are mostly overlapping, the means of those groups are likely matched. However, if circles are not touching each other or only partly overlap, the means of those groups are likely different.
In [5]:
np.random.seed(987654321)
a = st.norm.rvs(0, 1, 50)
b = st.norm.rvs(0.1, 1, 50)
c = st.norm.rvs(1, 1, 20)
analyze(
{'A': a, 'B': b, 'C': c},
nqp=False,
)
A Normal Quantile Plot is a specific type of Quantile-Quantile (Q-Q) plot where the quantiles on the x-axis correspond to the quantiles of the normal distribution. In the case of the Normal Quantile Plot, one quantile corresponds to one standard deviation.
If the plotted points for a group on the Normal Quantile Plot closely resemble a straight line (regardless of slope), then the group is normally distributed. In the example below, group C is not normally distributed, as seen by it's downward curved shape on the Normal Quantile Plot.
In [26]:
np.random.seed(987654321)
a = st.norm.rvs(0, 1, size=50)
b = st.norm.rvs(0.1, 1, size=50)
c = st.weibull_max.rvs(0.95, size=50)
analyze(
{'A': a, 'B': b, 'C': c},
circles=False,
)
The slope of the data points on the Normal Quantile Plot indicate the relative variance of a particular group compared to the other groups.
In [20]:
np.random.seed(987654321)
a = st.norm.rvs(0, 1, 50)
b = st.norm.rvs(0, 2, 50)
c = st.norm.rvs(0, 3, 50)
analyze(
{'A': a, 'B': b, 'C': c},
circles=False,
)
When performing a Location Test analysis, two statistics tables are given, the Overall Statistics and the Group Statistics.
The Overall Statistics shows the number of groups in the dataset, total number of data points in the dataset, Grand Mean, Grand Median, and Pooled Standard Deviation.
The Group Statistics list summary statistics for each group in a table. The summary statistics shown are the number of data points in the group (n), the Mean, Standard Deviation, Minimum, Median, Maximum, and group name.
The remaining two statistics are both Hypothesis Tests. The first test attempts to determine if the variances of each group are matched or not. The second test attempts to determine if the locations of each group are matched or not. Each hypothesis test shows the significance level (alpha), test statistic, and p-value. The hypothesis test used depends on a few different factors. The test for equal variance is fairly simple and depends on whether the all the data points in the dataset are normally distributed or not. If normally distributed, the Bartlett Test is used, otherwise the Levene Test is used.
The logic for determining which hypothesis test to use for checking location is more complex and depends on the number of groups, whether the data points in the dataset are normally distributed, and the size of the smallest group.
The five possible hypothesis tests from most sensitive to least sensitive are:
The last thing shown for each hypothesis test is the statement of the null hypothesis or alternative hypothesis. Each hypothesis has a null hypothesis that is assumed to be true. If the p-value of the test is lower than the significance level (alpha) of the test, the null hypothesis is rejected and the alternative hypothesis is stated. When the null hypothesis is rejected, it means that the likelihood of the outcome occurring by chance is significantly low enough that it is likely true.
Because the conclusion of hypothesis testing depends on an arbitrarily chosen significance level of 0.05, they should be taken with a bit of caution. This is why sci-analysis goes to lengths to try to use the most appropriate test given the supplied data and also pairs the test with graphs for a second source of truth.
Let's first import sci-analysis and setup some variables to use in these examples.
In [6]:
# Create sequence and groups from random variables for stacked data examples.
stacked = st.norm.rvs(2, 0.45, size=3000)
vals = 'ABCD'
stacked_groups = []
for _ in range(3000):
stacked_groups.append(vals[np.random.randint(0, 4)])
When analyzing stacked data, both sequence and groups are required.
In [7]:
analyze(
stacked,
groups=stacked_groups,
)
When analyzing unstacked data, sequences can be a dictionary or an array-like of array-likes.
In [8]:
# Create sequences from random variables for unstacked data examples.
np.random.seed(987654321)
a = st.norm.rvs(2, 0.45, size=750)
b = st.norm.rvs(2, 0.45, size=750)
c = st.norm.rvs(2, 0.45, size=750)
d = st.norm.rvs(2, 0.45, size=750)
If sequences is an array-like of array-likes, and groups is None, category labels will be automatically generated starting at 1.
In [9]:
analyze([a, b, c, d])
If sequences is a dictionary, the keys will be used as category labels.
Note: When sequences is a dictionary, the categories will not necessarily be shown in order.
In [10]:
analyze({'A': a, 'B': b, 'C': c, 'D': d})
If analyzing stacked data, groups should be an array-like with the same length as sequence. If analyzing unstacked data, groups should be the same length as sequences and all values in lgroups should be unique.
In [11]:
analyze(
[a, b, c, d],
groups=['A', 'B', 'C', 'D'],
)
Controls whether the Normal Quantile Plot is displayed or not. The default value is True.
In [12]:
analyze(
stacked,
groups=stacked_groups,
nqp=False,
)
Controls whether the Tukey-Kramer circles are displayed or not. The default value is True.
In [13]:
analyze(
stacked,
groups=stacked_groups,
circles=False,
)
Sets the significance level to use for hypothesis testing.
In [14]:
analyze(
stacked,
groups=stacked_groups,
alpha=0.01,
)
The title of the distribution to display above the graph.
In [15]:
analyze(
stacked,
groups=stacked_groups,
title='This is a Title',
)
The name of the category labels to display on the x-axis.
In [16]:
analyze(
stacked,
groups=stacked_groups,
categories='Generated Categories',
)
In [17]:
analyze(
stacked,
groups=stacked_groups,
xname='Generated Categories',
)
The label to display on the y-axis.
In [18]:
analyze(
stacked,
groups=stacked_groups,
name='Generated Values',
)
In [19]:
analyze(
stacked,
groups=stacked_groups,
yname='Generated Values',
)