A bivariate analysis differs from a univariate, or distribution analysis, in that it is the analysis of two separate sets of data. These two sets of data are compared to one another to check for correlation, or a tendency of one of the sets of data to "predict" corresponding values in the other data set. If a linear or higher order model can be applied to describe, or model, the two sets of data, they are said to be correlated.
When two distributions are correlated, it is possible that the data in one of the distributions can be used to predict a corresponding value in the second distribution. This first distribution is referred to as the predictor and the second distribution as the response. Both predictor and response are graphed by a scatter plot, typically with the predictor on the x-axis and the response on the y-axis.
Note: Just because two sets of data correlate with one another does not necessarily mean that one predicts the other. It merely means it's a possibility that one predicts the other. This is summarized by the saying "Correlation does not imply causation." Use caution when drawing conclusions of a bivariate analysis. It is a good idea to study both data sets more carefully to determine if the two data sets are in fact correlated.
Let's first import sci-analysis and setup some variables to use in these examples.
In [2]:
import numpy as np
import scipy.stats as st
from sci_analysis import analyze
%matplotlib inline
In [3]:
# Create x-sequence and y-sequence from random variables.
np.random.seed(987654321)
x_sequence = st.norm.rvs(2, size=2000)
y_sequence = np.array([x + st.norm.rvs(0, 0.5, size=1) for x in x_sequence])
A scatter plot is used in sci-analysis to visualize the correlation between two sets of data. For this to work, each value in the first set of data has a corresponding value in the second set of data. The two values are tied together by the matching index value in each set of data. The length of each set of data have to be equal to one another, and the index values of each data set have to contiguous. If there is a missing value or values in one data set, the matching value at the same index in the other data set will be dropped.
By default, the best-fit line (assuming a linear relationship) is drawn as a dotted red line.
In [4]:
analyze(x_sequence, y_sequence)
Boxplots can be displayed along-side the x and y axis of the scatter plot. This is a useful tool for visualizing the distribution of the sets of data on the x and y axis while still displaying the scatter plot.
In [6]:
analyze(x_sequence, y_sequence, boxplot_borders=True)
In certain cases, such as when one of the sets of data is discrete and the other is continuous, it might be difficult to determine where the data points are centered. In this case, density contours can be used to help visualize the join probability distribution between the two sets of data.
In [17]:
x_continuous = st.weibull_max.rvs(2.7, size=2000)
y_discrete = st.geom.rvs(0.5, loc=0, size=2000)
analyze(x_continuous, y_discrete, contours=True, fit=False)
If each set of data contains discrete and equivalent groups, the scatter plot can show each group in a separate color.
In [22]:
# Create new x-grouped and y-grouped from independent groups A, B, and C.
a_x = st.norm.rvs(2, size=500)
a_y = np.array([x + st.norm.rvs(0, 0.5, size=1) for x in a_x])
b_x = st.norm.rvs(4, size=500)
b_y = np.array([1.5 * x + st.norm.rvs(0, 0.65, size=1) for x in b_x])
c_x = st.norm.rvs(1.5, size=500)
c_y = np.array([3 * x + st.norm.rvs(0, 0.95, size=1) - 1 for x in c_x])
x_grouped = np.concatenate((a_x, b_x, c_x))
y_grouped = np.concatenate((a_y, b_y, c_y))
grps = np.concatenate((['Group A'] * 500, ['Group B'] * 500, ['Group C'] * 500))
In [23]:
analyze(
x_grouped,
y_grouped,
groups=grps,
boxplot_borders=False,
)
The Linear Regression finds the least-squares best-fit line between the predictor and response. The linear relationship between the predictor and response is described by the relationship y = mx + b, where x is the predictor, y is the response, m is the slope, and b is the y-intercept.
If the data points from both sets of data are normally distributed, the Pearson correlation coefficient is calculated, otherwise, the Spearman Rank correlation coefficient is calculated. A correlation coefficient of 0 indicates no relationship, whereas 1 indicates a perfect correlation between predictor and response. In the case of both correlation coefficients, the null hypothesis is that the correlation coefficient is 0, signifying no relationship between the predictor and response. If the p value is less than the significance $\alpha$, the predictor and response are correlated.
The bare minimum requirements for performing a Bivariate analysis. The length of x-sequence and y-sequence should be equal and will raise an UnequalVectorLengthError
if not.
In [4]:
analyze(
x_sequence,
y_sequence,
)
Controls whether the best fit line is displayed or not.
In [5]:
analyze(
x_sequence,
y_sequence,
fit=False,
)
Controls whether the data points of the scatter plot are displayed or not.
In [6]:
analyze(
x_sequence,
y_sequence,
points=False,
)
Controls whether boxplots are displayed for x-sequence and y-sequence.
In [7]:
analyze(
x_sequence,
y_sequence,
boxplot_borders=True,
)
Controls whether the density contours are displayed or not. The contours can be useful when analyzing joint probability distributions.
In [8]:
analyze(
x_sequence,
y_sequence,
contours=True,
)
Used in conjunction with one another, labels and highlight are used for displaying data values for the data points on the scatter plot.
In [9]:
labels = np.random.randint(low=10000, high=99999, size=2000)
In [10]:
analyze(
x_sequence,
y_sequence,
labels=labels,
highlight=[66286]
)
The groups argument can be used to perform a Bivariate analysis on separate collections of data points that can be compared to one another.
In [18]:
# Create new x-grouped and y-grouped from independent groups A, B, and C.
a_x = st.norm.rvs(2, size=500)
a_y = np.array([x + st.norm.rvs(0, 0.5, size=1) for x in a_x])
b_x = st.norm.rvs(4, size=500)
b_y = np.array([1.5 * x + st.norm.rvs(0, 0.65, size=1) for x in b_x])
c_x = st.norm.rvs(1.5, size=500)
c_y = np.array([3 * x + st.norm.rvs(0, 0.95, size=1) - 1 for x in c_x])
x_grouped = np.concatenate((a_x, b_x, c_x))
y_grouped = np.concatenate((a_y, b_y, c_y))
grps = np.concatenate((['Group A'] * 500, ['Group B'] * 500, ['Group C'] * 500))
In [19]:
analyze(
x_grouped,
y_grouped,
groups=grps,
)
Using the groups argument is a great way to compare treatments. When combined with the highlight argument, a particular group can be highlighted on the scatter plot to stand out from the others.
In [20]:
analyze(
x_grouped,
y_grouped,
groups=grps,
highlight=['Group A'],
)
Multiple groups can also be highlighted.
In [21]:
analyze(
x_grouped,
y_grouped,
groups=grps,
highlight=['Group A', 'Group B'],
)
The title of the distribution to display above the graph.
In [16]:
x_sequence = st.norm.rvs(2, size=2000)
y_sequence = np.array([x + st.norm.rvs(0, 0.5, size=1) for x in x_sequence])
In [17]:
analyze(
x_sequence,
y_sequence,
title='This is a Title',
)
The name of the data on the x-axis.
In [18]:
analyze(
x_sequence,
y_sequence,
xname='This is the x-axis data'
)
The name of the data on the y-axis.
In [20]:
analyze(
x_sequence,
y_sequence,
yname='This is the y-axis data'
)