Timothy Helton
NOTE:
This notebook uses code found in the
k2datascience.hr_analytics module.
To execute all the cells do one of the following items:
In [1]:
from k2datascience import hr_analytics
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
%matplotlib inline
In [2]:
hr = hr_analytics.HR()
The data set we will use for this exercise comes from a Kaggle challenge and is often used for predictive analytics, namely to predict why the best and most experienced employees tend to leave the company. We won't be using it for any predictive purposes here, but will instead use this data set to review many of the concepts explored in the Statistical Inference text.
This data contains fields for various measures of employee performance and reported satisfaction levels, as well as some categorical variables for events and salary level. For now, just explore the data a bit to get a general idea of what is going on.
In [3]:
print(f'Data Shape\n\n{hr.data.shape}')
print('\n\nColumns\n\n{}'.format('\n'.join(hr.data.columns)))
hr.data.head()
Out[3]:
In [4]:
hr.box_plot();
The concepts of probability, expectation values, and variance are the bedrock of statistical inference. Let's begin by employing some of these concepts to see if we can find some interesting paths to go down which may provide some insight into the inner workings of this company.
In [5]:
print(f'P(employee left the company) = {hr.p_left_company:.3f}')
print(f'P(employee experienced a work accident) = {hr.p_work_accident:.3f}')
print(f'P(employee experienced accident and left company) = {hr.p_left_and_accident:.3f}')
In [6]:
hr.compare_satisfaction()
Out[6]:
In [7]:
hours_variance, hours_std = hr.calc_hours_stats()
print(f'Hours Worked Variance: {hours_variance:.3f}')
print(f'Hours Worked Standard Deviation: {hours_std:.3f}')
In [8]:
satisfaction_ex, satisfaction_current = hr.compare_satisfaction_variance()
print(f'Ex-Employee Job Satisfaction Variance: {satisfaction_ex:.3f}')
print(f'Current Employee Job Satisfaction Variance: {satisfaction_current:.3f}')
In [9]:
hr.calc_satisfaction_salary()
Out[9]:
In [10]:
hr.calc_p_hours_salary()
Out[10]:
In [11]:
hr.calc_p_left_salary()
Out[11]:
In [12]:
hr.calc_p_salary_promotion()
Out[12]:
In [13]:
print(f'Approximate Sample Satisfaction Mean: {hr.data.satisfaction.mean():.3f}')
In [14]:
sample_n = 10
sample_means = []
for n in range(sample_n):
sample_means.append(hr.calc_satisfaction_random_sample(50))
sample_mean = '\n'.join([f'{x:.3f}' for x in sample_means])
print(f'Actual Sample Satisfaction Mean: {sum(sample_means) / sample_n}')
print('Actual Sample Satisfaction Values:')
print(f'{sample_mean}')
Bernoulli distributions are the result of a random variable with a binary outcome, like a coin clip or medical test giving a positive or negative result. Typically we represent the outcomes of a Bernoulli Random variable $X$ of only taking values of 0 or 1, with probabilities $p$ and $1 - p$ respectively, mean $p$, variance $p(1 - p)$, and PMF given by $$ P(X = x) = p^x (1 - p)^{1 - x} $$ Bernoulli random variables crop up very often statistical analysis most often in the form of Binomial trials, or, as a sum of independent Bernoulli variables with PMF given by $$ P(X = x) = {n \choose x} p^x (1 - p)^{n - x} $$ where $$ {n \choose x} = \frac{n!}{x!(n - x)!} $$ In this exercise you'll take a look at the HR data and apply these concepts to gain some insight.
Using the HR data, answer the following.
In [15]:
print('\n'.join(hr.bernoulli_vars))
In [16]:
hr.calc_p_bernoulli()
Out[16]:
In [17]:
hr.calc_bernoulli_variance()
Out[17]:
In [18]:
hr.calc_p_bernoulli_k()
Out[18]:
In [19]:
hr.calc_p_bernoulli_k(cumulative=True)
Out[19]:
In [20]:
hr.bernoulli_plot()
The Normal distribution (or sometimes called the Bell Curve or Guassian) is by far the most prevalent and useful distribution in any field that utilizes statistical techniques. In fact, in can be shown that the means of random variables sampled from any distribution eventually form a normal given a sufficiently large sample size.
A normal distribution is characterized by the PDF given by $$p(x|\mu,\sigma) = \frac{1}{\sqrt{(2\pi\sigma^2)}}e^{-\frac{(x - \mu)^2}{2\sigma^2}} $$
where $\mu$ is the mean and $\sigma^2$ is the variance, thus the distribution is characterized by mean and variance alone. In this exercise, you'll examine the some of the variables in the HR dataset and construct some normal approximating them.
Using the HR data, answer the following
In [21]:
print('\n'.join(hr.normal_vars))
In [22]:
hr.gaussian_plot()
In [23]:
hr.norm_stats
Out[23]:
In [24]:
hr.gaussian_plot(normal_overlay=True)
The Poisson distribution is very versatile but is typically used to model counts, such as, the amount of clicks per advertisement and arriving flights per unit time. It has a PDF given by $$ P(X = x, \lambda) = \frac{\lambda^x e^{-\lambda}}{x!} $$ where the mean and variance are both equal to $\lambda$
Using the HR data, answer the following.
In [25]:
print('\n'.join(hr.poisson_vars))
In [26]:
poisson = hr.poisson_distributions()
poisson
poisson[[f'p_{x}' for x in hr.poisson_vars]]
Out[26]:
Out[26]:
The Central Limit Theorem is perhaps one of the most remarkable results in statistics and mathematics in general. In short, it says that the distribution of means of independent random variables, sampled from any distribution, tends to approach a normal distribution as the sample size increases.
An example of this would be taking a pair of dice, rolling them, and recording the mean of each result. The Central Limit Theorem states, that after enough rolls, the distribution of the means will be approximately normal. Stated formally, the result is $$ \bar{X_n} \approx N(\mu, \sigma^2/n) = \frac{\bar{X_n} - \mu}{\sigma \sqrt{n}}$$ In this exercise, you'll conduct some simulation experiments to explore this idea.
Using the HR data, answer the following.
In [27]:
print('\n'.join(hr.central_limit_vars))
n = 10
, n = 100
, n = 500
and n = 1000
samples and take the mean. Repeat this 1000 times for each variable.
In [28]:
hr.central_limit_plot()
Hypothesis testing is essentially using the data to answer questions of interest. For example, does a new medication provide any benefit over placebo? Or is a subset of the population disproportionately more susceptible to a particular disease? Or is the difference between two companies profits' significant or due to chance alone?
Before doing some hypothesis testing on the HR data, recall that hypothesis typically come in pairs of the form $H_0$, called the null hypothesis, versus $H_a$, called the alternative hypothesis. The null hypothesis represents the "default" assumption -- that a medication has no effect for example, while the alternative hypothesis represents what exactly are looking to discover, in the medication case, whether it provides a significant benefit. Another common case is testing the difference between two means. Here, the null hypothesis is that there is no difference between two population means, whereas the alternative hypothesis is that there is a difference. Stated more precisely $$H_0: \mu_1 - \mu_2 = 0$$ $$H_a: \mu_1 - \mu_2 \ne 0$$
Hypothesis are usually tested by constructing a confidence interval around the test statistic and selecting a "cut-off" significance level denoted $\alpha$. A typical $\alpha$ significance is 0.05 and is often called a "P-value". If a test produces a P-value of $\alpha$ or below, then the null hypothesis can be rejected, strengthening the case of the alternative hypothesis. It is very important to remember that hypothesis testing can only tell you if your hypothesis is statistically significant -- this does not mean that your result may be scientifically significant which requires much more evidence.
In this exercise you'll explore the HR data more and test some hypothesis.
Using the HR data, answer the following.
In [29]:
left = hr.data.query('left == 1').satisfaction
stayed = hr.data.query('left == 0').satisfaction
comparison = hr.compare_confidence(left, 'left', stayed, 'stayed', 0.95)
comparison
Out[29]:
In [30]:
ttest = hr.t_test(left, 'left', stayed, 'stayed')
mean_diff = abs(left.mean() - stayed.mean())
variance_diff = abs(left.var() - stayed.var())
print(f'T-test P-value: {ttest[1]:.3f}')
print(f'Difference of Means: {mean_diff:.3f}')
print(f'Difference of Variances: {variance_diff:.3f}')
In [31]:
low = hr.data.query('salary == "low"').satisfaction
medium = hr.data.query('salary == "medium"').satisfaction
high = hr.data.query('salary == "high"').satisfaction
In [32]:
hr.t_test(low, 'low',
hr.data.satisfaction, 'All Satisfaction Data',
independent_vars=False)
Out[32]:
In [33]:
hr.t_test(medium, 'medium',
hr.data.satisfaction, 'All Satisfaction Data',
independent_vars=False)
Out[33]:
In [34]:
hr.t_test(high, 'high',
hr.data.satisfaction, 'All Satisfaction Data',
independent_vars=False)
Out[34]:
In [35]:
hr.t_test(low, 'low', medium, 'medium', high, 'high');
In [36]:
low = hr.data.query('salary == "low"').evaluation
medium = hr.data.query('salary == "medium"').evaluation
high = hr.data.query('salary == "high"').evaluation
In [37]:
hr.t_test(low, 'low',
hr.data.evaluation, 'All Evaluation Data',
independent_vars=False)
Out[37]:
In [38]:
hr.t_test(medium, 'medium',
hr.data.evaluation, 'All Evaluation Data',
independent_vars=False)
Out[38]:
In [39]:
hr.t_test(high, 'high',
hr.data.evaluation, 'All Evaluation Data',
independent_vars=False)
Out[39]:
In [40]:
hr.t_test(low, 'low', medium, 'medium', high, 'high');
In [41]:
high.mean()
hr.data.evaluation.mean()
Out[41]:
Out[41]:
In [42]:
medium = hr.data.query('salary == "medium"').satisfaction
high = hr.data.query('salary == "high"').satisfaction
hr.calc_power(medium, high)
Out[42]:
Bootstrapping is an immensely useful technique in practice. Very often you may find yourself in a situation where you want to compute some statistic, but lack sufficient data to do so. Bootstrapping works as a remedy to this problem.
Recall that the bootstrapping algorithm breaks down as follows:
In this exercise you will implement this algorithm on the HR data.
Write a function that can perform boostrapping for the median of a set of n samples in the HR data set. Test this function on the satisfaction_level
with n = 100
and b = 100
and compare your results to the true median. Also compute the standard deviation of the bootstrapped median.
In [43]:
hr.bootstrap(hr.data.satisfaction, n=100, sets=100)
Out[43]: