Significant level
In statistical hypothesis testing, a result has statistical significance when it is very unlikely to have occurred given the null hypothesis.[3] More precisely, the significance level defined for a study, α, is the probability of the study rejecting the null hypothesis, given that it were true; and the p-value of a result, p, is the probability of obtaining a result at least as extreme, given that the null hypothesis were true. The result is statistically significant, by the standards of the study, when p < α.
Link to wikipedia articleNormal Distribution
In [1]:
# Kurt's Introduction
from IPython.display import HTML
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/umJQ6gVT8kY" frameborder="0" allowfullscreen></iframe>')
Out[1]:
In [8]:
# Why is Statistics Useful?
from IPython.display import HTML
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/DyeRm96wH5M" frameborder="0" allowfullscreen></iframe>')
Out[8]:
In [10]:
# Introduction to Normal (Gauss Distribution)
from IPython.display import HTML
HTML ('<iframe width="560" height="315" src="https://www.youtube.com/embed/ZfOTcwXAdEw" frameborder="0" allowfullscreen></iframe>')
Out[10]:
The equation for the normal distribution is:
$$f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}.e^{\frac{-(x - \mu)^2}{2\sigma^2}}$$To be more explicit:
If you would like to learn more about the t-test, check out this lesson in Intro to Inferential Statistics.
Welch's T-Test In Python You can check out additional information about the scipy implementation of the t-test below: http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html
In [1]:
# t-Test video
from IPython.display import HTML
HTML('<iframe width="369" height="208" src="https://www.youtube.com/embed/tjSj2OkV51A" frameborder="0" allowfullscreen></iframe>')
Out[1]:
In [13]:
# Welch's Two-Sample t-Test
from IPython.display import HTML
HTML('<iframe width="369" height="208" src="https://www.youtube.com/embed/B_1cnwYn7so" frameborder="0" allowfullscreen></iframe>')
Out[13]:
Performs a t-test on two sets of baseball data (left-handed and right-handed hitters).
You will be given a csv file that has three columns.
A player's name
, handedness
(L for lefthanded or R for righthanded) and their
career batting average
(called 'avg').
You can look at the csv file by downloading the baseball_stats file from Downloadables below.
Write a function that will:
We have included the scipy.stats
library to help you write
or implement Welch's t-test:
http://docs.scipy.org/doc/scipy/reference/stats.html
With a significance level of 95%, if there is no difference between the two cohorts, return a tuple consisting of True, and then the tuple returned by scipy.stats.ttest.
If there is a difference, return a tuple consisting of False, and then the tuple returned by scipy.stats.ttest.
For example, the tuple that you return may look like: (True, (9.93570222, 0.000023))
Supporting materials baseball_stats.csv
In [ ]:
import numpy
import scipy.stats
import pandas
def compare_averages(filename):
"""
The description for this quiz is above text.
"""
baseball_data = pandas.read_csv(filename)
lh_player = baseball_data.loc[baseball_data['handedness'] == 'L', 'avg']
rh_player = baseball_data.loc[baseball_data['handedness'] == 'R', 'avg']
# Welch's t-test
(t, p) = scipy.stats.ttest_ind(lh_player, rh_player, equal_var=False)
# Welch's t-test results.
result = (p > 0.05, (t, p))
return result
Your calculated t-statistic is 9.93570222624 The correct t-statistic is +/-9.93570222624
In [14]:
# Exaplaination for Welch's t-Test exercise
from IPython.display import HTML
HTML('<iframe width="550" height="309" src="https://www.youtube.com/embed/TrSU-GH7TDY" frameborder="0" allowfullscreen></iframe>')
Out[14]:
When performing the t-Test, we assume that our data is normal. In the wild, you'll often encounter probability distributions. They're distinctly not normal. They might look like two diagrams below or even completely different.
As you imagine, there are still statistical tests that we can utilize when our data is not normal.
First of, we should have some machinery in place for determining whether or not our data is Gaussian in the first place. A crude, inaccurate way of determining whether or not our data is normal is simply to plot a histogram of our data ans ask, does this look like a bell curve? In both of the cases above, the answer would definitely be no. But, we can do little bit better than that. There are some statistical tests that we can use to measure the likelihood that a sample is drawn from a normally distributed population. One such test is the Shapiro-Wilk test. The theory of this test is out of this course's scope. But you can implement this test easyly like this:
(W, p) = scipy.stats.shapiro(data)
W
is the Shapiro-Wilk test statistic, p
value, which should be interpreted the same way as we would interpret the p-value for our t-test.That is, given null hypothesis that this data is drawn from a normal distribution, what is the likelihood that we would observe a value of W that was at least as extreme as the one that we see?
A statistical test that does not assume our data is drawn from any particular underlying probability distribution.
Mann-Whitney U test is a test that tests null hypothesis that two populations are the same:
(U, P) = scipy.stats.mannwhitneyu(x, y)
These have just been some of the methods that we can use when performing statistical tests on data. As you can imagine, there are a number of additional ways to handle data from different probability distributions or data that looks like it came from no probability distribution whatsoever.
Data scientist can perform many statistical procedures. But it's vital to understand the underlying structure of the data set and consequently, which statistical tests are appropriate given the data that we have.
There are many different types of statistical tests and even many different schools of thought within statistics regarding the correct way to analyze data. This has really just been an opportunity to get your feet wet with statistical analysis. It's just the tip of the iceberg.
In addition to statistics, many data scientists are well versed in machine learning. Machine Learning is a branch of artificial intelligence that's focused on constructing systems that learn from large amounts of data to make predictions.
These are all the potential applications of machine learning.
In [3]:
# Why is Machine Learning Useful?
from IPython.display import HTML
HTML('<iframe width="846" height="476" src="https://www.youtube.com/embed/uKEm9_HvkKQ" frameborder="0" allowfullscreen></iframe>')
Out[3]:
In [2]:
# Kurt's Favorite ML Algorithm
from IPython.display import HTML
HTML('<iframe width="798" height="449" src="https://www.youtube.com/embed/qwUYjU_kmdc" frameborder="0" allowfullscreen></iframe>')
Out[2]:
In [ ]:
# Gradient Descent in Python
import numpy
import pandas
def compute_cost(features, values, theta):
"""
Compute the cost of a list of parameters - theta, given a list of features
(input data points) and values (output data points).
"""
m = len(values)
sum_of_square_errors = numpy.square(numpy.dot(features, theta) - values).sum()
cost = sum_of_square_errors / (2*m)
return cost
def gradient_descent(features, values, theta, alpha, num_iterations):
"""
Perform gradient descent given a data set with an arbitrary number of features.
"""
# Write code here that performs num_iterations updates to the elements of theta.
# times. Every time you compute the cost for a given list of thetas, append it
# to cost_history.
# See the Instructor notes for hints.
cost_history = []
m = len(values)
###########################
### YOUR CODE GOES HERE ###
###########################
for iteration in range(num_iterations):
# Append new cost of given list of theta to cost_history
cost_history.append(compute_cost(features, values, theta))
# compute gradient descent
diff = numpy.dot(features.transpose(), values - numpy.dot(features, theta))
theta = theta + (alpha/m)*diff
return theta, pandas.Series(cost_history) # leave this line for the grader
In [ ]:
Theta =
[ 45.35759233 -9.02442042 13.69229668]
Cost History =
0 3769.194036
1 3748.133469
2 3727.492258
3 3707.261946
4 3687.434249
5 3668.001052
6 3648.954405
7 3630.286519
8 3611.989767
9 3594.056675
10 3576.479921
11 3559.252334
12 3542.366888
13 3525.816700
14 3509.595027
15 3493.695263
16 3478.110938
17 3462.835711
18 3447.863371
19 3433.187834
20 3418.803138
21 3404.703444
22 3390.883030
23 3377.336290
24 3364.057733
25 3351.041978
26 3338.283754
27 3325.777897
28 3313.519347
29 3301.503147
...
970 2686.739779
971 2686.739192
972 2686.738609
973 2686.738029
974 2686.737453
975 2686.736881
976 2686.736312
977 2686.735747
978 2686.735186
979 2686.734628
980 2686.734074
981 2686.733523
982 2686.732975
983 2686.732431
984 2686.731891
985 2686.731354
986 2686.730820
987 2686.730290
988 2686.729764
989 2686.729240
990 2686.728720
991 2686.728203
992 2686.727690
993 2686.727179
994 2686.726672
995 2686.726168
996 2686.725668
997 2686.725170
998 2686.724676
999 2686.724185
dtype: float64
We need some ways to evaluate the effectiveness of our models. One way we can measure this is a quantity called the coefficient of determination also referred to as $R^2$. We can define the coefficients of determination ($R^2$).
$$R^2 = 1 - \frac{\sum_n(y_i - f_i)^2}{\sum_n(y_i - \bar{y})^2}$$Note:
The closer $R^2$ to 1
, the better our models.
0
, the poorer our models.import numpy as np
def compute_r_squared(data, predictions):
# Write a function that, given two input numpy arrays, 'data', and 'predictions,'
# returns the coefficient of determination, R^2, for the model that produced
# predictions.
#
# Numpy has a couple of functions -- np.mean() and np.sum() --
# that you might find useful, but you don't have to use them.
# YOUR CODE GOES HERE
r_squared = 1 - sum((predictions - data)**2) / sum((data - np.mean(data))**2)
return r_squared
Qualitative viewpoint: Any problem you're looking at, it's always very valuable to start thinking about:
So, using k-Means clustering and PCA are a good start to do some sort of dimentionality reduction, some ways of geting the data to the point where you can look and ger some qualitative insights. Understand the general structure of it, you can start to see patterns, emerging data that make sense, or either confirm or possibly go against other theories or in grained beliefs that people have. -> Getting data down to that point is very importatnt.
Quantittative viewpoint: Trying to understand causal connections like which features are actually causing it. It's important to use a lot of caution around that and never just sort of dump a bunch of data into a model with lots of features and then jsut naively look at the thing that have the strongest weights in your model and here's what driving it.
There're 3 areas or part of this kind of work and you should think about which parts do you really enjoy the most.
Note: A quick overview on dimensionality reduction and PCA (principal component analysis): http://georgemdallas.wordpress.com/2013/10/30/principal-component-analysis-4-dummies-eigenvectors-eigenvalues-and-dimension-reduction/
In [1]:
from IPython.display import HTML
HTML('<iframe width="798" height="449" src="https://www.youtube.com/embed/zS9SmHPVjJs" frameborder="0" allowfullscreen></iframe>')
Out[1]:
In [3]:
from IPython.display import HTML
HTML('<iframe width="798" height="449" src="https://www.youtube.com/embed/wuUQl3o_hVI" frameborder="0" allowfullscreen></iframe>')
Out[3]:
In [ ]:
# Assignment 3
from IPython.display import HTML
HTML('<iframe width="798" height="449" src="https://www.youtube.com/embed/OWGuZuBxS8E" frameborder="0" allowfullscreen></iframe>')
In [ ]:
# Lesson 3 Recap
from IPython.display import HTML
HTML('<iframe width="798" height="449" src="https://www.youtube.com/embed/u1Sh-BjiFfM" frameborder="0" allowfullscreen></iframe>')