Lesson 3: Data Analysis

Statistics

Terminology

  • Significant level In statistical hypothesis testing, a result has statistical significance when it is very unlikely to have occurred given the null hypothesis.[3] More precisely, the significance level defined for a study, α, is the probability of the study rejecting the null hypothesis, given that it were true; and the p-value of a result, p, is the probability of obtaining a result at least as extreme, given that the null hypothesis were true. The result is statistically significant, by the standards of the study, when p < α. Link to wikipedia article
  • Normal Distribution

In [1]:
# Kurt's Introduction
from IPython.display import HTML
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/umJQ6gVT8kY" frameborder="0" allowfullscreen></iframe>')


Out[1]:

In [8]:
# Why is Statistics Useful?
from IPython.display import HTML
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/DyeRm96wH5M" frameborder="0" allowfullscreen></iframe>')


Out[8]:

In [10]:
# Introduction to Normal (Gauss Distribution)
from IPython.display import HTML
HTML ('<iframe width="560" height="315" src="https://www.youtube.com/embed/ZfOTcwXAdEw" frameborder="0" allowfullscreen></iframe>')


Out[10]:

The equation for the normal distribution is:

$$f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}.e^{\frac{-(x - \mu)^2}{2\sigma^2}}$$

T-Test

To be more explicit:

  • It is important to note that you cannot "accept" a null.
  • You can just "retain" or "fail to reject".

If you would like to learn more about the t-test, check out this lesson in Intro to Inferential Statistics.

Welch's T-Test In Python You can check out additional information about the scipy implementation of the t-test below: http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html


In [1]:
# t-Test video
from IPython.display import HTML
HTML('<iframe width="369" height="208" src="https://www.youtube.com/embed/tjSj2OkV51A" frameborder="0" allowfullscreen></iframe>')


Out[1]:

In [13]:
# Welch's Two-Sample t-Test
from IPython.display import HTML
HTML('<iframe width="369" height="208" src="https://www.youtube.com/embed/B_1cnwYn7so" frameborder="0" allowfullscreen></iframe>')


Out[13]:

14. Quiz - Welch's t-Test Exercise

Performs a t-test on two sets of baseball data (left-handed and right-handed hitters).

You will be given a csv file that has three columns. A player's name, handedness (L for lefthanded or R for righthanded) and their career batting average (called 'avg'). You can look at the csv file by downloading the baseball_stats file from Downloadables below.

Write a function that will:

  • read that the csv file into a pandas data frame,and
  • Run Welch's t-test on the two cohorts defined by handedness.
    • One cohort should be a data frame of right-handed batters. And
    • the other cohort should be a data frame of left-handed batters.

We have included the scipy.stats library to help you write or implement Welch's t-test: http://docs.scipy.org/doc/scipy/reference/stats.html

With a significance level of 95%, if there is no difference between the two cohorts, return a tuple consisting of True, and then the tuple returned by scipy.stats.ttest.

If there is a difference, return a tuple consisting of False, and then the tuple returned by scipy.stats.ttest.

For example, the tuple that you return may look like: (True, (9.93570222, 0.000023))

Supporting materials baseball_stats.csv


In [ ]:
import numpy
import scipy.stats
import pandas

def compare_averages(filename):
    """
    The description for this quiz is above text.
    """
    baseball_data = pandas.read_csv(filename)
    lh_player = baseball_data.loc[baseball_data['handedness'] == 'L', 'avg']
    rh_player = baseball_data.loc[baseball_data['handedness'] == 'R', 'avg']
    
    # Welch's t-test
    (t, p) = scipy.stats.ttest_ind(lh_player, rh_player, equal_var=False)
    
    # Welch's t-test results.
    result = (p > 0.05, (t, p))
    
    return result

Your calculated t-statistic is 9.93570222624 The correct t-statistic is +/-9.93570222624


In [14]:
# Exaplaination for Welch's t-Test exercise
from IPython.display import HTML
HTML('<iframe width="550" height="309" src="https://www.youtube.com/embed/TrSU-GH7TDY" frameborder="0" allowfullscreen></iframe>')


Out[14]:

Non-normal Data

When performing the t-Test, we assume that our data is normal. In the wild, you'll often encounter probability distributions. They're distinctly not normal. They might look like two diagrams below or even completely different.

As you imagine, there are still statistical tests that we can utilize when our data is not normal.

First of, we should have some machinery in place for determining whether or not our data is Gaussian in the first place. A crude, inaccurate way of determining whether or not our data is normal is simply to plot a histogram of our data ans ask, does this look like a bell curve? In both of the cases above, the answer would definitely be no. But, we can do little bit better than that. There are some statistical tests that we can use to measure the likelihood that a sample is drawn from a normally distributed population. One such test is the Shapiro-Wilk test. The theory of this test is out of this course's scope. But you can implement this test easyly like this:

(W, p) = scipy.stats.shapiro(data)
  • with W is the Shapiro-Wilk test statistic,
  • p value, which should be interpreted the same way as we would interpret the p-value for our t-test.

That is, given null hypothesis that this data is drawn from a normal distribution, what is the likelihood that we would observe a value of W that was at least as extreme as the one that we see?

Non-Parametric Test

A statistical test that does not assume our data is drawn from any particular underlying probability distribution.

Mann-Whitney U test is a test that tests null hypothesis that two populations are the same:

(U, P) = scipy.stats.mannwhitneyu(x, y)
  • x and y are two samples.

Note

These have just been some of the methods that we can use when performing statistical tests on data. As you can imagine, there are a number of additional ways to handle data from different probability distributions or data that looks like it came from no probability distribution whatsoever.

Data scientist can perform many statistical procedures. But it's vital to understand the underlying structure of the data set and consequently, which statistical tests are appropriate given the data that we have.

There are many different types of statistical tests and even many different schools of thought within statistics regarding the correct way to analyze data. This has really just been an opportunity to get your feet wet with statistical analysis. It's just the tip of the iceberg.

2. What is Machine Learning?

In addition to statistics, many data scientists are well versed in machine learning. Machine Learning is a branch of artificial intelligence that's focused on constructing systems that learn from large amounts of data to make predictions.

These are all the potential applications of machine learning.


In [3]:
# Why is Machine Learning Useful?

from IPython.display import HTML
HTML('<iframe width="846" height="476" src="https://www.youtube.com/embed/uKEm9_HvkKQ" frameborder="0" allowfullscreen></iframe>')


Out[3]:

Statistics vs. Machine Learning

What is the difference between statistics and machine learning


In [2]:
# Kurt's Favorite ML Algorithm
from IPython.display import HTML
HTML('<iframe width="798" height="449" src="https://www.youtube.com/embed/qwUYjU_kmdc" frameborder="0" allowfullscreen></iframe>')


Out[2]:

Different Types of Learning

Prediction with Regression

Linear Regression with Gradient

Cost Function

How to minimize the cost function

Gradient Descent in Python


In [ ]:
# Gradient Descent in Python
import numpy
import pandas

def compute_cost(features, values, theta):
    """
    Compute the cost of a list of parameters - theta, given a list of features 
    (input data points) and values (output data points).
    """
    m = len(values)
    sum_of_square_errors = numpy.square(numpy.dot(features, theta) - values).sum()
    cost = sum_of_square_errors / (2*m)

    return cost

def gradient_descent(features, values, theta, alpha, num_iterations):
    """
    Perform gradient descent given a data set with an arbitrary number of features.
    """

    # Write code here that performs num_iterations updates to the elements of theta.
    # times. Every time you compute the cost for a given list of thetas, append it 
    # to cost_history.
    # See the Instructor notes for hints. 
    
    cost_history = []
    m = len(values) 
    ###########################
    ### YOUR CODE GOES HERE ###
    ###########################
    for iteration in range(num_iterations):
        # Append new cost of given list of theta to cost_history
        cost_history.append(compute_cost(features, values, theta))
        # compute gradient descent
        diff = numpy.dot(features.transpose(), values - numpy.dot(features, theta))
        theta = theta + (alpha/m)*diff
    
    return theta, pandas.Series(cost_history) # leave this line for the grader

In [ ]:
Theta =
[ 45.35759233  -9.02442042  13.69229668]

Cost History = 
0      3769.194036
1      3748.133469
2      3727.492258
3      3707.261946
4      3687.434249
5      3668.001052
6      3648.954405
7      3630.286519
8      3611.989767
9      3594.056675
10     3576.479921
11     3559.252334
12     3542.366888
13     3525.816700
14     3509.595027
15     3493.695263
16     3478.110938
17     3462.835711
18     3447.863371
19     3433.187834
20     3418.803138
21     3404.703444
22     3390.883030
23     3377.336290
24     3364.057733
25     3351.041978
26     3338.283754
27     3325.777897
28     3313.519347
29     3301.503147
          ...     
970    2686.739779
971    2686.739192
972    2686.738609
973    2686.738029
974    2686.737453
975    2686.736881
976    2686.736312
977    2686.735747
978    2686.735186
979    2686.734628
980    2686.734074
981    2686.733523
982    2686.732975
983    2686.732431
984    2686.731891
985    2686.731354
986    2686.730820
987    2686.730290
988    2686.729764
989    2686.729240
990    2686.728720
991    2686.728203
992    2686.727690
993    2686.727179
994    2686.726672
995    2686.726168
996    2686.725668
997    2686.725170
998    2686.724676
999    2686.724185
dtype: float64

Coefficients Of Determination

We need some ways to evaluate the effectiveness of our models. One way we can measure this is a quantity called the coefficient of determination also referred to as $R^2$. We can define the coefficients of determination ($R^2$).

$$R^2 = 1 - \frac{\sum_n(y_i - f_i)^2}{\sum_n(y_i - \bar{y})^2}$$
  • Note:

    • data: $y_i ... y_n$
    • predictions: $f_i ... f_n$
    • average of data: $\bar{y}$
  • The closer $R^2$ to 1, the better our models.

  • The closer $R^2$ to 0, the poorer our models.

Quiz: Calculating R^2

import numpy as np

def compute_r_squared(data, predictions):
    # Write a function that, given two input numpy arrays, 'data', and 'predictions,'
    # returns the coefficient of determination, R^2, for the model that produced 
    # predictions.
    # 
    # Numpy has a couple of functions -- np.mean() and np.sum() --
    # that you might find useful, but you don't have to use them.

    # YOUR CODE GOES HERE
    r_squared = 1 - sum((predictions - data)**2) / sum((data - np.mean(data))**2)

    return r_squared

Other Considerations

  • Other types of linear regression
    • Ordinary lest squares regression
  • Parameter estimation
  • Under/Overfitting
  • Multiple local minima

Kurt's Advice For ML Best Practices

Qualitative viewpoint: Any problem you're looking at, it's always very valuable to start thinking about:

  • What sort of things do we know?
  • What sort of expectations do we have?
  • What sort of qualitative things can we get from an exploratory analysis of the data?

So, using k-Means clustering and PCA are a good start to do some sort of dimentionality reduction, some ways of geting the data to the point where you can look and ger some qualitative insights. Understand the general structure of it, you can start to see patterns, emerging data that make sense, or either confirm or possibly go against other theories or in grained beliefs that people have. -> Getting data down to that point is very importatnt.

Quantittative viewpoint: Trying to understand causal connections like which features are actually causing it. It's important to use a lot of caution around that and never just sort of dump a bunch of data into a model with lots of features and then jsut naively look at the thing that have the strongest weights in your model and here's what driving it.

Tips for aspiring Data Scientists

There're 3 areas or part of this kind of work and you should think about which parts do you really enjoy the most.

  • For some people, it's the process of building things, of writing code, ...
  • If you really enjoy the analysis part, the statistical and mathematical side of things, there's a lot more you can there in terms of coming up to speed with new machine learning techniques, learning statistics more in depth.
  • On the communications and strategies side there's obviously a lot that you can do to improve your communication skills, undestand how to abstract from the details, how to abstract our the high level issues that are important to a company.

Note: A quick overview on dimensionality reduction and PCA (principal component analysis): http://georgemdallas.wordpress.com/2013/10/30/principal-component-analysis-4-dummies-eigenvectors-eigenvalues-and-dimension-reduction/


In [1]:
from IPython.display import HTML
HTML('<iframe width="798" height="449" src="https://www.youtube.com/embed/zS9SmHPVjJs" frameborder="0" allowfullscreen></iframe>')


Out[1]:

In [3]:
from IPython.display import HTML
HTML('<iframe width="798" height="449" src="https://www.youtube.com/embed/wuUQl3o_hVI" frameborder="0" allowfullscreen></iframe>')


Out[3]:

In [ ]:
# Assignment 3

from IPython.display import HTML
HTML('<iframe width="798" height="449" src="https://www.youtube.com/embed/OWGuZuBxS8E" frameborder="0" allowfullscreen></iframe>')

In [ ]:
# Lesson 3 Recap

from IPython.display import HTML
HTML('<iframe width="798" height="449" src="https://www.youtube.com/embed/u1Sh-BjiFfM" frameborder="0" allowfullscreen></iframe>')