Probability

Bayes Theorem

$$ P(A|B) = \frac{P(B|A)}{P(B)} \cdot P(A) $$

$$ P(B) = P(B|A)P(A)+P(B|\bar{A})P(\bar{A}) $$

Combinatorics

$$ _nP_r=\frac{n!}{(n-r)!} $$

$$ _nC_k=\frac{n!}{k!(n-k)!} $$
$$ P(Exactly\ k\ in\ n\ attempts)=\binom{n}{k}f^k(1-f)^{n-k} $$

Math and Statistical Concepts

Central Limit Theorem

The distribution of the mean of a random sample from a population with finite variance is approximately normally distributed when the sample size is large, regardless of the shape of the population's distribution. The distribution of the means of random samples will look like a normal distribution. This means we can approximate some distribution with a normal distribution although the distribution is not normally distributed.

Because of CLT it's possible to make probabilistic inferences about population parameter values based on statistic samples.

Law of Large Numbers

As the sample size approaches infinity the center of the distribution of the sample means becomes very close to the population mean. According to the law, the average of the results obtained from a large number of trials should be close to the expected value, and will tend to become closer as more trials are performed.

Curse of Dimensionality

As we add dimensions we increase the processing power we need to analyze the data, and we also increase the amount of training data required to make meaningful models. As the number of features increases, the classifier's performance increases as well until an optimal number of features- adding more features based on the same amount of data will then degrade the classifier's performance.

KNN is very suceptible to overfitting due to the curse of dimensionality. Curse of dimensionality also describes the phenomenon where the feature space becomes increasingly sparse for an increasing number of dimensions of a fixed-size training dataset. Intuitively, we can think of even the closest neighbors being too far away in a high-dimensional space to give a good estimate. As our data becomes increasingly sparse we risk overfitting and performing poorly on a test set.

Probability Space

Sample space: The set of all possible outcomes
Event space: A subset of all the events in the sample space
Probabilities: The probabilities for each outcome.

Eigenvectors

In essence, an eigenvector v of a linear transformation T is a non-zero vector that, when T is applied to it, does not change direction. Applying T to the eigenvector only scales the eigenvector by the scalar value λ, called an eigenvalue.

Singular Value Decomposition

A more general version of eigenvalue decomposition- as it can only be applied to a diagonalizable matrix.

Linear Discrimate Analysis

Used to find a linear combination of features that characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification.

Probability vs. Likelihood

Probability is the area under the curve of a distribution between two amounts. Imagine a normal distribution of mouse weights, we may write the probability like p(weight between 32 and 34 grams| mean=32 and std=2.5) = .29. As we change the amounts we find new probabilities, where the right side of our equation describes the right side and the left side describes the area under the curve.

Likelihood assumes we have some observed data. If we have a 34 gram mouse we can look at the the y-axis of our distibution to get the likelihood of the left side of the equation- the parameters of the distribution. Written like L(mean=32 and std=2.5|mouse weights 34 grams). The right side stays fixed as it defined our data and we shift the left side to get the likelihood that those parameters define the distribution that our data comes from.

MLE or maximum likelihood estimation finds the most likely parameters that define a normal distribution that describes our observed data. MLE is an example of statistical inference. There are other examples of inference like Bayesian inference.

Lasso (L1) vs Ridge (L2) Regularization

The difference is the penalty term. Ridge uses a squared magnitude of coefficient as penalty term. Lasso uses absolute value of magnitude of coefficient as penalty term to loss.

Lasso penalizes the sum of the absolute values- as a result, for high values of lambda, many coefficients are zeroed under lasso, which is never the case in ridge regression. Lasso tends to do well if there are a small number of significant predictor parameters, and others are close to zero. Ridge tends to do better when there are many large parameters about the same value. Ultimately we should run cross-validation and choose whichever performs better.

Standard Error

The standard deviation of the sampling distribution of the sample mean.

AUC-ROC Curve

The AUC-ROC curve is used in binary classification to tell us how much the model is capable of distingusighing between classes. The higher the auc the better the model is at predicting class. The AUC score is determined by TRP(true positive rate) and FPR(False positive rate) also called recall/sesitivity and specificity. We can use auc-roc for multi-class classification by drawing auc-roc curves for each class and avereging with equal weight (macro-averaging) or draw one curve that considers each element as a binary prediction (micro-averaging).

Regression

General Outline

In linear regression we have 5 key assumptions
1. Independnet and dependent variables have liner relationship
2. Multivariate normality
3. No multi-colinearity
4. No autocorrelation
5. Homoskedasticity (constant finite variance)
Generate a scatterplot and analyze the correlation and directionality of the variables. Check for linear relationships if doing linear regression.
Look for outliers (consider looking up those data points and seeing if they deviate for a non-statistical reason)
If the independent variables show high pairwise correlations be on the lookout for multicolinearity
Split the data into train and test, or use cross-validation with k folds and iterate through the folds testing with the current fold and training with the others.
Fit a least-squares model (or whatever the situation calls for)
Use a Pearson goodness-of-fit or chi-square test
Consider several changes to get a better fit
1. Especialy if early scatter plot shows non-linear relationships, consider a transformation like Box-Cox
2. Check for multicolinearity and remove feature
3. Add regularization if overfitting oberved data
4. If underfitting use techniques like resampling or bootstrapping, which can also decrease variance

Logistic Formula

$$P= \frac{1}{1 + e^{-(b_0+b_1x)}}$$

Tests

Z-Test

In a z-test, the sample is assumed to be normally distributed. A z-score is calculated with population parameters such as “population mean” and “population standard deviation” and is used to validate a hypothesis that the sample drawn belongs to the same population.

$H_o$: Sample mean is same as the population mean

$H_a$: Sample mean is not same as the population mean

T-Test

Compares two averages and tells you if they are different from each other. The t test also tells you how significant the differences are. Use it to compare the means of two groups to figure out the probability that their differecnes are the results of chance. A t-test is used when the population parameters (mean and standard deviation) are not known. The t score is the ratio between the difference between two groups and the difference within the groups.

Independent samples t-test which compares mean for two groups
Paired sample t-test which compares means from the same group at different times
One sample t-test which tests the mean of a single group against a known mean

Chi-Square Test

Chi-square test is used to compare categorical variables. There are two type of chi-square test

Goodness of fit test, which determines if a sample matches the population.
A chi-square fit test for two independent variables is used to compare two variables in a contingency table to check if the data fits.

a. A small chi-square value means that data fits

b. A high chi-square value means that data doesn’t fit.

$H_o$: Variable A and Variable B are independent

$H_a$: Variable A and Variable B are not independent.

ANOVA

Also known as analysis of variance, is used to compare multiple (three or more) samples with a single test. There are 2 major flavors of ANOVA

One-way ANOVA: It is used to compare the difference between the three or more samples/groups of a single independent variable.
MANOVA: MANOVA allows us to test the effect of one or more independent variable on two or more dependent variables. In addition, MANOVA can also detect the difference in co-relation between dependent variables given the groups of independent variables.

$H_o$: All pairs of samples are same i.e. all sample means are equal

$H_a$: At least one pair of samples is significantly different

Other Tests

Auto-correlation
1. Durbin-Watson
2. Breusch-Godfrey
3. Ljung-Box
Normality
1. Shapiro-Wilks
2. Q-Q Plot
3. Jarque-Bera
Stationarity
1. Augmented Dickey-Fuller
2. KPSS

Distributions

Bernoulli: Two discrete outcomes, heads or tails, but not necesarilly equal outcomes.
Binomial: A Bernoulli distribution with n number of trials
Hypergeometric: A distribution like a binomial distribution but if events happen without replacement.
Multinomial: A generalization of the binomial distribution, it models the probability of counts for rolling a k-sided die n times. For n independent trials each of which leads to a success for exactly one of k categories, with each category having a given fixed success probability, the multinomial distribution gives the probability of any particular combination of numbers of successes for the various categories.
Poisson: Like the binomial distribution, it is the distribution of a count, a count of times something happened. Poisson is the distributino you must used when trying to count events over a time given the continuous rate of events occuring. Poisson is the packets that arrive at routers, the customers that arrive at stores, and things in some kind of queue.
Geometric: How many times does a flipped coin come up tails before it first comes up heads? This count of tails follows a geometric distribution. If the binomial distribution is “How many successes?” then the geometric distribution is “How many failures until a success?”
Exponential: A continuous distribution, like a mix of the Poisson and Geometric distribution- it answers 'How long until an event?'. Given events whose count per time follows a Poisson distribution, then the time between events follows an exponential distribution with the same rate parameter lambda. A Weibull distribution is an exponential distribution with increasing or decreasing rates of failure.
Log-Normal: Takes on values whose logarithm is normally distributed. Products of samples from other distributions are log-normly distributed.
Chi-Square: Finally, the chi-squared distribution is the distribution of the sum of squares of normally-distributed values. It’s the distribution underpinning the chi-squared test which is itself based on the sum of squares of differences, which are supposed to be normally distributed.
Gamma: A generalization of both the exponential and chi-squared distributions. The gamma distribution comes up when modeling the time until the next n events occur.
Beta: When discussing conjugate priors in Bayesian machiine learning- the Beta is the conjugate prior to most every other distribution.

Properties

Kurtosis: A descriptor of the shape of a probability distribution, large kurtosis exhibit tail data exceeding the tails of a normal distribution. Kurtosis is the combined weight of a distribution's tails relative to the center of the distribution.
Skewedness: Measure of the asymmetry of the probability distribution. A negative skew (skewed to the right) has the mode and median to the right of the mean, where a negative skew of -1 or greated is considered very skewed. A positive skew (skewed to the left) has a long tail running away from the origin where the median and mode are to the left of the mean.

Common Tracking Metrics

Retention Rate: How many users remained customers over a period of time, retention is the opposite of churn.
Conversion: Converting people aware of your marketing into customers.
1. Volume How many responses are there?
2. Conversion: How the sles marketing is perfomring through the entire sales cycle.
3. Velocity: How long does it take to turn an initial response into a close.
Session length
Time-in-app
Active users
Stickiness: How addictive the app is.

Machine Learning

Recommender Systems

Collaborative Filtering

Based on collecting and analyzing a large amount of information on users' behaviors, activites or preferences and predicting what users will like based on their similarity to other users. A key advantage of the collaborative filtering approach is that it doesn't require understanding the characteristics of items.

Content-Based Filtering

Based on the description of an item and the profile of a user's preferences. In content-based recommender systems, you need to have a profile of each item, like Pandora, where a single seed is used to get other songs of similar musical characteristics. As the user interacts with Pandora they can build a profile of the user.