Statistics and Probability Refresher

Skipping adding stuff here for this as fairly basic.

see my DistributionMetrics.ipynb for some basics, or lectures 7-18 of the course.

Linear Regression

fit a line to observations
use line to predict values of other data

Methods

'least squares': minimize squared-error
Gradien Descent: better for higher-D data, but prone to issues due to starting position

r-squared: 0 = bad, 1 = perfect (all of variance is captured by model)

below code is my modified version in regression.ipynb of that originally in LinearRegression.ipynb and PolynomialRegression.ipynb

Modify this to a multivariate/polynomial regression example

Make distribution more complicated to see if scikit-learn can fit it

With a high-N polynomial, it is unlikely to hold up to future testing and only fits the test data well.

Multivariate Regression

Just regression above with more than one variable being fit.

Key points

Avoid using features that don't provide additional information
combine features together when it makes sense to reduce dimensionality
Need to assume features are not dependent on each other, even though not always or even usually true

Example of using Multivariate Regression to estimate car prices based on their features is covered in MultivariateRegression.ipynb

Multi-Level Models

Effects happen at different levels
- a lower level feature depends on its environment, and the level above that...

Multi-Level Models attempt to model these interdepndencies

Commonly applied in healthcare.

Not covered in more detailed beyond general discusion in lecture 22, and instead recommends a [book] for further reading.

Bayesian Methods: Concepts

Bayes' Theorem (not covering it here as it is in my PhD and MSc theses)

One good real-world application is in a spam filter. Naive Bayes' can be used to develop a model that can discriminate normal (Ham) emails from garbage (Spam). Lots of ways to improve it, but works fairly well in a basic sense.

For more, check lecture 25-26

Spam Classifier/Filter with Naive Bayes

Supervised learning.

Steps

Read in emails and their classification of ham/spam the bulk of the code
Vectorize emails to numbers representing each word
Get a functional object that will perform Multinomial Naive Bayes' from sklearn
Fit vectorized emails
Check it worked with test cases

for more code and details, see NaiveBayes.ipynb

K-Means Clustering

Unsupervised learning

Attempts to split data into K groups that are closest to K centroids.
- (1)Centroids are adjusted to the center of the points that were closest to it.
- (2)Points are then used to find which centroids they are closest to again.
repeat 1 & 2 until error or distance centroids move converges.

Caveats

choosing K
- try increasing K until you stop getting large reductions in $\chi^2$
use different randomly chosen initial centroids to avoid local minima
Still need to determine labels for clusters found.

Example of its use can be found in KMeans.ipynb

Entropy

A measure of a data set's disorder - how same or different it is.
Classify data set into N classes.
- Entropy of 0, implies all data is the same class
- High entropy, implies there are many types of classes in the data

Computing Entropy

$H(s) = -p_1ln(p_1) -...-p_nln(p_n)$
$p_i$ represents portion of data with that class/label
casses where all data is or all data is not a particular class contribute zero to entropy. So, non-zero only when portions of the data are in different classes.

Decision Trees

Supervised learning

flowcharts to assist with classification choices
EX. A tree of resume contents organized by its relation to the chances of being hired.

Random Forests

Can use 'from sklearn import tree', AND pandas to organize data going into the trees. Graphviz can be used to visualize resulting trees.

Decision trees are very susceptible to overfitting
- construct many trees in a 'forest' and have them all 'vote' towards the outcome classification
  - MUST randomly sample data used to make each tree!
  - Also, randomize the attributes each tree is fitting.

steps

read in data with pandas
convert columns used to make decisions into ordinal numbers with a map function

push these into a 'features' list
get array of matching decisions from supervised portion for training
make decision tree

use graphviz to display resulting tree

upgrade to using a random forest

Ensemble Learning

Multiple models work together to make a prediction

Ex. random forests.

methods

Bagging (bootsrap aggregating): many models built by training on randomly-drawn subsets of data
Boosting: additional models added to help address data mis-classified by previous model
- Very good and powerful tool for local and cluster distributed solutions is XGBoost available in C++, Python, R, Java, Scala, and Julia.
'bucket of models': Train multiple models, then pick one that works best on test data
Stacking: run multiple models on same data, then combine output results

Advanced Ensemble Learning

Bayes Optimal Classifier (BOC)
- Theoretically the best - but almost always impractical
Bayesian Parameter Averaging
- Attempts ot make BOC practical. Still susceptible to overfitting, often outperformed by simple bagging
Bayesian Model Combination
- Tries to fix all of these
- BUT, ends up about the same as finding best combination of models with cross-validation

Support Vector Machines (SVM)

Works well for high-dimensional data (lots of features)
Solves for high-dimensional support vectors to help divide up the data
Applies a 'kernal trick' to represent data in higher dimensions in order to find the hyperplanes not initially apparent in the lower diimensions.
- This is computationally expensive, and why it is not as useful for low-D data.

Ex. Identify types of iris flower by length and width of sepal.

General:

With a simple linear kernal.

Recommender Systems

User-Based Collaborative Filtering

Build matrix of things each user bought/viewed/rated
Compute similarity scores between users
Find similar users
Recommend stuff similar users boughts... that current user hasn't seen yet.

Caveats

People's likes change
Number of people commonly >> number of items. Thus, needs lots of filtering.
People intentially fabricate fake users to boost/trash to their advantage
- Shilling attack

Item-based Collaborative Filtering

Resolves some of the problems that arise from using people's actions to make recommendations mentioned above.

less items than people, faster to compute.
harder for people/users to intervene

Idea

Find all pairs of items bought/viewed/rated by same user
Measure similarity of the item's ratings/bought... for all users that bought/viewed both
sort by item
sort by similarity
Use a look-up table of results to make recommendtions to users

Steps

import data with pandas
convert data to database with items, users and rating/frequency bought...

Calculate correlation between rating/frequency bought with pandas
Clean out spurrious results. THIS IS TRICKY, BUT THE MOST IMPORTANT PART TO MAKE SURE RECOMMENDATIONS ARE SUCCESSFUL TO PRODUCING SALES. Will probably go through many rounds of cleaning input data, tweaking correlation function, and cleaning resulting correlations.

Use cleaned correlations array(s) to make recommendations
Try grouping results to help find top matches

K-Nearest Neighbours (KNN)

Supervised learning

Similar to K Means Clustering
Classify new data points based on their distance from known data
Find the K nearest neighbord, based on this 'distance'
Allow all KNN to vote on classification

example in KNN.ipynb

Steps

import data with pandas
Group data by features of interest
Convert those for making classifications from into a normalized form
Make a distance calculating function
Find KNN
Sort or something to give results back

Principle Component Analysis (PCA)

When data has too many dimensions, extract sets of basis data that can be combined to re-produce the high-D data sufficiently. In another way: find a way to represent the data with minimal dimensions that sufficiently preserves its variance.

very useful in image compression and face recognition

Commonly implementation is Single Value Decomposition (SVD)

Ex. Identify types of iris flower by length and width of sepal. Data comes with scikit-learn.

With PCA 4 length & width of petals & sepal (4D) -> 2D

See PCA.ipynb for all code

steps

Import data
Apply PCA
Check how much variance was captured

Data Warehousing

ETL: Extract, Transform, Load

The more 'traditional' approach.

raw data from operational systems periodically extracted
raw data is transformed into a required schema
transformed data is loaded into warehouse
BUT, step 2, transform can be a big problem with "big data"

ELT: Extract, Load, Transform

Push intensive transformation step to the end where it can be better optimized. This approach is now much more scalable than ETL.

Extract raw data as before
load it in to datawarehouse raw
let cluster (Hadoop) process and manage data in-place
Query reduced data with new methods such as NoSQL, Spark or MapReduce

Reinforcement Learning

One example is Pac-Man.

Idea:

agent 'explores' space
allow agent to learn values of different state changes in different conditions
state & choice values then used to make informed future decisions

Q-Learning

Implementation of reinforcement learning.

have:
- set of environmental states s
- set of possible actions a for each state
- value of state/action Q
Start all Q's at 0
explore
bad things -> reduce Q for that state/action
good things -> increase Q

The exploration problem

Use Bayes theorem to include intelligent randomness into exploration to increase the learning efficiency. Thus, a Markov Decision Process (MDP)

Use This in tandem with Q-learning to build up a list of all possible states and the reward values (Q values) for every available action in that state. Can be considered to implement Dynamic Programming or Memoization in some cases or terms.

Dealing with Real-World Data

Lectures 48-53 based on issues of applying course fundamentals to real world data.

Apache Spark: Machine Learning on Big Data

Using MLLib to esentially do things like K-Means Clustering, Decision Trees... reviewed in pure Python before, but in a way that could be ran locally OR on a Hadoop cluster with Amazon Web Services (AWS).

Common models and tools in ML/AI:

Neural Network Models:

Types:

Feedforward (acyclic graphs)

Autoencoders
Denoising autoencoders
Restricted Boltzmann machines (stacked, they form deep-belief networks)

Convolutional

Deep convolutional networks are SOTA for images. There are many well known architectures, including AlexNet and VGGNet.

Convolutional networks usually involved a combination of convolutional layers as well as subsampling and fully connected feedforward layers.

Recurrent

These handle time series data especially well. They can be combined with convolutional networks to generate captions for images.

Recursive

These handle natural language especially well

MLP Multi-layer perceptron
CNN Convolutional Neural Network
RNN Recurrent Neural Network
RNN Recursive Neural Network
LSTM Long Short Term Memory
FRN Fully recurrent network
HN Hopfield network
EN Elman network
JN Jordan network
ESN Echo state network
BRNN Bi-directional RNN

Stochastic gradient descent

Optimize the cost function while training the model to give the highest accuracy.

Momentum SGD
AdaGrad
RMSprop
AdaDelta
Adam
Nestrov’s Accelerated gradient descent
Grave’s RMS prop

Activation function

Outputs of perceptrons/neurons/nodes generated by passing weighted inputs through an 'activation function'.

Relu (simple rectifier. returns max(x,0))
Sigmoid
Soft Max
Max Out
Tanh
Identity
Leaky ReLU
Clipped_RelU
Exponential Linear Unit
Log Soft Max
Soft Plus
Parametric ReLU

Pre-learning program

Training to estimate best weights for inputs to nodes.

Denoising Auto-Encoder
Auto-encoder
Add the user code
Deep Boltzmann Machine
Restricted Boltzmann Machine Gibbs sampling
Restricted Boltzmann Machine Contrastive Divergence
Deep Belief Network
Gaussian unit
RELU unit

Data Normalization

standardization
ZPC whitening
ZCA whitening

Big Data Solutions/Distributed computing notes

Spark

Alternative to tools like MapReduce.
More flexible and can work with other file systems: Cassandra, AWS S3...
Keeps most data in memory to be faster, BUT can RAM can overflow
MapReduce writes back to disk after each step, so slower...
MLLib provides powerful, easy to use ML & data mining tools
RDD Resilient Distributed Datasets (OLD way, latest version uses Dataframes like those in SQL and Pandas)

Misc Others:

autoregressive integrated moving average (ARIMA) model is a generalization of an autoregressive moving average (ARMA) model. Both of these models are fitted to time series data either to better understand the data or to predict future points in the series (forecasting).



In [ ]: