Statistics and Probability Refresher

Skipping adding stuff here for this as fairly basic.

see my DistributionMetrics.ipynb for some basics, or lectures 7-18 of the course.

Linear Regression

  • fit a line to observations
  • use line to predict values of other data

Methods

  • 'least squares': minimize squared-error
  • Gradien Descent: better for higher-D data, but prone to issues due to starting position

r-squared: 0 = bad, 1 = perfect (all of variance is captured by model)

below code is my modified version in regression.ipynb of that originally in LinearRegression.ipynb and PolynomialRegression.ipynb

Modify this to a multivariate/polynomial regression example

Make distribution more complicated to see if scikit-learn can fit it

With a high-N polynomial, it is unlikely to hold up to future testing and only fits the test data well.

Multivariate Regression

Just regression above with more than one variable being fit.

Key points

  • Avoid using features that don't provide additional information
  • combine features together when it makes sense to reduce dimensionality
  • Need to assume features are not dependent on each other, even though not always or even usually true

Example of using Multivariate Regression to estimate car prices based on their features is covered in MultivariateRegression.ipynb

Multi-Level Models

  • Effects happen at different levels
    • a lower level feature depends on its environment, and the level above that...

Multi-Level Models attempt to model these interdepndencies

Commonly applied in healthcare.

Not covered in more detailed beyond general discusion in lecture 22, and instead recommends a [book] for further reading.

Bayesian Methods: Concepts

Bayes' Theorem (not covering it here as it is in my PhD and MSc theses)

One good real-world application is in a spam filter. Naive Bayes' can be used to develop a model that can discriminate normal (Ham) emails from garbage (Spam). Lots of ways to improve it, but works fairly well in a basic sense.

For more, check lecture 25-26

Spam Classifier/Filter with Naive Bayes

Supervised learning.

Steps

  • Read in emails and their classification of ham/spam the bulk of the code
  • Vectorize emails to numbers representing each word
  • Get a functional object that will perform Multinomial Naive Bayes' from sklearn
  • Fit vectorized emails
  • Check it worked with test cases

for more code and details, see NaiveBayes.ipynb

K-Means Clustering

Unsupervised learning

  • Attempts to split data into K groups that are closest to K centroids.

    • (1)Centroids are adjusted to the center of the points that were closest to it.

    • (2)Points are then used to find which centroids they are closest to again.

  • repeat 1 & 2 until error or distance centroids move converges.

Caveats

  • choosing K

    • try increasing K until you stop getting large reductions in $\chi^2$
  • use different randomly chosen initial centroids to avoid local minima

  • Still need to determine labels for clusters found.

Example of its use can be found in KMeans.ipynb

Entropy

  • A measure of a data set's disorder - how same or different it is.
  • Classify data set into N classes.
    • Entropy of 0, implies all data is the same class
    • High entropy, implies there are many types of classes in the data

Computing Entropy

  • $H(s) = -p_1ln(p_1) -...-p_nln(p_n)$
  • $p_i$ represents portion of data with that class/label
  • casses where all data is or all data is not a particular class contribute zero to entropy. So, non-zero only when portions of the data are in different classes.

Decision Trees

Supervised learning

  • flowcharts to assist with classification choices
  • EX. A tree of resume contents organized by its relation to the chances of being hired.

Random Forests

Can use 'from sklearn import tree', AND pandas to organize data going into the trees. Graphviz can be used to visualize resulting trees.

  • Decision trees are very susceptible to overfitting
    • construct many trees in a 'forest' and have them all 'vote' towards the outcome classification
      • MUST randomly sample data used to make each tree!
      • Also, randomize the attributes each tree is fitting.
steps
  • read in data with pandas
  • convert columns used to make decisions into ordinal numbers with a map function
  • push these into a 'features' list
  • get array of matching decisions from supervised portion for training
  • make decision tree
  • use graphviz to display resulting tree
upgrade to using a random forest

Ensemble Learning

Multiple models work together to make a prediction

  • Ex. random forests.

methods

  • Bagging (bootsrap aggregating): many models built by training on randomly-drawn subsets of data
  • Boosting: additional models added to help address data mis-classified by previous model
    • Very good and powerful tool for local and cluster distributed solutions is XGBoost available in C++, Python, R, Java, Scala, and Julia.
  • 'bucket of models': Train multiple models, then pick one that works best on test data
  • Stacking: run multiple models on same data, then combine output results

Advanced Ensemble Learning

  • Bayes Optimal Classifier (BOC)
    • Theoretically the best - but almost always impractical
  • Bayesian Parameter Averaging
    • Attempts ot make BOC practical. Still susceptible to overfitting, often outperformed by simple bagging
  • Bayesian Model Combination
    • Tries to fix all of these
    • BUT, ends up about the same as finding best combination of models with cross-validation

Support Vector Machines (SVM)

  • Works well for high-dimensional data (lots of features)
  • Solves for high-dimensional support vectors to help divide up the data
  • Applies a 'kernal trick' to represent data in higher dimensions in order to find the hyperplanes not initially apparent in the lower diimensions.
    • This is computationally expensive, and why it is not as useful for low-D data.

Ex. Identify types of iris flower by length and width of sepal.

General:

With a simple linear kernal.

Recommender Systems

User-Based Collaborative Filtering

  • Build matrix of things each user bought/viewed/rated
  • Compute similarity scores between users
  • Find similar users
  • Recommend stuff similar users boughts... that current user hasn't seen yet.

Caveats

  • People's likes change
  • Number of people commonly >> number of items. Thus, needs lots of filtering.
  • People intentially fabricate fake users to boost/trash to their advantage
    • Shilling attack

Item-based Collaborative Filtering

Resolves some of the problems that arise from using people's actions to make recommendations mentioned above.

  • less items than people, faster to compute.
  • harder for people/users to intervene

Idea

  • Find all pairs of items bought/viewed/rated by same user
  • Measure similarity of the item's ratings/bought... for all users that bought/viewed both
  • sort by item
  • sort by similarity
  • Use a look-up table of results to make recommendtions to users

Steps

  • import data with pandas
  • convert data to database with items, users and rating/frequency bought...
  • Calculate correlation between rating/frequency bought with pandas
  • Clean out spurrious results. THIS IS TRICKY, BUT THE MOST IMPORTANT PART TO MAKE SURE RECOMMENDATIONS ARE SUCCESSFUL TO PRODUCING SALES. Will probably go through many rounds of cleaning input data, tweaking correlation function, and cleaning resulting correlations.
  • Use cleaned correlations array(s) to make recommendations
  • Try grouping results to help find top matches

K-Nearest Neighbours (KNN)

Supervised learning

  • Similar to K Means Clustering
  • Classify new data points based on their distance from known data
  • Find the K nearest neighbord, based on this 'distance'
  • Allow all KNN to vote on classification

example in KNN.ipynb

Steps

  • import data with pandas
  • Group data by features of interest
  • Convert those for making classifications from into a normalized form
  • Make a distance calculating function
  • Find KNN
  • Sort or something to give results back

Principle Component Analysis (PCA)

When data has too many dimensions, extract sets of basis data that can be combined to re-produce the high-D data sufficiently. In another way: find a way to represent the data with minimal dimensions that sufficiently preserves its variance.

very useful in image compression and face recognition

  • Commonly implementation is Single Value Decomposition (SVD)

Ex. Identify types of iris flower by length and width of sepal. Data comes with scikit-learn.

  • With PCA 4 length & width of petals & sepal (4D) -> 2D

See PCA.ipynb for all code

steps

  • Import data
  • Apply PCA
  • Check how much variance was captured

Data Warehousing

ETL: Extract, Transform, Load

The more 'traditional' approach.

  • raw data from operational systems periodically extracted
  • raw data is transformed into a required schema
  • transformed data is loaded into warehouse

  • BUT, step 2, transform can be a big problem with "big data"

ELT: Extract, Load, Transform

Push intensive transformation step to the end where it can be better optimized. This approach is now much more scalable than ETL.

  • Extract raw data as before
  • load it in to datawarehouse raw
  • let cluster (Hadoop) process and manage data in-place
  • Query reduced data with new methods such as NoSQL, Spark or MapReduce

Reinforcement Learning

One example is Pac-Man.

Idea:

  • agent 'explores' space
  • allow agent to learn values of different state changes in different conditions
  • state & choice values then used to make informed future decisions

Q-Learning

Implementation of reinforcement learning.

  • have:

    • set of environmental states s
    • set of possible actions a for each state
    • value of state/action Q
  • Start all Q's at 0

  • explore
  • bad things -> reduce Q for that state/action
  • good things -> increase Q

The exploration problem

Use Bayes theorem to include intelligent randomness into exploration to increase the learning efficiency. Thus, a Markov Decision Process (MDP)

Use This in tandem with Q-learning to build up a list of all possible states and the reward values (Q values) for every available action in that state. Can be considered to implement Dynamic Programming or Memoization in some cases or terms.

Dealing with Real-World Data

Lectures 48-53 based on issues of applying course fundamentals to real world data.

Apache Spark: Machine Learning on Big Data

Using MLLib to esentially do things like K-Means Clustering, Decision Trees... reviewed in pure Python before, but in a way that could be ran locally OR on a Hadoop cluster with Amazon Web Services (AWS).




Common models and tools in ML/AI:



Neural Network Models:


Types:

Feedforward (acyclic graphs)

  • Autoencoders
  • Denoising autoencoders
  • Restricted Boltzmann machines (stacked, they form deep-belief networks)

Convolutional

Deep convolutional networks are SOTA for images. There are many well known architectures, including AlexNet and VGGNet.

Convolutional networks usually involved a combination of convolutional layers as well as subsampling and fully connected feedforward layers.

Recurrent

These handle time series data especially well. They can be combined with convolutional networks to generate captions for images.

Recursive

These handle natural language especially well


  • MLP Multi-layer perceptron
  • CNN Convolutional Neural Network
  • RNN Recurrent Neural Network
  • RNN Recursive Neural Network
  • LSTM Long Short Term Memory
  • FRN Fully recurrent network
  • HN Hopfield network
  • EN Elman network
  • JN Jordan network
  • ESN Echo state network
  • BRNN Bi-directional RNN

Stochastic gradient descent

Optimize the cost function while training the model to give the highest accuracy.

  • Momentum SGD
  • AdaGrad
  • RMSprop
  • AdaDelta
  • Adam
  • Nestrov’s Accelerated gradient descent
  • Grave’s RMS prop

Activation function

Outputs of perceptrons/neurons/nodes generated by passing weighted inputs through an 'activation function'.

  • Relu (simple rectifier. returns max(x,0))
  • Sigmoid
  • Soft Max
  • Max Out
  • Tanh
  • Identity
  • Leaky ReLU
  • Clipped_RelU
  • Exponential Linear Unit
  • Log Soft Max
  • Soft Plus
  • Parametric ReLU

Pre-learning program

Training to estimate best weights for inputs to nodes.

  • Denoising Auto-Encoder
  • Auto-encoder
  • Add the user code
  • Deep Boltzmann Machine
  • Restricted Boltzmann Machine Gibbs sampling
  • Restricted Boltzmann Machine Contrastive Divergence
  • Deep Belief Network
  • Gaussian unit
  • RELU unit

Data Normalization

  • standardization
  • ZPC whitening
  • ZCA whitening



Big Data Solutions/Distributed computing notes

Spark

  • Alternative to tools like MapReduce.
  • More flexible and can work with other file systems: Cassandra, AWS S3...
  • Keeps most data in memory to be faster, BUT can RAM can overflow
  • MapReduce writes back to disk after each step, so slower...
  • MLLib provides powerful, easy to use ML & data mining tools
  • RDD Resilient Distributed Datasets (OLD way, latest version uses Dataframes like those in SQL and Pandas)



Misc Others:

autoregressive integrated moving average (ARIMA) model is a generalization of an autoregressive moving average (ARMA) model. Both of these models are fitted to time series data either to better understand the data or to predict future points in the series (forecasting).


In [ ]: