ch1-ml-landscape


The Machine Learning landscape

Types of ML systems

You can classify the algorithms using different methods

  • whether or not trained by humans [supervised, unsupervised, semi-supervised, reinforcement learning]
  • whether or not they can learn incrementally on the fly [online vs batch]
  • whether they work by comparing test against train or by detecting patterns in train to predict the test data [instance-based vs model-based]

Classification based on training required

Supervised learning

You feed labeled data to the algorithm. Classification and Regression are the kinds of problems that can be solved with supervised learning. Some popular algorithms

  • k-Nearest Neighbors (KNN)
  • Linear regression
  • Logistic regression
  • Support Vector Machines (SVM)
  • Decision Trees and Random Forests
  • Neural networks

Unsupervised learning

Training data is unlabeled. System tries to figure out the relationships. Clustering, anomaly detection and Dimensionality Reduction are good problems that can be solved with this type of learning. Some popular algorithms

  • Clustering
    • K-Means
    • Hierarchical Cluster Analysis (HCA)
    • Expectation Maximization
  • Viz and dimensionality reduction
    • Principal Component Analysis (PCA)
    • Kernel PCA
    • Locally-Linear Embedding (LLE)
    • t-distributed stochastic neighbor embedding (t-SNE)
  • Association rule learning (dig into large amounts of data, find interesting relationships b/w attributes)
    • Apriori
    • Eclat

Semisupervised learning

Algorithms that can learn with partially labelled data and lots of unlabeled data. Some examples of algorithms

  • deep belief networks (DBN)
  • restricted boltzmann machines (RBMs)

Reinforcement learning

Learning system (agent) can observe the environment, select and perform actions and get rewards or penalties. It must learn by itself to get the most reward over time (policy). Thus a policy defines what action the agent must take in a given situation.

Classification based on learning rate

Batch learning

  • system is incapable of learning incrementally. Since training takes a lot of time and resources, it is done offline. Hence also called offline learning
  • when new data arrives, the system must be taken offline and trained on full dataset (not just the new part).

Online learning

  • system can be trained incrementally. Usually data is fed in mini batches.
  • this also helps if the training data is huge that it will not fit in one machine's memory. Then data can be fed in mini-batches, removed for the next set etc.
  • learning rate determines how fast the system can adapt to new or changing data.

Classification based on generalization

Instance based learning

  • learns from examples by heart then generalizes to new cases using a measure of similarity

Model based learning

  • system builds a model using the training data and uses the model to make predictions.

Main challenges of ML

  • insufficient training data - the "unreasonable effectiveness of data" paper
  • non representative training data - sampling bias, poor quality data, irrelevant features (can be rectified through feature engineering - feature selection and extraction)
  • over-fitting the training data - happens when the model is too complex relative to the amount of noisiness of the data. One solution is to constrain the model.
    • hyperparameters -
  • underfitting the training data - when the model is too simple to learn the phenomena in the data.

In [ ]: