Machine Learning - Introduction

Introduction

"A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E."

--Tom M. Mitchell, Carnegie Mellon University

So if your program is to predict, say, weather on long weekends (task T), you can run it through a machine learning algorithm with past data (experience E) and, if it has successfully “learned”, it will be better at predicting weather forcasts (performance measure P).

Few other most popular definitions are

Ability of a machine to improve its own performance through the use of artificial intelligence techniques in order to mimic the ways of humans seem to learn (repetition and experience)

and

Machine learning is to use generic algorithms with a set of data (Problem specific) without having to write any custom code specific to the problem. Instead, data is feed to those generic algorithm(s) and it builds its own logic based on the provided data

Examples of Machine Learning

  • Auto-driving cars/trucks etc
  • Siri, Google Now
  • Auto spell correction
  • Handwriting detection
  • Spam detection for emails and webpages

ML methods

Following are the common ML methods

  • Supervised Learning in which the data comes with additional attributes that we want to predict. This problem can be either:
    • Classification - The action or process of classifying something. Classification means to group the output into a class.
    • Regression - a measure of the relation between the mean value of one variable (e.g. output) and corresponding values of other variables (e.g. time and cost). It is defined as E[Y | X] (the expectation of Y given X). A subset of these types of models where Y is binary or categorical (including logistic regression and multinomial regression along with many machine learning algorithms that essential have the same target [such as classification trees]) are useful for classification.
  • Unsupervised Learning in which the training data consists of a set of input vectors x without any corresponding target values. The goal in such problems may be to discover groups of similar examples within the data, where it is called clustering, or to determine the distribution of data within the input space, known as density estimation, or to project the data from a high-dimensional space down to two or three dimensions for the purpose of visualization (Click here to go to the Scikit-Learn unsupervised learning page).
  • Semi-supervised learning in which combines both labeled and unlabeled example to generate an appropriate function or classifier.
  • Reinforcement Learning
  • Association Learning
  • Transduction is similar to supervised learning, but does not explicitly construct a function: instead, tries to predict new outputs based on training inputs, training outputs, and new inputs.
  • Learning to learn is where the algorithm learns its own inductive bias based on previous experience.

Supervised machine learning algorithm

In it the program is "trained" using the past event's data (called training data), which then facilitate its ability to predict future events accurately when provided new data.

In this algorithm, data is a set of training examples with the associated "correct answers" and it learns to predict the correct answer from this training set.

In other words, it can be said that we have sets of correct/validated questions and answers and using them system tries to find the answer to the asked question.

There are many analysis techniques which are employed, few of them are as follows

Regression analysis

Regression analysis is a form of predictive modelling technique which investigates the relationship between a dependent and independent variable(s).

It is manly used in forecasting, time series modelling and finding the causal effect relationship between the variables. For example, relationship between rash driving and number of road accidents by a driver is best studied through regression.

Regression analysis is an important tool for modelling and analyzing data. Here, we fit a curve / line to the data points, in such a manner that the differences between the distances of data points from the curve or line is minimized. I’ll explain this in more details in coming sections.

Types of Regression techniques

There are many kinds of regression techniques available for predictions and are mostly defined by the following three metrics

  • number of independent variables
  • type of dependent variables
  • shape of regression line

Following are the most common regression algorithms

  • Linear Regression
  • Logistic Regression
  • Polynomial Regression
  • Stepwise Regression
  • Ridge Regression
  • Lasso Regression
  • ElasticNet Regression

Classification

Classification is the problem of identifying to which of a set of categories a new observation belongs, on the basis of a training set of data containing observations whose category membership is already defined.

Examples of Classification

  • text categorization (e.g., spam filtering)
  • fraud detection
  • optical character recognition ORC for exams and print to text
  • Face detection
  • natural-language processing (e.g. Apple's Siri & Google Now)
  • market segmentation

Unsupervised learning Algorithm

In an unsupervised learning algorithm, the algorithm can find trends in the data it is given, without looking for some specific “correct answer”. Examples of unsupervised learning algorithms involve clustering (grouping similar data points) or anomaly detection (detecting unusual data points in a data set).

Association Learning

Basket analysis: P (Y | X ) probability that somebody who buys X also buys Y where X and Y are products/services. Example: P ( chips | beer ) = 0.7

Steps used in Machine Learning

The most common steps in creating a solution using Machine Learning are as follows

Step 1 - Data Collection

In this step historic data is collected from various data sources, such as weather station, news archive etc depending on the requirement.

It is a good practice to have large variety, density and volume of relevant data, because better the data better prospects for the machine to predict.

Data Sets - Training Sets & Data Sets

Machine learning is about learning some properties of a data set and applying them to new data. This is why a common practice in machine learning to evaluate an algorithm is to split the data at hand into two sets, one that we call the training set on which we learn data properties and one that we call the testing set on which we test these properties.

Step 2 - Data Preparation/Sanitization

Step 3 - Training Model

Step 4 - Evaluating Model


In [ ]:

Step 5 - Performance tweaking


In [ ]:

Keyworks in Machine Learning

Deep Learning

Deep Learning is associated with Artificial Neural Network. ANN uses the concept of human brain to facilitate the modeling of arbitrary functions. ANN requires a vast amount of data and this algorithm is highly flexible when it comes to model multiple outputs simultaneously.

Natural Language Processing


In [ ]:

Data Mining


In [ ]:

Data Science


In [ ]:

Correlation: Correlation is any of a broad class of statistical relationships involving dependence, though in common usage it most often refers to the extent to which two variables have a linear relationship with each other. Examples of dependent phenomena include the correlation between the physical statures of parents and their offspring, and the correlation between the demand for a product and its price.

Reference:

1. Dataset: 
    - http://archive.ics.uci.edu/ml/index.php
2. Tutorials
    - https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-learn-data-science-python-scratch-2/