Copyright (c) 2015, 2016 Sebastian Raschka Li-Yi Wei
Problem
Algorithm
Analysis
In [1]:
# a simple python program to find minimum numbers
# note that it remains the same regardless of the input data
import math
# function
def findmin(numbers):
answer = math.inf
for value in numbers:
if value < answer:
answer = value
return answer
# main
test = [3.14, 2.2, 8, -9.2, 100000, 0]
print(findmin(test))
Problem
Algorithm
In [2]:
# code is left as an exercise
The program can learn from data and change structure/behavior
The programmer still writes (part of) the program, but it is not fixed
Problem
Traditional programming?
Machine learning
Sometimes it is much easier to say what (example data) instead of how (algorithm)
Given sequences of actions of an agent and feedbacks from an environment, learn to select action sequences in a way that maximises the expected reward
playing games
self driving cars
Types of learning
Supervised learning
Unsupervised learning
Reinforcement learning
Types of data
Discrete/continuous $\times$ input/output
Discrete output
Continuous output
We can represent the model as a function $f$, with a set of parameters $\Theta$.
Given a set of inputs $\mathbf{X}$, the model computes outcomes $\mathbf{Y}$.
$$\mathbf{Y} = f(\mathbf{X}, \Theta)$$
For example, in digit recognition, $\mathbf{X}$ and $\mathbf{Y}$ are the digit images and digit labels (0 to 9), respectively.
The parameters $\Theta$ consist of those optimized automatically and those manually picked by humans. The latter are called hyper-parameters.
Every machine learning task as a goal, which can be formalized as a loss function: $$L(\mathbf{X}, \mathbf{T}, \mathbf{Y})$$ , where $\mathbf{T}$ is some form of target or auxiliary information, such as:
In addition to the objective, we often care about the simplicity of the model, for better efficiency and generalization (avoiding over-fitting). The complexity of the model can be measured by another penalty function: $$P(\Theta)$$ Some common penalty functions include number and/or magnitude of parameters.
We can sum up both the loss and regularization terms as the total objective: $$\Phi(\mathbf{X}, \mathbf{T}, \Theta) = L\left(\mathbf{X}, \mathbf{T}, \mathbf{Y}=f(\mathbf{X}, \Theta)\right) + P(\Theta)$$
During training, the goal is to optimize the parameters $\Theta$ with respect to the given training data $\mathbf{X}$ and $\mathbf{T}$: $$argmin_\Theta \; \Phi(\mathbf{X}, \mathbf{T}, \Theta)$$ And hope the trained model with generalize well to future data.
Given a set of data points $\left(\mathbf{X}, \mathbf{Y}\right)$, fit a model curve to describe their relationship.
This is actually a regression problem, but we have all seen this in prior math/coding classes to serve as a good example for machine learning.
Recall $\mathbf{Y} = f(\mathbf{X}, \Theta)$ is our model.
For 2D linear curve fitting, the model is a straight line: $y = w_1 x + w_0$, so the parameters $\Theta = \{w_0, w_1\}$.
The loss function is $L\left(\mathbf{X}, \mathbf{T}, \mathbf{Y}\right) = \sum_i \left( T^{(i)} - Y^{(i)}\right)^2 = \sum_i \left( T^{(i)} - w_1 X^{(i)} - w_0 \right)^2$.
($\mathbf{X}$ is a matrix/tensor, and each data sample is a row. We denote the ith sample/row as $\mathbf{X}^{(i)}$.)
For this simple example we don't care about regularization, thus $P(\Theta) = 0$.
The goal is to optimize $\Theta = \{w_0, w_1 \}$ with given $\left(\mathbf{X}, \mathbf{Y}\right)$ to minimize $L$. For simple cases like this, we can directly optimize via calculus: $$ \begin{align} \frac{\partial L}{\partial w_0} & = 0 \\ \frac{\partial L}{\partial w_1} & = 0 \end{align} $$
The math and coding will be left as an exercise.
Artificial intelligence
Cognitive science
In [3]:
from IPython.display import YouTubeVideo
YouTubeVideo("-O01G3tSYpU")
Out[3]: