Find an example in order to explain all of these concepts:
Let's imagine we want to construct a handwriting digit recognizer. Somehow using advanced pattern recognition techniques we are able to indentify the number of straight horizontal lines in each sample containing a number. These straight horizontal lines are features.
An example of supervised learning is a prediction model. Imagine a dataset where you want to predict if a house is going be sold based on its price and the average income of a household of the zone where the house is, such as:
| House Price | Average Income | Sold? |
|---|---|---|
| 120000 | 24000 | Yes |
| 135000 | 30000 | Yes |
| 180000 | 24000 | No |
Well, you could feed a learning algorithm to learn about the $n$ features that contribute to a house being sold and try to predict if it's going get sold.
Try to predict if a picture contains a cat or a tiger.
Try to predict stock prices or estimate the remaining battery left in a phone based on several features.
Try to classify $n$ amount of people into $m$ groups. For example, you have a dataset of height and weight of your clients and you want to create the best fit for S, M, and L kind of clothes for them. Clustering would be a good approach.
Clustering is an unsupervised learning technique used to automatically classify unlabeled data into groups, also known as clusters. After that, this new data can be used to improve the algorithm.
Reminder: We define Bayes Theorem as follows $p(A|B) = \frac{p(B|A)p(A)}{p(B)}$, where $A, B$ are events and $p(B) \neq 0$
Let $A$ denote the event of feline being dangerous. Let $B$ denote the event of watching a big feline with similar size to pumas and bigger than 80% of cats. We know that the probability of any random given feline of being a puma is 0.1 and any random feline of being a cat is 0.9. Also, we know that 85% of pumas can be a big feline like the one watched and that at most, 20% of cats are like that big feline:
$p(A) = 0.1$
$p(B) = 0.1 \cdot 0.85 + 0.9 \cdot 0.2 = 0.265$
Then, the probability of watching a big feline ($B$) knowing that the feline is a puma ($A$) is defined as:
$p(B|A) = 0.85$
Then, using Bayes Theorem we can calculate the probability of a feline being a puma given the event that we see a big feline: $p(A|B)$
$p(A|B) = \frac{p(B|A)p(A)}{p(B)} = \frac{0.85 \cdot 0.1}{0.265} = 0.32$
Indeed, I would run.
In [84]:
size_var = linspace(0,120, 200);
mu_cat = 40;
mu_puma = 70;
sigma_cat = 12;
sigma_puma = 7;
p_cat = normpdf(size_var, mu_cat, sigma_cat);
p_puma = normpdf(size_var, mu_puma, sigma_puma);
plot(size_var, p_cat, 'lineWidth', 2);
hold on;
plot(size_var, p_puma, 'lineWidth', 2);
title('PDFs of two distributons');
ylabel('probability');
xlabel('Size [cm]');
p_cat_func = @(x) exp(-(x-mu_cat).^2 / (2*sigma_cat^2)) / sqrt(2*sigma_cat^2*pi);
p_puma_func = @(x) exp(-(x-mu_puma).^2 / (2*sigma_puma^2)) / sqrt(2*sigma_puma^2*pi);
intersection = fzero(@(x) p_cat_func(x) - p_puma_func(x), (mu_cat + mu_puma)/2);
disp(['The decision error is around = ', num2str(intersection)]);
line([intersection, intersection], [0, p_cat_func(intersection)], 'Color','red','LineStyle','--', 'lineWidth', 1.5);
plot(intersection, p_cat_func(intersection), 'r*', 'lineWidth', 10);
legend('Cat Prob. Dist.', 'Puma Prob. Dist.', 'Decision boundary', 'Intersection of distributions');
In [85]:
disp('Assuming the variance is the same for both distributions...')
size_var = linspace(0,120, 200);
mu_cat = 40;
mu_puma = 70;
sigma_cat = 12;
sigma_puma = 12;
p_cat = normpdf(size_var, mu_cat, sigma_cat);
p_puma = normpdf(size_var, mu_puma, sigma_puma);
plot(size_var, p_cat, 'lineWidth', 2);
hold on;
plot(size_var, p_puma, 'lineWidth', 2);
title('PDFs of two distributons');
ylabel('probability');
xlabel('Size [cm]');
p_cat_func = @(x) exp(-(x-mu_cat).^2 / (2*sigma_cat^2)) / sqrt(2*sigma_cat^2*pi);
p_puma_func = @(x) exp(-(x-mu_puma).^2 / (2*sigma_puma^2)) / sqrt(2*sigma_puma^2*pi);
intersection = fzero(@(x) p_cat_func(x) - p_puma_func(x), (mu_cat + mu_puma)/2);
disp(['The decision error is 0.5 at size = ', num2str(intersection)]);
line([intersection, intersection], [0, p_cat_func(intersection)], 'Color','red','LineStyle','--', 'lineWidth', 1.5);
plot(intersection, p_cat_func(intersection), 'ro', 'lineWidth', 2);
legend('Cat Prob. Dist.', 'Puma Prob. Dist.', 'Decision boundary', 'Intersection of distributions');
What we have to do is calculate the probability density function (pdf) of the two distributions. One pdf for the chicken eggs and another one for the goose eggs. However, this time we have two variables, weight and size, so we have to use the multivariate gaussian pdf, given by the formula:
$$\operatorname{det}(2\pi\boldsymbol\Sigma)^{-\frac{1}{2}} \, e^{ -\frac{1}{2}(\mathbf{x} - \boldsymbol\mu)'\boldsymbol\Sigma^{-1}(\mathbf{x} - \boldsymbol\mu) }$$So we just have to compute the pdf of the sample in both pdf, compare the probabilities and predict the one with a higher pdf. The probability of predicting correctly given distribution A, as it was explained on the last exercise, is done by taking the pdf of distribution A and dividing it by the sum of the pdfs of all the distributions. Using Matlab we can easily visualize the results:
In [86]:
scale = 3;
s = scale;
res = 40;
figure(1)
% Chicken?
meanWeight_chicken = 54;
meanHeight_chicken = 5;
mu = [meanWeight_chicken, meanHeight_chicken];
SIGMA = [5 .1; .1 .5];
[X1,X2] = meshgrid(linspace(meanWeight_chicken - s*sqrt(SIGMA(1, 1)), meanWeight_chicken+s*sqrt(SIGMA(1, 1)),res)',linspace(meanHeight_chicken-s*sqrt(SIGMA(2, 2)),meanHeight_chicken+s*sqrt(SIGMA(2, 2)),res)');
X = [X1(:) X2(:)];
p = mvnpdf(X,mu,SIGMA);
subplot(2, 1, 1);
surf(X1,X2,reshape(p,res,res));
hold on;
X = [60, 5];
p_chicken = mvnpdf(X,mu,SIGMA);
plot3(X(1), X(2), p_chicken, 'r*');
title(['Chicken pdf at point = ' num2str(mvnpdf(X,mu,SIGMA)*100)])
axis tight
view(45, 45)
% Goose?
meanWeight_goose = 65;
meanHeight_goose = 6;
mu = [meanWeight_chicken, meanHeight_chicken];
SIGMA = [8 .2; .2 1];
[X1,X2] = meshgrid(linspace(meanWeight_chicken - s*sqrt(SIGMA(1, 1)), meanWeight_chicken+s*sqrt(SIGMA(1, 1)),res)',linspace(meanHeight_chicken-s*sqrt(SIGMA(2, 2)),meanHeight_chicken+s*sqrt(SIGMA(2, 2)),res)');
X = [X1(:) X2(:)];
p = mvnpdf(X,mu,SIGMA);
subplot(2, 1, 2);
surf(X1,X2,reshape(p,res,res));
hold on;
X = [60, 5];
p_goose = mvnpdf(X,mu,SIGMA);
plot3(X(1), X(2), p_goose, 'r*', 'lineWidth', 2);
title(['Goose pdf at point = ' num2str(mvnpdf(X,mu,SIGMA)*100)])
axis tight
view(45, 45)
disp(['Probability of being a goose is ' num2str(p_goose/(p_goose+p_chicken))]);
% Top Views
figure(2)
% Chicken?
meanWeight_chicken = 54;
meanHeight_chicken = 5;
mu = [meanWeight_chicken, meanHeight_chicken];
SIGMA = [5 .1; .1 .5];
[X1,X2] = meshgrid(linspace(meanWeight_chicken - s*sqrt(SIGMA(1, 1)), meanWeight_chicken+s*sqrt(SIGMA(1, 1)),res)',linspace(meanHeight_chicken-s*sqrt(SIGMA(2, 2)),meanHeight_chicken+s*sqrt(SIGMA(2, 2)),res)');
X = [X1(:) X2(:)];
p = mvnpdf(X,mu,SIGMA);
subplot(2, 1, 1);
surf(X1,X2,reshape(p,res,res));
hold on;
X = [60, 5];
p_chicken = mvnpdf(X,mu,SIGMA);
plot3(X(1), X(2), p_chicken, 'r*');
title(['Chicken pdf at point = ' num2str(mvnpdf(X,mu,SIGMA)*100)])
view(0, 90)
axis tight
% Goose?
meanWeight_goose = 65;
meanHeight_goose = 6;
mu = [meanWeight_chicken, meanHeight_chicken];
SIGMA = [8 .2; .2 1];
[X1,X2] = meshgrid(linspace(meanWeight_chicken - s*sqrt(SIGMA(1, 1)), meanWeight_chicken+s*sqrt(SIGMA(1, 1)),res)',linspace(meanHeight_chicken-s*sqrt(SIGMA(2, 2)),meanHeight_chicken+s*sqrt(SIGMA(2, 2)),res)');
X = [X1(:) X2(:)];
p = mvnpdf(X,mu,SIGMA);
subplot(2, 1, 2);
surf(X1,X2,reshape(p,res,res));
hold on;
X = [60, 5];
p_goose = mvnpdf(X,mu,SIGMA);
plot3(X(1), X(2), p_goose, 'r*');
title(['Goose pdf at point = ' num2str(mvnpdf(X,mu,SIGMA)*100)])
view(0, 90)
axis tight
| Sepal length | Petal length | Sepal width | Petal width |
|---|---|---|---|
| 4,9 | 3,2 | 1,7 | 0,2 |
| 5 | 3,2 | 1,6 | 0,5 |
| 5,5 | 2,8 | 3,6 | 1,3 |
| 7,1 | 3,1 | 6,1 | 1,7 |
Gaussian Distribution Formula:
$$ \mathcal{N}(\boldsymbol\mu,\,\boldsymbol\Sigma) \sim \operatorname{det}(2\pi\boldsymbol\Sigma)^{-\frac{1}{2}} \, e^{ -\frac{1}{2}(\mathbf{x} - \boldsymbol\mu)'\boldsymbol\Sigma^{-1}(\mathbf{x} - \boldsymbol\mu) } $$
In [87]:
clearvars;
format short;
disp('Reading samples...');
samples = [4.9, 3.2, 1.7, 0.2; 5, 3.2, 1.6, 0.5; 5.5, 2.8, 3.6, 1.3; 7.1, 3.1, 6.1, 1.7]; % Imaginary Samples
% Labels
% 1: Setosa
% 2: Versicolor
% 3: Virginica
addpath('dataset');
dataset = csvread('data.csv');
labels = ["setosa", "versicolor", "virginica"];
% Clean the dataset
sepal_length = dataset(:, 1);
petal_length = dataset(:, 2);
sepal_width = dataset(:, 3);
petal_width = dataset(:, 4);
classId = dataset(:, 5);
X = [sepal_length, petal_length, sepal_width, petal_width]; % Features
y = classId; % Variable we want to predict
% Extract features from each class
X_setosa = X(find(y(:) == 1), :);
X_versicolor = X(find(y(:) == 2), :);
X_virginica = X(find(y(:) == 3), :);
% Calculate cov. matrices
setosa_covMat = cov(X_setosa);
versicolor_covMat = cov(X_versicolor);
virginica_covMat = cov(X_virginica);
% Extract the means
setosa_mean = mean(X_setosa, 1);
versicolor_mean = mean(X_versicolor, 1);
virginica_mean = mean(X_virginica, 1);
% Calculate the pdf of the samples
setosa_pdf = mvnpdf(samples, setosa_mean, setosa_covMat);
versicolor_pdf = mvnpdf(samples, versicolor_mean, versicolor_covMat);
virginica_pdf = mvnpdf(samples, virginica_mean, virginica_covMat);
% Probability of the samples being setosa
prob_setosa = (setosa_pdf)./(setosa_pdf + versicolor_pdf + virginica_pdf);
% Probability of the samples being setosa
prob_versicolor = (versicolor_pdf)./(setosa_pdf + versicolor_pdf + virginica_pdf);
% Probability of the samples being setosa
prob_virginica = (virginica_pdf)./(setosa_pdf + versicolor_pdf + virginica_pdf);
matProbabilities = [prob_setosa, prob_versicolor, prob_virginica];
for i=1:size(matProbabilities, 1)
[value, idx] = max(matProbabilities(i, :));
disp(['The sample ', num2str(i), ' is a ', char(labels(idx)), '. Confidence: ', num2str(value*100), ' %']);
end
Covariance Formula:
$C_v = \frac{1}{N-1}\left(X-\mu\right)\left(X-\mu\right)^{T}$
In [88]:
clearvars;
dataset = [0,0; 1,1; 2,3;3,2; 4,4]';
disp('1. Draw the data');
plot(dataset(1, :), dataset(2, :),'bo','MarkerSize', 10, 'lineWidth', 2);
disp('2. Compute the covariance matrix');
mu = mean(dataset, 2);
covarMat = ((dataset - mu) * (dataset - mu)')./(size(dataset, 2)-1)
% apply PCA and find the basis where the data has the maximum variance
% Find the eigenvalues and vectors to do so
disp('3. Apply PCA and find the basis:');
[eigenVector, eigenValue] = eig(covarMat)
disp('We take the vector with the highest eigenValue.');
[value, idx] = max(eigenValue);
[value, idx] = max(value);
disp(['Highest Eigenvalue: ', num2str(value)]);
eigenVector_reduced = eigenVector(:, idx);
disp(['Max. var. Eigenvector / new basis: [' num2str(eigenVector_reduced'), ']']);
dimReduction = eigenVector_reduced' * dataset
projected_X = eigenVector_reduced * dimReduction
hold on;
plot(projected_X(1, :), projected_X(2, :), 'r+', 'MarkerSize', 14, 'lineWidth', 2);
hor_axis_helper = min(dataset(1, :)):0.1:max(dataset(2, :));
quiver(0, 0, eigenVector_reduced(1), eigenVector_reduced(2), 1, 'maxHeadSize', 1)
plot(hor_axis_helper, hor_axis_helper*eigenVector_reduced(2)/eigenVector_reduced(1));
title('Projection of datapoints using PCA');
legend('Datapoints', 'Projected Datapoints', 'New Basis', 'Projection Axis', 'location', 'northwest');
In [1]:
clearvars;
% Load the data
dataset = [0, 0, 1, -1; 1, -1, 0, 0]
% Convert to polar coordinates
disp('We go from cartesian to polar...');
[theta radius] = cart2pol(dataset(1, :), dataset(2, :));
dataset = [theta; radius]
% Plotting
figure(1);
polarplot(dataset(1, :), dataset(2, :),'bo','MarkerSize', 10, 'lineWidth', 2);
hold on;
rlim([0, 1.25]);
title('PCA with polar coordinates using raw dataset')
% Calculate covariance matrix
covariance_mat = cov(dataset')
% Calculate eigenvectors
[eigenVector eigenValue] = eig(covariance_mat)
% Take the most important eigenvectors
disp('We take the vector (in our new polar space) with the highest eigenValue.');
[value, idx] = max(eigenValue);
[value, idx] = max(value);
disp(['Highest Eigenvalue: ', num2str(value)]);
eigenVector_reduced = eigenVector(:, idx);
disp(['Max. var. Eigenvector / new basis: [' num2str(eigenVector_reduced'), ']']);
% Calculate the new values in the new basis
z = eigenVector_reduced' * dataset
% Reproject the data
projected_data = eigenVector_reduced * z
polarplot(projected_data(1, :), projected_data(2, :),'rx','MarkerSize', 14, 'lineWidth', 2);
legend('2D data', '1D data');
In [90]:
clearvars;
dataset = [0, 0, 1, -1; 1, -1, 0, 0]
disp('We go from cartesian to polar...SUBSTRACTING MEAN!');
[theta radius] = cart2pol(dataset(1, :), dataset(2, :));
dataset = [theta; radius];
mu = mean(dataset, 2);
dataset = dataset - mu
% Calculate cov. mat.
covariance_mat = cov(dataset')
% Calculate Eigens
[eigenVector eigenValue] = eig(covariance_mat)
disp('We take the vector (in our new polar space) with the highest eigenValue.');
[value, idx] = max(eigenValue);
[value, idx] = max(value);
disp(['Highest Eigenvalue: ', num2str(value)]);
eigenVector_reduced = eigenVector(:, idx);
disp(['Max. var. Eigenvector / new basis: [' num2str(eigenVector_reduced'), ']']);
% Find new values on new basis
z = eigenVector_reduced' * dataset
% Add the mean again...
disp('# We add the mean to the projected data!');
projected_data = eigenVector_reduced * z;
projected_data = projected_data + mu
% Plots
figure(1);
polarplot(dataset(1, :) + mu(1), dataset(2, :) + mu(2),'bo','MarkerSize', 10, 'lineWidth', 2);
rlim([0, 1.25]);
title('PCA with polar coordinates substracting \mu');
hold on;
polarplot(projected_data(1, :), projected_data(2, :),'rx','MarkerSize', 14, 'lineWidth', 2);
legend('2D data', '1D data')
DATASET: data
Dimension: $M$, integer.
Features: $X$, subset of data
Class: $y$, subset of data
Steps to follow:
data using mean normalization, obtaining Z.
In [91]:
clearvars;
format short;
disp('Reading samples...');
samples = [4.9, 3.2, 1.7, 0.2; 5, 3.2, 1.6, 0.5; 5.5, 2.8, 3.6, 1.3; 7.1, 3.1, 6.1, 1.7]; % Imaginary Samples
% Labels
% 1: Setosa
% 2: Versicolor
% 3: Virginica
addpath('dataset');
dataset = csvread('data.csv');
labels = ["setosa", "versicolor", "virginica"];
% Clean the dataset
sepal_length = dataset(:, 1);
petal_length = dataset(:, 2);
sepal_width = dataset(:, 3);
petal_width = dataset(:, 4);
classId = dataset(:, 5);
X = [sepal_length, petal_length, sepal_width, petal_width]; % Features
y = classId; % Variable we want to predict
disp('Applying PCA...');
%% Apply PCA to matrix X before extraction of features for each class
%====================================================================
% Generating covariance matrix
X_cov = cov(X);
% Generating the means for each feature
X_mu = mean(X, 1);
%% Extracting eigens
% U: Eigenvectors
% V: Eigenvalues
% Find the highest eigenvalues using Matlab's implementation of the singular value decomposition function
%[U, V, W] = svd(X_cov)
% For learning purposes, I decided to use eig
[U, V] = eig(X_cov);
% Find the highest N features.
NUMBER_FEATURES = 2;
V = sum(V, 2); % Transform matrix into vector of eigenvalues
totalVar = sum(V);
[V, idx] = sort(V, 'descend'); % Sort the matrix
disp(['Using ' num2str(NUMBER_FEATURES) ' out of ' num2str(size(X, 2)) ' features...']);
V = V(1:NUMBER_FEATURES);
pickedVar= sum(V); % Store the variability
disp([num2str(100*pickedVar/totalVar), '% variability retained']);
U = U(:, idx(1:NUMBER_FEATURES));
%% Reproject the data onto X_proj
% Find the values projection on the new basis
T = X * U;
% Dimension-reduced data: X_proj
X_proj = T * U';
% PCA Finished! -> New dataset -> T
%====================================================================
% Extract features from each class
X_setosa = T(find(y(:) == 1), :);
X_versicolor = T(find(y(:) == 2), :);
X_virginica = T(find(y(:) == 3), :);
% Calculate cov. matrices
setosa_covMat = cov(X_setosa);
versicolor_covMat = cov(X_versicolor);
virginica_covMat = cov(X_virginica);
% Extract the means
setosa_mean = mean(X_setosa, 1);
versicolor_mean = mean(X_versicolor, 1);
virginica_mean = mean(X_virginica, 1);
% Calculate the pdf of the samples
% But convert samples to new space before
samples = samples * U;
setosa_pdf = mvnpdf(samples, setosa_mean, setosa_covMat);
versicolor_pdf = mvnpdf(samples, versicolor_mean, versicolor_covMat);
virginica_pdf = mvnpdf(samples, virginica_mean, virginica_covMat);
% Probability of the samples being setosa
prob_setosa = (setosa_pdf)./(setosa_pdf + versicolor_pdf + virginica_pdf);
% Probability of the samples being setosa
prob_versicolor = (versicolor_pdf)./(setosa_pdf + versicolor_pdf + virginica_pdf);
% Probability of the samples being setosa
prob_virginica = (virginica_pdf)./(setosa_pdf + versicolor_pdf + virginica_pdf);
matProbabilities = [prob_setosa, prob_versicolor, prob_virginica];
for i=1:size(matProbabilities, 1)
[value, idx] = max(matProbabilities(i, :));
disp(['The sample ', num2str(i), ' is a ', char(labels(idx)), '. Confidence: ', num2str(value*100), ' %']);
end