We talked about machine learning and the unsupervised k-means approach last time. Now we're going to talk about one way that we can perform supervised machine learning.
Let's dig into how we actually build a model. In this case, we're going to use a method for classification. This means that we have items divided into groups, and we want to find the rules that allow us to assign new items into those groups.
This video explains the concepts behind SVMs: https://youtu.be/EbVs31qq1Is
Q1: What is the intuition behind how SVM decides where to draw the decision boundary?
In the previous video, the best place to put the decision boundary was relatively clear, but what if it wasn't? What if we had to choose between either a small gap between points and the line or a large gap that didn't correctly classify all points? SVMs have a parameter, C, that controls this decision.
This video discusses C and provides some hypothetical examples: https://youtu.be/5oVQBF_p6kY
Q2: I am running an SVM and it predicts all of the points that I provided correctly, but it doesn't work on new points. Should I raise or lower C?
When we build a model, we'd really like to know how accurate it is. Specifically: how accurate are its predictions where we don't know the answer? If we'd like to predict this, we need to have a way to estimate this performance.
This video discusses how we can assess performance: https://youtu.be/s_qpzxbVViI
Q3: A paper says: "We constructed a support vector machine classifier using gene expression data to identify genes associated with long term memory. We used 105 examples from the screen (45 positives, 60 negatives) to construct the classifier and correctly classified 85 of them. We applied the classifier to all other genes, including 45 that we had screen information on. Of the remaining positives, 15 were correctly identified as memory associated by the classifier while 5 were missed. Of the negatives, 2 were identifed as memory associated by the classifier while 23 were not."
In this case, what are the training and testing accuracies?
Sometimes we want to build and evaluate our models in a manner that lets us use all of our data. With the strategy discussed in the previous video, we have to divide our data into a training and a testing set. This lets us get an accurate idea of the quality of our predictions, but also means that we're only using a portion of our data to make those predictions.
Cross-validation is a strategy that allows us to use all of our data, at the cost of a bit of complexity, which is discussed at the end of our video: https://youtu.be/rMZYrneij-E
Q4: In five-fold cross validation, how many times are examples used for training?"
Q5: In five-fold cross validation, how many times are examples used for testing?"