Machine learning:

model fitting where we care about the output, not parameters

In a lot of scientific model fitting, parameters are meaningful

If you fit a line to enzyme rate vs substrate data, you can extract things like $k_{cat}$ and $K_{M}$.

In some fits, we don't care about parameters at all

With a standard curve, we relate an observable to an unobservable quantity of interest

$$\hat{Conc} = \frac{abs - b}{m}$$

To this point, we've talked about the first type of regression.

Do

  • Choose the model that fits the data best with the fewest parameters
  • Check your residuals for randomness
  • Use a biologically/physically informed model with independently testable/interpretable parameters

Don't

  • Transform your data (e.g. take a log before fitting). Most regression approaches assume measurement uncertainty is normally distributed.
  • Try only one set of parameter guesses
  • Overfit your data (fit a model with more parameters than observations)

Machine learning is much more pragmatic: find a model that lets you predict a quantity given some set of observations.

  • Useful for:
    • Building models that predict from macromolecular sequence (transcription factor binding sites, protein secondary structure, etc.
    • Ojectively separating groups of bacteria into different classes based on growth
    • Image analysis

Rules:

  • Don't throw any data away
  • Choose a model that gives you power to observe what you want
  • Ignore the fit parameters
  • Check your model behavior using cross-validation

Two main types:

  • Supervised: Give a dataset with $X$ and $Y$ and figure out a relationship between them.
    • Classification
    • Regression (to predict quantiative values -- think standard curve).
  • Unsupervised: Figure out structure in the data given only the data.
    • Clustering
    • Dimensionality reduction (e.g. principle component analysis)

We'll be using sklearn for this analysis.

Check out their docs. They're amazing.

Random Forests

Built from binary decision trees.

Random forests are "bootstrap" methods

  • Sample (with replacement) over observations to include in the analysis
  • Sample (with replacement) over parameters to include in the analysis

This removes bias in the analysis.

Only as effective as the categories you choose.

              is space alien? 
                     |
                     |
              Yes----------No
               |            |
              0.0         100.0
                            |
                      died-------survived
                        |           |
                       62%         38%

Decision of what classifiers to use is pragmatic

How you decide if you're classifier is working well?

Cross Validation

  • Take a subset of your data out before fitting your model.
  • Fit the model.
  • See how well you do at predicting the data you did not include in your model.

Training set
Test set

Training set
Test set

What about unsupervised methods?

Really common one: principle component analysis (PCA)

Summary:

  • In machine learning, the goal is to classify/fit models that act as "black boxes" -- not to interpret parameters.
  • In supervised learning, you give categories to the observables
  • In unsupervised learning, you let the data speak for itself.
  • Practically: keep all your data, check your model with cross-validation