| notebook.community

notebook.community

Edit and run

Machine learning:

model fitting where we care about the output, not parameters

In a lot of scientific model fitting, parameters are meaningful

If you fit a line to enzyme rate vs substrate data, you can extract things like $k_{cat}$ and $K_{M}$.

In some fits, we don't care about parameters at all

With a standard curve, we relate an observable to an unobservable quantity of interest

$$\hat{Conc} = \frac{abs - b}{m}$$

To this point, we've talked about the first type of regression.

Do

Choose the model that fits the data best with the fewest parameters
Check your residuals for randomness
Use a biologically/physically informed model with independently testable/interpretable parameters

Don't

Transform your data (e.g. take a log before fitting). Most regression approaches assume measurement uncertainty is normally distributed.
Try only one set of parameter guesses
Overfit your data (fit a model with more parameters than observations)

Machine learning is much more pragmatic: find a model that lets you predict a quantity given some set of observations.

Useful for:
- Building models that predict from macromolecular sequence (transcription factor binding sites, protein secondary structure, etc.
- Ojectively separating groups of bacteria into different classes based on growth
- Image analysis

Rules:

Don't throw any data away
Choose a model that gives you power to observe what you want
Ignore the fit parameters
Check your model behavior using cross-validation

Two main types:

Supervised: Give a dataset with $X$ and $Y$ and figure out a relationship between them.
- Classification
- Regression (to predict quantiative values -- think standard curve).

Unsupervised: Figure out structure in the data given only the data.
- Clustering
- Dimensionality reduction (e.g. principle component analysis)

Supervised:
- Classification - Random Forests
- Regression - Support Vector Machines
Unsupervised:
- Clustering - K-means
- Dimensionality reduction - Principle Component Analysis

We'll be using sklearn for this analysis.

Check out their docs. They're amazing.

Random Forests

Built from binary decision trees.

Random forests are "bootstrap" methods

Sample (with replacement) over observations to include in the analysis
Sample (with replacement) over parameters to include in the analysis

This removes bias in the analysis.

Only as effective as the categories you choose.

              is space alien? 
                     |
                     |
              Yes----------No
               |            |
              0.0         100.0
                            |
                      died-------survived
                        |           |
                       62%         38%

Decision of what classifiers to use is pragmatic

How you decide if you're classifier is working well?

Cross Validation

Take a subset of your data out before fitting your model.
Fit the model.
See how well you do at predicting the data you did not include in your model.

Training set
Test set

What about unsupervised methods?

Really common one: principle component analysis (PCA)

Summary:

In machine learning, the goal is to classify/fit models that act as "black boxes" -- not to interpret parameters.
In supervised learning, you give categories to the observables
In unsupervised learning, you let the data speak for itself.
Practically: keep all your data, check your model with cross-validation

Machine learning:

model fitting where we care about the output, not parameters

In a lot of scientific model fitting, parameters are meaningful

If you fit a line to enzyme rate vs substrate data, you can extract things like $k_{cat}$ and $K_{M}$.

In some fits, we don't care about parameters at all

With a standard curve, we relate an observable to an unobservable quantity of interest

To this point, we've talked about the first type of regression.

Do

Don't

Machine learning is much more pragmatic: find a model that lets you predict a quantity given some set of observations.

Rules:

Two main types:

Random Forests

Built from binary decision trees.

Random forests are "bootstrap" methods

Only as effective as the categories you choose.

Decision of what classifiers to use is pragmatic

How you decide if you're classifier is working well?

Cross Validation

Training set Test set

Training set Test set

What about unsupervised methods?

Really common one: principle component analysis (PCA)

Summary:

Training set
Test set

Training set
Test set