Machine Learning

Machine learning is the application of algorithms to extract information from datasets by way of understanding it. This "understanding" usually means fitting a model on the dataset. It overlaps considerably with data mining, where one is usually more concerned with getting the information than with the modeling aspect. It also overlaps with artificial inteligence, mathematical optimization and inferential statistics.

Few are experts on machine learning and even fewer can find the best model to match a certain dataset. But ML is becoming so ubiquitous that even school kids need to learn it. From the average person point of view, as long as the validation tests show a good fit, any model is good enough, so the question usually shifts from modeling to easy implementation and good validation procedure. On Python there are several libraries that also stand out, and this is a personal list:

  • Scikit-learn: considered the best overall, is friendly to newcommers and contains good validation support.
  • mlpy: competition.
  • PyBrain: AI and neural networks.
  • nltk: for natural language processing and text mining.
  • Theano + Pylearn2: uses the graphical processor, for fast and "deep" learning.
  • MDP (Modular toolkit for Data Processing): make workflows using scikit-learn and other libs.
  • Orange: visual framework for ML (similar to what Weka is in Java). Has a bioinformatics plugin.

This figure was made by the creator of Scikit-learn. While the methods described in it matter less, the distinction between the covered problem classes is more important. You also have to keep in mind that this is only the simple core of ML, and there are entire classes of algorithms that are either not covered by scikit-learn (such as genetic algorithms or most neural networks) or covered in too small detail (bayesian learning). For these there are other specific Python libraries, aditionally a certain class of algorithms may only be available on a certain program or language, and bindings are usually provided for Python.

Another classification of ML problems is perhaps even more useful, and hopefully funnier, as it not only separates problem classes but creates social classes among programmers:

  • Supervised learning: There is a target that we are trying to predict. Datasets for supervised learning methods specifically have several tested outcomes, and the model is using them to fit in. Mostly regression and classification methods require a target. Example: Measured omics datasets with measured phenotypes, and good controls.
  • Unsupervised learning: No outcome is available, and the typical workflow consists in clustering, visualization and dimensionality reduction (feature selection). Example: "So like I have this gene expression dataset ...", most astronomical measurements, etc. Coincidentally this is what unaccomplished ML experts get to spend most of their time on.
  • Reinforcement learning: A model is trained on incomplete data, and new data is added while the model improves. Here you find most of the cool sounding algorithms in ML, such the best of neural networks, markov chain monte carlo, bayesian training, etc. Most experts in RL are hired by the financial sector to work on big data (and many use python) for expensive fees. In hardcore science where things are never fully known or fully measured this class gets all the media frenzy. Robotics and gaming are also big players here.
  • Ensemble learning: Using different ML algorithms on the same problem and improving a model based on all their outcomes. Decision trees and random forests find application here. Bioinformatics usually relies more and more on consensus methods, if for no other reason but to relax angry reviewers.

Some observations:

  • Other learning classes exist, like the structural learning, representation learning or metric learning. They usually focus on different aspects of the dataset such as finding a good representation of the inputs or finding associations among variables.
  • Many learning algorithms can perform okay on several problem classes. Also, the methodologies can come from different sources, such as statistics, optimization, AI or simple heuristics such as genetic programming or swarm intelligence.

Task:

  • Take some time to explore the documentation provided on scikit-learn for one or two methods of your choosing.