Movile Lightning Talk

Machine Learning Engineering using Scikit-Learn

Overview

Machine Learning in Data Science

"Data Science Venn Diagram". Licensed under CC BY-SA 3.0 via Wikimedia Commons

Main Concepts

  • Attributes (Predictiors, Variables): Columns of a dataset (thinking in Key-Value dataset)
  • Instance (Tuples, Records, lines): Records of a dataset
  • Class (Target): Column that indicates the final value of instance
  • Method (Technique, Algorithm (A.K.A. Algo)): It's a function or algorithm designed to use some features of a dataset and learn.
  • Model: It's a representation of a set of parameters of a method.
  • Data (Duh!)

Types of learning

  • Supervised Learning
  • Unsupervised Learning
  • Reinforcement Learning

Supervised Learning

It's a type of learning where some function learning is based in some labeled data (X) and the final output of this function it's a target output (y).

Training: Examples X_train together with labels y_train. Testing: Given X_test, predict y_test.

Common tasks:

  • Classification
  • Regression
  • Ranking

A.K.A House of Prediction Analytics!!!

Unsupervised Learning

In this type of learning a function doesn't have labeled data. It means that the learning comes from the whole structure of the data, instead some statistical/algorithm approximation

Examples X. Learn something about X.

Common tasks:

  • Dimensionality reduction
  • Clustering

A.K.A. learn from data itself

A.K.A No-structured info apriori

Reinforcement Learning

Reinforcement learning it's a term stolen from robots where an agent makes decisions based in their environment (state) and for each action this agent have some penalty (for bad actions) or reward (for good actions). The main objective is get the maximum reward in some cummulative way.

Use a structure of reward and penalty of the model with memory.

  • Ensemble Methods (?)

A.K.A Where the magic happens! Multiple models! Models learning based in other models, like Inception movie!

Machine Learning Workflow

Source: "Machine Learning in Python: Essential Techniques for Predictive Analysis". Licensed under CC BY-SA 3.0 via Wikimedia Commons

Introduction to Scikit-Learn

  • An open-source API/Library of Machine Learning in Python
  • Created by an Google Engineer called David Cournapeau in 2007
  • Main advantage: From Lab to Production
  • Most used programming module used in Kaggle competitions

Why Scikit-Learn?

  • No-brainer open-source toolkit
  • Do not use Java Virtual Machine (the main bottleneck to deal with big datasets and computation)
  • Scalability, scalability, scalability
  • Real deal programming module for data science
  • Reliable algorithms
  • Any programmer can use
  • Integrate with H2O.ai, Microsoft Azure Machine Learning, R, Spark, and so on...

Let’s go to iPython Notebook

Run time...

Mindflow to apply Machine Learning algorithms

  • 0.0) Preprocess your data in some RMDBS before!
  • 0.1) reprocess your data in some RMDBS before!
  • 0.2) Read all three again!
  • 1) Get labeled data where it contains (i) training set and (ii) test set
  • 2) Choose the learning function
  • 3) Fit the model
  • 4) Predict using the models
  • 5) See the adherence of the model (evaluation)
  • 6) Deploy
  • 7) Repeat the cycle again

Contact Me


In [ ]:


In [ ]:


In [ ]: