Data Science Workshop

Goal: To learn, how to start implementing ML.

Recap

  • Machine Learning is a sub feild of Artificial Intelligence, which is focused on self learning.
  • Data Science is not single step process

    • Model Building: Linear Models, Support Vector Machines, Random Forest Models
    • Validation: How good is your model?
    • Presentation/Story-telling
  • Data Science Goal:

    • Supervised/ Unsupervised
    • Regression Models/ Classification Models

Questions

  • Question Poolling
    • Check for how many of you know Python?
    • Check for how many of you know Programming(& do on daily basis)?

Agenda

  • Learn about Few python libraries
  • Hands on 3 basic Python Data Modelling stages of Linear Regression
  • Comparision LR to other ML Models

Python Libs

Numpy

NumPy is the fundamental package for scientific computing with Python.

a powerful N-dimensional array object sophisticated (broadcasting) functions tools for integrating C/C++ and Fortran code useful linear algebra, Fourier transform, and random number capabilities Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.


In [4]:
import numpy as np

Pandas

pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.


In [5]:
import pandas as pd

Plotting Libs


In [427]:
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
# Example

plt.plot([1, 2, 4, 3, 2, 1, 3], [5, 4, 5, 6, 6, 5, 6], 'r')


Out[427]:
[<matplotlib.lines.Line2D at 0x11e668080>]

Models


In [2]:
# can represent any kind of line
from sklearn import linear_model  

# can represent any kind of curves
from sklearn.svm import LinearSVR 

# can represent any kinds of curves
from sklearn.ensemble import RandomForestRegressor

Scoring


In [ ]:
from sklearn.metrics.regression import mean_squared_error, mean_absolute_error

Workshop beings

  • Linear Regression - Part 1
  • Linear Regression - Part 2
  • Linear Regression - Part 3

Advanced Concepts

  1. Data Analytics & Feature Engineering (Seaborn, PCA)

  2. Test-Train Data Split (Test Train Validation splits)

  3. Model Understanding & working intutions (Model Tuning)

  4. Scoring Methods (Truth Tables, Precision, Recall)

Some other time for these models

  1. Cross Validation
  2. Multiclass, Multi-variate

Data Analytics & Feature Engineering

Actual Data

Make the differences looks visible to the machine

Learn: https://www.quora.com/What-is-feature-engineering

Test-Train Datasets

Model Selections

Scoring Methods - Cautions

Find - Where do you fit?

Linear Regression

SVM Working

Basic of Sketching

There are only 3 major shapes in sketching - Lines, Curves and Oval shapes

Three basic principles in which SVM focuses

  1. Optimal Seperation/Boundary Region

    Find a linear equation, that could represent our solution using Hyper-planes

  2. Maximun Marginal boundary distances.

    Seperation of regions such that we have enough safety space space in both regions.

  3. Kernal Transformations

    The kernel function can be any of the following:

    • linear: (X, Y)
    • polynomial: (gamma*(X, Y) + r)^d. d is specified by keyword degree, r by coef0.
    • rbf: exp(-gamma * (X, Y)^2). gamma is specified by keyword gamma, must be greater than 0.
    • sigmoid tanh(gamma*(X, Y) + r), where r is specified by coef0.

Complete working style

Random Forest

  • ExtraTreesRegressor
  • Bagging, Boosting

Random Forest Real Life Application - Majority Rule

  • Decision are made of cumulative experiences collected from council.