Data Science Workshop

Goal: To learn, how to start implementing ML.

Recap

Machine Learning is a sub feild of Artificial Intelligence, which is focused on self learning.
Data Science is not single step process
- Model Building: Linear Models, Support Vector Machines, Random Forest Models
- Validation: How good is your model?
- Presentation/Story-telling
Data Science Goal:
- Supervised/ Unsupervised
- Regression Models/ Classification Models

Questions

Question Poolling
- Check for how many of you know Python?
- Check for how many of you know Programming(& do on daily basis)?

Agenda

Learn about Few python libraries
Hands on 3 basic Python Data Modelling stages of Linear Regression
Comparision LR to other ML Models

Python Libs

Numpy

NumPy is the fundamental package for scientific computing with Python.

a powerful N-dimensional array object sophisticated (broadcasting) functions tools for integrating C/C++ and Fortran code useful linear algebra, Fourier transform, and random number capabilities Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.



In [4]:

    
import numpy as np

Pandas

pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.



In [5]:

    
import pandas as pd

Plotting Libs



In [427]:

    
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
# Example

plt.plot([1, 2, 4, 3, 2, 1, 3], [5, 4, 5, 6, 6, 5, 6], 'r')









    Out[427]:





[<matplotlib.lines.Line2D at 0x11e668080>]

Models



In [2]:

    
# can represent any kind of line
from sklearn import linear_model  

# can represent any kind of curves
from sklearn.svm import LinearSVR 

# can represent any kinds of curves
from sklearn.ensemble import RandomForestRegressor

Scoring



In [ ]:

    
from sklearn.metrics.regression import mean_squared_error, mean_absolute_error

Workshop beings

Linear Regression - Part 1
Linear Regression - Part 2
Linear Regression - Part 3

Advanced Concepts

Data Analytics & Feature Engineering (Seaborn, PCA)
Test-Train Data Split (Test Train Validation splits)
Model Understanding & working intutions (Model Tuning)
Scoring Methods (Truth Tables, Precision, Recall)

Some other time for these models

Cross Validation
Multiclass, Multi-variate

Data Analytics & Feature Engineering

Actual Data

Make the differences looks visible to the machine

Learn: https://www.quora.com/What-is-feature-engineering

Test-Train Datasets

Model Selections

Scoring Methods - Cautions

Find - Where do you fit?

Linear Regression

SVM Working

Basic of Sketching

There are only 3 major shapes in sketching - Lines, Curves and Oval shapes

Three basic principles in which SVM focuses

Optimal Seperation/Boundary Region

Find a linear equation, that could represent our solution using Hyper-planes
Maximun Marginal boundary distances.

Seperation of regions such that we have enough safety space space in both regions.
Kernal Transformations

The kernel function can be any of the following:
- linear: (X, Y)
- polynomial: (gamma*(X, Y) + r)^d. d is specified by keyword degree, r by coef0.
- rbf: exp(-gamma * (X, Y)^2). gamma is specified by keyword gamma, must be greater than 0.
- sigmoid tanh(gamma*(X, Y) + r), where r is specified by coef0.

Complete working style

Random Forest

ExtraTreesRegressor
Bagging, Boosting

Random Forest Real Life Application - Majority Rule

Decision are made of cumulative experiences collected from council.