Intro

Andreas Mueller NYU Center for Data Science

Clone this repository:

https://github.com/amueller/advanced_training

Schedule

Start: 9am End: 5pm

Lunch: ~11:45am Snack: ~3pm

Outline

1 Basic algorithms

  • Review of supervised learning
  • Linear models for classification and regression
  • Loss functions, regularization, empirical risk minimization
  • Path algorithms

2 Preprocessing

  • Scaling and normalization
  • Continuous and discrete features
  • Feature selection
    • Univariate
    • Model-based
    • RFE
    • Forward / backward selection
  • Polynomial and interaction features

3 Basic tools

  • Cross-validation vs train/test split
  • GridSearchCV
  • Overfitting Parameters
  • Scoring Metrics

4 Advanced Supervised Learning

  • Decision Tree Recap
  • Random Forests
  • Gradient Boosting / xgboost
  • Kernel SVMs
  • Kernel approximation
  • Neural Networks

5 Advanced tools

  • Pipelines
  • FeatureUnion
  • Function Transformer

6 Unsupervised feature extraction and visualization

  • PCA
  • NMF
  • Robust PCA
  • TSNE

7 Outlier Detection

  • Elliptic Envelope
  • IForest
  • KDE
  • robust PCA

8 Gaussian Processes

9 beyond standard sklearn

  • out of core
  • custom estimators