Data Science Training #01 (draft)

Roadmap

01. What is data science ?

10. How data science fits into the big picture

11. Practical data science workflows (next)

01. What is data science ?

01.1. Definitions / Venn war

The term of data science itself is still contested, but a concise definition can be brought here:

To do data science, you have to be able to find and process large datasets. You’ll often need to understand and use programming, math, and technical communication skills. You’ll need to be a unicorn that can put together a lot of different skillsets.

  • Roger Huang, Springboard blog - source

A longer definition might be the one offered by the now famous HBR article, Data Scientist: The Sexiest Job of the 21st Century (Oct 2012):

[...] what data scientists do is make discoveries while swimming in data [...] They identify rich data sources, join them with other, potentially incomplete data sources, and clean the resulting set. [...]

[...] Often they are creative in displaying information visually and making the patterns they find clear and compelling. They advise executives and product managers on the implications of the data for products, processes, and decisions.

Given the nascent state of their trade, it often falls to data scientists to fashion their own tools and even conduct academic-style research. Yahoo, one of the firms that employed a group of data scientists early on, was instrumental in developing Hadoop. [...]

What kind of person does all this? What abilities make a data scientist successful? Think of him or her as a hybrid of data hacker, analyst, communicator, and trusted adviser. The combination is extremely powerful—and rare.

Source: hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century

**The most frequently pointed source point**: ![](assets/map_data_science_2010.png) Source: [drewconway.com/zia/2013/3/26/the-data-science-venn-diagram](http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram)
**It gets crazier**: ![](assets/map_ds_multi.png)
[![](assets/map_ds_nyu.png)](http://datascience.nyu.edu/what-is-data-science/) Source: [datascience.nyu.edu/what-is-data-science/](http://datascience.nyu.edu/what-is-data-science/)

![](assets/map_venn.png) Source: `Deep Learning` - **Ian Goodfellow** and **Yoshua Bengio** and Aaron Courville - [source](http://www.deeplearningbook.org/contents/intro.html) (pg. 9)

01.2. History of DS

![](http://www.brandavnu.com/assets/img/clients/sigkdd_logo.png) **SIGKDD** is the ACM's **Special Interest Group** (SIG) on **Knowledge Discovery and Data Mining** (official ACM SIG since 1998, started in 1995).
**Data Science: an Action Plan for Expanding the Technical Areas of the Field of Statistics** William S. Cleveland, Statistics Research, Bell Labs First published: **April 2001** - [Wiley page](http://onlinelibrary.wiley.com/doi/10.1111/j.1751-5823.2001.tb00477.x/abstract), [fulltext](http://www.datascienceassn.org/sites/default/files/Data%20Science%20An%20Action%20Plan%20for%20Expanding%20the%20Technical%20Areas%20of%20the%20Field%20of%20Statistics.pdf)

This document describes a plan to enlarge the major areas of technical work of the field of statistics. Because the plan is ambitious and implies substantial change, the altered field will be called “data science.” [...]

The six areas and their percentages are the following:

  • (25%) Multidisciplinary Investigations: data analysis collaborations in a collection of subject matter areas.
  • (20%) Models and Methods for Data: statistical models; methods of model building; methods of estimation and distribution based on probabilistic inference.
  • (15%) Computing with Data: hardware systems; software systems; computational algorithms.
  • (15%) Pedagogy: curriculum planning and approaches to teaching for elementary school, secondary school, college, graduate school, continuing education, and corporate training.
  • (5%) Tool Evaluation: surveys of tools in use in practice, surveys of perceived needs for new tools, and studies of the processes for developing new tools.
  • (20%) Theory: foundations of data science; general approaches to models and methods, to computing with data, to teaching, and to tool evaluation; mathematical investigations of models and methods, of computing with data, of teaching, and of evaluation.

10. How data science fits into the big picture

  • statistics: statistics

  • learning and generalizing: ML / ANN

  • bayesian generalization - one-shot learning, pymc ...

  • optimization field: MCDA / MODA

Machine learning types:

11. Practical data science workflows


In [1]:
%%svg
<svg width="720" height="80"><g>
    <g><rect x="0" y="0" width="150" height="70" fill="#FFF" stroke="#000"></rect>
       <text x="10" y="30" font-family="Verdana" font-size="20" fill="#444">Analysis</text></g>
    <g transform="translate(170,0)">
       <polyline fill="none" stroke="#AAA" stroke-width="1" stroke-linecap="round" stroke-linejoin="round" points="
        0.375,0.375 45.63,38.087 0.375,75.8 "/>
    </g>

    <g transform="translate(230,0)">
      <rect x="0" y="0" width="400" height="70" fill="#FFF" stroke="#000"></rect>
        <text x="10" y="30" font-family="Verdana" font-size="20" fill="#444">Modelling</text></g>
</g></svg>


Analysis Modelling

Machine learning steps

Asking the right question => Preparing Data => Selecting the algorithm => Training the model

The process may require to return to a previous point, such as:

  • changing the question
  • sanitizing, extending the data
  • changing the algorithm
  • (supervised) extending the test data

Thanks !