Data Science Training `#02`

Workflows

Simplified: Exploration vs Exploitation

Simulated Annealing

Part 1: Exploration

`pre-clean` / `scrape` & `transform` to csv >> `clean` >> `EDA` (Exploratory Data Analysis)

Part 2: Exploitation

segmentation: clusterization
- result: groups of datapoints - eg: high-prescribing doctors, new prescribers, all-round low prescribers
classification: labels from existing features ? supervised : clusterization >> supervised
- result: categorical classification
forecasting: classic statistical methods (ARIMA) / regression / interpolation

General advice, keep backups:

save notebooks
save data from notebooks (especially data that comes from endpoints) - eg: geocoding should be saved as binary in .pkl or serialized in some form such as .yaml or .json - some utilities here
save sklearn models in .pkl
depending on the dataframe's memory consumption, try to keep in one notebook different dataframes during the process and refine usage when not needed (garbage collection might be tricky, but the easiest way is to close the jupyter kernel)

General advice, folder structure:

Or a more complex version - [github.com/drivendata/cookiecutter-data-science](https://github.com/drivendata/cookiecutter-data-science)

Pre-cleaning / scraping

Scraping: Building abbreviation dictionaries

On medical datasets, you may encounter various abbreviations

Example of glossary page: directory.csms.org/abbreviations/

document.querySelectorAll('.abbrev').forEach(function(item) {
    item.innerHTML += ':';
});

Copy to sublime, and then use these key bindings:

{ "keys": ["ctrl+alt+shift+up"], "command": "select_lines", "args": {"forward": false} },
{ "keys": ["ctrl+alt+shift+down"], "command": "select_lines", "args": {"forward": true} },

Final result should be this:

Cleaning

Examples from: NYC Taxi Trip Duration EDA

Cleaning: Types

Usually, you will check what types have been read by Pandas:

data = pd.read_csv("train.csv", header = 0)
data["store_and_fwd_flag"].dtypes

Usually, they will be read by default as: float64, int64 and object.
You will often need to change types where suitable:

data_clean = data
data_clean["pickup_datetime"]  = data_clean["pickup_datetime"].astype("datetime64")

Cleaning: Column categories

demo: cleanup nb (<>B2B example<>)

Cleaning: Outliers

Usually it's good to keep outliers at least for EDA to give insights. A lot of talk is made around:

keeping outliers
normalizing outliers
changing the model to fit the outliers

In this case, since we want to focus our analysis on New York specifically, we empirically set limits for our map.

xlim = [-74.2, -73.7]
ylim = [40.55, 40.95]

data_normalized = data_clean
data_normalized = data_normalized[(data_normalized.dropoff_latitude < ylim[1]) & (data_normalized.pickup_latitude < ylim[1])]

Cleaning: Outliers

On normalization, some of the best resources are on audio processing: Audacity - Amplify & Normalize

Logarithmic smoothing

demo: NYC log hexbinning

**Next Lesson:**

Part 2 of this lesson & EDA

Data Science Training `#02`

Workflows

Simplified: Exploration vs Exploitation

Simulated Annealing

Part 1: Exploration

`pre-clean` / `scrape` & `transform` to csv >> `clean` >> `EDA` (Exploratory Data Analysis)

Part 2: Exploitation

General advice, keep backups:

General advice, folder structure:

Pre-cleaning / scraping

Scraping: Building abbreviation dictionaries

Cleaning

Cleaning: Types

Cleaning: Column categories

Cleaning: Outliers

Cleaning: Outliers

Links:

Future lessons:

Data Science Training #02

Workflows

Simplified: Exploration vs Exploitation

Simulated Annealing

Part 1: Exploration

pre-clean / scrape & transform to csv >> clean >> EDA (Exploratory Data Analysis)

Part 2: Exploitation

General advice, keep backups:

General advice, folder structure:

Pre-cleaning / scraping

Scraping: Building abbreviation dictionaries

Cleaning

Cleaning: Types

Cleaning: Column categories

Cleaning: Outliers

Cleaning: Outliers

Links:

Future lessons:

Data Science Training `#02`

`pre-clean` / `scrape` & `transform` to csv >> `clean` >> `EDA` (Exploratory Data Analysis)