Data Science Training #02

Workflows

Simplified: Exploration vs Exploitation

Simulated Annealing

Part 1: Exploration

pre-clean / scrape & transform to csv >> clean >> EDA (Exploratory Data Analysis)

Part 2: Exploitation

  • segmentation: clusterization
    • result: groups of datapoints - eg: high-prescribing doctors, new prescribers, all-round low prescribers
  • classification: labels from existing features ? supervised : clusterization >> supervised
    • result: categorical classification
  • forecasting: classic statistical methods (ARIMA) / regression / interpolation

General advice, keep backups:

  • save notebooks
  • save data from notebooks (especially data that comes from endpoints) - eg: geocoding should be saved as binary in .pkl or serialized in some form such as .yaml or .json - some utilities here
  • save sklearn models in .pkl
  • depending on the dataframe's memory consumption, try to keep in one notebook different dataframes during the process and refine usage when not needed (garbage collection might be tricky, but the easiest way is to close the jupyter kernel)

General advice, folder structure:

Or a more complex version - [github.com/drivendata/cookiecutter-data-science](https://github.com/drivendata/cookiecutter-data-science)

Pre-cleaning / scraping

Scraping: Building abbreviation dictionaries

On medical datasets, you may encounter various abbreviations

Example of glossary page: directory.csms.org/abbreviations/

document.querySelectorAll('.abbrev').forEach(function(item) {
    item.innerHTML += ':';
});

Copy to sublime, and then use these key bindings:

{ "keys": ["ctrl+alt+shift+up"], "command": "select_lines", "args": {"forward": false} },
{ "keys": ["ctrl+alt+shift+down"], "command": "select_lines", "args": {"forward": true} },
Final result should be this:

Cleaning

Cleaning: Types

Usually, you will check what types have been read by Pandas:

data = pd.read_csv("train.csv", header = 0)
data["store_and_fwd_flag"].dtypes

Usually, they will be read by default as: float64, int64 and object.
You will often need to change types where suitable:

data_clean = data
data_clean["pickup_datetime"]  = data_clean["pickup_datetime"].astype("datetime64")

Cleaning: Column categories

demo: cleanup nb (<>B2B example<>)

Cleaning: Outliers

Usually it's good to keep outliers at least for EDA to give insights. A lot of talk is made around:

  • keeping outliers
  • normalizing outliers
  • changing the model to fit the outliers

In this case, since we want to focus our analysis on New York specifically, we empirically set limits for our map.

xlim = [-74.2, -73.7]
ylim = [40.55, 40.95]

data_normalized = data_clean
data_normalized = data_normalized[(data_normalized.dropoff_latitude < ylim[1]) & (data_normalized.pickup_latitude < ylim[1])]

Cleaning: Outliers

On normalization, some of the best resources are on audio processing: Audacity - Amplify & Normalize

Logarithmic smoothing

demo: NYC log hexbinning

**Next Lesson:**

Part 2 of this lesson & EDA

Future lessons:

![](assets/DS_2/ds_training_map.png)