clusterizationlabels from existing features ? supervised : clusterization >> supervisedclassic statistical methods (ARIMA) / regression / interpolation.pkl or serialized in some form such as .yaml or .json - some utilities here.pklOn medical datasets, you may encounter various abbreviations
Example of glossary page: directory.csms.org/abbreviations/
document.querySelectorAll('.abbrev').forEach(function(item) {
item.innerHTML += ':';
});
Copy to sublime, and then use these key bindings:
{ "keys": ["ctrl+alt+shift+up"], "command": "select_lines", "args": {"forward": false} },
{ "keys": ["ctrl+alt+shift+down"], "command": "select_lines", "args": {"forward": true} },
Examples from: NYC Taxi Trip Duration EDA
Usually, you will check what types have been read by Pandas:
data = pd.read_csv("train.csv", header = 0)
data["store_and_fwd_flag"].dtypes
Usually, they will be read by default as: float64, int64 and object.
You will often need to change types where suitable:
data_clean = data
data_clean["pickup_datetime"] = data_clean["pickup_datetime"].astype("datetime64")
Usually it's good to keep outliers at least for EDA to give insights. A lot of talk is made around:
keeping outliersnormalizing outlierschanging the model to fit the outliersIn this case, since we want to focus our analysis on New York specifically, we empirically set limits for our map.
xlim = [-74.2, -73.7]
ylim = [40.55, 40.95]
data_normalized = data_clean
data_normalized = data_normalized[(data_normalized.dropoff_latitude < ylim[1]) & (data_normalized.pickup_latitude < ylim[1])]
On normalization, some of the best resources are on audio processing: Audacity - Amplify & Normalize
Logarithmic smoothing
demo: NYC log hexbinning