Dirty data: dealing with missing, unbalanced, and diverse data

In the last few posts, we have covered some general principles of fitting machine learning models, optimizing hyperparameters of your models, and different techniques for building an ensemble model from individual models. However, we have mostly avoided talking about some common issues that plague real-world machine learning.

Namely, what do you do if you have a lot of missing data? Simply omitting rows with missing values can rapidly whittle your dataset down too much. And what happens if the outcome you're trying to predict is quite rare (e.g. as in fraud detection) in your dataset? If your outcome occurs in 1% of cases, then models that just predict the majority case every time reach 99% accuracy, but this is clearly not very useful. Finally, most datasets will be composed of a mixture of categorical and numerical variables. How can you make use of as much data as possible when building our models?

Of course, these are not the only tricky aspects of datasets. You may have to remove outliers, smooth noisy data, scale or normalize your features, handle redundant or duplicate data, and bin or discretize variables. Still, we focus on the first topics we mentioned since less has been written about them within a tutorial context.

Here is an overview of this blog post:

Filling in missing data

Easiest - omission.

Difference techniques for imputation.

Redressing the balance: dealing with unbalanced target variables

A lot of data will have unbalanced outcome variables, i.e. very unequal proportions between the classes. Can mess with many classification algorithms, unhelpful results.

Oversampling minority case (with, without replacement).

Undersampling majority case.

Synthetic minority case generation.

Attaching weights to classes within the algorithm. A few of scikit's algorithms come with a weights option.

Getting the most from mixed datasets

Certain algorithms, like trees and forrests, are just as able to use numerical data as categorical data (i.e. trees just split on any type of variable as long as it increases purity in the terminal nodes).

For others, mixed data types are an issue.

Recoding categorical variables as numbers. Issue - if no natural ranking, numbers are misleading.

Recoding categorical variables as binary dummy variables.

Another way data can be mixed - if numerical features are on vastly different scales. Mean scaling and variance normalization.

Conclusion


In [ ]: