Data Preprocessing

1. Noise detection

Definition: noise refers to incorrect/distorted data values.

How to deal with noise depends on type of data:

  • For time-series data, low pass filters are usually used. Examples include: Butterworth filter, Chebyshev filter, etc.
  • For documents, use softwares for spell-checking, abbreviation expansion, etc.

In [6]:
import numpy as np
import wave
import matplotlib.pyplot as plt
import seaborn as sns

import utils
utils.set_plotting_style()

%matplotlib inline

spf = wave.open('data/exhibition.wav','r')

# Extract Raw Audio from Wav File
signal = spf.readframes(-1)
signal = np.fromstring(signal, 'Int16')

fig = plt.figure(1, figsize=(30,10))
ax = fig.add_subplot(2,1,1)
plt.title('Original Signal Wave', size=30)
for tick in ax.xaxis.get_major_ticks():
    tick.label.set_fontsize(0)
    tick.label.set_rotation(45)
for tick in ax.yaxis.get_major_ticks():
    tick.label.set_fontsize(24)
plt.plot(signal)

ax = fig.add_subplot(2,1,2)
plt.title('Filtered Signal Wave', size=30)
for tick in ax.xaxis.get_major_ticks():
    tick.label.set_fontsize(24)
    tick.label.set_rotation(45)
for tick in ax.yaxis.get_major_ticks():
    tick.label.set_fontsize(24)
plt.plot(signal)
plt.show()


2. Outliers

Definition: Data points/objects that are considerebly differnet than most of the other objects in a dataset.

3. Discretization methods

  • Equal width binning
  • Equal frequency
  • Clustering-based
  • Entropy-based

4. Dimensionality reduction

Unsupervised

  • Principal Component Analysis (PCA)

Supervised methods

  • Curse of dimensionality
  • Feature subset selection
  • Feature extraction
  • Entropy of each attribute

    ## 5.