DM_05_04

Import Libraries


In [ ]:
%matplotlib inline
import pylab
import numpy as np
import pandas as pd
from sklearn.svm import OneClassSVM
from sklearn.covariance import EllipticEnvelope

pylab.rcParams.update({'font.size': 14})

Read CSV


In [ ]:
df = pd.read_csv("AnomalyData.csv")
df.head()

Save state_code to label outliers. "data" contains just quantitative variables.


In [ ]:
state_code = df["state_code"]
data = df.loc[:, "data science": "Openness"]

Univariate Outliers

Create a box plot to display univariate outliers on "modern dance."


In [ ]:
param = "modern dance"

Get quantile values and IQR for outlier limits.


In [ ]:
qv1 = data[param].quantile(0.25)
qv2 = data[param].quantile(0.5)
qv3 = data[param].quantile(0.75)
qv_limit = 1.5 * (qv3 - qv1)

Get positions of outliers and use state_code for labels.


In [ ]:
un_outliers_mask = (data[param] > qv3 + qv_limit) | (data[param] < qv1 - qv_limit)
un_outliers_data = data[param][un_outliers_mask]
un_outliers_name = state_code[un_outliers_mask]

Create box plot for "modern dance."


In [ ]:
fig = pylab.figure(figsize=(4,6))
ax = fig.add_subplot(1, 1, 1)
for name, y in zip(un_outliers_name, un_outliers_data):
    ax.text(1, y, name)
ax.boxplot(data[param])
ax.set_ylabel(param)

Bivariate Outliers

Create a scatterplot with an ellipse as a boundary for outliers.

Use the Google search terms "data science" and "ceo" for this example.


In [ ]:
params = ["data science", "ceo"]
params_data = np.array([df[params[0]], df[params[1]]]).T

Compute the "elliptical envelope."


In [ ]:
ee = EllipticEnvelope()
ee.fit(params_data)

Get the names and positions of outliers.


In [ ]:
biv_outliers_mask = ee.predict(params_data) == -1
biv_outliers_data = params_data[biv_outliers_mask]
biv_outliers_name = state_code[biv_outliers_mask]

Calculate the decision boundary for the scatterplot.


In [ ]:
xx, yy = np.meshgrid(np.linspace(params_data[:, 0].min(), params_data[:, 0].max(), 100),
                     np.linspace(params_data[:, 1].min(), params_data[:, 1].max(), 100))
zz = ee.decision_function(np.c_[xx.ravel(), yy.ravel()])
zz = zz.reshape(xx.shape)

Draw the scatterplot with the elliptical envelope and label the outliers.


In [ ]:
fig = pylab.figure(figsize=(10,10))
ax = fig.add_subplot(1, 1, 1)
for name, xy in zip(biv_outliers_name, biv_outliers_data):
    ax.text(xy[0], xy[1], name)
ax.contour(xx, yy, zz, levels=[0], linewidths=2)
ax.scatter(params_data[:, 0], params_data[:, 1], color='black')
ax.set_xlabel(params[0])
ax.set_ylabel(params[1])

Multivariate Outliers

Use the one-class support vector machine (SVM) algorithm to classify unusual cases.


In [ ]:
ocsvm = OneClassSVM(nu=0.25, gamma=0.05)
ocsvm.fit(data)

List the names of the outlying states based on the one-class SVM.


In [ ]:
# 
state_code[ocsvm.predict(data) == -1]