Up to this point we have been working with supervised learning - we have a set of features and labels (or outputs in the case of a regression) that we used to teach the machine. What if we don't have labels? What if all we have is a set of unlabeled data points? It is still possible to do some types of unsupervised machine learning. We'll cover a couple of these methods over the next few classes.
We'll begin with Outlier Detection. There are a couple of different ways we can think about outliers:
If we are dealing with the first case: the outlier is not real, but was, for example, a mis-recorded or mis-entered number, than the outlier is something we need to fix. However, if it is the second case, the outlier may be the thing we are looking for. We'll work with both cases and talk about strategies for dealing with them. Let's first look at how to find the outliers in the data, then we'll look at what to do with them.
I've prepared a sample dataset of fake data with a couple of outliers. Let's load the data and take a look at it graphically.
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
dfa = pd.read_csv("Class08_fakedata1.csv")
dfa.plot.scatter(x='A',y='B')
Out[1]:
It looks like there are a couple of data points that don't fit in with the rest of the group. Perhaps we can spot them by looking at just the 'A' column data?
In [2]:
dfa['A'].plot()
Out[2]:
It looks like one of the points is around the 100 row point. But where is the other? Let's try plotting the 'B' column.
In [3]:
dfa['B'].plot()
Out[3]:
Well, it isn't clear from this plot that there are any problems in the 'B' column. There is another way we can visualize the data: using a boxplot
. This type of plot looks at the mean and standard deviation of the data. Then it provides a single plot that shows various limits based on these values. Here's an explanation of the ranges of data used by the boxplot (explanation from here). Any points outside of this range are plotted as separate data points and should be visible.
In [4]:
dfa.boxplot(['A','B'])
Out[4]:
This didn't really help: it maybe spotted an outlier in the 'B' data, but it has way too many in the 'A' data. Let's try going at this another way. We'll start with the 'A' data.
First we'll calculate the mean $\mu$ and standard deviation $\sigma$ for the data. Then we'll scale the data like this: we subtract the mean from each data point $x_i$ and divide by the standard deviation.
$\frac{x_i - \mu}{\sigma}$
This gives us a scaled "distance" between each data point and the mean where the scale is the standard deviation. Let's run this for the 'A' data.
In [5]:
X=dfa['A'].values.reshape(-1,1)
xv= range(0,len(X))
mean = X.mean()
stdev = X.std()
# Now calculate the scaled distance (which is always a positive number, so we take the absolute value of the distance)
dist = np.abs((X - mean)/stdev)
#Plot the data
fig, ax1 = plt.subplots()
ax1.scatter(xv,X,marker='^')
ax1.set_xlabel('Point Number')
ax1.set_ylabel('A Points')
ax2 = ax1.twinx()
ax2.set_ylabel('Scaled Distance',color='m')
ax2.scatter(xv,dist,color='m',marker='v')
Out[5]:
Again, there is one point in the 'A' data that stands out. That is certainly an outlier. But we'd like to use a tool that does this scaling for us and will let us take the 'B' data into consideration, too. Fortunately there is a tool: the Mahalanobis distance explanation here. Let's implement this tool using just the 'A' column to compare it to our calculation.
In [6]:
from sklearn.covariance import EmpiricalCovariance
# We initialize the model and fit in one step here.
emp_covA = EmpiricalCovariance().fit(X)
# The Mahalanobis function returns the distance squared, so we take the square root.
mahal_dist = np.sqrt(emp_covA.mahalanobis(X))
# And plot the data
fig, ax1 = plt.subplots()
ax1.scatter(xv,X,marker='^')
ax1.set_xlabel('Point Number')
ax1.set_ylabel('A Points')
ax2 = ax1.twinx()
ax2.set_ylabel('Mahalanobis Distance',color='m')
ax2.scatter(xv,mahal_dist,color='m',marker='v')
Out[6]:
This looks exactly like our manual distance calculation. That's good. Now let's extend this to use both columns as the inputs. We'll again calculate the distances and plot them for each point.
In [7]:
XAB=dfa[['A','B']].values
emp_covAB = EmpiricalCovariance().fit(XAB)
mahal_dist2 = np.sqrt(emp_covAB.mahalanobis(XAB))
fig, ax1 = plt.subplots()
ax1.scatter(xv,X,marker='^')
ax1.set_xlabel('Point Number')
ax1.set_ylabel('A Points')
ax2 = ax1.twinx()
ax2.set_ylabel('Mahalanobis Distance',color='m')
ax2.scatter(xv,mahal_dist2,color='m',marker='v')
Out[7]:
Now we can see there are a couple of points that have a Mahalanobis distance bigger than 4. That's a good indication that they are outlier for this dataset. We can plot the data as a general scatter plot and then overlay contour lines for the different Mahalanobis distances from the mean. That will let us see how the distances compare.
In [8]:
trainfig, ax = plt.subplots()
ax.scatter(dfa['A'].values,dfa['B'].values,color='black')
ax.set_title("Mahalanobis distances")
# Show contours of the distance functions
xx, yy = np.meshgrid(np.linspace(plt.xlim()[0], plt.xlim()[1], 100),
np.linspace(plt.ylim()[0], plt.ylim()[1], 100))
zz = np.c_[xx.ravel(), yy.ravel()]
mahal_emp_cov = emp_covAB.mahalanobis(zz)
mahal_emp_cov = mahal_emp_cov.reshape(xx.shape)
emp_cov_contour = ax.contour(xx, yy, np.sqrt(mahal_emp_cov),
cmap=plt.cm.PuBu_r,
linestyles='dashed',
levels=range(8))
plt.clabel(emp_cov_contour, inline=1, fontsize=10)
Out[8]:
Now we can see that there are a couple of points with a distance bigger than 4.0. This cutoff is slightly arbitrary- we choose how big or small we want to make it depending on the data and what we are doing with it. For now, we'll set the cutoff distance and then create a cleaned dataset where the distances are less than this cutoff.
In [9]:
cutoff_distance = 4.0
dfa['dist'] = mahal_dist2
dfacl = dfa[dfa['dist'] <= cutoff_distance]
print("Initial number of points: {}".format(len(dfa.index)))
print("Cleaned number of points: {}".format(len(dfacl.index)))
In [10]:
dfa['outlier'] = (dfa['dist'] > cutoff_distance).astype('category')
print(dfa.dtypes)
dfa.head()
Out[10]:
So that first set of data was fairly easy to work with: the distribution of data points looked Gaussian (it was) and so the outliers were easy to detect using methods that depend on normal distributions. What if our data aren't normally distributed? Let's look at another fake dataset that has a couple of outliers.
In [11]:
dfb = pd.read_csv("Class08_fakedata2.csv")
dfb.plot.scatter(x='x',y='y')
Out[11]:
We can clearly see that there are two points that don't fit the pattern. How do we identify them? We could try a boxplot.
In [12]:
dfb.boxplot('y')
Out[12]:
That doesn't get us anywhere. How about trying the Mahalanobis distance?
In [13]:
xy=dfb[['x','y']].values
emp_covxy = EmpiricalCovariance().fit(xy)
mahal_distxy = np.sqrt(emp_covxy.mahalanobis(xy))
trainfig, ax = plt.subplots()
ax.scatter(dfb['x'].values,dfb['y'].values,color='black')
ax.set_title("Mahalanobis distances")
# Show contours of the distance functions
xx, yy = np.meshgrid(np.linspace(plt.xlim()[0], plt.xlim()[1], 100),
np.linspace(plt.ylim()[0], plt.ylim()[1], 100))
zz = np.c_[xx.ravel(), yy.ravel()]
mahal_emp_cov = emp_covxy.mahalanobis(zz)
mahal_emp_cov = mahal_emp_cov.reshape(xx.shape)
emp_cov_contour = ax.contour(xx, yy, np.sqrt(mahal_emp_cov),
cmap=plt.cm.PuBu_r,
linestyles='dashed')
ax.clabel(emp_cov_contour, inline=1, fontsize=10)
ax.set_xlabel('x')
ax.set_ylabel('y')
Out[13]:
That isn't helpful, either. We need a new technique. We'll start by creating a machine learning model to predict the data. In this case we'll use a support vector regression. It should be weighted to try and fit the majority of the data which look like they are good. We'll also print out the score
for the model fit. This is the accuracy score that we've used before.
In [14]:
from sklearn.svm import SVR
svrmodel = SVR()
svrmodel.fit(dfb['x'].values.reshape(-1,1), dfb['y'].values)
X_plot =np.linspace(-1, 1, 1000)
Y_pred = svrmodel.predict(X_plot[:,None])
fig, ax = plt.subplots()
# First plot our points
ax.scatter(dfb['x'].values,dfb['y'].values,color='black')
ax.plot(X_plot,Y_pred,c='r')
ax.set_xlabel('x')
ax.set_ylabel('y')
print("Score: {}".format(svrmodel.score(dfb['x'].values.reshape(-1,1),dfb['y'].values)))
Like we hoped, the model looks like it follows most of the data. Now we can look to see if we can find those outliers. We'll plot the model predictions against the actual y
values.
In [15]:
pred = svrmodel.predict(dfb['x'].values.reshape(-1,1))
plt.scatter(dfb['y'].values,pred,color='blue')
plt.xlabel('Actual y values')
plt.ylabel('Model predicted y values')
Out[15]:
If the model prediction matches the actual value, the data point lies on a line with a slope of 1
. We see that most of the data points are on that line! And there are two points that are off the line - those are our outliers. Another way to visualize this is to calculate the residuals: the difference between the actual y
values and the predicted y
values. We'll create a new column of the residuals and plot them.
In [16]:
dfb['residual'] = np.abs(dfb['y'].values - pred)
dfb.plot.scatter(x='x',y='residual')
Out[16]:
Now we can sort the data by the residual column and look for the largest values.
In [17]:
dfb.sort_values('residual',ascending=False).head()
Out[17]:
Now we'll create a new dataset where we select only the data where the residuals are less than a threshold value. We'll try fitting this data to see if our model has improved without the outliers.
In [18]:
threshold = 0.3
dfbcl = dfb[dfb['residual'] <= threshold]
svrmodel2 = SVR(C=10,gamma=1)
svrmodel2.fit(dfbcl['x'].values.reshape(-1,1), dfbcl['y'].values)
Y_pred = svrmodel2.predict(X_plot[:,None])
fig, ax = plt.subplots()
# First plot our points
ax.scatter(dfbcl['x'].values,dfbcl['y'].values,color='black')
ax.plot(X_plot,Y_pred,c='r')
print("Score: {}".format(svrmodel2.score(dfb['x'].values.reshape(-1,1),dfb['y'].values)))
It looks like the score went up a little bit and the model looks a little better. Certainly we don't have those outliers any more!
Take a look at the fake data in Class08_fakedata3.csv
. There are 3 outlier points. Find them and check with me to see if you found them.
You may or may not have outliers in your data. Check and see and report on what you found (if anything).