Q3

In this question, you'll compute basic statitistics of data and determine the best way to represent it under different circumstances.

A

In this question, you'll work with 2-dimensional data. The following code uses a subset of the UCI ML Boston housing prices dataset. It has been used in many machine learning papers, usually in a regression application. However, in our case, we'll look at how to summarize this data.

At the end of the following code block, you'll see a matrix X that is created for you. This matrix has 506 rows (data points), and 2 columns (dimensions), and is visualized in a scatter plot. You'll notice most of the data clusters together between 0 and 10 on the x-axis, but there are a few data points that are clear outliers.

Your job is to compute a summary statistic of this data that is robust to these outliers. Using your knowledge of summary statistics from lecture, compute a single 2D data point that summarizes the data. You can check your work by passing the data X and your summary statistic s to the function plot_data_and_stat, which will visualize the data in blue dots, and the statistic as a yellow pentagon. If your statistic is robust to outliers, it should fall in the big cluster of data points, not in the open space between them.

Your method can include some data pre-processing! You just can't use any pre-packaged "outlier detection" methods, unless you implement it yourself, of course. But it can be as complex or as simple as you'd like, so long as it adheres to the robust-to-outliers requirement.



In [24]:

    
%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_boston

# Call this with just the data X to visualize it.
def plot_data(X, alpha = 1.0):
    plt.scatter(X[:, 0], X[:, 1], alpha = alpha)

##############
# Call this WITH YOUR 2D SUMMARY STATISTIC to visualize it simultaneously with the data.
# "statistic" should be a 2-element array.
# - 1st element is x-value
# - 2nd element is y-value
##############
def plot_data_and_stat(X, statistic):
    plt.scatter(X[:, 0], X[:, 1], alpha = 0.05, label = 'Data')
    plt.plot(statistic[0], statistic[1], c = 'y', marker = 'p', ms = 12.0, mec = 'k', mew = 1.0, label = 'Mean')
    plt.legend(loc = 0)

##########################
# 
# SETUP CODE STARTS HERE
# 
##########################
    
X = load_boston()['data'][:, [8, 10]]  # two clusters
print(X.shape)
plot_data(X)



In [ ]:

B

Justify your solution in the previous question. Are there any circumstances where you'd advise against this approach (i.e., are there any weaknesses to your solution)?