Analyze: Are you able to pick the relevant method or library to resolve specific stated questions?
Evaluate: Are you able to interpret the results and justify your interpretation based on the observed data?
By the end of this notebook, you will be expected to:
- Have a basic understanding of the concept of a distribution;
- Use the shape of a distribution, and distribution parameters to distinguish between bias and noise; an
- Use the Python Imaging Library.
- Exercise 1: Reducing the effect of bias in data.
- Exercise 2: Understanding the effect of increasing sample size on estimated shape parameters of noisy and biased data.
This notebook compares noise and bias through a practical demonstration, as discussed by Arek Stopczynski in the Module 3 video content. A data set will be generated, and you will build on your understanding of how to create functions that can be reused. All examples demonstrated in the previous notebooks were created manually, or were based on external libraries and the functions contained therein. You will likely have specific patterns of execution, that you typically repeat frequently, which either do not exist, or are unknown to you. This can be done in the notebook itself or you can start to create libraries of functions that you refer to between projects or even between different individuals.
In order to demonstrate the principles of noise and bias, a set of functions will be used to create random points on the supplied image. Following this, an example will continue to demonstrate the difference between noise and bias.
This notebook will make use of a new Python library called the Python Imaging Library (PIL). It is first used in the first section of this notebook, and then later in the notebook (in a brief deviation from the course content) to demonstrate how to manipulate images.
Note:
Students who may not be familiar with the advanced technical content should read the “Create functions” subsection for a logical description of steps, and should not be concerned if the code and syntax do not make sense. The code is included to demonstrate the principles. Advanced students will also benefit from having the example code accessible.
In [ ]:
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
from matplotlib.pyplot import imshow
from PIL import Image
import random
import copy
import numpy as np
import operator
# Set plotting options.
%matplotlib inline
Bias and noise errors are to be expected in data collected in experiments. Therefore, one of the critical skills you will be using in data analysis is the identification and estimation of these errors. In this example, you will explore how to understand these errors, using a simulation of GPS measurements observed on a student population during an experiment that was alluded to by Arek Stopczynski in the video content. The aim of the experiment was to estimate the time they spend indoors and outdoors.
Note:
Read the comments that precede the function descriptions, and ensure you understand the difference between noise and bias, before completing the exercises that follow.
In order to demonstrate the difference between noise and bias, a number of functions will be created.
In [ ]:
# Check if the point is inside a building based on the color of the point on the image.
def is_point_inside_building(x, y, pix, im, s=5):
building_color = (244, 243, 236, 255)
is_inside = False
for v_x in range(max(x-s,0),min(x+s,im.size[0])):
for v_y in range(max(y-s,0),min(y+s,im.size[1])):
if pix[v_x, v_y] == building_color:
is_inside = True
return is_inside
# Add the markers to the map.
def put_points_on_map(points, pix, im, point_size=5):
for point in points:
put_point_on_map(point[0],point[1],point[2],pix,im)
# Set the color of the point based on whether the point is inside or outside the building. (Inside = Blue, Outside = Red)
def put_point_on_map(x, y, is_inside, pix, im, point_size=5):
for v_x in range(max(x-point_size,0),min(x+point_size,im.size[0])):
for v_y in range(max(y-point_size,0),min(y+point_size,im.size[1])):
if is_inside:
pix[v_x,v_y] = (0,0,255) # color='blue'
else:
pix[v_x,v_y] = (255, 0, 0) # color = 'red'
# Generate random points to be added to our image using the randint from the random library.
# https://docs.python.org/2/library/random.html
def generate_random_points(pix,im, n=200, threshold=0.5):
points = set()
n_inside = 0
while len(points) < n:
x = random.randint(0,im.size[0])
y = random.randint(0,im.size[1])
is_inside = is_point_inside_building(x, y, pix, im)
if len(points) > 0:
n_inside = len([v for v in points if v[2] == True])/float(len(points))
if n_inside < threshold and not is_inside:
continue
#print is_inside
points.add((x,y,is_inside))
return points
# Calculate the time spent inside by using number of observations as a proxy for actual time.
def calculate_time_inside(points):
return len([v for v in points if v[2] == True])/float(len(points))
# Set plotting options and plot the diagram using Matplotlib.
def plot_map(points, im, pix, s, point_size=5):
put_points_on_map(points, pix, im)
f = plt.figure()
plt.imshow(im)
plt.title(s)
f.set_size_inches(18.5, 10.5)
# Create a new points list where we introduce random errors using shuffle from the random library.
# https://docs.python.org/2/library/random.html
def introduce_random_error(points, error=0.4):
points_list = list(points)
random.shuffle(points_list)
return set(points_list[:int(len(points_list)*error)])
# Calculate the random error using the previously created function.
def calculate_random_error(points, error=0.4, k=100):
xx = []
for i in range(k):
points_copy = copy.deepcopy(points)
points_copy = introduce_random_error(points_copy, error=error)
xx.append(calculate_time_inside(points_copy))
plot_results(xx, 'Histogram for noisy data.')
return points_copy
# Calculate bias using our previously created function.
def calculate_bias(points, error=0.4, bias=0.6, k=100):
xx = []
for i in range(k):
points_copy = copy.deepcopy(points)
points_copy = introduce_bias(points_copy, error=error, bias=bias)
xx.append(calculate_time_inside(points_copy))
plot_results(xx, 'Histogram for biased data.')
return points_copy
# Produce plots using our previously created function.
# We print the mean and standard deviation using NumPy functions.
# http://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html
# http://docs.scipy.org/doc/numpy/reference/generated/numpy.std.html
def plot_results(xx, s = 'Histogram'):
plt.figure()
plt.hist(xx)
plt.axvline(np.mean(xx), color='r', linestyle='dashed', linewidth=2)
plt.title(s)
print('mean = {}, std = {}.'.format(np.mean(xx), np.std(xx)))
# Create a new points list where we introduce bias using choice and random from the random library.
# https://docs.python.org/2/library/random.html
def introduce_bias(points, error=0.4, bias=0.7):
new_points = copy.deepcopy(points)
while len(new_points) > int(len(points)*error):
point = random.choice([v for v in new_points])
is_inside = point[2]
if is_inside:
if random.random() < bias:
new_points.remove(point)
if not is_inside:
if random.random() > bias:
new_points.remove(point)
return new_points
Before demonstrating the concept of bias and variance (noise), it is instructive to introduce some basic statistical concepts used to describe data. In statistics, a distribution is used to capture the notion of the shape of a data set or variable. Formally, a distribution is a description of the relative number of times each possible outcome will occur in a number of trials. There exist a number of statistical distributions, which are expressed using a mathematical equation that describes the shape, for describing a data set's underlying generating mechanism. The most common of these is the normal distribution that is described by a bell curve shape. Comparing statistical estimates from an observed data set fitted to some assumed distribution is an important technique that helps to detect abnormal trends and occurrences in the data. Visualization of the data helps to reveal the deviation of data from a trend, or detect anomalous patterns in a set of data.
The normal distribution referred to above is used to describe data that tends to cluster around a central value, with no bias for values to the left or right of the central value. In other words, 50% of the values occur below the central value, and 50% above the central value. The central value is referred to as the mean or average. In Module 1, you may recall the following example that used normal distribution to generate and plot IQ data in a histogram.
In [ ]:
# Example data.
mu = 100 # Mean of distribution.
sigma = 15 # Standard deviation of distribution.
x = mu + sigma * np.random.randn(10000)
num_bins = 50
# The histogram of the data.
n, bins, patches = plt.hist(x, num_bins, normed=1, facecolor='green', alpha=0.5)
# Add a 'best fit' line.
y = mlab.normpdf(bins, mu, sigma)
plt.plot(bins, y, 'r--')
plt.xlabel('Smarts')
plt.ylabel('Probability')
plt.title(r'Histogram of IQ: $\mu=100$, $\sigma=15$')
In the IQ data example that was just plotted, the data has a central tendency (or mean $\mu$) towards the value 100, and a standard deviation of $\sigma$ of 15. The value for standard deviation defines a range above and below the mean for which a certain percentage of the data lies, or how spread out the data is. It is a useful parameter because you can make the following probability assertions about any value drawn from the normal distribution:
In [ ]:
# Create image object and load properties to a variable using the image library from PIL.
# Base image.
im_org = Image.open("dtu_map.png")
pix_org = im_org.load()
# Random error example.
im_random_error = Image.open("dtu_map.png")
pix_random_error = im_random_error.load()
# Bias example.
im_bias = Image.open("dtu_map.png")
pix_bias = im_bias.load()
## Generate random points.
# The function, defined previously, accepts the image, its properties, number of points and a threshold for points inside as parameters.
points = generate_random_points(pix_org, im_org, n=500, threshold=0.7)
## Calculate and print the time spent inside.
ts = calculate_time_inside(points)
print('Proportion of time spent inside (based on number of observations): {}.'.format(ts))
In [ ]:
# Plot the generated dataset.
plot_map(points, im_org, pix_org, 'Plot of generated random points on image.')
In the video content, Arek Stopczynski indicates that noise can be countered by collecting more data, because the average or mean is centered around the true value. This can be visualized using the functions created earlier. Histograms are convenient plotting mechanisms for demonstrating this behavior. The mean and standard deviation are also produced as outputs of this function, which will allow you to plot a new map to show the updated result set.
In [ ]:
# Introduce random error to the data assigned to the "points" variable
# and calculate "time spent inside" for the corrupted data.
# Repeat a 1000 times and plot the histogram of the "time spent inside" values.
# Also, plot a map from a single iteration of the error generation algorithm.
random_error_points = calculate_random_error(points, k=1000, error=0.4)
plot_map(random_error_points, im_random_error, pix_random_error, 'Plot of generated random points with random errors on image.')
When there are systematic errors (bias) present in your data set, you will need to employ alternative methods to clean your data. Take note of the shift in the histogram where it is no longer centered around the expected mean. As per the previous example, the mean and standard deviation are produced, as well as the updated map.
In [ ]:
# Introduce bias error to the data assigned to the "points" variable
# and calculate "time spent inside" for the corrupted data.
# Repeat a 1000 times and plot the histogram of the "time spent inside" values.
# Also, plot a map from a single iteration of the error generation algorithm.
bias_points = calculate_bias(points, k=1000, bias=0.6, error=0.4)
plot_map(bias_points, im_bias, pix_bias, 'Plot of generated random points with bias on image.')
Your markdown answer.
In [ ]:
Exercise complete:
This is a good time to "Save and Checkpoint".
Using the data generation and plotting functions introduced above, describe the effect on the estimated values of the mean and standard deviation when you increase the number of datapoints in:
Noisy data; and
Biased data.
Hint:
Try using values of 1000 and 10000 for $k$ in each case. You will need to make use of the following functions, and change the value of $k$ as indicated:
calculate_random_error()
calculate_bias()
In [ ]:
# Your solution here
Exercise complete:
This is a good time to "Save and Checkpoint".
While you will not spend much time on this library in this course, this section of the notebook serves as a brief introduction to the Python Imaging Library, as it was utilized in the previous section. Many of you may not be aware of the wide range of potential uses of Python, and the options that exist. In the previous example, an image was used as an input file. Points were then classified as inside or outside, based on the pixel color of the randomly-generated points on the image.
You can obtain similar images from Google maps, and perform some basic data exploration and wrangling functions. You will need to update the colors used in the functions to reflect those present in your image. It is, therefore, advised that you select a basic image with few color variations.
In the previous example it was assumed that everyone was outside, and the status was updated if the color matched a building. This is much easier than trying to account for all the possible variations in color that you may come across.
Note:
The simplistic approach applied in this trivial example does not scale, and should not be used for image processing. Image processing is outside of the scope of this course and requires different tools and infrastructure configurations than those that have been provisioned. Attempts to perform non-trivial operations on your virtual analysis environment may result in the depletion of the resources required to complete this course, and should be avoided.
In the following example, Python tools and libraries are used to wrangle the data. This example, therefore, starts with the question of which colors are present in the image.
While this example is by no means extensive in demonstrating the capabilities of either Python or the PIL library, the aim is to emphasize that input data does not always have to be traditional data files.
In [ ]:
im = Image.open("dtu_map.png")
print('image format:{}, image size: {}, image mode: {}'.format(im.format, im.size, im.mode))
In [ ]:
# Create color dictionary.
colors = {}
for color in im.getdata():
colors[color] = colors.get(color, 0) + 1
# Create sorted list.
sorted_c = sorted(colors.items(), key=operator.itemgetter(1), reverse=True)
# Display the first 5 records of the list.
sorted_c[:5]
Replace the color used to identify buildings, RGBA color (244, 243, 236, 255), in the previous example, with another, (50, 50, 255, 255). You can search for RGBA color codes and replace the second, (50, 50, 255, 255), with a color of your choice, should you wish to do so.
In [ ]:
# Obtain pixel data and replace a specific color in the image with another.
pixdata = im.load()
# Change the building color.
for y in range(im.size[1]):
for x in range(im.size[0]):
if pixdata[x, y] == (244, 243, 236, 255): # check for current building color
pixdata[x, y] = (50, 50, 255, 255) # replace with new color
im.save('my_updated_file.png') # Save the updated image.
Note:
You can replace the filename in this markdown cell with "my_updated_file.png" as per the previous cell if you want to display any changes you made to the image.
Look at the updated image :