Bootstrap replicates

in other words repeating the same experiment for a given number of times.


In [1]:
# importing required modules
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np

# runing the functions script
%run stats_func.py

In [2]:
# loading the iris dataset
df = pd.read_csv('iris.csv')
df.head()


Out[2]:
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
0 1 5.1 3.5 1.4 0.2 Iris-setosa
1 2 4.9 3.0 1.4 0.2 Iris-setosa
2 3 4.7 3.2 1.3 0.2 Iris-setosa
3 4 4.6 3.1 1.5 0.2 Iris-setosa
4 5 5.0 3.6 1.4 0.2 Iris-setosa

In [3]:
# extracting sepal length and sepal width for further analysis
sepalLength = np.array(df['SepalLengthCm'])
sepalWidth = np.array(df['SepalWidthCm'])

For further experiments Sepal Length variable will be used. Additionally, we will focus on bootstrap technique used for mean.


In [4]:
# calculating the mean from data
np.mean(sepalLength)


Out[4]:
5.8433333333333337

Use of numpy.random.choice for bootstraping

For the boostraping technique we will use numpy.random.choice function that will randomly select the data from a given variable forming a new data set. Then a function of choice can be used to calculate required statistics. In our case it will be mean value.

The code of the function:

def bootstrap_replicate_1d(data, func):
    '''Generate bootstrap replicate of 1D data'''
    bs_sample = np.random.choice(data, len(data))
    return func(bs_sample)

To show the impact of the function on the mean value, the function is called three times in the cells below:


In [5]:
bootstrap_replicate_1d(sepalLength, np.mean)


Out[5]:
5.8513333333333337

In [6]:
bootstrap_replicate_1d(sepalLength, np.mean)


Out[6]:
5.8300000000000001

In [7]:
bootstrap_replicate_1d(sepalLength, np.mean)


Out[7]:
5.7560000000000002

It would be interesting to repeat call of boostrap_replicate_1d function many times to see how the mean value changes. For that reason another function is defined:

def draw_bs_reps(data, func, size=1):
    """Draw bootstrap replicates."""
    # Initialize array of replicates: bs_replicates
    bs_replicates = np.empty(size)

    # Generate replicates
    for i in range(size):
        bs_replicates[i] = bootstrap_replicate_1d(data, func)

    return bs_replicates

In the example we will call the boostrap_replicate_1d function 10,000 times.


In [8]:
bs_replicates = draw_bs_reps(sepalLength, np.mean, size=10000)

In [9]:
# histogram plot will show us how the mean value of sepal length changes when the experiment is repeated 10,000 times
plt.hist(bs_replicates, bins=30, normed=True, edgecolor='black')
plt.xlabel("Mean sepal length [cm]")
plt.ylabel('PDF');



In [10]:
# calculating 95% confidence interval for the mean based on boostrap technique
conf_intervals = np.percentile(bs_replicates, [2.5, 97.5])
conf_intervals


Out[10]:
array([ 5.70866667,  5.97735   ])

Repeating the same exercise for Sepal Width


In [11]:
bs_replicates2 = draw_bs_reps(sepalWidth, np.mean, size=10000)

# histogram plot will show us how the mean value of sepal width changes when the experiment is repeated 10,000 times
plt.hist(bs_replicates2, bins=30, normed=True, edgecolor='black')
plt.xlabel("Mean sepal width [cm]")
plt.ylabel('PDF');



In [12]:
# calculating 95% confidence interval for the mean based on boostrap technique
conf_intervals2 = np.percentile(bs_replicates2, [2.5, 97.5])
conf_intervals2


Out[12]:
array([ 2.98665   ,  3.12466667])

In [ ]: