THE ANSCOMBE QUARTET

Authors

Ndèye Gagnessiry Ndiaye and Christin Seifert

License

This work is licensed under the Creative Commons Attribution 3.0 Unported License https://creativecommons.org/licenses/by/3.0/

This notebook reproduces the Anscombe Data Set from https://de.wikipedia.org/wiki/Anscombe-Quartett#cite_note-Anscombe-1. It compares the following statistics:

Mean for x, y
Variance for x, y
Pearson product-moment correlation between x and y



In [1]:

    
import matplotlib.pyplot as plt
import numpy as np
import pandas
import statistics
from statistics import variance
from pylab import *
from collections import OrderedDict

The Anscombe quartet consists of four sets of data points. Each of these four sets consists of eleven ( x , y ) points. We create the Anscombe data. The x values are the same for the first three sets.



In [2]:

    
x1 = np.array([10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5],dtype= float)
y1 = np.array([8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68])
y2 = np.array([9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74])
y3 = np.array([7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73])
x4 = np.array([8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8],dtype= float)
y4 = np.array([6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89])

data_set = [('Set I_x',x1),
         ('Set I_y',y1),
         ('Set II_x',x1),
         ("Set II_y",y2),
         ("Set III_x",x1),
         ('Set III_y', y3),
         ("Set IV_x",x4),
         ('Set IV_y',y4)]
data_set = OrderedDict(data_set)
pandas.DataFrame(data_set)

The table below compares the mean of x, the variance of x, the mean of y, the variance of y and the Pearson product-moment correlation between x and y of the four sets.



In [89]:

    
# Calculate statistics for each set
metric_list = ["Mean of x", "Mean of y", "Variance of x","Variance of y","Pearson product-moment correlation(x,y) "]
set_list = ["Set I", "Set II", "Set III","Set IV"]
data = np.array([[mean(x1),mean(y1) , variance(x1),variance(y1), np.corrcoef(x1,y1)[1,0]  ],
                 [mean(x1),mean(y2) , variance(x1), variance(y2), np.corrcoef(x1,y2)[1,0] ],
                 [mean(x1), mean(y3), variance(x1),variance(y3), np.corrcoef(x1,y3)[1,0]  ],
                 [mean(x4), mean(y4), variance(x4),variance(y4), np.corrcoef(x4,y4)[1,0]  ] ])


pandas.DataFrame(data, set_list, metric_list)









    Out[89]:







  
    
      
      Mean of x
      Mean of y
      Variance of x
      Variance of y
      Pearson product-moment correlation(x,y)
    
  
  
    
      Set I
      9.0
      7.500909
      11.0
      4.127269
      0.816421
    
    
      Set II
      9.0
      7.500909
      11.0
      4.127629
      0.816237
    
    
      Set III
      9.0
      7.500000
      11.0
      4.122620
      0.816287
    
    
      Set IV
      9.0
      7.500909
      11.0
      4.123249
      0.816521

The four sets have nearly the same statistical numbers different only in the second digit after the decimal point.

The figure below plots the sets. They have almost identical simple statistical properties, but plotted look very different.



In [3]:

    
fig = plt.figure(figsize=(15,10))

# Set I
ax1 = fig.add_subplot(221)
ax1.scatter(x1, y1, c='orangered',edgecolors= 'orangered')
m,b = np.polyfit(x1, y1, 1)
X = np.linspace(ax1.get_xlim()[0], ax1.get_xlim()[1], 100)
ax1.set_title("Set I")
ax1.plot(X, m*X+b, '-')

# Set II
ax2 = fig.add_subplot(222)
ax2.scatter(x1, y2, c='orangered',edgecolors= 'orangered')
m,b = np.polyfit(x1, y1, 1)
X = np.linspace(ax2.get_xlim()[0], ax2.get_xlim()[1], 100)
ax2.set_title("Set II")
ax2.plot(X, m*X+b, '-')

# Set III
ax3 = fig.add_subplot(223)
ax3.scatter(x1, y3, c='orangered',edgecolors= 'orangered')
m,b = np.polyfit(x1, y3, 1)
X = np.linspace(ax3.get_xlim()[0], ax3.get_xlim()[1], 100)
ax3.set_title("Set III")
ax3.plot(X, m*X+b, '-')
        
# Set IV
ax4 = fig.add_subplot(224)
ax4.scatter(x4, y4, c='orangered',edgecolors= 'orangered')
m,b = np.polyfit(x4, y4, 1)
X = np.linspace(ax4.get_xlim()[0], ax4.get_xlim()[1], 100)
ax4.set_title("Set IV")
ax4.plot(X, m*X+b, '-')

plt.show()



In [ ]:

	Set I_x	Set I_y	Set II_x	Set II_y	Set III_x	Set III_y	Set IV_x	Set IV_y
0	10.0	8.04	10.0	9.14	10.0	7.46	8.0	6.58
1	8.0	6.95	8.0	8.14	8.0	6.77	8.0	5.76
2	13.0	7.58	13.0	8.74	13.0	12.74	8.0	7.71
3	9.0	8.81	9.0	8.77	9.0	7.11	8.0	8.84
4	11.0	8.33	11.0	9.26	11.0	7.81	8.0	8.47
5	14.0	9.96	14.0	8.10	14.0	8.84	8.0	7.04
6	6.0	7.24	6.0	6.13	6.0	6.08	8.0	5.25
7	4.0	4.26	4.0	3.10	4.0	5.39	19.0	12.50
8	12.0	10.84	12.0	9.13	12.0	8.15	8.0	5.56
9	7.0	4.82	7.0	7.26	7.0	6.42	8.0	7.91
10	5.0	5.68	5.0	4.74	5.0	5.73	8.0	6.89

	Mean of x	Mean of y	Variance of x	Variance of y	Pearson product-moment correlation(x,y)
Set I	9.0	7.500909	11.0	4.127269	0.816421
Set II	9.0	7.500909	11.0	4.127629	0.816237
Set III	9.0	7.500000	11.0	4.122620	0.816287
Set IV	9.0	7.500909	11.0	4.123249	0.816521