In a catalog, each galaxy's measurements come with "error bars" providing information about how uncertain we should be about each property of each galaxy.
This means that the distribution of "observed" galaxy properties (as reported in the catalog) is not the same as the underlying or "intrinsic" distribution.
Let's look at the distribution of observed sizes in the SDSS photometric object catalog.
In [25]:
%load_ext autoreload
%autoreload 2
In [26]:
from __future__ import print_function
import numpy as np
import SDSS
import pandas as pd
import matplotlib
%matplotlib inline
In [27]:
galaxies = "SELECT top 1000 \
petroR50_i AS size, \
petroR50Err_i AS err \
FROM PhotoObjAll \
WHERE \
(type = '3' AND petroR50Err_i > 0)"
print (galaxies)
In [4]:
# Download data. This can take a few moments...
data = SDSS.select(galaxies)
data.head()
Out[4]:
In [5]:
!mkdir -p downloads
data.to_csv("downloads/SDSSgalaxysizes.csv")
In [23]:
data = pd.read_csv("downloads/SDSSgalaxysizes.csv",usecols=["size","err"])
data['size'].hist(bins=np.linspace(0.0,5.0,100),figsize=(12,7))
matplotlib.pyplot.xlabel('Size / arcsec',fontsize=16)
matplotlib.pyplot.title('SDSS Observed Size',fontsize=20)
Out[23]:
Things to notice:
Are these large galaxies actually large, or have they just been measured that way? Let's look at the reported uncertainties on these sizes:
In [8]:
data.plot(kind='scatter', x='size', y='err',s=100,figsize=(12,7));
In [9]:
def generate_galaxies(mu=np.log10(1.5),S=0.3,N=1000):
return pd.DataFrame({'size' : 10.0**(mu + S*np.random.randn(N))})
In [22]:
mu = np.log10(1.5)
S = 0.05
intrinsic = generate_galaxies(mu=mu,S=S,N=1000)
intrinsic.hist(bins=np.linspace(0.0,5.0,100),figsize=(12,7),color='green')
matplotlib.pyplot.xlabel('Size / arcsec',fontsize=16)
matplotlib.pyplot.title('Intrinsic Size',fontsize=20)
Out[22]:
Now let's add some observational uncertainty. We can model this by drawing random Gaussian offsets $\epsilon$ and add one to each intrinsic size.
In [11]:
def make_noise(sigma=0.3,N=1000):
return pd.DataFrame({'size' : sigma*np.random.randn(N)})
In [24]:
sigma = 0.3
errors = make_noise(sigma=sigma,N=1000)
observed = intrinsic + errors
observed.hist(bins=np.linspace(0.0,5.0,100),figsize=(12,7),color='red')
matplotlib.pyplot.xlabel('Size / arcsec',fontsize=16)
matplotlib.pyplot.title('Observed Size',fontsize=20)
Out[24]:
In [13]:
both = pd.DataFrame({'SDSS': data['size'], 'Model': observed['size']}, columns=['SDSS', 'Model'])
both.hist(alpha=0.5,bins=np.linspace(0.0,5.0,100),figsize=(12,7))
Out[13]:
One last thing: let's look at the variances of these distributions.
Recall:
$V(x) = \frac{1}{N} \sum_{i=1}^N (x_i - \nu)^2$
If $\nu$, the population mean of $x$, is not known, an estimator for $V$ is
$\hat{V}(x) = \frac{1}{N} \sum_{i=1}^N (x_i - \bar{x})^2$
where $\bar{x} = \frac{1}{N} \sum_{i=1}^N x_i$, the sample mean.
In [14]:
V_data = np.var(data['size'])
print ("Variance of the SDSS distribution = ",V_data)
In [15]:
V_int = np.var(intrinsic['size'])
V_noise = np.var(errors['size'])
V_obs = np.var(observed['size'])
print ("Variance of the intrinsic distribution = ", V_int)
print ("Variance of the noise = ", V_noise)
print ("Variance of the observed distribution = ", V_int + V_noise, \
"cf", V_obs)
You may recall this last result from previous statistics courses.
Why is the variance of our mock dataset's galaxy sizes so much smaller than that of the SDSS sample?
The procedure of drawing numbers from the first, and then adding numbers from the second, produced mock data - which then appeared to have been drawn from:
which is broader than either the intrinsic distribution or the error distribution.
The three distributions are related by an integral:
${\rm Pr}(R_{\rm obs}|\mu,S,\sigma) = \int {\rm Pr}(R_{\rm obs}|R_{\rm true},\sigma) \; {\rm Pr}(R_{\rm true}|\mu,S) \; dR_{\rm true}$
In [28]:
from IPython.display import Image
Image(filename="samplingdistributions.png",width=300)
Out[28]: