In a catalog, each galaxy's measurements come with "error bars" providing information about how

*uncertain*we should be about each property of each galaxy.This means that the distribution of "observed" galaxy properties (as reported in the catalog) is not the same as the underlying or "intrinsic" distribution.

Let's look at the distribution of

*observed sizes*in the SDSS photometric object catalog.

```
In [16]:
```%load_ext autoreload
%autoreload 2

```
```

```
In [18]:
```from __future__ import print_function
import numpy as np
import SDSS
import pandas as pd
import matplotlib
%matplotlib inline

```
In [19]:
```galaxies = "SELECT top 1000 \
petroR50_i AS size, \
petroR50Err_i AS err \
FROM PhotoObjAll \
WHERE \
(type = '3' AND petroR50Err_i > 0)"
print (galaxies)

```
```

```
In [20]:
```# Download data. This can take a few moments...
data = SDSS.select(galaxies)
data.head()

```
Out[20]:
```

```
In [21]:
```!mkdir -p downloads
data.to_csv("downloads/SDSSgalaxysizes.csv")

```
In [22]:
```data = pd.read_csv("downloads/SDSSgalaxysizes.csv",usecols=["size","err"])
data['size'].hist(bins=np.linspace(0.0,5.0,100),figsize=(12,7))
matplotlib.pyplot.xlabel('Size / arcsec',fontsize=16)
matplotlib.pyplot.title('SDSS Observed Size',fontsize=20)

```
Out[22]:
```

Things to notice:

- No small objects (why not?)
- A "tail" to large size
- Some very large sizes that look a little odd

*actually* large, or have they just been measured that way?
Let's look at the reported uncertainties on these sizes:

```
In [23]:
```data.plot(kind='scatter', x='size', y='err',s=100,figsize=(12,7));

```
```

- Let's look at how distributions like this one can come about, by making a
**generative model**for this dataset.

- First, let's imagine a set of perfectly measured galaxies. They won't all have the same size, because the Universe isn't like that. Let's suppose the logarithm of their
*intrinsic sizes*are drawn from a Gaussian distribution of width $S$ and mean $\mu$.

- To model one mock galaxy, we
*draw a sample from this distribution*. To model the whole dataset, we draw 1000 samples.

- Note that this is a similar activity to making random catalogs for use in correlation function summaries; here, though, we want to start comparing real data with mock data to begin
*understanding*it.

```
In [24]:
```def generate_galaxies(mu=np.log10(1.5),S=0.3,N=1000):
return pd.DataFrame({'size' : 10.0**(mu + S*np.random.randn(N))})

```
In [25]:
```mu = np.log10(1.5)
S = 0.05
intrinsic = generate_galaxies(mu=mu,S=S,N=1000)
intrinsic.hist(bins=np.linspace(0.0,5.0,100),figsize=(12,7),color='green')
matplotlib.pyplot.xlabel('Size / arcsec',fontsize=16)
matplotlib.pyplot.title('Intrinsic Size',fontsize=20)

```
Out[25]:
```

```
In [26]:
```def make_noise(sigma=0.3,N=1000):
return pd.DataFrame({'size' : sigma*np.random.randn(N)})

```
In [27]:
```sigma = 0.3
errors = make_noise(sigma=sigma,N=1000)
observed = intrinsic + errors
observed.hist(bins=np.linspace(0.0,5.0,100),figsize=(12,7),color='red')
matplotlib.pyplot.xlabel('Size / arcsec',fontsize=16)
matplotlib.pyplot.title('Observed Size',fontsize=20)

```
Out[27]:
```

```
In [28]:
```both = pd.DataFrame({'SDSS': data['size'], 'Model': observed['size']}, columns=['SDSS', 'Model'])
both.hist(alpha=0.5,bins=np.linspace(0.0,5.0,100),figsize=(12,7))

```
Out[28]:
```

One last thing: let's look at the variances of these distributions.

Recall:

$V(x) = \frac{1}{N} \sum_{i=1}^N (x_i - \nu)^2$

If $\nu$, the population mean of $x$, is not known, an *estimator* for $V$ is

$\hat{V}(x) = \frac{1}{N} \sum_{i=1}^N (x_i - \bar{x})^2$

where $\bar{x} = \frac{1}{N} \sum_{i=1}^N x_i$, the *sample mean*.

```
In [29]:
```V_data = np.var(data['size'])
print ("Variance of the SDSS distribution = ",V_data)

```
```

```
In [30]:
```V_int = np.var(intrinsic['size'])
V_noise = np.var(errors['size'])
V_obs = np.var(observed['size'])
print ("Variance of the intrinsic distribution = ", V_int)
print ("Variance of the noise = ", V_noise)
print ("Variance of the observed distribution = ", V_int + V_noise, \
"cf", V_obs)

```
```

You may recall this last result from previous statistics courses.

**Why is the variance of our mock dataset's galaxy sizes so much smaller than that of the SDSS sample?**

The procedure of drawing numbers from the first, and then adding numbers from the second, produced *mock data* - which then appeared to have been drawn from:

- ${\rm Pr}(R_{\rm obs}|\mu,S,\sigma)$

which is *broader* than either the intrinsic distribution or the error distribution.

The three distributions are related by an integral:

${\rm Pr}(R_{\rm obs}|\mu,S,\sigma) = \int {\rm Pr}(R_{\rm obs}|R_{\rm true},\sigma) \; {\rm Pr}(R_{\rm true}|\mu,S) \; dR_{\rm true}$

- Note that this is
*not*a convolution, in general - but it's similar to one.

- When we only plot the 1D histogram of
*observed*sizes, we are*summing over*or "marginalizing out" the intrinsic ones.

```
In [31]:
```from IPython.display import Image
Image(filename="samplingdistributions.png",width=300)

```
Out[31]:
```

- Each "node" (circle or dot) in the graph above represents a
*probability distribution*

- Nodes marked with dots represent PDFs that are
*delta functions*- that is, parameters that are asserted to have specific values (like we did with $\mu$ for example: ${\rm Pr}(\mu) = \delta(\mu - \log_{10}{1.5}$).

- The "edges" in the graph show the conditional dependence of the parameters with each other: a node with an edge leading to it is a
*conditional probability distribution*.

- We do not need to specify the functional form for any of the other PDFs before writing down the PGM - it just illustrates the connections between parameters in our probabilistic model.

- It's often helpful to write down the PGM
*before*writing down the probability distributions, as we'll see later.

- The integral over the intrinsic size is not implied by this graph: the PGM just shows how to generate observed sizes given other parameters.

```
In [ ]:
```