```
In [15]:
```import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from scipy.stats import norm

the issue we have is the following: we are drawing indendent random numbers from a binary distribution of probability $p$ (think: the probability of a certain person liking the color blue) and we have two groups (think: male and female). Those two groups dont necessarily have the same size.

The question we ask is what difference we can expect in the spread of the ex-post estimation of $p$

We first define our population parameters

```
In [16]:
```N_people = 500
ratio_female = 0.30
proba = 0.40

of course we could have done this analytically using Normal approximation: we have two independent Normal random variables, both with expectation $p$. The variance of the $male$ variable is $p(1-p)/N_{male}$ and the one of the female one accordingly. The overall variance of the difference (or sum, it does not matter here because they are uncorrelated) is $$ var = p(1-p)\times \left(\frac{1}{N_{male}} + \frac{1}{N_{female}}\right) $$ Using the female/male ratio $r$ instead we can write for the standard deviation $$ sd = \sqrt{var} = \sqrt{\frac{1}{N}\frac{p(1-p)}{r(1-r)}} $$ meaning that we expect the difference in estimators for male and female of the order of $sd$

```
In [17]:
```def the_sd(N, p, r):
N = float(N)
p = float(p)
r = float(r)
return sqrt(1.0/N*(p*(1.0-p))/(r*(1.0-r)))

```
In [18]:
```def sd_func_factory(N,r):
def func(p):
return the_sd(N,p,r)
return func
f = sd_func_factory(N_people, ratio_female)
f2 = sd_func_factory(N_people/2, ratio_female)

`sd`

), or $5.9\%$ for if only half of the people replied (`sd2`

)

```
In [19]:
```p = linspace(0,0.25,5)
f = sd_func_factory(N_people, ratio_female)
f2 = sd_func_factory(N_people/2, ratio_female)
sd = list(map(f, p))
sd2 = list(map(f2, p))
pd.DataFrame(data= {'p':p, 'sd':sd, 'sd2':sd2})

```
Out[19]:
```

that's the same relationship as a plot

```
In [20]:
```p = linspace(0,0.25,50)
sd = list(map(f, p))
sd2 = list(map(f2, p))
plot (p,p, 'k')
plot (p,p-sd, 'g--')
plot (p,p+sd, 'g--')
plot (p,p-sd2, 'r--')
plot (p,p+sd2, 'r--')
grid(b=True, which='major', color='k', linestyle='--')

```
```

```
In [21]:
```z=linspace(1.,3,100)
plot(z,1. - (norm.cdf(z)-norm.cdf(-z)))
grid(b=True, which='major', color='k', linestyle='--')
plt.title("Probability of being beyond Z (2-sided) vs Z")

```
Out[21]:
```

```
In [31]:
```number_of_tries = 1000

We do some intermediate calculations...

```
In [32]:
```N_female = int (N_people * ratio_female)
N_male = N_people - N_female

...and then generate our random numbers...

```
In [33]:
```data_male = np.random.binomial(n=1, p=proba, size=(number_of_tries, N_male))
data_female = np.random.binomial(n=1, p=proba, size=(number_of_tries, N_female))

```
In [34]:
```proba_male = map(numpy.mean, data_male)
proba_female = map(numpy.mean, data_female)
proba_diff = list((pm-pf) for pm,pf in zip(proba_male, proba_female))
np.mean(proba_diff), np.std(proba_diff)

```
Out[34]:
```

```
In [ ]:
```