Spearman Rank Coefficient with Errors

Dennis asked how I was handling errors when calculating the rank correlation between e.g. size ratio and B/T. I wasn't accounting for errors, which is bad because the errors associated with the size ratio are quite variable.

To address this, I am implementing a monte-carlo error estimation for the spearman rank test.


In [74]:
import numpy as np
from pylab import *
%matplotlib inline
from scipy.stats.stats import spearmanr
from scipy.stats.mstats import normaltest
import warnings
warnings.filterwarnings('ignore')
import sys
sys.path.append("/Users/rfinn/Dropbox/pythonCode/")
sys.path.append("/anaconda/lib/python2.7/site-packages")
sys.path.append("/Users/rfinn/Ureka/variants/common/lib/python2.7/site-packages")

In [75]:
def spearman_with_errors(x,y,yerr,Nmc=1000,plotflag=False):
    ysim=np.zeros(Nmc,'f')
    rhosim=np.zeros(Nmc,'f')
    psim=np.zeros(Nmc,'f')
    for i in range(Nmc):
        ysim=np.random.normal(y,scale=yerr,size=len(y))
        rhosim[i],psim[i] = spearmanr(x,ysim)
    cave=np.mean(rhosim)
    cstd=np.std(rhosim)
    q1=50-34 # mean minus one std
    lower=np.percentile(rhosim,q1)
    q2=50+34 # mean minus one std
    upper=np.percentile(rhosim,q2)
    print 'mean = %5.2f, std = %5.2f'%(cave,cstd)
    print 'confidence interval from sorted list of MC fit values:'
    print 'lower = %5.2f (%5.2f), upper = %5.2f (%5.2f)'%(lower,cave-cstd, upper,cave+cstd)
    k,pnorm=normaltest(rhosim)
    print 'probability that distribution is normal = %5.2f'%(pnorm)
    if plotflag:
        plt.figure(figsize=(10,4))
        plt.subplot(1,2,1)
        plt.hist(rhosim,bins=10,normed=True)
        plt.xlabel(r'$Spearman \ \rho $')
        plt.axvline(x=cave,ls='-',color='k')
        plt.axvline(x=lower,ls='--',color='k')
        plt.axvline(x=upper,ls='--',color='k')
        plt.subplot(1,2,2)
        plt.hist(np.log10(psim),bins=10,normed=True)
        plt.xlabel(r'$\log_{10}(p \ value)$')
        plt.figure()
        plt.hexbin(rhosim,np.log10(psim))
        plt.xlabel(r'$Spearman \ \rho $')
        plt.ylabel(r'$\log_{10}(p \ value)$')
    return rhosim,psim

In [76]:
%run ~/Dropbox/pythonCode/LCSanalyzeblue.py


%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%

 prefix = 
all
%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%
PREFIX =  all

The Plot Dennis is referring to

Here is the plot of relative size vs. B/T. The right panel now shows the weighted mean in 5 bins vs B/T, so that is ok, although I am still not sure how to properly calculate the error in the weighted mean.


In [73]:
s.plotsizeBTblue()


[ 0.12        0.18000001  0.14        0.          0.34        0.05        0.22
  0.06        0.08        0.05        0.19        0.17        0.1         0.09
  0.02        0.          0.17        0.12        0.01        0.16        0.03
  0.07        0.09        0.40000001  0.1         0.15000001  0.          0.06
  0.27000001  0.11        0.11        0.03        0.23999999  0.17        0.12
  0.23        0.18000001  0.14        0.06        0.12        0.08        0.1
  0.25999999  0.02        0.04        0.09        0.          0.02        0.
  0.          0.69        0.19        0.03        0.02        0.1         0.02
  0.19        0.05        0.          0.19        0.03        0.64999998
  0.13        0.5         0.12        0.25999999  0.40000001  0.28999999
  0.          0.23      ] [ 0.47052389  0.38414401  0.23821501  0.77706665  0.20416158  0.76825398
  0.56480855  0.5804944   0.66424572  0.85999548  0.28608885  0.54184186
  0.36836162  0.61810166  0.80757105  0.47963026  0.45952863  0.69738179
  0.39682415  0.66043717  0.72247475  0.81206751  0.51526517  0.40053669
  0.81112099  0.75237596  0.80476749  0.21000375  0.62923038  0.40771228
  0.56135929  0.83003813  0.23042849  1.07600498  0.41129124  0.3771607
  0.66378087  0.33588114  0.8215881   0.53499657  0.62143767  0.36367753
  0.43279523  0.7667436   0.5633983   0.49729267  0.56953973  0.41814819
  0.80112261  0.68355811  0.19910863  0.65558773  0.54363078  0.55154163
  0.51045036  0.85481858  0.60620594  0.26745525  0.53927702  0.21956599
  0.36400738  0.1346392   0.45189422  0.21680041  0.68022084  0.34073141
  0.23649736  0.53238189  0.22750311  0.42872402] [ 0.01392034  0.05908264  0.00582964  0.12233033  0.04070742  0.0175813
  0.01230934  0.02296134  0.05763692  0.01165634  0.02115727  0.0096735
  0.01790053  0.01324119  1.25057042  0.6449843   0.46230903  0.88770288
  2.09502864  0.01631939  0.0281391   0.17086858  0.10813841  0.00748938
  0.0706467   0.01732622  2.11345863  0.02258834  0.05191611  0.01975799
  0.02568036  0.04161122  0.01064439  0.3462477   0.00889278  0.01218941
  0.01060422  0.0795773   0.15640345  0.00983063  0.0539565   0.00922232
  1.31250727  0.05273678  0.04949782  0.29486412  0.05193701  0.19858368
  0.08474055  0.02561063  0.00444195  0.04846012  0.00567002  2.42226624
  0.00364426  0.00935479  0.00825841  0.00447213  0.01940363  0.03882358
  0.0053852   0.00421248  0.00536405  0.00668069  0.01930372  0.00349522
  0.00268352  0.03663713  0.00248705  0.00578542]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-73-ef84d3ac99d9> in <module>()
----> 1 s.plotsizeBTblue()

/Users/rfinn/Dropbox/pythonCode/LCSanalyzeblue.py in plotsizeBTblue(self, scalepoint, blueflag, medianflag)
   1343                 sp=pl.scatter(x[point_flags[i]],y[point_flags[i]],c=self.logstellarmass[point_flags[i]],vmin=mstarmin,vmax=mstarmax,cmap='jet',s=size,label='GALFIT')
   1344                 print x[point_flags[i]],y[point_flags[i]],yerror[point_flags[i]]
-> 1345                 (rho,p)=spearman_with_errors(x[point_flags[i]],y[point_flags[i]],yerror[point_flags[i]])
   1346 
   1347                 ax=pl.gca()

NameError: global name 'spearman_with_errors' is not defined

Fixing the spearman rank correlation

I still need to account for errors when calculating the rank correlation coefficient in the left panels. For this, I am using a monte carlo approach to calculate the spearman rank correlation for many random realizations of the data. I am assuming that the galfit errors are normally distributed. This is likely incorrect, but I'm not sure how to handle it otherwise.


In [24]:
flag = s.sampleflag & ~s.agnflag & s.gim2dflag
membflag = flag & s.membflag
fieldflag = flag & ~s.membflag

In [65]:
t=spearman_with_errors(s.s.B_T_r[membflag], s.s.SIZE_RATIO[membflag],s.s.SIZE_RATIOERR[membflag],Nmc=1000,plotflag=True)


mean = -0.36, std =  0.08
confidence interval from sorted list of MC fit values:
lower = -0.44 (-0.44), upper = -0.28 (-0.28)
probability that distribution is normal =  0.02

In [61]:
t=spearman_with_errors(s.s.B_T_r[fieldflag], s.s.SIZE_RATIO[fieldflag],s.s.SIZE_RATIOERR[fieldflag],plotflag=True)


mean = -0.41, std =  0.05
confidence interval from sorted list of MC fit values:
lower = -0.46 (-0.46), upper = -0.35 (-0.36)
probability that distribution is normal =  0.00

Result

  • The rho values are smaller, but the correlations are still significant.
  • The correlation is slightly stronger for field galaxies than member galaxies.

In [ ]: