Statistics using SAS

  • Pearson chi-square test
  • Cramer's V statistic
  • Mantel-Haenszel chi-square test (Ordinal association)
  • Spearman Correlation Statistic
  • Cohen's kappa coefficient-A Test of Agreement
  • Yates' correction for continuity (Yates's chi-squared test)
  • Fisher’s Exact Test (Samll Sample)
  • Profile Likelihood Confidence (Samll Sample)
  • Hosmer and Lemeshow test
  • Three Classical Tests (LR, LM, Wald)
  • Goodness of fit vs Predictive Power
  • Deviance(-2log(L), AIC for explanatory variables, SC for predictor variables SC value has a larger penalty for adding terms in the model.
  • Quasi-complete Separation
  • Generalized R^2 = $1-e^{(\frac{-(L^2)}{n})}$
  • McFadden'S R^2 = $\frac{L^2}{-2log(L)}$
  • Tjur’s $R^{2}$

There are a few more Pseudo-$R^2$s on wiki

  • $\text{Tau-a} = \frac{C-D}{N}$
  • $Gamma = \frac{C-D}{C+D}$
  • $\text{Somer's D} = \frac{C-D}{C+D+T}$
  • $C = .5 (\text{Somer's D}) = 0.5(\frac{C+T}{C+D+T})$
  • ROC curve (not biased corrected probability) is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The true-positive rate is also known as sensitivity, recall or probability of detection[1] in machine learning. The false-positive rate is also known as the fall-out or probability of false alarm[1] and can be calculated as (1 − specificity)
  • Sensitivity (also called the true positive rate, the recall, or probability of detection[1] in some fields) measures the proportion of positives that are correctly identified as such (i.e. the percentage of sick people who are correctly identified as having the condition).
  • Specificity (also called the true negative rate) measures the proportion of negatives that are correctly identified as such (i.e., the percentage of healthy people who are correctly identified as not having the condition).

  • Latent Variable Model

  • Standardize to std unit for comparability of coefficients $ \beta^{*}_{j} = \frac{\beta_{j}\sigma_{j}}{\sigma_{d}}$ , $j = 1, ... , k$, $\beta^{*}_{j} = \text{standardized coefficients}$ , $\sigma_{d}$ is the standard deviation of the dependent variable, and $\sigma_{j}$ is the standard deviation of $x_{j}$.
      chisq measures cl expected cellchi2
      plots(only)=(effect);
      plots(only)=(effect oddsratio);

Diagnostic Statistics:

  • Linear predictor—Predicted log-odds for each case. In matrix notation, this is $x{\beta}$ , so it’s commonly referred to as XBETA.
  • Standard error of linear predictor—Used in generating confidence intervals.
  • Predicted values—Predicted probability of the event, based on the estimated model and values of the explanatory variables. For grouped data, this is the expected number of events.
  • Confidence intervals for predicted values—Confidence intervals are first calculated for the linear predictor by adding and subtracting an appropriate multiple of the standard error. Then, to get confidence intervals around the predicted values, the upper and lower bounds on the linear predictor are substituted into $1/(1+e^{-x})$, where x is either an upper or a lower bound.
  • Deviance residuals—Contribution of each observation to the deviance chisquare.
  • Pearson residuals—Contribution of each observation to the Pearson chi-square.

       INFLUENCE DFBETAS PHAT DPC LEVERAGE
  • DFBETAS—These statistics tell you how much each regression coefficient changes when a particular observation is deleted. The actual change is divided by the standard error of the coefficient.

  • DIFDEV—Change in deviance with deletion of the observation.
  • DIFCHISQ—Change in Pearson chi-square with deletion of the observation.
  • C and CBAR—Measures of overall change in regression coefficients, analogous to Cook’s distance in linear regression.
  • LEVERAGE—Measures how extreme the observation is in the space of the explanatory variables. The leverage is the diagonal of the "hat" matrix.

Logistic Regression

The logistic function ${\displaystyle \sigma (t)}$ is defined as follows:

$$ \sigma (t)={\frac {e^{t}}{e^{t}+1}}={\frac {1}{1+e^{-t}}} = {\frac {1}{1+e^{t}}}$$

Odd Ratio:

$${\displaystyle \mathrm {OR} ={\frac {\operatorname {odds} (x+1)}{\operatorname {odds} (x)}}={\frac {\left({\frac {F(x+1)}{1-F(x+1)}}\right)}{\left({\frac {F(x)}{1-F(x)}}\right)}}={\frac {e^{\beta _{0}+\beta _{1}(x+1)}}{e^{\beta _{0}+\beta _{1}x}}}=e^{\beta _{1}}}$$

Logit transformation:

$$\pi(x) = \frac{e^{\beta_{0}+\beta_{1} x}}{1+e^{\beta_{0}+\beta_{1} x}}$$

Likelihood Function:

$$ l(\beta) = \prod_{i=1}^n \pi x_{i}^{y_{i}} (1 - \pi x_{i})^{(1- y_{i})} $$

Loglikelihood Function;

$$L(\beta) = ln[l(\beta)] = \sum_{i=1}^{n}{y_{i} ln[\pi(x_{i})] + (1 − y_{i} ) ln[1 − \pi(x_{i})]}$$

Logit Model:

$$ g(F(x))=\ln \left({\frac {F(x)}{1-F(x)}}\right)=\boldsymbol{X \beta}$$

where $F(x)={\frac {1}{1+e^{-(\boldsymbol{X \beta})}}}$

Probit Model:

$$\Phi^{-1}(p_{i}) = \boldsymbol{X \beta} $$

$ {\displaystyle {\begin{aligned}&\Pr(Y=1\mid X)\\={}&\Pr(Y^{\ast }>0)\\={}&\Pr(X^{T}\beta +\varepsilon >0)\\={}&\Pr(\varepsilon >-X^{T}\beta )\\={}&\Pr(\varepsilon <X^{T}\beta )&{\text{by symmetry of the normal distribution}}\\={}&\Phi (X^{T}\beta )\end{aligned}}} $

Deviance:

$$D=-2\ln {\frac {\text{likelihood of the fitted model}}{\text{likelihood of the saturated model}}}$$

LR test:

$$ D = −2 \sum_{i=1}^{n}{y_{i} y_{i} ln(\frac{\widehat{\pi_{i}}}{y_{i}}) + (1-y_{i}) ln(\frac{1 − \widehat{\pi_{i}}}{1-y_i}})$$

where $\widehat{\pi_{i}} = \hat{\pi(x_{1})}$

$$l(\text{saturated model}) = \prod_{i=1}^{n} y_{i}^{y_{i}} (1−y_{i})^{(1−y_{i})} = 1$$$$D = −2 ln(\text{likelihood of the fitted model}).$$

Marginal Effect:

$$\frac{\partial p_{i}}{\partial x_{i}} = \beta p_{i} (1 - p_{i})$$

Using OLS to analyze dichotomous dependent variables violates two OLS assumptions:

  • homoscedasticity
  • Normally distributed errors

The variance of $i$ must be different for different observations and, in particular, it varies as a function of x. The disturbance variance is at a maximum when pi=.5 and gets small when pi is near 1 or 0.

Although OLS is still unbiased and consistent, it is no longer efficient.

If x has no upper or lower bound, then for any value of  there are values of x for which pi is either greater than 1 or less than 0. In fact, when estimating a linear probability model by OLS, it’s quite common for predicted values generated by the model to be outside the (0, 1) interval.

An odds of 4 means we expect 4 times as many occurrences as non-occurrences. An odds of 1/5 means that we expect only one-fifth as many occurrences as non-occurrences.

Like probabilities, odds have a lower bound of 0. But unlike probabilities, there is no upper bound on the odds.

  • Binary logistic regression
  • Nominal logistic regression
  • Ordinal logistic regression

A probability is bounded by 0 and 1. The logit of the probability transforms the probability into a linear function, which has no lower or upper bounds. So a logit's dependent variable ($log(\text{odd ratio})$) has no lower or upper bounds.

If you have grouped data, there are three readily available methods: ordinary least squares, weighted least squares, and maximum likelihood.

Maximum likelihood (ML) is the third method for estimating the logistic model for grouped data and the only method in general use for individual-level data. With individuallevel data, we simply observe a dichotomous dependent variable for each individual along with measured characteristics of the individual. OLS and WLS can’t be used with this kind of data unless the data can be grouped in some way. If $y_{i}$ can only have values of 1 and 0, it’s impossible to apply the logit transformation—you get either minus infinity or plus infinity. To put it another way, any transformation of a dichotomy is still a dichotomy.

First, ML estimators are known to have good properties in large samples. Under fairly general conditions, ML estimators are consistent, asymptotically efficient, and asymptotically normal. Consistency means that as the sample size gets larger the probability that the estimate is within some small distance of the true value also gets larger. No matter how small the distance or how high the specified probability, there is always a sample size that yields an even higher probability that the estimator is within that distance of the true value. One implication of consistency is that the ML estimator is approximately unbiased in large samples. Asymptotic efficiency means that, in large samples, the estimates will have standard errors that are, approximately, at least as small as those for any other estimation method. And, finally, the sampling distribution of the estimates will be approximately normal in large samples, which means that you can use the normal and chi-square distributions to compute confidence intervals and p-values.


In [ ]:
%let path=/folders/myshortcuts/_myfolders/ECST131;
libname statdata "&path";

In [ ]:
libname logreg "/folders/myshortcuts/_myfolders/SAS_LogReg";

proc print data = logreg.penalty (obs=10);
run;

In [ ]:
data pen;
set logreg.penalty;
if blackd = 1 then black = '1_Black';
    else if blackd = 0 then black = '2_NBlack';
if death = 1  then deaths = '1_Death';
    else if death = 0 then deaths = '2_Life';
if whitvic = 1 then whit = '1_White';
    else if whitvic = 0 then whit = '2_NWhite';
run;

proc format;
value $bkord 
    '1_Black' = 'Blacks'
    '2_NBlack' = 'Non Blacks';
value $dthord 
    '1_Death' = 'Death'
    '2_Life' = 'Life';
value $whitvicord
    '1_White' = 'White'
    '2_NWhite' = 'Non White';
run;

proc freq data = pen;
table deaths*(black whit)/ nocol nocum norow measures;
format deaths $dthord. black $bkord. whit $whitvicord.;
run;
/*This recreate Table 2.2*/

In [ ]:
PROC REG DATA=logreg.penalty; 
    MODEL death=blackd whitvic serious / HCCMETHOD=2 ; 
RUN

In [ ]:
/*Output 2.3*/
PROC LOGISTIC DATA=logreg.penalty;
    MODEL death(EVENT='1') = blackd whitvic serious;
RUN;

“Wald Chi-Square.” is calculated by dividing each coefficient by its standard error and squaring the resultIf we omitted the squaring operation (as many software packages do), we could call them z statistics, and they would have a standard normal distribution under the null hypothesis. In that case, the p-values calculated from a normal table would be exactly the same as the chi-square p-values reported here.

For instance, ${9.346 = 3.057^2}$ is the “Wald Chi-Square" of serious.

The odds ratios in the next table are obtained by simply exponentiating the coefficients in the first column, that is, calculating $e^{\beta}$. The 95 percent confidence intervals are obtained as follows. First, we get 95 percent confidence intervals around the original  coefficients in the usual way. That is, we add and subtract 1.96 standard errors. To get confidence intervals around the odds ratios, we exponentiate those upper and lower confidence limits.

For the linear probability model, a coefficient of 0.25 tells you that the predicted probability of the event increases by 0.25 for every 1-unit increase in the explanatory variable. By contrast, a logit coefficient of 0.25 tells you that the log-odds increases by 0.25 for every 1-unit increase in the explanatory variable.

For quantitative variables, it’s helpful to subtract 1 from the odds ratio and multiply by 100, that is, calculate $100(e^{\beta}–1)$. This tells us the percent change in the odds for each 1-unit increase in the independent variable. In this case, we find that a 1-unit increase in the SERIOUS scale is associated with a 21% increase in the predicted odds of a death sentence. Note that if a  coefficient is significantly different from 0, then the corresponding odds ratio is significantly different from 1. There is no need for a separate test for the odds ratio.

These might be better described as adjusted odds ratios because they control for other variables in the model. Recall that BLACKD has a value of 1 for black defendants and 0 for everyone else. The odds ratio of 1.813 tells us that the predicted odds of a death sentence for black defendants are 1.813 times the odds for nonblack defendants. In other words, the odds of a death sentence for black defendants are 81% higher than the odds for other defendants. This compares with an unadjusted odds ratio of 1.47 found in Table 2.2. Although the adjusted odds ratio for BLACKD is not statistically significant (the 95% confidence interval includes 1, corresponding to no effect), it is still our best estimate of the effect of this variable.

For those who insist on interpreting logistic models in terms of probabilities, there are several graphical and tabular methods available (Long 1997). Perhaps the simplest approach is to make use of equation (2.6):

$$\frac{\partial p_{i}}{\partial x_{i}} = \beta p_{i} (1 - p_{i})$$

This equation says that the change in the probability for a 1-unit increase in x depends on the logistic regression coefficient for x, as well as on the value of the probability itself. For this to be practically useful, we need to know what probability we are starting from. If we have to choose one value, the most natural is the overall proportion of cases that have the event. In our example, 50 out of 147 defendants got the death penalty, so the overall proportion is .34. Taking .34 times 1–.34, we get .224. We can multiply each of the coefficients in Output 2.3 by .224, and we get:

Marginal Effect for BLACKD = $0.133 = 0.5952*0.34*(1-0.34)$

Marginal Effect for WHITVIC = $0.057 = 0.2565*0.34*(1-0.34)$

Marginal Effect for SERIOUS = $0.046 = 0.1871*0.34*(1-0.34)$

We can then say that, on average, the probability of a death sentence is .133 higher if the defendant is black compared with nonblacks, .057 higher if the victim is white compared with non-white, and .046 higher for a 1-unit increase on the SERIOUS scale. These are sometimes called “marginal effects,” and they are of considerable interest in some fields. Of course, these numbers only give a rough indication of what actually happens for a given change in the x variable.


In [ ]:
PROC LOGISTIC DATA=logreg.penalty;
  CLASS culp(REF='5') /PARAM=REF;   /*Explicitly list out reference category = 5*/
                                    /*culp5 = 0 culp1-4 = 1*/
  MODEL death(EVENT='1') = blackd whitvic culp;
  CONTRAST 'Culp2 vs. Culp3' culp 0 1 -1 0;
  *TEST culp2=culp3;
RUN;

When you have a CLASS variable in a model, LOGISTIC provides an additional table, labeled “Type 3 Analysis of Effects.” For variables that are not CLASS variables, this table is completely redundant with the standard table below it—the chi-squares and p-values are exactly the same. For CLASS variables, on the other hand, it gives us something very useful: a test of the null hypothesis that all of the coefficients pertaining to this variable are zeros. In other words, it gives us a test of whether CULP has any impact on the probability of the death penalty. In this case, we clearly have strong evidence that CULP makes a difference. What’s particularly attractive about this test is that it is invariant to the choice of the omitted category, or even to the choice among very different methods for constructing design variables.


In [ ]:
/*Without specifying CLASS culp(REF='5') /PARAM=REF;*/

PROC LOGISTIC DATA=logreg.penalty;
  CLASS culp;
  MODEL death(EVENT='1') = blackd whitvic culp;
ESTIMATE 'coeff for 5' culp -1 -1 -1 -1;
run;

In [ ]:
/* The following program is found on page 36 */

PROC LOGISTIC DATA=logreg.penalty;
  MODEL death(EVENT='1') = blackd whitvic culp blackd*whitvic;
RUN;

In [ ]:
/* The following program is found on page 40 */

PROC LOGISTIC DATA=logreg.penalty;
  MODEL death(EVENT='1') = blackd whitvic culp / CLODDS=BOTH CLPARM=BOTH ;
  UNITS culp=2 / DEFAULT=1; /*12.7 = 3.564^2  (O^k)*/
RUN;

/*The UNIT column indicates how much each independent variable is incremented 
to produce the estimated odds ratio. The default is 1 unit. For the variable CULP,
each 1-point increase on the culpability scale multiplies the odds of a death 
sentence by 3.564.*/

Quasi-complete Separation

In my experience, the most common cause of quasi-complete separation is a dummy predictor variable that has the following property: at one level of the dummy variable either every case has a 1 on the dependent variable or every case has a 0.

If you find a cell frequency of 0 in any of these tables, you’ve pinpointed a cause of quasi-complete separation. If the problem is quasi-complete separation, there are additional options to consider.

  • Recode the problem variables. For instance, one can collapse the variable with another group.

  • Exclude cases from the model.

  • Retain the model with quasi-complete separation but use likelihood-ratio tests.

  • Use exact methods

  • Use penalized likelihood


In [ ]:
/* The following program is found on pages 46-47 */

DATA compsep;
  INPUT x y;
  DATALINES;
1 0
2 0
3 0
4 1
5 1
6 1
;
PROC LOGISTIC;
  MODEL y(EVENT='1') = x / ITPRINT;
RUN;

/* The following program is found on page 49 */

DATA quasisep;
  INPUT x y;
  DATALINES;
1 0
2 0
3 0
4 0
4 1
5 1
6 1
;
PROC LOGISTIC;
  MODEL y(EVENT='1') = x/ ITPRINT;
RUN;

proc freq data = logreg.penalty;
table culp*death/ nocol norow nopercent chisq measures;
where blackd = 0;
run;

In [ ]:
/* The following program is found on page 50 */
PROC LOGISTIC DATA=logreg.penalty;
  WHERE blackd=0;
  CLASS culp /PARAM=REF;
  MODEL death(EVENT='1') = culp serious;
RUN;


/* The following program is found on page 52 */
DATA;
  SET penalty;
  IF culp=1 THEN culp=2;
PROC LOGISTIC;
  WHERE blackd=0;
  CLASS culp / PARAM=REF;
  MODEL death(EVENT='1')=culp serious;
RUN;


/* The following program is found on page 53 */
PROC LOGISTIC DATA=logreg.penalty;
  WHERE blackd=0 AND culp > 1;
  CLASS culp / PARAM=REF;
  MODEL death(EVENT='1')=culp serious;
RUN;


/* The following program is found on page 54 */
PROC LOGISTIC DATA=logreg.penalty;
  WHERE blackd=0;
  CLASS culp /PARAM=REF;
  MODEL death(EVENT='1') = culp serious / CLPARM=PL ALPHA=.01;
RUN;


/* The following program is found on page 55 */
PROC LOGISTIC DATA=logreg.penalty;
  WHERE blackd=0;
  CLASS culp /PARAM=REF;
  MODEL death(EVENT='1') = culp serious;
  EXACT culp serious / ESTIMATE=BOTH;
RUN;


/* The following program is found on page 58 */
PROC LOGISTIC DATA=logreg.penalty;
  WHERE blackd=0;
  CLASS culp /PARAM=REF;
  MODEL death(EVENT='1') = culp serious / FIRTH 
    CLPARM=PL;
RUN;


/* The following program is found on page 61 */
PROC REG DATA=penalty;
  MODEL death = blackd whitvic serious serious2 / TOL VIF; 
RUN;

/* The following program is found on page 62 */
PROC LOGISTIC DATA=penalty;
  MODEL death(EVENT='1') = blackd whitvic serious serious2;
  OUTPUT OUT=a PRED=phat;
DATA b;
  SET a;
  w = phat*(1-phat);
PROC REG DATA=b;
  WEIGHT w;
  MODEL death = blackd whitvic serious1 serious2 / TOL VIF; 
RUN;

/* The following program is found on page 64 */
PROC LOGISTIC DATA=logreg.penalty;
  MODEL death(EVENT='1') = blackd whitvic culp / AGGREGATE SCALE=NONE;
RUN;


/* The following program is found on page 65 */
PROC LOGISTIC DATA=logreg.penalty;
CLASS culp;
MODEL death(EVENT='1') = blackd whitvic culp blackd*whitvic  blackd*culp whitvic*culp blackd*whitvic*culp ;
RUN;

/* The following program is found on page 66 */
PROC LOGISTIC DATA=logreg.penalty;
  MODEL death(EVENT='1') = blackd whitvic culp / LACKFIT;
RUN;

/* The following program is found on page 70 Tjur’s R2 = difference in means */
PROC LOGISTIC DATA=logreg.penalty;
  MODEL death(EVENT='1')=culp whitvic blackd;
  OUTPUT OUT=a PRED=yhat;
PROC MEANS; 
 CLASS death;
 VAR yhat; 
RUN;

/* The following program is found on page 76 */
PROC LOGISTIC DATA=logreg.penalty PLOTS(ONLY)=ROC(ID=CUTPOINT);
  MODEL death(EVENT='1')=blackd whitvic culp ;
RUN;

/* The following program is found on page 78 */
PROC LOGISTIC DATA=logreg.penalty;
  MODEL death(EVENT='1')=blackd whitvic culp;
  ROC 'omit culp' blackd whitvic; /*Plot ROC when one variable is ommitted*/
  ROC 'omit blackd' whitvic culp;
  ROC 'omit whitvic' blackd culp;
  ROCCONTRAST / ESTIMATE=ALLPAIRS; /*test for difference (c statistics))*/
RUN;
/*culp has the greatest impact when left out as shown in the Association Statistics
for hte three differnect statistic and the ROCCONTRAST tables*/


/* The following program is found on page 82 */
PROC LOGISTIC DATA=logreg.penalty PLOTS(UNPACK LABEL)=  
    (INFLUENCE DFBETAS PHAT DPC LEVERAGE);
  MODEL death(EVENT='1')=blackd whitvic culp ;
RUN;

/*page 90: Standardized Coefficients for relative importance 
  of explanatory variables*/
PROC LOGISTIC DATA=logreg.penalty;
  MODEL death(EVENT='1') = culp blackd whitvic / STB;
RUN;

When none of the individual variables is significant but the entire set is significant, multicollinearity is a likely culprit.

Deviance chi-square test

answers the question, “Is there a better model than this one?” Again, a significant chisquare corresponds to a “yes” answer, but that leads to rejection of the model.

Remember that what the deviance, Pearson chi-square, and HL tests are evaluating is whether you can improve the model by including interactions and nonlinearities. In my judgment, there’s no good substitute for directly testing those possible interactions and non-linearities.


In [ ]:
PROC LOGISTIC DATA=logreg.penalty;
  MODEL death(EVENT='1') = blackd whitvic culp / AGGREGATE SCALE=NONE;
RUN;

In [1]:
proc product_status;
run;


Out[1]:

11   ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
12
13 proc product_status;
14 run;
For Base SAS Software ...
Custom version information: 9.4_M4
Image version information: 9.04.01M4P110916
For SAS/STAT ...
Custom version information: 14.2
For SAS/ETS ...
Custom version information: 14.2
For SAS/IML ...
Custom version information: 14.2
For High Performance Suite ...
Custom version information: 2.2_M5
For SAS/ACCESS Interface to PC Files ...
Custom version information: 9.4_M4
NOTE: PROCEDURE PRODUCT_STATUS used (Total process time):
real time 0.00 seconds
cpu time 0.01 seconds

15
16 ods html5 close;ods listing;

17

In [ ]: