Statistics using SAS

  • Pearson chi-square test
  • Cramer's V statistic
  • Mantel-Haenszel chi-square test (Ordinal association)
  • Spearman Correlation Statistic
  • Cohen's kappa coefficient-A Test of Agreement
  • Yates' correction for continuity (Yates's chi-squared test)
  • Fisher’s Exact Test (Samll Sample)
  • Profile Likelihood Confidence (Samll Sample)
  • Hosmer and Lemeshow test
  • Three Classical Tests (LR, LM, Wald)
  • Goodness of fit vs Predictive Power
  • Deviance(-2log(L), AIC explanatory variables, SC predictor variables
  • Quasi-complete Separation
  • Generalized R^2 = $1-e^{(\frac{-(L^2)}{n})}$
  • McFadden'S R^2 = $\frac{L^2}{-2log(L)}$
  • Tjur’s $R^{2}$
  • $\text{Tau-a} = \frac{C-D}{N}$
  • $Gamma = \frac{C-D}{C+D}$
  • $\text{Somer's D} = \frac{C-D}{C+D+T}$
  • $C = .5 (\text{Somer's D}) = 0.5(\frac{C+T}{C+D+T})$
  • ROC curve (not biased corrected probability)
  • Latent Variable Model
  • Standardize to std unit for comparability of coefficients $ \beta^{*}_{j} = \frac{\beta_{j}\sigma_{j}}{\sigma_{d}}$ , $j = 1, ... , k$, $\beta^{*}_{j} = \text{standardized coefficients}$ , $\sigma_{d}$ is the standard deviation of the dependent variable, and $\sigma_{j}$ is the standard deviation of $x_{j}$.
      chisq measures cl expected cellchi2
      plots(only)=(effect);
      plots(only)=(effect oddsratio);

Diagnostic Statistics:

  • Linear predictor—Predicted log-odds for each case. In matrix notation, this is $x{\beta}$ , so it’s commonly referred to as XBETA.
  • Standard error of linear predictor—Used in generating confidence intervals.
  • Predicted values—Predicted probability of the event, based on the estimated model and values of the explanatory variables. For grouped data, this is the expected number of events.
  • Confidence intervals for predicted values—Confidence intervals are first calculated for the linear predictor by adding and subtracting an appropriate multiple of the standard error. Then, to get confidence intervals around the predicted values, the upper and lower bounds on the linear predictor are substituted into $1/(1+e^{-x})$, where x is either an upper or a lower bound.
  • Deviance residuals—Contribution of each observation to the deviance chisquare.
  • Pearson residuals—Contribution of each observation to the Pearson chi-square.

       INFLUENCE DFBETAS PHAT DPC LEVERAGE
  • DFBETAS—These statistics tell you how much each regression coefficient changes when a particular observation is deleted. The actual change is divided by the standard error of the coefficient.

  • DIFDEV—Change in deviance with deletion of the observation.
  • DIFCHISQ—Change in Pearson chi-square with deletion of the observation.
  • C and CBAR—Measures of overall change in regression coefficients, analogous to Cook’s distance in linear regression.
  • LEVERAGE—Measures how extreme the observation is in the space of the explanatory variables. The leverage is the diagonal of the "hat" matrix.

In [ ]:
%let path=/folders/myfolders/ECST131;
libname statdata "&path";

Examining the Distribution of Categorical Variables

PROC FREQ DATA=SAS-data-set 'SAS-library' <option(s)>;
    TABLES=table-request(s) </ option(s)>;
    additional statements;
    RUN;

Selected Options in PROC FREQ
    Statement   Option
    PROC FREQ   ORDER=
    TABLES  
    CELLCHI2
    CHISQ (Pearson and Mantel-Haenszel)
    CL
    EXPECTED
    MEASURES 
    NOCOL
    NOPERCENT
    PLOTS= 
    RELRISK

Option Description
AGREE Kappa coefficient of agreement
ALL A quick way to get CHISQ, MEASURES, and CMH
CHISQ Computes chi-square, Fisher’s exact (for 2x2), and several other tests
CMH Cochran-Mantel-Haenszel statistics
FISHER Fisher’s exact test (needed only for tables larger than 2x2)
MEASURES Various measures of association, such as Pearson and Spearman coefficients
RELRISK Computes relative risk (also called risk ratio)
TREND Computes Cochran-Armitage test for trend

In [ ]:
proc print data = statdata.sales (obs = 15);
run;

proc format;
value Incfmt
1 = 'Low'
2 = 'Medium'
3 = 'High';
run;

proc format;
value purfmt 
0 = '< $100'
1 = '> $100';
run;

In [ ]:
proc freq data=statdata.sales;
   tables Purchase Gender Income
          Gender*Purchase
          Income*Purchase 
          /
          plots=(freqplot); 
   
   title1 'Frequency Tables for Sales Data';
run;

In [ ]:
ods select histogram probplot;  

proc univariate data=statdata.sales normal;
   var Age;
   histogram Age / normal (mu=est
                   sigma=est); 
   probplot Age / normal (mu=est
                  sigma=est);
   title1 'Distribution of Age'; 
run;

title;

Ordering the Values in a Frequency or Cross-tabulation Table


In [ ]:
data statdata.sales_inc;
   set statdata.sales;
   if Income='Low' then IncLevel=1;
   else If Income='Medium' then IncLevel=2;
   else If Income='High' then IncLevel=3;
run;

proc freq data=statdata.sales_inc;
   tables IncLevel*Purchase / plots=freq;
   format IncLevel incfmt. Purchase purfmt.;
   title1 'Create variable IncLevel to correct Income';
run;

title;

In PROC FREQ output, the default order for character values is alphanumeric. In this frequency table, the values of Income are listed by default as High, Low, and Medium instead of in a logical order, by magnitude. In other words, Income is treated as a nominal variable instead of an ordinal variable. When you order values logically, your output tables and graphs are easier to read and interpret.


In [ ]:
data statdata.sales_inc;
   set statdata.sales;
   if Income='Low' then IncLevel=1;
   else If Income='Medium' then IncLevel=2;
   else If Income='High' then IncLevel=3;
run;

proc freq data=statdata.sales_inc;
   tables IncLevel*Purchase / plots=freq;
   format IncLevel incfmt. Purchase purfmt.;
   title1 'Create variable IncLevel to correct Income';
run;

title;

In [ ]:
proc print data = statdata.safety;

proc freq data=statdata.safety;
   tables Unsafe Type Region;
   title 'Safety Data Frequencies';
run;
title;

In [ ]:
proc format;
    value $Levels 'Low'='1_Low'
                  'Medium'='2_Medium'
                  'High'='3_High';
run;

proc freq data=statdata.sales order=formatted;
   tables Income*Purchase / chisq measures cl;
   format Income $levels.;
   title1 'Ordinal Association between Income and Purchase?';
run;

title;

Pearson Chi-Square Test of Association

You start with the null hypothesis of no association between the variables and the alternative hypothesis of association. To determine whether there is an association, the Pearson chi-square test measures the difference between the observed cell frequencies and the cell frequencies that are expected if there is no association between the variables. In other words, the expected frequencies are the frequencies you expect to get if the null hypothesis is true. The chi-square test calculates the expected frequencies for each cell by multiplying the row total (R) by the column total (C), and then dividing the result by the total sample size (T). If the observed frequencies equal the expected frequencies, there is no association between the variables. If the observed frequencies do not equal the expected frequencies, there is an association between the variables. The chi-square statistic indicates the difference between observed frequencies and expected frequencies. A significant chi-square statistic is strong evidence that an association exists between your variables.

Keep in mind that the chi-square statistic and its corresponding p-value indicate only whether an association exists between two categorical variables. The p-value indicates how confident you can be that the null hypothesis of no association is false. However, neither the chi-square statistic nor its p-value tells you the magnitude of an association.

Chi-square statistics and their p-values depend on and reflect the sample size. A larger sample size yields a larger chi-square statistic and a smaller corresponding p-value, even though the association might not be all that strong. For example, if you double the size of your sample by duplicating each observation, you double the value of the chi‑square statistic, even though the strength of the association does not change.

Cramer's V Statistic

Cramer's V statistic is one measure of the strength of an association between two categorical variables. Cramer's V statistic is derived from the Pearson chi-square statistic.

For two-by-two tables, Cramer's V is in the range of -1 to 1. For larger tables, Cramer's V is in the range of 0 to 1.

Values farther away from 0 indicate a relatively strong association between the variables. The closer Cramer's V is to 0, the weaker the association is between the rows and columns. For a two-by-two table, a Cramer's V value close to 1 indicates a strong positive association. A Cramer's V value close to -1 shows a strong negative association.

Like other measures of strength of association, Cramer's V statistic is not affected by sample size.

Odd Ratio

$${\displaystyle \mathrm {OR} ={\frac {\operatorname {odds} (x+1)}{\operatorname {odds} (x)}}={\frac {\left({\frac {F(x+1)}{1-F(x+1)}}\right)}{\left({\frac {F(x)}{1-F(x)}}\right)}}={\frac {e^{\beta _{0}+\beta _{1}(x+1)}}{e^{\beta _{0}+\beta _{1}x}}}=e^{\beta _{1}}}$$

To measure the strength of the association between a binary predictor variable and a binary outcome variable, you can use an odds ratio. An odds ratio indicates how much more likely it is, with respect to odds, that a certain event, or outcome, occurs in one group relative to its occurrence in another group. In this example, you want to know how much more likely it is, with respect to odds, that a Yes outcome occurs in Group B relative to its occurrence in Group A.

Now that we know the probabilities, let's calculate the odds. The odds of the outcome occurring in Group B are the probability of a Yes outcome in Group B (.90) divided by the probability of a No outcome in Group B (.10). The odds for group B are 9 (or 9:1), which means that we expect nine occurrences of the outcome to one non-occurrence in Group B.

Here's a question: What are the odds of having the outcome in Group A? To calculate the odds of the outcome occurring in Group A, you divide the probability of a Yes outcome in Group A (.75) by the probability of a No outcome in Group A (.25). The odds for Group A are 3 (or 3:1), which means that you expect three times as many occurrences as non-occurrences in Group A.

Now that you know the odds of the outcome in both groups, you can compare the Group B odds with the Group A odds by calculating an odds ratio. You divide the odds of an outcome in Group B (9) by the odds of an outcome in Group A (3), with a result of 3. An odds ratio of 3 means that the odds of getting the outcome in Group B are three times those of getting the outcome in Group A. In this example, the odds of saving at least 10% of income are three times higher for people who have received training.

Let's look at the possible range of values for an odds ratio. The value of the odds ratio can range from 0 to infinity; it cannot be negative. When the odds ratio is 1, there is no association between the predictor variable and the outcome variable. If the odds ratio is greater than 1, the group in the numerator of the fraction (here, Group B) is more likely to have the outcome. In the example about saving money, the odds ratio 3 indicates that the odds for Group B is three times the odds for Group A. If the odds ratio is less than 1, the group in the denominator of the fraction (here, Group A) is more likely to have the outcome.

The odds ratio is approximately the same regardless of sample size. To estimate the true odds ratio while taking into account the variability of the sample statistic, you can calculate confidence intervals just as you can for any unknown parameter you are trying to estimate. A 95% confidence interval for an odds ratio means that you are 95% confident that the interval contains the true odds ratio. In other words, if you sample repeatedly and calculate a confidence interval for each sample odds ratio, 95% of the time your confidence interval will contain the true population odds ratio.

You can use an odds ratio to test for significance between two categorical variables. If the 95% confidence interval does not include 1, the odds ratio is significant at the 0.05 level. You are 95% confident that the true odds ratio is significantly different from 1. There is an association between the two variables.

The last table below shows estimates of the relative risk for the row 1 variable to the row 2 variable in the crosstabulation table above. The RELRISK option generates this table. In the Estimates of the Relative Risk table, the odds ratio of females to males for spending less than $\$100$ is in the first row. However, remember that we are actually interested in the other outcome – spending $\$100$ or more. So we'll interpret this odds ratio as the odds of the row 2 value (males) compared to the odds of the row 1 value (females) of being in the second column. This interpretation is logically equivalent.

$$odd ratio = 0.6458 = \frac{139/101}{130/61}$$$$ = \frac{1}{1.54} = \frac{101/139}{61/130}$$

Remember that you can use the odds ratio to measure the strength of association for a two-by-two table. The odds ratio of 0.6458 means that the odds of a male spending at least $\$100$ are approximately $65\%$, or $2/3$ that of a female. It's often easier to report odds ratios by converting the decimal value to a percent difference value. The formula for the conversion is

$$(odd ratio -1) * 100\%$$$$ = (0.6458-1)*100 = -35.4\%$$

In this example, males have 35.42 percent lower odds than females of spending $100 or more. An odds ratio of 1 would mean no association. However, the 95% confidence interval does not include 1, which indicates that the odds ratio is significant at the 0.05 significance level. This also means that the true odds ratio is statistically different from 1, so it's another way of testing whether the association between Gender and Purchase is significant.

$$relative risk = 0.8510 = \frac{57.92\%}{68.06\%}$$

In [ ]:
proc freq data=statdata.sales_inc;
   tables Gender*Purchase /
          chisq expected cellchi2 nocol nopercent  
          relrisk;
   *format Purchase purfmt.;
   title1 'Association between Gender and Purchase';
run;

title;

The CHISQ option produces the Pearson chi-square test of association and the measures of association that are based on the chi-square statistic. The output from this option alone will tell you whether there is a significant association between the specified variables.

The EXPECTED option prints the expected cell frequencies, which are the cell frequencies that we expect under the null hypothesis of no association. CELLCHI2 prints each cell's contribution to the total chi-square statistic. NOCOL suppresses the printing of the column percentages. NOPERCENT suppresses the printing of the cell percentages. Finally, we'll add the RELRISK (relative risk) option to print a table that contains risk ratios (probability ratios) and odds ratios.

At the top of the results is the requested crosstabulation table for Gender by Purchase. The total number of observations appears in the bottom right corner. Notice that we're working with a large sample. You can see how the options in the TABLES statement change the statistics that appear in each cell. The actual frequency appears first. Next, the EXPECTED option generates the expected frequency. The expected frequency is the count that we'd expect if the null hypothesis of no association between the two variables is true. The CELLCHI2 option generates the cell chi-square statistic, which indicates how much that cell contributes to the overall chi-square statistic.

Relative risk or risk ratio (RR) is the ratio of the probability of an event occurring (for example, developing a disease, being injured) in an exposed group to the probability of the event occurring in a comparison, non-exposed group. Relative risk includes two important features: (i) a comparison of risk between two "exposures" puts risks in context, and (ii) "exposure" is ensured by having proper denominators for each group representing the exposure:

$${\displaystyle RR={\frac {p_{\text{event when exposed}}}{p_{\text{event when not exposed}}}}} $$

Cochran–Mantel–Haenszel statistics

When two variables have an ordinal association, an increase in the value of one variable tends to be associated with an increase or decrease in the value of the other variable. For example, the Mantel-Haenszel chi-square test can help to determine whether, as the row values increase in size, the column values also increase in size. When the variables have more than two levels, the levels must be in a logical order for the test results to be meaningful.

For the Mantel-Haenszel chi-square test, the null hypothesis is that there is no ordinal association between the row and column variables. The alternative hypothesis is that there is an ordinal association between the row and column variables.

So, why is the Mantel-Haenszel chi-square statistic more powerful than the Pearson chi-square statistic for detecting an ordinal association? All of the power of the Mantel-Haenszel statistic is concentrated on ordinal associations. However, the power of the Pearson (or general) association statistic is dispersed over a greater number of alternatives.

The Mantel-Haenszel chi-square statistic and its corresponding p-value are similar to the Pearson chi-square statistic and its corresponding p-value in the following ways. They indicate only whether an association exists but not the magnitude of the association. They depend on and reflect the sample size.


In [ ]:
proc freq data=statdata.sales_inc;
   tables IncLevel*Purchase / chisq measures cl;
   format IncLevel incfmt. Purchase purfmt.;
   title1 'Ordinal Association between IncLevel and Purchase?';
run;

title;

To measure the strength of the association between two ordinal variables, you can use the Spearman correlation statistic. You should use the Spearman correlation statistic only if both variables are ordinal and the values of each variable are in logical order. The Spearman correlation is considered to be a rank correlation because it provides a degree of association between the ranks of the ordinal variables. Ordinal variables are considered to be ranked because their values have a logical order.

This statistic has a range between -1 and 1. Values close to 1 indicate that there is a relatively high degree of positive correlation. Values close to -1 indicate that there is a relatively high degree of negative correlation. And values close to zero indicate a weak correlation.

Like other measures of strength of association, Spearman's correlation statistic is not affected by sample size.


In [ ]:
proc freq data=statdata.sales_inc;
   tables IncLevel*Purchase / chisq measures cl;
   format IncLevel incfmt. Purchase purfmt.;
   title1 'Ordinal Association between IncLevel and Purchase?';
run;

title;

Notice that the 95% confidence interval does not contain 0. This means that the positive, ordinal relationship between income level and purchasing habits is significant at the 0.05 significance level. The confidence bounds are valid only if the sample size is large. A general guideline is to have a sample size of at least 25 for each degree of freedom in the Pearson chi-square statistic. The asymptotic standard error is used for large samples and is used to calculate the confidence intervals for various measures of association, including the Spearman correlation coefficient. The asymptotic standard error measures the variability of the correlation statistic. In other words, it indicates how much error we can expect if we use the sample Spearman correlation to estimate the population Spearman correlation.


In [ ]:
proc freq data=statdata.safety;
   tables Region*Unsafe / chisq relrisk expected;
   *format Unsafe safefmt.;
   title "Association between Unsafe and Region";
run;

title;

proc freq data=statdata.safety;
   tables Size*Unsafe / chisq measures cl;
   *format Unsafe safefmt.;
   title "Association between Unsafe and Size";
run;

title;

Tables with Expected Values Less Than 5 (Fisher’s Exact Test)


In [ ]:
data small_counts;
input Group $ Outcome $ @@;
datalines;
A Good A Good A Good A Poor A Good A Good
B Poor B Poor B Good B Poor B Poor
;
title "Tables with Small Expected Frequencies";
proc freq data=small_counts;
tables Group * Outcome / chisq expected;
run;

Notice that SAS prints a warning when more than 20% of the cells have expected values that are less than 5. In this case, the chi-square value and resulting probability are not valid. The row in the output labeled Continuity Adj. Chi-Square is the Yates’ corrected value (2.2275). Under the heading Fisher’s Exact Test, you see the two-sided probability equal to .0801. If you were conducting a hypothesis test with alpha set to the usual value of .05, you would fail to reject the null hypothesis that the proportion of good outcomes was equal in the two groups. In practice, you might report that group A seemed to have a higher percentage of good outcomes, although the percentage did not reach statistical significance. At this point, you would suggest that more research was needed (perhaps with a larger sample),

Using a Chi-Square Macro


In [ ]:
/***********************************************************
Macro CHISQ
Purpose: To compute chi-square (and any other valid
PROC FREQ TABLES options) from frequencies in a
2 x 2 table.
Sample Calling Sequencies;
%CHISQ(10,20,30,40)
%CHISQ(10,20,30,40,OPTIONS=CMH)
%CHISQ(10,20,30,40,OPTIONS=CHISQ CMH)
************************************************************/
%macro chisq(a,b,c,d,options=chisq);
data chisq;
array cells[2,2] _temporary_ (&a &b &c &d);
do row = 1 to 2;
do Col = 1 to 2;
Count = cells[Row,Col];
output;
end;
end;
run;
proc freq data=chisq;
tables Row*Col / &options;
weight Count;
run;
%mend chisq;

/*Calling the Macro*/

*%include 'c:\sasprogs\macros\chisq.sas';

%chisq(30,10,15,40)

Computing Coefficient Kappa—A Test of Agreement

The AGREE option on the TABLES statement computes the Kappa coefficient, also known as a coefficient of agreement. For example, suppose two psychiatrists are evaluating the same patients, and each one decides whether each patient has attention deficit disorder. By chance alone, the two raters might agree or disagree on any individual diagnosis. For example, if the probability of a positive diagnosis is 50%, they will agree 50% of the time, by chance alone. Coefficient Kappa adjusts for the chance agreement.

A Kappa value of 1 indicates perfect agreement, and a Kappa value of 0 indicates that the agreement is attributable to chance. It is possible for Kappa to be negative; agreement can be less than chance. You see from this output that Kappa is equal to .6927. A two-tailed hypothesis test for Kappa equal to zero yields a p-value <.0001. You can conclude that the agreement between the two raters was significant and not due to chance.


In [ ]:
data kappa;
input Rater1 : $1. Rater2 : $1. @@;
datalines;
Y Y N N Y N N Y Y Y Y Y Y Y N N N N N N Y Y Y N Y Y N N N Y N
N N N N N
Y Y Y Y N N N N Y N Y Y Y Y N N N N N N N N Y Y N N Y Y N N
;
title "Computing Coefficient Kappa";
proc freq data=kappa;
tables Rater1 * Rater2 / agree;
test kappa;
run;

Yates's correction for continuity

In statistics, Yates' correction for continuity (or Yates' chi-squared test) is used in certain situations when testing for independence in a contingency table. In some cases, Yates' correction may adjust too far, and so its current use is limited.

Using the chi-squared distribution to interpret Pearson's chi-squared statistic requires one to assume that the discrete probability of observed binomial frequencies in the table can be approximated by the continuous chi-squared distribution. This assumption is not quite correct, and introduces some error.

To reduce the error in approximation, Frank Yates, an English statistician, suggested a correction for continuity that adjusts the formula for Pearson's chi-squared test by subtracting 0.5 from the difference between each observed value and its expected value in a 2 × 2 contingency table. This reduces the chi-squared value obtained and thus increases its p-value.

The effect of Yates' correction is to prevent overestimation of statistical significance for small data. This formula is chiefly used when at least one cell of the table has an expected count smaller than 5. Unfortunately, Yates' correction may tend to overcorrect. This can result in an overly conservative result that fails to reject the null hypothesis when it should (a type II error). So it is suggested that Yates' correction is unnecessary even with quite low sample sizes, such as:

$$\sum _{{i=1}}^{N}O_{i}=20\,$$

,

The following is Yates' corrected version of Pearson's chi-squared statistics:

$$ \chi _{{\text{Yates}}}^{2}=\sum _{{i=1}}^{{N}}{(|O_{i}-E_{i}|-0.5)^{2} \over E_{i}}$$

where:

$O_{i}$ = an observed frequency

$E_{i}$ = an expected (theoretical) frequency, asserted by the null hypothesis

$N$ = number of distinct events


In [ ]:
### *variables

Region
Advertising
Gender
Book_Sales
Music_Sales
Electronics_Sales
Total_Sales
;

proc format;
   value yesno 1 = 'Yes'
               0 = 'No';
data Store;
   length Region $ 5;
   call streaminit(57676);
   do Transaction = 1 to 200;
      R = ceil(rand('uniform')*10);
      select(R);
         when(1) Region = 'East';
         when(2) Region = 'West';
         when(3) Region = 'North';
         when(4) Region = 'South';
         otherwise;
      end;
      Advertising = rand('bernouli',.6);
      if rand('uniform') lt .6 then Gender = 'Female';
         else Gender = 'Male';
      Book_Sales = abs(round(rand('normal',250,50) + 30*(Gender = 'Female')
                    + 30*Advertising,10)) ;
      Music_Sales = abs(round(rand('uniform')*40 + rand('normal',50,5)
         + 30*(Region = 'East' and Gender = 'Male')
         - 20*(Region = 'West' and Gender = 'Female'),5) + 10*Advertising);
      Electronics_Sales = abs(round(rand('normal',300,60) + 70*(Gender = 'Male')
       + 55*Advertising + 50*(Region = 'East') - 20*(Region = 'South') 
       + 75*(Region = 'West'),10));
      Total_Sales = sum(Book_Sales,Music_Sales,Electronics_Sales);
   output;
   end;
   drop R;
   format Book_Sales Music_Sales Electronics_Sales Total_Sales dollar9.
          Advertising yesno.;
run;
 
/*title "Listing of Store";*/
/*proc print data=store heading=h;*/
/*run;*/

/*proc univariate data=store;*/
/*   var Book_Sales -- Total_Sales;*/
/*   histogram;*/
/*run;*/
/**/
/*title "Scatter Matrix for Store Variables";*/
/*proc sgscatter data=store;*/
/*   matrix Book_Sales -- Total_Sales / group = Gender;*/
/*run;*/
/**/
/*proc sgplot data=store;*/
/*   scatter x=Book_Sales y=Total_Sales / group=Gender;*/
/*run;*/

proc rank data=store out=median_sales groups=2;
   var Total_Sales;
   ranks Sales_Group;
run;

proc format;
   value sales 0 = 'Low'
               1 = 'High';
run;

/*proc logistic data=median_sales order=formatted;*/
/*   class Gender(param=ref ref='Male');*/
/*   model Sales_Group = Gender;*/
/*   format Sales_Group sales.;*/
/*quit;*/
/**/
/*proc logistic data=median_sales order=formatted;*/
/*   class Gender(param=ref ref='Male')*/
/*         Advertising (param=ref ref='No');*/
/*   model Sales_Group = Gender Advertising;*/
/*   format Sales_Group sales.;*/
/*quit;*/

*Create test data set;
libname example 'c:\books\statistics by example';
data example.Blood_Pressure;
   call streaminit(37373);
   do Drug = 'Placebo','Drug A','Drug B';
      do i = 1 to 20;
         Subj + 1;
         if mod(Subj,2) then Gender = 'M';
         else Gender = 'F';
         SBP = rand('normal',130,10) +
               7*(Drug eq 'Placebo') - 6*(Drug eq 'Drug B');
         SBP = round(SBP,2);
         DBP = rand('normal',80,5) +
               3*(Drug eq 'Placebo') - 2*(Drug eq 'Drug B');
         DBP = round(DBP,2);
         if Subj in (5,15,25,55) then call missing(SBP, DBP);
         if Subj in (4,18) then call missing(Gender);
         output;
      end;
   end;
   drop i;
run;

/*title "Listing of the first 25 observations from Blood_Pressure";*/
/*proc print data=example.Blood_Pressure(obs=25) noobs;*/
/*   var Subj Drug SBP DBP;*/
/*run;*/

data exercise;
   call streaminit(7657657);
   do Subj = 1 to 50;
      Age = round(rand('normal',50,15));
      Pushups = abs(int(rand('normal',40,10) - .30*age));
      Rest_Pulse = round(rand('normal',50,8) + .35*age);
      Max_Pulse = round(rest_pulse + rand('normal',50,5) - .05*age);
      Run_Pulse = round(max_pulse - rand('normal',3,3));
      output;
   end;
run;

*Data set for a paired t-test example;
data reading;
   input Subj Before After @@;
datalines;
1 100 110  2 120 121  3 130 140  4 90 110  5 85 92
6 133 137  7 210 209  8 155 179
;

/*title "Listing of Data Set READING";*/
/*proc print data=reading noobs;*/
/*run;*/

*Data set that violates assumptions for a t-test;
data salary;
   call streaminit(57575);
   do Subj = 1 to 50;
      do Gender = 'M','F';
         Income = round(20000*rand('exponential') + rand('uniform')*7000*(Gender = 'M'));
         output;
      end;
   end;
run;
/*proc univariate data=salary;*/
/*   class Gender;*/
/*   id Subj;*/
/*   var Income;*/
/*   histogram Income;*/
/*run;*/

*Data set risk for logistic regression example;
proc format;
   value yesno 1 = 'Yes'
               0 = 'No';
run;

data Risk;
   call streaminit(13579);
   length Age_Group $ 7;
   do i = 1 to 250;
      do Gender = 'F','M';
         Age = round(rand('uniform')*30 + 50);
         if missing(Age) then Age_Group = ' ';
         else if Age lt 60 then Age_Group = '1:< 60';
         else if Age le 70 then Age_Group = '2:60-70';
         else Age_Group = '3:71+';
         Chol = rand('normal',200,30) + rand('uniform')*8*(Gender='M');
         Chol = round(Chol);
         Score = .3*chol + age + 8*(Gender eq 'M');
         Heart_Attack = (Score gt 130)*(rand('uniform') lt .2);
         output;
       end;
   end;
   keep Gender Age Age_Group chol Heart_Attack;
   format Heart_Attack yesno.;
run;

/*title "Listing of first 100 observations from RISK";*/
/*proc print data=risk(obs=100);*/
/*run;*/