chisq measures cl expected cellchi2
plots(only)=(effect);
plots(only)=(effect oddsratio);
Diagnostic Statistics:
Pearson residuals—Contribution of each observation to the Pearson chi-square.
INFLUENCE DFBETAS PHAT DPC LEVERAGE
DFBETAS—These statistics tell you how much each regression coefficient changes when a particular observation is deleted. The actual change is divided by the standard error of the coefficient.
In [ ]:
%let path=/folders/myfolders/ECST131;
libname statdata "&path";
PROC FREQ DATA=SAS-data-set 'SAS-library' <option(s)>;
TABLES=table-request(s) </ option(s)>;
additional statements;
RUN;
Selected Options in PROC FREQ
Statement Option
PROC FREQ ORDER=
TABLES
CELLCHI2
CHISQ (Pearson and Mantel-Haenszel)
CL
EXPECTED
MEASURES
NOCOL
NOPERCENT
PLOTS=
RELRISK
Option | Description |
---|---|
AGREE | Kappa coefficient of agreement |
ALL | A quick way to get CHISQ, MEASURES, and CMH |
CHISQ | Computes chi-square, Fisher’s exact (for 2x2), and several other tests |
CMH | Cochran-Mantel-Haenszel statistics |
FISHER | Fisher’s exact test (needed only for tables larger than 2x2) |
MEASURES | Various measures of association, such as Pearson and Spearman coefficients |
RELRISK | Computes relative risk (also called risk ratio) |
TREND | Computes Cochran-Armitage test for trend |
In [ ]:
proc print data = statdata.sales (obs = 15);
run;
proc format;
value Incfmt
1 = 'Low'
2 = 'Medium'
3 = 'High';
run;
proc format;
value purfmt
0 = '< $100'
1 = '> $100';
run;
In [ ]:
proc freq data=statdata.sales;
tables Purchase Gender Income
Gender*Purchase
Income*Purchase
/
plots=(freqplot);
title1 'Frequency Tables for Sales Data';
run;
In [ ]:
ods select histogram probplot;
proc univariate data=statdata.sales normal;
var Age;
histogram Age / normal (mu=est
sigma=est);
probplot Age / normal (mu=est
sigma=est);
title1 'Distribution of Age';
run;
title;
In [ ]:
data statdata.sales_inc;
set statdata.sales;
if Income='Low' then IncLevel=1;
else If Income='Medium' then IncLevel=2;
else If Income='High' then IncLevel=3;
run;
proc freq data=statdata.sales_inc;
tables IncLevel*Purchase / plots=freq;
format IncLevel incfmt. Purchase purfmt.;
title1 'Create variable IncLevel to correct Income';
run;
title;
In PROC FREQ output, the default order for character values is alphanumeric. In this frequency table, the values of Income are listed by default as High, Low, and Medium instead of in a logical order, by magnitude. In other words, Income is treated as a nominal variable instead of an ordinal variable. When you order values logically, your output tables and graphs are easier to read and interpret.
In [ ]:
data statdata.sales_inc;
set statdata.sales;
if Income='Low' then IncLevel=1;
else If Income='Medium' then IncLevel=2;
else If Income='High' then IncLevel=3;
run;
proc freq data=statdata.sales_inc;
tables IncLevel*Purchase / plots=freq;
format IncLevel incfmt. Purchase purfmt.;
title1 'Create variable IncLevel to correct Income';
run;
title;
In [ ]:
proc print data = statdata.safety;
proc freq data=statdata.safety;
tables Unsafe Type Region;
title 'Safety Data Frequencies';
run;
title;
In [ ]:
proc format;
value $Levels 'Low'='1_Low'
'Medium'='2_Medium'
'High'='3_High';
run;
proc freq data=statdata.sales order=formatted;
tables Income*Purchase / chisq measures cl;
format Income $levels.;
title1 'Ordinal Association between Income and Purchase?';
run;
title;
You start with the null hypothesis of no association between the variables and the alternative hypothesis of association. To determine whether there is an association, the Pearson chi-square test measures the difference between the observed cell frequencies and the cell frequencies that are expected if there is no association between the variables. In other words, the expected frequencies are the frequencies you expect to get if the null hypothesis is true. The chi-square test calculates the expected frequencies for each cell by multiplying the row total (R) by the column total (C), and then dividing the result by the total sample size (T). If the observed frequencies equal the expected frequencies, there is no association between the variables. If the observed frequencies do not equal the expected frequencies, there is an association between the variables. The chi-square statistic indicates the difference between observed frequencies and expected frequencies. A significant chi-square statistic is strong evidence that an association exists between your variables.
Keep in mind that the chi-square statistic and its corresponding p-value indicate only whether an association exists between two categorical variables. The p-value indicates how confident you can be that the null hypothesis of no association is false. However, neither the chi-square statistic nor its p-value tells you the magnitude of an association.
Chi-square statistics and their p-values depend on and reflect the sample size. A larger sample size yields a larger chi-square statistic and a smaller corresponding p-value, even though the association might not be all that strong. For example, if you double the size of your sample by duplicating each observation, you double the value of the chi‑square statistic, even though the strength of the association does not change.
Cramer's V statistic is one measure of the strength of an association between two categorical variables. Cramer's V statistic is derived from the Pearson chi-square statistic.
For two-by-two tables, Cramer's V is in the range of -1 to 1. For larger tables, Cramer's V is in the range of 0 to 1.
Values farther away from 0 indicate a relatively strong association between the variables. The closer Cramer's V is to 0, the weaker the association is between the rows and columns. For a two-by-two table, a Cramer's V value close to 1 indicates a strong positive association. A Cramer's V value close to -1 shows a strong negative association.
Like other measures of strength of association, Cramer's V statistic is not affected by sample size.
To measure the strength of the association between a binary predictor variable and a binary outcome variable, you can use an odds ratio. An odds ratio indicates how much more likely it is, with respect to odds, that a certain event, or outcome, occurs in one group relative to its occurrence in another group. In this example, you want to know how much more likely it is, with respect to odds, that a Yes outcome occurs in Group B relative to its occurrence in Group A.
Now that we know the probabilities, let's calculate the odds. The odds of the outcome occurring in Group B are the probability of a Yes outcome in Group B (.90) divided by the probability of a No outcome in Group B (.10). The odds for group B are 9 (or 9:1), which means that we expect nine occurrences of the outcome to one non-occurrence in Group B.
Here's a question: What are the odds of having the outcome in Group A? To calculate the odds of the outcome occurring in Group A, you divide the probability of a Yes outcome in Group A (.75) by the probability of a No outcome in Group A (.25). The odds for Group A are 3 (or 3:1), which means that you expect three times as many occurrences as non-occurrences in Group A.
Now that you know the odds of the outcome in both groups, you can compare the Group B odds with the Group A odds by calculating an odds ratio. You divide the odds of an outcome in Group B (9) by the odds of an outcome in Group A (3), with a result of 3. An odds ratio of 3 means that the odds of getting the outcome in Group B are three times those of getting the outcome in Group A. In this example, the odds of saving at least 10% of income are three times higher for people who have received training.
Let's look at the possible range of values for an odds ratio. The value of the odds ratio can range from 0 to infinity; it cannot be negative. When the odds ratio is 1, there is no association between the predictor variable and the outcome variable. If the odds ratio is greater than 1, the group in the numerator of the fraction (here, Group B) is more likely to have the outcome. In the example about saving money, the odds ratio 3 indicates that the odds for Group B is three times the odds for Group A. If the odds ratio is less than 1, the group in the denominator of the fraction (here, Group A) is more likely to have the outcome.
The odds ratio is approximately the same regardless of sample size. To estimate the true odds ratio while taking into account the variability of the sample statistic, you can calculate confidence intervals just as you can for any unknown parameter you are trying to estimate. A 95% confidence interval for an odds ratio means that you are 95% confident that the interval contains the true odds ratio. In other words, if you sample repeatedly and calculate a confidence interval for each sample odds ratio, 95% of the time your confidence interval will contain the true population odds ratio.
You can use an odds ratio to test for significance between two categorical variables. If the 95% confidence interval does not include 1, the odds ratio is significant at the 0.05 level. You are 95% confident that the true odds ratio is significantly different from 1. There is an association between the two variables.
The last table below shows estimates of the relative risk for the row 1 variable to the row 2 variable in the crosstabulation table above. The RELRISK option generates this table. In the Estimates of the Relative Risk table, the odds ratio of females to males for spending less than $\$100$ is in the first row. However, remember that we are actually interested in the other outcome – spending $\$100$ or more. So we'll interpret this odds ratio as the odds of the row 2 value (males) compared to the odds of the row 1 value (females) of being in the second column. This interpretation is logically equivalent.
$$odd ratio = 0.6458 = \frac{139/101}{130/61}$$$$ = \frac{1}{1.54} = \frac{101/139}{61/130}$$Remember that you can use the odds ratio to measure the strength of association for a two-by-two table. The odds ratio of 0.6458 means that the odds of a male spending at least $\$100$ are approximately $65\%$, or $2/3$ that of a female. It's often easier to report odds ratios by converting the decimal value to a percent difference value. The formula for the conversion is
$$(odd ratio -1) * 100\%$$$$ = (0.6458-1)*100 = -35.4\%$$In this example, males have 35.42 percent lower odds than females of spending $100 or more. An odds ratio of 1 would mean no association. However, the 95% confidence interval does not include 1, which indicates that the odds ratio is significant at the 0.05 significance level. This also means that the true odds ratio is statistically different from 1, so it's another way of testing whether the association between Gender and Purchase is significant.
$$relative risk = 0.8510 = \frac{57.92\%}{68.06\%}$$
In [ ]:
proc freq data=statdata.sales_inc;
tables Gender*Purchase /
chisq expected cellchi2 nocol nopercent
relrisk;
*format Purchase purfmt.;
title1 'Association between Gender and Purchase';
run;
title;
The CHISQ option produces the Pearson chi-square test of association and the measures of association that are based on the chi-square statistic. The output from this option alone will tell you whether there is a significant association between the specified variables.
The EXPECTED option prints the expected cell frequencies, which are the cell frequencies that we expect under the null hypothesis of no association. CELLCHI2 prints each cell's contribution to the total chi-square statistic. NOCOL suppresses the printing of the column percentages. NOPERCENT suppresses the printing of the cell percentages. Finally, we'll add the RELRISK (relative risk) option to print a table that contains risk ratios (probability ratios) and odds ratios.
At the top of the results is the requested crosstabulation table for Gender by Purchase. The total number of observations appears in the bottom right corner. Notice that we're working with a large sample. You can see how the options in the TABLES statement change the statistics that appear in each cell. The actual frequency appears first. Next, the EXPECTED option generates the expected frequency. The expected frequency is the count that we'd expect if the null hypothesis of no association between the two variables is true. The CELLCHI2 option generates the cell chi-square statistic, which indicates how much that cell contributes to the overall chi-square statistic.
Relative risk or risk ratio (RR) is the ratio of the probability of an event occurring (for example, developing a disease, being injured) in an exposed group to the probability of the event occurring in a comparison, non-exposed group. Relative risk includes two important features: (i) a comparison of risk between two "exposures" puts risks in context, and (ii) "exposure" is ensured by having proper denominators for each group representing the exposure:
$${\displaystyle RR={\frac {p_{\text{event when exposed}}}{p_{\text{event when not exposed}}}}} $$When two variables have an ordinal association, an increase in the value of one variable tends to be associated with an increase or decrease in the value of the other variable. For example, the Mantel-Haenszel chi-square test can help to determine whether, as the row values increase in size, the column values also increase in size. When the variables have more than two levels, the levels must be in a logical order for the test results to be meaningful.
For the Mantel-Haenszel chi-square test, the null hypothesis is that there is no ordinal association between the row and column variables. The alternative hypothesis is that there is an ordinal association between the row and column variables.
So, why is the Mantel-Haenszel chi-square statistic more powerful than the Pearson chi-square statistic for detecting an ordinal association? All of the power of the Mantel-Haenszel statistic is concentrated on ordinal associations. However, the power of the Pearson (or general) association statistic is dispersed over a greater number of alternatives.
The Mantel-Haenszel chi-square statistic and its corresponding p-value are similar to the Pearson chi-square statistic and its corresponding p-value in the following ways. They indicate only whether an association exists but not the magnitude of the association. They depend on and reflect the sample size.
In [ ]:
proc freq data=statdata.sales_inc;
tables IncLevel*Purchase / chisq measures cl;
format IncLevel incfmt. Purchase purfmt.;
title1 'Ordinal Association between IncLevel and Purchase?';
run;
title;
To measure the strength of the association between two ordinal variables, you can use the Spearman correlation statistic. You should use the Spearman correlation statistic only if both variables are ordinal and the values of each variable are in logical order. The Spearman correlation is considered to be a rank correlation because it provides a degree of association between the ranks of the ordinal variables. Ordinal variables are considered to be ranked because their values have a logical order.
This statistic has a range between -1 and 1. Values close to 1 indicate that there is a relatively high degree of positive correlation. Values close to -1 indicate that there is a relatively high degree of negative correlation. And values close to zero indicate a weak correlation.
Like other measures of strength of association, Spearman's correlation statistic is not affected by sample size.
In [ ]:
proc freq data=statdata.sales_inc;
tables IncLevel*Purchase / chisq measures cl;
format IncLevel incfmt. Purchase purfmt.;
title1 'Ordinal Association between IncLevel and Purchase?';
run;
title;
Notice that the 95% confidence interval does not contain 0. This means that the positive, ordinal relationship between income level and purchasing habits is significant at the 0.05 significance level. The confidence bounds are valid only if the sample size is large. A general guideline is to have a sample size of at least 25 for each degree of freedom in the Pearson chi-square statistic. The asymptotic standard error is used for large samples and is used to calculate the confidence intervals for various measures of association, including the Spearman correlation coefficient. The asymptotic standard error measures the variability of the correlation statistic. In other words, it indicates how much error we can expect if we use the sample Spearman correlation to estimate the population Spearman correlation.
In [ ]:
proc freq data=statdata.safety;
tables Region*Unsafe / chisq relrisk expected;
*format Unsafe safefmt.;
title "Association between Unsafe and Region";
run;
title;
proc freq data=statdata.safety;
tables Size*Unsafe / chisq measures cl;
*format Unsafe safefmt.;
title "Association between Unsafe and Size";
run;
title;
In [ ]:
data small_counts;
input Group $ Outcome $ @@;
datalines;
A Good A Good A Good A Poor A Good A Good
B Poor B Poor B Good B Poor B Poor
;
title "Tables with Small Expected Frequencies";
proc freq data=small_counts;
tables Group * Outcome / chisq expected;
run;
Notice that SAS prints a warning when more than 20% of the cells have expected values that are less than 5. In this case, the chi-square value and resulting probability are not valid. The row in the output labeled Continuity Adj. Chi-Square is the Yates’ corrected value (2.2275). Under the heading Fisher’s Exact Test, you see the two-sided probability equal to .0801. If you were conducting a hypothesis test with alpha set to the usual value of .05, you would fail to reject the null hypothesis that the proportion of good outcomes was equal in the two groups. In practice, you might report that group A seemed to have a higher percentage of good outcomes, although the percentage did not reach statistical significance. At this point, you would suggest that more research was needed (perhaps with a larger sample),
In [ ]:
/***********************************************************
Macro CHISQ
Purpose: To compute chi-square (and any other valid
PROC FREQ TABLES options) from frequencies in a
2 x 2 table.
Sample Calling Sequencies;
%CHISQ(10,20,30,40)
%CHISQ(10,20,30,40,OPTIONS=CMH)
%CHISQ(10,20,30,40,OPTIONS=CHISQ CMH)
************************************************************/
%macro chisq(a,b,c,d,options=chisq);
data chisq;
array cells[2,2] _temporary_ (&a &b &c &d);
do row = 1 to 2;
do Col = 1 to 2;
Count = cells[Row,Col];
output;
end;
end;
run;
proc freq data=chisq;
tables Row*Col / &options;
weight Count;
run;
%mend chisq;
/*Calling the Macro*/
*%include 'c:\sasprogs\macros\chisq.sas';
%chisq(30,10,15,40)
The AGREE option on the TABLES statement computes the Kappa coefficient, also known as a coefficient of agreement. For example, suppose two psychiatrists are evaluating the same patients, and each one decides whether each patient has attention deficit disorder. By chance alone, the two raters might agree or disagree on any individual diagnosis. For example, if the probability of a positive diagnosis is 50%, they will agree 50% of the time, by chance alone. Coefficient Kappa adjusts for the chance agreement.
A Kappa value of 1 indicates perfect agreement, and a Kappa value of 0 indicates that the agreement is attributable to chance. It is possible for Kappa to be negative; agreement can be less than chance. You see from this output that Kappa is equal to .6927. A two-tailed hypothesis test for Kappa equal to zero yields a p-value <.0001. You can conclude that the agreement between the two raters was significant and not due to chance.
In [ ]:
data kappa;
input Rater1 : $1. Rater2 : $1. @@;
datalines;
Y Y N N Y N N Y Y Y Y Y Y Y N N N N N N Y Y Y N Y Y N N N Y N
N N N N N
Y Y Y Y N N N N Y N Y Y Y Y N N N N N N N N Y Y N N Y Y N N
;
title "Computing Coefficient Kappa";
proc freq data=kappa;
tables Rater1 * Rater2 / agree;
test kappa;
run;
In statistics, Yates' correction for continuity (or Yates' chi-squared test) is used in certain situations when testing for independence in a contingency table. In some cases, Yates' correction may adjust too far, and so its current use is limited.
Using the chi-squared distribution to interpret Pearson's chi-squared statistic requires one to assume that the discrete probability of observed binomial frequencies in the table can be approximated by the continuous chi-squared distribution. This assumption is not quite correct, and introduces some error.
To reduce the error in approximation, Frank Yates, an English statistician, suggested a correction for continuity that adjusts the formula for Pearson's chi-squared test by subtracting 0.5 from the difference between each observed value and its expected value in a 2 × 2 contingency table. This reduces the chi-squared value obtained and thus increases its p-value.
The effect of Yates' correction is to prevent overestimation of statistical significance for small data. This formula is chiefly used when at least one cell of the table has an expected count smaller than 5. Unfortunately, Yates' correction may tend to overcorrect. This can result in an overly conservative result that fails to reject the null hypothesis when it should (a type II error). So it is suggested that Yates' correction is unnecessary even with quite low sample sizes, such as:
$$\sum _{{i=1}}^{N}O_{i}=20\,$$,
The following is Yates' corrected version of Pearson's chi-squared statistics:
$$ \chi _{{\text{Yates}}}^{2}=\sum _{{i=1}}^{{N}}{(|O_{i}-E_{i}|-0.5)^{2} \over E_{i}}$$where:
$O_{i}$ = an observed frequency
$E_{i}$ = an expected (theoretical) frequency, asserted by the null hypothesis
$N$ = number of distinct events
In [ ]:
### *variables
Region
Advertising
Gender
Book_Sales
Music_Sales
Electronics_Sales
Total_Sales
;
proc format;
value yesno 1 = 'Yes'
0 = 'No';
data Store;
length Region $ 5;
call streaminit(57676);
do Transaction = 1 to 200;
R = ceil(rand('uniform')*10);
select(R);
when(1) Region = 'East';
when(2) Region = 'West';
when(3) Region = 'North';
when(4) Region = 'South';
otherwise;
end;
Advertising = rand('bernouli',.6);
if rand('uniform') lt .6 then Gender = 'Female';
else Gender = 'Male';
Book_Sales = abs(round(rand('normal',250,50) + 30*(Gender = 'Female')
+ 30*Advertising,10)) ;
Music_Sales = abs(round(rand('uniform')*40 + rand('normal',50,5)
+ 30*(Region = 'East' and Gender = 'Male')
- 20*(Region = 'West' and Gender = 'Female'),5) + 10*Advertising);
Electronics_Sales = abs(round(rand('normal',300,60) + 70*(Gender = 'Male')
+ 55*Advertising + 50*(Region = 'East') - 20*(Region = 'South')
+ 75*(Region = 'West'),10));
Total_Sales = sum(Book_Sales,Music_Sales,Electronics_Sales);
output;
end;
drop R;
format Book_Sales Music_Sales Electronics_Sales Total_Sales dollar9.
Advertising yesno.;
run;
/*title "Listing of Store";*/
/*proc print data=store heading=h;*/
/*run;*/
/*proc univariate data=store;*/
/* var Book_Sales -- Total_Sales;*/
/* histogram;*/
/*run;*/
/**/
/*title "Scatter Matrix for Store Variables";*/
/*proc sgscatter data=store;*/
/* matrix Book_Sales -- Total_Sales / group = Gender;*/
/*run;*/
/**/
/*proc sgplot data=store;*/
/* scatter x=Book_Sales y=Total_Sales / group=Gender;*/
/*run;*/
proc rank data=store out=median_sales groups=2;
var Total_Sales;
ranks Sales_Group;
run;
proc format;
value sales 0 = 'Low'
1 = 'High';
run;
/*proc logistic data=median_sales order=formatted;*/
/* class Gender(param=ref ref='Male');*/
/* model Sales_Group = Gender;*/
/* format Sales_Group sales.;*/
/*quit;*/
/**/
/*proc logistic data=median_sales order=formatted;*/
/* class Gender(param=ref ref='Male')*/
/* Advertising (param=ref ref='No');*/
/* model Sales_Group = Gender Advertising;*/
/* format Sales_Group sales.;*/
/*quit;*/
*Create test data set;
libname example 'c:\books\statistics by example';
data example.Blood_Pressure;
call streaminit(37373);
do Drug = 'Placebo','Drug A','Drug B';
do i = 1 to 20;
Subj + 1;
if mod(Subj,2) then Gender = 'M';
else Gender = 'F';
SBP = rand('normal',130,10) +
7*(Drug eq 'Placebo') - 6*(Drug eq 'Drug B');
SBP = round(SBP,2);
DBP = rand('normal',80,5) +
3*(Drug eq 'Placebo') - 2*(Drug eq 'Drug B');
DBP = round(DBP,2);
if Subj in (5,15,25,55) then call missing(SBP, DBP);
if Subj in (4,18) then call missing(Gender);
output;
end;
end;
drop i;
run;
/*title "Listing of the first 25 observations from Blood_Pressure";*/
/*proc print data=example.Blood_Pressure(obs=25) noobs;*/
/* var Subj Drug SBP DBP;*/
/*run;*/
data exercise;
call streaminit(7657657);
do Subj = 1 to 50;
Age = round(rand('normal',50,15));
Pushups = abs(int(rand('normal',40,10) - .30*age));
Rest_Pulse = round(rand('normal',50,8) + .35*age);
Max_Pulse = round(rest_pulse + rand('normal',50,5) - .05*age);
Run_Pulse = round(max_pulse - rand('normal',3,3));
output;
end;
run;
*Data set for a paired t-test example;
data reading;
input Subj Before After @@;
datalines;
1 100 110 2 120 121 3 130 140 4 90 110 5 85 92
6 133 137 7 210 209 8 155 179
;
/*title "Listing of Data Set READING";*/
/*proc print data=reading noobs;*/
/*run;*/
*Data set that violates assumptions for a t-test;
data salary;
call streaminit(57575);
do Subj = 1 to 50;
do Gender = 'M','F';
Income = round(20000*rand('exponential') + rand('uniform')*7000*(Gender = 'M'));
output;
end;
end;
run;
/*proc univariate data=salary;*/
/* class Gender;*/
/* id Subj;*/
/* var Income;*/
/* histogram Income;*/
/*run;*/
*Data set risk for logistic regression example;
proc format;
value yesno 1 = 'Yes'
0 = 'No';
run;
data Risk;
call streaminit(13579);
length Age_Group $ 7;
do i = 1 to 250;
do Gender = 'F','M';
Age = round(rand('uniform')*30 + 50);
if missing(Age) then Age_Group = ' ';
else if Age lt 60 then Age_Group = '1:< 60';
else if Age le 70 then Age_Group = '2:60-70';
else Age_Group = '3:71+';
Chol = rand('normal',200,30) + rand('uniform')*8*(Gender='M');
Chol = round(Chol);
Score = .3*chol + age + 8*(Gender eq 'M');
Heart_Attack = (Score gt 130)*(rand('uniform') lt .2);
output;
end;
end;
keep Gender Age Age_Group chol Heart_Attack;
format Heart_Attack yesno.;
run;
/*title "Listing of first 100 observations from RISK";*/
/*proc print data=risk(obs=100);*/
/*run;*/