Statistics Using SAS


In [1]:
%let path=/folders/myfolders/ECST131;
libname statdata "&path";

/*Go to the bottom to input the dataset*/


Out[1]:

11   ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
12
13 %let path=/folders/myshortcuts/_myfolders/ECST131;
14 libname statdata "&path";
NOTE: Library STATDATA does not exist.
15
16 /*Go to the bottom to input the dataset*/
17 ods html5 close;ods listing;

18

Basic Statistical Concepts

A population is the complete set of observations or the entire group of objects that you are researching. A sample is a subset of the population. The sample should be representative of the population. You can obtain a representative sample by collecting a simple random sample.

Parameters are numerical values that summarize characteristics of a population. Parameter values are typically unknown and are represented by Greek letters. Statistics summarize characteristics of a sample. You use letters from the English alphabet to represent sample statistics. You can measure characteristics of your sample and provide numerical values that summarize those characteristics.

Variables are classified according to their characteristics. They can be quantitative or categorical. Data that consists of counts or measurements is quantitative. Quantitative data can be further distinguished by two types: discrete and continuous. Discrete data takes on only a finite, or countable, number of values. Continuous data has an infinite number of values and no breaks or jumps.

Categorical or attribute data consists of variables that denote groupings or labels. There are two main types: nominal and ordinal. A nominal categorical variable exhibits no ordering within its groups or categories. With ordinal categorical variables, the observed levels of the variable can be ordered in a meaningful way that implies differences due to magnitude.

There are two scales of measurement for categorical variables: nominal and ordinal. There are two scales of measurement for continuous variables: interval and ratio. Data from an interval scale can be rank-ordered and has a sensible spacing of observations such that differences between measurements are meaningful. However, interval scales lack the ability to calculate ratios between numbers on the scale because there is no true zero point. Data on a ratio scale includes a true zero point and can therefore accurately indicate the ratio of difference between two spaces on the measurement scale.

The appropriate statistical method for your data also depends on the number of variables involved. Univariate analysis provides techniques for analyzing and describing a single variable at a time. Bivariate analysis describes and explains the relationship between two variables and how they change, or covary, together. Multivariate analysis examines two or more variables at the same time, in order to understand the relationships among them.

PROC MEANS DATA=SAS-data-set <options>;       
CLASS variables;
VAR variables;
RUN;
Option Description
N Number of nonmissing observations
NMISS Number of observations with missing values
MEAN Arithmetic mean
STD Standard deviation
STDERR Standard error
MIN Minimum value
MAX Maximum value
MEDIAN Median
MAXDEC= Maximum number of decimal places to display
CLM 95% confidence limit on the mean
CV Coefficient of variation
PROC UNIVARIATE DATA=SAS-data-set <options>;      
    VAR variables;
    ID variables;
    HISTOGRAM variables </options>;
    PROBPPLOT variables </options>;
    INSET keywords </options>;
RUN;

PROC SGPLOT DATA=SAS-data-set<options>;
    DOT category-variable </option(s)>;
    HBAR category-variable </option(s)>;
    VBAR category-variable </option(s)>;
    HBOX response-variable </option(s)>;
    VBOX response-variable </option(s)>;
    HISTOGRAM response-variable </option(s)>;
    SCATTER X=variable Y=variable </option(s)>;
    NEEDLE X=variable Y=numeric-variable </option(s)>;
    REG X=numeric-variable Y=numeric-variable </option(s)>;
    REFLINE variable | value-1 <... value-n> </option(s)>;
RUN;

ODS GRAPHICS ON <options>;
    statistical procedure code
ODS GRAPHICS OFF;

Sample Programs


In [2]:
proc print data=statdata.testscores;
run;


Out[2]:

20   ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
21
22 proc print data=statdata.testscores;
ERROR: File STATDATA.TESTSCORES.DATA does not exist.
23 run;
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE PRINT used (Total process time):
real time 0.01 seconds
cpu time 0.00 seconds

24 ods html5 close;ods listing;

25

Using Proc Freq

Plot Name Table type Require option on TABLES statement
AGREEPLOT two-way AGREE
CUMFREQPLOT one-way
DEVIATIONPLOT one-way CHISQ
FREQPLOT any request
KAPPAPLOT three-way AGREE
ODDSRATIOPLOT hx2x2 MEASURES or RELRISK
RELREISKPLOT hx2x2 MEASURES or RELRISK
RISKDIFFPLOT hx2x2 RISKDIFF
WTKAPPAPLOT hxrxr (r>2) AGREE

In [3]:
proc format;
value $gender 'M' = 'Male'
'F' = 'Female';
value score low-<1100 = 'Low'
            1100-<1400 = 'Middle'
            1400-High = 'High';
run;

proc freq data=statdata.testscores ORDER=FORMATTED;
    tables gender*SATScore / noprecent norow nocol CHISQ 
           plots = freqplot (twoway = stacked);
           *plots = freqplot (twoway = grouphorizontal);
    format Gender $gender. SATScore score.;
run;


Out[3]:

27   ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
28
29 proc format;
30 value $gender 'M' = 'Male'
NOTE: Format $GENDER has been output.
31 'F' = 'Female';
32 value score low-<1100 = 'Low'
33 1100-<1400 = 'Middle'
NOTE: Format SCORE has been output.
34 1400-High = 'High';
35 run;
NOTE: PROCEDURE FORMAT used (Total process time):
real time 0.02 seconds
cpu time 0.00 seconds

36
37 proc freq data=statdata.testscores ORDER=FORMATTED;
ERROR: File STATDATA.TESTSCORES.DATA does not exist.
ERROR: No data set open to look up variables.
ERROR: No data set open to look up variables.
38 tables gender*SATScore / noprecent norow nocol CHISQ
---------
1
WARNING 1-322: Assuming the symbol NOPERCENT was misspelled as noprecent.

39 plots = freqplot (twoway = stacked);
40 *plots = freqplot (twoway = grouphorizontal);
41 format Gender $gender. SATScore score.;
42 run;
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE FREQ used (Total process time):
real time 0.00 seconds
cpu time 0.01 seconds

43 ods html5 close;ods listing;

44

Using PROC MEANS


In [4]:
proc means data=statdata.testscores maxdec=2 fw=10 printalltypes
           n mean median std sterr var q1 q3 clm t;
   class Gender;
   var SATScore;
   title 'Selected Descriptive Statistics for SAT Scores';
run;
title;


Out[4]:

46   ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
47
48 proc means data=statdata.testscores maxdec=2 fw=10 printalltypes
ERROR: File STATDATA.TESTSCORES.DATA does not exist.
49 n mean median std sterr var q1 q3 clm t;
-----
1
WARNING 1-322: Assuming the symbol STDERR was misspelled as sterr.

ERROR: No data set open to look up variables.
50 class Gender;
ERROR: No data set open to look up variables.
51 var SATScore;
52 title 'Selected Descriptive Statistics for SAT Scores';
53 run;
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE MEANS used (Total process time):
real time 0.01 seconds
cpu time 0.00 seconds

54 title;
55 ods html5 close;ods listing;

56

Using proc sgplot, proc univariate


In [5]:
proc univariate data=statdata.testscores ALPHA=0.05;
   var SATScore;
   id idnumber;
   histogram SATScore / normal(mu=est sigma=est);
     inset skewness kurtosis / position=ne;
   probplot SATScore / normal(mu=est sigma=est);
     inset skewness kurtosis;
   cdfplot SATScore / normal(mu=est sigma=est);
     inset skewness kurtosis;
   title 'Descriptive Statistics Using PROC UNIVARIATE';
run;
title;


Out[5]:

58   ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
59
60 proc univariate data=statdata.testscores ALPHA=0.05;
ERROR: File STATDATA.TESTSCORES.DATA does not exist.
61 var SATScore;
62 id idnumber;
ERROR: No data set open to look up variables.
63 histogram SATScore / normal(mu=est sigma=est);
64 inset skewness kurtosis / position=ne;
ERROR: No data set open to look up variables.
65 probplot SATScore / normal(mu=est sigma=est);
66 inset skewness kurtosis;
ERROR: No data set open to look up variables.
67 cdfplot SATScore / normal(mu=est sigma=est);
68 inset skewness kurtosis;
69 title 'Descriptive Statistics Using PROC UNIVARIATE';
70 run;
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE UNIVARIATE used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

71 title;
72 ods html5 close;ods listing;

73

In [6]:
proc sgplot data = statdata.testscores ;
    *dot SATScore;
    *vbar SATscore;
    *scatter y=SATScore x=Gender;
    refline 1200 / axis=x lineattrs =(color=blue);
    /*hbox SATScore / group=Gender datalabel = IDNumber*/
    hbox SATScore / category=Gender datalabel = IDNumber missing;
    format IDnUMBER 8.;
run;


Out[6]:

75   ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
76
77 proc sgplot data = statdata.testscores ;
ERROR: File STATDATA.TESTSCORES.DATA does not exist.
78 *dot SATScore;
79 *vbar SATscore;
80 *scatter y=SATScore x=Gender;
81 refline 1200 / axis=x lineattrs =(color=blue);
82 /*hbox SATScore / group=Gender datalabel = IDNumber*/
ERROR: No data set open to look up variables.
ERROR: No data set open to look up variables.
ERROR: No data set open to look up variables.
83 hbox SATScore / category=Gender datalabel = IDNumber missing;
84 format IDnUMBER 8.;
85 run;
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE SGPLOT used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

86 ods html5 close;ods listing;

87

In [7]:
title "Using PROC SGPLOT to Produce a Scatter Plot";
proc sgplot data=store;
scatter x=Book_Sales y=Music_Sales;
run;
quit;


Out[7]:

89   ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
90
91 title "Using PROC SGPLOT to Produce a Scatter Plot";
92 proc sgplot data=store;
ERROR: File WORK.STORE.DATA does not exist.
ERROR: No data set open to look up variables.
ERROR: No data set open to look up variables.
93 scatter x=Book_Sales y=Music_Sales;
94 run;
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE SGPLOT used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

95 quit;
96 ods html5 close;ods listing;

97

In [8]:
title "Using PROC SGPLOT to Produce a Scatter Plot";
title2 "Adding Gender Information to the Plot";
proc sgplot data=store;
scatter x=Book_Sales y=Music_Sales / group=Gender;
run;
quit;


Out[8]:

99   ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
100
101 title "Using PROC SGPLOT to Produce a Scatter Plot";
102 title2 "Adding Gender Information to the Plot";
103 proc sgplot data=store;
ERROR: File WORK.STORE.DATA does not exist.
ERROR: No data set open to look up variables.
ERROR: No data set open to look up variables.
ERROR: No data set open to look up variables.
104 scatter x=Book_Sales y=Music_Sales / group=Gender;
105 run;
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE SGPLOT used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

106 quit;
107 ods html5 close;ods listing;

108

In [9]:
title "Demonstrating the PLOT Statement of PROC SGSCATTER";
proc sgscatter data=store;
plot Book_Sales * Music_Sales Total_Sales * Electronics_Sales;
run;


Out[9]:

110  ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
111
112 title "Demonstrating the PLOT Statement of PROC SGSCATTER";
113 proc sgscatter data=store;
ERROR: File WORK.STORE.DATA does not exist.
ERROR: No data set open to look up variables.
ERROR: No data set open to look up variables.
ERROR: No data set open to look up variables.
ERROR: No data set open to look up variables.
114 plot Book_Sales * Music_Sales Total_Sales * Electronics_Sales;
115 run;
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE SGSCATTER used (Total process time):
real time 0.00 seconds
cpu time 0.01 seconds

116 ods html5 close;ods listing;

117

In [10]:
title "Switching Axes and Adding a GROUP= Option";
proc sgscatter data=store;
compare Y=Total_Sales
X=(Book_Sales Music_Sales Electronics_Sales) / group=Gender;
run;


Out[10]:

119  ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
120
121 title "Switching Axes and Adding a GROUP= Option";
122 proc sgscatter data=store;
ERROR: File WORK.STORE.DATA does not exist.
123 compare Y=Total_Sales
ERROR: No data set open to look up variables.
ERROR: No data set open to look up variables.
ERROR: No data set open to look up variables.
ERROR: No data set open to look up variables.
ERROR: No data set open to look up variables.
124 X=(Book_Sales Music_Sales Electronics_Sales) / group=Gender;
125 run;
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE SGSCATTER used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

126 ods html5 close;ods listing;

127

The left and right sides of the box represent the 1st and 3rd quartiles (sometimes abbreviated Q1 and Q3). The vertical bar inside the box is the median, and the diamond represents the mean. The lines extending from the left and right side of the box (called whiskers) represent data values that are less than 1.5 times the interquartile range from Q1 and Q3.

The option DATALABEL= lets you select a variable to identify specific outliers. If you use the DATALABEL option without naming a label variable, SGPLOT uses the numerical value of the response variable (SBP in this example) to label the outliers.

T test

PROC TTEST DATA=SAS-data-set <options eg sides = U>;
    CLASS variables;
    VAR variables;
    PARIED variable1*variable2;
RUN;

One-sample t-test


In [11]:
title "Conducting a One-Sample t-test Using PROC TTEST";
proc ttest data=exercise h0=50 sides=2 alpha=.05;
var Age;
run;


Out[11]:

129  ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
130
131 title "Conducting a One-Sample t-test Using PROC TTEST";
132 proc ttest data=exercise h0=50 sides=2 alpha=.05;
ERROR: File WORK.EXERCISE.DATA does not exist.
133 var Age;
134 run;
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE TTEST used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

135 ods html5 close;ods listing;

136

In [12]:
proc sgplot data=statdata.ameshousing3;
   vbox SalePrice / category=Central_Air
                    connect=mean;
   title "Sale Price Differences across Central Air";
run;
 
proc sgplot data=statdata.ameshousing3;
   vbox SalePrice / category=Fireplaces
                    connect=mean;
   title "Sale Price Differences across Fireplaces";
run;
 
proc sgplot data=statdata.ameshousing3;
   vbox SalePrice / category=Heating_QC
                    connect=mean;
   title "Sale Price Differences across Heating Quality";
run;


Out[12]:

138  ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
139
140 proc sgplot data=statdata.ameshousing3;
ERROR: File STATDATA.AMESHOUSING3.DATA does not exist.
ERROR: No data set open to look up variables.
141 vbox SalePrice / category=Central_Air
ERROR: No data set open to look up variables.
142 connect=mean;
143 title "Sale Price Differences across Central Air";
144 run;
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE SGPLOT used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

145
146 proc sgplot data=statdata.ameshousing3;
ERROR: File STATDATA.AMESHOUSING3.DATA does not exist.
ERROR: No data set open to look up variables.
147 vbox SalePrice / category=Fireplaces
ERROR: No data set open to look up variables.
148 connect=mean;
149 title "Sale Price Differences across Fireplaces";
150 run;
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE SGPLOT used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

151
152 proc sgplot data=statdata.ameshousing3;
ERROR: File STATDATA.AMESHOUSING3.DATA does not exist.
ERROR: No data set open to look up variables.
153 vbox SalePrice / category=Heating_QC
ERROR: No data set open to look up variables.
154 connect=mean;
155 title "Sale Price Differences across Heating Quality";
156 run;
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE SGPLOT used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

157 ods html5 close;ods listing;

158

Two-sample t-test

The two-sample t-test is a hypothesis test for answering questions about the means of two populations. This test enables you to examine the differences between populations for one or more continuous variables. You can assess whether the means of the two populations are statistically different from each other. The null hypothesis for the two-sample t-test is that the means for the two groups are equal, or that $μ1 - μ2$ equals 0.

When you compare the means of two populations using a two-sample t-test, you make three assumptions:

  • The data contains independent observations,
  • The distributions of the two populations are normal, and
  • The variances in these normal distributions are equal.

To evaluate the assumption of equal variances in the two populations, you can use the F-test for equality of variances. To test the hypothesis, you calculate the F statistic, which is the ratio of the maximum sample variance of the two groups to the minimum sample variance of the two groups. By construction, the F statistic is always greater than or equal to 1.


In [13]:
ods graphics on;
proc ttest data=statdata.testscores 
           /*plots(shownull)=interval;*/
           plots(unpack shownull) = all;
           /*SIDES=U;*/
           /*SIDES=L;*/
   class Gender;
   var SATScore;
   title "Two-Sample t=Test Comparing Girls to Boys";
run;
ods graphics off;


Out[13]:

160  ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
161
162 ods graphics on;
163 proc ttest data=statdata.testscores
164 /*plots(shownull)=interval;*/
ERROR: File STATDATA.TESTSCORES.DATA does not exist.
165 plots(unpack shownull) = all;
166 /*SIDES=U;*/
167 /*SIDES=L;*/
168 class Gender;
169 var SATScore;
170 title "Two-Sample t=Test Comparing Girls to Boys";
171 run;
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE TTEST used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

172 ods graphics off;
173 ods html5 close;ods listing;

174

In general, Pool variance is

$$s_{p}^{2}={\frac {\sum _{{i=1}}^{k}(n_{i}-1)s_{i}^{2}}{\sum _{{i=1}}^{k}(n_{i}-1)}}$$

Welch's t-test, or unequal variances t-test, is a two-sample location test which is used to test the hypothesis that two populations have equal means. Welch's t-test is an adaptation of Student's t-test,[1] that is more reliable when the two samples have unequal variances and unequal sample sizes.[2] These tests are often referred to as "unpaired" or "independent samples" t-tests, as they are typically applied when the statistical units underlying the two samples being compared are non-overlapping.

$$t\quad =\quad {\;\overline {X}_{1}-\overline {X}_{2}\; \over {\sqrt {\;{s_{1}^{2} \over N_{1}}\;+\;{s_{2}^{2} \over N_{2}}\quad }}}\,$$

where ${\displaystyle {\overline {X}}_{1}}$, $ s_{1}^{2}$ and $ N_{1}$ are the 1st sample mean, population standard deviation and sample size, respectively. Unlike in Student's t-test, the denominator is not based on a pooled variance estimate.

The degrees of freedom ${\displaystyle \nu }$ associated with this variance estimate is approximated using the Welch–Satterthwaite equation:

$${\displaystyle \nu \quad \approx \quad {{\left(\;{s_{1}^{2} \over N_{1}}\;+\;{s_{2}^{2} \over N_{2}}\;\right)^{2}} \over {\quad {s_{1}^{4} \over N_{1}^{2}\nu _{1}}\;+\;{s_{2}^{4} \over N_{2}^{2}\nu _{2}}\quad }}} \nu \quad \approx \quad {{\left(\;{s_{1}^{2} \over N_{1}}\;+\;{s_{2}^{2} \over N_{2}}\;\right)^{2}} \over {\quad {s_{1}^{4} \over N_{1}^{2}\nu _{1}}\;+\;{s_{2}^{4} \over N_{2}^{2}\nu _{2}}\quad }}$$

Here ${\displaystyle \nu _{1}=N_{1}-1} {\displaystyle \nu _{1}=N_{1}-1}$, the degrees of freedom associated with the first variance estimate. ${\displaystyle \nu _{2}=N_{2}-1} {\displaystyle \nu _{2}=N_{2}-1}$, the degrees of freedom associated with the 2nd variance estimate.

In the above example,

$t=\frac{{\bar{X}}_{1}-{\bar{X}}_{2}}{s_{p}{\sqrt{2/n}}}$

$s_{p}={\sqrt{\frac {s_{X_{1}}^{2}+s_{X_{2}}^{2}}{2}}}$

$\sqrt{\frac{((157.4^{2}+130.9^{2})}{2}} = 144.76$

Aso,

${\text{SD}}_{\bar {x}}\ ={\frac {\sigma }{\sqrt {n}}}$

For female, ${\text{SD}}_{{Female}}\ = \frac{157.4}{\sqrt{40}}$

The SHOWNULL option displays a vertical reference line at the null hypothesis value (zero by default) on the plot of mean differences. That plot shows the value of the difference between the sample means and the confidence interval (95% by default) around the value. If the confidence interval includes the null hypothesis value, the implication is that the difference is not statistically significant at your chosen alpha level (ALPHA=0.05 by default).

The decision to choose the pooled or Satterthwaite values is somewhat controversial. Some statisticians recommend that you look at the p-value from the test for equality of variances to decide which t- and p-value to use. Others think that this is somewhat circular reasoning and that you should decide ahead of time which assumption is most reasonable.


In [14]:
title "Demonstrating a Paired T-test";
proc ttest data=reading; /*By default, it prints (summaryplot agreementplot profilesplot);*/
paired After*Before;
run;

/*You see that the reading program did indeed increase reading speeds (by an average of
9.375 words per minute), with a p-value of .0194.*/


Out[14]:

176  ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
177
178 title "Demonstrating a Paired T-test";
179 proc ttest data=reading; /*By default, it prints (summaryplot agreementplot profilesplot);*/
ERROR: File WORK.READING.DATA does not exist.
180 paired After*Before;
181 run;
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE TTEST used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

182
183 /*You see that the reading program did indeed increase reading speeds (by an average of
184 9.375 words per minute), with a p-value of .0194.*/
185 ods html5 close;ods listing;

186

ANOVA

ANOVA is a type of regression where independent variables are nominal variables. Nominal variable is one that have two or more levels, but there is no intrinsic ordering for the levels. ANOVA stands for Analysis of Variance. It is used to compare more than two means. As the name suggest, it estimate an variance and based on the variance, it allow us to make a conclusion about the comparison of means. It is true that we can also use t-test to compare more than two means. But, t-test will increase the type-I-error when t-test do multiple comparison on the same data. Depends on the number of independent variable, we can classify ANOVA in to different types.

PROC GLM DATA=SAS-data-set<options>;
    CLASS variable(s);
    MODEL dependents=independents </options>;
    MEANS effects </options>;
    LSMEANS effects </options>;
    STORE <OUT=>item-store-name </ LABEL='label'>;
RUN;
QUIT;

For one-way ANOVA, one can use:

PROC ANOVA;
CLASS variable-list;
MODEL dependent = effects;

ANOVA cautions

Balanced experiments (those with an equal sample size for each treatment) are relatively easy to interpret; Unbalanced experiments offer more complexity. For single factor (one way) ANOVA, the adjustment for unbalanced data is easy, but the unbalanced analysis lacks both robustness and power. For more complex designs the lack of balance leads to further complications. "The orthogonality property of main effects and interactions present in balanced data does not carry over to the unbalanced case. This means that the usual analysis of variance techniques do not apply. Consequently, the analysis of unbalanced factorials is much more difficult than that for balanced designs." In the general case, "The analysis of variance can also be applied to unbalanced data, but then the sums of squares, mean squares, and F-ratios will depend on the order in which the sources of variation are considered."[43] The simplest techniques for handling unbalanced data restore balance by either throwing out data or by synthesizing missing data. More complex techniques use regression.

ANOVA can be thought of as linear regression on dummy variables.

It is only in the interpretation of the model that a distinction is made.

You calculate the variability between the means and the variability of observations within each group, and then calculate a ratio between these two measurements. If the between-group variability is significantly larger than the within-group variability, you reject the null that all of the group means are equal. So, you partition out the variability using sums of squares. For ANOVA, you calculate three types of sums of squares:

  • Between Group Variation (model sums of squares or SSM,
  • Within Group Variation, and
  • Total Variation
  • The first assumption is one of independent observations
  • The second assumption is that the error terms are normally distributed. You verify this assumption by examining diagnostic plots of the residuals.
  • The third assumption is that the error terms have equal variances across treatments.

In [15]:
proc print data=statdata.mggarlic (obs=10);
   title "Partial Listing of Garlic Data";
run;

proc means data=statdata.mggarlic printalltypes maxdec=3;
    var BulbWt;
    class Fertilizer;
    title "Descriptive Statistics of Garlic Weight";
run;

ods graphics on / width=700;
proc sgplot data=statdata.mggarlic;
    vbox BulbWt / category=Fertilizer connect= mean datalabel=BedID;
    format BedID 5.;
    title "Box Plots of Garlic Weight";
run;
title;


Out[15]:

188  ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
189
190 proc print data=statdata.mggarlic (obs=10);
ERROR: File STATDATA.MGGARLIC.DATA does not exist.
191 title "Partial Listing of Garlic Data";
192 run;
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE PRINT used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

193
194 proc means data=statdata.mggarlic printalltypes maxdec=3;
ERROR: File STATDATA.MGGARLIC.DATA does not exist.
ERROR: No data set open to look up variables.
195 var BulbWt;
ERROR: No data set open to look up variables.
196 class Fertilizer;
197 title "Descriptive Statistics of Garlic Weight";
198 run;
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE MEANS used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

199
200 ods graphics on / width=700;
201 proc sgplot data=statdata.mggarlic;
ERROR: File STATDATA.MGGARLIC.DATA does not exist.
ERROR: No data set open to look up variables.
ERROR: No data set open to look up variables.
ERROR: No data set open to look up variables.
202 vbox BulbWt / category=Fertilizer connect= mean datalabel=BedID;
203 format BedID 5.;
204 title "Box Plots of Garlic Weight";
205 run;
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE SGPLOT used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

206 title;
207 ods html5 close;ods listing;

208

In [16]:
ods graphics on / width=700;

proc glm data=statdata.mggarlic plots(only)=diagnostics ;
   class Fertilizer;
   model BulbWt=Fertilizer;  /****/
   means Fertilizer / hovtest;
   title "Testing for Equality of Means with PROC GLM";
run;
quit;
title;


Out[16]:

210  ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
211
212 ods graphics on / width=700;
213
214 proc glm data=statdata.mggarlic plots(only)=diagnostics ;
ERROR: File STATDATA.MGGARLIC.DATA does not exist.
215 class Fertilizer;
ERROR: No data set open to look up variables.
NOTE: The previous statement has been deleted.
216 model BulbWt=Fertilizer; /****/
217 means Fertilizer / hovtest;
218 title "Testing for Equality of Means with PROC GLM";
219 run;
ERROR: A MODEL statement must be given.
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE GLM used (Total process time):
real time 0.02 seconds
cpu time 0.00 seconds

220 quit;
221 title;
222 ods html5 close;ods listing;

223

In [17]:
proc anova data=statdata.mggarlic;
    class Fertilizer;
    model BulbWt=Fertilizer;
run;


Out[17]:

225  ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
226
227 proc anova data=statdata.mggarlic;
ERROR: File STATDATA.MGGARLIC.DATA does not exist.
228 class Fertilizer;
ERROR: No data set open to look up variables.
NOTE: The previous statement has been deleted.
229 model BulbWt=Fertilizer;
230 run;
231 ods html5 close;ods listing;

232
Interpretation of results

You want to see a random scatter of residuals above and below 0 for the four fertilizer groups. The plot looks good.

You use Levene's Test for Homogeneity to formally test the equal variance assumption. Because the p-value of 0.4173 is greater than 0.05, you fail to reject the null and conclude that the variances are equal. This is good. You verified the equal variance assumption.

Because the residuals follow the diagonal reference line fairly closely, you can say that they are approximately normal.

The mean square error is $0.00078$, which is an estimate of the population variance. SAS calculates MSE by dividing the error sum of squares by the error Degrees of Freedom ($0.00457996/3$), which gives you the average sum of squares for the error. SAS calculates the F-statistic by dividing the MSM by the MSE ($0.00078/0.00152665$). The F statistic is 1.96. Because the corresponding p-value of $0.1432$ is greater than 0.05, you can conclude that there is not a statistically significant difference between the mean bulb weights for the four fertilizers.

It's important for you to realize that the one-way ANOVA is an omnibus test statistic and cannot tell you which specific groups are significantly different from each other, only that at least two groups are different. To determine which specific groups differ from each other, you need to use a post-hoc test.

All in all, the PROC GLM output supports your conclusion that there's not a statistically significant difference between the mean bulb weights for the four fertilizers. (F-test);

Levene's test

The HOVTEST=BARTLETT option specifies Bartlett’s test (Bartlett; 1937), a modification of the normal-theory likelihood ratio test.

The HOVTEST=BF option specifies Brown and Forsythe’s variation of Levene’s test (Brown and Forsythe; 1974).

The HOVTEST=LEVENE option specifies Levene’s test (Levene; 1960), which is widely considered to be the standard homogeneity of variance test.

In statistics, Levene's test is an inferential statistic used to assess the equality of variances for a variable calculated for two or more groups. Some common statistical procedures assume that variances of the populations from which different samples are drawn are equal. Levene's test assesses this assumption. It tests the null hypothesis that the population variances are equal (called homogeneity of variance or homoscedasticity). If the resulting p-value of Levene's test is less than some significance level (typically 0.05), the obtained differences in sample variances are unlikely to have occurred based on random sampling from a population with equal variances. Thus, the null hypothesis of equal variances is rejected and it is concluded that there is a difference between the variances in the population.

The test statistic, $W$, is defined as follows:

$${\displaystyle W={\frac {(N-k)}{(k-1)}}{\frac {\sum _{i=1}^{k}N_{i}(Z_{i\cdot }-Z_{\cdot \cdot })^{2}}{\sum _{i=1}^{k}\sum _{j=1}^{N_{i}}(Z_{ij}-Z_{i\cdot })^{2}}},}$$

where

$k$ is the number of different groups to which the sampled cases belong,

$N_{i}$ is the number of cases in the $ith$ group,

$N$ is the total number of cases in all groups,

$Y_{{ij}}$ is the value of the measured variable for the $jth$ case from the $ith$ group,

${\displaystyle Z_{ij}={\begin{cases}|Y_{ij}-{\bar {Y}}_{i\cdot }|,&{\bar {Y}}_{i\cdot }{\text{ is a mean of the }}i{\text{-th group}},\\|Y_{ij}-{\tilde {Y}}_{i\cdot }|,&{\tilde {Y}}_{i\cdot }{\text{ is a median of the }}i{\text{-th group}}.\end{cases}}}$

ANOVA with Data from a Randomized Block Design

Along with the three original ANOVA assumptions of independent observations, normally distributed errors, and equal variances across treatments, you make two more assumptions when you include a blocking factor in the model. First, you assume that the treatments are randomly assigned within each block. In the T-cell count example, this means that you assume the three medications are randomly assigned to each of the three age groups. Next, you assume that the effects of the treatment factor are constant across the levels of the blocking factor, meaning that the effects of the treatment factor don't depend on the block they are in. When the effects of the treatment factor are not constant across the levels of another variable, it's called interaction. But when you use a randomized block design, you assume that the effects are the same within each block. In other words, you assume that there are no interactions with the blocking variable

The farmers divide the farm into eight sectors, each of which has four beds, and in each of the four beds, they randomly assign each of the four fertilizers. An experimental design like this is often referred to as a randomized block design. As you can see in this ANOVA model, Sector is the blocking variable.


In [18]:
proc print data=statdata.mggarlic_block (obs=10);
run;

proc glm data=statdata.mggarlic_block plots(only)=diagnostics;
     class Fertilizer Sector;
     model BulbWt=Fertilizer Sector;
     title "ANOVA for Randomized Block Design";
run;
quit;
title;


Out[18]:

234  ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
235
NOTE: PROCEDURE ANOVA used (Total process time):
real time 0.18 seconds
cpu time 0.13 seconds

236 proc print data=statdata.mggarlic_block (obs=10);
ERROR: File STATDATA.MGGARLIC_BLOCK.DATA does not exist.
237 run;
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE PRINT used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

238
239 proc glm data=statdata.mggarlic_block plots(only)=diagnostics;
ERROR: File STATDATA.MGGARLIC_BLOCK.DATA does not exist.
240 class Fertilizer Sector;
ERROR: No data set open to look up variables.
NOTE: The previous statement has been deleted.
241 model BulbWt=Fertilizer Sector;
242 title "ANOVA for Randomized Block Design";
243 run;
244 quit;
NOTE: PROCEDURE GLM used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

245 title;
246
247 ods html5 close;ods listing;

248

In [19]:
proc glm data=statdata.ads1 plots(only)=diagnostics;
   class Ad Area;
   model Sales=Ad Area;
   title 'ANOVA for Randomized Block Design';
run;
quit;
title;


Out[19]:

250  ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
251
252 proc glm data=statdata.ads1 plots(only)=diagnostics;
ERROR: File STATDATA.ADS1.DATA does not exist.
253 class Ad Area;
ERROR: No data set open to look up variables.
NOTE: The previous statement has been deleted.
254 model Sales=Ad Area;
255 title 'ANOVA for Randomized Block Design';
256 run;
257 quit;
NOTE: PROCEDURE GLM used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

258 title;
259 ods html5 close;ods listing;

260

ANOVA Post Hoc Tests

A pairwise comparison examines the difference between two treatment means. If your ANOVA results suggest that you reject the null hypothesis that the means are equal across groups, you can conduct multiple pairwise comparisons in a post hoc analysis to learn which means differ.

The chance that you make a Type I error increases each time you conduct a statistical test. The comparisonwise error rate, or CER, is the probability of a Type I error on a single pairwise test. The experimentwise error rate, or EER, is the probability of making at least one Type I error when performing all of the pairwise comparisons. The EER increases as the number of pairwise comparisons increases.

With a statistically conservative multiple comparison method, such as the Tukey or Dunnett method, you control for the EER, so there's a tendency to find fewer significant differences than might otherwise be found. When you make no adjustments for multiple comparisons, you are likely to find more significant differences than might otherwise be found. If only the comparisonwise error rate is controlled, the overall risk of a Type I error across all the comparisons is increased (and therefore the risk of Type II error is decreased), so the test might find more significant differences than would otherwise be found.

The Tukey method compares all possible pairs of means, so it can only be used when you make pairwise comparisons (See Tukey's range test) Tukey's procedure is only applicable for pairwise comparisons. It assumes independence of the observations being tested, as well as equal variation across observations (homoscedasticity). The procedure calculates for each pair the studentized range statistic: ${\frac {Y_{{A}}-Y_{{B}}}{SE}}$ where $ Y_{{A}}$ is the larger of the two means being compared, ${\displaystyle Y_{B}} Y_{{B}}$ is the smaller, and $SE$ is the standard error of the data in question. Tukey's test is essentially a Student's t-test, except that it corrects for family-wise error-rate.

Dunnett's method is a specialized multiple comparison test that enables you to compare a single control group to all other groups.


In [20]:
ods graphics on / width=700;
ods trace on;

proc glm data=statdata.mggarlic_block;
    class Fertilizer Sector;
    model BulbWt=Fertilizer Sector;             /*****/
    lsmeans Fertilizer / pdiff=all adjust=tukey;
    lsmeans Fertilizer / pdiff=controlu('4') adjust=dunnett;
    lsmeans Fertilizer / pdiff=all adjust=t;
    title "Garlic Data: Multiple Comparisons";
run;
quit;
title;


Out[20]:

262  ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
263
264 ods graphics on / width=700;
265 ods trace on;
266
267 proc glm data=statdata.mggarlic_block;
ERROR: File STATDATA.MGGARLIC_BLOCK.DATA does not exist.
268 class Fertilizer Sector;
ERROR: No data set open to look up variables.
NOTE: The previous statement has been deleted.
269 model BulbWt=Fertilizer Sector; /*****/
270 lsmeans Fertilizer / pdiff=all adjust=tukey;
ERROR: Variable Fertilizer not found.
NOTE: The previous statement has been deleted.
271 lsmeans Fertilizer / pdiff=controlu('4') adjust=dunnett;
ERROR: Variable Fertilizer not found.
NOTE: The previous statement has been deleted.
272 lsmeans Fertilizer / pdiff=all adjust=t;
273 title "Garlic Data: Multiple Comparisons";
274 run;
ERROR: A MODEL statement must be given.
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE GLM used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

275 quit;
276 title;
277 ods html5 close;ods listing;

278

You can use the ODS TRACE statement in your program. When you add the ODS TRACE statement, SAS writes a trace record to the log that includes information about each output object, such as the path for each object and the label for each object. When you check the log, you can see a list of the ODS output objects and information about each one. Now that you know the names, you can decide what output to specify in the ODS SELECT statement. Let's turn tracing off, and then specify only the output we want to see. Now let's submit this program.

The only significant pairwise difference is between fertilizer 1 and fertilizer 4. The p-value of 0.0144 is less than your alpha, meaning that the bulb weights of these two fertilizers are significantly different from one another.

Because you performed one-sided, upper-tailed hypothesis tests of each of the organic fertilizers versus the chemical fertilizer, you only see the upper shaded region with the UDL in your plot. Remember that the bottom horizontal line is the least squares mean of your control group. The vertical line for fertilizer 1 is the only line that extends past the UDL.

Finally, let's examine the output that corresponds to your third LSMEANS statement. These t-tests do not adjust for multiple comparisons, and are therefore more liberal than tests that do control for the EER. Take a moment to look at the p-values. You might notice that the p-values in this table are smaller than those in the Tukey table. In fact, which additional significant pairwise difference does this method show? It shows that fertilizer 1 is significantly different from fertilizer 2 with a p-value of 0.0195. This is in addition to fertilizers 1 and 4 being statistically different with a p-value of 0.0029. Notice also that the comparison between fertilizers 3 and 4 is nearly significant. So with this test, there's a tendency to find more significant pairwise differences than might actually exist.

Lastly, let's take a look at the diffogram. Again, this reinforces what you know about fertilizers 1 and 4 and fertilizers 1 and 2. Using these multiple comparison techniques gives you options. If you feel strongly about controlling the EER, you shouldn't use the pairwise t-test results and should instead use the Tukey or Dunnett results. You knew before you performed these multiple comparison techniques that fertilizer 1 produced the garlic with the heaviest overall mean bulb weight, so that would be your first choice if you are not considering other factors like cost or availability. But what if fertilizer 1 is very expensive or hard to obtain? With these multiple comparison techniques, you now know which fertilizers are not statistically different from fertilizer 1, so the Montana Gourmet Garlic farmers have options for the fertilizer to use that will produce equally heavy garlic bulbs.

Two-Way ANOVA with Interactions

When you have two categorical predictor variables and a continuous response variable, you can analyze your data using two-way ANOVA. With two-way ANOVA, you can examine the effects of the two predictor variables concurrently. You can also determine whether they interact with respect to their effect on the response variable. An interaction means that the effects on one variable depend on the value of another variable. If there is no interaction, you can interpret the test for the individual factor effects to determine their significance. If an interaction exists between any factors, the test for the individual factor effects might be misleading due to the masking of these effects by the interaction.

You can include interactions and more than one predictor variable in the ANOVA model.

You can graphically explore the relationship between the response variable and the effect of the interaction between the two predictor variables using PROC SGPLOT.

You can use PROC GLM to determine whether the effects of the predictor variables and the interaction between them are statistically significant.

When running PROC GLM, you can add a STORE statement to save your analysis results. By using the STORE statement, you can run postprocessing analyses on the stored results, even if you no longer have access to the original data set. The STORE statement requests that the procedure save the context and results of the statistical analysis into an item store. To perform post-fitting statistical analyses and plotting for the contents of the store item, you use the PLM procedure.


In [21]:
proc format;
   value dosef
   1="Placebo"
   2="100 mg"
   3="200mg"
   4="500mg";
run;

proc means data=statdata.drug
           mean var std printalltypes;
   class Disease DrugDose;
   var BloodP;
   output out=means mean=BloodP_Mean;
   format DrugDose dosef.;
   title "Selected Descriptive Statistics for Drug Data Set";
run;
title;


Out[21]:

280  ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
281
282 proc format;
283 value dosef
284 1="Placebo"
285 2="100 mg"
286 3="200mg"
NOTE: Format DOSEF has been output.
287 4="500mg";
288 run;
NOTE: PROCEDURE FORMAT used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

289
290 proc means data=statdata.drug
ERROR: File STATDATA.DRUG.DATA does not exist.
291 mean var std printalltypes;
ERROR: No data set open to look up variables.
ERROR: No data set open to look up variables.
292 class Disease DrugDose;
ERROR: No data set open to look up variables.
293 var BloodP;
294 output out=means mean=BloodP_Mean;
295 format DrugDose dosef.;
296 title "Selected Descriptive Statistics for Drug Data Set";
297 run;
NOTE: The SAS System stopped processing this step because of errors.
WARNING: The data set WORK.MEANS may be incomplete. When this step was stopped there were 0 observations and 0 variables.
NOTE: PROCEDURE MEANS used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

298 title;
299 ods html5 close;ods listing;

300

The Means data set contains the variable TYPE, with values ranging from 0 to 3 to represent the four tables this PROC MEANS program generates. Type 0 gives you the mean blood pressure change of all observations, regardless of disease type or drug dose. Type 1 gives you the mean blood pressure for each drug dose, regardless of disease type. Type 2 gives you the mean blood pressure for each disease type, regardless of drug dose. And Type 3 gives you the mean blood pressure for each disease type and drug dose combination.


In [22]:
ods graphics on / width=800;

proc sgplot data=means;
   where _TYPE_=3;     
   scatter x=DrugDose y=BloodP_Mean / 
           group=Disease markerattrs=(size=10);
   series x=DrugDose y=BloodP_Mean / group=Disease        
          lineattrs=(thickness=2);
   xaxis integer;
   format DrugDose dosef.;  
   title "Plot of Stratified Means in Drug Data Set";
run;
title;

ods graphics on / width=800;

proc glm data=statdata.drug;
   class DrugDose Disease;
   model Bloodp=DrugDose Disease DrugDose*Disease / ss3;
   format DrugDose dosef.;  
   title "Analyze the Effects of DrugDose and Disease";
   title2 "Including Interaction";
run;
quit;
title;


Out[22]:

302  ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
303
304 ods graphics on / width=800;
305
306 proc sgplot data=means;
ERROR: Variable _TYPE_ is not on file WORK.MEANS.
307 where _TYPE_=3;
ERROR: Variable DRUGDOSE not found.
308 scatter x=DrugDose y=BloodP_Mean /
ERROR: Variable BLOODP_MEAN not found.
ERROR: Variable DISEASE not found.
309 group=Disease markerattrs=(size=10);
ERROR: Variable DRUGDOSE not found.
ERROR: Variable BLOODP_MEAN not found.
310 series x=DrugDose y=BloodP_Mean / group=Disease
ERROR: Variable DISEASE not found.
311 lineattrs=(thickness=2);
312 xaxis integer;
313 format DrugDose dosef.;
314 title "Plot of Stratified Means in Drug Data Set";
315 run;
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE SGPLOT used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

316 title;
317
318 ods graphics on / width=800;
319
320 proc glm data=statdata.drug;
ERROR: File STATDATA.DRUG.DATA does not exist.
321 class DrugDose Disease;
ERROR: No data set open to look up variables.
NOTE: The previous statement has been deleted.
322 model Bloodp=DrugDose Disease DrugDose*Disease / ss3;
323 format DrugDose dosef.;
324 title "Analyze the Effects of DrugDose and Disease";
325 title2 "Including Interaction";
326 run;
327 quit;
NOTE: PROCEDURE GLM used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

328 title;
329
330 ods html5 close;ods listing;

331

The p-value for the overall model is very small, so what does this tell you? You can reject the null hypothesis and conclude that at least one of the effects in the model is significant, in other words, there is at least one difference among the 12 group means, one for each drug dose and disease combination. Which factors explain this difference? You'll see in just a few moments.

The R square is 0.3479, so approximately 35% of the variation in blood pressure change can be explained by the model. The average blood pressure change of all the observations is –2.294, which is exactly what the PROC MEANS output showed.

The next tables show the breakdown of the main effects and interaction term in the model. Look at the Type I and Type III Sums of Squares values. Do you know why their values are not exactly the same? You don't have a balanced design in this experiment. In other words, you have a different number of observations in each drug dose and disease combination group. In most situations, you will want to use the Type III SS. The Type I, or sequential SS, are the sums of squares you obtain from fitting the effects in the order you specify in the model. The Type III, or marginal SS, are the sums of squares you obtain from fitting each effect after all the other terms in the model, that is, the sums of squares for each effect corrected for the other terms in the model. Type III SS does not depend upon the order you specify effects in the model. (Use the SS3 option on the MODEL statement, this portion of the output shows the results for Type III SS only.)

You want to look at the interaction term first. If it's significant, the main effects don't tell you the whole story. The p-value for DrugDose*Disease is 0.0001. Presuming an alpha of 0.05, you reject the null hypothesis. You have sufficient evidence to conclude that there is an interaction between the two factors, meaning that the effect of the level of drug dose on blood pressure changes for the different disease types. You don't need to worry all that much about the significance of the main effects at this point for two reasons: 1) Because the interaction term is significant, you know that the effect of the drug level changes for the different disease types. 2) Because the interaction term is significant, you want to include the main effects in the model, whether they are significant or not, to preserve model hierarchy.

Let's finally take a look at the interaction plot for blood pressure. SAS produces this plot by default when you have an interaction term in the model. This plot looks similar to the one you produced with PROC SGPLOT, except that this one plots each of the blood pressure change measurements, as well as the means for each drug dose and disease type combination. Well, you might be thinking, "Okay. I know the interaction is significant. What I really want to know is the effect of drug dose at each particular level of disease." You have to add the LSMEANS statement to your program to find the answer.

The PLM Procedure

The PLM procedure performs post-fitting statistical analyses and plotting for the contents of a SAS item store that were previously created with the STORE statement in some other SAS/STAT procedure.

The statements that are available in the PLM procedure are designed to reveal the contents of the source item store via the Output Delivery System (ODS) and to perform post-fitting tasks.

The use of item stores and PROC PLM enables you to separate common post-processing tasks, such as testing for treatment differences and predicting new observations under a fitted model, from the process of model building and fitting. A numerically expensive model fitting technique can be applied once to produce a source item store. The PLM procedure can then be called multiple times, and the results of the fitted model are analyzed without incurring the model fitting expenditure again.

Selected PROC PLM option:

  • RESTORE specifies the source item store for processing.
  • Selected PROC PLM procedure statements:
  • EFFECTPLOT produces a display of the fitted model and provides options for changing and enhancing the displays.
  • LSMEANS computes and compares least squares means (LS-means) of fixed effects.
  • LSMESTIMATE provides custom hypothesis tests among least squares means.
  • SHOW uses the Output Delivery System to display contents of the item store. This statement is useful for verifying that the contents of the item store apply to the analysis and for generating ODS tables.
  • SLICE provides a general mechanism for performing a partitioned analysis of the LS-means for an interaction. This analysis is also known as an analysis of simple effects. The SLICE statement uses the same options as the LSMEANS statement.
  • WHERE is used in the PLM procedure when the item store contains BY-variable information and you want to apply the PROC PLM statements to only a subset of the BY groups.
PROC PLM RESTORE=item-store-specification<options>;
    EFFECTPLOT <plot-type <(plot-definition options)>> 
          </ options>;
    LSMEANS <model-effects > </ options>;
    LSMESTIMATE model-effect <'label'> values 
      <divisor=n><,...<'label'> values
      <divisor=n>> </ options>;
SHOW options;
SLICE model-effect </ options>;
WHERE expression ;
RUN;

Saving Analysis Results with the STORE Statement

When running PROC GLM, you can add a STORE statement to save your analysis results. By using the STORE statement, you can run postprocessing analyses on the stored results, even if you no longer have access to the original data set.

The STORE statement requests that the procedure save the context and results of the statistical analysis into an item store. An item store is a binary file format that cannot be modified. You can process the contents of an item store with the PLM procedure.

For example, if you need to perform a time-consuming analysis, you can store the results by using the STORE statement.

At a later time, you can use PROC PLM to perform specific statistical analysis tasks based on the saved results of the previous analysis without having to fit the model again. This can be a great time saver!

Here is the syntax of the STORE statement. Following the keyword STORE and OUT= you specify the item store name and an optional label.

You can use the STORE statement in a number of SAS/STAT procedures. For more information about the STORE statement, click the Information button.


In [23]:
proc contents data = statdata.ameshousing3;
run;


Out[23]:

333  ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
334
335 proc contents data = statdata.ameshousing3;
ERROR: File STATDATA.AMESHOUSING3.DATA does not exist.
336 run;
NOTE: Statements not processed because of errors noted above.
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE CONTENTS used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

337 ods html5 close;ods listing;

338

In [24]:
proc glm data=statdata.ameshousing3 
         order=internal 
         plots(only)=intplot;
   class Season_Sold Heating_QC;
   model SalePrice=Heating_QC Season_Sold Heating_QC*Season_Sold / ss3;
   lsmeans Heating_QC*Season_Sold / diff slice=Heating_QC;
   format Season_Sold ;
   store out=interact;
   title "Model with Heating Quality and Season as Interacting "
         "Predictors";
run;


Out[24]:

340  ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
341
342 proc glm data=statdata.ameshousing3
ERROR: File STATDATA.AMESHOUSING3.DATA does not exist.
343 order=internal
344 plots(only)=intplot;
345 class Season_Sold Heating_QC;
ERROR: No data set open to look up variables.
NOTE: The previous statement has been deleted.
346 model SalePrice=Heating_QC Season_Sold Heating_QC*Season_Sold / ss3;
347 lsmeans Heating_QC*Season_Sold / diff slice=Heating_QC;
348 format Season_Sold ;
349 store out=interact;
350 title "Model with Heating Quality and Season as Interacting "
351 "Predictors";
352 run;
ERROR: A MODEL statement must be given.
NOTE: The model item store WORK.INTERACT was deleted because of incomplete information for a subsequent analysis.
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE GLM used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds

353 ods html5 close;ods listing;

354

In [25]:
proc plm restore=interact plots=all;
   slice Heating_QC*Season_Sold / sliceby=Heating_QC adjust=tukey;
   effectplot interaction(sliceby=Heating_QC) / clm;
run;


Out[25]:

356  ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
357
358 proc plm restore=interact plots=all;
ERROR: The file WORK.INTERACT does not exist or it is not a valid item store.
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE PLM used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

359 slice Heating_QC*Season_Sold / sliceby=Heating_QC adjust=tukey;
360 effectplot interaction(sliceby=Heating_QC) / clm;
361 run;
362 ods html5 close;ods listing;

363

The option order=internal tells SAS to use the order of the variable values stored internally, rather than the order of the formatted values. The internal values for Season_Sold are 1 (formatted as Winter), 2 (formatted as Spring), 3 (for Summer), and 4 (for Fall). So, including order=internal tells SAS to display the seasons in the order Winter, Spring, Summer, and Fall, rather than in alphabetical order.

We specify plots(only)=intplot to request an interaction plot. You can request an interaction plot even if there isn’t an interaction in the model. In this case, there is an interaction, so we’ll be able to visualize the interaction in the plot. Because Season_Sold and Heating_QC are categorical variables, we need to include them in a CLASS statement.

In the MODEL statement, we specify the response variable SalePrice, equals, the main effects Heating_QC and Season_Sold, and then interaction effect. The interaction effect is represented as the two main effects separated by an asterisk. Here’s another way to represent the effects. Instead of listing the interaction effect, we could simply place a vertical bar between the main effects.

Recall that the LSMEANS option computes and compares least squares means of fixed effects. This LSMEANS statement specifies the interaction term Heating_QC by Season_Sold. By specifying slice=Heating_QC, we tell SAS to to slice the interaction effect by the different levels of Heating_QC.

This means that each slice will have one Heating_QC level and will show the Season_Sold effect across that slice. We'll format Season_Sold with a FORMAT statement. The statement store out=interact saves the analysis results as a SAS item store named interact in the Work library.

We want to produce more tables on Season_Sold by different levels of Heating_QC, so we add the SLICE statement slice Heating_QC by Season_Sold, sliceby=Heating_QC. Notice that the syntax sliceby= in the SLICE statement is different from slice= in the LSMEANS statement.

We include adjust=tukey in order to get the Tukey adjustment for multiple comparison tests.

By specifying plots=all, the output will include an effect plot. However, by adding an EFFECTPLOT statement, we can specify more options. We specify the interaction sliceby=Heating_QC, and we specify clm, which gives us confidence limits for the means.


In [26]:
*variables

Region
Advertising
Gender
Book_Sales
Music_Sales
Electronics_Sales
Total_Sales
;

proc format;
   value yesno 1 = 'Yes'
               0 = 'No';
data Store;
   length Region $ 5;
   call streaminit(57676);
   do Transaction = 1 to 200;
      R = ceil(rand('uniform')*10);
      select(R);
         when(1) Region = 'East';
         when(2) Region = 'West';
         when(3) Region = 'North';
         when(4) Region = 'South';
         otherwise;
      end;
      Advertising = rand('bernouli',.6);
      if rand('uniform') lt .6 then Gender = 'Female';
         else Gender = 'Male';
      Book_Sales = abs(round(rand('normal',250,50) + 30*(Gender = 'Female')
                    + 30*Advertising,10)) ;
      Music_Sales = abs(round(rand('uniform')*40 + rand('normal',50,5)
         + 30*(Region = 'East' and Gender = 'Male')
         - 20*(Region = 'West' and Gender = 'Female'),5) + 10*Advertising);
      Electronics_Sales = abs(round(rand('normal',300,60) + 70*(Gender = 'Male')
       + 55*Advertising + 50*(Region = 'East') - 20*(Region = 'South') 
       + 75*(Region = 'West'),10));
      Total_Sales = sum(Book_Sales,Music_Sales,Electronics_Sales);
   output;
   end;
   drop R;
   format Book_Sales Music_Sales Electronics_Sales Total_Sales dollar9.
          Advertising yesno.;
run;
 
/*title "Listing of Store";*/
/*proc print data=store heading=h;*/
/*run;*/

/*proc univariate data=store;*/
/*   var Book_Sales -- Total_Sales;*/
/*   histogram;*/
/*run;*/
/**/
/*title "Scatter Matrix for Store Variables";*/
/*proc sgscatter data=store;*/
/*   matrix Book_Sales -- Total_Sales / group = Gender;*/
/*run;*/
/**/
/*proc sgplot data=store;*/
/*   scatter x=Book_Sales y=Total_Sales / group=Gender;*/
/*run;*/

proc rank data=store out=median_sales groups=2;
   var Total_Sales;
   ranks Sales_Group;
run;

proc format;
   value sales 0 = 'Low'
               1 = 'High';
run;

/*proc logistic data=median_sales order=formatted;*/
/*   class Gender(param=ref ref='Male');*/
/*   model Sales_Group = Gender;*/
/*   format Sales_Group sales.;*/
/*quit;*/
/**/
/*proc logistic data=median_sales order=formatted;*/
/*   class Gender(param=ref ref='Male')*/
/*         Advertising (param=ref ref='No');*/
/*   model Sales_Group = Gender Advertising;*/
/*   format Sales_Group sales.;*/
/*quit;*/

*Create test data set;
libname example 'c:\books\statistics by example';
data example.Blood_Pressure;
   call streaminit(37373);
   do Drug = 'Placebo','Drug A','Drug B';
      do i = 1 to 20;
         Subj + 1;
         if mod(Subj,2) then Gender = 'M';
         else Gender = 'F';
         SBP = rand('normal',130,10) +
               7*(Drug eq 'Placebo') - 6*(Drug eq 'Drug B');
         SBP = round(SBP,2);
         DBP = rand('normal',80,5) +
               3*(Drug eq 'Placebo') - 2*(Drug eq 'Drug B');
         DBP = round(DBP,2);
         if Subj in (5,15,25,55) then call missing(SBP, DBP);
         if Subj in (4,18) then call missing(Gender);
         output;
      end;
   end;
   drop i;
run;

/*title "Listing of the first 25 observations from Blood_Pressure";*/
/*proc print data=example.Blood_Pressure(obs=25) noobs;*/
/*   var Subj Drug SBP DBP;*/
/*run;*/

data exercise;
   call streaminit(7657657);
   do Subj = 1 to 50;
      Age = round(rand('normal',50,15));
      Pushups = abs(int(rand('normal',40,10) - .30*age));
      Rest_Pulse = round(rand('normal',50,8) + .35*age);
      Max_Pulse = round(rest_pulse + rand('normal',50,5) - .05*age);
      Run_Pulse = round(max_pulse - rand('normal',3,3));
      output;
   end;
run;

*Data set for a paired t-test example;
data reading;
   input Subj Before After @@;
datalines;
1 100 110  2 120 121  3 130 140  4 90 110  5 85 92
6 133 137  7 210 209  8 155 179
;

/*title "Listing of Data Set READING";*/
/*proc print data=reading noobs;*/
/*run;*/

*Data set that violates assumptions for a t-test;
data salary;
   call streaminit(57575);
   do Subj = 1 to 50;
      do Gender = 'M','F';
         Income = round(20000*rand('exponential') + rand('uniform')*7000*(Gender = 'M'));
         output;
      end;
   end;
run;
/*proc univariate data=salary;*/
/*   class Gender;*/
/*   id Subj;*/
/*   var Income;*/
/*   histogram Income;*/
/*run;*/

*Data set risk for logistic regression example;
proc format;
   value yesno 1 = 'Yes'
               0 = 'No';
run;

data Risk;
   call streaminit(13579);
   length Age_Group $ 7;
   do i = 1 to 250;
      do Gender = 'F','M';
         Age = round(rand('uniform')*30 + 50);
         if missing(Age) then Age_Group = ' ';
         else if Age lt 60 then Age_Group = '1:< 60';
         else if Age le 70 then Age_Group = '2:60-70';
         else Age_Group = '3:71+';
         Chol = rand('normal',200,30) + rand('uniform')*8*(Gender='M');
         Chol = round(Chol);
         Score = .3*chol + age + 8*(Gender eq 'M');
         Heart_Attack = (Score gt 130)*(rand('uniform') lt .2);
         output;
       end;
   end;
   keep Gender Age Age_Group chol Heart_Attack;
   format Heart_Attack yesno.;
run;

/*title "Listing of first 100 observations from RISK";*/
/*proc print data=risk(obs=100);*/
/*run;*/


Out[26]:

365  ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
366
367 *variables
368
369 Region
370 Advertising
371 Gender
372 Book_Sales
373 Music_Sales
374 Electronics_Sales
375 Total_Sales
376 ;
377
378 proc format;
379 value yesno 1 = 'Yes'
NOTE: Format YESNO has been output.
380 0 = 'No';
NOTE: PROCEDURE FORMAT used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

381 data Store;
382 length Region $ 5;
383 call streaminit(57676);
384 do Transaction = 1 to 200;
385 R = ceil(rand('uniform')*10);
386 select(R);
387 when(1) Region = 'East';
388 when(2) Region = 'West';
389 when(3) Region = 'North';
390 when(4) Region = 'South';
391 otherwise;
392 end;
393 Advertising = rand('bernouli',.6);
394 if rand('uniform') lt .6 then Gender = 'Female';
395 else Gender = 'Male';
396 Book_Sales = abs(round(rand('normal',250,50) + 30*(Gender = 'Female')
397 + 30*Advertising,10)) ;
398 Music_Sales = abs(round(rand('uniform')*40 + rand('normal',50,5)
399 + 30*(Region = 'East' and Gender = 'Male')
400 - 20*(Region = 'West' and Gender = 'Female'),5) + 10*Advertising);
401 Electronics_Sales = abs(round(rand('normal',300,60) + 70*(Gender = 'Male')
402 + 55*Advertising + 50*(Region = 'East') - 20*(Region = 'South')
403 + 75*(Region = 'West'),10));
404 Total_Sales = sum(Book_Sales,Music_Sales,Electronics_Sales);
405 output;
406 end;
407 drop R;
408 format Book_Sales Music_Sales Electronics_Sales Total_Sales dollar9.
409 Advertising yesno.;
410 run;
NOTE: The data set WORK.STORE has 200 observations and 8 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

411
412 /*title "Listing of Store";*/
413 /*proc print data=store heading=h;*/
414 /*run;*/
415
416 /*proc univariate data=store;*/
417 /* var Book_Sales -- Total_Sales;*/
418 /* histogram;*/
419 /*run;*/
420 /**/
421 /*title "Scatter Matrix for Store Variables";*/
422 /*proc sgscatter data=store;*/
423 /* matrix Book_Sales -- Total_Sales / group = Gender;*/
424 /*run;*/
425 /**/
426 /*proc sgplot data=store;*/
427 /* scatter x=Book_Sales y=Total_Sales / group=Gender;*/
428 /*run;*/
429
430 proc rank data=store out=median_sales groups=2;
431 var Total_Sales;
432 ranks Sales_Group;
433 run;
NOTE: The data set WORK.MEDIAN_SALES has 200 observations and 9 variables.
NOTE: PROCEDURE RANK used (Total process time):
real time 0.01 seconds
cpu time 0.00 seconds

434
435 proc format;
436 value sales 0 = 'Low'
NOTE: Format SALES has been output.
437 1 = 'High';
438 run;
NOTE: PROCEDURE FORMAT used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

439
440 /*proc logistic data=median_sales order=formatted;*/
441 /* class Gender(param=ref ref='Male');*/
442 /* model Sales_Group = Gender;*/
443 /* format Sales_Group sales.;*/
444 /*quit;*/
445 /**/
446 /*proc logistic data=median_sales order=formatted;*/
447 /* class Gender(param=ref ref='Male')*/
448 /* Advertising (param=ref ref='No');*/
449 /* model Sales_Group = Gender Advertising;*/
450 /* format Sales_Group sales.;*/
451 /*quit;*/
452
453 *Create test data set;
454 libname example 'c:\books\statistics by example';
NOTE: Library EXAMPLE does not exist.
455 data example.Blood_Pressure;
456 call streaminit(37373);
457 do Drug = 'Placebo','Drug A','Drug B';
458 do i = 1 to 20;
459 Subj + 1;
460 if mod(Subj,2) then Gender = 'M';
461 else Gender = 'F';
462 SBP = rand('normal',130,10) +
463 7*(Drug eq 'Placebo') - 6*(Drug eq 'Drug B');
464 SBP = round(SBP,2);
465 DBP = rand('normal',80,5) +
466 3*(Drug eq 'Placebo') - 2*(Drug eq 'Drug B');
467 DBP = round(DBP,2);
468 if Subj in (5,15,25,55) then call missing(SBP, DBP);
469 if Subj in (4,18) then call missing(Gender);
470 output;
471 end;
472 end;
473 drop i;
474 run;
ERROR: Library EXAMPLE does not exist.
NOTE: The SAS System stopped processing this step because of errors.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.02 seconds

475
476 /*title "Listing of the first 25 observations from Blood_Pressure";*/
477 /*proc print data=example.Blood_Pressure(obs=25) noobs;*/
478 /* var Subj Drug SBP DBP;*/
479 /*run;*/
480
481 data exercise;
482 call streaminit(7657657);
483 do Subj = 1 to 50;
484 Age = round(rand('normal',50,15));
485 Pushups = abs(int(rand('normal',40,10) - .30*age));
486 Rest_Pulse = round(rand('normal',50,8) + .35*age);
487 Max_Pulse = round(rest_pulse + rand('normal',50,5) - .05*age);
488 Run_Pulse = round(max_pulse - rand('normal',3,3));
489 output;
490 end;
491 run;
NOTE: The data set WORK.EXERCISE has 50 observations and 6 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

492
493 *Data set for a paired t-test example;
494 data reading;
495 input Subj Before After @@;
496 datalines;
NOTE: SAS went to a new line when INPUT statement reached past the end of a line.
NOTE: The data set WORK.READING has 8 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

499 ;
500
501 /*title "Listing of Data Set READING";*/
502 /*proc print data=reading noobs;*/
503 /*run;*/
504
505 *Data set that violates assumptions for a t-test;
506 data salary;
507 call streaminit(57575);
508 do Subj = 1 to 50;
509 do Gender = 'M','F';
510 Income = round(20000*rand('exponential') + rand('uniform')*7000*(Gender = 'M'));
511 output;
512 end;
513 end;
514 run;
NOTE: The data set WORK.SALARY has 100 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

515 /*proc univariate data=salary;*/
516 /* class Gender;*/
517 /* id Subj;*/
518 /* var Income;*/
519 /* histogram Income;*/
520 /*run;*/
521
522 *Data set risk for logistic regression example;
523 proc format;
524 value yesno 1 = 'Yes'
NOTE: Format YESNO is already on the library WORK.FORMATS.
NOTE: Format YESNO has been output.
525 0 = 'No';
526 run;
NOTE: PROCEDURE FORMAT used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

527
528 data Risk;
529 call streaminit(13579);
530 length Age_Group $ 7;
531 do i = 1 to 250;
532 do Gender = 'F','M';
533 Age = round(rand('uniform')*30 + 50);
534 if missing(Age) then Age_Group = ' ';
535 else if Age lt 60 then Age_Group = '1:< 60';
536 else if Age le 70 then Age_Group = '2:60-70';
537 else Age_Group = '3:71+';
538 Chol = rand('normal',200,30) + rand('uniform')*8*(Gender='M');
539 Chol = round(Chol);
540 Score = .3*chol + age + 8*(Gender eq 'M');
541 Heart_Attack = (Score gt 130)*(rand('uniform') lt .2);
542 output;
543 end;
544 end;
545 keep Gender Age Age_Group chol Heart_Attack;
546 format Heart_Attack yesno.;
547 run;
NOTE: The data set WORK.RISK has 500 observations and 5 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

548
549 /*title "Listing of first 100 observations from RISK";*/
550 /*proc print data=risk(obs=100);*/
551 /*run;*/
552
553
554 ods html5 close;ods listing;

555