Statistics Using SAS



In [27]:

    
%let path=/folders/myfolders/ECST131;
libname statdata "&path";

/*Go to the bottom to input the dataset*/









    Out[27]:









  
  
  




557  ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
558  
559  %let path=/folders/myfolders/ECST131;
560  libname statdata "&path";
NOTE: Libref STATDATA was successfully assigned as follows: 
      Engine:        V9 
      Physical Name: /folders/myfolders/ECST131
561  
562  /*Go to the bottom to input the dataset*/
563  ods html5 close;ods listing;

564

Basic Statistical Concepts

A population is the complete set of observations or the entire group of objects that you are researching. A sample is a subset of the population. The sample should be representative of the population. You can obtain a representative sample by collecting a simple random sample.

Parameters are numerical values that summarize characteristics of a population. Parameter values are typically unknown and are represented by Greek letters. Statistics summarize characteristics of a sample. You use letters from the English alphabet to represent sample statistics. You can measure characteristics of your sample and provide numerical values that summarize those characteristics.

Variables are classified according to their characteristics. They can be quantitative or categorical. Data that consists of counts or measurements is quantitative. Quantitative data can be further distinguished by two types: discrete and continuous. Discrete data takes on only a finite, or countable, number of values. Continuous data has an infinite number of values and no breaks or jumps.

Categorical or attribute data consists of variables that denote groupings or labels. There are two main types: nominal and ordinal. A nominal categorical variable exhibits no ordering within its groups or categories. With ordinal categorical variables, the observed levels of the variable can be ordered in a meaningful way that implies differences due to magnitude.

There are two scales of measurement for categorical variables: nominal and ordinal. There are two scales of measurement for continuous variables: interval and ratio. Data from an interval scale can be rank-ordered and has a sensible spacing of observations such that differences between measurements are meaningful. However, interval scales lack the ability to calculate ratios between numbers on the scale because there is no true zero point. Data on a ratio scale includes a true zero point and can therefore accurately indicate the ratio of difference between two spaces on the measurement scale.

The appropriate statistical method for your data also depends on the number of variables involved. Univariate analysis provides techniques for analyzing and describing a single variable at a time. Bivariate analysis describes and explains the relationship between two variables and how they change, or covary, together. Multivariate analysis examines two or more variables at the same time, in order to understand the relationships among them.

PROC MEANS DATA=SAS-data-set <options>;       
CLASS variables;
VAR variables;
RUN;

Option	Description
N	Number of nonmissing observations
NMISS	Number of observations with missing values
MEAN	Arithmetic mean
STD	Standard deviation
STDERR	Standard error
MIN	Minimum value
MAX	Maximum value
MEDIAN	Median
MAXDEC=	Maximum number of decimal places to display
CLM	95% confidence limit on the mean
CV	Coefficient of variation

PROC UNIVARIATE DATA=SAS-data-set <options>;      
    VAR variables;
    ID variables;
    HISTOGRAM variables </options>;
    PROBPPLOT variables </options>;
    INSET keywords </options>;
RUN;

PROC SGPLOT DATA=SAS-data-set<options>;
    DOT category-variable </option(s)>;
    HBAR category-variable </option(s)>;
    VBAR category-variable </option(s)>;
    HBOX response-variable </option(s)>;
    VBOX response-variable </option(s)>;
    HISTOGRAM response-variable </option(s)>;
    SCATTER X=variable Y=variable </option(s)>;
    NEEDLE X=variable Y=numeric-variable </option(s)>;
    REG X=numeric-variable Y=numeric-variable </option(s)>;
    REFLINE variable | value-1 <... value-n> </option(s)>;
RUN;

ODS GRAPHICS ON <options>;
    statistical procedure code
ODS GRAPHICS OFF;

Sample Programs



In [28]:

    
proc print data=statdata.testscores;
run;









    Out[28]:










SAS Output





Model with Heating Quality and Season as Interacting Predictors 







Obs
Gender
SATScore
IDNumber




1
Male
1170
61469897


2
Female
1090
33081197


3
Male
1240
68137597


4
Female
1000
37070397


5
Male
1210
64608797


6
Female
970
60714297


7
Male
1020
16907997


8
Female
1490
9589297


9
Male
1200
93891897


10
Female
1260
85859397


11
Male
1150
38152597


12
Female
1390
99108497


13
Male
1240
59666697


14
Female
1370
70847197


15
Male
1140
47613397


16
Female
1160
53750297


17
Male
1050
95948597


18
Female
1110
3873197


19
Male
1100
25756097


20
Female
1080
43493297


21
Male
1120
27543197


22
Female
1080
26212897


23
Male
1050
8945097


24
Female
1200
51799397


25
Male
1600
39196697


26
Female
1100
48154497


27
Male
1050
55189597


28
Female
1060
46028397


29
Male
1140
75332897


30
Female
1100
29520797


31
Male
1340
55983497


32
Female
1240
93236497


33
Male
1090
6975697


34
Female
1180
29686297


35
Male
1170
76815697


36
Female
1130
64045497


37
Male
1290
9880297


38
Female
1380
23048597


39
Male
1010
76058697


40
Female
1280
42586897


41
Male
1050
62688897


42
Female
1520
73461797


43
Male
1360
44327297


44
Female
1260
2854197


45
Male
1290
42401697


46
Female
1070
51266697


47
Male
1260
61195297


48
Female
1350
71681397


49
Male
1140
43230697


50
Female
1330
6520097


51
Male
990
61728297


52
Female
1300
64179097


53
Male
1150
20494697


54
Female
1520
40177297


55
Male
1200
43183797


56
Female
1120
47146397


57
Male
1020
60633297


58
Female
1250
72168497


59
Male
1330
98377897


60
Female
1170
20745097


61
Male
1110
91775997


62
Female
1230
82634897


63
Male
1310
73772397


64
Female
1270
20755897


65
Male
1210
57288097


66
Female
1120
55860697


67
Male
1200
65246997


68
Female
1380
81994397


69
Male
890
76526697


70
Female
1590
23573597


71
Male
1110
61160997


72
Female
1150
23359697


73
Male
1280
19108097


74
Female
910
30834797


75
Male
1060
46806897


76
Female
1270
62919297


77
Male
1020
37534197


78
Female
1300
4440297


79
Male
1050
2012997


80
Female
1060
59149297

Using Proc Freq

Plot Name	Table type	Require option on TABLES statement
AGREEPLOT	two-way	AGREE
CUMFREQPLOT	one-way
DEVIATIONPLOT	one-way	CHISQ
FREQPLOT	any request
KAPPAPLOT	three-way	AGREE
ODDSRATIOPLOT	hx2x2	MEASURES or RELRISK
RELREISKPLOT	hx2x2	MEASURES or RELRISK
RISKDIFFPLOT	hx2x2	RISKDIFF
WTKAPPAPLOT	hxrxr (r>2)	AGREE



In [29]:

    
proc format;
value $gender 'M' = 'Male'
'F' = 'Female';
value score low-<1100 = 'Low'
            1100-<1400 = 'Middle'
            1400-High = 'High';
run;

proc freq data=statdata.testscores ORDER=FORMATTED;
    tables gender*SATScore / noprecent norow nocol CHISQ 
           plots = freqplot (twoway = stacked);
           *plots = freqplot (twoway = grouphorizontal);
    format Gender $gender. SATScore score.;
run;









    Out[29]:










SAS Output





Model with Heating Quality and Season as Interacting Predictors 


The FREQ Procedure












Frequency










Table of Gender by SATScore


Gender
SATScore


High
Low
Middle
Total




Female


4




9




27




40




Male


1




13




26




40




Total


5




22




53




80















Statistics for Table of Gender by SATScore







Statistic
DF
Value
Prob




WARNING: 33% of the cells have expected counts less
than 5. Chi-Square may not be a valid test.





Chi-Square
2
2.5461
0.2800


Likelihood Ratio Chi-Square
2
2.6777
0.2622


Mantel-Haenszel Chi-Square
1
2.4303
0.1190


Phi Coefficient
 
0.1784
 


Contingency Coefficient
 
0.1756
 


Cramer's V
 
0.1784
 





Sample Size = 80

Using PROC MEANS



In [30]:

    
proc means data=statdata.testscores maxdec=2 fw=10 printalltypes
           n mean median std sterr var q1 q3 clm t;
   class Gender;
   var SATScore;
   title 'Selected Descriptive Statistics for SAT Scores';
run;
title;









    Out[30]:










SAS Output





Selected Descriptive Statistics for SAT Scores 


The MEANS Procedure







Analysis Variable : SATScore


N Obs
N
Mean
Median
Std Dev
Std Error
Variance
Lower Quartile
Upper Quartile
Lower 95%
CL for Mean

Upper 95%
CL for Mean

t Value




80
80
1190.63
1170.00
147.06
16.44
21626.19
1085.00
1280.00
1157.90
1223.35
72.42










Analysis Variable : SATScore


Gender
N Obs
N
Mean
Median
Std Dev
Std Error
Variance
Lower Quartile
Upper Quartile
Lower 95%
CL for Mean

Upper 95%
CL for Mean

t Value




Female
40
40
1221.00
1215.00
157.40
24.89
24773.33
1100.00
1315.00
1170.66
1271.34
49.06


Male
40
40
1160.25
1145.00
130.92
20.70
17140.96
1050.00
1240.00
1118.38
1202.12
56.05

Using proc sgplot, proc univariate



In [31]:

    
proc univariate data=statdata.testscores ALPHA=0.05;
   var SATScore;
   id idnumber;
   histogram SATScore / normal(mu=est sigma=est);
     inset skewness kurtosis / position=ne;
   probplot SATScore / normal(mu=est sigma=est);
     inset skewness kurtosis;
   cdfplot SATScore / normal(mu=est sigma=est);
     inset skewness kurtosis;
   title 'Descriptive Statistics Using PROC UNIVARIATE';
run;
title;









    Out[31]:










SAS Output





Descriptive Statistics Using PROC UNIVARIATE 


The UNIVARIATE Procedure
Variable:  SATScore








Moments




N
80
Sum Weights
80


Mean
1190.625
Sum Observations
95250


Std Deviation
147.058447
Variance
21626.1867


Skewness
0.64202018
Kurtosis
0.42409987


Uncorrected SS
115115500
Corrected SS
1708468.75


Coeff Variation
12.3513656
Std Error Mean
16.4416342










Basic Statistical Measures


Location
Variability




Mean
1190.625
Std Deviation
147.05845


Median
1170.000
Variance
21626


Mode
1050.000
Range
710.00000


 
 
Interquartile Range
195.00000










Tests for Location: Mu0=0


Test
Statistic
p Value




Student's t
t
72.41525
Pr > |t|
<.0001


Sign
M
40
Pr >= |M|
<.0001


Signed Rank
S
1620
Pr >= |S|
<.0001










Quantiles (Definition 5)


Level
Quantile




100% Max
1600


99%
1600


95%
1505


90%
1375


75% Q3
1280


50% Median
1170


25% Q1
1085


10%
1020


5%
995


1%
890


0% Min
890










Extreme Observations


Lowest
Highest


Value
IDNumber
Obs
Value
IDNumber
Obs




890
76526697
69
1490
9589297
8


910
30834797
74
1520
73461797
42


970
60714297
6
1520
40177297
54


990
61728297
51
1590
23573597
70


1000
37070397
4
1600
39196697
25






Descriptive Statistics Using PROC UNIVARIATE 


The UNIVARIATE Procedure









Descriptive Statistics Using PROC UNIVARIATE 


The UNIVARIATE Procedure
Fitted Normal Distribution for SATScore








Parameters for Normal Distribution


Parameter
Symbol
Estimate




Mean
Mu
1190.625


Std Dev
Sigma
147.0584










Goodness-of-Fit Tests for Normal Distribution


Test
Statistic
p Value




Kolmogorov-Smirnov
D
0.08382224
Pr > D
>0.150


Cramer-von Mises
W-Sq
0.09964577
Pr > W-Sq
0.114


Anderson-Darling
A-Sq
0.70124822
Pr > A-Sq
0.068










Quantiles for Normal Distribution


Percent
Quantile


Observed
Estimated




1.0
890.000
848.516


5.0
995.000
948.735


10.0
1020.000
1002.162


25.0
1085.000
1091.436


50.0
1170.000
1190.625


75.0
1280.000
1289.814


90.0
1375.000
1379.088


95.0
1505.000
1432.515


99.0
1600.000
1532.734








Descriptive Statistics Using PROC UNIVARIATE 


The UNIVARIATE Procedure










Descriptive Statistics Using PROC UNIVARIATE 


The UNIVARIATE Procedure



In [32]:

    
proc sgplot data = statdata.testscores ;
    *dot SATScore;
    *vbar SATscore;
    *scatter y=SATScore x=Gender;
    refline 1200 / axis=x lineattrs =(color=blue);
    /*hbox SATScore / group=Gender datalabel = IDNumber*/
    hbox SATScore / category=Gender datalabel = IDNumber missing;
    format IDnUMBER 8.;
run;









    Out[32]:










SAS Output



In [33]:

    
title "Using PROC SGPLOT to Produce a Scatter Plot";
proc sgplot data=store;
scatter x=Book_Sales y=Music_Sales;
run;
quit;









    Out[33]:










SAS Output



In [34]:

    
title "Using PROC SGPLOT to Produce a Scatter Plot";
title2 "Adding Gender Information to the Plot";
proc sgplot data=store;
scatter x=Book_Sales y=Music_Sales / group=Gender;
run;
quit;









    Out[34]:










SAS Output



In [35]:

    
title "Demonstrating the PLOT Statement of PROC SGSCATTER";
proc sgscatter data=store;
plot Book_Sales * Music_Sales Total_Sales * Electronics_Sales;
run;









    Out[35]:










SAS Output



In [36]:

    
title "Switching Axes and Adding a GROUP= Option";
proc sgscatter data=store;
compare Y=Total_Sales
X=(Book_Sales Music_Sales Electronics_Sales) / group=Gender;
run;









    Out[36]:










SAS Output

The left and right sides of the box represent the 1st and 3rd quartiles (sometimes abbreviated Q1 and Q3). The vertical bar inside the box is the median, and the diamond represents the mean. The lines extending from the left and right side of the box (called whiskers) represent data values that are less than 1.5 times the interquartile range from Q1 and Q3.

The option DATALABEL= lets you select a variable to identify specific outliers. If you use the DATALABEL option without naming a label variable, SGPLOT uses the numerical value of the response variable (SBP in this example) to label the outliers.

T test

PROC TTEST DATA=SAS-data-set <options eg sides = U>;
    CLASS variables;
    VAR variables;
    PARIED variable1*variable2;
RUN;

One-sample t-test



In [37]:

    
title "Conducting a One-Sample t-test Using PROC TTEST";
proc ttest data=exercise h0=50 sides=2 alpha=.05;
var Age;
run;









    Out[37]:










SAS Output






Conducting a One-Sample t-test Using PROC TTEST 


The TTEST Procedure

Variable:  Age







N
Mean
Std Dev
Std Err
Minimum
Maximum




50
51.3600
16.1419
2.2828
19.0000
84.0000










Mean
95% CL Mean
Std Dev
95% CL Std Dev




51.3600
46.7725
55.9475
16.1419
13.4839
20.1150










DF
t Value
Pr > |t|




49
0.60
0.5541



In [38]:

    
proc sgplot data=statdata.ameshousing3;
   vbox SalePrice / category=Central_Air
                    connect=mean;
   title "Sale Price Differences across Central Air";
run;
 
proc sgplot data=statdata.ameshousing3;
   vbox SalePrice / category=Fireplaces
                    connect=mean;
   title "Sale Price Differences across Fireplaces";
run;
 
proc sgplot data=statdata.ameshousing3;
   vbox SalePrice / category=Heating_QC
                    connect=mean;
   title "Sale Price Differences across Heating Quality";
run;









    Out[38]:










SAS Output

Two-sample t-test

The two-sample t-test is a hypothesis test for answering questions about the means of two populations. This test enables you to examine the differences between populations for one or more continuous variables. You can assess whether the means of the two populations are statistically different from each other. The null hypothesis for the two-sample t-test is that the means for the two groups are equal, or that $μ1 - μ2$ equals 0.

When you compare the means of two populations using a two-sample t-test, you make three assumptions:

The data contains independent observations,
The distributions of the two populations are normal, and
The variances in these normal distributions are equal.

To evaluate the assumption of equal variances in the two populations, you can use the F-test for equality of variances. To test the hypothesis, you calculate the F statistic, which is the ratio of the maximum sample variance of the two groups to the minimum sample variance of the two groups. By construction, the F statistic is always greater than or equal to 1.



In [39]:

    
ods graphics on;
proc ttest data=statdata.testscores 
           /*plots(shownull)=interval;*/
           plots(unpack shownull) = all;
           /*SIDES=U;*/
           /*SIDES=L;*/
   class Gender;
   var SATScore;
   title "Two-Sample t=Test Comparing Girls to Boys";
run;
ods graphics off;









    Out[39]:










SAS Output






Two-Sample t=Test Comparing Girls to Boys 


The TTEST Procedure

Variable:  SATScore







Gender
N
Mean
Std Dev
Std Err
Minimum
Maximum




Female
40
1221.0
157.4
24.8864
910.0
1590.0


Male
40
1160.3
130.9
20.7008
890.0
1600.0


Diff (1-2)
 
60.7500
144.8
32.3706
 
 










Gender
Method
Mean
95% CL Mean
Std Dev
95% CL Std Dev




Female
 
1221.0
1170.7
1271.3
157.4
128.9
202.1


Male
 
1160.3
1118.4
1202.1
130.9
107.2
168.1


Diff (1-2)
Pooled
60.7500
-3.6950
125.2
144.8
125.2
171.7


Diff (1-2)
Satterthwaite
60.7500
-3.7286
125.2
 
 
 










Method
Variances
DF
t Value
Pr > |t|




Pooled
Equal
78
1.88
0.0643


Satterthwaite
Unequal
75.497
1.88
0.0644










Equality of Variances


Method
Num DF
Den DF
F Value
Pr > F




Folded F
39
39
1.45
0.2545

In general, Pool variance is

$$s_{p}^{2}={\frac {\sum _{{i=1}}^{k}(n_{i}-1)s_{i}^{2}}{\sum _{{i=1}}^{k}(n_{i}-1)}}$$

Welch's t-test, or unequal variances t-test, is a two-sample location test which is used to test the hypothesis that two populations have equal means. Welch's t-test is an adaptation of Student's t-test,[1] that is more reliable when the two samples have unequal variances and unequal sample sizes.[2] These tests are often referred to as "unpaired" or "independent samples" t-tests, as they are typically applied when the statistical units underlying the two samples being compared are non-overlapping.

$$t\quad =\quad {\;\overline {X}_{1}-\overline {X}_{2}\; \over {\sqrt {\;{s_{1}^{2} \over N_{1}}\;+\;{s_{2}^{2} \over N_{2}}\quad }}}\,$$

where ${\displaystyle {\overline {X}}_{1}}$, $ s_{1}^{2}$ and $ N_{1}$ are the 1st sample mean, population standard deviation and sample size, respectively. Unlike in Student's t-test, the denominator is not based on a pooled variance estimate.

The degrees of freedom ${\displaystyle \nu }$ associated with this variance estimate is approximated using the Welch–Satterthwaite equation:

$${\displaystyle \nu \quad \approx \quad {{\left(\;{s_{1}^{2} \over N_{1}}\;+\;{s_{2}^{2} \over N_{2}}\;\right)^{2}} \over {\quad {s_{1}^{4} \over N_{1}^{2}\nu _{1}}\;+\;{s_{2}^{4} \over N_{2}^{2}\nu _{2}}\quad }}} \nu \quad \approx \quad {{\left(\;{s_{1}^{2} \over N_{1}}\;+\;{s_{2}^{2} \over N_{2}}\;\right)^{2}} \over {\quad {s_{1}^{4} \over N_{1}^{2}\nu _{1}}\;+\;{s_{2}^{4} \over N_{2}^{2}\nu _{2}}\quad }}$$

Here ${\displaystyle \nu _{1}=N_{1}-1} {\displaystyle \nu _{1}=N_{1}-1}$, the degrees of freedom associated with the first variance estimate. ${\displaystyle \nu _{2}=N_{2}-1} {\displaystyle \nu _{2}=N_{2}-1}$, the degrees of freedom associated with the 2nd variance estimate.

In the above example,

$t=\frac{{\bar{X}}_{1}-{\bar{X}}_{2}}{s_{p}{\sqrt{2/n}}}$

$s_{p}={\sqrt{\frac {s_{X_{1}}^{2}+s_{X_{2}}^{2}}{2}}}$

$\sqrt{\frac{((157.4^{2}+130.9^{2})}{2}} = 144.76$

Aso,

${\text{SD}}_{\bar {x}}\ ={\frac {\sigma }{\sqrt {n}}}$

For female, ${\text{SD}}_{{Female}}\ = \frac{157.4}{\sqrt{40}}$

The SHOWNULL option displays a vertical reference line at the null hypothesis value (zero by default) on the plot of mean differences. That plot shows the value of the difference between the sample means and the confidence interval (95% by default) around the value. If the confidence interval includes the null hypothesis value, the implication is that the difference is not statistically significant at your chosen alpha level (ALPHA=0.05 by default).

The decision to choose the pooled or Satterthwaite values is somewhat controversial. Some statisticians recommend that you look at the p-value from the test for equality of variances to decide which t- and p-value to use. Others think that this is somewhat circular reasoning and that you should decide ahead of time which assumption is most reasonable.



In [40]:

    
title "Demonstrating a Paired T-test";
proc ttest data=reading; /*By default, it prints (summaryplot agreementplot profilesplot);*/
paired After*Before;
run;

/*You see that the reading program did indeed increase reading speeds (by an average of
9.375 words per minute), with a p-value of .0194.*/









    Out[40]:










SAS Output






Demonstrating a Paired T-test 


The TTEST Procedure

Difference:  After - Before







N
Mean
Std Dev
Std Err
Minimum
Maximum




8
9.3750
8.7821
3.1049
-1.0000
24.0000










Mean
95% CL Mean
Std Dev
95% CL Std Dev




9.3750
2.0330
16.7170
8.7821
5.8065
17.8739










DF
t Value
Pr > |t|




7
3.02
0.0194

ANOVA

ANOVA is a type of regression where independent variables are nominal variables. Nominal variable is one that have two or more levels, but there is no intrinsic ordering for the levels. ANOVA stands for Analysis of Variance. It is used to compare more than two means. As the name suggest, it estimate an variance and based on the variance, it allow us to make a conclusion about the comparison of means. It is true that we can also use t-test to compare more than two means. But, t-test will increase the type-I-error when t-test do multiple comparison on the same data. Depends on the number of independent variable, we can classify ANOVA in to different types.

PROC GLM DATA=SAS-data-set<options>;
    CLASS variable(s);
    MODEL dependents=independents </options>;
    MEANS effects </options>;
    LSMEANS effects </options>;
    STORE <OUT=>item-store-name </ LABEL='label'>;
RUN;
QUIT;

For one-way ANOVA, one can use:

PROC ANOVA;
CLASS variable-list;
MODEL dependent = effects;

ANOVA cautions

Balanced experiments (those with an equal sample size for each treatment) are relatively easy to interpret; Unbalanced experiments offer more complexity. For single factor (one way) ANOVA, the adjustment for unbalanced data is easy, but the unbalanced analysis lacks both robustness and power. For more complex designs the lack of balance leads to further complications. "The orthogonality property of main effects and interactions present in balanced data does not carry over to the unbalanced case. This means that the usual analysis of variance techniques do not apply. Consequently, the analysis of unbalanced factorials is much more difficult than that for balanced designs." In the general case, "The analysis of variance can also be applied to unbalanced data, but then the sums of squares, mean squares, and F-ratios will depend on the order in which the sources of variation are considered."[43] The simplest techniques for handling unbalanced data restore balance by either throwing out data or by synthesizing missing data. More complex techniques use regression.

ANOVA can be thought of as linear regression on dummy variables.

It is only in the interpretation of the model that a distinction is made.

You calculate the variability between the means and the variability of observations within each group, and then calculate a ratio between these two measurements. If the between-group variability is significantly larger than the within-group variability, you reject the null that all of the group means are equal. So, you partition out the variability using sums of squares. For ANOVA, you calculate three types of sums of squares:

Between Group Variation (model sums of squares or SSM,
Within Group Variation, and
Total Variation

The first assumption is one of independent observations
The second assumption is that the error terms are normally distributed. You verify this assumption by examining diagnostic plots of the residuals.
The third assumption is that the error terms have equal variances across treatments.



In [41]:

    
proc print data=statdata.mggarlic (obs=10);
   title "Partial Listing of Garlic Data";
run;

proc means data=statdata.mggarlic printalltypes maxdec=3;
    var BulbWt;
    class Fertilizer;
    title "Descriptive Statistics of Garlic Weight";
run;

ods graphics on / width=700;
proc sgplot data=statdata.mggarlic;
    vbox BulbWt / category=Fertilizer connect= mean datalabel=BedID;
    format BedID 5.;
    title "Box Plots of Garlic Weight";
run;
title;









    Out[41]:










SAS Output





Partial Listing of Garlic Data 







Obs
Fertilizer
BulbWt
Cloves
BedID




1
4
0.20901
11.5062
30402


2
3
0.25792
12.2550
23423


3
2
0.21588
12.0982
20696


4
4
0.24754
12.9199
25412


5
1
0.24402
12.5793
10575


6
3
0.20150
10.6891
21466


7
1
0.20891
11.5416
14749


8
4
0.15173
14.0173
25342


9
2
0.24114
9.9072
20383


10
3
0.23350
11.2130
23306








Descriptive Statistics of Garlic Weight 


The MEANS Procedure







Analysis Variable : BulbWt


N Obs
N
Mean
Std Dev
Minimum
Maximum




32
32
0.219
0.029
0.152
0.278










Analysis Variable : BulbWt


Fertilizer
N Obs
N
Mean
Std Dev
Minimum
Maximum




1
9
9
0.225
0.025
0.188
0.254


2
8
8
0.209
0.026
0.159
0.241


3
11
11
0.230
0.026
0.189
0.278


4
4
4
0.196
0.041
0.152
0.248



In [42]:

    
ods graphics on / width=700;

proc glm data=statdata.mggarlic plots(only)=diagnostics ;
   class Fertilizer;
   model BulbWt=Fertilizer;  /****/
   means Fertilizer / hovtest;
   title "Testing for Equality of Means with PROC GLM";
run;
quit;
title;









    Out[42]:










SAS Output





Testing for Equality of Means with PROC GLM 


The GLM Procedure








Class Level Information


Class
Levels
Values




Fertilizer
4
1 2 3 4










Number of Observations Read
32


Number of Observations Used
32







Testing for Equality of Means with PROC GLM 


The GLM Procedure

Dependent Variable: BulbWt









Source
DF
Sum of Squares
Mean Square
F Value
Pr > F




Model
3
0.00457996
0.00152665
1.96
0.1432


Error
28
0.02183054
0.00077966
 
 


Corrected Total
31
0.02641050
 
 
 










R-Square
Coeff Var
Root MSE
BulbWt Mean




0.173414
12.74520
0.027922
0.219082










Source
DF
Type I SS
Mean Square
F Value
Pr > F




Fertilizer
3
0.00457996
0.00152665
1.96
0.1432










Source
DF
Type III SS
Mean Square
F Value
Pr > F




Fertilizer
3
0.00457996
0.00152665
1.96
0.1432













Testing for Equality of Means with PROC GLM 


The GLM Procedure










Levene's Test for Homogeneity of BulbWt Variance
ANOVA of Squared Deviations from Group Means



Source
DF
Sum of Squares
Mean Square
F Value
Pr > F




Fertilizer
3
1.716E-6
5.719E-7
0.98
0.4173


Error
28
0.000016
5.849E-7
 
 








Testing for Equality of Means with PROC GLM 


The GLM Procedure








Level of
Fertilizer

N
BulbWt


Mean
Std Dev




1
9
0.22540667
0.02452242


2
8
0.20856500
0.02641979


3
11
0.22982091
0.02644365


4
4
0.19635250
0.04139664



In [43]:

    
proc anova data=statdata.mggarlic;
    class Fertilizer;
    model BulbWt=Fertilizer;
run;









    Out[43]:










SAS Output





The ANOVA Procedure








Class Level Information


Class
Levels
Values




Fertilizer
4
1 2 3 4










Number of Observations Read
32


Number of Observations Used
32







The ANOVA Procedure

Dependent Variable: BulbWt









Source
DF
Sum of Squares
Mean Square
F Value
Pr > F




Model
3
0.00457996
0.00152665
1.96
0.1432


Error
28
0.02183054
0.00077966
 
 


Corrected Total
31
0.02641050
 
 
 










R-Square
Coeff Var
Root MSE
BulbWt Mean




0.173414
12.74520
0.027922
0.219082










Source
DF
Anova SS
Mean Square
F Value
Pr > F




Fertilizer
3
0.00457996
0.00152665
1.96
0.1432

Interpretation of results

You want to see a random scatter of residuals above and below 0 for the four fertilizer groups. The plot looks good.

You use Levene's Test for Homogeneity to formally test the equal variance assumption. Because the p-value of 0.4173 is greater than 0.05, you fail to reject the null and conclude that the variances are equal. This is good. You verified the equal variance assumption.

Because the residuals follow the diagonal reference line fairly closely, you can say that they are approximately normal.

The mean square error is $0.00078$, which is an estimate of the population variance. SAS calculates MSE by dividing the error sum of squares by the error Degrees of Freedom ($0.00457996/3$), which gives you the average sum of squares for the error. SAS calculates the F-statistic by dividing the MSM by the MSE ($0.00078/0.00152665$). The F statistic is 1.96. Because the corresponding p-value of $0.1432$ is greater than 0.05, you can conclude that there is not a statistically significant difference between the mean bulb weights for the four fertilizers.

It's important for you to realize that the one-way ANOVA is an omnibus test statistic and cannot tell you which specific groups are significantly different from each other, only that at least two groups are different. To determine which specific groups differ from each other, you need to use a post-hoc test.

All in all, the PROC GLM output supports your conclusion that there's not a statistically significant difference between the mean bulb weights for the four fertilizers. (F-test);

Levene's test

The HOVTEST=BARTLETT option specifies Bartlett’s test (Bartlett; 1937), a modification of the normal-theory likelihood ratio test.

The HOVTEST=BF option specifies Brown and Forsythe’s variation of Levene’s test (Brown and Forsythe; 1974).

The HOVTEST=LEVENE option specifies Levene’s test (Levene; 1960), which is widely considered to be the standard homogeneity of variance test.

In statistics, Levene's test is an inferential statistic used to assess the equality of variances for a variable calculated for two or more groups. Some common statistical procedures assume that variances of the populations from which different samples are drawn are equal. Levene's test assesses this assumption. It tests the null hypothesis that the population variances are equal (called homogeneity of variance or homoscedasticity). If the resulting p-value of Levene's test is less than some significance level (typically 0.05), the obtained differences in sample variances are unlikely to have occurred based on random sampling from a population with equal variances. Thus, the null hypothesis of equal variances is rejected and it is concluded that there is a difference between the variances in the population.

The test statistic, $W$, is defined as follows:

$${\displaystyle W={\frac {(N-k)}{(k-1)}}{\frac {\sum _{i=1}^{k}N_{i}(Z_{i\cdot }-Z_{\cdot \cdot })^{2}}{\sum _{i=1}^{k}\sum _{j=1}^{N_{i}}(Z_{ij}-Z_{i\cdot })^{2}}},}$$

where

$k$ is the number of different groups to which the sampled cases belong,

$N_{i}$ is the number of cases in the $ith$ group,

$N$ is the total number of cases in all groups,

$Y_{{ij}}$ is the value of the measured variable for the $jth$ case from the $ith$ group,

${\displaystyle Z_{ij}={\begin{cases}|Y_{ij}-{\bar {Y}}_{i\cdot }|,&{\bar {Y}}_{i\cdot }{\text{ is a mean of the }}i{\text{-th group}},\\|Y_{ij}-{\tilde {Y}}_{i\cdot }|,&{\tilde {Y}}_{i\cdot }{\text{ is a median of the }}i{\text{-th group}}.\end{cases}}}$

ANOVA with Data from a Randomized Block Design

Along with the three original ANOVA assumptions of independent observations, normally distributed errors, and equal variances across treatments, you make two more assumptions when you include a blocking factor in the model. First, you assume that the treatments are randomly assigned within each block. In the T-cell count example, this means that you assume the three medications are randomly assigned to each of the three age groups. Next, you assume that the effects of the treatment factor are constant across the levels of the blocking factor, meaning that the effects of the treatment factor don't depend on the block they are in. When the effects of the treatment factor are not constant across the levels of another variable, it's called interaction. But when you use a randomized block design, you assume that the effects are the same within each block. In other words, you assume that there are no interactions with the blocking variable

The farmers divide the farm into eight sectors, each of which has four beds, and in each of the four beds, they randomly assign each of the four fertilizers. An experimental design like this is often referred to as a randomized block design. As you can see in this ANOVA model, Sector is the blocking variable.



In [44]:

    
proc print data=statdata.mggarlic_block (obs=10);
run;

proc glm data=statdata.mggarlic_block plots(only)=diagnostics;
     class Fertilizer Sector;
     model BulbWt=Fertilizer Sector;
     title "ANOVA for Randomized Block Design";
run;
quit;
title;









    Out[44]:










SAS Output












Obs
Sector
Position
Fertilizer
BulbWt
Cloves
BedId




1
1
1
3
0.25881
11.6322
22961


2
1
2
4
0.20746
12.5837
23884


3
1
3
1
0.27453
12.0597
19642


4
1
4
2
0.24467
12.1001
20384


5
2
1
3
0.21454
11.5863
20303


6
2
2
4
0.16953
12.7132
21004


7
2
3
1
0.22504
12.0470
16117


8
2
4
2
0.16809
11.9071
19686


9
3
1
4
0.21720
12.3655
26527


10
3
2
3
0.22551
11.6864
23574








ANOVA for Randomized Block Design 


The GLM Procedure








Class Level Information


Class
Levels
Values




Fertilizer
4
1 2 3 4


Sector
8
1 2 3 4 5 6 7 8










Number of Observations Read
32


Number of Observations Used
32







ANOVA for Randomized Block Design 


The GLM Procedure

Dependent Variable: BulbWt









Source
DF
Sum of Squares
Mean Square
F Value
Pr > F




Model
10
0.02307263
0.00230726
5.86
0.0003


Error
21
0.00826745
0.00039369
 
 


Corrected Total
31
0.03134008
 
 
 










R-Square
Coeff Var
Root MSE
BulbWt Mean




0.736202
9.085064
0.019842
0.218398










Source
DF
Type I SS
Mean Square
F Value
Pr > F




Fertilizer
3
0.00508630
0.00169543
4.31
0.0162


Sector
7
0.01798632
0.00256947
6.53
0.0004










Source
DF
Type III SS
Mean Square
F Value
Pr > F




Fertilizer
3
0.00508630
0.00169543
4.31
0.0162


Sector
7
0.01798632
0.00256947
6.53
0.0004



In [45]:

    
proc glm data=statdata.ads1 plots(only)=diagnostics;
   class Ad Area;
   model Sales=Ad Area;
   title 'ANOVA for Randomized Block Design';
run;
quit;
title;









    Out[45]:










SAS Output





ANOVA for Randomized Block Design 


The GLM Procedure








Class Level Information


Class
Levels
Values




Ad
4
display paper people radio


Area
18
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18










Number of Observations Read
144


Number of Observations Used
144







ANOVA for Randomized Block Design 


The GLM Procedure

Dependent Variable: Sales









Source
DF
Sum of Squares
Mean Square
F Value
Pr > F




Model
20
15131.38889
756.56944
8.43
<.0001


Error
123
11037.91667
89.73916
 
 


Corrected Total
143
26169.30556
 
 
 










R-Square
Coeff Var
Root MSE
Sales Mean




0.578211
14.17712
9.473076
66.81944










Source
DF
Type I SS
Mean Square
F Value
Pr > F




Ad
3
5866.083333
1955.361111
21.79
<.0001


Area
17
9265.305556
545.017974
6.07
<.0001










Source
DF
Type III SS
Mean Square
F Value
Pr > F




Ad
3
5866.083333
1955.361111
21.79
<.0001


Area
17
9265.305556
545.017974
6.07
<.0001

ANOVA Post Hoc Tests

A pairwise comparison examines the difference between two treatment means. If your ANOVA results suggest that you reject the null hypothesis that the means are equal across groups, you can conduct multiple pairwise comparisons in a post hoc analysis to learn which means differ.

The chance that you make a Type I error increases each time you conduct a statistical test. The comparisonwise error rate, or CER, is the probability of a Type I error on a single pairwise test. The experimentwise error rate, or EER, is the probability of making at least one Type I error when performing all of the pairwise comparisons. The EER increases as the number of pairwise comparisons increases.

With a statistically conservative multiple comparison method, such as the Tukey or Dunnett method, you control for the EER, so there's a tendency to find fewer significant differences than might otherwise be found. When you make no adjustments for multiple comparisons, you are likely to find more significant differences than might otherwise be found. If only the comparisonwise error rate is controlled, the overall risk of a Type I error across all the comparisons is increased (and therefore the risk of Type II error is decreased), so the test might find more significant differences than would otherwise be found.

The Tukey method compares all possible pairs of means, so it can only be used when you make pairwise comparisons (See Tukey's range test) Tukey's procedure is only applicable for pairwise comparisons. It assumes independence of the observations being tested, as well as equal variation across observations (homoscedasticity). The procedure calculates for each pair the studentized range statistic: ${\frac {Y_{{A}}-Y_{{B}}}{SE}}$ where $ Y_{{A}}$ is the larger of the two means being compared, ${\displaystyle Y_{B}} Y_{{B}}$ is the smaller, and $SE$ is the standard error of the data in question. Tukey's test is essentially a Student's t-test, except that it corrects for family-wise error-rate.

Dunnett's method is a specialized multiple comparison test that enables you to compare a single control group to all other groups.



In [46]:

    
ods graphics on / width=700;
ods trace on;

proc glm data=statdata.mggarlic_block;
    class Fertilizer Sector;
    model BulbWt=Fertilizer Sector;             /*****/
    lsmeans Fertilizer / pdiff=all adjust=tukey;
    lsmeans Fertilizer / pdiff=controlu('4') adjust=dunnett;
    lsmeans Fertilizer / pdiff=all adjust=t;
    title "Garlic Data: Multiple Comparisons";
run;
quit;
title;









    Out[46]:










SAS Output





Garlic Data: Multiple Comparisons 


The GLM Procedure








Class Level Information


Class
Levels
Values




Fertilizer
4
1 2 3 4


Sector
8
1 2 3 4 5 6 7 8










Number of Observations Read
32


Number of Observations Used
32







Garlic Data: Multiple Comparisons 


The GLM Procedure

Dependent Variable: BulbWt









Source
DF
Sum of Squares
Mean Square
F Value
Pr > F




Model
10
0.02307263
0.00230726
5.86
0.0003


Error
21
0.00826745
0.00039369
 
 


Corrected Total
31
0.03134008
 
 
 










R-Square
Coeff Var
Root MSE
BulbWt Mean




0.736202
9.085064
0.019842
0.218398










Source
DF
Type I SS
Mean Square
F Value
Pr > F




Fertilizer
3
0.00508630
0.00169543
4.31
0.0162


Sector
7
0.01798632
0.00256947
6.53
0.0004










Source
DF
Type III SS
Mean Square
F Value
Pr > F




Fertilizer
3
0.00508630
0.00169543
4.31
0.0162


Sector
7
0.01798632
0.00256947
6.53
0.0004













Garlic Data: Multiple Comparisons 


The GLM Procedure
Least Squares Means
Adjustment for Multiple Comparisons: Tukey










Fertilizer
BulbWt LSMEAN
LSMEAN Number




1
0.23625000
1


2
0.21115125
2


3
0.22330125
3


4
0.20288875
4










Least Squares Means for effect Fertilizer
Pr > |t| for H0: LSMean(i)=LSMean(j)

Dependent Variable: BulbWt





i/j
1
2
3
4




1
 
0.0840
0.5699
0.0144


2
0.0840
 
0.6186
0.8383


3
0.5699
0.6186
 
0.1995


4
0.0144
0.8383
0.1995
 



















Garlic Data: Multiple Comparisons 


The GLM Procedure
Least Squares Means
Adjustment for Multiple Comparisons: Dunnett










Fertilizer
BulbWt LSMEAN
H0:LSMean=Control


Pr > t




1
0.23625000
0.0040


2
0.21115125
0.3978


3
0.22330125
0.0638


4
0.20288875
 



















Garlic Data: Multiple Comparisons 


The GLM Procedure
Least Squares Means










Fertilizer
BulbWt LSMEAN
LSMEAN Number




1
0.23625000
1


2
0.21115125
2


3
0.22330125
3


4
0.20288875
4










Least Squares Means for effect Fertilizer
Pr > |t| for H0: LSMean(i)=LSMean(j)

Dependent Variable: BulbWt





i/j
1
2
3
4




1
 
0.0195
0.2059
0.0029


2
0.0195
 
0.2342
0.4143


3
0.2059
0.2342
 
0.0523


4
0.0029
0.4143
0.0523
 
















Note:To ensure overall protection level, only probabilities associated with pre-planned comparisons should be used.

You can use the ODS TRACE statement in your program. When you add the ODS TRACE statement, SAS writes a trace record to the log that includes information about each output object, such as the path for each object and the label for each object. When you check the log, you can see a list of the ODS output objects and information about each one. Now that you know the names, you can decide what output to specify in the ODS SELECT statement. Let's turn tracing off, and then specify only the output we want to see. Now let's submit this program.

The only significant pairwise difference is between fertilizer 1 and fertilizer 4. The p-value of 0.0144 is less than your alpha, meaning that the bulb weights of these two fertilizers are significantly different from one another.

Because you performed one-sided, upper-tailed hypothesis tests of each of the organic fertilizers versus the chemical fertilizer, you only see the upper shaded region with the UDL in your plot. Remember that the bottom horizontal line is the least squares mean of your control group. The vertical line for fertilizer 1 is the only line that extends past the UDL.

Finally, let's examine the output that corresponds to your third LSMEANS statement. These t-tests do not adjust for multiple comparisons, and are therefore more liberal than tests that do control for the EER. Take a moment to look at the p-values. You might notice that the p-values in this table are smaller than those in the Tukey table. In fact, which additional significant pairwise difference does this method show? It shows that fertilizer 1 is significantly different from fertilizer 2 with a p-value of 0.0195. This is in addition to fertilizers 1 and 4 being statistically different with a p-value of 0.0029. Notice also that the comparison between fertilizers 3 and 4 is nearly significant. So with this test, there's a tendency to find more significant pairwise differences than might actually exist.

Lastly, let's take a look at the diffogram. Again, this reinforces what you know about fertilizers 1 and 4 and fertilizers 1 and 2. Using these multiple comparison techniques gives you options. If you feel strongly about controlling the EER, you shouldn't use the pairwise t-test results and should instead use the Tukey or Dunnett results. You knew before you performed these multiple comparison techniques that fertilizer 1 produced the garlic with the heaviest overall mean bulb weight, so that would be your first choice if you are not considering other factors like cost or availability. But what if fertilizer 1 is very expensive or hard to obtain? With these multiple comparison techniques, you now know which fertilizers are not statistically different from fertilizer 1, so the Montana Gourmet Garlic farmers have options for the fertilizer to use that will produce equally heavy garlic bulbs.

Two-Way ANOVA with Interactions

When you have two categorical predictor variables and a continuous response variable, you can analyze your data using two-way ANOVA. With two-way ANOVA, you can examine the effects of the two predictor variables concurrently. You can also determine whether they interact with respect to their effect on the response variable. An interaction means that the effects on one variable depend on the value of another variable. If there is no interaction, you can interpret the test for the individual factor effects to determine their significance. If an interaction exists between any factors, the test for the individual factor effects might be misleading due to the masking of these effects by the interaction.

You can include interactions and more than one predictor variable in the ANOVA model.

You can graphically explore the relationship between the response variable and the effect of the interaction between the two predictor variables using PROC SGPLOT.

You can use PROC GLM to determine whether the effects of the predictor variables and the interaction between them are statistically significant.

When running PROC GLM, you can add a STORE statement to save your analysis results. By using the STORE statement, you can run postprocessing analyses on the stored results, even if you no longer have access to the original data set. The STORE statement requests that the procedure save the context and results of the statistical analysis into an item store. To perform post-fitting statistical analyses and plotting for the contents of the store item, you use the PLM procedure.



In [47]:

    
proc format;
   value dosef
   1="Placebo"
   2="100 mg"
   3="200mg"
   4="500mg";
run;

proc means data=statdata.drug
           mean var std printalltypes;
   class Disease DrugDose;
   var BloodP;
   output out=means mean=BloodP_Mean;
   format DrugDose dosef.;
   title "Selected Descriptive Statistics for Drug Data Set";
run;
title;









    Out[47]:










SAS Output





Selected Descriptive Statistics for Drug Data Set 


The MEANS Procedure







Analysis Variable : BloodP


N Obs
Mean
Variance
Std Dev




170
-2.2941176
620.3745214
24.9073186










Analysis Variable : BloodP


DrugDose
N Obs
Mean
Variance
Std Dev




Placebo
41
-2.4390244
303.6024390
17.4241912


100 mg
44
-3.1136364
483.9170190
21.9981140


200mg
41
-1.6097561
790.6939024
28.1192799


500mg
44
-1.9772727
935.0924947
30.5792821










Analysis Variable : BloodP


Disease
N Obs
Mean
Variance
Std Dev




A
59
-15.0169492
434.4997078
20.8446566


B
57
10.6666667
629.9404762
25.0986150


C
54
-2.0740741
476.1830887
21.8216198










Analysis Variable : BloodP


Disease
DrugDose
N Obs
Mean
Variance
Std Dev




A
Placebo
12
1.3333333
183.1515152
13.5333483


 
100 mg
16
-9.6875000
356.7625000
18.8881577


 
200mg
13
-26.2307692
329.0256410
18.1390640


 
500mg
18
-22.5555556
445.0849673
21.0970369


B
Placebo
15
-8.1333333
285.9809524
16.9109714


 
100 mg
15
5.4000000
479.1142857
21.8886794


 
200mg
14
24.7857143
563.7197802
23.7427838


 
500mg
13
23.2307692
556.3589744
23.5872630


C
Placebo
14
0.4285714
411.8021978
20.2929100


 
100 mg
13
-4.8461538
577.6410256
24.0341637


 
200mg
14
-5.1428571
195.5164835
13.9827209


 
500mg
13
1.3076923
828.5641026
28.7847894

The Means data set contains the variable TYPE, with values ranging from 0 to 3 to represent the four tables this PROC MEANS program generates. Type 0 gives you the mean blood pressure change of all observations, regardless of disease type or drug dose. Type 1 gives you the mean blood pressure for each drug dose, regardless of disease type. Type 2 gives you the mean blood pressure for each disease type, regardless of drug dose. And Type 3 gives you the mean blood pressure for each disease type and drug dose combination.



In [48]:

    
ods graphics on / width=800;

proc sgplot data=means;
   where _TYPE_=3;     
   scatter x=DrugDose y=BloodP_Mean / 
           group=Disease markerattrs=(size=10);
   series x=DrugDose y=BloodP_Mean / group=Disease        
          lineattrs=(thickness=2);
   xaxis integer;
   format DrugDose dosef.;  
   title "Plot of Stratified Means in Drug Data Set";
run;
title;

ods graphics on / width=800;

proc glm data=statdata.drug;
   class DrugDose Disease;
   model Bloodp=DrugDose Disease DrugDose*Disease / ss3;
   format DrugDose dosef.;  
   title "Analyze the Effects of DrugDose and Disease";
   title2 "Including Interaction";
run;
quit;
title;









    Out[48]:










SAS Output













Analyze the Effects of DrugDose and Disease 
Including Interaction 


The GLM Procedure








Class Level Information


Class
Levels
Values




DrugDose
4
100 mg 200mg 500mg Placebo


Disease
3
A B C










Number of Observations Read
170


Number of Observations Used
170







Analyze the Effects of DrugDose and Disease 
Including Interaction 


The GLM Procedure

Dependent Variable: BloodP









Source
DF
Sum of Squares
Mean Square
F Value
Pr > F




Model
11
36476.8353
3316.0759
7.66
<.0001


Error
158
68366.4589
432.6991
 
 


Corrected Total
169
104843.2941
 
 
 










R-Square
Coeff Var
Root MSE
BloodP Mean




0.347918
-906.7286
20.80142
-2.294118










Source
DF
Type III SS
Mean Square
F Value
Pr > F




DrugDose
3
335.73526
111.91175
0.26
0.8551


Disease
2
18742.62386
9371.31193
21.66
<.0001


DrugDose*Disease
6
17146.31698
2857.71950
6.60
<.0001

The p-value for the overall model is very small, so what does this tell you? You can reject the null hypothesis and conclude that at least one of the effects in the model is significant, in other words, there is at least one difference among the 12 group means, one for each drug dose and disease combination. Which factors explain this difference? You'll see in just a few moments.

The R square is 0.3479, so approximately 35% of the variation in blood pressure change can be explained by the model. The average blood pressure change of all the observations is –2.294, which is exactly what the PROC MEANS output showed.

The next tables show the breakdown of the main effects and interaction term in the model. Look at the Type I and Type III Sums of Squares values. Do you know why their values are not exactly the same? You don't have a balanced design in this experiment. In other words, you have a different number of observations in each drug dose and disease combination group. In most situations, you will want to use the Type III SS. The Type I, or sequential SS, are the sums of squares you obtain from fitting the effects in the order you specify in the model. The Type III, or marginal SS, are the sums of squares you obtain from fitting each effect after all the other terms in the model, that is, the sums of squares for each effect corrected for the other terms in the model. Type III SS does not depend upon the order you specify effects in the model. (Use the SS3 option on the MODEL statement, this portion of the output shows the results for Type III SS only.)

You want to look at the interaction term first. If it's significant, the main effects don't tell you the whole story. The p-value for DrugDose*Disease is 0.0001. Presuming an alpha of 0.05, you reject the null hypothesis. You have sufficient evidence to conclude that there is an interaction between the two factors, meaning that the effect of the level of drug dose on blood pressure changes for the different disease types. You don't need to worry all that much about the significance of the main effects at this point for two reasons: 1) Because the interaction term is significant, you know that the effect of the drug level changes for the different disease types. 2) Because the interaction term is significant, you want to include the main effects in the model, whether they are significant or not, to preserve model hierarchy.

Let's finally take a look at the interaction plot for blood pressure. SAS produces this plot by default when you have an interaction term in the model. This plot looks similar to the one you produced with PROC SGPLOT, except that this one plots each of the blood pressure change measurements, as well as the means for each drug dose and disease type combination. Well, you might be thinking, "Okay. I know the interaction is significant. What I really want to know is the effect of drug dose at each particular level of disease." You have to add the LSMEANS statement to your program to find the answer.

The PLM Procedure

The PLM procedure performs post-fitting statistical analyses and plotting for the contents of a SAS item store that were previously created with the STORE statement in some other SAS/STAT procedure.

The statements that are available in the PLM procedure are designed to reveal the contents of the source item store via the Output Delivery System (ODS) and to perform post-fitting tasks.

The use of item stores and PROC PLM enables you to separate common post-processing tasks, such as testing for treatment differences and predicting new observations under a fitted model, from the process of model building and fitting. A numerically expensive model fitting technique can be applied once to produce a source item store. The PLM procedure can then be called multiple times, and the results of the fitted model are analyzed without incurring the model fitting expenditure again.

Selected PROC PLM option:

RESTORE specifies the source item store for processing.
Selected PROC PLM procedure statements:
EFFECTPLOT produces a display of the fitted model and provides options for changing and enhancing the displays.
LSMEANS computes and compares least squares means (LS-means) of fixed effects.
LSMESTIMATE provides custom hypothesis tests among least squares means.
SHOW uses the Output Delivery System to display contents of the item store. This statement is useful for verifying that the contents of the item store apply to the analysis and for generating ODS tables.
SLICE provides a general mechanism for performing a partitioned analysis of the LS-means for an interaction. This analysis is also known as an analysis of simple effects. The SLICE statement uses the same options as the LSMEANS statement.
WHERE is used in the PLM procedure when the item store contains BY-variable information and you want to apply the PROC PLM statements to only a subset of the BY groups.

PROC PLM RESTORE=item-store-specification<options>;
    EFFECTPLOT <plot-type <(plot-definition options)>> 
          </ options>;
    LSMEANS <model-effects > </ options>;
    LSMESTIMATE model-effect <'label'> values 
      <divisor=n><,...<'label'> values
      <divisor=n>> </ options>;
SHOW options;
SLICE model-effect </ options>;
WHERE expression ;
RUN;

Saving Analysis Results with the STORE Statement

The STORE statement requests that the procedure save the context and results of the statistical analysis into an item store. An item store is a binary file format that cannot be modified. You can process the contents of an item store with the PLM procedure.

For example, if you need to perform a time-consuming analysis, you can store the results by using the STORE statement.

At a later time, you can use PROC PLM to perform specific statistical analysis tasks based on the saved results of the previous analysis without having to fit the model again. This can be a great time saver!

Here is the syntax of the STORE statement. Following the keyword STORE and OUT= you specify the item store name and an optional label.

You can use the STORE statement in a number of SAS/STAT procedures. For more information about the STORE statement, click the Information button.



In [49]:

    
proc contents data = statdata.ameshousing3;
run;









    Out[49]:










SAS Output





The CONTENTS Procedure








Data Set Name
STATDATA.AMESHOUSING3
Observations
300


Member Type
DATA
Variables
32


Engine
V9
Indexes
0


Created
05/18/2017 20:21:01
Observation Length
248


Last Modified
05/18/2017 20:21:01
Deleted Observations
0


Protection
 
Compressed
NO


Data Set Type
 
Sorted
NO


Label
 
 
 


Data Representation
SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64
 
 


Encoding
utf-8  Unicode (UTF-8)
 
 










Engine/Host Dependent Information




Data Set Page Size
65536


Number of Data Set Pages
2


First Data Page
1


Max Obs per Page
263


Obs in First Data Page
235


Number of Data Set Repairs
0


Filename
/folders/myfolders/ECST131/ameshousing3.sas7bdat


Release Created
9.0401M4


Host Created
Linux


Inode Number
372


Access Permission
rwxrwx---


Owner Name
root


File Size
192KB


File Size (bytes)
196608










Alphabetic List of Variables and Attributes


#
Variable
Type
Len
Format
Informat
Label




21
Age_Sold
Num
8
BEST12.
BEST12.
Age of house when sold, in years


16
Basement_Area
Num
8
BEST12.
BEST12.
Basement area in square feet


10
Bedroom_AbvGr
Num
8
BEST12.
BEST12.
Bedrooms above grade


31
Bonus
Num
8
BEST12.
BEST12.
Sale Price > $175,000


8
Central_Air
Char
1
$CHAR1.
$CHAR1.
Presence of central air conditioning


20
Deck_Porch_Area
Num
8
BEST12.
BEST12.
Total area of decks and porches in square feet


11
Fireplaces
Num
8
BEST12.
BEST12.
Number of fireplaces


24
Foundation_2
Char
16
$CHAR16.
$CHAR16.
Foundation Type


17
Full_Bathroom
Num
8
BEST12.
BEST12.
Number of full bathrooms


12
Garage_Area
Num
8
BEST12.
BEST12.
Size of garage in square feet


23
Garage_Type_2
Char
8
$CHAR8.
$CHAR8.
Garage attached or detached


9
Gr_Liv_Area
Num
8
BEST12.
BEST12.
Above grade (ground) living area square feet


18
Half_Bathroom
Num
8
BEST12.
BEST12.
Number of half bathrooms


7
Heating_QC
Char
2
$CHAR2.
$CHAR2.
Heating quality and condition


3
House_Style
Char
6
$CHAR6.
$CHAR6.
Style of dwelling


27
House_Style2
Char
6
$CHAR6.
$CHAR6.
Style of dwelling


30
Log_Price
Num
8
BEST12.
BEST12.
Natural log of the sale price


2
Lot_Area
Num
8
BEST12.
BEST12.
Lot size in square feet


26
Lot_Shape_2
Char
10
$CHAR10.
$CHAR10.
Regular or irregular lot shape


25
Masonry_Veneer
Char
1
$CHAR1.
$CHAR1.
Masonry veneer or not


13
Mo_Sold
Num
8
BEST12.
BEST12.
Month Sold (MM)


5
Overall_Cond
Num
8
BEST12.
BEST12.
Overall condition of the house


29
Overall_Cond2
Num
8
BEST12.
BEST12.
Overall condition of the house


4
Overall_Qual
Num
8
BEST12.
BEST12.
Overall material and finish of the house


28
Overall_Qual2
Num
8
BEST12.
BEST12.
Overall material and finish of the house


1
PID
Char
10
$CHAR10.
$CHAR10.
 


15
SalePrice
Num
8
BEST12.
BEST12.
Sale price in dollars


22
Season_Sold
Num
8
BEST12.
BEST12.
Season when house sold


19
Total_Bathroom
Num
8
BEST12.
BEST12.
Total number of bathrooms (half bathrooms counted 10%)


6
Year_Built
Num
8
BEST12.
BEST12.
Original construction year


14
Yr_Sold
Num
8
BEST12.
BEST12.
Year Sold (YYYY)


32
score
Num
8



In [50]:

    
proc glm data=statdata.ameshousing3 
         order=internal 
         plots(only)=intplot;
   class Season_Sold Heating_QC;
   model SalePrice=Heating_QC Season_Sold Heating_QC*Season_Sold / ss3;
   lsmeans Heating_QC*Season_Sold / diff slice=Heating_QC;
   format Season_Sold ;
   store out=interact;
   title "Model with Heating Quality and Season as Interacting "
         "Predictors";
run;









    Out[50]:










SAS Output





Model with Heating Quality and Season as Interacting Predictors 


The GLM Procedure








Class Level Information


Class
Levels
Values




Season_Sold
4
1 2 3 4


Heating_QC
4
Ex Fa Gd TA










Number of Observations Read
300


Number of Observations Used
300







Model with Heating Quality and Season as Interacting Predictors 


The GLM Procedure

Dependent Variable: SalePrice   Sale price in dollars









Source
DF
Sum of Squares
Mean Square
F Value
Pr > F




Model
15
97609874155
6507324943.7
5.68
<.0001


Error
284
325613645356
1146526920.3
 
 


Corrected Total
299
423223519511
 
 
 










R-Square
Coeff Var
Root MSE
SalePrice Mean




0.230634
24.62130
33860.40
137524.9










Source
DF
Type III SS
Mean Square
F Value
Pr > F




Heating_QC
3
51116493768
17038831256
14.86
<.0001


Season_Sold
3
9318181844
3106060615
2.71
0.0455


Season_So*Heating_QC
9
24835058089
2759450899
2.41
0.0121













Model with Heating Quality and Season as Interacting Predictors 


The GLM Procedure
Least Squares Means










Season_Sold
Heating_QC
SalePrice LSMEAN
LSMEAN Number




1
Ex
145583.333
1


1
Fa
58100.000
2


1
Gd
124330.000
3


1
TA
121312.500
4


2
Ex
153765.244
5


2
Fa
98657.143
6


2
Gd
149619.833
7


2
TA
129404.412
8


3
Ex
154279.422
9


3
Fa
128800.000
10


3
Gd
113727.273
11


3
TA
134046.552
12


4
Ex
163726.933
13


4
Fa
45000.000
14


4
Gd
143812.500
15


4
TA
129345.455
16










Least Squares Means for effect Season_So*Heating_QC
Pr > |t| for H0: LSMean(i)=LSMean(j)

Dependent Variable: SalePrice


i/j
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16




1
 
0.0003
0.2252
0.1354
0.5808
0.0133
0.8005
0.2815
0.5550
0.4137
0.0420
0.4276
0.2682
0.0063
0.9229
0.3455


2
0.0003
 
0.0032
0.0033
<.0001
0.0837
<.0001
0.0005
<.0001
0.0046
0.0080
0.0002
<.0001
0.7378
0.0002
0.0014


3
0.2252
0.0032
 
0.8252
0.0143
0.1250
0.0593
0.6773
0.0119
0.8097
0.4123
0.4027
0.0047
0.0263
0.2261
0.7349


4
0.1354
0.0033
0.8252
 
0.0013
0.1409
0.0156
0.4312
0.0009
0.6664
0.4959
0.1840
0.0006
0.0296
0.1260
0.5452


5
0.5808
<.0001
0.0143
0.0013
 
<.0001
0.6654
0.0021
0.9440
0.1207
<.0001
0.0046
0.3304
0.0017
0.4476
0.0345


6
0.0133
0.0837
0.1250
0.1409
<.0001
 
0.0008
0.0295
<.0001
0.1295
0.3059
0.0095
<.0001
0.1394
0.0105
0.0619


7
0.8005
<.0001
0.0593
0.0156
0.6654
0.0008
 
0.0415
0.6221
0.2249
0.0010
0.0894
0.2344
0.0029
0.6868
0.1188


8
0.2815
0.0005
0.6773
0.4312
0.0021
0.0295
0.0415
 
0.0014
0.9703
0.0917
0.5261
0.0012
0.0146
0.2798
0.9960


9
0.5550
<.0001
0.0119
0.0009
0.9440
<.0001
0.6221
0.0014
 
0.1115
<.0001
0.0029
0.3502
0.0016
0.4211
0.0294


10
0.4137
0.0046
0.8097
0.6664
0.1207
0.1295
0.2249
0.9703
0.1115
 
0.3697
0.7398
0.0467
0.0246
0.4374
0.9762


11
0.0420
0.0080
0.4123
0.4959
<.0001
0.3059
0.0010
0.0917
<.0001
0.3697
 
0.0172
<.0001
0.0481
0.0322
0.2127


12
0.4276
0.0002
0.4027
0.1840
0.0046
0.0095
0.0894
0.5261
0.0029
0.7398
0.0172
 
0.0027
0.0096
0.4451
0.6732


13
0.2682
<.0001
0.0047
0.0006
0.3304
<.0001
0.2344
0.0012
0.3502
0.0467
<.0001
0.0027
 
0.0008
0.1802
0.0110


14
0.0063
0.7378
0.0263
0.0296
0.0017
0.1394
0.0029
0.0146
0.0016
0.0246
0.0481
0.0096
0.0008
 
0.0063
0.0177


15
0.9229
0.0002
0.2261
0.1260
0.4476
0.0105
0.6868
0.2798
0.4211
0.4374
0.0322
0.4451
0.1802
0.0063
 
0.3586


16
0.3455
0.0014
0.7349
0.5452
0.0345
0.0619
0.1188
0.9960
0.0294
0.9762
0.2127
0.6732
0.0110
0.0177
0.3586
 






Model with Heating Quality and Season as Interacting Predictors 


The GLM Procedure
Least Squares Means







Season_So*Heating_QC Effect Sliced by Heating_QC for SalePrice


Heating_QC
DF
Sum of Squares
Mean Square
F Value
Pr > F




Ex
3
1759608339
586536113
0.51
0.6746


Fa
3
12318827232
4106275744
3.58
0.0143


Gd
3
14560964166
4853654722
4.23
0.0060


TA
3
2134918196
711639399
0.62
0.6021






Note:To ensure overall protection level, only probabilities associated with pre-planned comparisons should be used.



In [51]:

    
proc plm restore=interact plots=all;
   slice Heating_QC*Season_Sold / sliceby=Heating_QC adjust=tukey;
   effectplot interaction(sliceby=Heating_QC) / clm;
run;









    Out[51]:










SAS Output







Model with Heating Quality and Season as Interacting Predictors 


The PLM Procedure







Store Information




Item Store
WORK.INTERACT


Data Set Created From
STATDATA.AMESHOUSING3


Created By
PROC GLM


Date Created
03JUL17:23:17:39


Response Variable
SalePrice


Class Variables
Season_Sold Heating_QC


Model Effects
Intercept Heating_QC Season_Sold Season_So*Heating_QC










Class Level Information


Class
Levels
Values




Season_Sold
4
1 2 3 4


Heating_QC
4
Ex Fa Gd TA











F Test for Season_So*Heating_QC Least Squares Means Slice


Slice
Num DF
Den DF
F Value
Pr > F




Heating_QC Ex
3
284
0.51
0.6746










Simple Differences of Season_So*Heating_QC Least Squares Means
Adjustment for Multiple Comparisons: Tukey-Kramer


Slice
Season when house sold
Season when house sold
Estimate
Standard Error
DF
t Value
Pr > |t|
Adj P




Heating_QC Ex
1
2
-8181.91
14800
284
-0.55
0.5808
0.9457


Heating_QC Ex
1
3
-8696.09
14716
284
-0.59
0.5550
0.9348


Heating_QC Ex
1
4
-18144
16356
284
-1.11
0.2682
0.6841


Heating_QC Ex
2
3
-514.18
7310.43
284
-0.07
0.9440
0.9999


Heating_QC Ex
2
4
-9961.69
10218
284
-0.97
0.3304
0.7638


Heating_QC Ex
3
4
-9447.51
10095
284
-0.94
0.3502
0.7856















F Test for Season_So*Heating_QC Least Squares Means Slice


Slice
Num DF
Den DF
F Value
Pr > F




Heating_QC Fa
3
284
3.58
0.0143










Simple Differences of Season_So*Heating_QC Least Squares Means
Adjustment for Multiple Comparisons: Tukey-Kramer


Slice
Season when house sold
Season when house sold
Estimate
Standard Error
DF
t Value
Pr > |t|
Adj P




Heating_QC Fa
1
2
-40557
23366
284
-1.74
0.0837
0.3071


Heating_QC Fa
1
3
-70700
24728
284
-2.86
0.0046
0.0235


Heating_QC Fa
1
4
13100
39099
284
0.34
0.7378
0.9870


Heating_QC Fa
2
3
-30143
19827
284
-1.52
0.1295
0.4267


Heating_QC Fa
2
4
53657
36198
284
1.48
0.1394
0.4495


Heating_QC Fa
3
4
83800
37092
284
2.26
0.0246
0.1102















F Test for Season_So*Heating_QC Least Squares Means Slice


Slice
Num DF
Den DF
F Value
Pr > F




Heating_QC Gd
3
284
4.23
0.0060










Simple Differences of Season_So*Heating_QC Least Squares Means
Adjustment for Multiple Comparisons: Tukey-Kramer


Slice
Season when house sold
Season when house sold
Estimate
Standard Error
DF
t Value
Pr > |t|
Adj P




Heating_QC Gd
1
2
-25290
13355
284
-1.89
0.0593
0.2330


Heating_QC Gd
1
3
10603
12914
284
0.82
0.4123
0.8445


Heating_QC Gd
1
4
-19483
16061
284
-1.21
0.2261
0.6191


Heating_QC Gd
2
3
35893
10762
284
3.34
0.0010
0.0053


Heating_QC Gd
2
4
5807.33
14388
284
0.40
0.6868
0.9777


Heating_QC Gd
3
4
-30085
13980
284
-2.15
0.0322
0.1394















F Test for Season_So*Heating_QC Least Squares Means Slice


Slice
Num DF
Den DF
F Value
Pr > F




Heating_QC TA
3
284
0.62
0.6021










Simple Differences of Season_So*Heating_QC Least Squares Means
Adjustment for Multiple Comparisons: Tukey-Kramer


Slice
Season when house sold
Season when house sold
Estimate
Standard Error
DF
t Value
Pr > |t|
Adj P




Heating_QC TA
1
2
-8091.91
10265
284
-0.79
0.4312
0.8598


Heating_QC TA
1
3
-12734
9561.68
284
-1.33
0.1840
0.5434


Heating_QC TA
1
4
-8032.95
13262
284
-0.61
0.5452
0.9302


Heating_QC TA
2
3
-4642.14
7313.62
284
-0.63
0.5261
0.9207


Heating_QC TA
2
4
58.9572
11745
284
0.01
0.9960
1.0000


Heating_QC TA
3
4
4701.10
11135
284
0.42
0.6732
0.9747

The option order=internal tells SAS to use the order of the variable values stored internally, rather than the order of the formatted values. The internal values for Season_Sold are 1 (formatted as Winter), 2 (formatted as Spring), 3 (for Summer), and 4 (for Fall). So, including order=internal tells SAS to display the seasons in the order Winter, Spring, Summer, and Fall, rather than in alphabetical order.

We specify plots(only)=intplot to request an interaction plot. You can request an interaction plot even if there isn’t an interaction in the model. In this case, there is an interaction, so we’ll be able to visualize the interaction in the plot. Because Season_Sold and Heating_QC are categorical variables, we need to include them in a CLASS statement.

In the MODEL statement, we specify the response variable SalePrice, equals, the main effects Heating_QC and Season_Sold, and then interaction effect. The interaction effect is represented as the two main effects separated by an asterisk. Here’s another way to represent the effects. Instead of listing the interaction effect, we could simply place a vertical bar between the main effects.

Recall that the LSMEANS option computes and compares least squares means of fixed effects. This LSMEANS statement specifies the interaction term Heating_QC by Season_Sold. By specifying slice=Heating_QC, we tell SAS to to slice the interaction effect by the different levels of Heating_QC.

This means that each slice will have one Heating_QC level and will show the Season_Sold effect across that slice. We'll format Season_Sold with a FORMAT statement. The statement store out=interact saves the analysis results as a SAS item store named interact in the Work library.

We want to produce more tables on Season_Sold by different levels of Heating_QC, so we add the SLICE statement slice Heating_QC by Season_Sold, sliceby=Heating_QC. Notice that the syntax sliceby= in the SLICE statement is different from slice= in the LSMEANS statement.

We include adjust=tukey in order to get the Tukey adjustment for multiple comparison tests.

By specifying plots=all, the output will include an effect plot. However, by adding an EFFECTPLOT statement, we can specify more options. We specify the interaction sliceby=Heating_QC, and we specify clm, which gives us confidence limits for the means.



In [52]:

    
*variables

Region
Advertising
Gender
Book_Sales
Music_Sales
Electronics_Sales
Total_Sales
;

proc format;
   value yesno 1 = 'Yes'
               0 = 'No';
data Store;
   length Region $ 5;
   call streaminit(57676);
   do Transaction = 1 to 200;
      R = ceil(rand('uniform')*10);
      select(R);
         when(1) Region = 'East';
         when(2) Region = 'West';
         when(3) Region = 'North';
         when(4) Region = 'South';
         otherwise;
      end;
      Advertising = rand('bernouli',.6);
      if rand('uniform') lt .6 then Gender = 'Female';
         else Gender = 'Male';
      Book_Sales = abs(round(rand('normal',250,50) + 30*(Gender = 'Female')
                    + 30*Advertising,10)) ;
      Music_Sales = abs(round(rand('uniform')*40 + rand('normal',50,5)
         + 30*(Region = 'East' and Gender = 'Male')
         - 20*(Region = 'West' and Gender = 'Female'),5) + 10*Advertising);
      Electronics_Sales = abs(round(rand('normal',300,60) + 70*(Gender = 'Male')
       + 55*Advertising + 50*(Region = 'East') - 20*(Region = 'South') 
       + 75*(Region = 'West'),10));
      Total_Sales = sum(Book_Sales,Music_Sales,Electronics_Sales);
   output;
   end;
   drop R;
   format Book_Sales Music_Sales Electronics_Sales Total_Sales dollar9.
          Advertising yesno.;
run;
 
/*title "Listing of Store";*/
/*proc print data=store heading=h;*/
/*run;*/

/*proc univariate data=store;*/
/*   var Book_Sales -- Total_Sales;*/
/*   histogram;*/
/*run;*/
/**/
/*title "Scatter Matrix for Store Variables";*/
/*proc sgscatter data=store;*/
/*   matrix Book_Sales -- Total_Sales / group = Gender;*/
/*run;*/
/**/
/*proc sgplot data=store;*/
/*   scatter x=Book_Sales y=Total_Sales / group=Gender;*/
/*run;*/

proc rank data=store out=median_sales groups=2;
   var Total_Sales;
   ranks Sales_Group;
run;

proc format;
   value sales 0 = 'Low'
               1 = 'High';
run;

/*proc logistic data=median_sales order=formatted;*/
/*   class Gender(param=ref ref='Male');*/
/*   model Sales_Group = Gender;*/
/*   format Sales_Group sales.;*/
/*quit;*/
/**/
/*proc logistic data=median_sales order=formatted;*/
/*   class Gender(param=ref ref='Male')*/
/*         Advertising (param=ref ref='No');*/
/*   model Sales_Group = Gender Advertising;*/
/*   format Sales_Group sales.;*/
/*quit;*/

*Create test data set;
libname example 'c:\books\statistics by example';
data example.Blood_Pressure;
   call streaminit(37373);
   do Drug = 'Placebo','Drug A','Drug B';
      do i = 1 to 20;
         Subj + 1;
         if mod(Subj,2) then Gender = 'M';
         else Gender = 'F';
         SBP = rand('normal',130,10) +
               7*(Drug eq 'Placebo') - 6*(Drug eq 'Drug B');
         SBP = round(SBP,2);
         DBP = rand('normal',80,5) +
               3*(Drug eq 'Placebo') - 2*(Drug eq 'Drug B');
         DBP = round(DBP,2);
         if Subj in (5,15,25,55) then call missing(SBP, DBP);
         if Subj in (4,18) then call missing(Gender);
         output;
      end;
   end;
   drop i;
run;

/*title "Listing of the first 25 observations from Blood_Pressure";*/
/*proc print data=example.Blood_Pressure(obs=25) noobs;*/
/*   var Subj Drug SBP DBP;*/
/*run;*/

data exercise;
   call streaminit(7657657);
   do Subj = 1 to 50;
      Age = round(rand('normal',50,15));
      Pushups = abs(int(rand('normal',40,10) - .30*age));
      Rest_Pulse = round(rand('normal',50,8) + .35*age);
      Max_Pulse = round(rest_pulse + rand('normal',50,5) - .05*age);
      Run_Pulse = round(max_pulse - rand('normal',3,3));
      output;
   end;
run;

*Data set for a paired t-test example;
data reading;
   input Subj Before After @@;
datalines;
1 100 110  2 120 121  3 130 140  4 90 110  5 85 92
6 133 137  7 210 209  8 155 179
;

/*title "Listing of Data Set READING";*/
/*proc print data=reading noobs;*/
/*run;*/

*Data set that violates assumptions for a t-test;
data salary;
   call streaminit(57575);
   do Subj = 1 to 50;
      do Gender = 'M','F';
         Income = round(20000*rand('exponential') + rand('uniform')*7000*(Gender = 'M'));
         output;
      end;
   end;
run;
/*proc univariate data=salary;*/
/*   class Gender;*/
/*   id Subj;*/
/*   var Income;*/
/*   histogram Income;*/
/*run;*/

*Data set risk for logistic regression example;
proc format;
   value yesno 1 = 'Yes'
               0 = 'No';
run;

data Risk;
   call streaminit(13579);
   length Age_Group $ 7;
   do i = 1 to 250;
      do Gender = 'F','M';
         Age = round(rand('uniform')*30 + 50);
         if missing(Age) then Age_Group = ' ';
         else if Age lt 60 then Age_Group = '1:< 60';
         else if Age le 70 then Age_Group = '2:60-70';
         else Age_Group = '3:71+';
         Chol = rand('normal',200,30) + rand('uniform')*8*(Gender='M');
         Chol = round(Chol);
         Score = .3*chol + age + 8*(Gender eq 'M');
         Heart_Attack = (Score gt 130)*(rand('uniform') lt .2);
         output;
       end;
   end;
   keep Gender Age Age_Group chol Heart_Attack;
   format Heart_Attack yesno.;
run;

/*title "Listing of first 100 observations from RISK";*/
/*proc print data=risk(obs=100);*/
/*run;*/









    Out[52]:









  
  
  




911  ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
912  
913  *variables
914  
915  Region
916  Advertising
917  Gender
918  Book_Sales
919  Music_Sales
920  Electronics_Sales
921  Total_Sales
922  ;
923  
924  proc format;
925     value yesno 1 = 'Yes'
NOTE: Format YESNO is already on the library WORK.FORMATS.
NOTE: Format YESNO has been output.
926                 0 = 'No';
NOTE: PROCEDURE FORMAT used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds
      
927  data Store;
928     length Region $ 5;
929     call streaminit(57676);
930     do Transaction = 1 to 200;
931        R = ceil(rand('uniform')*10);
932        select(R);
933           when(1) Region = 'East';
934           when(2) Region = 'West';
935           when(3) Region = 'North';
936           when(4) Region = 'South';
937           otherwise;
938        end;
939        Advertising = rand('bernouli',.6);
940        if rand('uniform') lt .6 then Gender = 'Female';
941           else Gender = 'Male';
942        Book_Sales = abs(round(rand('normal',250,50) + 30*(Gender = 'Female')
943                      + 30*Advertising,10)) ;
944        Music_Sales = abs(round(rand('uniform')*40 + rand('normal',50,5)
945           + 30*(Region = 'East' and Gender = 'Male')
946           - 20*(Region = 'West' and Gender = 'Female'),5) + 10*Advertising);
947        Electronics_Sales = abs(round(rand('normal',300,60) + 70*(Gender = 'Male')
948         + 55*Advertising + 50*(Region = 'East') - 20*(Region = 'South')
949         + 75*(Region = 'West'),10));
950        Total_Sales = sum(Book_Sales,Music_Sales,Electronics_Sales);
951     output;
952     end;
953     drop R;
954     format Book_Sales Music_Sales Electronics_Sales Total_Sales dollar9.
955            Advertising yesno.;
956  run;
NOTE: The data set WORK.STORE has 200 observations and 8 variables.
NOTE: DATA statement used (Total process time):
      real time           0.02 seconds
      cpu time            0.01 seconds
      
957  
958  /*title "Listing of Store";*/
959  /*proc print data=store heading=h;*/
960  /*run;*/
961  
962  /*proc univariate data=store;*/
963  /*   var Book_Sales -- Total_Sales;*/
964  /*   histogram;*/
965  /*run;*/
966  /**/
967  /*title "Scatter Matrix for Store Variables";*/
968  /*proc sgscatter data=store;*/
969  /*   matrix Book_Sales -- Total_Sales / group = Gender;*/
970  /*run;*/
971  /**/
972  /*proc sgplot data=store;*/
973  /*   scatter x=Book_Sales y=Total_Sales / group=Gender;*/
974  /*run;*/
975  
976  proc rank data=store out=median_sales groups=2;
977     var Total_Sales;
978     ranks Sales_Group;
979  run;
NOTE: The data set WORK.MEDIAN_SALES has 200 observations and 9 variables.
NOTE: PROCEDURE RANK used (Total process time):
      real time           0.00 seconds
      cpu time            0.01 seconds
      
980  
981  proc format;
982     value sales 0 = 'Low'
NOTE: Format SALES is already on the library WORK.FORMATS.
NOTE: Format SALES has been output.
983                 1 = 'High';
984  run;
NOTE: PROCEDURE FORMAT used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds
      
985  
986  /*proc logistic data=median_sales order=formatted;*/
987  /*   class Gender(param=ref ref='Male');*/
988  /*   model Sales_Group = Gender;*/
989  /*   format Sales_Group sales.;*/
990  /*quit;*/
991  /**/
992  /*proc logistic data=median_sales order=formatted;*/
993  /*   class Gender(param=ref ref='Male')*/
994  /*         Advertising (param=ref ref='No');*/
995  /*   model Sales_Group = Gender Advertising;*/
996  /*   format Sales_Group sales.;*/
997  /*quit;*/
998  
999  *Create test data set;
1000  libname example 'c:\books\statistics by example';
NOTE: Library EXAMPLE does not exist.
1001  data example.Blood_Pressure;
1002     call streaminit(37373);
1003     do Drug = 'Placebo','Drug A','Drug B';
1004        do i = 1 to 20;
1005           Subj + 1;
1006           if mod(Subj,2) then Gender = 'M';
1007           else Gender = 'F';
1008           SBP = rand('normal',130,10) +
1009                 7*(Drug eq 'Placebo') - 6*(Drug eq 'Drug B');
1010           SBP = round(SBP,2);
1011           DBP = rand('normal',80,5) +
1012                 3*(Drug eq 'Placebo') - 2*(Drug eq 'Drug B');
1013           DBP = round(DBP,2);
1014           if Subj in (5,15,25,55) then call missing(SBP, DBP);
1015           if Subj in (4,18) then call missing(Gender);
1016           output;
1017        end;
1018     end;
1019     drop i;
1020  run;
ERROR: Library EXAMPLE does not exist.
NOTE: The SAS System stopped processing this step because of errors.
NOTE: DATA statement used (Total process time):
      real time           0.03 seconds
      cpu time            0.01 seconds
      
1021  
1022  /*title "Listing of the first 25 observations from Blood_Pressure";*/
1023  /*proc print data=example.Blood_Pressure(obs=25) noobs;*/
1024  /*   var Subj Drug SBP DBP;*/
1025  /*run;*/
1026  
1027  data exercise;
1028     call streaminit(7657657);
1029     do Subj = 1 to 50;
1030        Age = round(rand('normal',50,15));
1031        Pushups = abs(int(rand('normal',40,10) - .30*age));
1032        Rest_Pulse = round(rand('normal',50,8) + .35*age);
1033        Max_Pulse = round(rest_pulse + rand('normal',50,5) - .05*age);
1034        Run_Pulse = round(max_pulse - rand('normal',3,3));
1035        output;
1036     end;
1037  run;
NOTE: The data set WORK.EXERCISE has 50 observations and 6 variables.
NOTE: DATA statement used (Total process time):
      real time           0.04 seconds
      cpu time            0.01 seconds
      
1038  
1039  *Data set for a paired t-test example;
1040  data reading;
1041     input Subj Before After @@;
1042  datalines;
NOTE: SAS went to a new line when INPUT statement reached past the end of a line.
NOTE: The data set WORK.READING has 8 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds
      
1045  ;
1046  
1047  /*title "Listing of Data Set READING";*/
1048  /*proc print data=reading noobs;*/
1049  /*run;*/
1050  
1051  *Data set that violates assumptions for a t-test;
1052  data salary;
1053     call streaminit(57575);
1054     do Subj = 1 to 50;
1055        do Gender = 'M','F';
1056           Income = round(20000*rand('exponential') + rand('uniform')*7000*(Gender = 'M'));
1057           output;
1058        end;
1059     end;
1060  run;
NOTE: The data set WORK.SALARY has 100 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.02 seconds
      cpu time            0.00 seconds
      
1061  /*proc univariate data=salary;*/
1062  /*   class Gender;*/
1063  /*   id Subj;*/
1064  /*   var Income;*/
1065  /*   histogram Income;*/
1066  /*run;*/
1067  
1068  *Data set risk for logistic regression example;
1069  proc format;
1070     value yesno 1 = 'Yes'
NOTE: Format YESNO is already on the library WORK.FORMATS.
NOTE: Format YESNO has been output.
1071                 0 = 'No';
1072  run;
NOTE: PROCEDURE FORMAT used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds
      
1073  
1074  data Risk;
1075     call streaminit(13579);
1076     length Age_Group $ 7;
1077     do i = 1 to 250;
1078        do Gender = 'F','M';
1079           Age = round(rand('uniform')*30 + 50);
1080           if missing(Age) then Age_Group = ' ';
1081           else if Age lt 60 then Age_Group = '1:< 60';
1082           else if Age le 70 then Age_Group = '2:60-70';
1083           else Age_Group = '3:71+';
1084           Chol = rand('normal',200,30) + rand('uniform')*8*(Gender='M');
1085           Chol = round(Chol);
1086           Score = .3*chol + age + 8*(Gender eq 'M');
1087           Heart_Attack = (Score gt 130)*(rand('uniform') lt .2);
1088           output;
1089         end;
1090     end;
1091     keep Gender Age Age_Group chol Heart_Attack;
1092     format Heart_Attack yesno.;
1093  run;
NOTE: The data set WORK.RISK has 500 observations and 5 variables.
NOTE: DATA statement used (Total process time):
      real time           0.03 seconds
      cpu time            0.00 seconds
      
1094  
1095  /*title "Listing of first 100 observations from RISK";*/
1096  /*proc print data=risk(obs=100);*/
1097  /*run;*/
1098  
1099  
1100  ods html5 close;ods listing;

1101

Obs	Gender	SATScore	IDNumber
1	Male	1170	61469897
2	Female	1090	33081197
3	Male	1240	68137597
4	Female	1000	37070397
5	Male	1210	64608797
6	Female	970	60714297
7	Male	1020	16907997
8	Female	1490	9589297
9	Male	1200	93891897
10	Female	1260	85859397
11	Male	1150	38152597
12	Female	1390	99108497
13	Male	1240	59666697
14	Female	1370	70847197
15	Male	1140	47613397
16	Female	1160	53750297
17	Male	1050	95948597
18	Female	1110	3873197
19	Male	1100	25756097
20	Female	1080	43493297
21	Male	1120	27543197
22	Female	1080	26212897
23	Male	1050	8945097
24	Female	1200	51799397
25	Male	1600	39196697
26	Female	1100	48154497
27	Male	1050	55189597
28	Female	1060	46028397
29	Male	1140	75332897
30	Female	1100	29520797
31	Male	1340	55983497
32	Female	1240	93236497
33	Male	1090	6975697
34	Female	1180	29686297
35	Male	1170	76815697
36	Female	1130	64045497
37	Male	1290	9880297
38	Female	1380	23048597
39	Male	1010	76058697
40	Female	1280	42586897
41	Male	1050	62688897
42	Female	1520	73461797
43	Male	1360	44327297
44	Female	1260	2854197
45	Male	1290	42401697
46	Female	1070	51266697
47	Male	1260	61195297
48	Female	1350	71681397
49	Male	1140	43230697
50	Female	1330	6520097
51	Male	990	61728297
52	Female	1300	64179097
53	Male	1150	20494697
54	Female	1520	40177297
55	Male	1200	43183797
56	Female	1120	47146397
57	Male	1020	60633297
58	Female	1250	72168497
59	Male	1330	98377897
60	Female	1170	20745097
61	Male	1110	91775997
62	Female	1230	82634897
63	Male	1310	73772397
64	Female	1270	20755897
65	Male	1210	57288097
66	Female	1120	55860697
67	Male	1200	65246997
68	Female	1380	81994397
69	Male	890	76526697
70	Female	1590	23573597
71	Male	1110	61160997
72	Female	1150	23359697
73	Male	1280	19108097
74	Female	910	30834797
75	Male	1060	46806897
76	Female	1270	62919297
77	Male	1020	37534197
78	Female	1300	4440297
79	Male	1050	2012997
80	Female	1060	59149297

Statistic	DF	Value	Prob
WARNING: 33% of the cells have expected counts less than 5. Chi-Square may not be a valid test.
Chi-Square	2	2.5461	0.2800
Likelihood Ratio Chi-Square	2	2.6777	0.2622
Mantel-Haenszel Chi-Square	1	2.4303	0.1190
Phi Coefficient		0.1784
Contingency Coefficient		0.1756
Cramer's V		0.1784

Analysis Variable : SATScore
N Obs	N	Mean	Median	Std Dev	Std Error	Variance	Lower Quartile	Upper Quartile	Lower 95% CL for Mean	Upper 95% CL for Mean	t Value
80	80	1190.63	1170.00	147.06	16.44	21626.19	1085.00	1280.00	1157.90	1223.35	72.42

Analysis Variable : SATScore
Gender	N Obs	N	Mean	Median	Std Dev	Std Error	Variance	Lower Quartile	Upper Quartile	Lower 95% CL for Mean	Upper 95% CL for Mean	t Value
Female	40	40	1221.00	1215.00	157.40	24.89	24773.33	1100.00	1315.00	1170.66	1271.34	49.06
Male	40	40	1160.25	1145.00	130.92	20.70	17140.96	1050.00	1240.00	1118.38	1202.12	56.05

Moments
N	80	Sum Weights	80
Mean	1190.625	Sum Observations	95250
Std Deviation	147.058447	Variance	21626.1867
Skewness	0.64202018	Kurtosis	0.42409987
Uncorrected SS	115115500	Corrected SS	1708468.75
Coeff Variation	12.3513656	Std Error Mean	16.4416342

Basic Statistical Measures
Location		Variability
Mean	1190.625	Std Deviation	147.05845
Median	1170.000	Variance	21626
Mode	1050.000	Range	710.00000
		Interquartile Range	195.00000

Tests for Location: Mu0=0
Test	Statistic		p Value
Student's t	t	72.41525	Pr > \|t\|	<.0001
Sign	M	40	Pr >= \|M\|	<.0001
Signed Rank	S	1620	Pr >= \|S\|	<.0001

Quantiles (Definition 5)
Level	Quantile
100% Max	1600
99%	1600
95%	1505
90%	1375
75% Q3	1280
50% Median	1170
25% Q1	1085
10%	1020
5%	995
1%	890
0% Min	890

Extreme Observations
Lowest			Highest
Value	IDNumber	Obs	Value	IDNumber	Obs
890	76526697	69	1490	9589297	8
910	30834797	74	1520	73461797	42
970	60714297	6	1520	40177297	54
990	61728297	51	1590	23573597	70
1000	37070397	4	1600	39196697	25

Quantiles for Normal Distribution
Percent	Quantile
Percent	Observed	Estimated
1.0	890.000	848.516
5.0	995.000	948.735
10.0	1020.000	1002.162
25.0	1085.000	1091.436
50.0	1170.000	1190.625
75.0	1280.000	1289.814
90.0	1375.000	1379.088
95.0	1505.000	1432.515
99.0	1600.000	1532.734

Gender	N	Mean	Std Dev	Std Err	Minimum	Maximum
Female	40	1221.0	157.4	24.8864	910.0	1590.0
Male	40	1160.3	130.9	20.7008	890.0	1600.0
Diff (1-2)		60.7500	144.8	32.3706

Gender	Method	Mean	95% CL Mean		Std Dev	95% CL Std Dev
Female		1221.0	1170.7	1271.3	157.4	128.9	202.1
Male		1160.3	1118.4	1202.1	130.9	107.2	168.1
Diff (1-2)	Pooled	60.7500	-3.6950	125.2	144.8	125.2	171.7
Diff (1-2)	Satterthwaite	60.7500	-3.7286	125.2

Obs	Fertilizer	BulbWt	Cloves	BedID
1	4	0.20901	11.5062	30402
2	3	0.25792	12.2550	23423
3	2	0.21588	12.0982	20696
4	4	0.24754	12.9199	25412
5	1	0.24402	12.5793	10575
6	3	0.20150	10.6891	21466
7	1	0.20891	11.5416	14749
8	4	0.15173	14.0173	25342
9	2	0.24114	9.9072	20383
10	3	0.23350	11.2130	23306

Analysis Variable : BulbWt
Fertilizer	N Obs	N	Mean	Std Dev	Minimum	Maximum
1	9	9	0.225	0.025	0.188	0.254
2	8	8	0.209	0.026	0.159	0.241
3	11	11	0.230	0.026	0.189	0.278
4	4	4	0.196	0.041	0.152	0.248

Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	3	0.00457996	0.00152665	1.96	0.1432
Error	28	0.02183054	0.00077966
Corrected Total	31	0.02641050

Levene's Test for Homogeneity of BulbWt Variance ANOVA of Squared Deviations from Group Means
Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Fertilizer	3	1.716E-6	5.719E-7	0.98	0.4173
Error	28	0.000016	5.849E-7

Level of Fertilizer	N	BulbWt
Level of Fertilizer	N	Mean	Std Dev
1	9	0.22540667	0.02452242
2	8	0.20856500	0.02641979
3	11	0.22982091	0.02644365
4	4	0.19635250	0.04139664

Obs	Sector	Position	Fertilizer	BulbWt	Cloves	BedId
1	1	1	3	0.25881	11.6322	22961
2	1	2	4	0.20746	12.5837	23884
3	1	3	1	0.27453	12.0597	19642
4	1	4	2	0.24467	12.1001	20384
5	2	1	3	0.21454	11.5863	20303
6	2	2	4	0.16953	12.7132	21004
7	2	3	1	0.22504	12.0470	16117
8	2	4	2	0.16809	11.9071	19686
9	3	1	4	0.21720	12.3655	26527
10	3	2	3	0.22551	11.6864	23574

Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	10	0.02307263	0.00230726	5.86	0.0003
Error	21	0.00826745	0.00039369
Corrected Total	31	0.03134008

Source	DF	Type I SS	Mean Square	F Value	Pr > F
Fertilizer	3	0.00508630	0.00169543	4.31	0.0162
Sector	7	0.01798632	0.00256947	6.53	0.0004

Class Level Information
Class	Levels	Values
Ad	4	display paper people radio
Area	18	1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Number of Observations Read	144
Number of Observations Used	144

Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	20	15131.38889	756.56944	8.43	<.0001
Error	123	11037.91667	89.73916
Corrected Total	143	26169.30556

Source	DF	Type I SS	Mean Square	F Value	Pr > F
Ad	3	5866.083333	1955.361111	21.79	<.0001
Area	17	9265.305556	545.017974	6.07	<.0001

Least Squares Means for effect Fertilizer Pr > \|t\| for H0: LSMean(i)=LSMean(j) Dependent Variable: BulbWt
i/j	1	2	3	4
1		0.0840	0.5699	0.0144
2	0.0840		0.6186	0.8383
3	0.5699	0.6186		0.1995
4	0.0144	0.8383	0.1995

Fertilizer	BulbWt LSMEAN	H0:LSMean=Control
Fertilizer	BulbWt LSMEAN	Pr > t
1	0.23625000	0.0040
2	0.21115125	0.3978
3	0.22330125	0.0638
4	0.20288875

Least Squares Means for effect Fertilizer Pr > \|t\| for H0: LSMean(i)=LSMean(j) Dependent Variable: BulbWt
i/j	1	2	3	4
1		0.0195	0.2059	0.0029
2	0.0195		0.2342	0.4143
3	0.2059	0.2342		0.0523
4	0.0029	0.4143	0.0523

Analysis Variable : BloodP
DrugDose	N Obs	Mean	Variance	Std Dev
Placebo	41	-2.4390244	303.6024390	17.4241912
100 mg	44	-3.1136364	483.9170190	21.9981140
200mg	41	-1.6097561	790.6939024	28.1192799
500mg	44	-1.9772727	935.0924947	30.5792821

Analysis Variable : BloodP
Disease	N Obs	Mean	Variance	Std Dev
A	59	-15.0169492	434.4997078	20.8446566
B	57	10.6666667	629.9404762	25.0986150
C	54	-2.0740741	476.1830887	21.8216198

Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	11	36476.8353	3316.0759	7.66	<.0001
Error	158	68366.4589	432.6991
Corrected Total	169	104843.2941