Statistics Using SAS


In [27]:
%let path=/folders/myfolders/ECST131;
libname statdata "&path";

/*Go to the bottom to input the dataset*/


Out[27]:

557  ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
558
559 %let path=/folders/myfolders/ECST131;
560 libname statdata "&path";
NOTE: Libref STATDATA was successfully assigned as follows:
Engine: V9
Physical Name: /folders/myfolders/ECST131
561
562 /*Go to the bottom to input the dataset*/
563 ods html5 close;ods listing;

564

Basic Statistical Concepts

A population is the complete set of observations or the entire group of objects that you are researching. A sample is a subset of the population. The sample should be representative of the population. You can obtain a representative sample by collecting a simple random sample.

Parameters are numerical values that summarize characteristics of a population. Parameter values are typically unknown and are represented by Greek letters. Statistics summarize characteristics of a sample. You use letters from the English alphabet to represent sample statistics. You can measure characteristics of your sample and provide numerical values that summarize those characteristics.

Variables are classified according to their characteristics. They can be quantitative or categorical. Data that consists of counts or measurements is quantitative. Quantitative data can be further distinguished by two types: discrete and continuous. Discrete data takes on only a finite, or countable, number of values. Continuous data has an infinite number of values and no breaks or jumps.

Categorical or attribute data consists of variables that denote groupings or labels. There are two main types: nominal and ordinal. A nominal categorical variable exhibits no ordering within its groups or categories. With ordinal categorical variables, the observed levels of the variable can be ordered in a meaningful way that implies differences due to magnitude.

There are two scales of measurement for categorical variables: nominal and ordinal. There are two scales of measurement for continuous variables: interval and ratio. Data from an interval scale can be rank-ordered and has a sensible spacing of observations such that differences between measurements are meaningful. However, interval scales lack the ability to calculate ratios between numbers on the scale because there is no true zero point. Data on a ratio scale includes a true zero point and can therefore accurately indicate the ratio of difference between two spaces on the measurement scale.

The appropriate statistical method for your data also depends on the number of variables involved. Univariate analysis provides techniques for analyzing and describing a single variable at a time. Bivariate analysis describes and explains the relationship between two variables and how they change, or covary, together. Multivariate analysis examines two or more variables at the same time, in order to understand the relationships among them.

PROC MEANS DATA=SAS-data-set <options>;       
CLASS variables;
VAR variables;
RUN;
Option Description
N Number of nonmissing observations
NMISS Number of observations with missing values
MEAN Arithmetic mean
STD Standard deviation
STDERR Standard error
MIN Minimum value
MAX Maximum value
MEDIAN Median
MAXDEC= Maximum number of decimal places to display
CLM 95% confidence limit on the mean
CV Coefficient of variation
PROC UNIVARIATE DATA=SAS-data-set <options>;      
    VAR variables;
    ID variables;
    HISTOGRAM variables </options>;
    PROBPPLOT variables </options>;
    INSET keywords </options>;
RUN;

PROC SGPLOT DATA=SAS-data-set<options>;
    DOT category-variable </option(s)>;
    HBAR category-variable </option(s)>;
    VBAR category-variable </option(s)>;
    HBOX response-variable </option(s)>;
    VBOX response-variable </option(s)>;
    HISTOGRAM response-variable </option(s)>;
    SCATTER X=variable Y=variable </option(s)>;
    NEEDLE X=variable Y=numeric-variable </option(s)>;
    REG X=numeric-variable Y=numeric-variable </option(s)>;
    REFLINE variable | value-1 <... value-n> </option(s)>;
RUN;

ODS GRAPHICS ON <options>;
    statistical procedure code
ODS GRAPHICS OFF;

Sample Programs


In [28]:
proc print data=statdata.testscores;
run;


Out[28]:
SAS Output

Model with Heating Quality and Season as Interacting Predictors

Obs Gender SATScore IDNumber
1 Male 1170 61469897
2 Female 1090 33081197
3 Male 1240 68137597
4 Female 1000 37070397
5 Male 1210 64608797
6 Female 970 60714297
7 Male 1020 16907997
8 Female 1490 9589297
9 Male 1200 93891897
10 Female 1260 85859397
11 Male 1150 38152597
12 Female 1390 99108497
13 Male 1240 59666697
14 Female 1370 70847197
15 Male 1140 47613397
16 Female 1160 53750297
17 Male 1050 95948597
18 Female 1110 3873197
19 Male 1100 25756097
20 Female 1080 43493297
21 Male 1120 27543197
22 Female 1080 26212897
23 Male 1050 8945097
24 Female 1200 51799397
25 Male 1600 39196697
26 Female 1100 48154497
27 Male 1050 55189597
28 Female 1060 46028397
29 Male 1140 75332897
30 Female 1100 29520797
31 Male 1340 55983497
32 Female 1240 93236497
33 Male 1090 6975697
34 Female 1180 29686297
35 Male 1170 76815697
36 Female 1130 64045497
37 Male 1290 9880297
38 Female 1380 23048597
39 Male 1010 76058697
40 Female 1280 42586897
41 Male 1050 62688897
42 Female 1520 73461797
43 Male 1360 44327297
44 Female 1260 2854197
45 Male 1290 42401697
46 Female 1070 51266697
47 Male 1260 61195297
48 Female 1350 71681397
49 Male 1140 43230697
50 Female 1330 6520097
51 Male 990 61728297
52 Female 1300 64179097
53 Male 1150 20494697
54 Female 1520 40177297
55 Male 1200 43183797
56 Female 1120 47146397
57 Male 1020 60633297
58 Female 1250 72168497
59 Male 1330 98377897
60 Female 1170 20745097
61 Male 1110 91775997
62 Female 1230 82634897
63 Male 1310 73772397
64 Female 1270 20755897
65 Male 1210 57288097
66 Female 1120 55860697
67 Male 1200 65246997
68 Female 1380 81994397
69 Male 890 76526697
70 Female 1590 23573597
71 Male 1110 61160997
72 Female 1150 23359697
73 Male 1280 19108097
74 Female 910 30834797
75 Male 1060 46806897
76 Female 1270 62919297
77 Male 1020 37534197
78 Female 1300 4440297
79 Male 1050 2012997
80 Female 1060 59149297

Using Proc Freq

Plot Name Table type Require option on TABLES statement
AGREEPLOT two-way AGREE
CUMFREQPLOT one-way
DEVIATIONPLOT one-way CHISQ
FREQPLOT any request
KAPPAPLOT three-way AGREE
ODDSRATIOPLOT hx2x2 MEASURES or RELRISK
RELREISKPLOT hx2x2 MEASURES or RELRISK
RISKDIFFPLOT hx2x2 RISKDIFF
WTKAPPAPLOT hxrxr (r>2) AGREE

In [29]:
proc format;
value $gender 'M' = 'Male'
'F' = 'Female';
value score low-<1100 = 'Low'
            1100-<1400 = 'Middle'
            1400-High = 'High';
run;

proc freq data=statdata.testscores ORDER=FORMATTED;
    tables gender*SATScore / noprecent norow nocol CHISQ 
           plots = freqplot (twoway = stacked);
           *plots = freqplot (twoway = grouphorizontal);
    format Gender $gender. SATScore score.;
run;


Out[29]:
SAS Output

Model with Heating Quality and Season as Interacting Predictors

The FREQ Procedure

Frequency
Table of Gender by SATScore
Gender SATScore
High Low Middle Total
Female
4
9
27
40
Male
1
13
26
40
Total
5
22
53
80

Statistics for Table of Gender by SATScore

Statistic DF Value Prob
WARNING: 33% of the cells have expected counts less
than 5. Chi-Square may not be a valid test.
Chi-Square 2 2.5461 0.2800
Likelihood Ratio Chi-Square 2 2.6777 0.2622
Mantel-Haenszel Chi-Square 1 2.4303 0.1190
Phi Coefficient   0.1784  
Contingency Coefficient   0.1756  
Cramer's V   0.1784  

Sample Size = 80

Using PROC MEANS


In [30]:
proc means data=statdata.testscores maxdec=2 fw=10 printalltypes
           n mean median std sterr var q1 q3 clm t;
   class Gender;
   var SATScore;
   title 'Selected Descriptive Statistics for SAT Scores';
run;
title;


Out[30]:
SAS Output

Selected Descriptive Statistics for SAT Scores

The MEANS Procedure

Analysis Variable : SATScore
N Obs N Mean Median Std Dev Std Error Variance Lower Quartile Upper Quartile Lower 95%
CL for Mean
Upper 95%
CL for Mean
t Value
80 80 1190.63 1170.00 147.06 16.44 21626.19 1085.00 1280.00 1157.90 1223.35 72.42
Analysis Variable : SATScore
Gender N Obs N Mean Median Std Dev Std Error Variance Lower Quartile Upper Quartile Lower 95%
CL for Mean
Upper 95%
CL for Mean
t Value
Female 40 40 1221.00 1215.00 157.40 24.89 24773.33 1100.00 1315.00 1170.66 1271.34 49.06
Male 40 40 1160.25 1145.00 130.92 20.70 17140.96 1050.00 1240.00 1118.38 1202.12 56.05

Using proc sgplot, proc univariate


In [31]:
proc univariate data=statdata.testscores ALPHA=0.05;
   var SATScore;
   id idnumber;
   histogram SATScore / normal(mu=est sigma=est);
     inset skewness kurtosis / position=ne;
   probplot SATScore / normal(mu=est sigma=est);
     inset skewness kurtosis;
   cdfplot SATScore / normal(mu=est sigma=est);
     inset skewness kurtosis;
   title 'Descriptive Statistics Using PROC UNIVARIATE';
run;
title;


Out[31]:
SAS Output

Descriptive Statistics Using PROC UNIVARIATE

The UNIVARIATE Procedure

Variable: SATScore

Moments
N 80 Sum Weights 80
Mean 1190.625 Sum Observations 95250
Std Deviation 147.058447 Variance 21626.1867
Skewness 0.64202018 Kurtosis 0.42409987
Uncorrected SS 115115500 Corrected SS 1708468.75
Coeff Variation 12.3513656 Std Error Mean 16.4416342
Basic Statistical Measures
Location Variability
Mean 1190.625 Std Deviation 147.05845
Median 1170.000 Variance 21626
Mode 1050.000 Range 710.00000
    Interquartile Range 195.00000
Tests for Location: Mu0=0
Test Statistic p Value
Student's t t 72.41525 Pr > |t| <.0001
Sign M 40 Pr >= |M| <.0001
Signed Rank S 1620 Pr >= |S| <.0001
Quantiles (Definition 5)
Level Quantile
100% Max 1600
99% 1600
95% 1505
90% 1375
75% Q3 1280
50% Median 1170
25% Q1 1085
10% 1020
5% 995
1% 890
0% Min 890
Extreme Observations
Lowest Highest
Value IDNumber Obs Value IDNumber Obs
890 76526697 69 1490 9589297 8
910 30834797 74 1520 73461797 42
970 60714297 6 1520 40177297 54
990 61728297 51 1590 23573597 70
1000 37070397 4 1600 39196697 25

Descriptive Statistics Using PROC UNIVARIATE

The UNIVARIATE Procedure


Descriptive Statistics Using PROC UNIVARIATE

The UNIVARIATE Procedure

Fitted Normal Distribution for SATScore

Parameters for Normal Distribution
Parameter Symbol Estimate
Mean Mu 1190.625
Std Dev Sigma 147.0584
Goodness-of-Fit Tests for Normal Distribution
Test Statistic p Value
Kolmogorov-Smirnov D 0.08382224 Pr > D >0.150
Cramer-von Mises W-Sq 0.09964577 Pr > W-Sq 0.114
Anderson-Darling A-Sq 0.70124822 Pr > A-Sq 0.068
Quantiles for Normal Distribution
Percent Quantile
Observed Estimated
1.0 890.000 848.516
5.0 995.000 948.735
10.0 1020.000 1002.162
25.0 1085.000 1091.436
50.0 1170.000 1190.625
75.0 1280.000 1289.814
90.0 1375.000 1379.088
95.0 1505.000 1432.515
99.0 1600.000 1532.734

Descriptive Statistics Using PROC UNIVARIATE

The UNIVARIATE Procedure


Descriptive Statistics Using PROC UNIVARIATE

The UNIVARIATE Procedure


In [32]:
proc sgplot data = statdata.testscores ;
    *dot SATScore;
    *vbar SATscore;
    *scatter y=SATScore x=Gender;
    refline 1200 / axis=x lineattrs =(color=blue);
    /*hbox SATScore / group=Gender datalabel = IDNumber*/
    hbox SATScore / category=Gender datalabel = IDNumber missing;
    format IDnUMBER 8.;
run;


Out[32]:
SAS Output

In [33]:
title "Using PROC SGPLOT to Produce a Scatter Plot";
proc sgplot data=store;
scatter x=Book_Sales y=Music_Sales;
run;
quit;


Out[33]:
SAS Output

In [34]:
title "Using PROC SGPLOT to Produce a Scatter Plot";
title2 "Adding Gender Information to the Plot";
proc sgplot data=store;
scatter x=Book_Sales y=Music_Sales / group=Gender;
run;
quit;


Out[34]:
SAS Output

In [35]:
title "Demonstrating the PLOT Statement of PROC SGSCATTER";
proc sgscatter data=store;
plot Book_Sales * Music_Sales Total_Sales * Electronics_Sales;
run;


Out[35]:
SAS Output

In [36]:
title "Switching Axes and Adding a GROUP= Option";
proc sgscatter data=store;
compare Y=Total_Sales
X=(Book_Sales Music_Sales Electronics_Sales) / group=Gender;
run;


Out[36]:
SAS Output

The left and right sides of the box represent the 1st and 3rd quartiles (sometimes abbreviated Q1 and Q3). The vertical bar inside the box is the median, and the diamond represents the mean. The lines extending from the left and right side of the box (called whiskers) represent data values that are less than 1.5 times the interquartile range from Q1 and Q3.

The option DATALABEL= lets you select a variable to identify specific outliers. If you use the DATALABEL option without naming a label variable, SGPLOT uses the numerical value of the response variable (SBP in this example) to label the outliers.

T test

PROC TTEST DATA=SAS-data-set <options eg sides = U>;
    CLASS variables;
    VAR variables;
    PARIED variable1*variable2;
RUN;

One-sample t-test


In [37]:
title "Conducting a One-Sample t-test Using PROC TTEST";
proc ttest data=exercise h0=50 sides=2 alpha=.05;
var Age;
run;


Out[37]:
SAS Output

Conducting a One-Sample t-test Using PROC TTEST

The TTEST Procedure

Variable: Age

N Mean Std Dev Std Err Minimum Maximum
50 51.3600 16.1419 2.2828 19.0000 84.0000
Mean 95% CL Mean Std Dev 95% CL Std Dev
51.3600 46.7725 55.9475 16.1419 13.4839 20.1150
DF t Value Pr > |t|
49 0.60 0.5541

In [38]:
proc sgplot data=statdata.ameshousing3;
   vbox SalePrice / category=Central_Air
                    connect=mean;
   title "Sale Price Differences across Central Air";
run;
 
proc sgplot data=statdata.ameshousing3;
   vbox SalePrice / category=Fireplaces
                    connect=mean;
   title "Sale Price Differences across Fireplaces";
run;
 
proc sgplot data=statdata.ameshousing3;
   vbox SalePrice / category=Heating_QC
                    connect=mean;
   title "Sale Price Differences across Heating Quality";
run;


Out[38]:
SAS Output


Two-sample t-test

The two-sample t-test is a hypothesis test for answering questions about the means of two populations. This test enables you to examine the differences between populations for one or more continuous variables. You can assess whether the means of the two populations are statistically different from each other. The null hypothesis for the two-sample t-test is that the means for the two groups are equal, or that $μ1 - μ2$ equals 0.

When you compare the means of two populations using a two-sample t-test, you make three assumptions:

  • The data contains independent observations,
  • The distributions of the two populations are normal, and
  • The variances in these normal distributions are equal.

To evaluate the assumption of equal variances in the two populations, you can use the F-test for equality of variances. To test the hypothesis, you calculate the F statistic, which is the ratio of the maximum sample variance of the two groups to the minimum sample variance of the two groups. By construction, the F statistic is always greater than or equal to 1.


In [39]:
ods graphics on;
proc ttest data=statdata.testscores 
           /*plots(shownull)=interval;*/
           plots(unpack shownull) = all;
           /*SIDES=U;*/
           /*SIDES=L;*/
   class Gender;
   var SATScore;
   title "Two-Sample t=Test Comparing Girls to Boys";
run;
ods graphics off;


Out[39]:
SAS Output

Two-Sample t=Test Comparing Girls to Boys

The TTEST Procedure

Variable: SATScore

Gender N Mean Std Dev Std Err Minimum Maximum
Female 40 1221.0 157.4 24.8864 910.0 1590.0
Male 40 1160.3 130.9 20.7008 890.0 1600.0
Diff (1-2)   60.7500 144.8 32.3706    
Gender Method Mean 95% CL Mean Std Dev 95% CL Std Dev
Female   1221.0 1170.7 1271.3 157.4 128.9 202.1
Male   1160.3 1118.4 1202.1 130.9 107.2 168.1
Diff (1-2) Pooled 60.7500 -3.6950 125.2 144.8 125.2 171.7
Diff (1-2) Satterthwaite 60.7500 -3.7286 125.2      
Method Variances DF t Value Pr > |t|
Pooled Equal 78 1.88 0.0643
Satterthwaite Unequal 75.497 1.88 0.0644
Equality of Variances
Method Num DF Den DF F Value Pr > F
Folded F 39 39 1.45 0.2545

In general, Pool variance is

$$s_{p}^{2}={\frac {\sum _{{i=1}}^{k}(n_{i}-1)s_{i}^{2}}{\sum _{{i=1}}^{k}(n_{i}-1)}}$$

Welch's t-test, or unequal variances t-test, is a two-sample location test which is used to test the hypothesis that two populations have equal means. Welch's t-test is an adaptation of Student's t-test,[1] that is more reliable when the two samples have unequal variances and unequal sample sizes.[2] These tests are often referred to as "unpaired" or "independent samples" t-tests, as they are typically applied when the statistical units underlying the two samples being compared are non-overlapping.

$$t\quad =\quad {\;\overline {X}_{1}-\overline {X}_{2}\; \over {\sqrt {\;{s_{1}^{2} \over N_{1}}\;+\;{s_{2}^{2} \over N_{2}}\quad }}}\,$$

where ${\displaystyle {\overline {X}}_{1}}$, $ s_{1}^{2}$ and $ N_{1}$ are the 1st sample mean, population standard deviation and sample size, respectively. Unlike in Student's t-test, the denominator is not based on a pooled variance estimate.

The degrees of freedom ${\displaystyle \nu }$ associated with this variance estimate is approximated using the Welch–Satterthwaite equation:

$${\displaystyle \nu \quad \approx \quad {{\left(\;{s_{1}^{2} \over N_{1}}\;+\;{s_{2}^{2} \over N_{2}}\;\right)^{2}} \over {\quad {s_{1}^{4} \over N_{1}^{2}\nu _{1}}\;+\;{s_{2}^{4} \over N_{2}^{2}\nu _{2}}\quad }}} \nu \quad \approx \quad {{\left(\;{s_{1}^{2} \over N_{1}}\;+\;{s_{2}^{2} \over N_{2}}\;\right)^{2}} \over {\quad {s_{1}^{4} \over N_{1}^{2}\nu _{1}}\;+\;{s_{2}^{4} \over N_{2}^{2}\nu _{2}}\quad }}$$

Here ${\displaystyle \nu _{1}=N_{1}-1} {\displaystyle \nu _{1}=N_{1}-1}$, the degrees of freedom associated with the first variance estimate. ${\displaystyle \nu _{2}=N_{2}-1} {\displaystyle \nu _{2}=N_{2}-1}$, the degrees of freedom associated with the 2nd variance estimate.

In the above example,

$t=\frac{{\bar{X}}_{1}-{\bar{X}}_{2}}{s_{p}{\sqrt{2/n}}}$

$s_{p}={\sqrt{\frac {s_{X_{1}}^{2}+s_{X_{2}}^{2}}{2}}}$

$\sqrt{\frac{((157.4^{2}+130.9^{2})}{2}} = 144.76$

Aso,

${\text{SD}}_{\bar {x}}\ ={\frac {\sigma }{\sqrt {n}}}$

For female, ${\text{SD}}_{{Female}}\ = \frac{157.4}{\sqrt{40}}$

The SHOWNULL option displays a vertical reference line at the null hypothesis value (zero by default) on the plot of mean differences. That plot shows the value of the difference between the sample means and the confidence interval (95% by default) around the value. If the confidence interval includes the null hypothesis value, the implication is that the difference is not statistically significant at your chosen alpha level (ALPHA=0.05 by default).

The decision to choose the pooled or Satterthwaite values is somewhat controversial. Some statisticians recommend that you look at the p-value from the test for equality of variances to decide which t- and p-value to use. Others think that this is somewhat circular reasoning and that you should decide ahead of time which assumption is most reasonable.


In [40]:
title "Demonstrating a Paired T-test";
proc ttest data=reading; /*By default, it prints (summaryplot agreementplot profilesplot);*/
paired After*Before;
run;

/*You see that the reading program did indeed increase reading speeds (by an average of
9.375 words per minute), with a p-value of .0194.*/


Out[40]:
SAS Output

Demonstrating a Paired T-test

The TTEST Procedure

Difference: After - Before

N Mean Std Dev Std Err Minimum Maximum
8 9.3750 8.7821 3.1049 -1.0000 24.0000
Mean 95% CL Mean Std Dev 95% CL Std Dev
9.3750 2.0330 16.7170 8.7821 5.8065 17.8739
DF t Value Pr > |t|
7 3.02 0.0194

ANOVA

ANOVA is a type of regression where independent variables are nominal variables. Nominal variable is one that have two or more levels, but there is no intrinsic ordering for the levels. ANOVA stands for Analysis of Variance. It is used to compare more than two means. As the name suggest, it estimate an variance and based on the variance, it allow us to make a conclusion about the comparison of means. It is true that we can also use t-test to compare more than two means. But, t-test will increase the type-I-error when t-test do multiple comparison on the same data. Depends on the number of independent variable, we can classify ANOVA in to different types.

PROC GLM DATA=SAS-data-set<options>;
    CLASS variable(s);
    MODEL dependents=independents </options>;
    MEANS effects </options>;
    LSMEANS effects </options>;
    STORE <OUT=>item-store-name </ LABEL='label'>;
RUN;
QUIT;

For one-way ANOVA, one can use:

PROC ANOVA;
CLASS variable-list;
MODEL dependent = effects;

ANOVA cautions

Balanced experiments (those with an equal sample size for each treatment) are relatively easy to interpret; Unbalanced experiments offer more complexity. For single factor (one way) ANOVA, the adjustment for unbalanced data is easy, but the unbalanced analysis lacks both robustness and power. For more complex designs the lack of balance leads to further complications. "The orthogonality property of main effects and interactions present in balanced data does not carry over to the unbalanced case. This means that the usual analysis of variance techniques do not apply. Consequently, the analysis of unbalanced factorials is much more difficult than that for balanced designs." In the general case, "The analysis of variance can also be applied to unbalanced data, but then the sums of squares, mean squares, and F-ratios will depend on the order in which the sources of variation are considered."[43] The simplest techniques for handling unbalanced data restore balance by either throwing out data or by synthesizing missing data. More complex techniques use regression.

ANOVA can be thought of as linear regression on dummy variables.

It is only in the interpretation of the model that a distinction is made.

You calculate the variability between the means and the variability of observations within each group, and then calculate a ratio between these two measurements. If the between-group variability is significantly larger than the within-group variability, you reject the null that all of the group means are equal. So, you partition out the variability using sums of squares. For ANOVA, you calculate three types of sums of squares:

  • Between Group Variation (model sums of squares or SSM,
  • Within Group Variation, and
  • Total Variation
  • The first assumption is one of independent observations
  • The second assumption is that the error terms are normally distributed. You verify this assumption by examining diagnostic plots of the residuals.
  • The third assumption is that the error terms have equal variances across treatments.

In [41]:
proc print data=statdata.mggarlic (obs=10);
   title "Partial Listing of Garlic Data";
run;

proc means data=statdata.mggarlic printalltypes maxdec=3;
    var BulbWt;
    class Fertilizer;
    title "Descriptive Statistics of Garlic Weight";
run;

ods graphics on / width=700;
proc sgplot data=statdata.mggarlic;
    vbox BulbWt / category=Fertilizer connect= mean datalabel=BedID;
    format BedID 5.;
    title "Box Plots of Garlic Weight";
run;
title;


Out[41]:
SAS Output

Partial Listing of Garlic Data

Obs Fertilizer BulbWt Cloves BedID
1 4 0.20901 11.5062 30402
2 3 0.25792 12.2550 23423
3 2 0.21588 12.0982 20696
4 4 0.24754 12.9199 25412
5 1 0.24402 12.5793 10575
6 3 0.20150 10.6891 21466
7 1 0.20891 11.5416 14749
8 4 0.15173 14.0173 25342
9 2 0.24114 9.9072 20383
10 3 0.23350 11.2130 23306

Descriptive Statistics of Garlic Weight

The MEANS Procedure

Analysis Variable : BulbWt
N Obs N Mean Std Dev Minimum Maximum
32 32 0.219 0.029 0.152 0.278
Analysis Variable : BulbWt
Fertilizer N Obs N Mean Std Dev Minimum Maximum
1 9 9 0.225 0.025 0.188 0.254
2 8 8 0.209 0.026 0.159 0.241
3 11 11 0.230 0.026 0.189 0.278
4 4 4 0.196 0.041 0.152 0.248


In [42]:
ods graphics on / width=700;

proc glm data=statdata.mggarlic plots(only)=diagnostics ;
   class Fertilizer;
   model BulbWt=Fertilizer;  /****/
   means Fertilizer / hovtest;
   title "Testing for Equality of Means with PROC GLM";
run;
quit;
title;


Out[42]:
SAS Output

Testing for Equality of Means with PROC GLM

The GLM Procedure

Class Level Information
Class Levels Values
Fertilizer 4 1 2 3 4
Number of Observations Read 32
Number of Observations Used 32

Testing for Equality of Means with PROC GLM

The GLM Procedure

Dependent Variable: BulbWt

Source DF Sum of Squares Mean Square F Value Pr > F
Model 3 0.00457996 0.00152665 1.96 0.1432
Error 28 0.02183054 0.00077966    
Corrected Total 31 0.02641050      
R-Square Coeff Var Root MSE BulbWt Mean
0.173414 12.74520 0.027922 0.219082
Source DF Type I SS Mean Square F Value Pr > F
Fertilizer 3 0.00457996 0.00152665 1.96 0.1432
Source DF Type III SS Mean Square F Value Pr > F
Fertilizer 3 0.00457996 0.00152665 1.96 0.1432

Testing for Equality of Means with PROC GLM

The GLM Procedure

Levene's Test for Homogeneity of BulbWt Variance
ANOVA of Squared Deviations from Group Means
Source DF Sum of Squares Mean Square F Value Pr > F
Fertilizer 3 1.716E-6 5.719E-7 0.98 0.4173
Error 28 0.000016 5.849E-7    

Testing for Equality of Means with PROC GLM

The GLM Procedure

Level of
Fertilizer
N BulbWt
Mean Std Dev
1 9 0.22540667 0.02452242
2 8 0.20856500 0.02641979
3 11 0.22982091 0.02644365
4 4 0.19635250 0.04139664

In [43]:
proc anova data=statdata.mggarlic;
    class Fertilizer;
    model BulbWt=Fertilizer;
run;


Out[43]:
SAS Output

The ANOVA Procedure

Class Level Information
Class Levels Values
Fertilizer 4 1 2 3 4
Number of Observations Read 32
Number of Observations Used 32

The ANOVA Procedure

Dependent Variable: BulbWt

Source DF Sum of Squares Mean Square F Value Pr > F
Model 3 0.00457996 0.00152665 1.96 0.1432
Error 28 0.02183054 0.00077966    
Corrected Total 31 0.02641050      
R-Square Coeff Var Root MSE BulbWt Mean
0.173414 12.74520 0.027922 0.219082
Source DF Anova SS Mean Square F Value Pr > F
Fertilizer 3 0.00457996 0.00152665 1.96 0.1432
Interpretation of results

You want to see a random scatter of residuals above and below 0 for the four fertilizer groups. The plot looks good.

You use Levene's Test for Homogeneity to formally test the equal variance assumption. Because the p-value of 0.4173 is greater than 0.05, you fail to reject the null and conclude that the variances are equal. This is good. You verified the equal variance assumption.

Because the residuals follow the diagonal reference line fairly closely, you can say that they are approximately normal.

The mean square error is $0.00078$, which is an estimate of the population variance. SAS calculates MSE by dividing the error sum of squares by the error Degrees of Freedom ($0.00457996/3$), which gives you the average sum of squares for the error. SAS calculates the F-statistic by dividing the MSM by the MSE ($0.00078/0.00152665$). The F statistic is 1.96. Because the corresponding p-value of $0.1432$ is greater than 0.05, you can conclude that there is not a statistically significant difference between the mean bulb weights for the four fertilizers.

It's important for you to realize that the one-way ANOVA is an omnibus test statistic and cannot tell you which specific groups are significantly different from each other, only that at least two groups are different. To determine which specific groups differ from each other, you need to use a post-hoc test.

All in all, the PROC GLM output supports your conclusion that there's not a statistically significant difference between the mean bulb weights for the four fertilizers. (F-test);

Levene's test

The HOVTEST=BARTLETT option specifies Bartlett’s test (Bartlett; 1937), a modification of the normal-theory likelihood ratio test.

The HOVTEST=BF option specifies Brown and Forsythe’s variation of Levene’s test (Brown and Forsythe; 1974).

The HOVTEST=LEVENE option specifies Levene’s test (Levene; 1960), which is widely considered to be the standard homogeneity of variance test.

In statistics, Levene's test is an inferential statistic used to assess the equality of variances for a variable calculated for two or more groups. Some common statistical procedures assume that variances of the populations from which different samples are drawn are equal. Levene's test assesses this assumption. It tests the null hypothesis that the population variances are equal (called homogeneity of variance or homoscedasticity). If the resulting p-value of Levene's test is less than some significance level (typically 0.05), the obtained differences in sample variances are unlikely to have occurred based on random sampling from a population with equal variances. Thus, the null hypothesis of equal variances is rejected and it is concluded that there is a difference between the variances in the population.

The test statistic, $W$, is defined as follows:

$${\displaystyle W={\frac {(N-k)}{(k-1)}}{\frac {\sum _{i=1}^{k}N_{i}(Z_{i\cdot }-Z_{\cdot \cdot })^{2}}{\sum _{i=1}^{k}\sum _{j=1}^{N_{i}}(Z_{ij}-Z_{i\cdot })^{2}}},}$$

where

$k$ is the number of different groups to which the sampled cases belong,

$N_{i}$ is the number of cases in the $ith$ group,

$N$ is the total number of cases in all groups,

$Y_{{ij}}$ is the value of the measured variable for the $jth$ case from the $ith$ group,

${\displaystyle Z_{ij}={\begin{cases}|Y_{ij}-{\bar {Y}}_{i\cdot }|,&{\bar {Y}}_{i\cdot }{\text{ is a mean of the }}i{\text{-th group}},\\|Y_{ij}-{\tilde {Y}}_{i\cdot }|,&{\tilde {Y}}_{i\cdot }{\text{ is a median of the }}i{\text{-th group}}.\end{cases}}}$

ANOVA with Data from a Randomized Block Design

Along with the three original ANOVA assumptions of independent observations, normally distributed errors, and equal variances across treatments, you make two more assumptions when you include a blocking factor in the model. First, you assume that the treatments are randomly assigned within each block. In the T-cell count example, this means that you assume the three medications are randomly assigned to each of the three age groups. Next, you assume that the effects of the treatment factor are constant across the levels of the blocking factor, meaning that the effects of the treatment factor don't depend on the block they are in. When the effects of the treatment factor are not constant across the levels of another variable, it's called interaction. But when you use a randomized block design, you assume that the effects are the same within each block. In other words, you assume that there are no interactions with the blocking variable

The farmers divide the farm into eight sectors, each of which has four beds, and in each of the four beds, they randomly assign each of the four fertilizers. An experimental design like this is often referred to as a randomized block design. As you can see in this ANOVA model, Sector is the blocking variable.


In [44]:
proc print data=statdata.mggarlic_block (obs=10);
run;

proc glm data=statdata.mggarlic_block plots(only)=diagnostics;
     class Fertilizer Sector;
     model BulbWt=Fertilizer Sector;
     title "ANOVA for Randomized Block Design";
run;
quit;
title;


Out[44]:
SAS Output
Obs Sector Position Fertilizer BulbWt Cloves BedId
1 1 1 3 0.25881 11.6322 22961
2 1 2 4 0.20746 12.5837 23884
3 1 3 1 0.27453 12.0597 19642
4 1 4 2 0.24467 12.1001 20384
5 2 1 3 0.21454 11.5863 20303
6 2 2 4 0.16953 12.7132 21004
7 2 3 1 0.22504 12.0470 16117
8 2 4 2 0.16809 11.9071 19686
9 3 1 4 0.21720 12.3655 26527
10 3 2 3 0.22551 11.6864 23574

ANOVA for Randomized Block Design

The GLM Procedure

Class Level Information
Class Levels Values
Fertilizer 4 1 2 3 4
Sector 8 1 2 3 4 5 6 7 8
Number of Observations Read 32
Number of Observations Used 32

ANOVA for Randomized Block Design

The GLM Procedure

Dependent Variable: BulbWt

Source DF Sum of Squares Mean Square F Value Pr > F
Model 10 0.02307263 0.00230726 5.86 0.0003
Error 21 0.00826745 0.00039369    
Corrected Total 31 0.03134008      
R-Square Coeff Var Root MSE BulbWt Mean
0.736202 9.085064 0.019842 0.218398
Source DF Type I SS Mean Square F Value Pr > F
Fertilizer 3 0.00508630 0.00169543 4.31 0.0162
Sector 7 0.01798632 0.00256947 6.53 0.0004
Source DF Type III SS Mean Square F Value Pr > F
Fertilizer 3 0.00508630 0.00169543 4.31 0.0162
Sector 7 0.01798632 0.00256947 6.53 0.0004

In [45]:
proc glm data=statdata.ads1 plots(only)=diagnostics;
   class Ad Area;
   model Sales=Ad Area;
   title 'ANOVA for Randomized Block Design';
run;
quit;
title;


Out[45]:
SAS Output

ANOVA for Randomized Block Design

The GLM Procedure

Class Level Information
Class Levels Values
Ad 4 display paper people radio
Area 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Number of Observations Read 144
Number of Observations Used 144

ANOVA for Randomized Block Design

The GLM Procedure

Dependent Variable: Sales

Source DF Sum of Squares Mean Square F Value Pr > F
Model 20 15131.38889 756.56944 8.43 <.0001
Error 123 11037.91667 89.73916    
Corrected Total 143 26169.30556      
R-Square Coeff Var Root MSE Sales Mean
0.578211 14.17712 9.473076 66.81944
Source DF Type I SS Mean Square F Value Pr > F
Ad 3 5866.083333 1955.361111 21.79 <.0001
Area 17 9265.305556 545.017974 6.07 <.0001
Source DF Type III SS Mean Square F Value Pr > F
Ad 3 5866.083333 1955.361111 21.79 <.0001
Area 17 9265.305556 545.017974 6.07 <.0001

ANOVA Post Hoc Tests

A pairwise comparison examines the difference between two treatment means. If your ANOVA results suggest that you reject the null hypothesis that the means are equal across groups, you can conduct multiple pairwise comparisons in a post hoc analysis to learn which means differ.

The chance that you make a Type I error increases each time you conduct a statistical test. The comparisonwise error rate, or CER, is the probability of a Type I error on a single pairwise test. The experimentwise error rate, or EER, is the probability of making at least one Type I error when performing all of the pairwise comparisons. The EER increases as the number of pairwise comparisons increases.

With a statistically conservative multiple comparison method, such as the Tukey or Dunnett method, you control for the EER, so there's a tendency to find fewer significant differences than might otherwise be found. When you make no adjustments for multiple comparisons, you are likely to find more significant differences than might otherwise be found. If only the comparisonwise error rate is controlled, the overall risk of a Type I error across all the comparisons is increased (and therefore the risk of Type II error is decreased), so the test might find more significant differences than would otherwise be found.

The Tukey method compares all possible pairs of means, so it can only be used when you make pairwise comparisons (See Tukey's range test) Tukey's procedure is only applicable for pairwise comparisons. It assumes independence of the observations being tested, as well as equal variation across observations (homoscedasticity). The procedure calculates for each pair the studentized range statistic: ${\frac {Y_{{A}}-Y_{{B}}}{SE}}$ where $ Y_{{A}}$ is the larger of the two means being compared, ${\displaystyle Y_{B}} Y_{{B}}$ is the smaller, and $SE$ is the standard error of the data in question. Tukey's test is essentially a Student's t-test, except that it corrects for family-wise error-rate.

Dunnett's method is a specialized multiple comparison test that enables you to compare a single control group to all other groups.


In [46]:
ods graphics on / width=700;
ods trace on;

proc glm data=statdata.mggarlic_block;
    class Fertilizer Sector;
    model BulbWt=Fertilizer Sector;             /*****/
    lsmeans Fertilizer / pdiff=all adjust=tukey;
    lsmeans Fertilizer / pdiff=controlu('4') adjust=dunnett;
    lsmeans Fertilizer / pdiff=all adjust=t;
    title "Garlic Data: Multiple Comparisons";
run;
quit;
title;


Out[46]:
SAS Output

Garlic Data: Multiple Comparisons

The GLM Procedure

Class Level Information
Class Levels Values
Fertilizer 4 1 2 3 4
Sector 8 1 2 3 4 5 6 7 8
Number of Observations Read 32
Number of Observations Used 32

Garlic Data: Multiple Comparisons

The GLM Procedure

Dependent Variable: BulbWt

Source DF Sum of Squares Mean Square F Value Pr > F
Model 10 0.02307263 0.00230726 5.86 0.0003
Error 21 0.00826745 0.00039369    
Corrected Total 31 0.03134008      
R-Square Coeff Var Root MSE BulbWt Mean
0.736202 9.085064 0.019842 0.218398
Source DF Type I SS Mean Square F Value Pr > F
Fertilizer 3 0.00508630 0.00169543 4.31 0.0162
Sector 7 0.01798632 0.00256947 6.53 0.0004
Source DF Type III SS Mean Square F Value Pr > F
Fertilizer 3 0.00508630 0.00169543 4.31 0.0162
Sector 7 0.01798632 0.00256947 6.53 0.0004

Garlic Data: Multiple Comparisons

The GLM Procedure

Least Squares Means

Adjustment for Multiple Comparisons: Tukey

Fertilizer BulbWt LSMEAN LSMEAN Number
1 0.23625000 1
2 0.21115125 2
3 0.22330125 3
4 0.20288875 4
Least Squares Means for effect Fertilizer
Pr > |t| for H0: LSMean(i)=LSMean(j)

Dependent Variable: BulbWt


i/j 1 2 3 4
1   0.0840 0.5699 0.0144
2 0.0840   0.6186 0.8383
3 0.5699 0.6186   0.1995
4 0.0144 0.8383 0.1995  

Garlic Data: Multiple Comparisons

The GLM Procedure

Least Squares Means

Adjustment for Multiple Comparisons: Dunnett

Fertilizer BulbWt LSMEAN H0:LSMean=Control
Pr > t
1 0.23625000 0.0040
2 0.21115125 0.3978
3 0.22330125 0.0638
4 0.20288875  

Garlic Data: Multiple Comparisons

The GLM Procedure

Least Squares Means

Fertilizer BulbWt LSMEAN LSMEAN Number
1 0.23625000 1
2 0.21115125 2
3 0.22330125 3
4 0.20288875 4
Least Squares Means for effect Fertilizer
Pr > |t| for H0: LSMean(i)=LSMean(j)

Dependent Variable: BulbWt


i/j 1 2 3 4
1   0.0195 0.2059 0.0029
2 0.0195   0.2342 0.4143
3 0.2059 0.2342   0.0523
4 0.0029 0.4143 0.0523  

Note:To ensure overall protection level, only probabilities associated with pre-planned comparisons should be used.

You can use the ODS TRACE statement in your program. When you add the ODS TRACE statement, SAS writes a trace record to the log that includes information about each output object, such as the path for each object and the label for each object. When you check the log, you can see a list of the ODS output objects and information about each one. Now that you know the names, you can decide what output to specify in the ODS SELECT statement. Let's turn tracing off, and then specify only the output we want to see. Now let's submit this program.

The only significant pairwise difference is between fertilizer 1 and fertilizer 4. The p-value of 0.0144 is less than your alpha, meaning that the bulb weights of these two fertilizers are significantly different from one another.

Because you performed one-sided, upper-tailed hypothesis tests of each of the organic fertilizers versus the chemical fertilizer, you only see the upper shaded region with the UDL in your plot. Remember that the bottom horizontal line is the least squares mean of your control group. The vertical line for fertilizer 1 is the only line that extends past the UDL.

Finally, let's examine the output that corresponds to your third LSMEANS statement. These t-tests do not adjust for multiple comparisons, and are therefore more liberal than tests that do control for the EER. Take a moment to look at the p-values. You might notice that the p-values in this table are smaller than those in the Tukey table. In fact, which additional significant pairwise difference does this method show? It shows that fertilizer 1 is significantly different from fertilizer 2 with a p-value of 0.0195. This is in addition to fertilizers 1 and 4 being statistically different with a p-value of 0.0029. Notice also that the comparison between fertilizers 3 and 4 is nearly significant. So with this test, there's a tendency to find more significant pairwise differences than might actually exist.

Lastly, let's take a look at the diffogram. Again, this reinforces what you know about fertilizers 1 and 4 and fertilizers 1 and 2. Using these multiple comparison techniques gives you options. If you feel strongly about controlling the EER, you shouldn't use the pairwise t-test results and should instead use the Tukey or Dunnett results. You knew before you performed these multiple comparison techniques that fertilizer 1 produced the garlic with the heaviest overall mean bulb weight, so that would be your first choice if you are not considering other factors like cost or availability. But what if fertilizer 1 is very expensive or hard to obtain? With these multiple comparison techniques, you now know which fertilizers are not statistically different from fertilizer 1, so the Montana Gourmet Garlic farmers have options for the fertilizer to use that will produce equally heavy garlic bulbs.

Two-Way ANOVA with Interactions

When you have two categorical predictor variables and a continuous response variable, you can analyze your data using two-way ANOVA. With two-way ANOVA, you can examine the effects of the two predictor variables concurrently. You can also determine whether they interact with respect to their effect on the response variable. An interaction means that the effects on one variable depend on the value of another variable. If there is no interaction, you can interpret the test for the individual factor effects to determine their significance. If an interaction exists between any factors, the test for the individual factor effects might be misleading due to the masking of these effects by the interaction.

You can include interactions and more than one predictor variable in the ANOVA model.

You can graphically explore the relationship between the response variable and the effect of the interaction between the two predictor variables using PROC SGPLOT.

You can use PROC GLM to determine whether the effects of the predictor variables and the interaction between them are statistically significant.

When running PROC GLM, you can add a STORE statement to save your analysis results. By using the STORE statement, you can run postprocessing analyses on the stored results, even if you no longer have access to the original data set. The STORE statement requests that the procedure save the context and results of the statistical analysis into an item store. To perform post-fitting statistical analyses and plotting for the contents of the store item, you use the PLM procedure.


In [47]:
proc format;
   value dosef
   1="Placebo"
   2="100 mg"
   3="200mg"
   4="500mg";
run;

proc means data=statdata.drug
           mean var std printalltypes;
   class Disease DrugDose;
   var BloodP;
   output out=means mean=BloodP_Mean;
   format DrugDose dosef.;
   title "Selected Descriptive Statistics for Drug Data Set";
run;
title;


Out[47]:
SAS Output

Selected Descriptive Statistics for Drug Data Set

The MEANS Procedure

Analysis Variable : BloodP
N Obs Mean Variance Std Dev
170 -2.2941176 620.3745214 24.9073186
Analysis Variable : BloodP
DrugDose N Obs Mean Variance Std Dev
Placebo 41 -2.4390244 303.6024390 17.4241912
100 mg 44 -3.1136364 483.9170190 21.9981140
200mg 41 -1.6097561 790.6939024 28.1192799
500mg 44 -1.9772727 935.0924947 30.5792821
Analysis Variable : BloodP
Disease N Obs Mean Variance Std Dev
A 59 -15.0169492 434.4997078 20.8446566
B 57 10.6666667 629.9404762 25.0986150
C 54 -2.0740741 476.1830887 21.8216198
Analysis Variable : BloodP
Disease DrugDose N Obs Mean Variance Std Dev
A Placebo 12 1.3333333 183.1515152 13.5333483
  100 mg 16 -9.6875000 356.7625000 18.8881577
  200mg 13 -26.2307692 329.0256410 18.1390640
  500mg 18 -22.5555556 445.0849673 21.0970369
B Placebo 15 -8.1333333 285.9809524 16.9109714
  100 mg 15 5.4000000 479.1142857 21.8886794
  200mg 14 24.7857143 563.7197802 23.7427838
  500mg 13 23.2307692 556.3589744 23.5872630
C Placebo 14 0.4285714 411.8021978 20.2929100
  100 mg 13 -4.8461538 577.6410256 24.0341637
  200mg 14 -5.1428571 195.5164835 13.9827209
  500mg 13 1.3076923 828.5641026 28.7847894

The Means data set contains the variable TYPE, with values ranging from 0 to 3 to represent the four tables this PROC MEANS program generates. Type 0 gives you the mean blood pressure change of all observations, regardless of disease type or drug dose. Type 1 gives you the mean blood pressure for each drug dose, regardless of disease type. Type 2 gives you the mean blood pressure for each disease type, regardless of drug dose. And Type 3 gives you the mean blood pressure for each disease type and drug dose combination.


In [48]:
ods graphics on / width=800;

proc sgplot data=means;
   where _TYPE_=3;     
   scatter x=DrugDose y=BloodP_Mean / 
           group=Disease markerattrs=(size=10);
   series x=DrugDose y=BloodP_Mean / group=Disease        
          lineattrs=(thickness=2);
   xaxis integer;
   format DrugDose dosef.;  
   title "Plot of Stratified Means in Drug Data Set";
run;
title;

ods graphics on / width=800;

proc glm data=statdata.drug;
   class DrugDose Disease;
   model Bloodp=DrugDose Disease DrugDose*Disease / ss3;
   format DrugDose dosef.;  
   title "Analyze the Effects of DrugDose and Disease";
   title2 "Including Interaction";
run;
quit;
title;


Out[48]:
SAS Output

Analyze the Effects of DrugDose and Disease

Including Interaction

The GLM Procedure

Class Level Information
Class Levels Values
DrugDose 4 100 mg 200mg 500mg Placebo
Disease 3 A B C
Number of Observations Read 170
Number of Observations Used 170

Analyze the Effects of DrugDose and Disease

Including Interaction

The GLM Procedure

Dependent Variable: BloodP

Source DF Sum of Squares Mean Square F Value Pr > F
Model 11 36476.8353 3316.0759 7.66 <.0001
Error 158 68366.4589 432.6991    
Corrected Total 169 104843.2941      
R-Square Coeff Var Root MSE BloodP Mean
0.347918 -906.7286 20.80142 -2.294118
Source DF Type III SS Mean Square F Value Pr > F
DrugDose 3 335.73526 111.91175 0.26 0.8551
Disease 2 18742.62386 9371.31193 21.66 <.0001
DrugDose*Disease 6 17146.31698 2857.71950 6.60 <.0001

The p-value for the overall model is very small, so what does this tell you? You can reject the null hypothesis and conclude that at least one of the effects in the model is significant, in other words, there is at least one difference among the 12 group means, one for each drug dose and disease combination. Which factors explain this difference? You'll see in just a few moments.

The R square is 0.3479, so approximately 35% of the variation in blood pressure change can be explained by the model. The average blood pressure change of all the observations is –2.294, which is exactly what the PROC MEANS output showed.

The next tables show the breakdown of the main effects and interaction term in the model. Look at the Type I and Type III Sums of Squares values. Do you know why their values are not exactly the same? You don't have a balanced design in this experiment. In other words, you have a different number of observations in each drug dose and disease combination group. In most situations, you will want to use the Type III SS. The Type I, or sequential SS, are the sums of squares you obtain from fitting the effects in the order you specify in the model. The Type III, or marginal SS, are the sums of squares you obtain from fitting each effect after all the other terms in the model, that is, the sums of squares for each effect corrected for the other terms in the model. Type III SS does not depend upon the order you specify effects in the model. (Use the SS3 option on the MODEL statement, this portion of the output shows the results for Type III SS only.)

You want to look at the interaction term first. If it's significant, the main effects don't tell you the whole story. The p-value for DrugDose*Disease is 0.0001. Presuming an alpha of 0.05, you reject the null hypothesis. You have sufficient evidence to conclude that there is an interaction between the two factors, meaning that the effect of the level of drug dose on blood pressure changes for the different disease types. You don't need to worry all that much about the significance of the main effects at this point for two reasons: 1) Because the interaction term is significant, you know that the effect of the drug level changes for the different disease types. 2) Because the interaction term is significant, you want to include the main effects in the model, whether they are significant or not, to preserve model hierarchy.

Let's finally take a look at the interaction plot for blood pressure. SAS produces this plot by default when you have an interaction term in the model. This plot looks similar to the one you produced with PROC SGPLOT, except that this one plots each of the blood pressure change measurements, as well as the means for each drug dose and disease type combination. Well, you might be thinking, "Okay. I know the interaction is significant. What I really want to know is the effect of drug dose at each particular level of disease." You have to add the LSMEANS statement to your program to find the answer.

The PLM Procedure

The PLM procedure performs post-fitting statistical analyses and plotting for the contents of a SAS item store that were previously created with the STORE statement in some other SAS/STAT procedure.

The statements that are available in the PLM procedure are designed to reveal the contents of the source item store via the Output Delivery System (ODS) and to perform post-fitting tasks.

The use of item stores and PROC PLM enables you to separate common post-processing tasks, such as testing for treatment differences and predicting new observations under a fitted model, from the process of model building and fitting. A numerically expensive model fitting technique can be applied once to produce a source item store. The PLM procedure can then be called multiple times, and the results of the fitted model are analyzed without incurring the model fitting expenditure again.

Selected PROC PLM option:

  • RESTORE specifies the source item store for processing.
  • Selected PROC PLM procedure statements:
  • EFFECTPLOT produces a display of the fitted model and provides options for changing and enhancing the displays.
  • LSMEANS computes and compares least squares means (LS-means) of fixed effects.
  • LSMESTIMATE provides custom hypothesis tests among least squares means.
  • SHOW uses the Output Delivery System to display contents of the item store. This statement is useful for verifying that the contents of the item store apply to the analysis and for generating ODS tables.
  • SLICE provides a general mechanism for performing a partitioned analysis of the LS-means for an interaction. This analysis is also known as an analysis of simple effects. The SLICE statement uses the same options as the LSMEANS statement.
  • WHERE is used in the PLM procedure when the item store contains BY-variable information and you want to apply the PROC PLM statements to only a subset of the BY groups.
PROC PLM RESTORE=item-store-specification<options>;
    EFFECTPLOT <plot-type <(plot-definition options)>> 
          </ options>;
    LSMEANS <model-effects > </ options>;
    LSMESTIMATE model-effect <'label'> values 
      <divisor=n><,...<'label'> values
      <divisor=n>> </ options>;
SHOW options;
SLICE model-effect </ options>;
WHERE expression ;
RUN;

Saving Analysis Results with the STORE Statement

When running PROC GLM, you can add a STORE statement to save your analysis results. By using the STORE statement, you can run postprocessing analyses on the stored results, even if you no longer have access to the original data set.

The STORE statement requests that the procedure save the context and results of the statistical analysis into an item store. An item store is a binary file format that cannot be modified. You can process the contents of an item store with the PLM procedure.

For example, if you need to perform a time-consuming analysis, you can store the results by using the STORE statement.

At a later time, you can use PROC PLM to perform specific statistical analysis tasks based on the saved results of the previous analysis without having to fit the model again. This can be a great time saver!

Here is the syntax of the STORE statement. Following the keyword STORE and OUT= you specify the item store name and an optional label.

You can use the STORE statement in a number of SAS/STAT procedures. For more information about the STORE statement, click the Information button.


In [49]:
proc contents data = statdata.ameshousing3;
run;


Out[49]:
SAS Output

The CONTENTS Procedure

Data Set Name STATDATA.AMESHOUSING3 Observations 300
Member Type DATA Variables 32
Engine V9 Indexes 0
Created 05/18/2017 20:21:01 Observation Length 248
Last Modified 05/18/2017 20:21:01 Deleted Observations 0
Protection   Compressed NO
Data Set Type   Sorted NO
Label      
Data Representation SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64    
Encoding utf-8 Unicode (UTF-8)    
Engine/Host Dependent Information
Data Set Page Size 65536
Number of Data Set Pages 2
First Data Page 1
Max Obs per Page 263
Obs in First Data Page 235
Number of Data Set Repairs 0
Filename /folders/myfolders/ECST131/ameshousing3.sas7bdat
Release Created 9.0401M4
Host Created Linux
Inode Number 372
Access Permission rwxrwx---
Owner Name root
File Size 192KB
File Size (bytes) 196608
Alphabetic List of Variables and Attributes
# Variable Type Len Format Informat Label
21 Age_Sold Num 8 BEST12. BEST12. Age of house when sold, in years
16 Basement_Area Num 8 BEST12. BEST12. Basement area in square feet
10 Bedroom_AbvGr Num 8 BEST12. BEST12. Bedrooms above grade
31 Bonus Num 8 BEST12. BEST12. Sale Price > $175,000
8 Central_Air Char 1 $CHAR1. $CHAR1. Presence of central air conditioning
20 Deck_Porch_Area Num 8 BEST12. BEST12. Total area of decks and porches in square feet
11 Fireplaces Num 8 BEST12. BEST12. Number of fireplaces
24 Foundation_2 Char 16 $CHAR16. $CHAR16. Foundation Type
17 Full_Bathroom Num 8 BEST12. BEST12. Number of full bathrooms
12 Garage_Area Num 8 BEST12. BEST12. Size of garage in square feet
23 Garage_Type_2 Char 8 $CHAR8. $CHAR8. Garage attached or detached
9 Gr_Liv_Area Num 8 BEST12. BEST12. Above grade (ground) living area square feet
18 Half_Bathroom Num 8 BEST12. BEST12. Number of half bathrooms
7 Heating_QC Char 2 $CHAR2. $CHAR2. Heating quality and condition
3 House_Style Char 6 $CHAR6. $CHAR6. Style of dwelling
27 House_Style2 Char 6 $CHAR6. $CHAR6. Style of dwelling
30 Log_Price Num 8 BEST12. BEST12. Natural log of the sale price
2 Lot_Area Num 8 BEST12. BEST12. Lot size in square feet
26 Lot_Shape_2 Char 10 $CHAR10. $CHAR10. Regular or irregular lot shape
25 Masonry_Veneer Char 1 $CHAR1. $CHAR1. Masonry veneer or not
13 Mo_Sold Num 8 BEST12. BEST12. Month Sold (MM)
5 Overall_Cond Num 8 BEST12. BEST12. Overall condition of the house
29 Overall_Cond2 Num 8 BEST12. BEST12. Overall condition of the house
4 Overall_Qual Num 8 BEST12. BEST12. Overall material and finish of the house
28 Overall_Qual2 Num 8 BEST12. BEST12. Overall material and finish of the house
1 PID Char 10 $CHAR10. $CHAR10.  
15 SalePrice Num 8 BEST12. BEST12. Sale price in dollars
22 Season_Sold Num 8 BEST12. BEST12. Season when house sold
19 Total_Bathroom Num 8 BEST12. BEST12. Total number of bathrooms (half bathrooms counted 10%)
6 Year_Built Num 8 BEST12. BEST12. Original construction year
14 Yr_Sold Num 8 BEST12. BEST12. Year Sold (YYYY)
32 score Num 8      

In [50]:
proc glm data=statdata.ameshousing3 
         order=internal 
         plots(only)=intplot;
   class Season_Sold Heating_QC;
   model SalePrice=Heating_QC Season_Sold Heating_QC*Season_Sold / ss3;
   lsmeans Heating_QC*Season_Sold / diff slice=Heating_QC;
   format Season_Sold ;
   store out=interact;
   title "Model with Heating Quality and Season as Interacting "
         "Predictors";
run;


Out[50]:
SAS Output

Model with Heating Quality and Season as Interacting Predictors

The GLM Procedure

Class Level Information
Class Levels Values
Season_Sold 4 1 2 3 4
Heating_QC 4 Ex Fa Gd TA
Number of Observations Read 300
Number of Observations Used 300

Model with Heating Quality and Season as Interacting Predictors

The GLM Procedure

Dependent Variable: SalePrice Sale price in dollars

Source DF Sum of Squares Mean Square F Value Pr > F
Model 15 97609874155 6507324943.7 5.68 <.0001
Error 284 325613645356 1146526920.3    
Corrected Total 299 423223519511      
R-Square Coeff Var Root MSE SalePrice Mean
0.230634 24.62130 33860.40 137524.9
Source DF Type III SS Mean Square F Value Pr > F
Heating_QC 3 51116493768 17038831256 14.86 <.0001
Season_Sold 3 9318181844 3106060615 2.71 0.0455
Season_So*Heating_QC 9 24835058089 2759450899 2.41 0.0121

Model with Heating Quality and Season as Interacting Predictors

The GLM Procedure

Least Squares Means

Season_Sold Heating_QC SalePrice LSMEAN LSMEAN Number
1 Ex 145583.333 1
1 Fa 58100.000 2
1 Gd 124330.000 3
1 TA 121312.500 4
2 Ex 153765.244 5
2 Fa 98657.143 6
2 Gd 149619.833 7
2 TA 129404.412 8
3 Ex 154279.422 9
3 Fa 128800.000 10
3 Gd 113727.273 11
3 TA 134046.552 12
4 Ex 163726.933 13
4 Fa 45000.000 14
4 Gd 143812.500 15
4 TA 129345.455 16
Least Squares Means for effect Season_So*Heating_QC
Pr > |t| for H0: LSMean(i)=LSMean(j)

Dependent Variable: SalePrice
i/j 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1   0.0003 0.2252 0.1354 0.5808 0.0133 0.8005 0.2815 0.5550 0.4137 0.0420 0.4276 0.2682 0.0063 0.9229 0.3455
2 0.0003   0.0032 0.0033 <.0001 0.0837 <.0001 0.0005 <.0001 0.0046 0.0080 0.0002 <.0001 0.7378 0.0002 0.0014
3 0.2252 0.0032   0.8252 0.0143 0.1250 0.0593 0.6773 0.0119 0.8097 0.4123 0.4027 0.0047 0.0263 0.2261 0.7349
4 0.1354 0.0033 0.8252   0.0013 0.1409 0.0156 0.4312 0.0009 0.6664 0.4959 0.1840 0.0006 0.0296 0.1260 0.5452
5 0.5808 <.0001 0.0143 0.0013   <.0001 0.6654 0.0021 0.9440 0.1207 <.0001 0.0046 0.3304 0.0017 0.4476 0.0345
6 0.0133 0.0837 0.1250 0.1409 <.0001   0.0008 0.0295 <.0001 0.1295 0.3059 0.0095 <.0001 0.1394 0.0105 0.0619
7 0.8005 <.0001 0.0593 0.0156 0.6654 0.0008   0.0415 0.6221 0.2249 0.0010 0.0894 0.2344 0.0029 0.6868 0.1188
8 0.2815 0.0005 0.6773 0.4312 0.0021 0.0295 0.0415   0.0014 0.9703 0.0917 0.5261 0.0012 0.0146 0.2798 0.9960
9 0.5550 <.0001 0.0119 0.0009 0.9440 <.0001 0.6221 0.0014   0.1115 <.0001 0.0029 0.3502 0.0016 0.4211 0.0294
10 0.4137 0.0046 0.8097 0.6664 0.1207 0.1295 0.2249 0.9703 0.1115   0.3697 0.7398 0.0467 0.0246 0.4374 0.9762
11 0.0420 0.0080 0.4123 0.4959 <.0001 0.3059 0.0010 0.0917 <.0001 0.3697   0.0172 <.0001 0.0481 0.0322 0.2127
12 0.4276 0.0002 0.4027 0.1840 0.0046 0.0095 0.0894 0.5261 0.0029 0.7398 0.0172   0.0027 0.0096 0.4451 0.6732
13 0.2682 <.0001 0.0047 0.0006 0.3304 <.0001 0.2344 0.0012 0.3502 0.0467 <.0001 0.0027   0.0008 0.1802 0.0110
14 0.0063 0.7378 0.0263 0.0296 0.0017 0.1394 0.0029 0.0146 0.0016 0.0246 0.0481 0.0096 0.0008   0.0063 0.0177
15 0.9229 0.0002 0.2261 0.1260 0.4476 0.0105 0.6868 0.2798 0.4211 0.4374 0.0322 0.4451 0.1802 0.0063   0.3586
16 0.3455 0.0014 0.7349 0.5452 0.0345 0.0619 0.1188 0.9960 0.0294 0.9762 0.2127 0.6732 0.0110 0.0177 0.3586  

Model with Heating Quality and Season as Interacting Predictors

The GLM Procedure

Least Squares Means

Season_So*Heating_QC Effect Sliced by Heating_QC for SalePrice
Heating_QC DF Sum of Squares Mean Square F Value Pr > F
Ex 3 1759608339 586536113 0.51 0.6746
Fa 3 12318827232 4106275744 3.58 0.0143
Gd 3 14560964166 4853654722 4.23 0.0060
TA 3 2134918196 711639399 0.62 0.6021

Note:To ensure overall protection level, only probabilities associated with pre-planned comparisons should be used.


In [51]:
proc plm restore=interact plots=all;
   slice Heating_QC*Season_Sold / sliceby=Heating_QC adjust=tukey;
   effectplot interaction(sliceby=Heating_QC) / clm;
run;


Out[51]:
SAS Output

Model with Heating Quality and Season as Interacting Predictors

The PLM Procedure

Store Information
Item Store WORK.INTERACT
Data Set Created From STATDATA.AMESHOUSING3
Created By PROC GLM
Date Created 03JUL17:23:17:39
Response Variable SalePrice
Class Variables Season_Sold Heating_QC
Model Effects Intercept Heating_QC Season_Sold Season_So*Heating_QC
Class Level Information
Class Levels Values
Season_Sold 4 1 2 3 4
Heating_QC 4 Ex Fa Gd TA
F Test for Season_So*Heating_QC Least Squares Means Slice
Slice Num DF Den DF F Value Pr > F
Heating_QC Ex 3 284 0.51 0.6746
Simple Differences of Season_So*Heating_QC Least Squares Means
Adjustment for Multiple Comparisons: Tukey-Kramer
Slice Season when house sold Season when house sold Estimate Standard Error DF t Value Pr > |t| Adj P
Heating_QC Ex 1 2 -8181.91 14800 284 -0.55 0.5808 0.9457
Heating_QC Ex 1 3 -8696.09 14716 284 -0.59 0.5550 0.9348
Heating_QC Ex 1 4 -18144 16356 284 -1.11 0.2682 0.6841
Heating_QC Ex 2 3 -514.18 7310.43 284 -0.07 0.9440 0.9999
Heating_QC Ex 2 4 -9961.69 10218 284 -0.97 0.3304 0.7638
Heating_QC Ex 3 4 -9447.51 10095 284 -0.94 0.3502 0.7856
F Test for Season_So*Heating_QC Least Squares Means Slice
Slice Num DF Den DF F Value Pr > F
Heating_QC Fa 3 284 3.58 0.0143
Simple Differences of Season_So*Heating_QC Least Squares Means
Adjustment for Multiple Comparisons: Tukey-Kramer
Slice Season when house sold Season when house sold Estimate Standard Error DF t Value Pr > |t| Adj P
Heating_QC Fa 1 2 -40557 23366 284 -1.74 0.0837 0.3071
Heating_QC Fa 1 3 -70700 24728 284 -2.86 0.0046 0.0235
Heating_QC Fa 1 4 13100 39099 284 0.34 0.7378 0.9870
Heating_QC Fa 2 3 -30143 19827 284 -1.52 0.1295 0.4267
Heating_QC Fa 2 4 53657 36198 284 1.48 0.1394 0.4495
Heating_QC Fa 3 4 83800 37092 284 2.26 0.0246 0.1102
F Test for Season_So*Heating_QC Least Squares Means Slice
Slice Num DF Den DF F Value Pr > F
Heating_QC Gd 3 284 4.23 0.0060
Simple Differences of Season_So*Heating_QC Least Squares Means
Adjustment for Multiple Comparisons: Tukey-Kramer
Slice Season when house sold Season when house sold Estimate Standard Error DF t Value Pr > |t| Adj P
Heating_QC Gd 1 2 -25290 13355 284 -1.89 0.0593 0.2330
Heating_QC Gd 1 3 10603 12914 284 0.82 0.4123 0.8445
Heating_QC Gd 1 4 -19483 16061 284 -1.21 0.2261 0.6191
Heating_QC Gd 2 3 35893 10762 284 3.34 0.0010 0.0053
Heating_QC Gd 2 4 5807.33 14388 284 0.40 0.6868 0.9777
Heating_QC Gd 3 4 -30085 13980 284 -2.15 0.0322 0.1394
F Test for Season_So*Heating_QC Least Squares Means Slice
Slice Num DF Den DF F Value Pr > F
Heating_QC TA 3 284 0.62 0.6021
Simple Differences of Season_So*Heating_QC Least Squares Means
Adjustment for Multiple Comparisons: Tukey-Kramer
Slice Season when house sold Season when house sold Estimate Standard Error DF t Value Pr > |t| Adj P
Heating_QC TA 1 2 -8091.91 10265 284 -0.79 0.4312 0.8598
Heating_QC TA 1 3 -12734 9561.68 284 -1.33 0.1840 0.5434
Heating_QC TA 1 4 -8032.95 13262 284 -0.61 0.5452 0.9302
Heating_QC TA 2 3 -4642.14 7313.62 284 -0.63 0.5261 0.9207
Heating_QC TA 2 4 58.9572 11745 284 0.01 0.9960 1.0000
Heating_QC TA 3 4 4701.10 11135 284 0.42 0.6732 0.9747

The option order=internal tells SAS to use the order of the variable values stored internally, rather than the order of the formatted values. The internal values for Season_Sold are 1 (formatted as Winter), 2 (formatted as Spring), 3 (for Summer), and 4 (for Fall). So, including order=internal tells SAS to display the seasons in the order Winter, Spring, Summer, and Fall, rather than in alphabetical order.

We specify plots(only)=intplot to request an interaction plot. You can request an interaction plot even if there isn’t an interaction in the model. In this case, there is an interaction, so we’ll be able to visualize the interaction in the plot. Because Season_Sold and Heating_QC are categorical variables, we need to include them in a CLASS statement.

In the MODEL statement, we specify the response variable SalePrice, equals, the main effects Heating_QC and Season_Sold, and then interaction effect. The interaction effect is represented as the two main effects separated by an asterisk. Here’s another way to represent the effects. Instead of listing the interaction effect, we could simply place a vertical bar between the main effects.

Recall that the LSMEANS option computes and compares least squares means of fixed effects. This LSMEANS statement specifies the interaction term Heating_QC by Season_Sold. By specifying slice=Heating_QC, we tell SAS to to slice the interaction effect by the different levels of Heating_QC.

This means that each slice will have one Heating_QC level and will show the Season_Sold effect across that slice. We'll format Season_Sold with a FORMAT statement. The statement store out=interact saves the analysis results as a SAS item store named interact in the Work library.

We want to produce more tables on Season_Sold by different levels of Heating_QC, so we add the SLICE statement slice Heating_QC by Season_Sold, sliceby=Heating_QC. Notice that the syntax sliceby= in the SLICE statement is different from slice= in the LSMEANS statement.

We include adjust=tukey in order to get the Tukey adjustment for multiple comparison tests.

By specifying plots=all, the output will include an effect plot. However, by adding an EFFECTPLOT statement, we can specify more options. We specify the interaction sliceby=Heating_QC, and we specify clm, which gives us confidence limits for the means.


In [52]:
*variables

Region
Advertising
Gender
Book_Sales
Music_Sales
Electronics_Sales
Total_Sales
;

proc format;
   value yesno 1 = 'Yes'
               0 = 'No';
data Store;
   length Region $ 5;
   call streaminit(57676);
   do Transaction = 1 to 200;
      R = ceil(rand('uniform')*10);
      select(R);
         when(1) Region = 'East';
         when(2) Region = 'West';
         when(3) Region = 'North';
         when(4) Region = 'South';
         otherwise;
      end;
      Advertising = rand('bernouli',.6);
      if rand('uniform') lt .6 then Gender = 'Female';
         else Gender = 'Male';
      Book_Sales = abs(round(rand('normal',250,50) + 30*(Gender = 'Female')
                    + 30*Advertising,10)) ;
      Music_Sales = abs(round(rand('uniform')*40 + rand('normal',50,5)
         + 30*(Region = 'East' and Gender = 'Male')
         - 20*(Region = 'West' and Gender = 'Female'),5) + 10*Advertising);
      Electronics_Sales = abs(round(rand('normal',300,60) + 70*(Gender = 'Male')
       + 55*Advertising + 50*(Region = 'East') - 20*(Region = 'South') 
       + 75*(Region = 'West'),10));
      Total_Sales = sum(Book_Sales,Music_Sales,Electronics_Sales);
   output;
   end;
   drop R;
   format Book_Sales Music_Sales Electronics_Sales Total_Sales dollar9.
          Advertising yesno.;
run;
 
/*title "Listing of Store";*/
/*proc print data=store heading=h;*/
/*run;*/

/*proc univariate data=store;*/
/*   var Book_Sales -- Total_Sales;*/
/*   histogram;*/
/*run;*/
/**/
/*title "Scatter Matrix for Store Variables";*/
/*proc sgscatter data=store;*/
/*   matrix Book_Sales -- Total_Sales / group = Gender;*/
/*run;*/
/**/
/*proc sgplot data=store;*/
/*   scatter x=Book_Sales y=Total_Sales / group=Gender;*/
/*run;*/

proc rank data=store out=median_sales groups=2;
   var Total_Sales;
   ranks Sales_Group;
run;

proc format;
   value sales 0 = 'Low'
               1 = 'High';
run;

/*proc logistic data=median_sales order=formatted;*/
/*   class Gender(param=ref ref='Male');*/
/*   model Sales_Group = Gender;*/
/*   format Sales_Group sales.;*/
/*quit;*/
/**/
/*proc logistic data=median_sales order=formatted;*/
/*   class Gender(param=ref ref='Male')*/
/*         Advertising (param=ref ref='No');*/
/*   model Sales_Group = Gender Advertising;*/
/*   format Sales_Group sales.;*/
/*quit;*/

*Create test data set;
libname example 'c:\books\statistics by example';
data example.Blood_Pressure;
   call streaminit(37373);
   do Drug = 'Placebo','Drug A','Drug B';
      do i = 1 to 20;
         Subj + 1;
         if mod(Subj,2) then Gender = 'M';
         else Gender = 'F';
         SBP = rand('normal',130,10) +
               7*(Drug eq 'Placebo') - 6*(Drug eq 'Drug B');
         SBP = round(SBP,2);
         DBP = rand('normal',80,5) +
               3*(Drug eq 'Placebo') - 2*(Drug eq 'Drug B');
         DBP = round(DBP,2);
         if Subj in (5,15,25,55) then call missing(SBP, DBP);
         if Subj in (4,18) then call missing(Gender);
         output;
      end;
   end;
   drop i;
run;

/*title "Listing of the first 25 observations from Blood_Pressure";*/
/*proc print data=example.Blood_Pressure(obs=25) noobs;*/
/*   var Subj Drug SBP DBP;*/
/*run;*/

data exercise;
   call streaminit(7657657);
   do Subj = 1 to 50;
      Age = round(rand('normal',50,15));
      Pushups = abs(int(rand('normal',40,10) - .30*age));
      Rest_Pulse = round(rand('normal',50,8) + .35*age);
      Max_Pulse = round(rest_pulse + rand('normal',50,5) - .05*age);
      Run_Pulse = round(max_pulse - rand('normal',3,3));
      output;
   end;
run;

*Data set for a paired t-test example;
data reading;
   input Subj Before After @@;
datalines;
1 100 110  2 120 121  3 130 140  4 90 110  5 85 92
6 133 137  7 210 209  8 155 179
;

/*title "Listing of Data Set READING";*/
/*proc print data=reading noobs;*/
/*run;*/

*Data set that violates assumptions for a t-test;
data salary;
   call streaminit(57575);
   do Subj = 1 to 50;
      do Gender = 'M','F';
         Income = round(20000*rand('exponential') + rand('uniform')*7000*(Gender = 'M'));
         output;
      end;
   end;
run;
/*proc univariate data=salary;*/
/*   class Gender;*/
/*   id Subj;*/
/*   var Income;*/
/*   histogram Income;*/
/*run;*/

*Data set risk for logistic regression example;
proc format;
   value yesno 1 = 'Yes'
               0 = 'No';
run;

data Risk;
   call streaminit(13579);
   length Age_Group $ 7;
   do i = 1 to 250;
      do Gender = 'F','M';
         Age = round(rand('uniform')*30 + 50);
         if missing(Age) then Age_Group = ' ';
         else if Age lt 60 then Age_Group = '1:< 60';
         else if Age le 70 then Age_Group = '2:60-70';
         else Age_Group = '3:71+';
         Chol = rand('normal',200,30) + rand('uniform')*8*(Gender='M');
         Chol = round(Chol);
         Score = .3*chol + age + 8*(Gender eq 'M');
         Heart_Attack = (Score gt 130)*(rand('uniform') lt .2);
         output;
       end;
   end;
   keep Gender Age Age_Group chol Heart_Attack;
   format Heart_Attack yesno.;
run;

/*title "Listing of first 100 observations from RISK";*/
/*proc print data=risk(obs=100);*/
/*run;*/


Out[52]:

911  ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
912
913 *variables
914
915 Region
916 Advertising
917 Gender
918 Book_Sales
919 Music_Sales
920 Electronics_Sales
921 Total_Sales
922 ;
923
924 proc format;
925 value yesno 1 = 'Yes'
NOTE: Format YESNO is already on the library WORK.FORMATS.
NOTE: Format YESNO has been output.
926 0 = 'No';
NOTE: PROCEDURE FORMAT used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

927 data Store;
928 length Region $ 5;
929 call streaminit(57676);
930 do Transaction = 1 to 200;
931 R = ceil(rand('uniform')*10);
932 select(R);
933 when(1) Region = 'East';
934 when(2) Region = 'West';
935 when(3) Region = 'North';
936 when(4) Region = 'South';
937 otherwise;
938 end;
939 Advertising = rand('bernouli',.6);
940 if rand('uniform') lt .6 then Gender = 'Female';
941 else Gender = 'Male';
942 Book_Sales = abs(round(rand('normal',250,50) + 30*(Gender = 'Female')
943 + 30*Advertising,10)) ;
944 Music_Sales = abs(round(rand('uniform')*40 + rand('normal',50,5)
945 + 30*(Region = 'East' and Gender = 'Male')
946 - 20*(Region = 'West' and Gender = 'Female'),5) + 10*Advertising);
947 Electronics_Sales = abs(round(rand('normal',300,60) + 70*(Gender = 'Male')
948 + 55*Advertising + 50*(Region = 'East') - 20*(Region = 'South')
949 + 75*(Region = 'West'),10));
950 Total_Sales = sum(Book_Sales,Music_Sales,Electronics_Sales);
951 output;
952 end;
953 drop R;
954 format Book_Sales Music_Sales Electronics_Sales Total_Sales dollar9.
955 Advertising yesno.;
956 run;
NOTE: The data set WORK.STORE has 200 observations and 8 variables.
NOTE: DATA statement used (Total process time):
real time 0.02 seconds
cpu time 0.01 seconds

957
958 /*title "Listing of Store";*/
959 /*proc print data=store heading=h;*/
960 /*run;*/
961
962 /*proc univariate data=store;*/
963 /* var Book_Sales -- Total_Sales;*/
964 /* histogram;*/
965 /*run;*/
966 /**/
967 /*title "Scatter Matrix for Store Variables";*/
968 /*proc sgscatter data=store;*/
969 /* matrix Book_Sales -- Total_Sales / group = Gender;*/
970 /*run;*/
971 /**/
972 /*proc sgplot data=store;*/
973 /* scatter x=Book_Sales y=Total_Sales / group=Gender;*/
974 /*run;*/
975
976 proc rank data=store out=median_sales groups=2;
977 var Total_Sales;
978 ranks Sales_Group;
979 run;
NOTE: The data set WORK.MEDIAN_SALES has 200 observations and 9 variables.
NOTE: PROCEDURE RANK used (Total process time):
real time 0.00 seconds
cpu time 0.01 seconds

980
981 proc format;
982 value sales 0 = 'Low'
NOTE: Format SALES is already on the library WORK.FORMATS.
NOTE: Format SALES has been output.
983 1 = 'High';
984 run;
NOTE: PROCEDURE FORMAT used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

985
986 /*proc logistic data=median_sales order=formatted;*/
987 /* class Gender(param=ref ref='Male');*/
988 /* model Sales_Group = Gender;*/
989 /* format Sales_Group sales.;*/
990 /*quit;*/
991 /**/
992 /*proc logistic data=median_sales order=formatted;*/
993 /* class Gender(param=ref ref='Male')*/
994 /* Advertising (param=ref ref='No');*/
995 /* model Sales_Group = Gender Advertising;*/
996 /* format Sales_Group sales.;*/
997 /*quit;*/
998
999 *Create test data set;
1000 libname example 'c:\books\statistics by example';
NOTE: Library EXAMPLE does not exist.
1001 data example.Blood_Pressure;
1002 call streaminit(37373);
1003 do Drug = 'Placebo','Drug A','Drug B';
1004 do i = 1 to 20;
1005 Subj + 1;
1006 if mod(Subj,2) then Gender = 'M';
1007 else Gender = 'F';
1008 SBP = rand('normal',130,10) +
1009 7*(Drug eq 'Placebo') - 6*(Drug eq 'Drug B');
1010 SBP = round(SBP,2);
1011 DBP = rand('normal',80,5) +
1012 3*(Drug eq 'Placebo') - 2*(Drug eq 'Drug B');
1013 DBP = round(DBP,2);
1014 if Subj in (5,15,25,55) then call missing(SBP, DBP);
1015 if Subj in (4,18) then call missing(Gender);
1016 output;
1017 end;
1018 end;
1019 drop i;
1020 run;
ERROR: Library EXAMPLE does not exist.
NOTE: The SAS System stopped processing this step because of errors.
NOTE: DATA statement used (Total process time):
real time 0.03 seconds
cpu time 0.01 seconds

1021
1022 /*title "Listing of the first 25 observations from Blood_Pressure";*/
1023 /*proc print data=example.Blood_Pressure(obs=25) noobs;*/
1024 /* var Subj Drug SBP DBP;*/
1025 /*run;*/
1026
1027 data exercise;
1028 call streaminit(7657657);
1029 do Subj = 1 to 50;
1030 Age = round(rand('normal',50,15));
1031 Pushups = abs(int(rand('normal',40,10) - .30*age));
1032 Rest_Pulse = round(rand('normal',50,8) + .35*age);
1033 Max_Pulse = round(rest_pulse + rand('normal',50,5) - .05*age);
1034 Run_Pulse = round(max_pulse - rand('normal',3,3));
1035 output;
1036 end;
1037 run;
NOTE: The data set WORK.EXERCISE has 50 observations and 6 variables.
NOTE: DATA statement used (Total process time):
real time 0.04 seconds
cpu time 0.01 seconds

1038
1039 *Data set for a paired t-test example;
1040 data reading;
1041 input Subj Before After @@;
1042 datalines;
NOTE: SAS went to a new line when INPUT statement reached past the end of a line.
NOTE: The data set WORK.READING has 8 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds

1045 ;
1046
1047 /*title "Listing of Data Set READING";*/
1048 /*proc print data=reading noobs;*/
1049 /*run;*/
1050
1051 *Data set that violates assumptions for a t-test;
1052 data salary;
1053 call streaminit(57575);
1054 do Subj = 1 to 50;
1055 do Gender = 'M','F';
1056 Income = round(20000*rand('exponential') + rand('uniform')*7000*(Gender = 'M'));
1057 output;
1058 end;
1059 end;
1060 run;
NOTE: The data set WORK.SALARY has 100 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 0.02 seconds
cpu time 0.00 seconds

1061 /*proc univariate data=salary;*/
1062 /* class Gender;*/
1063 /* id Subj;*/
1064 /* var Income;*/
1065 /* histogram Income;*/
1066 /*run;*/
1067
1068 *Data set risk for logistic regression example;
1069 proc format;
1070 value yesno 1 = 'Yes'
NOTE: Format YESNO is already on the library WORK.FORMATS.
NOTE: Format YESNO has been output.
1071 0 = 'No';
1072 run;
NOTE: PROCEDURE FORMAT used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds

1073
1074 data Risk;
1075 call streaminit(13579);
1076 length Age_Group $ 7;
1077 do i = 1 to 250;
1078 do Gender = 'F','M';
1079 Age = round(rand('uniform')*30 + 50);
1080 if missing(Age) then Age_Group = ' ';
1081 else if Age lt 60 then Age_Group = '1:< 60';
1082 else if Age le 70 then Age_Group = '2:60-70';
1083 else Age_Group = '3:71+';
1084 Chol = rand('normal',200,30) + rand('uniform')*8*(Gender='M');
1085 Chol = round(Chol);
1086 Score = .3*chol + age + 8*(Gender eq 'M');
1087 Heart_Attack = (Score gt 130)*(rand('uniform') lt .2);
1088 output;
1089 end;
1090 end;
1091 keep Gender Age Age_Group chol Heart_Attack;
1092 format Heart_Attack yesno.;
1093 run;
NOTE: The data set WORK.RISK has 500 observations and 5 variables.
NOTE: DATA statement used (Total process time):
real time 0.03 seconds
cpu time 0.00 seconds

1094
1095 /*title "Listing of first 100 observations from RISK";*/
1096 /*proc print data=risk(obs=100);*/
1097 /*run;*/
1098
1099
1100 ods html5 close;ods listing;

1101