SAS_ECST_Lesson_5

Categorical Data Analysis


In [ ]:
%let path=/folders/myfolders/ECST131;
libname statdata "&path";

In [ ]:
proc print data=statdata.sales_inc (obs = 10);
run;

There are various methods of parameterizing the classification variables. Two common parameterization methods, or coding schemes, are effect coding and reference cell coding. Different parameterization methods will produce the same results regarding the significance of the categorical predictors. But understanding the parameterization method will help you to interpret your results accurately. The CLASS statement specifies which parameterization method PROC LOGISTIC uses. By default, PROC LOGISTIC uses effect coding. If you specify only a variable in the CLASS statement, as in this example, PROC LOGISTIC uses effect coding. To specify a parameterization method other than the default, such as reference cell coding, you use the PARAM= option in the CLASS statement. The keyword for reference cell coding is either REFERENCE or REF, as in this example.

As the default reference level for either effect coding or reference cell coding, PROC LOGISTIC uses the level that has the highest ranked value—or, the last value—when the levels are sorted in ascending alphanumeric order. In this example, for Gender, Male is the default reference level. If you want to specify a reference level, you use the REF= variable option. You can specify the actual value (or level) in quotation marks or the keyword FIRST or LAST. In this example, for the classification variable Gender, the REF= option specifies Male as the reference level. In this example, Male is the default reference level, so it's not necessary to specify it. However, including the REF= option makes your code easier to understand at a glance.

Effect Coding

Deviation from the Mean Coding

Effect coding compares the effect of each level of the variable to the average effect of all levels. For effect coding, the p-values indicate whether each particular level is significant compared to the average effect of all levels. The p-value for IncLevel 1 is 0.1273, which indicates that the effect of low income is no different than the average effect of low, medium, and high income.

Reference Cell Coding

Reference cell coding compares the effect of each level of the predictor variable to the effect of another level that is the designated reference level. As the default reference level, PROC LOGISTIC uses the level that has the highest ranked value—or, the last value—when the levels are sorted in ascending alphanumeric order. The reference level is the intercept. This formula $logit(p) = \beta_{0}+ \beta_{1}*D_{Low Income} + \beta_{2}*D_{Medium Income}$ shows how the betas (the estimates of the parameters) are interpreted when reference cell coding is the parameterization method. $\beta_{0}$ is the intercept, but not in terms of where you cross the Y axis. Instead, $\beta_{0}$ is the value of the logit of the probability when income is high (or at the reference level). $\beta_{1}$ is the difference between the logit of the probability for low and high income. $\beta_{2}$ is the difference between the logit of the probability for medium and high income.

Fitting a Binary Logistic Regression Model


In [ ]:
proc logistic data=statdata.sales_inc
              plots(only)=(effect);
   class Gender (param=ref ref='Male'); /*Female = 1; Male = 0*/
   model Purchase(event='1') = Gender; 
   title1 'LOGISTIC MODEL (1):Purchase=Gender' / clodds = both ;
run;
title;

AIC is Akaike's Information Criterion. SC is the Schwarz Criterion, which is also known as Schwarz's Bayesian Criterion. -2Log L is -2 times the log likelihood. AIC and SC are goodness-of-fit measures that you can use to compare one model to another. AIC and SC are not dependent on the number of terms in the model. For both AIC and SC, lower values indicate a more desirable model. AIC adjusts for the number of predictor variables, and SC adjusts for the number of predictor variables and the number of observations. SC uses a bigger penalty for extra variables and therefore favors more parsimonious models. If you are trying to come up with the best explanatory model, AIC is the best of the three statistics to use. If you are trying to come up with the best predictive model, SC is the best statistic to use. You can compare the AIC and SC columns for the model with the intercept only and the model with the intercept and the predictor variables.

The significant p-value means that there is an association between Gender and Purchase, so females and males are statistically different from one another in terms of purchasing $100 or more.

Comparing Pairs

PROC LOGISTIC then selects pairs of observations, one from each group, until no more pairs can be selected. PROC LOGISTIC determines whether each pair is concordant, discordant, or tied.

Suppose a pair consists of a man from the "less than $100" group and a woman from the "$100 or more" group. We know that customers have a different predicted probability of spending the target amount, based on their gender. A man has a lower predicted probability (0.32) and a woman has a higher predicted probability (0.42) of spending that amount. In this pair, the woman spent the higher amount and the man did not, as the model predicts. A pair is concordant if the model predicts it correctly. In other words, a pair is concordant if the observation with the desired outcome has a higher predicted probability, based on the model, than the observation without the outcome.

Suppose a woman is chosen from the "less than \$100" group and a man is chosen from the "\$100 or more" group. From our model, we know that a woman has a higher predicted probability (0.42) and a man has a lower predicted probability (0.32) of spending $100 or more. However, in this pair, the man spent $100 or more but the woman did not. Our model did not predict this pair correctly, so it is a discordant pair. A pair is discordant if the observation with the desired outcome has a lower predicted probability than the observation without the outcome.

Let's look at one more pair, which consists of a woman from each group. According to our model, both have a predicted probability of 0.42 of spending $100 or more, so our model cannot distinguish between them. A pair is tied if it is neither concordant nor discordant—the probabilities are the same. In this example, a tied pair can consist of either two women or two men. Either way, the probabilities are the same and the model cannot distinguish between them.

This table summarizes the combinations that make up the three types of pairs in this analysis. When a pair consists of a female from the group of customers with the desired outcome and a male from the group of customers without the outcome, then the pair is concordant. The pair fits the model. When a pair consists of a male from the group of customers with the desired outcome and a female from the group of customers without the outcome, then the pair is discordant. The pair does not fit the model. When a pair consists of either two males or two females, they have the same predicted probability, and the pair is tied. The predictor variable Gender has only two levels, so there are only two predicted probabilities for the outcome of purchasing \$100 or more. More complex models have more than two predicted probabilities. However, regardless of the model's complexity, the same comparisons are made across all pairs of observations with different outcomes.

Let's take another look at the Association of Predicted Probabilities and Observed Responses table. The left column lists the percentage of pairs of each type: concordant, discordant, and tied. At the bottom is the total number of observation pairs on which the percentages are based. In this example, there are 43,578 pairs of observations with different outcome values. The frequency table for Purchase that we generated in an earlier demonstration shows where the total number of pairs comes from. The data set contains 162 observations with an outcome of \$100 dollars or more and 269 observations with an outcome of under \$100 dollars. We multiply 162 by 269 to get 43,578 pairs of observations that have different outcome values. You can use the percentages of concordant, discordant, and tied pairs as goodness‑of‑fit measures to compare one model to another. In general, higher percentages of concordant pairs and lower percentages of discordant and tied pairs indicate a more desirable model.

This table also shows the four rank correlation indices that are computed from the numbers of concordant, discordant, and tied pairs of observations: Somers' D, Gamma, Tau-a, and c. In general, a model with higher values for these indices has better predictive ability than a model with lower values. Of these indices, c is the most commonly used. The concordance index, c, estimates the probability of an observation with the desired outcome having a higher predicted probability than an observation without the desired outcome. C is calculated as the number concordant outcomes plus ½ times the number of ties divided by the total number of pairs.

$$\text{Tau-a} = \frac{C-D}{N}$$$$Gamma = \frac{C-D}{C+D}$$$$\text{Somer's D} = \frac{C-D}{C+D+T}$$$$C = .5 (\text{Somer's D}) = 0.5(\frac{C+T}{C+D+T})$$

In [ ]:
proc print data = statdata.safety;

proc logistic data=statdata.safety
              plots(only)=(effect);
   class Region (param=ref ref='Asia'); /*Female = 1; Male = 0*/
   model Unsafe(event='1') = Region / clodds = both ;
   title 'LOGISTIC MODEL: Unsafe = Region';
run;
title;

/*The odds ratio for Region says that the odds for being unsafe (having 
a below average safety rating) are 56.5% lower for cars produced in North
America, compared with those produced in Asia. The confidence interval 
includes 1, indicating that that the odds ratio is not statistically 
significant. The difference between Asian and North American cars is shown
on the probability scale in the effect plot.*/

Fitting a Multiple Logistic Regression Model

Add SELECTION= option to the MODEL statement. All possible values of the SELECTION= option are shown here: BACKWARD or B requests backward elimination. FORWARD or F requests forward selection. When you specify NONE, no selection method is used. Instead, PROC LOGISTIC fits the complete model specified in the MODEL statement. STEPWISE requests stepwise selection. SCORE requests best subset selection. By default, SELECTION= is set to NONE.


In [ ]:
proc format;
    value incfmt 1 = 'Low Income'
                 2 = 'Medium Income'
                 3 = 'High Income';
run;                 

proc logistic data=statdata.sales_inc 
              plots(only)=(effect oddsratio);
   class Gender (param=ref ref='Male') 
         IncLevel (param=ref ref='Low Income');
   units Age=10;
   model Purchase(event='1')=Gender Age IncLevel / 
         selection=backward clodds=pl; 
   title1 'LOGISTIC MODEL (2):Purchase=Gender Age IncLevel';
   format IncLevel incfmt.;
run;

title;

Looking at the Predicted probability graph above, the predicted probability for Female3 is calculated as follow:

$$ \text{Predicted probability for Female3}= \frac{e^{Age{\beta_{Age} + \beta_{0} + \beta_{Gender} + \beta{InLvl3}}}}{1 + e^{Age{\beta_{Age} + \beta_{0} + \beta_{Gender} + \beta{InLvl3}}}} = \frac{e^{60*0.056 - 3.3071 + 0.5204+0.8186}}{1+ e^{60*0.056 - 3.3071 + 0.5204+0.8186}} = 0.80$$

We expect to see a drastic in the number of tied pairs decrease because tied pairs aren't very common when there are continuous variables in the model.

Fitting a Multiple Logistic Regression Model with Interactions

When only significant interactions remain, PROC LOGISTIC turns its attention to the main effects. PROC LOGISTIC eliminates, one at a time, the nonsignificant main effects that are not involved in any significant interactions. Each time, PROC LOGISTIC eliminates the least significant interaction—the one with the largest p-value. When eliminating main effects, PROC LOGISTIC must preserve the model hierarchy. According to this requirement, for any interaction that's included in the model, all main effects that the interaction contains must also be in the model. For example, the interaction Gender by IncLevel is significant and has remained in the model, so both Gender and IncLevel must also remain in the model, whether or not they are significant. The only main effect that PROC LOGISTIC could possibly eliminate is Age, and only if Age is nonsignificant.

By default, PROC LOGISTIC does not produce or display odds ratios for terms that are involved in interactions, and Age is the only term that is not involved in an interaction.


In [ ]:
proc logistic data=statdata.sales_inc 
              plots(only)=(effect oddsratio);
   class Gender (param=ref ref='Male')
         IncLevel (param=ref ref='1');
   units Age=10;  /* One can use n*SD*/
   model Purchase(event='1')=Gender | Age | IncLevel @2 /  
   /* | adds interaction. @ sign followed by an integer indicating the maximum number 
   of variables. @2 specifies that the model will include only the two-way interactions.*/
         selection=backward clodds=pl; 
   title1 'LOGISTIC MODEL (3): Main Effects and 2-Way Interactions';
   title2 '/ sel=backward';
   store out = salesout;
run;
title;

/*One note indicates that the CLODDS= option does not compute odds ratios for effects within interactions.*/

In [ ]:
proc plm restore=salesout;
run;

Fitting a Multiple Logistic Regression Model with All Odds Ratios


In [ ]:
ods graphics / width=700;
ods select OddsRatiosPL ORPlot;

proc logistic data=statdata.sales_inc 
              plots(only)=(oddsratio);
   class Gender (param=ref ref='Male')
         IncLevel (param=ref ref='1');
   units Age=10;
   model Purchase(event='1')=Gender | IncLevel Age;
   oddsratio Age / cl=pl;
   oddsratio Gender / diff=ref at (IncLevel=all) cl=pl;
   oddsratio IncLevel / diff=ref at (Gender=all) cl=pl;
   title1 'LOGISTIC MODEL (3a): Significant Terms and All Odds Ratios';
   title2 '/ sel=backward';
run;

title;

The plot below shows two slopes for IncLevel, one for males and one for females. If there is no interaction between two variables, the slopes should be relatively parallel. However, the slopes shown here are not parallel. You can see how the logit of the predicted probability of spending \$100 or more is affected by low, medium, and high incomes for males and females. The probability of females spending \$100 or more barely increases as the income level increases. The probability of males spending \$100 or more barely increases from low to medium incomes, but significantly increases from medium to high incomes. So, the probability of spending $100 or more is highly related to income for men, but is weakly related to income for women.


In [ ]:
proc print data=statdata.sales_inc (obs=10);
run;

proc means data=statdata.sales_inc noprint nway;
   class IncLevel Gender;
   var Purchase;
   output out=bins sum(Purchase)=Purchase n(Purchase)=BinSize;
run;

proc print data = bins;
run;

data bins;
   set bins;
      Logit=log((Purchase+1)/(BinSize-Purchase+1)); 
run;

proc sgscatter data=bins;
   plot Logit*IncLevel /group=Gender markerattrs=(size=15)
                        join;
   format IncLevel incfmt.;
   label IncLevel='Income Level';
   title;
run;
quit;

Generating Predictions Using PROC PLM

The ILINK option requests that SAS provide the predictions in the form of predicted probabilities in lieu of logits where covariate effects are additive.

The table produced by PROC PLM shows that the house with the highest predicted probability of being bonus eligible (0.306) has an irregular lot shape, 1 fireplace, and a basement area of 975 square feet. The house with the lowest predicted probability (0.0004) has a regular lot shape, 2 fireplaces, and a basement area of 775.

Be sure that you generate predictions only for new data records that fall within the range of the training data. If not, predictions could be invalid due to extrapolation. We assume that the modeled relationship between predictors and responses holds across the span of the observed data. We should not assume that this relationship holds everywhere.


In [ ]:
ods select none;
proc logistic data=statdata.ameshousing3;
   class Fireplaces (ref='0') Lot_Shape_2 (ref='Regular') / param=ref;
   model Bonus(event='1')=Basement_Area|Lot_Shape_2 Fireplaces;
   units Basement_Area=100;
   store out=isbonus;
run;
ods select all;

data newhouses;
   length Lot_Shape_2 $9;
   input Fireplaces Lot_Shape_2 $ Basement_Area;
   datalines;
   0 Regular 1060
   2 Regular 775
   2 Irregular 1100
   1 Irregular 975
   1 Regular 800
   ;
run;

proc plm restore=isbonus;
   score data=newhouses out=scored_houses / ILINK;
   title 'Predictions using PROC PLM';
run;

proc print data=scored_houses;
run;

Practice: Fitting a Multiple Logistic Regression Model, Saving Analysis Results, and Generating Predictions

Fit a multiple logistic regression model with Unsafe as the outcome variable and Weight, Size, and Region as the predictor variables. Use the EVENT= option to model the probability of below-average safety scores. Apply the SIZEFMT. format to the variable Size. (This format is defined in the setup program for this course.) Specify Region and Size as classification variables using reference cell coding, and specify Asia as the reference level for Region and Small as the reference level for Size. Use a UNITS statement with -1 as the unit for weight so that you can see the odds ratio for lighter cars over heavier cars. Request any relevant plots. Add an appropriate title. Submit the code and view the log.


In [ ]:
proc format ;
value sizefmt 1 = 'Small'
              2 = 'Medium'
              3 = 'Large';
run;      

ods graphics on;

proc logistic data=statdata.safety plots(only)=(effect oddsratio);
   class Region (param=ref ref='Asia')
         Size (param=ref ref= 'Small' );
model unsafe (event = '1') = weight region size / clodds=pl selection=backward;
unit weight = -1; 
store out = IsSafe;
   format Size sizefmt.;
   title 'LOGISTIC MODEL: Backwards Elimination';
run;
title;

data checkSafety;
   length Region $9.;
   input Weight Size Region $ 5-13;
   datalines;
4 1 N America
3 1 Asia     
5 3 Asia     
5 2 N America
;
run;

proc plm restore = IsSafe;
    score data = checkSafety out = scored_cars / ILINK;
    title 'Predictions using PROC PLM';
run;

proc print data=scored_cars;
run;

Model Building and Scoring Prediction

Target marketing is a type of database marketing, which is marketing that uses customer data to improve sales promotions and product loyalty. In target marketing, a predictive model is based on historical customer data. The predictive model contains the inputs for customer attributes that are most likely to predict a customer's response to a promotion. Historic customer databases can also be used to predict who is likely to switch brands or cancel services (which is referred to as churn). Loyalty promotions can then be targeted at customers who are at risk of churn.

If data are partitioned:

PROC GLMSELECT DATA=training-data-set 
    VALDATA=validation-data-set;
    MODEL target(s)=input(s) </ options>;
RUN;

If not:

PROC GLMSELECT DATA=training-data-set 
                           <SEED=number>;
    MODEL target(s)=input(s) </ options>;
    PARTITION FRACTION(<TEST=fraction><VALIDATE=fraction>);
RUN;

The PARTITION statement requires a pseudo-random number generator to start the random selection process. The pseudo-random number generator requires a starting "seed," which must be an integer. If you need to be able to reproduce your results in the future, you specify an integer that is greater than zero in the SEED= option. Then, whenever you run the PROC GLMSELECT step and use that seed value, the pseudo-random selection process is replicated and you get the same results. If the SEED= value specifies an invalid value or no value, the seed is automatically generated from reading the time of day from the computer's clock. In most situations, it is recommended that you use the SEED= option and specify an integer greater than zero.

With GLM parameterization, one design variable is produced per level of each CLASS variable. We specify REF=FIRST so that, for each design variable, we can compare the output at the current level with the first level. By default, the reference level is the last level.

CHOOSE=VALIDATE specifies that PROC GLMSELECT will select the best model based on the validation data. PROC GLMSELECT selects the model that has the smallest overall validation error.

Specifically, PROC GLMSELECT calculates the average squared error, which is the sum of the squared differences between the observed value and the predicted value using the model.

Training dataset (Model Building) -> Holdout Sample, Validation Dataset (assessing model performance and select best-generalized model) -> Test dataset (Scoring)


In [ ]:
proc print data = statdata.ameshousing3 (obs=10);
run;
proc print data = statdata.ameshousing4 (obs=10);
run;

Notice that the validation data set used one less observation for the analysis. This indicates that the validation data set has one more missing value than the training data set.

(Total) number of Parameters = intercept + &interval + &categorical = 1 + 8 + 34 = 43 Number of Effects = 1 + 8 + 11 = 20

Number of parameters in the model = Number of Parameters - redundant design variables = 43 - 11 = 32

The SBC is assessed on the training data. In the Step 1 row, you can see why Season_Sold was removed. It produced a reduction in the SBC. Moving down the SBC column, we see that variables continue to be removed as long as the value of SBC continues to decrease. In the last row, for Step 8, the SBC value is followed by an asterisk that means Optimal Value of Criterion.

Backward Selection Summary:

The next column reports the training average squared error (ASE), but this is not used to select the model. A smaller average squared error is better. It seems contradictory that, as you remove variables from the model, the training average squared error tends to increase. However, this is not true for the validation data set, so the validation data set gives a better idea of how the model performs on data that was not used to build the model. Remember that the model is chosen based on the validation average squared error.

Going down the column for the validation average squared error (the last column of Backward Selection Summary), we see that the validation average squared error continues to decrease until we get to Step 6, when Heating_QC is removed. At that point, the validation average squared error starts to increase. The model in Step 5, when Central_Air is removed, is marked with an asterisk. The asterisk indicates that this is the best model based on the validation data.

The validation average squared error values at Step 4 and Step 5 are relatively close to each other. However, the model at Step 5 has one less predictor variable so it is a more parsimonious model. The parameter estimates on this vertical line are of interest to us.

Looking at the training data plot, notice that the ASE cannot go down; it can only go up because the backwards elimination criterion is used to remove variables from the model. For the training data, the ASE can go down only if variables are added.


In [ ]:
%let interval=Gr_Liv_Area Basement_Area Garage_Area Deck_Porch_Area 
              Lot_Area Age_Sold Bedroom_AbvGr Total_Bathroom;        /*8 variables*/
%let categorical=House_Style2 Overall_Qual2 Overall_Cond2 Fireplaces 
                 Season_Sold Garage_Type_2 Foundation_2 Heating_QC 
                 Masonry_Veneer Lot_Shape_2 Central_Air;

ods graphics;

proc glmselect data=statdata.ameshousing3
               plots=all 
               valdata=statdata.ameshousing4;
   class &categorical / param=glm ref=first;
   model SalePrice=&categorical &interval / showpvalues
                  selection=backward
                  select=sbc 
                  choose=validate;    
   store out=work.amesstore;
   title "Selecting the Best Model using Honest Assessment";
run;

title;

In [ ]:
proc glmselect data = statdata.ameshousing3 seed = 1 ; *8675309;
    class &categorical / param=glm ref=first;
    model SalePrice=&categorical &interval / showpvalues
                  selection=stepwise
                  select=aic 
                  choose=validate;    
    partition fraction(validate=0.333333);
title "Selecting the Best Model using Honest Assessment";
run;
title;

When you score, you do not rerun the algorithm that was used to build the model. Instead, you apply the score code—that is, the equations obtained from the final model—to the scoring data.

Method 1 Use a SCORE statement in PROC GLMSELECT.

  • The first method is useful because you can build and score a model in one step. However, this method is inefficient if you want to score more than once or use a large data set to build a model. With this method, the model must be built from the training data each time the program is run.

Method 2 Use a STORE statement in PROC GLMSELECT. Use a SCORE statement in PROC PLM.

  • The second method enables you to build the model only once, along with an item store, using PROC GLMSELECT. You can then use PROC PLM to score new data using the item store. Separating the code for model building and model scoring is especially helpful if your model is based on a very large training data set or if you want to score more than once. Potential probles with this method are that others might not be able to use this code with earlier versions of SAS or you might not want to share the entire item store.

Method 3 Use a STORE statement in PROC GLMSELECT. Use a CODE statement in PROC PLM to output SAS code. Use a DATA step for scoring.

  • The third method uses PROC PLM to write detailed scoring code, based on the item store, that is compatible with earlier versions of SAS. You can provide this code to others without having to share other information that is in the item store. The DATA step is then used for scoring.

In [ ]:
/*Note:*/
/*In your current SAS session, you must first run the code from the previous demonstration (shown above) before running the code below.
Replace my-file-path with the path to your course practice files.*/

/*Method 3*/
proc plm restore=work.amesstore; /*Usually in a permanent file*/
   score data=statdata.ameshousing4 out=scored;
   code file="/folders/myshortcuts/_myfolders/scoring.sas";
run;

/*Check log*/

data scored2;
   set statdata.ameshousing4;
   %include "/folders/myshortcuts/_myfolders/scoring.sas";
run;

proc compare base=scored compare=scored2 criterion=0.0001;
   var Predicted;
   with P_SalePrice;
run;

See SAS_Stat3 for the full code

SAS Statistics by Example


In [ ]:
title "Logistic Regression with One Categorical Predictor Variable";
proc logistic data=risk;
    class Gender (param=ref ref='F');
    model Heart_Attack (event='Yes')= Gender / clodds = pl;
run;
quit;

In [ ]:
title "Logistic Regression with One Continuous Predictor Variable";
proc logistic data=risk;
model Heart_Attack (event='Yes') = Chol / clodds = pl;
units Chol = 10;
run;
quit;

In [ ]:
proc format;
    value cholgrp low-200 = 'Low to Medium'
    201-high = 'High';
run;
title "Using a Format to Create a Categorical Variable";

proc logistic data=risk;
    class Chol (param=ref ref='Low to Medium');
    model Heart_Attack (event='Yes') = Chol / clodds = pl;
    format Chol cholgrp.;
run;
quit;

In [ ]:
ods graphics on;
title "Using a Combination of Categorical and Continuous Variables";

proc logistic data=risk;
    class Age_Group (ref='1:< 60')
    Gender (ref='F') / param=ref;
    model Heart_Attack (event='Yes') = Gender Age_Group Chol /
    clodds = pl;
    units Chol=10;
run;
quit;
ods graphics off;

In [ ]:
ods graphics on;
title "Running a Logistic Model with Interactions";

proc logistic data=risk plots(only)=(roc oddsratio);
    class Gender (param=ref ref='F')
    Age_Group (param=ref ref='1:< 60');
    model Heart_Attack (event = 'Yes') = Gender | Age_Group |
    Chol @2 /
    selection=backward slstay=.10 clodds=pl;
    units Chol=10;
    oddsratio Chol;
    oddsratio Gender;
    oddsratio Age_Group;
run;
quit;

ods graphics off;

In [ ]:
*variables

Region
Advertising
Gender
Book_Sales
Music_Sales
Electronics_Sales
Total_Sales
;

proc format;
   value yesno 1 = 'Yes'
               0 = 'No';
data Store;
   length Region $ 5;
   call streaminit(57676);
   do Transaction = 1 to 200;
      R = ceil(rand('uniform')*10);
      select(R);
         when(1) Region = 'East';
         when(2) Region = 'West';
         when(3) Region = 'North';
         when(4) Region = 'South';
         otherwise;
      end;
      Advertising = rand('bernouli',.6);
      if rand('uniform') lt .6 then Gender = 'Female';
         else Gender = 'Male';
      Book_Sales = abs(round(rand('normal',250,50) + 30*(Gender = 'Female')
                    + 30*Advertising,10)) ;
      Music_Sales = abs(round(rand('uniform')*40 + rand('normal',50,5)
         + 30*(Region = 'East' and Gender = 'Male')
         - 20*(Region = 'West' and Gender = 'Female'),5) + 10*Advertising);
      Electronics_Sales = abs(round(rand('normal',300,60) + 70*(Gender = 'Male')
       + 55*Advertising + 50*(Region = 'East') - 20*(Region = 'South') 
       + 75*(Region = 'West'),10));
      Total_Sales = sum(Book_Sales,Music_Sales,Electronics_Sales);
   output;
   end;
   drop R;
   format Book_Sales Music_Sales Electronics_Sales Total_Sales dollar9.
          Advertising yesno.;
run;
 
/*title "Listing of Store";*/
/*proc print data=store heading=h;*/
/*run;*/

/*proc univariate data=store;*/
/*   var Book_Sales -- Total_Sales;*/
/*   histogram;*/
/*run;*/
/**/
/*title "Scatter Matrix for Store Variables";*/
/*proc sgscatter data=store;*/
/*   matrix Book_Sales -- Total_Sales / group = Gender;*/
/*run;*/
/**/
/*proc sgplot data=store;*/
/*   scatter x=Book_Sales y=Total_Sales / group=Gender;*/
/*run;*/

proc rank data=store out=median_sales groups=2;
   var Total_Sales;
   ranks Sales_Group;
run;

proc format;
   value sales 0 = 'Low'
               1 = 'High';
run;

/*proc logistic data=median_sales order=formatted;*/
/*   class Gender(param=ref ref='Male');*/
/*   model Sales_Group = Gender;*/
/*   format Sales_Group sales.;*/
/*quit;*/
/**/
/*proc logistic data=median_sales order=formatted;*/
/*   class Gender(param=ref ref='Male')*/
/*         Advertising (param=ref ref='No');*/
/*   model Sales_Group = Gender Advertising;*/
/*   format Sales_Group sales.;*/
/*quit;*/

*Create test data set;
libname example 'c:\books\statistics by example';
data example.Blood_Pressure;
   call streaminit(37373);
   do Drug = 'Placebo','Drug A','Drug B';
      do i = 1 to 20;
         Subj + 1;
         if mod(Subj,2) then Gender = 'M';
         else Gender = 'F';
         SBP = rand('normal',130,10) +
               7*(Drug eq 'Placebo') - 6*(Drug eq 'Drug B');
         SBP = round(SBP,2);
         DBP = rand('normal',80,5) +
               3*(Drug eq 'Placebo') - 2*(Drug eq 'Drug B');
         DBP = round(DBP,2);
         if Subj in (5,15,25,55) then call missing(SBP, DBP);
         if Subj in (4,18) then call missing(Gender);
         output;
      end;
   end;
   drop i;
run;

/*title "Listing of the first 25 observations from Blood_Pressure";*/
/*proc print data=example.Blood_Pressure(obs=25) noobs;*/
/*   var Subj Drug SBP DBP;*/
/*run;*/

data exercise;
   call streaminit(7657657);
   do Subj = 1 to 50;
      Age = round(rand('normal',50,15));
      Pushups = abs(int(rand('normal',40,10) - .30*age));
      Rest_Pulse = round(rand('normal',50,8) + .35*age);
      Max_Pulse = round(rest_pulse + rand('normal',50,5) - .05*age);
      Run_Pulse = round(max_pulse - rand('normal',3,3));
      output;
   end;
run;

*Data set for a paired t-test example;
data reading;
   input Subj Before After @@;
datalines;
1 100 110  2 120 121  3 130 140  4 90 110  5 85 92
6 133 137  7 210 209  8 155 179
;

/*title "Listing of Data Set READING";*/
/*proc print data=reading noobs;*/
/*run;*/

*Data set that violates assumptions for a t-test;
data salary;
   call streaminit(57575);
   do Subj = 1 to 50;
      do Gender = 'M','F';
         Income = round(20000*rand('exponential') + rand('uniform')*7000*(Gender = 'M'));
         output;
      end;
   end;
run;
/*proc univariate data=salary;*/
/*   class Gender;*/
/*   id Subj;*/
/*   var Income;*/
/*   histogram Income;*/
/*run;*/

*Data set risk for logistic regression example;
proc format;
   value yesno 1 = 'Yes'
               0 = 'No';
run;

data Risk;
   call streaminit(13579);
   length Age_Group $ 7;
   do i = 1 to 250;
      do Gender = 'F','M';
         Age = round(rand('uniform')*30 + 50);
         if missing(Age) then Age_Group = ' ';
         else if Age lt 60 then Age_Group = '1:< 60';
         else if Age le 70 then Age_Group = '2:60-70';
         else Age_Group = '3:71+';
         Chol = rand('normal',200,30) + rand('uniform')*8*(Gender='M');
         Chol = round(Chol);
         Score = .3*chol + age + 8*(Gender eq 'M');
         Heart_Attack = (Score gt 130)*(rand('uniform') lt .2);
         output;
       end;
   end;
   keep Gender Age Age_Group chol Heart_Attack;
   format Heart_Attack yesno.;
run;

/*title "Listing of first 100 observations from RISK";*/
/*proc print data=risk(obs=100);*/
/*run;*/