## Principal Component Analysis in SAS

Many statistical procedures test specific hypotheses.  Principal Component Analysis (PCA), Factor analysis, Cluster Analysis, are examples of analyses that explore the data rather than answer a specific hypothesis.  PCA examines common components among data by fitting a correlation pattern among the variables.  Often used to reduce data from several variables to 2-3 components.

Before running a PCA, one of the first things you will need to do is to determine whether there is any relationship among the variables you want to include in a PCA.  If the variables are not related then there’s no reason to run a PCA.  The data that we will be working with is a sample dataset that contains the 1988 Olympic decathlon results for 33 athletes.  The variables are as follows:

athlete:  sequential ID number
r100m:  time it took to run 100m
longjump:  distance attained in the Long Jump event
shotput:  distance reached with ShotPut
highjump:  height reached in the High Jump event
r400m: time it took to run 400m
h110m:  time it took to run 110m of hurdles
discus:  distance reached with Discus
polevlt:  height reached in the Pole Vault event
javelin:  distance reached with the Javelin
r1500m:  time it took to run 1500m

Let’s start with a PROC CORR to review the relationships among the variables.

Proc corr data=olympic88;
Run;

By reviewing the output found here,  we can see that there are a number of significant relationships suggesting that a PCA will be a valuable method of reducing our data from 10 variables to 2 or 3 components.

There are a few different PROCedures available in SAS to conduct a PCA.  My preferred PROC is PRINCOMP – short for principal components.  Let’s start with the basic syntax:

Proc princomp data=olympic88;
var r100m longjump shotput highjump r400m h110m discus polevlt javelin                    r1500m;
Run;

The output starts with the same correlation matrix we created using the PROC CORR.  Although you’ll notice that there are no p-values available here.  Although we see the correlations we do not know whether they are significantly different from 0 or not.  We also have the Simple Statistics available – Means and Standard Deviations.

Our next table is a table of the eigenvalues for each component.  So let’s step back and talk about what this analysis is really doing for us.

Imagine my cloud of data – throw all the data in the air – there is variation due to the different events and there is variation within each event attributed to the performance of the different athletes.  Our goal with PCA is to reduce our data from the 10 events down to 2 or 3 components that represent these 10 variables (events).  Back to my cloud of data – PCA will draw a line through all the data that explains the most variation possible with that one line – or arrow if you want to visualize this.  That will be the first component.  PCA analysis will then go back to the cloud of data and draw a second line through the data that explains the next “most” variation – 2nd component, and it will continue to do this until there is no variation left.  If you have 10 variables, you will have 10 components to explain all the variation.  Each component explains a different amount of variation.  The first will explain the most, the second lesser, and so on.  PCA will provide you with eigenvalues which are translated to an amount of variation.

Now, each component will be made up of bits and pieces of the 10 variables – we will see these as the weightings within each Component or Eigenvector.  The most challenging part of PCA, will be the definition of the components.  It’s fine to say we have 2 components, but there’s more value in trying to define what the components represent.  For this, you will use the weightings within each eigenvector or component.

Now that we have a better feeling for what is happening when we run this analysis, I said earlier that our goal is to cut down from 10 variables to 2 or 3 components.  How do we do this?  During the analysis, we will see a SCREE PLOT – we want to use this as a guide to help determine how many components we will use.  Where the elbow appears in the scree plot, there is where we cut off the number of components to use.  Yes, a subjective decision.  We will also use the % variation explained by the components as a guide and aid to support our decision.

If we look at our current output – let’s first scroll down to the scree plot.  notice it shows the principal component number on the x-axis and the eigenvalue on the y-axis.  Also note that the elbow in the curve happens around the 3rd component.  If you look up at the table that shows you the proportion of the variation explained by the components we see that Component #1 explains 34%, Component #2 explains 26%, and Component #3 explains 9%.  Given that drastic drop between Component 2 and 3, I would select working with only the first 2 components.  Subjective decision!!!  But be able to back it up.  In this case, the first two explain a total of 60% of the variation seen across the 10 events, and as I look down to component #4 it also explains 9%.  So rather than trying to explain why I didn’t include component #4 and keeping component #3 when they are so  similar, I decide to cut it at 2 components.

The next challenge is trying to “define” the 2 components.  To attempt this, look at the table with the weightings for each component.

For component #1 we see a nice split of events with all the running events holding a -ve weighting and all other events a +ve weighting.  This component could be viewed as representing the running ability of the decathletes.

For component #2 the events with the lowest values are longjump, highjump, h100m – 3 jumping or height events.  It has been suggested that this component may be represented as the strength or endurance ability of the decathletes.  Very subjective again!!

This is the basis of PCA – however, in our output there is one particular output that many associate with PCA that does not accompany the default settings for PRINCOMP.  We want to see the component plots.  In SAS we need to specify these plots.

What I recommend, run the analysis as we have above to determine how many components you want to work with first.  Once you’ve decided then you will specify this in the following code to obtain the plots of interest.  In older versions of SAS you will need to turn the ODS graphics option on and off to take advantage of the advanced graphing abilities of SAS.  Please note that I have NOT tested this in SAS Studio!!!

ods graphics on;
Proc princomp data=olympic88 n=2 plots(ncomp=3)=all;
var r100m longjump shotput highjump r400m h110m discus polevlt javelin r1500m;
RUn;
ods graphics off;

The above code will result in the following output.  One of the first things you will notice is that by adding the n=2 option in the Proc statement we are telling SAS to only calculate the first 2 components.  The plots(ncomp=3)=all produces all the plots following the Scree Plot.

You can use the Component Pattern Profile plot and the component plots to help you define what the components represent.

This is a fun and very straightforward example of how to use PCA with your data.

## Reproducibility

But this isn’t always true in publications.  There are a number of thoughts as to why this may happen:

• word limitations – should the words be used in the materials and methods, or should we reserve them for our Results and Discussions?
• uncomfortable talking about the statistical analysis – lack of knowledge or too much knowledge?
• lack of confidence – so lets skip it, nobody is going to read it?

## Examples – can you replicate the following studies?

Read through the Materials and Methods section of these papers and decide whether you have enough information to replicate the study.

## Goals of writing:

Our readers should be able to determine:

1. that the analysis fits the objective stated for your experiment or trial
2. that your research methodology and data collection processes match the analyses
3. that your data management processes ensure data quality

## Checklist:

• Objectives clearly stated
• Materials and Methods:
• Identify how you selected your experimental units and your sample size
• State your experimental design, be sure to include number of replicates
• Are you using your raw data?
• Are you using derived variables?  If so, what are they?
• Are you using transformed variables?  If so, how and why were they transformed?
• State your method of analysis – if you say something like “model was produced using stepwise regression” – this is a flag for a statistical reviewer!!  Provide your reader with clear directions
• State the software that was used.

• If you have several trials, consider combining them rather than reporting several single trials.
• Remember that you must account for your error variances if they are heterogenous.
• If the data was transformed for your statistical analysis, back transform your results for presentation in the publication
• When your Null hypothesis is NOT rejected (in other words, when there are no differences observed), report the observed p-value.  As an example p=0.35
• No Significant difference is correct and accepted in publications.  Do NOT say the difference was non-significant
• Report means with their standard errors.

### Can you replicate the following trial?

Control of glyphosphate-resitant Canad fleabane in soybean with preplant herbicides (2017)

## Tackling an analysis using GLIMMIX

So, you have some data and you want to analyze it using Proc GLIMMIX.  You have some data which you’ve collected and have a few treatments which you’d like to compare.  So how do you start this?

My goal is to provide steps to tackle these types of analyses, whether you are working with weed data, or animal data, or yield data.  I suspect I’ll be updating this post as we clarify these steps.

## First Step – your experimental design

Ah yes!  Despite popular belief you DO have an experimental design!  Find it or figure it out now before you go any further.  Why?  Because your model depends on this!  Your analysis comes down to your experimental design.

## Second Step – build your MODEL statement

You know what your outcome variable is, you know what your experimental design is, which means you know what factors that you’ve measured and whether they are fixed or random.  So…  you now know the basis of your MODEL statement and your initial RANDOM statement.

## Third Step – expected distribution of your outcome variable

You already know whether your outcome variable comes from a normal distribution of not.  Chances are it is not, but what is it?  Check out the post on Non-Gaussian Distributions to get an idea of what distribution your outcome variable may be.  Think of it as the starting point.

Add this distribution and the appropriate LINK to the end of our MODEL statement.

## Fourth Step – run model and check residuals

Remember that when we run the Proc GLIMMIX – we need to check our assumptions – the residuals!  How do they look?  How’s the variation between your fixed effect levels?  Homogeneous or not?  Are the residuals evenly distributed?  Are the residuals normally distributed?

## Fifth Step – residuals NOT normally distributed

Is there another LINK for the DISTribution that you selected?  If so, please try it.

## Sixth Step – fixed treatment effects not homogeneous

Now the fun begins.  To fix this one, we need to add a second RANDOM statement – essentially telling SAS that we need to it to use the variation of the individual treatment levels rather than the residual variation.  As an example, a RANDOM statement, for a design that has a random block effect, would be as follows:

RANDOM _residual_ / subject = block*treatment group=treatment;

## Seventh Step – try another distribution

Now – we do NOT want you trying ALL the distributions possible – this just doesn’t make sense.  Remember you need to think back to the distribution possibilities for our outcome variable.  Please use the link provided in Step 3 as a guide.  However, one distribution I have discovered works for many situations is the lognormal distribution.  At the end of your model statement you would add / DIST=lognormal LINK=identity.

Another option is to transform the data in the GLIMMIX procedure.  The one transformation that researchers like is the arcsine square root transformation.  To try this one please use the following code.

Proc GLIMMIX data=first;
trans = arsin(sqrt(outcome));

model trans = …;

Run;

## Last Step – results will not always be perfect!

You will do the best that you can when analyzing your data.  But please recognize that you may not be able to match all the assumptions everytime.  Go back, review your data, review your experimental design, to ensure you have the correct proc GLIMMIX coding.

As I’ve noted earlier, as we continue to learn more about GLIMMIX this post will probably be updated to include and/or refine these steps.

## Ridgetown – Workshop data – November 6, 2017

We’ll continue to work with the data that Brittany provided me earlier this semester.  I’d like you all to try developing a model for the CONTROL_56D data.  Using GLIMMIX, try to find the best fitting model.

On Monday, November 6, we will work together and discuss how everyone approached this challenge.

Please download the data and program here.  Since I cannot link to .sas files, I have provided you with the PDF file.  You’ll need to copy and paste the contents into a SAS editor and work from there.

Also note that I will be available for one-on-one consultations in the afternoon of Monday, November 6, 2017.  To book a timeslot, please visit:  http://rt_oacstats.youcanbook.me

## Data and Project Background

For this workshop, we’ll work through a dataset from the very beginnings through to the analysis and interpretation.  Along the way I’ll highlight challenges and possible ways to overcome them.

The dataset was provided by a research team at Ridgetown College and contains some fictional data from a classic weed trial experimental design.  The data was collected in two years and across 6 different fields or environments.  The treatments consisted of 2 controls (a weedy control and a weed free control), 5 different pre-emergence herbicides, and 4 herbicide combinations, for a total of 11 treatments.  The data was collected in 4 different blocks of a field in each of the 6 different fields.  Variables collected include:

• % injury at 14 days of crop
• % injury at 28 days of crop
• % injury at 56 days of crop
• Density of crop
• Dry weight of crop
• Yield of crop

The objective of this research is to see if herbicides in a tank mix (combinations) are more efficacious than when they are applied alone.  The hypothesis is that the control of pests will increase with the addition of a second herbicide.

## Cleaning the Data

I often make the following statement during many of my consultations and during my workshops:  “The statistical analysis portion of any project takes very little time.  It is the data cleaning and the interpretation after the analysis that can be very time consuming.”  No matter how well you collect your data, there will always be some level of data cleaning involved.  Many of us use Excel to enter our data – which is great!  We create worksheets that are easy for us to enter data and that are easy for us to read.  We use symbols such as % and may add documentation to the worksheet that is important to the collection phase of the trial.  Information such as the dates collected, comments about the collection processes, units of measure for our variables, and coding information.  All of this information is ESSENTIAL!  However, we will need to remove it before moving our data into any statistical package for analysis.

I recommend you create a NEW worksheet for the data to be used in SAS, SPSS, Stata, R, etc…  First step is to copy the sheet where you collected the data.  Rename it to something like SAS_ready or just SAS_data.  Then remove ALL formatting from the sheet.  To do this, in Excel, Clear -> Clear formats.  Any colours, shading, bold, underlining, etc…  will disappear – this is what you want!  Please SAVE your file!

Next step is to remove any extra lines you may have at the top of your worksheet.

Last thing – save your Excel file!  Then save the SAS worksheet as a CSV file – comma separated value file.  For anyone working on a Mac and using the Free University SAS edition, please save the file as a Windows CSV.  Please note – this is another great habit to get into.  You now has a Master Data file in Excel along with a working data file in a CSV format.  Also note that a CSV format is non-proprietary and a great format for preserving and sharing your data at a later date.

We will be using the INFILE statement to bring our data into SAS.  Yes you can copy and paste the data into the program, however, I prefer using the INFILE.  WHY?  If at some point you need to make a change to your data, you will make that change in the original Excel file.  Then you’ll save that file, and save it as the CSV.  When you run SAS, it calls upon that file you just saved and changed.  You do not need to remember what SAS program you copied the data to, where the change needs to be made, etc…  You have a MASTER copy of your data, and a copy that you are using to run your analysis against.  It’s just a cleaner and easier way to work with your data.

Remember how you copied all the labels that you had in your original Excel file to a new worksheet?  Well, let’s bring all this information into SAS now.

### Labels

Labels in SAS are essentially longer titles or descriptions of your variable names.  For example, I may use the variable name injury_14d.  Many of you in this research field could take a pretty good guess as to what this means.  But, we should not be guessing, we should have access to this information as we work with the data.  The correct label or description of this variable would be:  Percent injury, 14 days after application.  This is fairly short and we could expand this to be more comprehensive, but we don’t want to clutter our output with extremely LONG labels, and this one will help the reader understand what we mean by injury_14d.

When I created the variable names that I chose to use for this workshop and data, I listed them all in a separate worksheet in my Excel file and listed their labels as well.  My worksheet looked like this:

 ID Field identifications within the years Trmt Treatment Block Block PLOT Plot injury_14d Percent injury, 14 days after application control_28d Percent control, 28 days after application control_56d Percent control, 56 days after application density Density, count per metre-squared drywt Dry weight, grams per metre-squared yield Yield, bushels per acre

To add this to my SAS dataset, the first thing I need to note, is that whenever I touch the data, it has to happen in a DATA step.

Data rt;
set rt;
label ID = “Field identifications within the years”
Trmt = “Treatment”
Block = “Block”
Plot = “Plot in the field”
injury_14d = “Percent injury, 14 days after application”
control_28d = “Percent control, 28 days after application”
control_56d = “Percent control, 56 days after application”
density = “Density, count per metre-squared”
drywt = “Dry weight, grams per metre-squared”
yield = “Yield, bushels per acre”;
Run;

This is a way to avoid a lot of typing or copying and pasting.  The other way you could accomplish the same thing is to use the following code:

Data rt;
set rt;
label ID = “Field identifications within the years”;
label Trmt = “Treatment”;
label Block = “Block”;
label Plot = “Plot in the field”;
label injury_14d = “Percent injury, 14 days after application”;
label control_28d = “Percent control, 28 days after application”;
label control_56d = “Percent control, 56 days after application”;
label density = “Density, count per metre-squared”;
label drywt = “Dry weight, grams per metre-squared”;
label yield = “Yield, bushels per acre”;
Run;

Did you notice the difference?  In the first situation you have 1 label statement that includes all the labels.  This means ONLY one ; at the end.  The second sample coding you have 10 label statements, one for each variable.  That also means 10 ; .  Either way will produce the same results.  It’s your choice!

### Formats

Formats?  What do you mean by adding formats?  SAS calls value labels – formats.  For instance, in this example we have 11 different treatments and we’ve coded them from 1 through to 11.  Ideally we would like to add labels to our codes.  This way we don’t have to remember how we coded our variables, and/or we don’t need to keep that piece of paper where we wrote them down with us all the time.

To accomplish this in SAS, it’s a two-step process.  The first step is to create a “box” or a “format” in SAS that contains the codes/values and their labels.  For the purposes of this public-facing blog post I will make-up labels for the treatments used in this example.  Again this is something that I would have started or created in Excel that accompanies my data.

 1 Weedy Control 2 Weed Free Control 3 Herbicide #1 4 Herbicide #2 5 Herbicide #3 6 Herbicide #4 7 Herbicide #5 8 Herbicide combination #1 9 Herbicide combination #2 10 Herbicide combination #3 11 Herbicide combination #4

Proc format;
value trmt_code
1 = “Weedy Control”
2 = “Weed Free Control”
3 = “Herbicide #1”
4 = “Herbicide #2”
5 = “Herbicide #3”
6 = “Herbicide #4”
7 = “Herbicide #5”
8 = “Herbicide Combination #1”
9 = “Herbicide Combination #2”
10 = “Herbicide Combination #3”
11 = “Herbicide Combination #4”;
Run;

This piece of code will need to be run BEFORE you read in your data.  It will create a SAS format called trmt_code.

The second step is to apply the formats to the variable.  Remember, now we are touching the data, so this second step must be done within a Data step.

Data rt;
set rt;
label ID = “Field identifications within the years”;
label Trmt = “Treatment”;
label Block = “Block”;
label Plot = “Plot in the field”;
label injury_14d = “Percent injury, 14 days after application”;
label control_28d = “Percent control, 28 days after application”;
label control_56d = “Percent control, 56 days after application”;
label density = “Density, count per metre-squared”;
label drywt = “Dry weight, grams per metre-squared”;
label yield = “Yield, bushels per acre”;

Format trmt trmt_code.;
Run;

NOTE:  the SAS format you created needs a “.” at the end of the name in order for SAS to recognize it as a format.

Once you run this Data step, you will now have a dataset that contains labels for your variables (you will see these in your Results) and labels for the coded values you have for your variables (you will also see these in your Results).

### Proc Contents

When you add labels and formats, you will not see the effects of these until you run a PROC or analysis on your data.  Some of us, would like to ensure that this worked before we run any analyses.  A Proc Print will not show you the variable labels, but will should your formats.

You can always use Proc Contents to examine the characteristics of your dataset.

Proc contents data=rt;
Run;

A great little piece of code that can be used to debug your data at times.

## Using GLIMMIX

Now we have our data in SAS and we can see our labels and formats, onto the fun stuff – the analysis!

### Looking at Yield

Ah yes!  The dreaded GLIMMIX!  First thing to note, we are no longer running ANOVAs!  We are now running GLMM, Generalized Linear Mixed Models – this is a couple generations after ANOVAs.  When you were using Proc MIXED, you may not have been running ANOVAs either 🙂  This is important when you write up your statistical analysis for theses and journal articles.

So where do we start?  Go back to the research question, the hypothesis, and the experimental design.  Use all these pieces of information to create your model.  You don’t need SAS to do this part.  In theory, you can create this (and the SAS code) before you collect your data and while you are collecting your data.

What is the experimental unit?  This is the KEY piece of all statistical models.  As I look at this data, it appears to me that the PLOT is the experimental unit.  Remember the unit to which the treatment is applied to.  This is an RCBD with 4 blocks in each field*year – the 11 treatments were randomly assigned to each block.

I read in the data and called it rt.

Proc glimmix data=rt;  <- I always use the “data=” option to ensure that SAS knows what dataset to use
class trmt block plot;
model yield = trmt;
random block ;
covtest “block=0” 0 .;  <- this code is used to provide statistics to determine whether the random effects are significantly different than 0
lsmeans trmt/pdiff adjust=tukey lines;  <- the lines option replaces the %pdmix800 macro
output out=rep_resid predicted=pred residual=resid residual(noblup)=mresid student=studentresid student(noblup)=smresid;  <- produces all the different types of residuals for residual analysis
title “Yield results – differences between treatments”;
Run;

Initial output – PDF

### Contrasts vs LSMeans

LSMeans – using a Tukey’s p-value adjustment – is a means comparison.  What this  provides are comparisons between all the treatments.  So each treatment vs each treatment.

We adjust the p-values to reduce our chances of seeing a difference where it may not exist because we are doing so many comparisons at the same time – reducing the comparison-wise error.  If you look at your results, sometimes you will see changes in the p-values.  Be assured that you should be using the Tukey adjusted p-values.

Should we use LSMeans or should we use Contrasts to answer our questions?  Always go back to your research question.  With Contrasts we can ask questions about means of treatments.  Whereas you cannot do this with the LSMeans.

In this particular trial we may be interested in the mean of all treatments that were 1 herbicide vs all treatments that were a combination of 2 treatments – to do this – we need to use Contrasts.

For a treatment vs treatment comparison – an LSMeans with a Tukey adjusted p-value would be best.

### Estimate vs Contrast Statement

The coding for both estimate and contrast statements are identical with the exception of “estimate” and “constrast”.

contrast “Treatment 4 vs mean of all Double applications” trmt 0 0 0 4 0 0 0 -1 -1 -1 -1;
estimate “Treatment 4 vs mean of all Double applications” trmt 0 0 0 4 0 0 0 -1 -1 -1 -1;

The estimate statement calculates the difference between the mean of Treatment 4 and the mean of all Double applications.  The output table provides you with the estimate of that difference, the standard error of that difference along with a t-statistic and an associated p-value.  The t-statistic and p-value tell you whether the calculated difference is significantly different than 0.

The contrast  statement answers the question: Is there a statistically significant difference between the mean of Treatment 4 and the mean of all Double applications.  The output table provides the F-statistic and the associated p-value.

Notice that both output tables provide the same conclusion.  The estimate table provides the difference whereas the contrast table does not.

### Looking at injury_14d

First and foremost, we are looking at a different variable from the same experimental design.  THEREFORE, you should be using the same model.  This does not necessarily mean using the exact same code, since we already have a gut feeling or realize that the distribution of our dependent variable, injury_14d will not be the same as yield.  So we will need to make appropriate changes.

Let’s start with the same code, check the residuals to see where we are at.

Initial SAS output for Percent Injury at 14 DAA

First thing to note – the program ran with NO errors!  So?  We should use the output right?  We know we have the right model, and all we did was change the dependent variable, so everything should be fine.

## STOP!!!!

We need to check our residuals first!  So let’s run those and see what happens.

Yup!  Definitely a few challenges!  But all in all not too bad!  We can try to correct for the heterogenous variation of the treatments, but we know we have percentage data and our dataset isn’t THAT big where we’d be comfortable with a normal distribution.

For more information on the Residual Analysis coding and meaning check out the SASsyfridays Blog (Coming soon!)

For more information on the covtest coding and meaning check out the SASsyfridays Blog (Coming soon!)

For more information on back-transforming lognormal data coding check out the SASsyfridays Blog