Principal Component Analysis in SAS

Many statistical procedures test specific hypotheses.  Principal Component Analysis (PCA), Factor analysis, Cluster Analysis, are examples of analyses that explore the data rather than answer a specific hypothesis.  PCA examines common components among data by fitting a correlation pattern among the variables.  Often used to reduce data from several variables to 2-3 components.

Before running a PCA, one of the first things you will need to do is to determine whether there is any relationship among the variables you want to include in a PCA.  If the variables are not related then there’s no reason to run a PCA.  The data that we will be working with is a sample dataset that contains the 1988 Olympic decathlon results for 33 athletes.  The variables are as follows:

athlete:  sequential ID number
r100m:  time it took to run 100m
longjump:  distance attained in the Long Jump event
shotput:  distance reached with ShotPut
highjump:  height reached in the High Jump event
r400m: time it took to run 400m
h110m:  time it took to run 110m of hurdles
discus:  distance reached with Discus
polevlt:  height reached in the Pole Vault event
javelin:  distance reached with the Javelin
r1500m:  time it took to run 1500m

Let’s start with a PROC CORR to review the relationships among the variables.

Proc corr data=olympic88;
Run;

By reviewing the output found here,  we can see that there are a number of significant relationships suggesting that a PCA will be a valuable method of reducing our data from 10 variables to 2 or 3 components.

There are a few different PROCedures available in SAS to conduct a PCA.  My preferred PROC is PRINCOMP – short for principal components.  Let’s start with the basic syntax:

Proc princomp data=olympic88;
var r100m longjump shotput highjump r400m h110m discus polevlt javelin                    r1500m;
Run;

The output starts with the same correlation matrix we created using the PROC CORR.  Although you’ll notice that there are no p-values available here.  Although we see the correlations we do not know whether they are significantly different from 0 or not.  We also have the Simple Statistics available – Means and Standard Deviations.

Our next table is a table of the eigenvalues for each component.  So let’s step back and talk about what this analysis is really doing for us.

Imagine my cloud of data – throw all the data in the air – there is variation due to the different events and there is variation within each event attributed to the performance of the different athletes.  Our goal with PCA is to reduce our data from the 10 events down to 2 or 3 components that represent these 10 variables (events).  Back to my cloud of data – PCA will draw a line through all the data that explains the most variation possible with that one line – or arrow if you want to visualize this.  That will be the first component.  PCA analysis will then go back to the cloud of data and draw a second line through the data that explains the next “most” variation – 2nd component, and it will continue to do this until there is no variation left.  If you have 10 variables, you will have 10 components to explain all the variation.  Each component explains a different amount of variation.  The first will explain the most, the second lesser, and so on.  PCA will provide you with eigenvalues which are translated to an amount of variation.

Now, each component will be made up of bits and pieces of the 10 variables – we will see these as the weightings within each Component or Eigenvector.  The most challenging part of PCA, will be the definition of the components.  It’s fine to say we have 2 components, but there’s more value in trying to define what the components represent.  For this, you will use the weightings within each eigenvector or component.

Now that we have a better feeling for what is happening when we run this analysis, I said earlier that our goal is to cut down from 10 variables to 2 or 3 components.  How do we do this?  During the analysis, we will see a SCREE PLOT – we want to use this as a guide to help determine how many components we will use.  Where the elbow appears in the scree plot, there is where we cut off the number of components to use.  Yes, a subjective decision.  We will also use the % variation explained by the components as a guide and aid to support our decision.

If we look at our current output – let’s first scroll down to the scree plot.  notice it shows the principal component number on the x-axis and the eigenvalue on the y-axis.  Also note that the elbow in the curve happens around the 3rd component.  If you look up at the table that shows you the proportion of the variation explained by the components we see that Component #1 explains 34%, Component #2 explains 26%, and Component #3 explains 9%.  Given that drastic drop between Component 2 and 3, I would select working with only the first 2 components.  Subjective decision!!!  But be able to back it up.  In this case, the first two explain a total of 60% of the variation seen across the 10 events, and as I look down to component #4 it also explains 9%.  So rather than trying to explain why I didn’t include component #4 and keeping component #3 when they are so  similar, I decide to cut it at 2 components.

The next challenge is trying to “define” the 2 components.  To attempt this, look at the table with the weightings for each component.

For component #1 we see a nice split of events with all the running events holding a -ve weighting and all other events a +ve weighting.  This component could be viewed as representing the running ability of the decathletes.

For component #2 the events with the lowest values are longjump, highjump, h100m – 3 jumping or height events.  It has been suggested that this component may be represented as the strength or endurance ability of the decathletes.  Very subjective again!!

This is the basis of PCA – however, in our output there is one particular output that many associate with PCA that does not accompany the default settings for PRINCOMP.  We want to see the component plots.  In SAS we need to specify these plots.

What I recommend, run the analysis as we have above to determine how many components you want to work with first.  Once you’ve decided then you will specify this in the following code to obtain the plots of interest.  In older versions of SAS you will need to turn the ODS graphics option on and off to take advantage of the advanced graphing abilities of SAS.  Please note that I have NOT tested this in SAS Studio!!!

ods graphics on;
Proc princomp data=olympic88 n=2 plots(ncomp=3)=all;
var r100m longjump shotput highjump r400m h110m discus polevlt javelin r1500m;
RUn;
ods graphics off;

The above code will result in the following output.  One of the first things you will notice is that by adding the n=2 option in the Proc statement we are telling SAS to only calculate the first 2 components.  The plots(ncomp=3)=all produces all the plots following the Scree Plot.

You can use the Component Pattern Profile plot and the component plots to help you define what the components represent.

This is a fun and very straightforward example of how to use PCA with your data.

Name

 

2 thoughts on “Principal Component Analysis in SAS”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s