# Principal Component, Cluster , and Discriminant Analyses

The goal of this workshop and blog post is to review 3 different multivariate analyses.  We will use one common dataset to showcase the different purposes of the analyses and to showcase the different PROCedures available in SAS to conduct each analysis.

The dataset we will be using is the Fisher Iris dataset (1936), originally collected by Dr. E. Anderson and used to derive discriminant analysis by Dr. Ronald Fisher.  The dataset contains measures of petal length, petal width, sepal length, and sepal width on 50 plants of 3 varieties of Iris’.  The dataset is available within the SAS Help.  To access this dataset you will need to use the dataset name:  sashelp.iris.

## Exploratory and Explanatory Analyses

When we think of statistics, most of us tend to think of our traditional hypothesis-driven analysis.  So, the ANOVAs, regressions, means comparisons, and the list goes on.  These are types of Explanatory Analyses.  There is another world of statistics, referred by some as Exploratory Analyses – those analyses that are not driven by a hypothesis.  Exploratory analyses are used more for describing relationships among variables or measures that were taken during a trial or in a dataset.  Principal Component Analysis or PCA, and Cluster Analysis are two examples of exploratory analyses, whereas discriminant analyses falls into the explanatory analysis bucket.

### Principal Component Analysis (PCA)

Please review the PCA blog post for more details regarding this analysis.  This post will not provide the same level of detail but will form the basis of using the same dataset across three different analyses.

The roots of the PCA were found in 1901 and was developed by Karl Pearson.  Its primary role is top reduce the number of variables used to explain a dataset.  Factor analysis (FA) is a related process and has the same goal.  Many people confuse these two and tend to use the terms factor analysis and PCA interchangeably, when the two analyses are similar but not interchangeable.  I’ve listed a few of the primary differences between PCA and Factor analysis:

1. Both analyses begin with a correlation matrix.  PCA maintains the diagonal of the correlation matrix as 1’s, whereas Factor analysis replaces to provide a measures of the relationship of each variable with the others.
2. PCA – total variance among the variables is explained, FA common variance shared is the basis of the analysis
3. PCA is less complex mathematically compared to FA
4. PCA is one procedure, whereas FA is a family of procedures

#### SAS code and output using the IRIS dataset

/* For this workshop we will use the IRIS dataset */
/* Fisher’s dataset can be found in the SASHELP */
/* Library. The datsaet name is sashelp.iris */
/* SASHELP is the permanent SAS directory */

Proc print data=sashelp.iris;
Run;

/* Let’s get a sense of relationships that may exist */
/* in the dataset. We will use PROC SGPLOT to visualize */

Proc sgplot data=sashelp.iris;
scatter x=SepalLength y=PetalLength; * / datalabel=species;
Run;

/* We will run the PROC PRINCOMP as we did in the */
/* previous workshop. Options we are using include */
/* plots=all to show all plots available in the PROC */
/* n = 3 – we will start without this option and */
/* and then add it back to see only the 3 components */

ods graphics on;
Proc princomp data=sashelp.Iris standard plots=all n=3;
var SepalLength PetalLength SepalWidth PetalWidth;
Run;
ods graphics off;

To view the output.  The output explanations will be the same as the explanations reviewed in the last post – only a different dataset.

### Cluster Analysis

Cluster analysis is a multivariate analysis that does not have ah hypothesis.  We are interested in seeing whether there are any natural clusters or groups of the data.  Clusters can be based on either the variables or measures collected in the dataset, OR they can be based on the observations within the dataset – variables or observations.

Clustering techniques will use two processes: distances and linkages.  Being familiar with these terms may help you to select the most appropriate clustering technique for your data.

Distance:  quantitative index defining the similarities of the clusters remaining in the analysis at each step.

Linkage: two clusters that have the smallest distance between them as determined by particular distance measures are then linked together to form a new cluster.

Standardizing the variables to be used in a cluster analysis is essential.  Clustering techniques use some measure of “distance”, ensuring that all the variables are on the same level, will ensure a better clustering.

There are 2 broad types of clustering techniques used in Cluster Analysis:

• Hierarchical
• K-means
1. Hierarchical clustering.
Cluster or group a small to moderate number of cases based on several quantitative attributes  the groups are clustered on a given set of variables – so when we talk about these clusters we can only discuss their merits based on the variables used to create the groups.  Remember context!
2. K-means
Creating cluster from a relatively large number of cases based on a relatively small set of variables.  K-means uses an iterative technique, cases are added to a cluster during the analysis rather than at the end – this allows some cases to shift around before the analysis is complete.  You also need to specify how many clusters you want as a result with K-means clustering.

#### SAS code and output using the IRIS dataset

In SAS there are 2 PROCedures that are commonly used for Cluster Analysis:

PROC Cluster and PROC Fastclus:

Directly from the SAS Online documentation:
“The CLUSTER procedure is not practical for very large data sets because, with most methods, the CPU time is roughly proportional to the square or cube of the number of observations. The FASTCLUS procedure requires time proportional to the number of observations and can therefore be used with much larger data sets than PROC CLUSTER. If you want to hierarchically cluster a very large data set, you can use PROC FASTCLUS for a preliminary cluster analysis to produce a large number of clusters and then use PROC CLUSTER to hierarchically cluster the preliminary clusters.”

When you create clusters – in any package – it is handy to calculate the means of the clusters and to run them through a Frequency analysis – essentially we want to be able to review some descriptive statistics on our new groups or clusters.  PROC Fastclus saves all of this information for us be default or as part of the PROC coding, whereas with PROC Cluster you need to add in a few extra steps.  Part of the coding that is commonly used includes a SAS Macro with PROC Cluster to run these descriptive statistics on our output clusters.

However, be assured that the output is the same whether you use PROC Fastclus or PROC Cluster with the macro.  For simplicity, we will only use the PROC Fastclus syntax for our example.

/* Cluster Analysis */
/* Creating 2 clusters, saving the results in a new dataset called CLUS */
/* Try a Proc PRINT to see what is found in the new dataset CLUS */
Proc fastclus data=sashelp.iris maxc=2 maxiter=10 out=clus;
var SepalLength SepalWidth PetalLength PetalWidth;
Run;

/* Using the resulting dataset to get a feel for who landed in what cluster */
Proc freq data=clus;
tables cluster*species;
Run;

/* Creating 3 clusters, saving the results in a new dataset called CLUS */
Proc fastclus data=sashelp.iris maxc=3 maxiter=10 out=clus;
var SepalLength SepalWidth PetalLength PetalWidth;
Run;

/* Using the resulting dataset to get a feel for who landed in what cluster */
Proc freq data=clus;
tables cluster*Species;
Run;

/* To obtain a graphical presntation of the clusters we need to run the */
/* Proc CANDISC to get the information needed for the graphical output */

Proc candisc anova out=can;
class cluster;
var SepalLength SepalWidth PetalLength PetalWidth;
title2 ‘Canonical Discriminant Analysis of Iris Clusters’;
Run;

Proc sgplot data=Can;
scatter y=Can2 x=Can1 /group=Cluster ;
title2 ‘Plot of Canonical Variables Identified by Cluster’;
Run;

To view the resulting output.

Extra piece of SAS code.  If you need to standardize your variables before putting them into a Cluster analysis here is a sample piece of code that you can use:

/* If you need to standardize your variables – this is how you would do it */
Proc standard data=sashelp.iris out=iris mean=0 std=1;
var SepalLength SepalWidth PetalLength PetalWidth;
Run;

/* Run a Proc PRINT to see what happened to your data and what changes happened */
Proc print data=iris;
Run;

/* Run a Proc MEANS to check whether the standardization worked or not */
Proc means data=iris;
var SepalLength SepalWidth PetalLength PetalWidth;
Run;

### Discriminant Function Analysis

As noted earlier, this analysis is not an exploratory analysis but an explanatory analysis.  In fact it is very similar to an Multivariate ANOVA or MANOVA.  It does however, have 2 distinct but compatible purposes:

1. To determine whether the characteristics used to define the groups hold true or not
2. To classify or predict the group membership of new observations based on the discriminant function.

So what does discriminant function do?  Essentially it creates a weighted linear combination of the variables used in the analysis which is then used to differentiate or group observations into groups.  Logistic regression comes to mind when you define discriminant analysis, however with logistic regression, the predictors can be quantitative or categorical and the fitted curve is sigmoidal in shape.  Discriminant analysis can only use quantitative variables and all the assumptions of a general linear model must be met.  So yes that means residual analysis – normality, homogeneity of variances, ….

One of the biggest challenges with discriminant analysis is sample size!  The smallest group in your dataset MUST exceed the number of predictor variables by a “lot”.  Papers have suggested at least 5 X or at least 10 X.

So, in the end discriminant analysis will essentially create a regression equation from your data that will “discriminate” observations into 2 groups – a variable in your dataset.  Let’s look at the example to get a better feel for this.

#### SAS code and output using the IRIS dataset

/* Discriminant Analysis – Fisher’s Iris Data */
Proc discrim data=sashelp.iris anova manova listerr crosslisterr;
class Species;
var SepalLength SepalWidth PetalLength PetalWidth;
Run;

To view the resulting output.

### Conclusion

A quick review of 3 different types of multivariate analyses using SAS and the same dataset.  Each analysis has a different purpose.  Please ensure that you use the most appropriate analysis for your research question!