What is discriminant analysis
Discriminant analysis is a multivariate tool that allows you to classify observations of independent variables into dependent variable categories. The purpose of discriminant analysis is to classify the dependent variable between two or more categories for a set of independent variables. Then based on these predefined observations for which the categories are already known, a model – or a set of linear functions of independent variables – is built for predicting the category of a new observation. It’s similar to multiple regression, since both techniques use two or more independent variables and a single dependent variable. However, whereas multiple regression’s dependent variable is continuous, the dependent variable in discriminant analysis is categorical.
A discriminant analysis has several key functions:
- classifying observations into two or more categories if you have a sample – or observation – with known categories
- investigating how variables contribute to categorization
- predicting classification of future observations
There are two popular types of discriminant analysis:
- You use a linear discriminant analysis when you assume the covariance matrices are equal for all categories. The Mahalanobis distance – which is the squared distance from the observation to the group center – is the measure used to form or classify categories. Any group that has the smallest squared distance will have the largest linear discriminant function. You then classify the observation into this category. Linear discriminant analysis is the more simple and common of the two.
- In a quadratic analysis, you don’t assume the covariance matrices are equal for all categories. It’s similar to a linear discriminant analysis in that observations are classified into the category that has the smallest squared distance. But, in this case, the squared distance doesn’t simplify into a linear function.
Performing a discriminant analysis
When you’re ready to perform a linear discriminant analysis, you’ll find the general steps are similar to the steps for other multivariate methods:
- collect and organize sample data
- conduct an analysis of the data using statistical software
- analyze the results
Consider an example of a Six Sigma team at a widget manufacturer that wants to classify observations into one of the three quality categories based on earlier observations with known categories. This is done by using linear discriminant functions (or equations) for each category and the function with highest numerical value determines which category a new observation will belong to.
The independent variables the team identifies include material quantity, machine temperature, material temperature, density, pressure, and cooling time. Product quality is the dependent variable.
The Six Sigma team completed 30 observations, each measuring six variables and one dependent variable – product quality. Based on the resulting dependent variable, each observation was assigned to one of the three categories of product quality – A, B, or C. The Six Sigma team entered all the sample data and ran the discriminant analysis program. Based on resulting discriminant functions, the team members then predicted what the categories would be for the subsequent new observations.
Typically, statistical software returns a linear discriminant function table with values for each category for the dependent variable corresponding to each independent variable. A linear discriminant function (equation) is created for each category and the category providing the highest value is chosen for an observation.
The team found that, for some observations, the actual classifications didn’t correspond to the predicted classifications. When the true categorization doesn’t match the predicted one, it doesn’t meet the discriminant model. In this example, the true classification category for an observation was A, but the discriminant analysis placed it in category C. This is because the software calculated the squared distance and probability, and allocated the observation to the category with the lowest squared distance and highest probability.
The probability of 69.3% that they found for category C is the posterior probability – or the probability of the discriminant analysis assigning observations to categories given the data. In the future, this probability will be used to predict categories for the observations that categories aren’t known for. This is quite similar to developing a regression line based on the past observations and using it later to predict values in future observations.
To perform a linear discriminant analysis, you first divide your observations into categories. You then use a statistical software to do discriminant analysis and develop discriminant functions for each category. These functions are used for predicting categories for new observations. For validating the discriminant model and explaining variability, you compare the predicted classifications to the real-life classifications. When the true categorization doesn’t match the predicted one, it doesn’t meet the discriminant model.