A data collection task, whether in science, business or engineering, typically involves many measurements made on many samples. Such multivariate data has traditionally been analyzed using one or two variables at a time. However, this approach misses the point; to discover the relationships among all samples and variables efficiently, we must process all of the data simultaneously. Enter chemometrics. Chemometrics is the field of extracting information from multivariate chemical data using tools of statistics and mathematics. Chemometrics is typically used for one or more of three primary purposes:
- To explore patterns of association in data;
- To track properties of materials on a continuous basis; and
- To prepare and use multivariate classification models.
The algorithms in primary use in the field have demonstrated a significant capacity for analyzing and modeling a wide assortment of data types for an even more diverse set of applications.
Exploratory Data Analysis
Patterns of association exist in many data sets, but the relationships between samples can be difficult to discover when the data matrix exceeds three or more features. Exploratory data analysis can reveal hidden patterns in complex data by reducing the information to a more comprehensible form. Such a chemometric analysis can expose possible outliers and indicate whether there are patterns or trends in the data. Exploratory algorithms such as principal component analysis (PCA) and hierarchical cluster analysis (HCA) are designed to reduce large complex data sets into a series of optimized and interpretable views. These views emphasize the natural groupings in the data and show which variables most strongly influence those patterns.
Continuous Property Regression
In many applications, it is expensive, time consuming or difficult to measure a property of interest directly. Such cases require the analyst to predict something of interest based on related properties that are easier to measure. The goal of chemometric regression analysis is to develop a calibration model which correlates the information in the set of known measurements to the desired property. Chemometric algorithms for performing regression include partial least squares (PLS) and principal component regression (PCR) and are designed to avoid problems associated with noise and correlations in the data. Because the regression algorithms used are based in factor analysis, the entire group of known measurements is considered simultaneously, and information about correlations among the variables is automatically built into the calibration model. Chemometric regression lends itself handily to the on-line monitoring and process control industry, where fast and inexpensive systems are needed to test, predict and make decisions about product quality.
Many applications require that samples be assigned to predefined categories, or "classes". This may involve determining whether a sample is good or bad, or predicting an unknown sample as belonging to one of several distinct groups. A classification model is used to predict a sample's class by comparing the sample to a previously analyzed experience set, in which categories are already known. k-nearest neighbor (KNN) and soft independent modeling of class analogy (SIMCA) are primary chemometric workhorses. When these techniques are used to create a classification model, the answers provided are more reliable and include the ability to reveal unusual samples in the data. In this manner, a chemometric system can be built that is objective and thereby standardize the data evaluation process.
An very nice introduction to the use of chemometrics with spectroscopic data was written by Steve Brown and is presented on the SpectroscopyNow web site.
Chemometrics research is occurring at many locations around the globe. Visit some of these chemometrics sites to get a feel for the type of work that can be accomplished through a multivariate approach; you may find ideas that will benefit your own work.