Spectral Preprocessing: Smoothing and Derivatives
A raw near-infrared or Raman spectrum is rarely clean enough to feed directly into a quantitative model. Baseline offsets caused by scattering particles, lamp drift, and sample presentation variability sit on top of the genuine chemical signal, inflating prediction errors and masking subtle spectral features. Spectral preprocessing — a collection of mathematical transformations applied to the raw spectrum before modelling — separates the analytical signal from these physical artefacts. The K LAB MRX N1 NIR analyser applies preprocessing steps such as Savitzky-Golay smoothing and derivative transformations as standard components of its chemometric workflow.
Why Raw Spectra Need Preprocessing
In diffuse-reflectance NIR measurements, the physical state of the sample — particle size, packing density, moisture on the surface — causes multiplicative and additive scatter effects that are not related to composition. Two samples with identical chemistry can produce spectra with different baselines and slopes purely because of how they scatter light. If these physical variations are not removed, the calibration model will learn to predict scatter rather than concentration, degrading accuracy and robustness across the sample set.
Savitzky-Golay Smoothing
Savitzky-Golay (SG) smoothing fits a low-order polynomial to a moving window of adjacent spectral points and replaces the centre point with the polynomial value. The result is a smoothed spectrum in which high-frequency noise is suppressed while genuine broad absorption bands are preserved. Key parameters are the window width (number of adjacent points) and the polynomial order: a wider window gives stronger smoothing but can broaden and attenuate narrow peaks; a higher polynomial order preserves peak shape better but reduces noise suppression. Typical settings for NIR spectra are a window of 9-15 points with a second- or third-order polynomial.
Derivative Transformations
Computing the first or second derivative of a spectrum is one of the most powerful tools for baseline correction and peak resolution. The first derivative eliminates additive baseline offsets (constant shifts in absorbance) and converts absorption maxima to zero-crossings flanked by positive and negative peaks, making it easier to resolve overlapping bands. The second derivative removes both additive and linear (sloping) baselines, turning absorption peaks into sharp negative troughs; overlapping bands that are invisible in the raw spectrum often separate into distinct features in the second derivative.
In practice, differentiation amplifies high-frequency noise, which is why smoothing (e.g., Savitzky-Golay) is almost always applied simultaneously. The Savitzky-Golay algorithm can compute smoothed derivatives in a single step by differentiating the fitted polynomial, making it the standard implementation in chemometric software.
Normalisation and Scatter Correction
Beyond smoothing and derivatives, NIR chemometrics commonly uses normalisation methods to correct for multiplicative scatter effects. Standard Normal Variate (SNV) scaling subtracts the mean of each spectrum and divides by its standard deviation, removing both additive and multiplicative scatter in one step. Multiplicative Scatter Correction (MSC) regresses each spectrum against a reference (typically the mean spectrum) and corrects for the slope and intercept of that regression. The MRX N1 supports these preprocessing options as part of its PLS and OPLS calibration pipeline, ensuring that the model learns chemical variance rather than physical presentation variance.
Selecting the Right Preprocessing Strategy
There is no universal best preprocessing combination. The correct choice depends on the dominant source of spectral variability in your sample set. As a starting point:
- Noisy spectra with flat baselines — SG smoothing alone.
- Additive baseline drift — first derivative (with SG smoothing).
- Sloping or curved baselines — second derivative (with SG smoothing).
- Particle-size or surface scatter — SNV or MSC, optionally combined with a derivative.
Always validate the effect of preprocessing on model performance using cross-validation (e.g., leave-one-out or k-fold) rather than fitting metrics alone. A preprocessing step that reduces RMSECV (root mean square error of cross-validation) on an independent validation set is genuinely beneficial; one that merely improves the calibration fit may be overfitting. The MRX N1 chemometric engine provides cross-validation statistics to guide this choice, ensuring that preprocessing decisions are empirically grounded rather than arbitrary.
