Benchmark Datasets for Evaluating Robustness and Generalization in Machine Learning based Microstructure Analysis

Mueller, Martin

Machine learning (ML)-based microstructure analysis has gained significant attention in recent years. While AI methods successfully automate standard tasks, most published studies rely on carefully curated datasets acquired under controlled laboratory conditions with limited variability. As a result, the robustness and generalization capability of ML models under realistic experimental conditions remain largely unexplored.

In the project “Robustness of Machine Learning Models in Microstructure Analysis” funded by the German Science Foundation (DFG reference number: Mu 959 57-1), we generated two benchmark datasets with systematically controlled experimental variability. Dataset 1 addresses the classification of martensite and bainite in light microscopic images of steel. Dataset 2 focuses on semantic segmentation of high chromium cast irons based on scanning electron microscopy. Experimental variations include differences in sample preparation (e.g., etching), microscope type, and acquisition parameters. For light microscopy, parameters such as brightness, contrast, and aperture were varied; for scanning electron microscopy, detector type, acceleration voltage, and dwell time, and others were systematically modified. All parameters are provided as structured metadata linked to each image.

The datasets are currently undergoing final curation and will be publicly available by the conference. The presentation will introduce both datasets and demonstrate initial use cases enabling systematic robustness studies. Key research questions include: (i) whether a single generalized model can meaningfully capture large experimental variability, (ii) whether specialized models tailored to specific conditions outperform global models, (iii) to what extent experimental variability can be simulated via data augmentation, and (iv) whether incorporating metadata improves model robustness and performance. These datasets provide a foundation for reproducible and systematic investigation of robustness in ML-based microstructure analysis under realistic experimental conditions.