Degradation-Aware Detection Of Schema-Breaking Changes In Evolving Machine Learning Systems: A Mutation-Driven Evaluation Framework
Keywords:
Schema-Breaking Detection; Machine Learning Reliability; Performance Degradation Analysis; Degradation-Aware Classification; Model Monitoring.Abstract
Machine learning systems operating in dynamic data environments are vulnerable to structural schema changes that can silently degrade predictive performance. Existing monitoring approaches primarily focus on distributional drift or syntactic schema validation, but lack a principled, performance-based criterion to determine whether a structural change leads to an operational failure. This paper addresses the gap by framing schema-breaking detection as a degradation-aware binary classification problem, where a schema change is considered “breaking” only if it results in measurable loss in prediction accuracy.
To support this formulation, a controlled mutation engine is developed, generating 75 schema transformation scenarios across three heterogeneous datasets comprising over 160,000 samples and 1,855 features. Five mutation types are considered: column removal, column addition, type transformation, range scaling, and cardinality reduction. Ground truth labels are assigned based on a relative accuracy degradation threshold of 0.10, yielding 36 breaking and 39 non-breaking cases.
The study evaluates four heuristic detectors and a Logistic Regression Schema Detector (LRSD) using leave-one-out cross-validation, with precision, recall, and F1-score as evaluation metrics. Among heuristic approaches, type-change detection performs best (mean F1 = 0.643), while structural difference detection shows moderate effectiveness (mean F1 = 0.515). The LRSD model achieves a significantly higher mean F1-score of 0.818 under cross-validation, demonstrating superior performance over heuristic methods. Statistical testing using McNemar’s test indicates no significant difference in error patterns between LRSD and type-change detection, suggesting their complementary strengths.
The findings highlight the importance of degradation-based labeling, robust evaluation protocols, and supervised learning approaches for effectively detecting schema changes that have real operational impact.




