Sustainability-Driven Neural Network Compression For Efficient Large-Scale Model Serving
Keywords:
Neural Network Compression, Sustainable AI, Knowledge Distillation, Model Pruning, Quantization, Green Computing, Large-Scale Model Serving, Carbon Footprint.Abstract
Large-scale deep learning models are spreading extremely rapidly, leading to heavy computational as well as environmental loads of modern inference infrastructure. Training and serving large-scale (billion-parameter) models require massive energy, which can be a significant portion of total carbon emissions and which makes it a challenge to satisfy the sustainability goals of the organizations. In this paper, propose SuComp, a sustainable neural network compression framework to minimize the energy of large-scale model serving without any task accuracy drop. SuComp combines three different compression methods (structured pruning, post-training quantization, and knowledge distillation) in one unified framework managed by a Sustainability-Aware Compression Scheduler (SACS) to trade-off between accuracy constraints and energy/carbon costs. Experiments show that on benchmark datasets (ResNet-50, BERT-base, and GPT-2), SuComp yields an average compression ratio of 9.7x, a reduction of 61.6% inference energy usage, and a 61.8% decrease in normalized CO₂ emission, while an average of 99.4% baseline model accuracy is maintained. The proposed framework offers a systematic and pragmatic approach towards responsible AI deployment that is aligned with environmental concerns.




