Inference-Time Energy Minimization Through Learnable Numerical Precision In Activation Computation
Keywords:
Learnable Precision, Activation Quantization, Inference Energy, Per-Channel Precision, Binary Gating, Energy-Efficient Inference, Mixed-Precision Neural Networks.Abstract
The total energy spent on inference for a neural network is comprised mainly of arithmetic operations performed on activation tensors, which are non-linearly dependent on the numerical precision employed. Fixed precision quantization makes use of fixed-bit widths for activation operations, thus ignoring the varying demands on precision (spatially and channel-wise) within a single layer. In this work, propose LearnPrec, a framework for minimizing inference-time energy using learnable precision for activations. introduce the per-activation channel precision selector, a small binary network, which, together with end-to-end learning with a combined accuracy energy objective, decides upon the use of 8-bit or 4-bit computation per activation channel, independent of all others. The precision selector operates during inference time and makes a binary decision per activation channel for each input batch of data in a fashion that allows fine-grained energy saving while leaving the model weights unchanged. On MobileNetV3, EfficientNet-B2 and DeiT-Small using ImageNet-1K, CIFAR-100 and Oxford Pets datasets, LearnPrec manages to cut inference energy to 19% of FP32 baseline while preserving 93.5% accuracy (vs. INT8 48% energy 93.1% accuracy and INT4 31% energy 91.4% accuracy fixed precision baseline).




