Gradient Checkpointing And Recomputation Strategies For Deep Learning On Memory-Limited Devices

V. Sivasankari; B. Gayathri; Dr. Lincy  Roy; Yuldasheva Maftuna Mamurjon  kizi

Authors

V. Sivasankari Assistant Professor, Department of Mathematics, Meenakshi College of Arts and Science, Meenakshi Academy of Higher Education and Research, Chennai, Tamil Nadu, India.
B. Gayathri Assistant Professor, Department of Computer Science, Meenakshi College of Arts and Science, Meenakshi Academy of Higher Education and Research, Chennai, Tamil Nadu, India.
Dr. Lincy Roy Assistant Professor, Kalinga University, Naya Raipur, Chhattisgarh, India.
Yuldasheva Maftuna Mamurjon kizi Turan International University, Namangan, Uzbekistan.

Keywords:

: Gradient Checkpointing, Activation Recomputation, Memory-Efficient Training, Deep Networks, Dynamic Programming, Mobile Training, Memory Optimization.

Abstract

The memory bottleneck in training very deep neural networks on memory-limited hardware (e.g., 8-16 GB VRAM consumer GPUs, mobile NPUs, and embedded accelerators) is mainly attributed to the large size of activation memories, which are required to save intermediate feature maps for backpropagation. Gradient checkpointing is one possible method for addressing this memory issue. In particular, intermediate activations that would have been discarded during the forward pass are deleted and then recomputed on demand during the backward pass. This technique trades extra compute for reduced peak memory. Unfortunately, simple checkpointing methods that recompute every lost activation back to the previous checkpoint result in at least 33% compute overhead. This paper proposes GradRecomp, a framework for optimized gradient checkpointing and recomputation, which uses a Dynamic Programming activation cost model to place checkpoints efficiently to maximize memory savings and optimized partial recomputation paths based on the knowledge of cached intermediate values for optimal performance overhead during recomputation. Our results show that with GradRecomp, save up to 32% of full activation caching peak memory, and the overhead is only 5%, 18%, and 8% with respect to ResNet-152, BERT-Large, and ViT-B/16, respectively, compared to 51% memory and 18% overhead with no checkpointing, and 44% memory and 8% overhead with Checkpoint-2.

Gradient Checkpointing And Recomputation Strategies For Deep Learning On Memory-Limited Devices

Authors

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

License

Make a Submission

INDEXING

Information

Keywords