PaReprop: Fast Parallelized Reversible Backpropagation

Abstract

The growing size of datasets and deep learning models has made faster and memory-efficient training crucial. Reversible transformers have recently been introduced as an exciting new method for extremely memory-efficient training, but they come with an additional computation overhead of activation re-computation in the backpropagation phase.

We present PaReprop, a fast Parallelized Reversible Backpropagation algorithm that parallelizes the additional activation re-computation overhead in reversible training with the gradient computation itself in backpropagation phase. We demonstrate the effectiveness of the proposed PaReprop algorithm through extensive benchmarking across model families (ViT, MViT, Swin and RoBERTa), data modalities (Vision & NLP), model sizes (from small to giant), and training batch sizes. Our empirical results show that PaReprop achieves up to 20% higher training throughput than vanilla reversible training, largely mitigating the theoretical overhead of 25% lower throughput from activation recomputation in reversible training.

Problem

Reversible Vision Transformers are a recently proposed class of memory-efficient models which utilize a reversible transformation (left) to reduce the memory footprint of the model. They were shown to be able to achieve equal performance to their non-reversible counterparts at equal parity.

Top throughputs of PaReprop on Swin Transformer.

Illustration of the reversible transformation in NICE flow as proposed in Dinh et. al 2015 on the left, and an example of using it to create a Reversible Vision Transformer (RevViT) on the right as proposed in Mangalam et. al 2022.

Solution: PaReprop

Looking at the backpropagation in detail however reveals a key issue. The reversible transformation requires the activations to be recomputed in the backpropagation phase, which is a significant overhead. Crucially, there is no dependency between the block required to update the gradients and recompute the activations of the next block. This allows us to parallelize them at the same time, and theoretically speed up reversible backprop (Reprop) to be almost as fast as normal backprop.

GPipe style illustration of our PaReprop method compared to normal backpropagation and reversible backpropagation (Reprop). Notably, at a small memory increase, we are theoretically able to erase the overhead from the reversible recomputations.

Results

In practice, we find that our model works surprisingly well. PaReprop matches the throughput of Reprop for ViT models, while it greater improves on the hierarchical models. We hypothesize that their non-homogeneous nature is especially amenable to being sped up with parallelization.

We also compare the amount of memory used by all of the methods. Both are especially memory-efficient compared to traditional backprop, and the additional memory cost incurred by PaReprop is negligible to the overall savings.

Throughputs vs. Batch sizes for vision architectures. PaReprop does especially better for MViT and Swin.

Throughputs for vision architectures. PaReprop does especially better for MViT and Swin.

Plot of memory used by normal backprop and Reprop, PaReprop. Both use much less memory, but PaReprop's extra memory cost is negligible relatively.

PaReprop: Fast Parallelized Reversible Backpropagation

Transformers for Vision Workshop @ CVPR 2023 (Spotlight)

PaReprop achieves up to 20% training throughput gain without any change to the underlying computation and while still being extremely memory-efficient.