The capacity of GPU built-in memory is off the pace of model size growth, even in the latest releases – such as A100, and H100. This work aims to enable larger model training by utilizing CPU-side DRAMs and perform no overhead by a layer-based offloading mechanism. It can train the 1.9x to 6.5x larger models using the same resources, compared to the state-of-the-art offloading-based solutions, and simultaneously improve 1.2x to 3.7x throughput on the respectively largest trainable model.
(1)Designed and implemented a working window at the GPU side to store active parameters during forward and backward propagation. The inactive parameters were stored at CPU DRAM, and the offloading actions are dynamical based on the layers’ interactions by analysing neural architecture. Inserted the relevant operations in hook functions.
(2)Optimized tensor transmission between GPU and CPU. Reusing the GPU tensor storage avoids time-consuming memory allocation and deallocation operations. Implemented this process under Python GIL-releasing scenario.
(3)Achieved the action overlap between backward propagation and optimizer updating. By instantiating separate optimizers for each individual layer enables to update a part of the parameters in advance, no need to wait for unrelated backwards.