Acceleration and Optimization of Pre-trained Language Model Training

This project focused on advancing the efficiency and scalability of large-scale pre-trained language model (PLM) training under constrained hardware conditions. The goal was to significantly improve training throughput, reduce infrastructure costs, and enable models with substantially larger parameter counts while maintaining transparency and usability for upstream algorithm development. Through detailed performance profiling, the project identified two major challenges in distributed pre-training: severe GPU memory limitations and inefficient utilization of heterogeneous computing resources.

To address these issues, the team proposed and implemented several key system-level optimizations. These included a multi-stream micro-batching strategy to improve Streaming Multiprocessor (SM) utilization and a dynamic offloading mechanism that leveraged heterogeneous memory resources such as CPU RAM to alleviate GPU memory bottlenecks. Additional optimizations — including a window-based CUDA tensor reuse scheme, a dual-type tensor communication backend, and a decoupled hierarchical optimizer — enabled the training of models with over 23× larger parameter counts while achieving minimal performance degradation, outperforming existing offloading approaches. The resulting framework was fully compatible with mainstream parallelization strategies, including data, pipeline, and tensor parallelism, as well as ZeRO-based partitioning, providing a highly scalable and practical solution for next-generation large model training.