PROJECTS
2021 – 2022 » Affordable Billion-Scale Pre-trained NLP models’ Training
RI at Alibaba DAMO Academy
Description
The capacity of GPU built-in memory is off the pace of model size growth, even in the latest releases – such as A100, and H100. This work aims to enable larger model training by utilizing CPU-side DRAMs and perform no overhead by a layer-based offloading mechanism. It can train the 1.9x~6.5x larger models using the same resources, compared to the state-of-the-art offloading-based solutions, and simultaneously improve 1.2x~3.7x throughput on the respectively largest trainable model.
Main Work
- Designed and implemented a working window at the GPU side to store active parameters during forward and backward propagation. The inactive parameters were stored at CPU DRAM, and the offloading actions are dynamical based on the layers’ interactions by analysing neural architecture. Inserted the relevant operations in hook functions.
- Optimized tensor transmission between GPU and CPU. Reusing the GPU tensor storage avoids time-consuming memory allocation and deallocation operations. Implemented this process under Python GIL-releasing scenario.
- Achieved the action overlap between backward propagation and optimizer updating. By instantiating separate optimizers for each individual layer enables to update a part of the parameters in advance, no need to wait for unrelated backwards.
2019 – 2021 »> QoS Provision at Massive Scale Data Centers
RA at University of Leeds
Description
Co-locating diverse workloads at one massive data center faces performance interference because the competition for resource occurs at CPU cache, memory and network bandwidths, etc., due to the irregular fluctuation of resource requests. This work aims to ensure the QoS of LRAs while balancing the performance of batch jobs and maintaining higher utilization.
Main Work
- Implemented a container re-scheduling mechanism once QoS violation might occur. The re-assigning process considered the current system information, re-tried times, intermediate data locality and job performance to choose another server.
- Implemented a multi-dimensional resource controller based on the cgroups library. The task preemption strategy was pluggable. The optimizations for low-cost preemption were designed from CPU, memory, runtime checkpoints, etc.
2017 – 2018 » Efficient Resource Management
(China’s National Key R&D Program) R&D at Alibaba Cloud
Description
The low actual resource utilizations at massive-scale clusters - where online services and offline batch tasks are co-located – waste a lot of cost investment, especially for cloud service providers. This work aims to reuse allocated-but-idle, reserved, and fragmentary resources based on over-subscription technology to pursue efficient resource usage. The solution can nearly double hardware utilization from 36% to 65%, and shorten the makespan of offline batch tasks by over 30%.
Main Work
- Introduced a new task type – speculative task, which can utilize idle resources by launching a container with a low priority, no matter if these resources are allocated or not. This kind of task was enqueued first and managed by node manager.
- Supported hybrid resource managers – centralized and distributed. The distributed managers were responsible for assigning speculative tasks and kept a periodically updated resource view by a global coordinator.
- Developed a multi-phase filter algo based on
runtime load
andhealth performance
for avoiding inter-task interference. - Developed a timestamp-ordering compression algo for synchronizing candidates without much extra overhead.
2016 – 2017 » Hybrid Cloud
(China’s National 863 Program) RA at Beihang University
Description
Hybrid cloud performs both advantages of private and public clouds. However, the current solution is blank. This work aims to develop a hybrid cloud prototype to explore related areas with other acidemias and industries together.
Main Work
- Discussed and designed the cloud platform architecture, which is application-oriented and organized by a few modules.
- Implemented an overview portal, monitoring and virtualizing the runtime resources from different perspectives.
- Implemented a unified adaptor based on a reflection mechanism for supporting existing heterogeneous cloud platforms.
- Implemented a few deployment strategies, covering cost, performance, data locality, user privacy, and other metrics.
2015.06 – 2015.08 » Real-time Computing System
Development Intern at Alibaba Group Inc.
Data accuracy (or business logic accuracy) is essential to real-time computing platforms.
Main Work:
Implemented a detection mechanism: to evaluate the data accuracy rate of real-time systems by comparing the same metrics in offline systems. Once error happened, the mechanism sent a detailed report to developers to help them debug or adjust business logic correspondingly. In addition, this mechanism also supported a stable data-query service, in which it can automatically switch to an available data channel once the current crashed.
2015.01 – 2015.05 » Chinese-English Syntax Tree for NLP
Research Assistant at Tsinghua University.
In NLP, manually annotating sentences (e.g. syntax-trees) to verify the automatically labelling errors is a crucial and time-consuming procedure.
Main Work: Designed and implemented a graphical annotation tool – TreeEditor, based on jGraph.
It supported: 1) to parse and visualize syntax-trees on different formats (such as XML, brackets, etc.); 2) to present a convenient and efficient window to annotate sentences and merge multiple results (highlight the inconsistent parts, generate an accuracy report); 3) to enable to customize and zoom in/out the panel and export to various formats.
TOP