Do not waste what we paid.

The low actual resource utilizations at massive-scale clusters - where online services and offline batch tasks are co-located – waste a lot of cost investment, especially for cloud service providers. This work aims to reuse allocated-but-idle, reserved, and fragmentary resources based on over-subscription technology to pursue efficient resource usage. The solution can nearly double hardware utilization from 36% to 65%, and shorten the makespan of offline batch tasks by over 30%.

Main Work

(1)Introduced a new task type – speculative task, which can utilize idle resources by launching a container with a low priority, no matter if these resources are allocated or not. This kind of task was enqueued first and managed by node manager.

(2)Supported hybrid resource managers – centralized and distributed. The distributed managers were responsible for assigning speculative tasks and kept a periodically updated resource view by a global coordinator.

(3)Developed a multi-phase filter algo based on runtime load and health performance for avoiding inter-task interference.

(4)Developed a timestamp-ordering compression algo for synchronizing candidates without much extra overhead.