QoS is important.

Co-locating diverse workloads at one massive data center faces performance interference because the competition for resource occurs at CPU cache, memory and network bandwidths, etc., due to the irregular fluctuation of resource requests. This work aims to ensure the QoS of LRAs while balancing the performance of batch jobs and maintaining higher utilization.

Main Work

(1)Implemented a container re-scheduling mechanism once QoS violation might occur. The re-assigning process considered the current system information, re-tried times, intermediate data locality and job performance to choose another server.

(2)Implemented a multi-dimensional resource controller based on the cgroups library. The task preemption strategy was pluggable. The optimizations for low-cost preemption were designed from CPU, memory, runtime checkpoints, etc.