This research focused on improving quality of service (QoS) for long-running applications (LRAs) in large-scale data centers, where diverse co-located workloads often suffer from resource contention in CPU caches, memory, and network bandwidth. The project aimed to ensure predictable QoS for LRAs while maintaining balanced performance for batch jobs and maximizing overall hardware utilization.
To achieve these objectives, I designed and implemented a container-based re-scheduling mechanism that proactively reassigns tasks when potential QoS violations are detected, taking into account system state, retry history, data locality, and execution performance. In parallel, I developed a multidimensional resource controller based on Linux cgroups, supporting a pluggable preemption policy with optimizations for low-cost task interruption through CPU and memory management and runtime checkpointing. Together, these techniques improved system responsiveness and resource efficiency, providing a scalable solution for QoS-aware workload orchestration in modern cloud data centers.