This project focused on improving resource utilization in large-scale cloud data centres where online services and offline batch workloads are co-located. Inefficient use of allocated but idle, reserved, or fragmented resources leads to substantial cost inefficiencies for cloud providers. By leveraging over-subscription techniques, the project aimed to significantly enhance resource efficiency — nearly doubling hardware utilization from 36% to 65% — while reducing the makespan of batch workloads by over 30%.
To achieve these outcomes, I introduced a speculative task mechanism that opportunistically utilizes idle resources through low-priority container launches, regardless of prior allocation. A hybrid resource management architecture combining centralized coordination with distributed task assignment was designed, where distributed managers handled speculative scheduling and a global coordinator maintained a periodically updated resource view. Further optimizations included a multi-phase filtering algorithm based on runtime load and system health metrics to minimize task interference, and a timestamp-ordering compression algorithm to synchronize scheduling candidates with minimal overhead. Together, these techniques provided a scalable, cost-efficient approach to resource management in modern cloud infrastructures.