Morpheus: Towards Automated SLOs for Enterprise Clusters
Motivation
The main motivation behind this work is the tension between enterprise cluster manager and users. The cluster managers want to have maximum resource utilization while maintaining fairness among users, but the users want their jobs to finish within a given deadline. The problem here is that jobs often have unpredictable runtime and resource requirement, so users often have to overprovision the resources for their jobs in order to meet the deadline. However, this overprovisioning leads to resource underutilization which is bad for cluster managers.
In this paper, the authors proposed a system that can automatically derive job deadlines and resource constraints from history data. The evaluation shows that Morpheus can reduce SLO violations by 5-13x and also reduce cluster size by 14-28%.
Design
The design of Morpheus can be divided into 3 parts: the automatic inference module, the reservation mechanism, and the dynamic reprovisioning module.
Automatic Inference Module
The job of automatic inference module is to derive user requirements, more specifically the deadline SLO and resource requirements. For deadline SLO it uses a Provenance Graph to capture and extract dependencies. And for resource requirement it uses the telemetry history of the job to derive the resource model. After generating the SLO, the module will hand it to user for sign-off or override.
Reservation Mechanism
The reservation mechanism is an online packing algorithm for reservation placement. It compacts the jobs based on Least Common Multiple (LCM) of job period and uses a LowCost Packing Algorithm for placement.
Dynamic Reprovisioning Module
The dynamic reprovisioning module will monitor the progress of each job and dynamically adjust the provisioned resource if the progress is slow. This helps reduce the effect of unpredictability when running jobs.
Takeaway
The main contribution here in my opinion is the automated inference module which can automatically generate the deadline and resource constraints for each job.
It may also be possible to use some kind of machine learning algorithm in the inference module.
One possible problem is how they handle non-periodic jobs. For those jobs they can only rely on reservation mechanism for resource allocation and the paper lacks a proof that the mechanism will be fair in these situations.
A link to the presentation slides can be found here