Adaptive Workflow Scheduling in Heterogeneous GPU Clusters via Deep Reinforcement Learning

Zixuan Li; Yuefeng Chen; Yuefeng Chen; Thomas Gallagher

doi:10.71465/mrcis158

Authors

Zixuan Li 1Department of Computer Science and Engineering, University of Colorado Denver, USA Author
Yuefeng Chen Department of Computer Science and Engineering, University of Colorado Denver, USA Author
Yuefeng Chen Department of Computer Science and Engineering, University of Colorado Denver, USA Author
Thomas Gallagher Department of Computer Science and Engineering, University of Colorado Denver, USA Author

DOI:

https://doi.org/10.71465/mrcis158

Keywords:

Workflow Scheduling, Deep Reinforcement Learning, Heterogeneous GPU Clusters, Deep Q-Network, Resource Management

Abstract

The proliferation of heterogeneous Graphics Processing Unit (GPU) clusters has introduced unprecedented computational capabilities for workflow execution across diverse scientific and industrial domains. However, the inherent heterogeneity of GPU resources, coupled with dynamic workload characteristics and complex workflow dependencies, presents substantial challenges for efficient scheduling. Traditional heuristic-based scheduling algorithms such as Heterogeneous Earliest Finish Time (HEFT) and First-In-First-Out with Duplication and Earliest Finish Time (FIFO-DEFT) often fail to adapt to rapidly changing cluster states and evolving workload patterns. This paper proposes an adaptive workflow scheduling framework leveraging Deep Reinforcement Learning (DRL) to intelligently allocate workflow tasks to heterogeneous GPU resources. The proposed approach employs a Deep Q-Network (DQN) architecture integrated with prioritized experience replay to learn optimal scheduling policies through continuous interaction with the cluster environment. The framework models workflow scheduling as a Markov Decision Process (MDP) where the agent learns to minimize makespan, maximize resource utilization, and maintain quality-of-service guarantees. Extensive experimental evaluations demonstrate that the DRL-based scheduler achieves significant performance improvements compared to baseline algorithms including HEFT, FIFO-DEFT, and other state-of-the-art schedulers. The proposed method exhibits superior adaptability to varying cluster configurations and workflow characteristics, maintaining robust performance across diverse execution scenarios while reducing average makespan and improving scheduling length ratio metrics.

Adaptive Workflow Scheduling in Heterogeneous GPU Clusters via Deep Reinforcement Learning

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Issue

Section

License

logo

Journal Information

Latest publications

Information

visitors