The elastic systems allow users to dynamically change the number of GPUs allocated to training jobs. The target of this SIG is to develop an adaptive distributed training system that can train the neural networks in elastic clusters without affecting the convergence. This working repo contains all the artifacts, materials, meeting notes, and proposals regarding Elastic Training and Adaptive Training. Feedbacks and contributions are welcomed.
Luo Mai (University of Edinburgh)