MindSpore Adaptive Distributed Training System Special Interest Group (SIG)

The elastic systems allow users to dynamically change the number of GPUs allocated to training jobs. The target of this SIG is to develop an adaptive distributed training system that can train the neural networks in elastic clusters without affecting the convergence. This working repo contains all the artifacts, materials, meeting notes, and proposals regarding Elastic Training and Adaptive Training. Feedbacks and contributions are welcomed.

Elastic Training: the number of GPUs could change without interrupting the training process.
Adaptive Training: the training jobs could be reconfigured and scheduled adaptively when the training resource changes, so that the convergence speed is not affected.

SIG Leads

Luo Mai (University of Edinburgh)

Logistics

SIG leads will drive the meeting.
Meeting announcement will be posted on our gitee channel: https://gitee.com/mindspore/community/tree/master/sigs/adaptivetraining
Feedbacks and topic requests are welcomed by all.

Discussion

Slack channel: https://app.slack.com/client/T018BLCMSGL/learning-slack
Documents and artifacts: https://gitee.com/mindspore/community/tree/master/sigs/adaptivetraining

1.3 kB Raw Blame History

MindSpore Adaptive Distributed Training System Special Interest Group (SIG)

SIG Leads

Logistics

Discussion

Meeting notes

1.3 kB

Raw Blame History