diff --git a/design/meps/mep-parallel/MEP-PARALLEL.md b/design/meps/mep-parallel/MEP-PARALLEL.md new file mode 100644 index 0000000..03b3776 --- /dev/null +++ b/design/meps/mep-parallel/MEP-PARALLEL.md @@ -0,0 +1,196 @@ +| title | authors | owning-sig | participating-sigs | status | creation-date | reviewers | approvers | stage | milestone | +| ------------ | -------------------- | ---------- | ------------------ | ----------- | ------------- | --------- | --------- | ----- | ------------- | +| MEP-PARALLEL | @stsuteng @xiaoda_zh | parallel | | provisional | 2020-10-14 | | TBD | beta | beta : "v0.5" | + +# MEP-PARALLEL: Auto-parallel + +## Table of Contents + + + +- [MEP-PARALLEL: Auto-parallel](#mep-parallel-auto-parallel) + - [Table of Contents](#table-of-contents) + - [Summary](#summary) + - [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) + - [Proposal](#proposal) + - [User Stories](#user-stories) + - [Parallelizing training for general DNNs](#parallelizing-training-for-general-dnns) + - [Design Details](#design-details) + - [Test Plan](#test-plan) + - [Implementation History](#implementation-history) + - [Drawbacks](#drawbacks) + - [Alternatives](#alternatives) + - [References](#references) + + + +## Summary + + + +Auto-parallel is a functionality built in MindSpore to automatically parallelize the training of giant DNN models. While keeping the DNN descriptions identical to their single-device counterparts, it algorithmically finds good partitioning strategy for the given model. + +## Motivation + + + +It is increasingly important to find an efficient way to parallel train giant DNN models. However, there are multiple factors impacting the choice of parallel paradigms (Data-parallel, Model-parallel, and Hybird-parallel), including the size of training dataset, the size of the DNN model, the graph structure of the DNN model, the specification of hardware accelerators, etc. Different combinations of these factors prefer different parallel paradigms. It is desirable that a system takes these factors into account, and produces an efficient parallelization strategy for the given DNN model. + +### Goals + + + +- Easy use: The desired system should provide user-friendly interfaces. Ideally, parallel implementation is totally transparent to users. +- Good parallel speedup: The desired system can always find efficient parallelization strategy. + +### Non-Goals + + +- None + +## Proposal + + + +Auto-parallel aims to automatically find efficient parallelization strategy for any DNN model, while keeping the DNN descriptions same as their single-device counterparts. To do so, it captures the data flow graph defined by a DNN model, and partitions the data flow graph after evaluating the cost of different partition strategies. **Cost model** provides the mechanism to estimate the cost under a given strategy. **Tensor partitioning** and **Pipelined model-parallel** are two paradigms to implement Model-parallel. + +- **Cost model.** + The choice of different strategies depends on their associated costs. The costs here can be defined as the iteration time in training, which includes both computation and communication time. In most cases, the computation and communication can be overlapped. Instead of taking engineering efforts to investigate cost of each operator under possible strategies, it is desired to estimate the cost according the semantic of the operator and its inputs. + +- **Tensor partitioning.** + Tensor partitioning is a paradigm to implement Model-parallel. In this paradigm, each tensor in the DNN model is partitioned into slices. The data flow graph obtained by each device is symmetric, meaning that the sequences of operators assigned onto devices are same. The problem is to find an efficient partition strategy for each tensor. + +- **Pipelined model-parallel.** This is another paradigm to implement Model-parallel. In this paradigm, operators are assigned to different devices, while each operator itself is not partitioned. To address the low resource utilization problem, multiple training iterations can be active at the same time, so that the different batches are pipelined. The problem is to find a partition for the data flow graph. + +### User Stories + + + +#### Parallelizing training for general DNNs +Since the DNNs used in different areas are significantly different, it is challenging to design a universal system that can always find the best parallelization strategies for the given DNN model. The Cost model and two paradigms are important components in our consideration. Proposals according to these components and other designs to solve the parallelization problem are welcome. + +# Design Details + + + + + + +Auto-parallel consists of four main components: +- **Parallel model.** It provides the tensor layout (how each tensor is partitioned among devices), the distributed operator (the distributed counterpart of an operator), the distributed auto-grad (how to automatically generate derivatives of the distributed operators), etc. +- **Cost model.** It provides the interfaces of estimating cost of a distributed operator given a partition strategy. It also estimates cost of a tensor redistribution. +- **Parallel strategy search.** Given the data flow graph, it returns the parallelization strategy for each operator using the evaluations of **Cost model**. +- **Parallel partition.** Given the graph marked with strategy for each operator, it partitions the involved tensors for each operator, and it inserts necessary primitives to guarantee the correctness of partitioned operators. + +### Test Plan + + + +There are two types of testing strategies in Auto-parallel: + +- **Unit Test.** Every design for Auto-parallel should guarantee the correctness for each partitioned operator. + +- **System test**. Every effective design should be tested for at least one real DNN model, so that the searched strategy indeed leads to efficient performance. Auto-parallel module provides some verifications and performance testing. + +## Implementation History + + + +- Support an preliminary implementation of **Parallel model** and **Parallel partition** module. +- Support an algorithm of **Parallel strategy search** module. +- Support an preliminary implementation of **Cost model**. +- Enhance **Parallel partition** and **Parallel strategy search** module to support searching efficient strategy for a series of ResNet DNN models. +- Enhance **Cost model** module to precisely characterize memory cost, computation cost and communication cost. + +## Drawbacks + + +- Currently, the searched strategy returned by Auto-parallel may not always lead to the best iteration time. This is mainly due to the **Cost model** improperly estimates the execution cost of operators and entire models. + +## Alternatives + + +- Mesh-TF[1] partitions the data flow graph to minimize the memory consumption of each device. However, Mesh-TF may be suboptimal in end-to-end iteration time since it misses tensor redistribution strategies between two adjacent operators. +OptCNN[2] and Tofu[3] includes the tensor redistribution strategies, but they have problems when the data flow graph has complex graph structures. The algorithm in current Auto-parallel considers the tensor redistributions, and is able to deal with complex graph structures. +- The above works follow the **Tensor partitioning** paradigm. While GPipe[4] and PipeDream[5] are two implementations of **Pipelined model-parallel** paradigm. We are considering a hybrid design of combining these two paradigms. +## References +- [1] Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanantakool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, Ryan Sepassi, and Blake Hechtman. Mesh-TensorFlow: Deep Learning for Supercomputers. NeurIPS '18. +- [2] Zhihao Jia, Sina Lin, Charles R. Qi, and Alex Aiken. Exploring Hidden Dimensions in Accelerating Convolutional Neural Networks. ICML '18. +- [3] Minjie Wang, Chien-chin Huang, and Jinyang Li. Supporting Very Large Models Using Automatic Dataflow Graph Partitioning. EuroSys '19. +- [4] Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, and zhifeng Chen. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. NeurIPS '19. +- [5] Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger, Phillip B. Gibbons, and Matei Zaharia. PipeDream: Generalized Pipeline Parallelism for DNN Training. SOSP '19. \ No newline at end of file diff --git a/design/meps/mep-parallel/auto-parallel-components.png b/design/meps/mep-parallel/auto-parallel-components.png new file mode 100644 index 0000000..13a8bb5 Binary files /dev/null and b/design/meps/mep-parallel/auto-parallel-components.png differ diff --git a/sigs/README.md b/sigs/README.md index 4e47658..4263547 100644 --- a/sigs/README.md +++ b/sigs/README.md @@ -27,3 +27,4 @@ in the mailing list. SIG artifacts can be found in the current repository. | AKG | This SIG is responsible for the development of MindSpore auto kernel generator. | | MSLITE | This SIG is responsible for the development of MindSpore lite. | | MDP | This SIG is responsible for the development of MindSpore programming library for Bayesian deep learning. | +| Parallel | This SIG is responsible for the development of MindSpore's functionality of automatically finding the efficient parallel strategy for DNN training and inference. | diff --git a/sigs/parallel/README.md b/sigs/parallel/README.md new file mode 100644 index 0000000..5e8f5f7 --- /dev/null +++ b/sigs/parallel/README.md @@ -0,0 +1,24 @@ +# MindSpore Parallel Special Interest Group (SIG) + +This is the working repository for the Parallel Special Interest Group (SIG). This repository contains all the artifacts, materials, meeting notes and proposals regarding **Auto-parallel**, **Model-parallel**, **Pipelined model-parallel**, **Tensor partitioning**, **Cost model**. Feedback and contributions are welcome. +1. **Auto-parallel**: The sizes of popular DNN models are getting larger, thus it is desired to automatically find an efficient way to parallelize the execution (training and inference) of the giant DNNs. This is the ultimate goal of this SIG. +2. **Model-parallel**: Unlike Data-parallel in which each device holds the entire model in training, Model-parallel is to partition the model to available devices, so that each device holds a slice of the entire model. Model-parallel is a more suitable approach for training giant models. +3. **Pipelined model-parallel**: This is a paradigm to implement Model-parallel. This paradigm is to assign operators of a DNN model to different devices, so that different training batches can be pipelined. +4. **Tensor partitioning**: This is another paradigm to implement Model-parallel. This paradigm is to partition tensors of each operator in a DNN model, so that the devices obtain *symmetric* sequences of sliced operators. + +# SIG Leads + +* Cheng Li (University of Science and Technology of China) + +# Logistics + +* SIG leads will drive the meeting. +* Meeting annoucement will be posted on our gitee channel: https://gitee.com/mindspore/community/tree/master/sigs/parallel +* Feedbacks and topic requests are welcomed by all. + +# Discussion + +* Slack channel: https://app.slack.com/client/TUKCY4QDR/CUZ3FESNS?cdn_fallback=2 +* Documents and artifacts: https://gitee.com/mindspore/community/tree/master/sigs/parallel + +# Meeting notes \ No newline at end of file diff --git a/sigs/parallel/docs/design-template.md b/sigs/parallel/docs/design-template.md new file mode 100644 index 0000000..e69de29 diff --git a/sigs/parallel/meetings/meeting-template.md b/sigs/parallel/meetings/meeting-template.md new file mode 100644 index 0000000..0c85d75 --- /dev/null +++ b/sigs/parallel/meetings/meeting-template.md @@ -0,0 +1,14 @@ +# Thursday Aug 20, 2020 at 21:30pm GMT+8 + +## Agenda + +## Conference links + +## Attendees +* Tom (Huawei) + +## Notes +* TODO + +## Action items +* TODO