From 617b8f6d728a025242d2a8acde0036936bb1629d Mon Sep 17 00:00:00 2001 From: Helin Wang Date: Wed, 29 Nov 2017 18:02:32 -0800 Subject: [PATCH 1/5] Add ExecutionPlan design. --- doc/design/program.md | 20 ++++++++++++++++++-- paddle/framework/framework.proto | 10 ++++++++++ 2 files changed, 28 insertions(+), 2 deletions(-) diff --git a/doc/design/program.md b/doc/design/program.md index bd2456787c4e3..571ad07cbd35a 100644 --- a/doc/design/program.md +++ b/doc/design/program.md @@ -2,7 +2,7 @@ ## Compile and Execution -A PaddlePaddle program consists of two parts -- the first generates a `ProgramDesc` protobuf message that describes the program, and the second runs this message using a C++ class `Executor`. +A PaddlePaddle program consists of three parts -- the first generates a `ProgramDesc` protobuf message that describes the program, the second optimizes this message using a C++ class `Optimizer` and generates an `ExecutionPlan` protobuf messages, and the third run the message using a C++ class `Executor`. A simple example PaddlePaddle program can be found in [graph.md](./graph.md): @@ -15,7 +15,7 @@ optimize(cost) train(cost, reader=mnist.train()) ``` -The first five lines of the following PaddlePaddle program generates, or, compiles, the `ProgramDesc` message. The last line runs it. +The first five lines of the following PaddlePaddle program generates, or, compiles, the `ProgramDesc` message. The last line optimizes and runs it. ## Programs and Blocks @@ -120,6 +120,22 @@ message AttrDesc { } ``` +## ProgramDesc and ExecutionPlan + +The goal of `ProgramDesc` is to describe **what** the user wants to calculate, and the goal of `ExecutionPlan` is to specify **how** to calculate it. + +For example, the `ExecutionPlan` has OP placement information to indicate which device the OP will run, but the `ProgramDesc` does not have this information since currently our Python API does not support manually pinning an OP onto a type of device (e.g., GPU or FPGA). On the other hand, the `ProgramDesc` should have information about if an OP belongs to an optimizer, this information is provided by the user and helps to place the OPs onto the parameter servers, but the `ExecutionPlan` does not have this information. + +### Optimizer + +The optimizer takes `ProgramDesc` as the input and outputs the `ExcutionPlan`, the steps are: +1. Add the prgram in `ProgramDesc` and the coresponding backward pass program into the `ExecutionPlan`. +1. Optimizes the program according to the avaiable devices. + For example, add data parallelism by spliting the input mini-batches and replicating the OPs onto different GPUs. Note that even if the OPs are replicated on different GPUs, there is still only **one** execution plan. One executor runs and only runs one `ExecutionPlan`. +1. Place each OP onto available devices, the placement information is written in the `ExecutionPlan`. +1. In distributed training, split the `ExecutionPlan` into multiple `ExecutionPlans` and add send/recv OP between them. For local training, this step is not necessary since there is only one executor. +1. Send the `ExecutionPlan` to the executor for execution. + ## InferShape With this design, the InferShape function should take the following parameters: diff --git a/paddle/framework/framework.proto b/paddle/framework/framework.proto index f1fc4529e1550..4d892028770d5 100644 --- a/paddle/framework/framework.proto +++ b/paddle/framework/framework.proto @@ -143,3 +143,13 @@ message BlockDesc { // https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/program.md // for more details. message ProgramDesc { repeated BlockDesc blocks = 1; } + +message OpPlacement { + optional string name = 1; + optional string device = 2; +} + +message ExecutionPlan { + optional ProgramDesc program = 1; + repeated OpPlacement op_placement = 2; +} From ab3e54c2eb926d33f483bfbdd115138572c90b8f Mon Sep 17 00:00:00 2001 From: Helin Wang Date: Sun, 3 Dec 2017 18:05:06 -0800 Subject: [PATCH 2/5] Update ExecutionPlan design doc. --- doc/design/program.md | 81 +++++++++++++++++++++++++++++++++---------- 1 file changed, 63 insertions(+), 18 deletions(-) diff --git a/doc/design/program.md b/doc/design/program.md index 571ad07cbd35a..174032455ea03 100644 --- a/doc/design/program.md +++ b/doc/design/program.md @@ -2,7 +2,7 @@ ## Compile and Execution -A PaddlePaddle program consists of three parts -- the first generates a `ProgramDesc` protobuf message that describes the program, the second optimizes this message using a C++ class `Optimizer` and generates an `ExecutionPlan` protobuf messages, and the third run the message using a C++ class `Executor`. +A PaddlePaddle program consists of three parts -- the first generates a `ProgramDesc` protobuf message that describes the program, the second plans this message using a C++ class `Planner` and generates an `ExecutionPlan` protobuf messages, and the third run the message using a C++ class `Executor`. A simple example PaddlePaddle program can be found in [graph.md](./graph.md): @@ -15,7 +15,68 @@ optimize(cost) train(cost, reader=mnist.train()) ``` -The first five lines of the following PaddlePaddle program generates, or, compiles, the `ProgramDesc` message. The last line optimizes and runs it. +The first five lines of the following PaddlePaddle program generates, +or, compiles, the `ProgramDesc` message. The last line runs it by +generating the `ExecutionPlan` and sending to `Executor` for +execution. + + + + + + + + +### ProgramDesc + +The `ProgramDesc` describes the computation specified by the user, with +the following requirements: + +1. It should be programming language agnostic. Currently we have a +Python API that generates the `ProgramDesc`, but we could add the +support for other languages later. + +1. It should **not** describe anything that is not specified by the + user. For example: + 1. The OPs for the backward pass added by PaddlePaddle + 1. Any optimizations to the program. + 1. OP placement information that is not specified by the user. + + +### ExecutionPlan + +The `ExecutionPlan` contains all the details of running the program, +including which device each OP is placed on. One `Executor` could have +mutilple devices (e.g, CPU, GPUs), but it runs only one +`ExecutionPlan`. In distributed training there will be `n` +`ExecutionPlan` for `n` `Executor`, jointly completes the +`ProgramDesc` specified by the user. + + +### Planner + +The planner takes `ProgramDesc` as the input and outputs the +`ExcutionPlan`, the steps are: + +1. Add necessary OPs that are not specified by the user to the + `ProgramDesc`. E.g., add the backward pass. + +1. Prune the unnecessary computations from the `ProgramDesc`. + +1. Transforms the `ProgramDesc` given the available devices. E.g., add + data parallelism by spliting the input mini-batches and replicating + the OPs onto different GPUs. + +1. Generate `ExecutionPlan` by placing each OP onto available devices, + the placement information is written in the `ExecutionPlan`. + +1. In distributed training, split the `ExecutionPlan` into multiple + `ExecutionPlans` and add send/recv OP between them. For local + training, this step is not necessary since there is only one + executor. + +1. Send the `ExecutionPlan` to the executor for execution. + ## Programs and Blocks @@ -120,22 +181,6 @@ message AttrDesc { } ``` -## ProgramDesc and ExecutionPlan - -The goal of `ProgramDesc` is to describe **what** the user wants to calculate, and the goal of `ExecutionPlan` is to specify **how** to calculate it. - -For example, the `ExecutionPlan` has OP placement information to indicate which device the OP will run, but the `ProgramDesc` does not have this information since currently our Python API does not support manually pinning an OP onto a type of device (e.g., GPU or FPGA). On the other hand, the `ProgramDesc` should have information about if an OP belongs to an optimizer, this information is provided by the user and helps to place the OPs onto the parameter servers, but the `ExecutionPlan` does not have this information. - -### Optimizer - -The optimizer takes `ProgramDesc` as the input and outputs the `ExcutionPlan`, the steps are: -1. Add the prgram in `ProgramDesc` and the coresponding backward pass program into the `ExecutionPlan`. -1. Optimizes the program according to the avaiable devices. - For example, add data parallelism by spliting the input mini-batches and replicating the OPs onto different GPUs. Note that even if the OPs are replicated on different GPUs, there is still only **one** execution plan. One executor runs and only runs one `ExecutionPlan`. -1. Place each OP onto available devices, the placement information is written in the `ExecutionPlan`. -1. In distributed training, split the `ExecutionPlan` into multiple `ExecutionPlans` and add send/recv OP between them. For local training, this step is not necessary since there is only one executor. -1. Send the `ExecutionPlan` to the executor for execution. - ## InferShape With this design, the InferShape function should take the following parameters: From 15c1f4c82c1b341a0d59f2b647a85df960331e44 Mon Sep 17 00:00:00 2001 From: Helin Wang Date: Sun, 3 Dec 2017 18:29:33 -0800 Subject: [PATCH 3/5] fix typo --- doc/design/program.md | 20 ++++++++------------ 1 file changed, 8 insertions(+), 12 deletions(-) diff --git a/doc/design/program.md b/doc/design/program.md index 174032455ea03..158649bf16ae5 100644 --- a/doc/design/program.md +++ b/doc/design/program.md @@ -21,18 +21,14 @@ generating the `ExecutionPlan` and sending to `Executor` for execution. - - - - - - ### ProgramDesc -The `ProgramDesc` describes the computation specified by the user, with -the following requirements: +The `ProgramDesc` describes the computation specified by the user, it +will be the same regardless which devices the program runs on +(CPU/single GPU/multiple GPU/multiple nodes), with the following +requirements: -1. It should be programming language agnostic. Currently we have a + 1. It should be programming language agnostic. Currently, we have a Python API that generates the `ProgramDesc`, but we could add the support for other languages later. @@ -47,7 +43,7 @@ support for other languages later. The `ExecutionPlan` contains all the details of running the program, including which device each OP is placed on. One `Executor` could have -mutilple devices (e.g, CPU, GPUs), but it runs only one +multiple devices (e.g, CPU, GPUs), but it runs only one `ExecutionPlan`. In distributed training there will be `n` `ExecutionPlan` for `n` `Executor`, jointly completes the `ProgramDesc` specified by the user. @@ -64,8 +60,8 @@ The planner takes `ProgramDesc` as the input and outputs the 1. Prune the unnecessary computations from the `ProgramDesc`. 1. Transforms the `ProgramDesc` given the available devices. E.g., add - data parallelism by spliting the input mini-batches and replicating - the OPs onto different GPUs. + data parallelism by splitting the input mini-batches and + replicating the OPs onto different GPUs. 1. Generate `ExecutionPlan` by placing each OP onto available devices, the placement information is written in the `ExecutionPlan`. From d11c4cd1707f7431b58eec35fbebe2ef8b6d2f4a Mon Sep 17 00:00:00 2001 From: Helin Wang Date: Sun, 3 Dec 2017 19:48:47 -0800 Subject: [PATCH 4/5] Update style --- doc/design/program.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/design/program.md b/doc/design/program.md index 158649bf16ae5..00425765a2816 100644 --- a/doc/design/program.md +++ b/doc/design/program.md @@ -28,7 +28,7 @@ will be the same regardless which devices the program runs on (CPU/single GPU/multiple GPU/multiple nodes), with the following requirements: - 1. It should be programming language agnostic. Currently, we have a +1. It should be programming language agnostic. Currently, we have a Python API that generates the `ProgramDesc`, but we could add the support for other languages later. From a27bac65e5601b7a1e85dec7b26860c8cdce7dc7 Mon Sep 17 00:00:00 2001 From: Helin Wang Date: Tue, 5 Dec 2017 13:44:19 -0800 Subject: [PATCH 5/5] Make ExecutionPlan no longer depends on ProgramDesc --- paddle/framework/framework.proto | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/paddle/framework/framework.proto b/paddle/framework/framework.proto index 4d892028770d5..d6f455b58129d 100644 --- a/paddle/framework/framework.proto +++ b/paddle/framework/framework.proto @@ -150,6 +150,6 @@ message OpPlacement { } message ExecutionPlan { - optional ProgramDesc program = 1; + repeated BlockDesc blocks = 1; repeated OpPlacement op_placement = 2; }