Add ExecutionPlan design. #6078

helinwang · 2017-11-30T03:29:40Z

No description provided.

QiJune · 2017-11-30T05:47:27Z

paddle/framework/framework.proto

+message ExecutionPlan {
+  optional ProgramDesc program = 1;
+  repeated OpPlacement op_placement = 2;
+}


So, how to find the correspondence between OpPlacement in ExecutionPlan and OpDesc in ProgramDesc?
Are the number and order of operators in ExecutionPlace and ProgramDesc the same?

The number will be the same, each OP will have one placement. The order does not have to be the same, otherwise the "name" field in OpPlacement is not necessary.

Program{Block{Op}}. A Program has many blocks. A block has many ops.

However, the Program has many operator placements. We cannot get a one-to-one map by this data structure.

@reyoung

However, the Program has many operator placements. We cannot get a one-to-one map by this data structure.

Sorry I don't fully get this point, I thought different OPs have different names?

QiJune · 2017-11-30T06:03:51Z

doc/design/program.md

@@ -2,7 +2,7 @@

 ## Compile and Execution

-A PaddlePaddle program consists of two parts -- the first generates a `ProgramDesc` protobuf message that describes the program, and the second runs this message using a C++ class `Executor`.
+A PaddlePaddle program consists of three parts -- the first generates a `ProgramDesc` protobuf message that describes the program, the second optimizes this message using a C++ class `Optimizer` and generates an `ExecutionPlan` protobuf messages, and the third run the message using a C++ class `Executor`.


ExecutionPlan is not dependent on optimizer. In an inference ProgramDesc, we can also have a ExecutionPlan.

In which time we can decide the device where an operator runs? In current code, an operator has CPU kernel and GPU kernel. At running time, the kernel is decided by the place of DeviceContext. Actually, it's decided in running time.
Since we have ExecutionPlan which is a proto message storing device information, the device must be decided at compile time.

In a word, we still have two parts, compile-time and run-time. At compile-time, we will generate two proto message, the first is ProgramDesc and the second is ExecutionPlan.
The ExecutionPlan is set by users' configuration and Paddle's own auto device placement policy. If user switch to another hardware environment, and he/she do not want to provide a ExecutionPlan, Paddle can generate a ExecutionPlan under Paddle's own auto device placement policy.

An interface could be:

void GenerateExecutionPlan(const ProgramDesc& input, OpDeviceMap* user_config, ExecutionPlan* output);

The user_config could be null.

ExecutionPlan is not dependent on optimizer. In an inference ProgramDesc, we can also have a ExecutionPlan.

Agree, will find a better name for optimizer.

the kernel is decided by the place of DeviceContext. Actually, it's decided in running time.

Understand, but I think deciding at runtime make us no way to control where to place the OP. Being able to control it is very important.

In a word, we still have two parts, compile-time and run-time. At compile-time, we will generate two proto message, the first is ProgramDesc and the second is ExecutionPlan.

Agree.

An interface could be: void GenerateExecutionPlan(const ProgramDesc& input, OpDeviceMap* user_config, ExecutionPlan* output);

The user_config should be part of ProgramDesc, since ProgramDesc describes what the user wants.

dzhwinter · 2017-11-30T06:52:16Z

doc/design/program.md

@@ -2,7 +2,7 @@

 ## Compile and Execution

-A PaddlePaddle program consists of two parts -- the first generates a `ProgramDesc` protobuf message that describes the program, and the second runs this message using a C++ class `Executor`.
+A PaddlePaddle program consists of three parts -- the first generates a `ProgramDesc` protobuf message that describes the program, the second optimizes this message using a C++ class `Optimizer` and generates an `ExecutionPlan` protobuf messages, and the third run the message using a C++ class `Executor`.


We'd better switch Optimizer to another term. Our python already have the Optimizer.

The same as @dzhwinter , just my personal view, this Optimizer does not do the optimize, like the four steps which run a C program, COMPILER -> ASSEMBLER -> LINKER -> LOADER. How about convert optimizer -> assembler?

Generating ExecutionPlan is exactly like gcc's -O option.

We probably do not need a single C++ class to optimize the graph since we can just create a member function OptimizeProgram in the Executor class. ExecutionPlan object should also be the member of Executor, so that we call executor.run is executing optimized graph.

And, ExecutionPlan is no need to be a protobuf.

Please refer to #6141 , I think that ProgramDesc is not enough to run a network. We also need to provide Device Type and Data Type for each operator. Exposing these interface to users is necessary, even though paddle framework could provide a solution.

@dzhwinter @Yancey1989

We'd better switch Optimizer to another term. Our python already have the Optimizer.

Agree we need a better naming, thank @Yancey1989 ! Assembler is a good name candidate!

@typhoonzero I think whoever generates the ExecutionPlan from ProgramDesc should have the global view: the global program desc, and the number of devices. And different ExecutionPlans are sent to different nodes. On the other hand, Executor runs locally, it does not know the devices on other nodes.

@QiJune thanks, agree that we will need enable user's manual placement configuration, and that configuration should be in ProgramDesc. At the same time, ExecutionPlan should have placement information too. ProgramDesc and ExecutionPlan are two different things with different focus, it's fine for them to have similar fields, it's not duplication.

@Yancey1989 @helinwang Sorry that I did not quite get the point of the name Assembler, if this name is to be used, what is Compiler/Linker/Loader in PaddlePaddle?

@zealoct thank you! I have changed the name to Planner, do you think it conveys the means correctly?

dzhwinter · 2017-11-30T06:52:57Z

doc/design/program.md

@@ -15,7 +15,7 @@ optimize(cost)
 train(cost, reader=mnist.train())
 ```

-The first five lines of the following PaddlePaddle program generates, or, compiles, the `ProgramDesc` message.  The last line runs it.
+The first five lines of the following PaddlePaddle program generates, or, compiles, the `ProgramDesc` message.  The last line optimizes and runs it.


maybe optimizes => transform ?

dzhwinter · 2017-11-30T06:53:51Z

doc/design/program.md

+
+### Optimizer
+
+The optimizer takes `ProgramDesc` as the input and outputs the `ExcutionPlan`, the steps are:


typo ExcutionPlan

dzhwinter · 2017-11-30T06:56:09Z

doc/design/program.md

+
+The goal of `ProgramDesc` is to describe **what** the user wants to calculate, and the goal of `ExecutionPlan` is to specify **how** to calculate it.
+
+For example, the `ExecutionPlan` has OP placement information to indicate which device the OP will run, but the `ProgramDesc` does not have this information since currently our Python API does not support manually pinning an OP onto a type of device (e.g., GPU or FPGA). On the other hand, the `ProgramDesc` should have information about if an OP belongs to an optimizer, this information is provided by the user and helps to place the OPs onto the parameter servers, but the `ExecutionPlan` does not have this information.


It may be more clear if we add
ProgramDesc describe the device independent computing process, but the ExecutionPlan describe the device related computing process

indicate which device the OP will run

missing an on before which (=

dzhwinter · 2017-11-30T06:59:54Z

paddle/framework/framework.proto

+  optional string name = 1;
+  optional string device = 2;
+}
+


Maybe we can add a detail example in comment.

message OpPlacement { // pserver:gpu0 optional string name = 1; optional string device = 2; }

I think the "pserver" in "pserver:gpu0" is not necessary, the executor does not need to know what role (e.g., pserver) it takes. Maybe only "gpu0" is sufficient.

bit confused how would name and device values be at runtime, can you give an example?

name should be the name of the OP (every OP should have a name), will add this into the PR.
device should be something like: "gpu0", "cpu".

dzhwinter · 2017-11-30T07:00:47Z

doc/design/program.md

+### Optimizer
+
+The optimizer takes `ProgramDesc` as the input and outputs the `ExcutionPlan`, the steps are:
+1. Add the prgram in `ProgramDesc` and the coresponding backward pass program into the `ExecutionPlan`.


typo prgram and avaiable

dzhwinter · 2017-11-30T07:04:43Z

doc/design/program.md

+
+The optimizer takes `ProgramDesc` as the input and outputs the `ExcutionPlan`, the steps are:
+1. Add the prgram in `ProgramDesc` and the coresponding backward pass program into the `ExecutionPlan`.
+1. Optimizes the program according to the avaiable devices.


I am a little confused at the avaiable devices.
Which part should own the Optimizer module? The cluster or the client program?

Especially in the Elastic DeepLearning, if the user request for nodes in a range 5-10, how should we generate the ExecutionPlan?

Optimizer module (not a good name, maybe assembler/transformer/? would be better) should be in a binary running in the cluster for distributed training. For local training, the module should be compiled locally.

dzhwinter · 2017-11-30T07:08:01Z

doc/design/program.md

+    For example, add data parallelism by spliting the input mini-batches and replicating the OPs onto different GPUs. Note that even if the OPs are replicated on different GPUs, there is still only **one** execution plan. One executor runs and only runs one `ExecutionPlan`.
+1. Place each OP onto available devices, the placement information is written in the `ExecutionPlan`.
+1. In distributed training, split the `ExecutionPlan` into multiple `ExecutionPlans` and add send/recv OP between them. For local training, this step is not necessary since there is only one executor.
+1. Send the `ExecutionPlan` to the executor for execution.


Still the same question above. In a local machine with Multi-GPUs, which module should send the ExecutionPlan ?

Please see https://github.com/PaddlePaddle/Paddle/pull/6078/files#r154270688 , does it answer your question?

QiJune · 2017-11-30T08:00:49Z

paddle/framework/framework.proto

+
+message OpPlacement {
+  optional string name = 1;
+  optional string device = 2;


I am also wondering if device info for Operator is enough.
In Tensorflow, tf.Variable is actually an operator, and tf.Tensor has a operator data member. Tensorflow is a graph of operator, so device info in operator is enough.
But we have both variable and operator. Do we need device info for Variable? Do we need another VarPlacement?

In most common case(add/sub/relu...), the output variable device is the same with operator device.

For control related operators and LoD related operators, the operator device is always CPU. And the output variable is in CPU too.

If we get training data/parameter using load operator/initialize operator , the variable device is the same with load operator/initialize operator.

If we get training data/parameter using python reader, the variable device need to be set manually.

So, this a only one case which we should set device for variable. For other cases, the variable device can be decided by operator's device info.

@QiJune thanks, great question, I guess we need VarPlacement only if we will use explicit OP for copying data from CPU to GPU?

If we get training data/parameter using python reader, the variable device need to be set manually.

Isn't the data initially CPU, and copied to GPU implicitly when needed, since we don't do explicit copies, maybe we don't need VarPlacement?

Yancey1989 · 2017-11-30T12:33:48Z

doc/design/program.md

@@ -2,7 +2,7 @@

 ## Compile and Execution

-A PaddlePaddle program consists of two parts -- the first generates a `ProgramDesc` protobuf message that describes the program, and the second runs this message using a C++ class `Executor`.
+A PaddlePaddle program consists of three parts -- the first generates a `ProgramDesc` protobuf message that describes the program, the second optimizes this message using a C++ class `Optimizer` and generates an `ExecutionPlan` protobuf messages, and the third run the message using a C++ class `Executor`.


The same as @dzhwinter , just my personal view, this Optimizer does not do the optimize, like the four steps which run a C program, COMPILER -> ASSEMBLER -> LINKER -> LOADER. How about convert optimizer -> assembler?

Yancey1989 · 2017-11-30T12:40:53Z

doc/design/program.md

+
+The goal of `ProgramDesc` is to describe **what** the user wants to calculate, and the goal of `ExecutionPlan` is to specify **how** to calculate it.
+
+For example, the `ExecutionPlan` has OP placement information to indicate which device the OP will run, but the `ProgramDesc` does not have this information since currently our Python API does not support manually pinning an OP onto a type of device (e.g., GPU or FPGA). On the other hand, the `ProgramDesc` should have information about if an OP belongs to an optimizer, this information is provided by the user and helps to place the OPs onto the parameter servers, but the `ExecutionPlan` does not have this information.


Maybe we also updte the describe of Op placement: https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/refactor/parameter_server.md#graph-converter, in the newer design, op placement includ device and trainer/pserver information.

Thanks! Will do.

Do we have two independent concepts: trainer & pserver, or we only have one concept worker and the role is decided by the subgraph it receives?

Only have one concept worker and the role is decided by the subgraph it receives.

reyoung · 2017-12-03T08:54:55Z

@helinwang @typhoonzero
I just have two concerns about ExecutionPlan.

Is that ExecutionPlan a general plan for every kind executor? Or just multi-node executor or cluster executor? I think there may be many kinds of ExecutionPlan since we may have many kinds of executors. Each kind of the executor may have its own kind of plan.

Another concern is if we do not want a general plan for every kind executor. The execution plan might be the internal data structure of the executor. To use protobuf or not depends on whether it is convenient or not. For example, if we want to implement our multi-node executor in golang, it is no need to create a protobuf message as execution plan because golang can serialize & deserialize its own structs.

helinwang · 2017-12-04T00:39:35Z

@reyoung Thanks for reviewing!

since we may have many kinds of executors

The executor here is the cpp implementation, not the Python executor (the Python executor is more like TensorFlow session, it's a gateway to the cpp executor that runs the ProgramDesc. Sorry about the naming confusion).

The Python executor we probably can have different kind executors, local executor and remote executor.

I think we should just have one cpp executor implementation, multiple nodes should run the same executor implementation as single node. Having multiple executor probably makes code very hard to maintain and optimize (e.g., need to update all executors when a fix/optimization is needed), and I don't see much benefit.

The reason for using protobuf is just for the convenience of serialization when sending the ExecutionPlan between nodes.

typhoonzero · 2017-12-04T03:20:23Z

@reyoung @helinwang

Is that ExecutionPlan a general plan for every kind executor? Or just multi-node executor or cluster executor? I think there may be many kinds of ExecutionPlan since we may have many kinds of executors. Each kind of the executor may have its own kind of plan.

If we keep one ExecutionPlan format, so we can have multiple IR optimizers for different purposes and run a series of optimization for the IR to get the final ExecutionPlan, e.g. IR->Transpliter->MultiGPU->KernelFusion->MemorySavior

typhoonzero · 2017-12-04T03:38:01Z

paddle/framework/framework.proto

+}
+
+message ExecutionPlan {
+  optional ProgramDesc program = 1;


Pserver and trainer may use different ProgramDesc, seems one field is not enough?

In the new design the pserver and trainer will be exactly same binary (executor), only thing that is different is the ExecutionPlan they run. The Planner will know the roles of different executors (e.g., pserver role, trainer role) to help generating the ExecutionPlans.

typhoonzero · 2017-12-04T03:38:56Z

paddle/framework/framework.proto

+
+message OpPlacement {
+  optional string name = 1;
+  optional string device = 2;


Why not put device field in ProgramDesc directly?

Maybe we also need to allow users the specify the device information by two approaches:

device ID such as CPU:0/GPU:0.

The maximal device count such as CPU:{5}.

The ProgramDesc is used to specify the information from the user, since currently we don't have API to do that, we probably should not put that information into ProgramDesc.

In the future when we have that API we can add it to ProgramDesc.

It's fine to add it to ProgramDesc since we are not using this field for now, so then we don't need further changes to the protobuf files.

@typhoonzero I think ProgramDesc and ExecutionPlan are used for different purposes, ProgramDesc is the output from Python, specifying what the user need. ExecutionPlan is the input and output for IR optimizers, and input for executor. So they better be two separate entities.

Since there are two entities: ProgramDesc and ExecutionPlan, and the device placement is about optimization, not about what the user specified, it probably should be in ExecutionPlan but not ProgramDesc.

In the future when we want enable the user to configure which device an OP runs, we can put the field indicating device in ProgramDesc.

Maybe I need to change ExecutionPlan to (not depend on ProgramDesc anymore):

message ExecutionPlan { repeated BlockDesc blocks = 1; repeated OpPlacement op_placement = 2; }

What do you think?

Yancey1989 · 2017-12-04T03:49:33Z

If we keep one ExecutionPlan format, so we can have multiple IR optimizers for different purposes and run a series of optimization for the IR to get the final ExecutionPlan, e.g. IR->Transpliter->MultiGPU->KernelFusion->MemorySavior

The sequence of optimizers to generate the final ExecutionPlan is a good idea, we can also add Copy Op to copy the memory between CPU and GPU when we have two kinds of device.

dzhwinter · 2017-12-04T04:07:54Z

There is one more concern.
Currently, we only have the cluster design of multi-nodes, should the Multi-GPU be same with the cluster ones? What should it be in multi-nodes with Multi-GPU equipment?

helinwang · 2017-12-04T06:09:47Z

@dzhwinter Yes, I think we need a unify solution, otherwise there are too much code path to develop / maintain. The ExecutionPlan should work on single node multiple-GPU too.

putcn · 2017-12-04T19:45:04Z

doc/design/program.md

+(CPU/single GPU/multiple GPU/multiple nodes), with the following
+requirements:
+
+1. It should be programming language agnostic. Currently, we have a


should there be a way of exporting ProgramDesc? so that user can share it, like export(cost, SAVE_TO_PATH)? how we are going to differentiate saving algorithm(ProgramDesc) from saving model?

I think a model should be saved separately: ProgramDesc and the weights. So that the weights can be re-used for different ProgramDescs.
Maybe saving model is not strictly related to this PR, we can discuss more in a separate issue if we wish :)

agree, thanks:)

putcn · 2017-12-04T19:50:59Z

doc/design/program.md

+The `ExecutionPlan` contains all the details of running the program,
+including which device each OP is placed on. One `Executor` could have
+multiple devices (e.g, CPU, GPUs), but it runs only one
+`ExecutionPlan`. In distributed training there will be `n`


available devices for distributed training are dynamic, should this plan be generated every time when available devices change (device added/removed/updated)? how are we going to efficiently deploy it?

Yes this should be generated every time when available devices change. Currently in distributed training we can have a constant number of trainers/pservers, I think it's a good starting point.

helinwang · 2018-01-03T19:58:16Z

After several discussions, we reached conclusion that we no longer need execution plan, the internal representation and the input for the executor will be ProgramDesc.

helinwang requested review from reyoung, Yancey1989, wangkuiyi, putcn, emailweixu, gongweibao, typhoonzero and QiJune November 30, 2017 03:29

Add ExecutionPlan design.

617b8f6

helinwang force-pushed the execution_plan branch from 9a038d1 to 617b8f6 Compare November 30, 2017 03:30

helinwang requested a review from dzhwinter November 30, 2017 04:42

QiJune reviewed Nov 30, 2017

View reviewed changes

dzhwinter reviewed Nov 30, 2017

View reviewed changes

QiJune reviewed Nov 30, 2017

View reviewed changes

Yancey1989 reviewed Nov 30, 2017

View reviewed changes

Update ExecutionPlan design doc.

ab3e54c

helinwang force-pushed the execution_plan branch 2 times, most recently from 22fc63c to ab3e54c Compare December 4, 2017 02:09

fix typo

15c1f4c

typhoonzero reviewed Dec 4, 2017

View reviewed changes

Update style

d11c4cd

typhoonzero mentioned this pull request Dec 4, 2017

Execute the program with multi threads #6223

Merged

putcn reviewed Dec 4, 2017

View reviewed changes

Make ExecutionPlan no longer depends on ProgramDesc

a27bac6

This was referenced Dec 6, 2017

Executor: need multiple thread support #6319

Closed

Compile time - runtime separation / single node multiple GPU milestones #6359

Closed

helinwang closed this Jan 3, 2018

QiJune mentioned this pull request Feb 28, 2018

new design of Parallel Do #8631

Closed


		### Optimizer

		The optimizer takes `ProgramDesc` as the input and outputs the `ExcutionPlan`, the steps are:


		The goal of `ProgramDesc` is to describe what the user wants to calculate, and the goal of `ExecutionPlan` is to specify how to calculate it.

		For example, the `ExecutionPlan` has OP placement information to indicate which device the OP will run, but the `ProgramDesc` does not have this information since currently our Python API does not support manually pinning an OP onto a type of device (e.g., GPU or FPGA). On the other hand, the `ProgramDesc` should have information about if an OP belongs to an optimizer, this information is provided by the user and helps to place the OPs onto the parameter servers, but the `ExecutionPlan` does not have this information.

Add ExecutionPlan design. #6078

Add ExecutionPlan design. #6078

Conversation

helinwang commented Nov 30, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

QiJune Nov 30, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

helinwang Dec 1, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

helinwang Dec 4, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

QiJune Nov 30, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yancey1989 Nov 30, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacquesqiao Dec 1, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

reyoung commented Dec 3, 2017

helinwang commented Dec 4, 2017 • edited Loading

typhoonzero commented Dec 4, 2017

Choose a reason for hiding this comment

helinwang Dec 4, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

helinwang Dec 5, 2017 • edited Loading

Choose a reason for hiding this comment

Yancey1989 commented Dec 4, 2017

dzhwinter commented Dec 4, 2017

helinwang commented Dec 4, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

helinwang commented Jan 3, 2018

QiJune Nov 30, 2017 •

edited

Loading

helinwang Dec 1, 2017 •

edited

Loading

helinwang Dec 4, 2017 •

edited

Loading

QiJune Nov 30, 2017 •

edited

Loading

Yancey1989 Nov 30, 2017 •

edited

Loading

jacquesqiao Dec 1, 2017 •

edited

Loading

helinwang commented Dec 4, 2017 •

edited

Loading

helinwang Dec 4, 2017 •

edited

Loading

helinwang Dec 5, 2017 •

edited

Loading