Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ExecutionPlan design. #6078

Closed
wants to merge 5 commits into from

Conversation

helinwang
Copy link
Contributor

No description provided.

message ExecutionPlan {
optional ProgramDesc program = 1;
repeated OpPlacement op_placement = 2;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, how to find the correspondence between OpPlacement in ExecutionPlan and OpDesc in ProgramDesc?
Are the number and order of operators in ExecutionPlace and ProgramDesc the same?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number will be the same, each OP will have one placement. The order does not have to be the same, otherwise the "name" field in OpPlacement is not necessary.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Program{Block{Op}}. A Program has many blocks. A block has many ops.

However, the Program has many operator placements. We cannot get a one-to-one map by this data structure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reyoung

However, the Program has many operator placements. We cannot get a one-to-one map by this data structure.

Sorry I don't fully get this point, I thought different OPs have different names?

@@ -2,7 +2,7 @@

## Compile and Execution

A PaddlePaddle program consists of two parts -- the first generates a `ProgramDesc` protobuf message that describes the program, and the second runs this message using a C++ class `Executor`.
A PaddlePaddle program consists of three parts -- the first generates a `ProgramDesc` protobuf message that describes the program, the second optimizes this message using a C++ class `Optimizer` and generates an `ExecutionPlan` protobuf messages, and the third run the message using a C++ class `Executor`.
Copy link
Member

@QiJune QiJune Nov 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • ExecutionPlan is not dependent on optimizer. In an inference ProgramDesc, we can also have a ExecutionPlan.
  • In which time we can decide the device where an operator runs? In current code, an operator has CPU kernel and GPU kernel. At running time, the kernel is decided by the place of DeviceContext. Actually, it's decided in running time.
    Since we have ExecutionPlan which is a proto message storing device information, the device must be decided at compile time.

In a word, we still have two parts, compile-time and run-time. At compile-time, we will generate two proto message, the first is ProgramDesc and the second is ExecutionPlan.
The ExecutionPlan is set by users' configuration and Paddle's own auto device placement policy. If user switch to another hardware environment, and he/she do not want to provide a ExecutionPlan, Paddle can generate a ExecutionPlan under Paddle's own auto device placement policy.

An interface could be:

void GenerateExecutionPlan(const ProgramDesc& input, OpDeviceMap* user_config, ExecutionPlan* output);

The user_config could be null.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ExecutionPlan is not dependent on optimizer. In an inference ProgramDesc, we can also have a ExecutionPlan.

Agree, will find a better name for optimizer.

the kernel is decided by the place of DeviceContext. Actually, it's decided in running time.

Understand, but I think deciding at runtime make us no way to control where to place the OP. Being able to control it is very important.

In a word, we still have two parts, compile-time and run-time. At compile-time, we will generate two proto message, the first is ProgramDesc and the second is ExecutionPlan.

Agree.

An interface could be: void GenerateExecutionPlan(const ProgramDesc& input, OpDeviceMap* user_config, ExecutionPlan* output);

The user_config should be part of ProgramDesc, since ProgramDesc describes what the user wants.

@@ -2,7 +2,7 @@

## Compile and Execution

A PaddlePaddle program consists of two parts -- the first generates a `ProgramDesc` protobuf message that describes the program, and the second runs this message using a C++ class `Executor`.
A PaddlePaddle program consists of three parts -- the first generates a `ProgramDesc` protobuf message that describes the program, the second optimizes this message using a C++ class `Optimizer` and generates an `ExecutionPlan` protobuf messages, and the third run the message using a C++ class `Executor`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd better switch Optimizer to another term. Our python already have the Optimizer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same as @dzhwinter , just my personal view, this Optimizer does not do the optimize, like the four steps which run a C program, COMPILER -> ASSEMBLER -> LINKER -> LOADER. How about convert optimizer -> assembler?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generating ExecutionPlan is exactly like gcc's -O option.

We probably do not need a single C++ class to optimize the graph since we can just create a member function OptimizeProgram in the Executor class. ExecutionPlan object should also be the member of Executor, so that we call executor.run is executing optimized graph.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, ExecutionPlan is no need to be a protobuf.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please refer to #6141 , I think that ProgramDesc is not enough to run a network. We also need to provide Device Type and Data Type for each operator. Exposing these interface to users is necessary, even though paddle framework could provide a solution.

Copy link
Contributor Author

@helinwang helinwang Dec 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dzhwinter @Yancey1989

We'd better switch Optimizer to another term. Our python already have the Optimizer.

Agree we need a better naming, thank @Yancey1989 ! Assembler is a good name candidate!

@typhoonzero I think whoever generates the ExecutionPlan from ProgramDesc should have the global view: the global program desc, and the number of devices. And different ExecutionPlans are sent to different nodes. On the other hand, Executor runs locally, it does not know the devices on other nodes.

@QiJune thanks, agree that we will need enable user's manual placement configuration, and that configuration should be in ProgramDesc. At the same time, ExecutionPlan should have placement information too. ProgramDesc and ExecutionPlan are two different things with different focus, it's fine for them to have similar fields, it's not duplication.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yancey1989 @helinwang Sorry that I did not quite get the point of the name Assembler, if this name is to be used, what is Compiler/Linker/Loader in PaddlePaddle?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zealoct thank you! I have changed the name to Planner, do you think it conveys the means correctly?

@@ -15,7 +15,7 @@ optimize(cost)
train(cost, reader=mnist.train())
```

The first five lines of the following PaddlePaddle program generates, or, compiles, the `ProgramDesc` message. The last line runs it.
The first five lines of the following PaddlePaddle program generates, or, compiles, the `ProgramDesc` message. The last line optimizes and runs it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe optimizes => transform ?


### Optimizer

The optimizer takes `ProgramDesc` as the input and outputs the `ExcutionPlan`, the steps are:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo ExcutionPlan


The goal of `ProgramDesc` is to describe **what** the user wants to calculate, and the goal of `ExecutionPlan` is to specify **how** to calculate it.

For example, the `ExecutionPlan` has OP placement information to indicate which device the OP will run, but the `ProgramDesc` does not have this information since currently our Python API does not support manually pinning an OP onto a type of device (e.g., GPU or FPGA). On the other hand, the `ProgramDesc` should have information about if an OP belongs to an optimizer, this information is provided by the user and helps to place the OPs onto the parameter servers, but the `ExecutionPlan` does not have this information.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be more clear if we add
ProgramDesc describe the device independent computing process, but the ExecutionPlan describe the device related computing process

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indicate which device the OP will run

missing an on before which (=

optional string name = 1;
optional string device = 2;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can add a detail example in comment.

message OpPlacement {
   // pserver:gpu0
   optional string name = 1;
   optional string device = 2;
 }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the "pserver" in "pserver:gpu0" is not necessary, the executor does not need to know what role (e.g., pserver) it takes. Maybe only "gpu0" is sufficient.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bit confused how would name and device values be at runtime, can you give an example?

Copy link
Contributor Author

@helinwang helinwang Dec 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name should be the name of the OP (every OP should have a name), will add this into the PR.
device should be something like: "gpu0", "cpu".

### Optimizer

The optimizer takes `ProgramDesc` as the input and outputs the `ExcutionPlan`, the steps are:
1. Add the prgram in `ProgramDesc` and the coresponding backward pass program into the `ExecutionPlan`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo prgram and avaiable


The optimizer takes `ProgramDesc` as the input and outputs the `ExcutionPlan`, the steps are:
1. Add the prgram in `ProgramDesc` and the coresponding backward pass program into the `ExecutionPlan`.
1. Optimizes the program according to the avaiable devices.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little confused at the avaiable devices.
Which part should own the Optimizer module? The cluster or the client program?

Especially in the Elastic DeepLearning, if the user request for nodes in a range 5-10, how should we generate the ExecutionPlan?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optimizer module (not a good name, maybe assembler/transformer/? would be better) should be in a binary running in the cluster for distributed training. For local training, the module should be compiled locally.

For example, add data parallelism by spliting the input mini-batches and replicating the OPs onto different GPUs. Note that even if the OPs are replicated on different GPUs, there is still only **one** execution plan. One executor runs and only runs one `ExecutionPlan`.
1. Place each OP onto available devices, the placement information is written in the `ExecutionPlan`.
1. In distributed training, split the `ExecutionPlan` into multiple `ExecutionPlans` and add send/recv OP between them. For local training, this step is not necessary since there is only one executor.
1. Send the `ExecutionPlan` to the executor for execution.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still the same question above. In a local machine with Multi-GPUs, which module should send the ExecutionPlan ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


message OpPlacement {
optional string name = 1;
optional string device = 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also wondering if device info for Operator is enough.
In Tensorflow, tf.Variable is actually an operator, and tf.Tensor has a operator data member. Tensorflow is a graph of operator, so device info in operator is enough.
But we have both variable and operator. Do we need device info for Variable? Do we need another VarPlacement?

Copy link
Member

@QiJune QiJune Nov 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • In most common case(add/sub/relu...), the output variable device is the same with operator device.
  • For control related operators and LoD related operators, the operator device is always CPU. And the output variable is in CPU too.
  • If we get training data/parameter using load operator/initialize operator , the variable device is the same with load operator/initialize operator.
  • If we get training data/parameter using python reader, the variable device need to be set manually.

So, this a only one case which we should set device for variable. For other cases, the variable device can be decided by operator's device info.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@QiJune thanks, great question, I guess we need VarPlacement only if we will use explicit OP for copying data from CPU to GPU?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we get training data/parameter using python reader, the variable device need to be set manually.

Isn't the data initially CPU, and copied to GPU implicitly when needed, since we don't do explicit copies, maybe we don't need VarPlacement?

@@ -2,7 +2,7 @@

## Compile and Execution

A PaddlePaddle program consists of two parts -- the first generates a `ProgramDesc` protobuf message that describes the program, and the second runs this message using a C++ class `Executor`.
A PaddlePaddle program consists of three parts -- the first generates a `ProgramDesc` protobuf message that describes the program, the second optimizes this message using a C++ class `Optimizer` and generates an `ExecutionPlan` protobuf messages, and the third run the message using a C++ class `Executor`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same as @dzhwinter , just my personal view, this Optimizer does not do the optimize, like the four steps which run a C program, COMPILER -> ASSEMBLER -> LINKER -> LOADER. How about convert optimizer -> assembler?


The goal of `ProgramDesc` is to describe **what** the user wants to calculate, and the goal of `ExecutionPlan` is to specify **how** to calculate it.

For example, the `ExecutionPlan` has OP placement information to indicate which device the OP will run, but the `ProgramDesc` does not have this information since currently our Python API does not support manually pinning an OP onto a type of device (e.g., GPU or FPGA). On the other hand, the `ProgramDesc` should have information about if an OP belongs to an optimizer, this information is provided by the user and helps to place the OPs onto the parameter servers, but the `ExecutionPlan` does not have this information.
Copy link
Contributor

@Yancey1989 Yancey1989 Nov 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we also updte the describe of Op placement: https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/refactor/parameter_server.md#graph-converter, in the newer design, op placement includ device and trainer/pserver information.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Will do.

Copy link
Member

@jacquesqiao jacquesqiao Dec 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have two independent concepts: trainer & pserver, or we only have one concept worker and the role is decided by the subgraph it receives?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only have one concept worker and the role is decided by the subgraph it receives.

@reyoung
Copy link
Collaborator

reyoung commented Dec 3, 2017

@helinwang @typhoonzero
I just have two concerns about ExecutionPlan.

Is that ExecutionPlan a general plan for every kind executor? Or just multi-node executor or cluster executor? I think there may be many kinds of ExecutionPlan since we may have many kinds of executors. Each kind of the executor may have its own kind of plan.

Another concern is if we do not want a general plan for every kind executor. The execution plan might be the internal data structure of the executor. To use protobuf or not depends on whether it is convenient or not. For example, if we want to implement our multi-node executor in golang, it is no need to create a protobuf message as execution plan because golang can serialize & deserialize its own structs.

@helinwang
Copy link
Contributor Author

helinwang commented Dec 4, 2017

@reyoung Thanks for reviewing!

since we may have many kinds of executors

The executor here is the cpp implementation, not the Python executor (the Python executor is more like TensorFlow session, it's a gateway to the cpp executor that runs the ProgramDesc. Sorry about the naming confusion).

The Python executor we probably can have different kind executors, local executor and remote executor.

I think we should just have one cpp executor implementation, multiple nodes should run the same executor implementation as single node. Having multiple executor probably makes code very hard to maintain and optimize (e.g., need to update all executors when a fix/optimization is needed), and I don't see much benefit.

The reason for using protobuf is just for the convenience of serialization when sending the ExecutionPlan between nodes.

@helinwang helinwang force-pushed the execution_plan branch 2 times, most recently from 22fc63c to ab3e54c Compare December 4, 2017 02:09
@typhoonzero
Copy link
Contributor

@reyoung @helinwang

Is that ExecutionPlan a general plan for every kind executor? Or just multi-node executor or cluster executor? I think there may be many kinds of ExecutionPlan since we may have many kinds of executors. Each kind of the executor may have its own kind of plan.

If we keep one ExecutionPlan format, so we can have multiple IR optimizers for different purposes and run a series of optimization for the IR to get the final ExecutionPlan, e.g. IR->Transpliter->MultiGPU->KernelFusion->MemorySavior

}

message ExecutionPlan {
optional ProgramDesc program = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pserver and trainer may use different ProgramDesc, seems one field is not enough?

Copy link
Contributor Author

@helinwang helinwang Dec 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the new design the pserver and trainer will be exactly same binary (executor), only thing that is different is the ExecutionPlan they run. The Planner will know the roles of different executors (e.g., pserver role, trainer role) to help generating the ExecutionPlans.


message OpPlacement {
optional string name = 1;
optional string device = 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not put device field in ProgramDesc directly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we also need to allow users the specify the device information by two approaches:

  • device ID such as CPU:0/GPU:0.
  • The maximal device count such as CPU:{5}.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ProgramDesc is used to specify the information from the user, since currently we don't have API to do that, we probably should not put that information into ProgramDesc.

In the future when we have that API we can add it to ProgramDesc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine to add it to ProgramDesc since we are not using this field for now, so then we don't need further changes to the protobuf files.

Copy link
Contributor Author

@helinwang helinwang Dec 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@typhoonzero I think ProgramDesc and ExecutionPlan are used for different purposes, ProgramDesc is the output from Python, specifying what the user need. ExecutionPlan is the input and output for IR optimizers, and input for executor. So they better be two separate entities.

Since there are two entities: ProgramDesc and ExecutionPlan, and the device placement is about optimization, not about what the user specified, it probably should be in ExecutionPlan but not ProgramDesc.

In the future when we want enable the user to configure which device an OP runs, we can put the field indicating device in ProgramDesc.

Maybe I need to change ExecutionPlan to (not depend on ProgramDesc anymore):

message ExecutionPlan {
  repeated BlockDesc blocks = 1; 
  repeated OpPlacement op_placement = 2;
}

What do you think?

@Yancey1989
Copy link
Contributor

If we keep one ExecutionPlan format, so we can have multiple IR optimizers for different purposes and run a series of optimization for the IR to get the final ExecutionPlan, e.g. IR->Transpliter->MultiGPU->KernelFusion->MemorySavior

The sequence of optimizers to generate the final ExecutionPlan is a good idea, we can also add Copy Op to copy the memory between CPU and GPU when we have two kinds of device.

@dzhwinter
Copy link
Contributor

There is one more concern.
Currently, we only have the cluster design of multi-nodes, should the Multi-GPU be same with the cluster ones? What should it be in multi-nodes with Multi-GPU equipment?

@helinwang
Copy link
Contributor Author

@dzhwinter Yes, I think we need a unify solution, otherwise there are too much code path to develop / maintain. The ExecutionPlan should work on single node multiple-GPU too.

(CPU/single GPU/multiple GPU/multiple nodes), with the following
requirements:

1. It should be programming language agnostic. Currently, we have a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should there be a way of exporting ProgramDesc? so that user can share it, like export(cost, SAVE_TO_PATH)? how we are going to differentiate saving algorithm(ProgramDesc) from saving model?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a model should be saved separately: ProgramDesc and the weights. So that the weights can be re-used for different ProgramDescs.
Maybe saving model is not strictly related to this PR, we can discuss more in a separate issue if we wish :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, thanks:)

The `ExecutionPlan` contains all the details of running the program,
including which device each OP is placed on. One `Executor` could have
multiple devices (e.g, CPU, GPUs), but it runs only one
`ExecutionPlan`. In distributed training there will be `n`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

available devices for distributed training are dynamic, should this plan be generated every time when available devices change (device added/removed/updated)? how are we going to efficiently deploy it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this should be generated every time when available devices change. Currently in distributed training we can have a constant number of trainers/pservers, I think it's a good starting point.

@helinwang
Copy link
Contributor Author

After several discussions, we reached conclusion that we no longer need execution plan, the internal representation and the input for the executor will be ProgramDesc.

@helinwang helinwang closed this Jan 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants