Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ExecutionPlan design. #6078

Closed
wants to merge 5 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 59 additions & 2 deletions doc/design/program.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Compile and Execution

A PaddlePaddle program consists of two parts -- the first generates a `ProgramDesc` protobuf message that describes the program, and the second runs this message using a C++ class `Executor`.
A PaddlePaddle program consists of three parts -- the first generates a `ProgramDesc` protobuf message that describes the program, the second plans this message using a C++ class `Planner` and generates an `ExecutionPlan` protobuf messages, and the third run the message using a C++ class `Executor`.

A simple example PaddlePaddle program can be found in [graph.md](./graph.md):

Expand All @@ -15,7 +15,64 @@ optimize(cost)
train(cost, reader=mnist.train())
```

The first five lines of the following PaddlePaddle program generates, or, compiles, the `ProgramDesc` message. The last line runs it.
The first five lines of the following PaddlePaddle program generates,
or, compiles, the `ProgramDesc` message. The last line runs it by
generating the `ExecutionPlan` and sending to `Executor` for
execution.


### ProgramDesc

The `ProgramDesc` describes the computation specified by the user, it
will be the same regardless which devices the program runs on
(CPU/single GPU/multiple GPU/multiple nodes), with the following
requirements:

1. It should be programming language agnostic. Currently, we have a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should there be a way of exporting ProgramDesc? so that user can share it, like export(cost, SAVE_TO_PATH)? how we are going to differentiate saving algorithm(ProgramDesc) from saving model?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a model should be saved separately: ProgramDesc and the weights. So that the weights can be re-used for different ProgramDescs.
Maybe saving model is not strictly related to this PR, we can discuss more in a separate issue if we wish :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, thanks:)

Python API that generates the `ProgramDesc`, but we could add the
support for other languages later.

1. It should **not** describe anything that is not specified by the
user. For example:
1. The OPs for the backward pass added by PaddlePaddle
1. Any optimizations to the program.
1. OP placement information that is not specified by the user.


### ExecutionPlan

The `ExecutionPlan` contains all the details of running the program,
including which device each OP is placed on. One `Executor` could have
multiple devices (e.g, CPU, GPUs), but it runs only one
`ExecutionPlan`. In distributed training there will be `n`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

available devices for distributed training are dynamic, should this plan be generated every time when available devices change (device added/removed/updated)? how are we going to efficiently deploy it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this should be generated every time when available devices change. Currently in distributed training we can have a constant number of trainers/pservers, I think it's a good starting point.

`ExecutionPlan` for `n` `Executor`, jointly completes the
`ProgramDesc` specified by the user.


### Planner

The planner takes `ProgramDesc` as the input and outputs the
`ExcutionPlan`, the steps are:

1. Add necessary OPs that are not specified by the user to the
`ProgramDesc`. E.g., add the backward pass.

1. Prune the unnecessary computations from the `ProgramDesc`.

1. Transforms the `ProgramDesc` given the available devices. E.g., add
data parallelism by splitting the input mini-batches and
replicating the OPs onto different GPUs.

1. Generate `ExecutionPlan` by placing each OP onto available devices,
the placement information is written in the `ExecutionPlan`.

1. In distributed training, split the `ExecutionPlan` into multiple
`ExecutionPlans` and add send/recv OP between them. For local
training, this step is not necessary since there is only one
executor.

1. Send the `ExecutionPlan` to the executor for execution.


## Programs and Blocks

Expand Down
10 changes: 10 additions & 0 deletions paddle/framework/framework.proto
Original file line number Diff line number Diff line change
Expand Up @@ -143,3 +143,13 @@ message BlockDesc {
// https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/program.md
// for more details.
message ProgramDesc { repeated BlockDesc blocks = 1; }

message OpPlacement {
optional string name = 1;
optional string device = 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also wondering if device info for Operator is enough.
In Tensorflow, tf.Variable is actually an operator, and tf.Tensor has a operator data member. Tensorflow is a graph of operator, so device info in operator is enough.
But we have both variable and operator. Do we need device info for Variable? Do we need another VarPlacement?

Copy link
Member

@QiJune QiJune Nov 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • In most common case(add/sub/relu...), the output variable device is the same with operator device.
  • For control related operators and LoD related operators, the operator device is always CPU. And the output variable is in CPU too.
  • If we get training data/parameter using load operator/initialize operator , the variable device is the same with load operator/initialize operator.
  • If we get training data/parameter using python reader, the variable device need to be set manually.

So, this a only one case which we should set device for variable. For other cases, the variable device can be decided by operator's device info.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@QiJune thanks, great question, I guess we need VarPlacement only if we will use explicit OP for copying data from CPU to GPU?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we get training data/parameter using python reader, the variable device need to be set manually.

Isn't the data initially CPU, and copied to GPU implicitly when needed, since we don't do explicit copies, maybe we don't need VarPlacement?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not put device field in ProgramDesc directly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we also need to allow users the specify the device information by two approaches:

  • device ID such as CPU:0/GPU:0.
  • The maximal device count such as CPU:{5}.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ProgramDesc is used to specify the information from the user, since currently we don't have API to do that, we probably should not put that information into ProgramDesc.

In the future when we have that API we can add it to ProgramDesc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine to add it to ProgramDesc since we are not using this field for now, so then we don't need further changes to the protobuf files.

Copy link
Contributor Author

@helinwang helinwang Dec 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@typhoonzero I think ProgramDesc and ExecutionPlan are used for different purposes, ProgramDesc is the output from Python, specifying what the user need. ExecutionPlan is the input and output for IR optimizers, and input for executor. So they better be two separate entities.

Since there are two entities: ProgramDesc and ExecutionPlan, and the device placement is about optimization, not about what the user specified, it probably should be in ExecutionPlan but not ProgramDesc.

In the future when we want enable the user to configure which device an OP runs, we can put the field indicating device in ProgramDesc.

Maybe I need to change ExecutionPlan to (not depend on ProgramDesc anymore):

message ExecutionPlan {
  repeated BlockDesc blocks = 1; 
  repeated OpPlacement op_placement = 2;
}

What do you think?

}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can add a detail example in comment.

message OpPlacement {
   // pserver:gpu0
   optional string name = 1;
   optional string device = 2;
 }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the "pserver" in "pserver:gpu0" is not necessary, the executor does not need to know what role (e.g., pserver) it takes. Maybe only "gpu0" is sufficient.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bit confused how would name and device values be at runtime, can you give an example?

Copy link
Contributor Author

@helinwang helinwang Dec 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name should be the name of the OP (every OP should have a name), will add this into the PR.
device should be something like: "gpu0", "cpu".

message ExecutionPlan {
repeated BlockDesc blocks = 1;
repeated OpPlacement op_placement = 2;
}