Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add paddle.device.cuda.get_device_properties #35661

Merged
merged 38 commits into from
Sep 28, 2021
Merged

Add paddle.device.cuda.get_device_properties #35661

merged 38 commits into from
Sep 28, 2021

Conversation

Yanxing-Shi
Copy link
Contributor

@Yanxing-Shi Yanxing-Shi commented Sep 10, 2021

PR types

New features

PR changes

APIs

Describe

Add paddle.device.cuda.get_device_properties

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@Yanxing-Shi Yanxing-Shi changed the title Add get device properties Add paddle.get_device_properties Sep 10, 2021
@Yanxing-Shi Yanxing-Shi changed the title Add paddle.get_device_properties 【WIP】Add paddle.get_device_properties Sep 12, 2021
@Yanxing-Shi Yanxing-Shi changed the title 【WIP】Add paddle.get_device_properties [WIP] Add paddle.get_device_properties Sep 12, 2021
@Yanxing-Shi Yanxing-Shi marked this pull request as draft September 12, 2021 10:01
@CLAassistant
Copy link

CLAassistant commented Sep 14, 2021

CLA assistant check
All committers have signed the CLA.

@Yanxing-Shi Yanxing-Shi marked this pull request as ready for review September 15, 2021 02:35
@Yanxing-Shi Yanxing-Shi changed the title [WIP] Add paddle.get_device_properties Add paddle.device.cuda.get_device_properties Sep 15, 2021
@@ -297,6 +314,64 @@ std::vector<int> GetSelectedDevices() {
return devices;
}

#ifdef PADDLE_WITH_CUDA
cudaDeviceProp *GetDeviceProperties(int id) {
std::call_once(init_flag, [&] {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too many duplicate codes. I think we can unify the codes of CUDA and HIP by some kinds of techniques shown in paddle/fluid/platform/type_defs.h.

@@ -39,6 +44,18 @@ DECLARE_uint64(gpu_memory_limit_mb);

constexpr static float fraction_reserve_gpu_memory = 0.05f;

static std::once_flag init_flag;
static std::deque<std::once_flag> device_flags;
static int gpu_num = -1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think you should add this global variable gpu_num. As I mentioned offline yesterday, you do not need to record the gpu_num. You can use device_props.size() below.

static int gpu_num = -1;

#ifdef PADDLE_WITH_CUDA
static std::vector<cudaDeviceProp> device_props;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

device_props -> g_device_props. Here, the prefix g_ means global variables.

@@ -39,6 +44,18 @@ DECLARE_uint64(gpu_memory_limit_mb);

constexpr static float fraction_reserve_gpu_memory = 0.05f;

static std::once_flag init_flag;
static std::deque<std::once_flag> device_flags;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too fancy codes to use std::deque::resize for non-copyable and non-moveable types.

Although std::deque::resize can accept the non-copyable and non-moveable types like std::once_flag (note that std::once_flag is neither copyable nor moveable, see https://en.cppreference.com/w/cpp/thread/once_flag), many of us should think that std::deque should accept the moveable types at least. I do not think it is a good idea to use std::deque<std::once_flag>.

Alternatively, you use use std::vector<std::unique_ptr<std::once_flag>>.


std::call_once(device_flags[id], [&] {
cudaDeviceProp device_prop;
PADDLE_ENFORCE_CUDA_SUCCESS(cudaGetDeviceProperties(&device_prop, id));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need not to use the temporary variable device_prop. You can write the more direct codes:

PADDLE_ENFORCE_CUDA_SUCCESS(cudaGetDeviceProperties(&device_props[id], id));

@@ -67,6 +67,15 @@ dim3 GetGpuMaxGridDimSize(int);
//! Get a list of device ids from environment variable or use all.
std::vector<int> GetSelectedDevices();

//! Get the properties of device.
#ifdef PADDLE_WITH_CUDA
cudaDeviceProp *GetDeviceProperties(int id);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have emphasized the Google coding style for a long time. Use const reference for unchanged variables, and use non-const pointer for changeable variables.

You should not return non-const pointer here, because one should think that returning non-const pointer means the users may change the returned value. However, I do not think that cudaDeviceProp should be changed.

@@ -2267,6 +2267,71 @@ All parameter, weight, gradient are variables in Paddle.
#endif
#endif

#ifdef PADDLE_WITH_CUDA
m.def("get_device_properties",
[](int id) -> cudaDeviceProp * {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate codes for CUDA and HIP.

@@ -24,6 +24,7 @@
'synchronize',
'device_count',
'empty_cache',
'get_device_properties'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comma in the end.

else:
raise ValueError("device type must be int or paddle.CUDAPlace")

return core.get_device_properties(device_id)
Copy link
Collaborator

@sneaxiy sneaxiy Sep 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not friendly error message. If the users install the CPU version and he/she calls paddle.device.cuda.get_device_properties(), he/she would get the error message like "core" has no attribute "get_device_properties". It is not friendly to users.

@@ -22,6 +22,11 @@ limitations under the License. */
#else
#include "paddle/fluid/platform/dynload/cudnn.h"
#endif

#include <deque>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove useless headers.

@@ -39,6 +44,10 @@ DECLARE_uint64(gpu_memory_limit_mb);

constexpr static float fraction_reserve_gpu_memory = 0.05f;

static std::once_flag g_init_flag;
static std::vector<std::unique_ptr<std::once_flag>> g_device_flags;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable names g_init_flag and g_device_flags are not readable. Nobody knows what they mean if he/she does not read the following codes.

Try:

  • g_init_flag -> g_device_props_size_init_flag
  • g_device_flags -> g_device_prop_init_flags

@@ -297,6 +306,43 @@ std::vector<int> GetSelectedDevices() {
return devices;
}

const gpuDeviceProp *GetDeviceProperties(int id) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you read my comment #35661 (comment) carefully? I mean const reference.

@@ -297,6 +306,43 @@ std::vector<int> GetSelectedDevices() {
return devices;
}

const gpuDeviceProp *GetDeviceProperties(int id) {
int gpu_num = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to put this variable outside the std::call_once. I mean this line can be put inside the following std::call_once.

g_device_flags.resize(gpu_num);
g_device_props.resize(gpu_num);
for (int i = 0; i < gpu_num; ++i) {
g_device_flags[i] = std::unique_ptr<std::once_flag>(new std::once_flag());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

g_device_flags[i] = std::make_unique<std::once_flag>() would be simpler.

"The device id: %d exceeds out of range [0, the number of devices: %d "
"on this machine). Because the device id should be greater than or "
"equal to zero and smaller than the number of gpus. Please input "
"appropriate device again!",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your error message would be like:

The device id: 8 exceeds out of range [0, the number of devices: 4 on this machine).

This is:

  • in the wrong format in math.
  • wrong grammar in English. The word exceed has the meaning of out of range!

I prefer that the error message would be like:

The device id 8 is out of range [0, 4), where 4 is the number of devices on this machine.

from paddle.fluid.wrapped_decorator import signature_safe_contextmanager
from paddle.device import is_compiled_with_cuda
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this line would cause circular dependencies. Are you sure it is right? If you have checked, please tell me.

My thought on why there are circular dependencies:

  • When users import paddle.device.cuda, Python would import python/paddle/device/cuda/__init__.py and reach this line from paddle.device import is_compiled_with_cuda.
  • Then, Python would try to import paddle.device. Inside python/paddle/device/__init__.py, there is the code from . import cuda. So, Python would import python/paddle/device/cuda/__init__.py again. That is why I think the circular dependencies occur.

'''

place = framework._current_expected_place()
if not isinstance(place, core.CUDAPlace) or not is_compiled_with_cuda():
Copy link
Collaborator

@sneaxiy sneaxiy Sep 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think that you should check isinstance(place, core.CUDAPlace).

The framework._current_expected_place() may return core.CPUPlace(). Even if users install the GPU version Paddle, they can use CPU instead of GPU.

raise ValueError(
"The current device: {} is not expected. Because paddle.device.cuda."
"get_device_properties only support cuda device Please change device"
"and input device again!".format(place))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused with this sentence. What are the differences between device and input device?

Return the properties of given CUDA device.

Args:
device(paddle.CUDAPlace() or int or str): The device, the ID of the device
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

paddle.CUDAPlace() -> paddle.CUDAPlace

Default: None.

Returns:
_CudaDeviceProperties: the properties of the device which include ASCII string
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_CudaDeviceProperties -> _gpuDeviceProperties? I guess you can fix this in the next PR with test=document_fix.。。


if not core.is_compiled_with_cuda():
raise ValueError(
"The current device {} is not expected. Because paddle.device.cuda."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the error message is not right. It should be something like The API paddle.device.cuda.get_device_properties() is not supported in CPU-only PaddlePaddle. Please reinstall PaddlePaddle with GPU support to call this API.


device_id = -1

if device is not None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if device is None?

raise ValueError(
"The current string {} is not expected. Because paddle.device."
"cuda.get_device_properties only support string which is like 'gpu:x'. "
"Please input appropriat string again!".format(device))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong word appropriat.

Copy link
Collaborator

@sneaxiy sneaxiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Member

@zhhsplendid zhhsplendid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sneaxiy sneaxiy merged commit 4cbed9e into PaddlePaddle:develop Sep 28, 2021
AnnaTrainingG pushed a commit to AnnaTrainingG/Paddle that referenced this pull request Sep 29, 2021
* Initial Commit

* add unittest and add error information

* modify doc

* fix some error

* fix some word

* fix bug cudaDeviceProp* and modify error explanation

* fix cudaDeviceProp* error and unnitest samples

* fix hip error and PADDLE_WITH_HIP

* update style

* fix error is_compiled_with_cuda

* fix paddle.device.cuda.get_device_properties

* fix error for multi thread safe

* update style

* merge conflict

* modify after mentor review

* update style

* delete word

* fix unittest error for windows

* support string input and modify some code

* modify doc to support string input

* fix error for express information

* fix error for express information

* fix unnitest for windows

* fix device.startswith('gpu:')

* format error and doc

* fix after review

* format code

* fix error for doc compile

* fix error for doc compile

* fix error for doc compile

* fix error for doc compile

* fix error for doc compile

* fix py2 error

* fix wrong words and doc

* fix _gpuDeviceProperties
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants