Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[UR][L0] Unify use of large allocation in L0 adapter #1099

Merged
merged 2 commits into from
Dec 6, 2023

Conversation

jandres742
Copy link

@jandres742 jandres742 commented Nov 21, 2023

Intel(R) GPUs have two modes of operation in terms of allocations:
Stateful and stateless mode.

Stateful optimizes memory accesses through pointer arithmetic.
This can be done as long as allocations used by the allocation
are smaller than 4GB.

Stateless disables such pointer-arithmetic optimization to
allow the kernel to use allocations larger than 4GB.

Currently, L0 adapter dynamically and automatically requests
the L0 driver large allocations if it detects an allocation size
is larger than 4GB. This creates a problem if a kernel has been
previously compiled for stateful access. This ultimately means
the adapter mixes stateful and stateless behavior, which is not
a user-friendly experience.

This patch aims at correcting this behavior by defining a default
one. On Intel(R) GPUs previous to Intel(R) Data Center GPU Max,
default behavior is now stateless, meaning all allocations are
only allowed by default. Users can opt-in for stateful mode setting
a new environment variable UR_L0_USE_OPTIMIZED_32BIT_ACCESS=1.

Addresses:
https://stackoverflow.com/questions/75621264/sycl-dot-product-code-gives-wrong-results

intel/llvm testing: intel/llvm#11958

@jandres742 jandres742 force-pushed the largeallocations branch 3 times, most recently from 31a5796 to 3482cdf Compare November 22, 2023 16:34
@jandres742
Copy link
Author

@smaslov-intel : please review

@jandres742 jandres742 marked this pull request as ready for review November 22, 2023 17:31
@jandres742 jandres742 requested a review from a team as a code owner November 22, 2023 17:31
static const bool UseLargeAllocations = [this] {
const char *UrRet = std::getenv("UR_L0_ALLOW_LARGE_ALLOCATIONS");
if (!UrRet)
return (this->isPVC() ? true : false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this isPVC check ?
It is not required on PVC.
PVC Level Zero driver does it by default.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MichalMrozek : this is so the rest of the adapter has "large allocation behavior".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But PVC is already in large allocation behavior by default.
There is no need to do any special for this device.
There is no need to add compiler option and it is quite dangerous to bypass max memory allocation size limits by default.
By using those variables it indicates that PVC requires some special handling for large allocations which in fact is not needed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @MichalMrozek . I have changed code to use defaults on PVC.

// On some Intel GPUs, this influences how kernels are compiled.
// If large allocations (>4GB) are requested, then kernels are
// compiled with stateless access.
// If small allocations (<4GB) are requested, then kernels are
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not accurate, even if -ze-opt-greater-than-4GB-buffer-required is not specified, kernels may still be compiled in stateless mode.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, I will remove the comment

// If small allocations (<4GB) are requested, then kernels are
// compiled with stateful access, with potential performance
// improvements.
// Some GPUs support only one mode, such us Intel(R) Data Center GPU Max,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel(R) Data Center GPU Max supports both stateful and stateless modes.
Level Zero implementation for Intel(R) Data Center GPU Max allows only stateless mode for this device.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will fix comment indicating is the driver that has only that uspport.

@jandres742
Copy link
Author

@nrspruit: please review.

Signed-off-by: Spruit, Neil R <neil.r.spruit@intel.com>
@kbenzie kbenzie added the v0.8.x Include in the v0.8.x release label Nov 28, 2023
Intel(R) GPUs have two modes of operation in terms of allocations:
Stateful and stateless mode.

Stateful optimizes memory accesses through pointer arithmetic.
This can be done as long as allocations used by the allocation
are smaller than 4GB.

Stateless disables such pointer-arithmetic optimization to
allow the kernel to use allocations larger than 4GB.

Currently, L0 adapter dynamically and automatically requests
the L0 driver large allocations if it detects an allocation size
is larger than 4GB. This creates a problem if a kernel has been
previously compiled for stateful access. This ultimately means
the adapter mixes stateful and stateless behavior, which is not
a user-friendly experience.

This patch aims at correcting this behavior by defining a default
one. On Intel(R) GPUs previous to Intel(R) Data Center GPU Max,
default behavior is now stateless, meaning all allocations are
only allowed by default. Users can opt-in for stateful mode setting
a new environment variable UR_L0_USE_OPTIMIZED_32BIT_ACCESS=1.

Addresses:
https://stackoverflow.com/questions/75621264/sycl-dot-product-code-gives-wrong-results

Signed-off-by: Jaime Arteaga <jaime.a.arteaga.molina@intel.com>
// ze-opt-greater-than-4GB-buffer-required to disable
// stateful optimizations and be able to use larger than
// 4GB allocations on these kernels.
if (Context->Devices[0]->useOptimized32bitAccess() == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be in the Exp function and have the Exp function have the Compile code now? That way you check the specific device being passed in and not just the Context->Devices[0] since you might be on a non-uniform system.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @nrspruit . Good idea. However, urProgramCompileExp at this moment is unimplemented, so adding implementation for urProgramCompileExp on top of these changes would make this PR too big. I think it is better we merge this patch, then we add the support for urProgramCompileExp, including using the functionality from this patch. what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that would be fine, a follow-up patch would be good improvement on this.

Copy link
Contributor

@nrspruit nrspruit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 from me

@jandres742
Copy link
Author

jandres742 commented Dec 1, 2023

@kbenzie : please merge when possible. This needs to be merged on top of #916

@kbenzie kbenzie added the ready to merge Added to PR's which are ready to merge label Dec 4, 2023
@fabiomestre fabiomestre changed the base branch from adapters to main December 5, 2023 16:48
@fabiomestre
Copy link
Contributor

I have updated the target branch of this PR from the adapters branch to the main branch.
Development in UR is moving back to main. The adapters branch will soon be deleted.

@kbenzie kbenzie mentioned this pull request Dec 6, 2023
13 tasks
@kbenzie kbenzie merged commit ce4acbc into oneapi-src:main Dec 6, 2023
52 checks passed
@kbenzie
Copy link
Contributor

kbenzie commented Dec 6, 2023

I'm going to create a combined intel/llvm PR which will include this and a few other PR's so we can get things merged quicker.

kbenzie added a commit to kbenzie/unified-runtime that referenced this pull request Dec 6, 2023
[UR][L0] Unify use of large allocation in L0 adapter
kbenzie added a commit to kbenzie/llvm that referenced this pull request Dec 6, 2023
steffenlarsen pushed a commit to intel/llvm that referenced this pull request Dec 8, 2023
kbenzie added a commit to kbenzie/unified-runtime that referenced this pull request Dec 15, 2023
[UR][L0] Unify use of large allocation in L0 adapter
callumfare pushed a commit to kbenzie/llvm that referenced this pull request Dec 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready to merge Added to PR's which are ready to merge v0.8.x Include in the v0.8.x release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants