Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

possible Android memory corruption in validation or SPIRV used by validation #8439

Open
lunarpapillo opened this issue Aug 22, 2024 · 8 comments
Labels

Comments

@lunarpapillo
Copy link
Contributor

Environment:

  • OS: Android
  • GPU and driver version: N/A, crash appears on all tested Android devices
  • SDK or header version if building from repo: Android NDK 26.3
  • Options enabled (synchronization, best practices, etc.):

Describe the Issue

When building and testing a Debug build using Android NDK 26.3, tests crash on all devices in the same place in VkArmBestPracticesLayerTest.ComputeShaderBadSpatialLocalityTest, inside an allocator within SPIRV-Tools:

#00 libVkLayer_khronos_validation.so (void std::__ndk1::allocator<unsigned int>::construct[abi:v170000]<unsigned int, unsigned int const&>(unsigned int*, unsigned int const&)+28)
...
#04 libVkLayer_khronos_validation.so (std::__ndk1::__wrap_iter<unsigned int*> std::__ndk1::vector<unsigned int, std::__ndk1::allocator<unsigned int> >::insert<std::__ndk1::__wrap_iter<unsigned int const*>, 0>(std::__ndk1::__wrap_iter<unsigned int const*>, std::__ndk1::__wrap_iter<unsigned int const*>, std::__ndk1::__wrap_iter<unsigned int const*>)+344) 
#05 libVkLayer_khronos_validation.so (spvtools::val::ValidationState_t::RegisterUniqueTypeDeclaration(spvtools::val::Instruction const*)+416)
#06 libVkLayer_khronos_validation.so (spvtools::val::(anonymous namespace)::ValidateUniqueness(spvtools::val::ValidationState_t&, spvtools::val::Instruction const*)+172)
#07 libVkLayer_khronos_validation.so (spvtools::val::TypePass(spvtools::val::ValidationState_t&, spvtools::val::Instruction const*)+88) 
#08 libVkLayer_khronos_validation.so (spvtools::val::(anonymous namespace)::ValidateBinaryUsingContextAndValidationState(spv_context_t const&, unsigned int const*, unsigned long, spv_diagnostic_t**, spvtools::val::ValidationState_t*)+3824) 
#09 libVkLayer_khronos_validation.so (spvValidateWithOptions+164)
#10 libVkLayer_khronos_validation.so (CoreChecks::RunSpirvValidation(spv_const_binary_t&, Location const&, ValidationCache*) const+296)
#11 libVkLayer_khronos_validation.so (CoreChecks::ValidateShaderModuleCreateInfo(VkShaderModuleCreateInfo const&, Location const&) const+692)
#12 libVkLayer_khronos_validation.so (CoreChecks::PreCallValidateCreateShaderModule(VkDevice_T*, VkShaderModuleCreateInfo const*, VkAllocationCallbacks const*, VkShaderModule_T**, ErrorObject const&) const+104) 
#13 libVkLayer_khronos_validation.so (vulkan_layer_chassis::CreateShaderModule(VkDevice_T*, VkShaderModuleCreateInfo const*, VkAllocationCallbacks const*, VkShaderModule_T**)+248)
#14  /system/lib64/libvulkan.so (vulkan::api::(anonymous namespace)::CreateShaderModule(VkDevice_T*, VkShaderModuleCreateInfo const*, VkAllocationCallbacks const*, VkShaderModule_T**)+160)
#15 libVulkanLayerValidationTests.so (vkt::ShaderModule::init(vkt::Device const&, VkShaderModuleCreateInfo const&)+168)
#16 libVulkanLayerValidationTests.so (VkShaderObj::InitFromGLSL(void const*)+224)
#17 libVulkanLayerValidationTests.so (VkShaderObj::VkShaderObj(VkRenderFramework*, char const*, VkShaderStageFlagBits, spv_target_env, SpvSourceType, VkSpecializationInfo const*, char const*, void const*)+268) 
#18 libVulkanLayerValidationTests.so (VkArmBestPracticesLayerTest_ComputeShaderBadSpatialLocalityTest_Test::TestBody()+296)
...

The full ndk-stack output is available:
008-ndk-stack-info.txt

The crash appears when using a Debug build with Android NDK 26.3. It does not appear when using a Release build with NDK 26.3, nor (using either a Release or a Debug build) with either NDK 25.2 or NDK 27.0.

Given that the code appears to run correctly in a Release build, that the crash is device-independent, and that the crash occurs during memory allocation, it's fairly likely that the compiler isn't the issue, and that that something in validation or SPIRV is causing memory corruption that happens to cause a validation crash when memory is laid out "just right". If Address Sanitizer is supported on Android, it might be helpful in uncovering such a corruption.

It's possible, though IMHO unlikely, that this is an unknown compiler bug that appeared in NDK 26 and disappeared in NDK 27, as symptoms like this are not listed as known issues: https://github.com/android/ndk/releases

To reproduce the problem, run a manual-Vulkan-ValidationLayers build with: http://tcubuser.lunarg.localdomain:8080/view/Manual/job/manual-Vulkan-ValidationLayers/build

  • BUILD_MODE: Debug
  • ANDROID_ARGS: --android-ndk 26.3
  • NODE: tcubuand1
@lunarpapillo
Copy link
Contributor Author

For reference, original chat is: https://chat.google.com/room/AAAAOXVAYGg/FL0Vh98x-gM/FL0Vh98x-gM?cls=10

@spencer-lunarg
Copy link
Contributor

tests crash on all devices in the same place in VkArmBestPracticesLayerTest.ComputeShaderBadSpatialLocalityTest,

This is 99% because VkArm is alphabetically first and it will crash in any test

@mikes-lunarg
Copy link
Contributor

I was working on a minimal repro case and got it down to this, note that I'm not even creating a Vulkan instance:

TEST_F(PositiveTooling, Issue8439) {
    std::vector<uint32_t> spv = {
        0x07230203, 0x00010000, 0x0008000b, 0x00000019, 0x00000000, 0x00020011, 0x00000001, 0x0006000b, 
        0x00000001, 0x4c534c47, 0x6474732e, 0x3035342e, 0x00000000, 0x0003000e, 0x00000000, 0x00000001, 
        0x0005000f, 0x00000005, 0x00000004, 0x6e69616d, 0x00000000, 0x00060010, 0x00000004, 0x00000011, 
        0x00000008, 0x00000008, 0x00000001, 0x00030003, 0x00000002, 0x000001c2, 0x00040005, 0x00000004, 
        0x6e69616d, 0x00000000, 0x00040005, 0x00000009, 0x756c6176, 0x00000065, 0x00050005, 0x0000000d, 
        0x6d615375, 0x72656c70, 0x00000000, 0x00040047, 0x0000000d, 0x00000022, 0x00000000, 0x00040047, 
        0x0000000d, 0x00000021, 0x00000000, 0x00040047, 0x00000018, 0x0000000b, 0x00000019, 0x00020013, 
        0x00000002, 0x00030021, 0x00000003, 0x00000002, 0x00030016, 0x00000006, 0x00000020, 0x00040017, 
        0x00000007, 0x00000006, 0x00000004, 0x00040020, 0x00000008, 0x00000007, 0x00000007, 0x00090019, 
        0x0000000a, 0x00000006, 0x00000001, 0x00000000, 0x00000000, 0x00000000, 0x00000001, 0x00000000, 
        0x0003001b, 0x0000000b, 0x0000000a, 0x00040020, 0x0000000c, 0x00000000, 0x0000000b, 0x0004003b, 
        0x0000000c, 0x0000000d, 0x00000000, 0x00040017, 0x0000000f, 0x00000006, 0x00000002, 0x0004002b, 
        0x00000006, 0x00000010, 0x3f000000, 0x0005002c, 0x0000000f, 0x00000011, 0x00000010, 0x00000010, 
        0x0004002b, 0x00000006, 0x00000012, 0x00000000, 0x00040015, 0x00000014, 0x00000020, 0x00000000, 
        0x00040017, 0x00000015, 0x00000014, 0x00000003, 0x0004002b, 0x00000014, 0x00000016, 0x00000008, 
        0x0004002b, 0x00000014, 0x00000017, 0x00000001, 0x0006002c, 0x00000015, 0x00000018, 0x00000016, 
        0x00000016, 0x00000017, 0x00050036, 0x00000002, 0x00000004, 0x00000000, 0x00000003, 0x000200f8, 
        0x00000005, 0x0004003b, 0x00000008, 0x00000009, 0x00000007, 0x0004003d, 0x0000000b, 0x0000000e, 
        0x0000000d, 0x00070058, 0x00000007, 0x00000013, 0x0000000e, 0x00000011, 0x00000002, 0x00000012, 
        0x0003003e, 0x00000009, 0x00000013, 0x000100fd, 0x00010038, 
    };

    spv_target_env spirv_environment = SPV_ENV_VULKAN_1_0;
    spv_context ctx = spvContextCreate(spirv_environment);
    spvtools::ValidatorOptions spirv_val_options;
    spv_const_binary_t binary{spv.data(), spv.size()};
    spv_diagnostic diag = nullptr;
    
    const spv_result_t spv_valid = spvValidateWithOptions(ctx, spirv_val_options, &binary, &diag);
    ASSERT_TRUE(spv_valid == SPV_SUCCESS);
   
    spvDiagnosticDestroy(diag);
    spvContextDestroy(ctx);
}

Weird thing is that if I add the same test to the SPIRV-Tools unit tests, it works fine! Same SPIRV-Tools commit, same CMake flags, same NDK.

@lunarpapillo
Copy link
Contributor Author

Weird thing is that if I add the same test to the SPIRV-Tools unit tests, it works fine! Same SPIRV-Tools commit, same CMake flags, same NDK.

Do the SPIRV-Tools unit tests also run on Android?

@mikes-lunarg
Copy link
Contributor

mikes-lunarg commented Aug 26, 2024

By default, SPIRV-Tools tests do not run on Android. I was able to run them by commenting out these lines: https://github.com/KhronosGroup/SPIRV-Tools/blob/main/CMakeLists.txt#L315-L317 and then manually pushing and running the test executable using the adb shell.

@lunarpapillo
Copy link
Contributor Author

Weird...

const spv_result_t spv_valid = spvValidateWithOptions(ctx, spirv_val_options, &binary, &diag);
ASSERT_TRUE(spv_valid == SPV_SUCCESS);

spvDiagnosticDestroy(diag);
spvContextDestroy(ctx);

I presume the crash occurs in spvValidateWithOptions(), as it seems to with the VVL tests, and the stack trace is otherwise similar; I presume you were also running the test in isolation via --gtest_filter, yes?

Since it works in SPIRV-Tools unit tests, do you have an hypothesis as to why it fails deterministically in VVL? I've got nothing...

@mikes-lunarg
Copy link
Contributor

I presume the crash occurs in spvValidateWithOptions(), as it seems to with the VVL tests, and the stack trace is otherwise similar; I presume you were also running the test in isolation via --gtest_filter, yes?

Yes and yes. And just like your initial writup, this only affects the Debug build. Release builds make it past the the spvValidateWithOptions() call and pass the assert.

Since it works in SPIRV-Tools unit tests, do you have an hypothesis as to why it fails deterministically in VVL? I've got nothing...

No real hypothesis yet. The fact that the test code works in one build (SPIRV-Tools) and not the other (VVL) makes me suspect something about how we build/package libSPIRV-Tools

@mikes-lunarg
Copy link
Contributor

mikes-lunarg commented Aug 28, 2024

Similar issue: KhronosGroup/glslang#3534

That reporter traced it back to a specific constructor for std::vector and patched around it by constructing the vector using a different method

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants