Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test STKDoc_tests_stk_mesh_doc_tests_MPI_4 unit test StkMeshHowTo.useAutomaticGeneratedAura randomly failing/segfaulting in PR build gnu-8.5.0-openmpi-4.1.6-openmp since 2024-06-26 #13244

Open
bartlettroscoe opened this issue Jul 16, 2024 · 6 comments
Labels
Framework tasks Framework tasks (used internally by Framework team) impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) PA: Framework Issues that fall under the Trilinos Framework Product Area pkg: STK type: bug The primary issue is a bug in Trilinos code or tests

Comments

@bartlettroscoe
Copy link
Member

bartlettroscoe commented Jul 16, 2024

CC: @alanw0, @sebrowne, @achauphan

Next Action Status

Description

As shown in this query (click "Shown Matching Output" in upper right) the test:

  • STKDoc_tests_stk_mesh_doc_tests_MPI_4

in the unique GenConfig build:

  • rhel8_sems-gnu-8.5.0-openmpi-4.1.6-openmp_release-debug_static_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables

started randomly failing/segfaulting on testing day 2024-06-26.

The specific set of CDash builds impacted where:

  • PR-13164-test-rhel8_sems-gnu-8.5.0-openmpi-4.1.6-openmp_release-debug_static_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-20
  • PR-13165-test-rhel8_sems-gnu-8.5.0-openmpi-4.1.6-openmp_release-debug_static_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-17
  • PR-13191-test-rhel8_sems-gnu-8.5.0-openmpi-4.1.6-openmp_release-debug_static_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-51
  • PR-13197-test-rhel8_sems-gnu-8.5.0-openmpi-4.1.6-openmp_release-debug_static_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-73
  • PR-13206-test-rhel8_sems-gnu-8.5.0-openmpi-4.1.6-openmp_release-debug_static_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-119
  • PR-13212-test-rhel8_sems-gnu-8.5.0-openmpi-4.1.6-openmp_release-debug_static_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-98

When the test segfault, it looks like:

*** Starting test StkMeshHowTo.useNoAura
[       OK ] StkMeshHowTo.useNoAura (0 ms)
*** Starting test StkMeshHowTo.useAutomaticGeneratedAura
[       OK ] StkMeshHowTo.useAutomaticGeneratedAura (0 ms)
*** Starting test StkMeshHowTo.use_generate_new_ids
[ascic0194:3690832] *** Process received signal ***
[ascic0194:3690832] Signal: Segmentation fault (11)
[ascic0194:3690832] Signal code: Address not mapped (1)
[ascic0194:3690832] Failing at address: (nil)
[ascic0194:3690832] [ 0] /lib64/libpthread.so.0(+0x12cf0)[0x7efe1cf64cf0]
[ascic0194:3690832] [ 1] /lib64/libc.so.6(__libc_malloc+0x146)[0x7efe1cc288c6]

Current Status on CDash

Run the above query adjusting the "Begin" and "End" dates to match today any other date range or just click "CURRENT" in the top bar to see results for the current testing day.

Steps to Reproduce

See:

If you can't figure out what commands to run to reproduce the problem given this documentation, then please post a comment here and we will give you the exact minimal commands.

@bartlettroscoe bartlettroscoe added type: bug The primary issue is a bug in Trilinos code or tests pkg: STK Framework tasks Framework tasks (used internally by Framework team) impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) PA: Framework Issues that fall under the Trilinos Framework Product Area labels Jul 16, 2024
@bartlettroscoe
Copy link
Member Author

@achauphan and @sebrowne, did the frameworks monitoring of the randomly failing tests not pick up this random test failure?

I only decided to run this query after one of the PRs I was reviewing showed this failure. But this test had failed/segfaulted randomly five other times before since the end of last month (and no one bothered to post an issue for this?).

@alanw0
Copy link
Contributor

alanw0 commented Jul 16, 2024

Thanks for the notification. That's troubling... I haven't seen that test fail in recent memory, and haven't known it to exhibit non-deterministic or random behavior. I'll look into it and try to resolve what's happening.

@sebrowne
Copy link
Contributor

sebrowne commented Jul 22, 2024

@bartlettroscoe I looked back through the history of the tool’s messages and it has not flagged that test at all. Remember, all we’re currently flagging are tests that failed, then passed on the same SHA1.

EDIT: It flagged it from last week, but not prior to that.

@bartlettroscoe
Copy link
Member Author

@bartlettroscoe I looked back through the history of the tool’s messages and it has not flagged that test at all. Remember, all we’re currently flagging are tests that failed, then passed on the same SHA1.

Not surprising. The current screening approach will miss a lot of actual random failures.

EDIT: It flagged it from last week, but not prior to that.

The next step is to run a query looking for that same test failure with similar output where that test is the only test failing in that build. That was the case with these particular test failure. You could write an automated tool to do this.

@alanw0
Copy link
Contributor

alanw0 commented Jul 23, 2024

I've identified some undefined behavior associated with using something similar to &vec[0] on an empty vector, which can dereference a null pointer. Disappointingly, that often doesn't cause a seg-fault, but it can.
In any case I will try to get a stk update into trilinos as soon as I can.

@alanw0
Copy link
Contributor

alanw0 commented Aug 2, 2024

This should be addressed by #13288.
That pull-request turned this test off. A coming-soon stk update will fix the actual undefined-behavior which is causing that test to be flaky.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Framework tasks Framework tasks (used internally by Framework team) impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) PA: Framework Issues that fall under the Trilinos Framework Product Area pkg: STK type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

3 participants