Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-14532 gurt: Replace environment APIs hook #13483

Merged
merged 9 commits into from
Jan 8, 2024

Conversation

knard-intel
Copy link
Contributor

@knard-intel knard-intel commented Dec 12, 2023

Description

This PR is a subset of the PR #13250 allowing thread safe management of environment variables: it has been split into smaller PRs to facilitate the review process.
This PR mainly add thread safe environment variables management functions.
It also remove and replace old non thread safe custom environment management functions.
Finally, it replace the setenv() function with d_setenv().

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate watchers.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

Add thread safe environment variables managment functions.
Remove and replace non thread safe custom environment management function.
Replace setenv() function with d_setenv()

Required-githooks: true
Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@intel.com>
Copy link

github-actions bot commented Dec 12, 2023

Bug-tracker data:
Ticket title is 'frontera: segfault on mvapich init'
Status is 'In Review'
Labels: 'TACC,TACC_Frontera,triaged'
https://daosio.atlassian.net/browse/DAOS-14532

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@knard-intel knard-intel marked this pull request as ready for review December 12, 2023 10:49
@knard-intel knard-intel requested review from a team as code owners December 12, 2023 10:49
src/gurt/misc.c Outdated
Comment on lines 1206 to 1224
if (!dis_unsigned_str(env)) {
rc = -DER_INVAL;
goto out;
}

env_val = getenv(env);
if (!env_val)
return;
errno = 0;
val = strtoul(env, &endptr, 0);
if (errno != 0 || endptr == env || *endptr != '\0') {
rc = -DER_INVAL;
goto out;
}

if (!dis_integer_str(env_val)) {
D_ERROR("ENV %s is not integer.\n", env_val);
return;
#if UINT_MAX != ULONG_MAX
assert(sizeof(unsigned) < sizeof(unsigned long));
if (val > UINT_MAX) {
rc = -DER_INVAL;
goto out;
}
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[suggestion] i think for somehow dis_unsigned_str() is duplicated as strtoull()?
we could remove dis_unsigned_str() and just use strtoull()?

Copy link
Contributor Author

@knard-intel knard-intel Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The best solution from my point of view should be to remove all the unsigned and to replace them with uint32_t or uint64_t. However, this is beyond the scope of this PR.
For this PR, I would prefer to keep these three functions as the sizeof() of the unsigned is platform dependent: I would like to avoid any unexpected overflow. In a follow-up PR, I could replace the d_getenv_uint() with relevant d_getenv_uintXXt() and update the code accordingly.

To avoid duplicate code, I could factorize the code into d_getenv_uint64_t() function and just check the overflow in d_getenv_uint32t() and d_getenv_uint() functions.
@wangshilong, is this last solution seems acceptable for you ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with this, i thought dis_unsigned_str() is unnecessary and might be removed.

Copy link
Contributor Author

@knard-intel knard-intel Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Factorize unsigned fonctions.

Copy link
Contributor Author

@knard-intel knard-intel Jan 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[suggestion] i think for somehow dis_unsigned_str() is duplicated as strtoull()? we could remove dis_unsigned_str() and just use strtoull()?

My apologies @wangshilong , I did not well understood your question regarding the function dis_unsigned_str(). At the beginning, I also wanted to remove it. However, strtoull() is less restrictive and allow to have spaces before the first digit and even to have a negative integer. To not change the behavior of the d_getenv_uint() functions, I decided to kept the dis_unsigned_str() function.

Is this OK for you ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code of the getenv of unsigned integer function has been factorized in commit 193c24c

wangshilong
wangshilong previously approved these changes Dec 13, 2023
src/gurt/misc.c Outdated
{
int rc;

pthread_rwlock_wrlock(&d_env_lock);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate why this is safer than the hook version? I think the issue we have seen is due to threads calling getenv in other libraries, which in this case would not be locked with respect to calls in daos libraries. Whereas the hooked version would ensure the lock is acquired for all calls, no?

Copy link
Contributor Author

@knard-intel knard-intel Dec 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hook version can have some side effect with other third party libraries such as mvapich and eventually lead to segfault. You could find in the following comment, a detailed description of such scenario https://daosio.atlassian.net/browse/DAOS-14532?focusedCommentId=121770

I tried to overcome this issue with updating hooks or playing with library constructors. At the end, I did not find an acceptable solution which could not have some side effects with other third party libraries.

After some discussion with @johannlombardi and @gnailzenh , it appeared that the least bad solution was to have our own thread safe functions for managing environment. From my side, the only disadvantage of this solution is to be much more intrusive in our code. Happily, we already have some dedicated non thread safe functions for managing the environment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we see an intermittent issue with fio that manifests as follows
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007d9faabc195d in getenv () from /lib64/libc.so.6
[Current thread is 1 (Thread 0x7d9fad4335c0 (LWP 60560))]
Missing separate debuginfos, use: yum debuginfo-install boost-atomic-1.66.0-13.el8.x86_64 boost-chrono-1.66.0-13.el8.x86_64 boost-date-time-1.66.0-13.el8.x86_64 boost-iostreams-1.66.0-13.el8.x86_64 boost-program-options-1.66.0-13.el8.x86_64 boost-random-1.66.0-13.el8.x86_64 boost-regex-1.66.0-13.el8.x86_64 boost-system-1.66.0-13.el8.x86_64 boost-thread-1.66.0-13.el8.x86_64 daxctl-libs-71.1-4.el8.x86_64 glibc-2.28-225.el8.x86_64 libaio-0.3.112-1.el8.x86_64 libblkid-2.32.1-42.el8_8.x86_64 libgcc-8.5.0-18.el8.x86_64 libibverbs-44.0-2.el8.1.x86_64 libicu-60.3-2.el8_1.x86_64 libmount-2.32.1-42.el8_8.x86_64 libnl3-3.7.0-1.el8.x86_64 libpmem-1.6.1-1.el8.x86_64 libpmemblk-1.6.1-1.el8.x86_64 librados2-12.2.7-9.el8.x86_64 librbd1-12.2.7-9.el8.x86_64 librdmacm-44.0-2.el8.1.x86_64 libselinux-2.9-8.el8.x86_64 libstdc++-8.5.0-18.el8.x86_64 libyaml-0.1.7-5.el8.x86_64 lz4-libs-1.8.3-3.el8_4.x86_64 ndctl-libs-71.1-4.el8.x86_64 nspr-4.34.0-3.el8_6.x86_64 nss-util-3.79.0-11.el8_7.x86_64 numactl-libs-2.0.12-13.el8.x86_64 openssl-libs-1.1.1k-9.el8_7.x86_64 pcre2-10.32-3.el8_6.x86_64 protobuf-c-1.3.0-6.el8.x86_64 systemd-libs-239-74.el8_8.x86_64 zlib-1.2.11-21.el8_7.x86_64
(gdb) bt
#0 0x00007d9faabc195d in getenv () from /lib64/libc.so.6
#1 0x00005b1f2183098c in check_status_file () at stat.c:2372
#2 0x00005b1f21839c25 in check_for_running_stats () at stat.c:2411
#3 0x00005b1f2186cf51 in do_usleep (usecs=10000) at backend.c:2460
#4 run_threads (sk_out=sk_out@entry=0x0) at backend.c:2460
#5 0x00005b1f2186d5b2 in fio_backend (sk_out=0x0) at backend.c:2513
#6 0x00005b1f2181c595 in main (argc=16, argv=0x7ffe8dcfb228, envp=) at fio.c:60

It doesn't show dlopen in the callstack. Do you think it's related to this issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a high probably that it could be related: however it is hard for me to be 100% sure without a more deep investigation in the core file and the related code source.
As for @daltonbohning issue with mvapich, you could use the full patch #13250 to check if it also solves your issue with fio.

jolivier23
jolivier23 previously approved these changes Dec 19, 2023
void
d_free_env(char **str_val);
int
d_getenv_bool(bool *bool_val, const char *env);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is changing the API and ABI isn't it? Existing applications which use this interface will need code changes to work, but if they build with one version and run with another they could also segfault.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a similar thought but then kind of assumed that the only user of these APIs are DAOS libraries that will be shipped with the changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the danger would be things like mpifileutils and hdf5 etc. At the very least we should bump the SO version with this changes, but that only applies to releases from "before" and "after" and I'd assume that 2.4 and 2.6 will have different SO numbers anyway - perhaps worth checking it though and bumping it with this change if master is still using the same as 2.4.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this time, the only user of this new APIs are indeed the DAOS libraries and thus should not have any side effects for the client applications linked with DAOS. They will just use the standard non-thread safe getenv() functions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was referring to the functions you've changed, not the ones you've added.

Integrate reviewers comments:
- Factorize d_getenv_uint() functions.

Required-githooks: true
Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@intel.com>
Fix clang format.

Required-githooks: true
Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@intel.com>
@daosbuild1
Copy link
Collaborator

Test stage NLT on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-13483/3/display/redirect

wangshilong
wangshilong previously approved these changes Jan 4, 2024
@knard-intel knard-intel requested a review from a team January 4, 2024 07:24
src/include/gurt/common.h Show resolved Hide resolved
void
d_free_env(char **str_val);
int
d_getenv_bool(bool *bool_val, const char *env);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the danger would be things like mpifileutils and hdf5 etc. At the very least we should bump the SO version with this changes, but that only applies to releases from "before" and "after" and I'd assume that 2.4 and 2.6 will have different SO numbers anyway - perhaps worth checking it though and bumping it with this change if master is still using the same as 2.4.

void
d_free_env(char **str_val);
int
d_getenv_bool(bool *bool_val, const char *env);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was referring to the functions you've changed, not the ones you've added.

d_getenv_bool("DAOS_NVME_AUTO_FAULTY_ENABLED", &glb_criteria.fc_enabled);
d_getenv_int("DAOS_NVME_AUTO_FAULTY_IO", &glb_criteria.fc_max_io_errs);
d_getenv_int("DAOS_NVME_AUTO_FAULTY_CSUM", &glb_criteria.fc_max_csum_errs);
d_getenv_bool(&glb_criteria.fc_enabled, "DAOS_NVME_AUTO_FAULTY_ENABLED");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't change these APIs. MPI_IO uses d_getenv_bool so we need to maintain the old interfaces.

Copy link
Contributor Author

@knard-intel knard-intel Jan 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Revert interfaces update

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By any chance, do you know if we have some functional tests using MPI_IO ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use Features: mpiio

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the past where this has come up then what functional testing we have uses a weird system where:

  1. PRs and landings to downstream components test against master daos.
  2. PRs and landings to daos test with the latest built downstream component.

The unfortunate implication here is that we have some testing but it's limited and if we change the API or ABI and it's incompatible then we don't notice until there's a new PR to the downstream component or it gets rebuilt for another reason. If we had test coverage here then it would fail at build time, as it is it'll fail (obscurely) at runtime, but only if the code is called.

It's an area we need to work on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Revert interfaces update

Fixed with commit ac764d1

Copy link
Contributor

@jolivier23 jolivier23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing my +1 since the API change needs to be reverted.

@knard-intel
Copy link
Contributor Author

knard-intel commented Jan 4, 2024

@ashleypittman , my apologies for the misunderstanding.
I was thinking that you were talking about the standard getenv() function and not the daos_getenv_xxx() functions already existing. I was expecting that these last ones were only used internaly with DAOS.
For sure, I will revert the interface modification to preserve the compatibility.

Integrate reviewers comments:
- Revert d_getenv_XXX interfaces update
- Rename d_free_env() -> d_free_env_str()

Features: mpiio
Required-githooks: true
Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@intel.com>
Revert useless header file changes.

Features: mpiio
Required-githooks: true
Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@intel.com>
src/gurt/misc.c Outdated
{
char *env;

pthread_rwlock_rdlock(&d_env_lock);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These will be coverity warnings unless you either check the return code or use our macros.

Copy link
Contributor Author

@knard-intel knard-intel Jan 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the info.
Sadly, I will not be able to use the D_RWLOCK_RDLOCK macros as it use the DAOS logging facilities which could lead to infinite recursive calls. Thus I will check the return code is a similar way as this last macros.

  • Checks return code when using standard function calls such as pthread_rwlock_rdlock()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did wonder. Does the logging ever call getenv though?

I made the same change to the locking in the logging as that was causing recursion and found out about the coverity issue that way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I remember, I experiment infinite recursion with the function d_logfac_is_enabled() from src/gurt/dlog.c. There is other functions from this file which could also probably lead to the same kind of issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the info. Sadly, I will not be able to use the D_RWLOCK_RDLOCK macros as it use the DAOS logging facilities which could lead to infinite recursive calls. Thus I will check the return code is a similar way as this last macros.

  • Checks return code when using standard function calls such as pthread_rwlock_rdlock()

Fixed with commit 2791b93

@knard-intel knard-intel self-assigned this Jan 5, 2024
Integrate reviwers comments:
- Checks return code when using standard function calls such as pthread_rwlock_rdlock()

Features: mpiio
Required-githooks: true
Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@intel.com>
Fix clang format coding standards.

Features: mpiio
Required-githooks: true
Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@intel.com>
@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13483/8/execution/node/1453/log

@knard-intel
Copy link
Contributor Author

The only CI issue with the functional test erasurecode_online_rebuild_mdtest_py is a known issue DAOS-14845.

@mchaarawi mchaarawi merged commit 2a9e853 into master Jan 8, 2024
45 of 47 checks passed
@mchaarawi mchaarawi deleted the ckochhof/fix/master/daos-14532-gurt branch January 8, 2024 16:50
daltonbohning pushed a commit that referenced this pull request Jan 24, 2024
This PR is a subset of the PR #13250 allowing thread safe management of environment variables: it has been split into smaller PRs to facilitate the review process.
This PR mainly add thread safe environment variables management functions.
It also remove and replace old non thread safe custom environment management functions.
Finally, it replace the setenv() function with d_setenv().

Required-githooks: true

Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@intel.com>
jolivier23 pushed a commit that referenced this pull request Feb 28, 2024
This PR is a subset of the PR #13250 allowing thread safe management of environment variables: it has been split into smaller PRs to facilitate the review process.
This PR mainly add thread safe environment variables management functions.
It also remove and replace old non thread safe custom environment management functions.
Finally, it replace the setenv() function with d_setenv().

Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@intel.com>
jolivier23 pushed a commit that referenced this pull request Mar 12, 2024
This PR is a subset of the PR #13250 allowing thread safe management of environment variables: it has been split into smaller PRs to facilitate the review process.
This PR mainly add thread safe environment variables management functions.
It also remove and replace old non thread safe custom environment management functions.
Finally, it replace the setenv() function with d_setenv().

Required-githooks: true

Change-Id: Ife6690e2c63dd6c47279a2ac8c3c5a3da5cf8213
Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@intel.com>
jolivier23 pushed a commit that referenced this pull request Apr 10, 2024
This PR is a subset of the PR #13250 allowing thread safe management of environment variables: it has been split into smaller PRs to facilitate the review process.
This PR mainly add thread safe environment variables management functions.
It also remove and replace old non thread safe custom environment management functions.
Finally, it replace the setenv() function with d_setenv().

Required-githooks: true

Change-Id: Ife6690e2c63dd6c47279a2ac8c3c5a3da5cf8213
Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

8 participants