Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Inconsistent usage of file_client in CheckpointHook. The previous best checkpoints are not removed for file backends other than disk. #1084

Closed
2 tasks done
tyomj opened this issue Apr 18, 2023 · 1 comment · Fixed by #1086
Labels
bug Something isn't working

Comments

@tyomj
Copy link

tyomj commented Apr 18, 2023

Prerequisite

Environment

OrderedDict([('sys.platform', 'linux'), ('Python', '3.8.16 (default, Mar 23 2023, 17:40:49) [GCC 10.2.1 20210110]'), ('CUDA available', True), ('numpy_random_seed', 2147483648), ('GPU 0', 'Tesla V100-SXM2-16GB'), ('CUDA_HOME', None), ('GCC', 'gcc (Debian 10.2.1-6) 10.2.1 20210110'), ('PyTorch', '2.0.0+cu117'), ('PyTorch compiling details', 'PyTorch built with:\n - GCC 9.3\n - C++ Version: 201703\n - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications\n - Intel(R) MKL-DNN v2.7.3 (Git Hash 6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)\n - OpenMP 201511 (a.k.a. OpenMP 4.5)\n - LAPACK is enabled (usually provided by MKL)\n - NNPACK is enabled\n - CPU capability usage: AVX2\n - CUDA Runtime 11.7\n - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86\n - CuDNN 8.5\n - Magma 2.6.1\n - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.7, CUDNN_VERSION=8.5.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.0.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, \n'), ('TorchVision', '0.15.1+cu117'), ('OpenCV', '4.7.0'), ('MMEngine', '0.7.2')])

Reproduces the problem - code sample

Config example
The example uses aws backend implemeted in this PR. One may test with Petrel backend as well to reproduce.

# Client args
# ...
beargs = dict(
    backend="aws",
    path_mapping=dict(
        {
            "./work_dirs": f"aws://bucket_name/{run_id}",
            "/app/mmdetection3d/work_dirs": f"aws://bucket_name/{run_id}",
        }
    ),
)
# ...
# Define CheckpointHook
default_hooks = dict(
    checkpoint=dict(
        type="CheckpointHook",
        save_best='some_key',
        max_keep_ckpts=1,
        backend_args=beargs
    ),
)

Reproduces the problem - command or script

python tools/train.py ../config.py

Reproduces the problem - error message

Unexpected behaviour.

Additional information

  1. Expected behavior:
    The system should save only one best checkpoint at a time, and if a new checkpoint becomes better than the previous one, the old checkpoint should be removed.

  2. Current behavior:
    The system saves a new checkpoint in addition to the previous one, which results in multiple checkpoints being saved.

  3. The reason:
    The issue arises due to the wrong simultaneous definition of backend_args and file_client_args. If backend_args is defined, the file_client is set to default to disk. The file_backend is used throughout the code, except in _save_best_checkpoint. Hence, when the system tries to locate the previous checkpoint at this location, it searches on the local drive.

  4. A proposed solution:
    The simultaneous definition of backend_args and file_client_args at this location should be changed here
    to

        if self.file_client_args is None:
            self.file_backend = get_file_backend(
                self.out_dir, backend_args=self.backend_args)
            self.file_client = self.file_backend
        else:
            self.file_client = FileClient.infer_client(self.file_client_args,
                                                   self.out_dir)
            self.file_backend = self.file_client

Additionally, self.file_client should be replaced with self.file_backend at this location here to ensure that the system locates the previous checkpoint correctly.

@tyomj tyomj added the bug Something isn't working label Apr 18, 2023
@tyomj tyomj changed the title [Bug] Inconsistent usage of file_client in CheckpointHook [Bug] Inconsistent usage of file_client in CheckpointHook. The previous best checkpoints are not removed for file backends other than disk. Apr 18, 2023
@HAOCHENYE
Copy link
Collaborator

Thanks for your feedback, I've tried to fix this Bug in #1086, it could be helpful if you can review this PR 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants