Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_pickle.UnpicklingError: invalid load key, '\x00'. #7

Open
GuoJunfu-tech opened this issue Jun 27, 2024 · 2 comments
Open

_pickle.UnpicklingError: invalid load key, '\x00'. #7

GuoJunfu-tech opened this issue Jun 27, 2024 · 2 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@GuoJunfu-tech
Copy link

Dear Author,

I encountered an error when running bash scripts/train_and_eval_w_geo.sh ManiGaussian_BC 0,1,2,3,4,5 5678 ${try_without_tmux} .
I run this code with 6 RTX3090s in Ubuntu20.04, torch==2.0.0+cu117.

However, it shows an error like this during training:

Error executing job with overrides: ['method=ManiGaussian_BC', 'rlbench.task_name=ManiGaussian_BC_20240627', 'rlbench.demo_path=/home/gjf/codes/ManiGaussian/data/train_data', 'replay.path=/home/gjf/codes/ManiGaussian/replay/ManiGaussian_BC_20240627', 'framework.start_seed=0', 'framework.use_wandb=False', 'method.use_wandb=False', 'framework.wandb_group=ManiGaussian_BC_20240627', 'framework.wandb_name=ManiGaussian_BC_20240627', 'ddp.num_devices=6', 'replay.batch_size=1', 'ddp.master_port=5678', 'rlbench.tasks=[close_jar,open_drawer,sweep_to_dustpan_of_size,meat_off_grill,turn_tap,slide_block_to_color_target,put_item_in_drawer,reach_and_drag,push_buttons,stack_blocks]', 'rlbench.demos=20', 'method.neural_renderer.render_freq=2000']
Traceback (most recent call last):
  File "/home/gjf/codes/ManiGaussian/train.py", line 96, in main
    run_seed_fn.run_seed(
  File "/home/gjf/codes/ManiGaussian/run_seed_fn.py", line 147, in run_seed
    train_runner.start()
  File "/home/gjf/codes/ManiGaussian/third_party/YARR/yarr/runners/offline_train_runner.py", line 200, in start
    batch = self.preprocess_data(data_iter)
  File "/home/gjf/codes/ManiGaussian/third_party/YARR/yarr/runners/offline_train_runner.py", line 121, in preprocess_data
    sampled_batch = next(data_iter) # may raise StopIteration
  File "/home/gjf/miniconda3/envs/manigaussian/lib/python3.9/site-packages/lightning/fabric/wrappers.py", line 178, in __iter__
    for item in self._dataloader:
  File "/home/gjf/miniconda3/envs/manigaussian/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 634, in __next__
    data = self._next_data()
  File "/home/gjf/miniconda3/envs/manigaussian/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 678, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/gjf/miniconda3/envs/manigaussian/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 41, in fetch
    data = next(self.dataset_iter)
  File "/home/gjf/codes/ManiGaussian/third_party/YARR/yarr/replay_buffer/wrappers/pytorch_replay_buffer.py", line 17, in _generator
    yield self._replay_buffer.sample_transition_batch(pack_in_dict=True)
  File "/home/gjf/codes/ManiGaussian/third_party/YARR/yarr/replay_buffer/uniform_replay_buffer.py", line 722, in sample_transition_batch
    store = self._get_from_disk(
  File "/home/gjf/codes/ManiGaussian/third_party/YARR/yarr/replay_buffer/uniform_replay_buffer.py", line 391, in _get_from_disk
    d = pickle.load(f)
_pickle.UnpicklingError: invalid load key, '\x00'.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
[W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
  0%|▎                                                                                                                                                                                                              | 134/100010 [03:10<35:31:44,  1.28s/it]/home/gjf/miniconda3/envs/manigaussian/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 11 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '`

After that, one of the GPU stopped work, the whole program stuck at this place even I pressed Ctrl + C. This happened every time soon after training.

By the way, I did not use tmux or wandb, would this matter?

Could you please help me with this issue?

@GuanxingLu
Copy link
Owner

Yes, I've also encountered this issue multiple times, but I haven't found an essential solution yet because it seems to occur randomly. I suspect it could be due to loading a broken file that was removed by other processes, since the data is cached in a shared directory ('/tmp/arm/replay' by default). I recommend trying the training with two GPUs one more time.

@GuanxingLu GuanxingLu added bug Something isn't working help wanted Extra attention is needed labels Jun 27, 2024
@GuoJunfu-tech
Copy link
Author

Thanks for replying, I have not faced the same error when using 2 cards.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants