Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Features/1400 implement unfold operation similar to torch tensor unfold #1419

Open
wants to merge 41 commits into
base: main
Choose a base branch
from

Conversation

FOsterfeld
Copy link
Collaborator

@FOsterfeld FOsterfeld commented Apr 2, 2024

Due Diligence

  • General:
  • Implementation:
    • unit tests: all split configurations tested
    • unit tests: multiple dtypes tested
    • documentation updated where needed

Description

Add the function unfold to the available manipulations. unfold(a, dimension, size, step) for a DNDarray a behaves like torch.Tensor.unfold.

Example:

>>> x = ht.arange(1., 8)
>>> x
DNDarray([1., 2., 3., 4., 5., 6., 7.], dtype=ht.float32, device=cpu:0, split=e)
>>> ht.unfold(x, 0, 2, 1)
DNDarray([[1., 2.],
          [2., 3.],
          [3., 4.],
          [4., 5.],
          [5., 6.],
          [6., 7.]], dtype=ht.float32, device=cpu:0, split=None)
>>> ht.unfold(x, 0, 2, 2)
DNDarray([[1., 2.],
          [3., 4.],
          [5., 6.]], dtype=ht.float32, device=cpu:0, split=None)

Issue/s resolved: #1400

Changes proposed:

Type of change

  • New feature (non-breaking change which adds functionality)

Memory requirements

Performance

Does this change modify the behaviour of other functions? If so, which?

no

Copy link
Contributor

github-actions bot commented Apr 2, 2024

Thank you for the PR!

Copy link
Contributor

github-actions bot commented Apr 2, 2024

Thank you for the PR!

1 similar comment
Copy link
Contributor

github-actions bot commented Apr 2, 2024

Thank you for the PR!

Copy link

codecov bot commented Apr 2, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.93%. Comparing base (a774559) to head (825979c).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1419      +/-   ##
==========================================
+ Coverage   91.91%   91.93%   +0.02%     
==========================================
  Files          80       80              
  Lines       11942    11973      +31     
==========================================
+ Hits        10976    11007      +31     
  Misses        966      966              
Flag Coverage Δ
unit 91.93% <100.00%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mrfh92
Copy link
Collaborator

mrfh92 commented Apr 3, 2024

The tests on the CUDA-runner seem to hang at test_manipulations.py for 5 MPI-processes.
This also happens locally on my machine, so there seems to be an error in unfold that results in hanging (most likely an MPI deadlock?)

Copy link
Contributor

github-actions bot commented Apr 3, 2024

Thank you for the PR!

Copy link
Contributor

github-actions bot commented Apr 8, 2024

Thank you for the PR!

Copy link
Contributor

Thank you for the PR!

Copy link
Contributor

Thank you for the PR!

1 similar comment
Copy link
Contributor

Thank you for the PR!

Copy link
Contributor

Thank you for the PR!

@mrfh92
Copy link
Collaborator

mrfh92 commented Apr 15, 2024

On the Terrabyte cluster, using 8 processes on 2 nodes with 4 GPUs each I get the following error:

ERROR: test_unfold (heat.core.tests.test_manipulations.TestManipulations)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/dss/dsshome1/03/di93zek/heat/heat/core/tests/test_manipulations.py", line 3775, in test_unfold
    ht.unfold(x, 0, min_chunk_size, min_chunk_size + 1)  # no fully local unfolds on some nodes
  File "/dss/dsshome1/03/di93zek/heat/heat/core/manipulations.py", line 4272, in unfold
    ret_larray = torch.cat((unfold_loc, unfold_halo), dimension)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument tensors in method wrapper_CUDA_cat)

----------------------------------------------------------------------
Ran 32 tests in 26.574s

on CPU, everything seems to work (at least in test_manipulations.py)

Copy link
Contributor

github-actions bot commented Jun 4, 2024

Thank you for the PR!

Copy link
Contributor

github-actions bot commented Jun 4, 2024

Thank you for the PR!

…ided-halo

Support one-sided halo for DNDarrays
Copy link
Contributor

github-actions bot commented Jun 5, 2024

Thank you for the PR!

Copy link
Contributor

github-actions bot commented Jun 5, 2024

Thank you for the PR!

Copy link
Contributor

github-actions bot commented Jun 5, 2024

Thank you for the PR!

Copy link
Contributor

Thank you for the PR!

Copy link
Contributor

Thank you for the PR!

Copy link
Contributor

Thank you for the PR!

@mrfh92
Copy link
Collaborator

mrfh92 commented Jun 21, 2024

@FOsterfeld there seems to be an error now on the CUDA runner. As it fails in unfold, its maybe not a random-CI-error due to overloaded runners but really sth in unfold

Copy link
Contributor

github-actions bot commented Jul 5, 2024

Thank you for the PR!

1 similar comment
Copy link
Contributor

github-actions bot commented Jul 5, 2024

Thank you for the PR!

Copy link
Contributor

github-actions bot commented Jul 5, 2024

Thank you for the PR!

Copy link
Contributor

github-actions bot commented Jul 5, 2024

Thank you for the PR!

1 similar comment
Copy link
Contributor

github-actions bot commented Jul 5, 2024

Thank you for the PR!

@FOsterfeld
Copy link
Collaborator Author

There seems to be something wrong with the communication in DNDarray.get_halo(). Sometimes the halo that is sent from the last rank to the rank before is faulty. This happened irregularly in my tests without any randomization in the data, so maybe it occurs depending on the order in which the non-blocking halo-sends are fulfilled there.

In 825979c I tested get_halo(prev=False) with blocking sends instead, this eliminated all errors but is obviously no final solution to the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement unfold-operation similar to torch.Tensor.unfold
3 participants