Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss abruptly becomes 'nan' during Self-Supervised training #2

Open
hrishi508 opened this issue Jan 29, 2022 · 0 comments
Open

Loss abruptly becomes 'nan' during Self-Supervised training #2

hrishi508 opened this issue Jan 29, 2022 · 0 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@hrishi508
Copy link
Owner

In the ResNet9_Barlow_Twins.ipynb notebook, the model and the training function compiled and trained on ~13 epochs successfully, however subsequently the loss abruptly becomes nan. This in turn was occuring due to the gradient becoming nan.

Debugging:

  1. torch.autograd.set_detect_anomaly(True) was used to trace what part of the code was causing there to be nan values.

  2. Error was traced to be: RuntimeError: Function 'PowBackward0' returned nan values in its 0th output.

  3. A simpler model (AlexNet) was used in the AlexNet_Barlow_Twins.ipynb notebook. The error persisted.

  4. Gradient clipping was used to ensure that the gradients do not explode. Division by zero was also prevented at all stages by adding a small positive constant wherever required.

  5. We also tried using Facebook research's implementation of the Barlow Twins loss function and the LARS optimizer.

@hrishi508 hrishi508 added bug Something isn't working help wanted Extra attention is needed labels Jan 29, 2022
@hrishi508 hrishi508 pinned this issue Jan 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

1 participant