Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix some bugs for latest TE #160

Merged
merged 2 commits into from
Feb 22, 2024
Merged

fix some bugs for latest TE #160

merged 2 commits into from
Feb 22, 2024

Conversation

tocean
Copy link
Contributor

@tocean tocean commented Feb 21, 2024

Description
Fix some bugs for latest TE and add UT for it.

  1. In TE, it only allocates fp8 weight for the first micro batch. In MS-AMP, it allocates zero size tensor for fp8 weight because tex.fp8_cast_transpose_fused will allocate memory for it. However, latest TE introduces a data structure Float8Tensor which use _data to store the original fp8 tensor. When comparing shape in set_fp8_weights, we should use the shape of _data. Otherwise, TE will allocate zero-size tensor for non-first micro batch.
  2. Seem that when using latest TE, Megaton-LM can't converge(Test it with GPT-345M). The newest TE which can converge is v1.1, so convert it back to v1.1

@tocean tocean enabled auto-merge (squash) February 21, 2024 09:02
@tocean tocean disabled auto-merge February 22, 2024 03:53
@tocean tocean changed the title fix bug for non first microbatch in TE fix some bugs for latest TE Feb 22, 2024
@tocean tocean merged commit 9ac98df into main Feb 22, 2024
9 checks passed
@tocean tocean deleted the yuxiang/te_bugfix branch February 22, 2024 08:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants