Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combine matmuls for GRU gates #188

Merged
merged 2 commits into from
May 20, 2024
Merged

Combine matmuls for GRU gates #188

merged 2 commits into from
May 20, 2024

Conversation

robertknight
Copy link
Owner

@robertknight robertknight commented May 19, 2024

Apply a change similar to 2f0327c by combining the separate matrix multiplications for the 3 gates into a single matmul. This reduces the total number of matmuls from num_directions * seq_length * 6 to num_directions * seq_length * 2.

Using only two matmuls is only possible when linear_before_reset=true, as is the case all models I have worked with so far, since it is what PyTorch and cuDNN support and what TensorFlow uses by default. With the GRU formulation in the original paper, an additional matmul is required since the reset gate has to be computed before the hidden_state @ hidden_weights[hidden_gate] matmul can be done.

Fixes #85

TODO:

  • Fix efficiency of binary ops for non-contiguous inputs. This affects accumulating / scaling subsets of gates.
  • Fix efficiency of parallel unary ops for non-contiguous inputs. This affects applying activations to subsets of gates. Update: This mostly affected tanh on the hidden gate. In this PR a workaround has been implemented which copies the tensor, using the pool to minimize allocation/de-allocation overhead. An issue has been filed in Make unary ops more efficient with non-contiguous inputs #192 for a more general solution. The issue also affects the sigmoid op, but the overhead is much less.

@robertknight robertknight force-pushed the gru-combine-gemms branch 3 times, most recently from 5d67a3e to f0d7b9c Compare May 20, 2024 06:25
@robertknight robertknight marked this pull request as ready for review May 20, 2024 06:28
Apply a change similar to 2f0327c by combining
the separate matrix multiplications for the 3 gates into a single matmul. This
reduces the total number of matmuls from `num_directions * seq_length * 6` to
`num_directions * seq_length * 2`.

Using only two matmuls is only possible when `linear_before_reset=true`, as is
the case all models I have worked with so far, since it is what PyTorch and
cuDNN support and what TensorFlow uses by default. With the GRU formulation in
the original paper, an additional matmul is required since the reset gate has to
be computed before the `hidden_state @ hidden_weights[hidden_gate]` matmul can
be done.

Fixes #85
This is a workaround needed because `tanh_in_place` is very slow with
non-contigous inputs. See #192.
@robertknight robertknight merged commit 2851fed into main May 20, 2024
2 checks passed
@robertknight robertknight deleted the gru-combine-gemms branch May 20, 2024 06:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimize LSTM / GRU operations to do fewer matmuls
1 participant