Combine matmuls for GRU gates #188

robertknight · 2024-05-19T10:37:52Z

Apply a change similar to 2f0327c by combining the separate matrix multiplications for the 3 gates into a single matmul. This reduces the total number of matmuls from num_directions * seq_length * 6 to num_directions * seq_length * 2.

Using only two matmuls is only possible when linear_before_reset=true, as is the case all models I have worked with so far, since it is what PyTorch and cuDNN support and what TensorFlow uses by default. With the GRU formulation in the original paper, an additional matmul is required since the reset gate has to be computed before the hidden_state @ hidden_weights[hidden_gate] matmul can be done.

Fixes #85

TODO:

Fix efficiency of binary ops for non-contiguous inputs. This affects accumulating / scaling subsets of gates.
Fix efficiency of parallel unary ops for non-contiguous inputs. This affects applying activations to subsets of gates. Update: This mostly affected tanh on the hidden gate. In this PR a workaround has been implemented which copies the tensor, using the pool to minimize allocation/de-allocation overhead. An issue has been filed in Make unary ops more efficient with non-contiguous inputs #192 for a more general solution. The issue also affects the sigmoid op, but the overhead is much less.

Apply a change similar to 2f0327c by combining the separate matrix multiplications for the 3 gates into a single matmul. This reduces the total number of matmuls from `num_directions * seq_length * 6` to `num_directions * seq_length * 2`. Using only two matmuls is only possible when `linear_before_reset=true`, as is the case all models I have worked with so far, since it is what PyTorch and cuDNN support and what TensorFlow uses by default. With the GRU formulation in the original paper, an additional matmul is required since the reset gate has to be computed before the `hidden_state @ hidden_weights[hidden_gate]` matmul can be done. Fixes #85

This is a workaround needed because `tanh_in_place` is very slow with non-contigous inputs. See #192.

robertknight force-pushed the gru-combine-gemms branch 3 times, most recently from 5d67a3e to f0d7b9c Compare May 20, 2024 06:25

robertknight marked this pull request as ready for review May 20, 2024 06:28

robertknight added 2 commits May 20, 2024 07:34

Copy hidden gate in GRU op before applying activation

53d9e41

This is a workaround needed because `tanh_in_place` is very slow with non-contigous inputs. See #192.

robertknight force-pushed the gru-combine-gemms branch from f0d7b9c to 53d9e41 Compare May 20, 2024 06:34

robertknight merged commit 2851fed into main May 20, 2024
2 checks passed

robertknight deleted the gru-combine-gemms branch May 20, 2024 06:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Combine matmuls for GRU gates #188

Combine matmuls for GRU gates #188

robertknight commented May 19, 2024 •

edited

Loading

Combine matmuls for GRU gates #188

Combine matmuls for GRU gates #188

Conversation

robertknight commented May 19, 2024 • edited Loading

robertknight commented May 19, 2024 •

edited

Loading