Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug report] Some bugs or inconvenience in bidirectional_sequence_lstm and lstmunit_activation #401

Open
gdh1995 opened this issue May 23, 2022 · 4 comments
Milestone

Comments

@gdh1995
Copy link
Contributor

gdh1995 commented May 23, 2022

When I tries to use the bidirectional LSTM operator, I found some bugs and fixed parts of them. So, I've created gdh1995@9ae1832 to show my patch; and those I can't fix are also described in this issues.

Those fixed issues

The below are detailed problems of https://github.com/VeriSilicon/TIM-VX/blob/f8741b4704ab9caffd49f6fc15f72616962ba1d1/src/tim/vx/internal/src/ops/vsi_nn_op_bidirectional_sequence_lstm.c :

  1. when passed inputs[BI_LSTM_BW_INPUT_H_STATE] is empty, it copies shape from outputs[BI_LSTM_BW_OUTPUT_OUTPUT], but the later may also be empty if only curr_param->merge_outputs
  2. it doesn't support empty c_state inputs, while unidirectional LSTM does.
  3. when it creates new c_state outputs during iterating forwards, it uses dtype from outputs[BI_LSTM_FW_OUTPUT_OUTPUT], but I think a better source should be inputs[BI_LSTM_FW_INPUT_C_STATE]
    1. so does it during iterating backwards
  1. it sets lstm_ovxlib.internal_dtype, which should have been lstmunit_ovxlib.internal_dtype
  2. the order of tensors in lstmcell_reshape_output_tensors_bw is D'C'B'A' when input words are A, B, C, D
    1. in PyTorch, the order is still A'B'C'D', so a merged output tensor is [ [A, A'], [B, B'], [C, C'], [D, D'] ]
    1. the above judgment is tested on torch==1.9.0+cu111 and Python 3.7 x64 for Windows.

Left bugs

When I test the operator, I noticed some kernel functions of lstmunit_activation don't work under some special combination of I/O data types. As far as I've catched, they are:

On a main branch of TIMVX and the A311D-6.4.10.2 package (https://github.com/VeriSilicon/TIM-VX/releases/tag/v1.1.42), the evis kernel (src/tim/vx/internal/src/kernel/evis/lstmunit_activation_evis.c) never works

  1. the kernel will never write its output tensors, so the tensor will contain old data from old graph forwarding
  2. the failed kernels are:
    1. GEN_LSTMUNIT_STRUCT_ITEMS(0, 0, 0, 0, 0, F16, F16, F16, SIGMOID, S)
    1. GEN_LSTMUNIT_STRUCT_ITEMS(0, 0, 0, 0, 0, F16, U8, F16, SIGMOID, S)
  1. for example, after I created a graph with only 1 bidi-LSTM layer using the cl kernel, run it, and re-created the graph using the evis kernel, the new running result (under a input tensor of a same shape) is absolutely the same with the cl kernel.
  2. while the version in x86_64_linux works well

The cl kernel (src/tim/vx/internal/src/kernel/evis/lstmunit_activation_cl.c) doesn't support F32toU8_F32_SIGMOID

  1. the failed kernel is GEN_LSTMUNIT_STRUCT_ITEMS(0, 0, 0, 0, 0, F32, U8 , F32, SIGMOID, S)
  2. the kernel will only output very small numbers (all items in the uint8-quantified tensor are either 0 or 1)
  3. here's my test case:
    1. by default the unidirectional LSTM layer creates float16 tensors as outputs of FullConnect nodes
    1. I want to quantify weights and LSTM input+output tensors into uint8, and then the hidden state is also uint8-quantified
    1. so lstmunit_activate receives several float16 inputs and is expected to generate a uint8-quantified tensor
  1. both the version in A311D-6.4.10.2 and the x86_64_linux one will give wrong outputs.

Here's a conclusion of the 2 kernel bugs (uint8+fp16 means only hidden states are uint8, activation inputs and c_states are still float16):

                    | simulator (x86_64_linux)    | A311D
                    | evis    cl        cpu       | evis       cl        cpu
---------------------------------------------------------------------------
unidi | fp16        | ok      ok        ok        | not-write  ok        ok
      | uint8+fp16  | ok      too-small ok        | not-write  too-small ok
      | uint8+fp16  | ok      too-small ok        | not-write  too-small ok
---------------------------------------------------------------------------
bidi  | fp16        | ok      ok        ok        | not-write  ok        ok
      | uint8+fp16  | ok      too-small ok        | not-write  too-small ok

A311D + evis: never write the tensor at all
A311D + cl:

  • fp16+fp16 (S_F32toF32_F32_SIGMOID): ok
  • uint8+fp16 (S_F32toU8_F32_SIGMOID): value error: all are too small
@gdh1995 gdh1995 changed the title [bug report] Some bugs or inconvenience in vsi_nn_op_bidirectional_sequence_lstm.c [bug report] Some bugs or inconvenience in bidirectional_sequence_lstm and lstmunit_activation May 23, 2022
@sunshinemyson
Copy link
Contributor

@gdh1995

Thanks for the detail. We will check it ASAP.

@sunshinemyson
Copy link
Contributor

@gdh1995

Sorry for that take a long time to get update for you. I discussed with team on this, we think the implementation inside "internal" is not robust and we don't want to maintain it anymore internally.

For bidirectional_sequence_lstm support, we would like to compose it in tim-vx. If you think it is ok, I can share more detail about the information about unidrectional_sequence_lstm to github.

Thanks

@gdh1995
Copy link
Contributor Author

gdh1995 commented Jun 16, 2022

Sorry my English is not so good and I can't get what you meant.

If vsi_nn_op_bidirectional_sequence_lstm.c will not be updated, it doesn't matter for me - I can maintain a fork of TIM-VX in my private workspace and pull new changes periodic.

I still expect the buggy behavior of EVIS and CL kernels can be fixed - because according to their names, they might run faster than the CPU versions.

I know few about LSTM variants, so I guesse info about unidirectional-lstm is too hard for me. If you want to add a new version of bidi-lstm, I'm afraid I can't understand it easily but I'm glad to try it.

@sunshinemyson sunshinemyson added this to the 22Q4_rel milestone Sep 28, 2022
@sunshinemyson
Copy link
Contributor

11fd278 We have another implementation instead of reuse original op inside "internal" implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants