Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TrOCR example #304

Merged
merged 3 commits into from
Aug 23, 2024
Merged

Add TrOCR example #304

merged 3 commits into from
Aug 23, 2024

Conversation

robertknight
Copy link
Owner

@robertknight robertknight commented Aug 13, 2024

This is a vision-to-text example with a very similar structure to the DistilViT example.

Different TrOCR model sizes use different tokenizers. This example works with the base model which uses a BPE tokenizer, but not the small model which uses a unigram tokenizer.

Compared to Ocrs the models are much larger and thus slower to execute. However being a bigger model it also has more capacity.

TODO:

  • Implement the If operator. This will allow using the "merged" output model from Optimum (decoder_model_merged.onnx), which is faster than using the cache-less model (decoder_model.onnx) alone and more size-efficient than using separate models for the initial run and subsequent runs. See Implement If operator #306.
  • Support cross-attention KV-caches in rten-generate. These are the past_key_values.{layer}.encoder.{key,value} inputs that Optimum uses. Unlike self-attention KV-caches these are generated once when the encoder is run for the first time and skipped in subsequent runs (Support cross-attention key-value caches in rten-generate #318)
  • Investigate why LayerNormalization op is not fused in the decoder
    • The problem is that the "shift and scale" pattern in fuse_layer_norm doesn't match because it expects arguments to the Add and Mul operators to be constants. However they are actually value nodes which capture values from constants defined in the parent graph.

Support loading tokenizers which contain entries in the `vocab` map that do not
appear in either `merges` or `added_tokens`. The TrOCR base model on Hugging
Face (https://huggingface.co/microsoft/trocr-base-printed) has an
`<|endoftext|>` token in the vocab which does not appear in the `merges` or
`added_tokens` fields.
@robertknight robertknight marked this pull request as ready for review August 23, 2024 08:05
@robertknight robertknight merged commit 09e8d19 into main Aug 23, 2024
2 checks passed
@robertknight robertknight deleted the trocr-example branch August 23, 2024 08:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant