Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new version of icu_pattern crate #4579

Closed
wants to merge 6 commits into from
Closed

Conversation

sffc
Copy link
Member

@sffc sffc commented Feb 3, 2024

We need this in so many places that I decided to build it.

I added a new type NumericPlaceholderPattern which, as the name suggests, supports only numeric placeholders. It stores all of its data in a string, so it can be stored easily in a data provider and interpolated zero-copy. I took some inspiration from the current icu_pattern when designing the API, but my proposed API has fewer generics and instead specialized on this one common use case.

Two things we don't support, and the path to add them:

  1. String placeholders. This is easy enough to add as a new type like StringPlaceholderPattern.
  2. Placeholders with additional context (like datetime patterns). It seems like this is fairly specific to the DateTime use case so I don't know if we want to support them here, but if we do, they can be added as another type like ContextualPlaceholderPattern.

The only code I copied from the existing crate was the Parser impl, which turns out to be more lines of code than the fresh code that I wrote.

@zbraniecki, if you like this direction, I can just move this code into the existing path in the file tree.

@sffc sffc requested a review from zbraniecki February 3, 2024 08:00
@sffc sffc requested a review from a team as a code owner February 3, 2024 08:00
@sffc
Copy link
Member Author

sffc commented Feb 4, 2024

Comment on lines +114 to +115
/// If set to `false`, ASCII letters can only appear in quoted literals,
/// like "{0} 'days'".
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: how is "ASCII letters" defined? Space doesn't seem to be part of it, what about all the other non a-z ASCII characters?

#[derive(PartialEq, Debug, Clone)]
pub(crate) enum PatternToken<'s, P> {
Placeholder(P),
Literal { content: Cow<'s, str>, quoted: bool },
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the quoted output used for? Do any consumers behave differently whether a literal was quoted or not?

/// Zero copy parsing is a model which allows the parser to produce tokens that are de-facto
/// slices of the input without ever having to modify the input or copy from it.
///
/// In case of ICU patterns that decision brings a trade-off around handling of quoted literals.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are "ICU patterns"?

@sffc
Copy link
Member Author

sffc commented Feb 9, 2024

@robertbastian The parser/mod.rs file is copied verbatim from utils/pattern. Does the rest of the PR look good to you?

Copy link
Member

@Manishearth Manishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

design seems fine overall. didn't look too closely at the parser

@sffc
Copy link
Member Author

sffc commented Feb 13, 2024

@zbraniecki Are you OK if I replace /utils/pattern with this new code? I will keep your parser but replace the rest of the crate with this code.

@sffc
Copy link
Member Author

sffc commented Feb 13, 2024

Discussed with @robertbastian and agreed to make a version that includes a generic for the number of placeholders with potential data model optimizations.

@sffc sffc marked this pull request as draft February 13, 2024 17:25
sffc added a commit to sffc/omnicu that referenced this pull request Feb 14, 2024
sffc added a commit to sffc/omnicu that referenced this pull request Feb 14, 2024
@sffc
Copy link
Member Author

sffc commented Feb 14, 2024

Based on feedback from @robertbastian I opened a new PR with a different data model, #4610. I think both data models have pros and cons.

For reference, here is the data model for this PR:

https://icu4x-pr-artifacts.storage.googleapis.com/gha/eb310a34dc12d569ab2cff196256b8088551b9ee/rustdoc/icu_pattern_2/struct.NumericPlaceholderPattern.html#pattern-string-encoding-details

I have not run performance benchmarks and can do so if that is the deciding factor. As far as data size, I expect the model implemented in this PR to be slightly smaller than the one implemented in #4610.

@sffc
Copy link
Member Author

sffc commented Feb 21, 2024

I'm quite happy now with #4610 so I will close this PR.

Archived as https://github.com/sffc/omnicu/tree/archive/NumericPlaceholderPattern

@sffc sffc closed this Feb 21, 2024
@sffc sffc deleted the new-pattern branch February 21, 2024 00:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants