Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve handling of word boundaries, fix #146 #179

Merged
merged 1 commit into from
Nov 14, 2021

Conversation

ericmarkmartin
Copy link
Contributor

What changed

In addition to the minimal fix suggested by Vouillon in #146, this PR
allows Category.inexistant characters (i.e., bos and eos) in the
following places:

  1. before bow
  2. after eow
  3. on both sides of not_boundary (but not one side alone)

Justification of change

That minimal fix would still give an unintuitive and unconventional
interpretation of word boundaries where bos and eos...

  1. are accepted on both sides of a word boundary (bow or eow)
  2. are not accepted on either side of a not_boundary

With the minimal fix, bow and eow matches would not need to have a
word between them. E.g.,

  1. in "", both bow and eow would match at position 0 with
    nothing between them
  2. in "#", bow would match before the # and eow after the #,
    even though # is not a word character.

In contrast, the convention in other regex libraries is that bos and
eos are not words:

  1. Ruby
  2. Python
  3. re2 syntax reference.
  4. PCRE: can't permalink, but you can check on https://regexpal.com .
  5. JavaScript: ditto, https://regexpal.com .

This PR implements this more conventional behavior.

Use case

Jane Street is migrating from Re2 to Re by writing a wrapper that
implements the Re2 interface on top of Re.

In most cases where Re2 disagrees with Re, we either worked around
it in the wrapper or let the wrapper diverge from real Re2.

This seems like a case where we can't work around it, and where Re2
has the more intuitive and conventional behavior.

@ericmarkmartin
Copy link
Contributor Author

My internship with Jane Street ends today, so future communication on this PR should be redirected to @dwang20151005.

@Drup
Copy link
Collaborator

Drup commented Aug 18, 2020

Sounds good to me! Could you add tests to exercise it ? Also, I suggest squashing the commits (I'll do it when merging otherwise). (cc @dwang20151005 ?)

@rgrinberg
Copy link
Member

Could you add tests to exercise it

And a CHANGES entry as well please.

In addition to the minimal fix suggested by Vouillon in ocaml#146, this PR
allows `Category.inexistant` characters (i.e., `bos` and `eos`) in the
following places:

1. before `bow`
2. after `eow`
3. on both sides of `not_boundary` (but not one side alone)

That minimal fix would still give an unintuitive and unconventional
interpretation of word boundaries where `bos` and `eos`...

1. are accepted on both sides of a word boundary (`bow` or `eow`)
2. are not accepted on either side of a `not_boundary`

With the minimal fix, `bow` and `eow` matches would not need to have a
word between them. E.g.,

1. in `""`, both `bow` and `eow` would match at position 0 with
   nothing between them
2. in `"#"`, `bow` would match before the `#` and `eow` after the `#`,
   even though `#` is not a word character.

In contrast, the convention in other regex libraries is that `bos` and
`eos` are not words:

1. [Ruby](https://rubular.com/r/7jdsGlU3LOPrX2)
2. [Python](https://pythex.org/?regex=%5Cb&test_string=%23&ignorecase=0&multiline=0&dotall=0&verbose=0)
3. [re2 syntax reference](https://github.com/google/re2/wiki/Syntax).
4. PCRE: can't permalink, but you can check on https://regexpal.com .
5. JavaScript: ditto, https://regexpal.com .

This PR implements this more conventional behavior.

Jane Street is migrating from `Re2` to `Re` by writing a wrapper that
implements the `Re2` interface on top of `Re`.

In most cases where `Re2` disagrees with `Re`, we either worked around
it in the wrapper or let the wrapper diverge from real `Re2`.

This seems like a case where we can't work around it, and where `Re2`
has the more intuitive and conventional behavior.

test(test_re): add regression tests for bow, eow, word, not_boundary

docs(CHANGES): add CHANGES entry
@ericmarkmartin
Copy link
Contributor Author

I've added regression tests and a CHANGES entry, and squashed the commits.

@rgrinberg rgrinberg merged commit a844d29 into ocaml:master Nov 14, 2021
@rgrinberg
Copy link
Member

Thanks

rgrinberg added a commit to rgrinberg/opam-repository that referenced this pull request Apr 27, 2022
CHANGES:

* Improve handling of word boundaries (ocaml/ocaml-re#179)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants