Improve handling of word boundaries, fix #146 #179
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changed
In addition to the minimal fix suggested by Vouillon in #146, this PR
allows
Category.inexistant
characters (i.e.,bos
andeos
) in thefollowing places:
bow
eow
not_boundary
(but not one side alone)Justification of change
That minimal fix would still give an unintuitive and unconventional
interpretation of word boundaries where
bos
andeos
...bow
oreow
)not_boundary
With the minimal fix,
bow
andeow
matches would not need to have aword between them. E.g.,
""
, bothbow
andeow
would match at position 0 withnothing between them
"#"
,bow
would match before the#
andeow
after the#
,even though
#
is not a word character.In contrast, the convention in other regex libraries is that
bos
andeos
are not words:This PR implements this more conventional behavior.
Use case
Jane Street is migrating from
Re2
toRe
by writing a wrapper thatimplements the
Re2
interface on top ofRe
.In most cases where
Re2
disagrees withRe
, we either worked aroundit in the wrapper or let the wrapper diverge from real
Re2
.This seems like a case where we can't work around it, and where
Re2
has the more intuitive and conventional behavior.