Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Add Unicode support #48

Closed
wants to merge 20 commits into from
Closed

[RFC] Add Unicode support #48

wants to merge 20 commits into from

Conversation

nojb
Copy link
Contributor

@nojb nojb commented Dec 20, 2014

This is a first try to adding Unicode support to ocaml-re. The approach is the one suggested at the end of #24. Namely, translate unicode regular expressions into byte-oriented regular expressions that match UTF-8-encoded strings.

This patchset adds a new module, Re_unicode which defines the type of unicode regular expressions. A new type is needed because it is not safe to mix unicode and byte-oriented regular expressions. The interface is a slightly modified version of the existing interface in Re. There is also a corresponding findlib library re.unicode. The module Re_unicode uses uutf to decode UTF-8 strings, but this is not a hard dependency and could easily be swapped out later to keep dependencies to a minumum.

On the implementation side, Re_unicode uses the same implementation as Re, but overloading the Set variant to mean sets of unicode code points. Before compilation the unicode regular expression is traversed (see handle_unicode in re_unicode.ml) to translate unicode character sets into suitable byte-oriented regular expression. After this step, everything goes through as before.

The same POSIX character classes (alnum, digit, etc.) present in Re are offered by this module, but implemented in terms of unicode as suggested by UTS #18. More fine-grained unicode character sets can be added easily. The definition of unicode character sets depends on static data contained in the module Unicode_groups. This module can be regenerated from the Unicode Character Database by a tool called gen_unicode_groups found in the tools/ directory. This tool depends on the uucd library.

There are some difficulties in integrating unicode elegantly to the current code base arising from the fact that this library makes some assumptions about the character set when compiling and interpreting REs:

  • The bol, eol, eow, bow combinators are ASCII-specific and handled specially in the RE engine. These are not currently present in the Re_unicode interface.
  • (BREAKING CHANGE) The case, no_case combinators have different semantics for unicode because a single unicode character can change case to a sequence of unicode characters, so it is no longer true that applying one of these combinators to a character set gives a character set. In the present patchset the result of applying one of these combinators to a character set is no longer considered to be a character set (which is the right behaviour for unicode, I think). This can be reverted back to the old behaviour for the Re module with a little bit of refactoring. This is solved by using the simple_case_folding property of unicode characters which is 1:1, and so does not suffer from the above problem. UPDATE The case, no_case combinators are now working with Unicode.
  • The ~pos and ~len arguments in the exec*, all*, split* functions require indexing into a UTF-8 string which takes time linear on the length of the string. These arguments are not present in the current changeset, but could be easily added.

In summary, I think that, long-term, the best course of action would be to refactor the library so that the core is completely independent of the character set used and deals only with arbitrary bytes. Both ASCII and unicode engines should be added on top. Right now, there is some special handling of the bol, eol, eow, bow combinators in the RE engine that would have to be factored out, but someone more knowledgeable about the internals needs to explain what is involved in this.

Any and all comments welcome!

This is a first try to adding Unicode support to `ocaml-re`.  The
approach is the one suggested at the end of #24. Namely,
translate *unicode* regular expressions into *byte*-oriented regular
expressions that match UTF-8-encoded strings.

This patchset adds a new module, `Re_unicode` which defines the type of
unicode regular expressions.  A new type is needed because it is not
safe to mix unicode and byte-oriented regular expressions.  The
interface is a slightly modified version of the existing interface in
`Re`.  There is also a corresponding findlib library `re.unicode`.

On the implementation side, `Re_unicode` uses the same implementation as
`Re`, but overloading the `Set` variant to mean sets of *unicode* code
points.  Before compilation the unicode regular expression is
traversed (see `handle_unicode` in `re_unicode.ml`) to translate it into
a byte-oriented one and then everything goes through as before.

The same POSIX character classes (`alnum`, `digit`, etc.) present in
`Re` are offered by this module, but implemented in terms of unicode as
suggested by [UTS #18](http://www.unicode.org/reports/tr18/#Compatibility_Properties).
More fine-grained unicode character sets can be added easily.  The
definition of unicode character sets depends on static data contained in
the module `Unicode_groups`.  This module can be regenerated from the
Unicode Character Database by a tool called `gen_unicode_groups` found
in the `tools/` directory.  The tool itself depends on the
[`uucd`](http://erratique.ch/software/uucd) library.  No extra
dependency is required for `re`.

There are some difficulties in integrating unicode elegantly to the
current code base arising from the fact that this library makes some
assumptions about the character set when compiling and interpreting REs:

- The `bol`, `eol`, `eow`, `bow` combinators are ASCII-specific and
  handled specially in the RE engine.  These are not currently present
  in the `Re_unicode` interface.

- (**BREAKING CHANGE**) The `case`, `no_case` combinators have different
  semantics for unicode because a single unicode character can change
  case to a *sequence* of unicode characters, so it is no longer true
  that applying one of these combinators to a character set gives a
  character set.  In the present patchset the result of applying one of
  these combinators to a character set is no longer considered to be a
  character set (which is the right behaviour for unicode, I think).
  This can be reverted back to the old behaviour for the `Re` module
  with a little bit of refactoring.

- The `~pos` and `~len` arguments in the `exec*`, `all*`, `split*`
  functions require indexing into a UTF-8 string which takes time linear
  on the length of the string.  These arguments are not present in the
  current changeset, but could be easily added.

In summary, I think the best course of action would be to refactor the
library so that the core is completely independent of the character sets
and deals only with arbitrary bytes.  Both ASCII and unicode engines
should be added on top.  There is some special handling of the `bol`,
`eol`, `eow`, `bow` combinators in the RE engine that would have to be
factored out, but someone more knowledgeable about the internals needs
to explain what is involved in this.

Lastly, this changeset has only been tested very lightly and more
thorough testing is required, but I wanted to put it out now in order to
receive feedback from those interested.

Any and all comments welcome!
It is not currently needed.
Matches any UTF-8 character except '\n'.  This is what re2 does.
`any_byte` => `any`
This is in preparation to implementing case, no_case
for unicode using simple_case_folding (which is a 1:1
mapping between unicode characters).
Case folding table still takes up too much space, but it
can be reduced further by using same tricks as in re2.

(see https://github.com/nelhage/re2/blob/master/re2/unicode_casefold.h)
Untranslated unicode code points were being passed
to `compile_1`.

Also, changed signature of `case_insens` back to
`Cset.t -> Cset.t` in preparation to implementing case, no_case
for Unicode.
The `case`, `no_case` now work with Unicode.  A more compact
encoding of the `Unicode_groups.foldcase` table will be introduced
later.
For some reason removing them causes some tests to fail.  Also
it is not clear whether removing or leaving them is the better
option.
They are implemented using `uutf`.  Later they could be
rewritten directly so that we do not depend on any
external library.
@vouillon
Copy link
Member

This is great! Thank a lot!

I agree the library will need to be refactored.

Regarding the ~pos and ~len arguments, I think we can use byte values for them. We could experiment with a substring algebra to abstract them (see A Subsequence Algebra: First Class Values for Substrings, Wilfred J Hansen).

The bol, eol, eow and bow combinators are going to be tricky to implement. I need to think about that.

@nojb
Copy link
Contributor Author

nojb commented Dec 28, 2014

Great! Regarding the problematic combinators (bol, eol, eow, bow), it looks like they are special cases of lookahead/lookbehind. How hard would it be to add general lookahead/lookbehind to the RE engine?

@vouillon
Copy link
Member

vouillon commented Jan 5, 2015

Indeed, these combinators are special cases of lookahead and lookbehind, and I think that's how they should be implemented.

The way to implement a lookbehind subexpression (?<=r) is by matching with .*(r) in parallel. Then, the subexpression (?<=r) matches at a given position in the string if we have a match for .*(r) ending at this position.

The implementation of ocaml-re is DFA-based. Intuitively, each state of the DFA is the disjunction of the positions we may have reached in the regular expression. In fact, we have a tree structure rather than simply a disjunction to deal with the matching semantics (longest, shortest or first match). To deal with lookahead, we will need something like a disjunction of conjunctions (both the main regular expression and the lookahead expression(s) must match).

We need a couple of restrictions on lookbehind and lookahead expression to make this work:

  • the lengths of the set of strings matched by a lookbehind expression must be bounded, so that one knows how many bytes before the start position one should start parsing to get enough context (this is a standard restriction);
  • we cannot have lookahead expressions inside lookbehind expressions (as we must be able to decide whether a lookbehind expression matches right before a given position without looking beyond this position).

And it's going to be much simpler to implement if we do not allow group matching inside lookahead and lookbehind expressions.

@m2ym
Copy link

m2ym commented Mar 16, 2015

👍

@nojb
Copy link
Contributor Author

nojb commented May 15, 2020

A fun hack, but no more than that.

@nojb nojb closed this May 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants