[RFC] Add Unicode support #48

nojb · 2014-12-20T21:53:43Z

This is a first try to adding Unicode support to ocaml-re. The approach is the one suggested at the end of #24. Namely, translate unicode regular expressions into byte-oriented regular expressions that match UTF-8-encoded strings.

This patchset adds a new module, Re_unicode which defines the type of unicode regular expressions. A new type is needed because it is not safe to mix unicode and byte-oriented regular expressions. The interface is a slightly modified version of the existing interface in Re. There is also a corresponding findlib library re.unicode. The module Re_unicode uses uutf to decode UTF-8 strings, but this is not a hard dependency and could easily be swapped out later to keep dependencies to a minumum.

On the implementation side, Re_unicode uses the same implementation as Re, but overloading the Set variant to mean sets of unicode code points. Before compilation the unicode regular expression is traversed (see handle_unicode in re_unicode.ml) to translate unicode character sets into suitable byte-oriented regular expression. After this step, everything goes through as before.

The same POSIX character classes (alnum, digit, etc.) present in Re are offered by this module, but implemented in terms of unicode as suggested by UTS #18. More fine-grained unicode character sets can be added easily. The definition of unicode character sets depends on static data contained in the module Unicode_groups. This module can be regenerated from the Unicode Character Database by a tool called gen_unicode_groups found in the tools/ directory. This tool depends on the uucd library.

There are some difficulties in integrating unicode elegantly to the current code base arising from the fact that this library makes some assumptions about the character set when compiling and interpreting REs:

The bol, eol, eow, bow combinators are ASCII-specific and handled specially in the RE engine. These are not currently present in the Re_unicode interface.
(BREAKING CHANGE) The case, no_case combinators have different semantics for unicode because a single unicode character can change case to a sequence of unicode characters, so it is no longer true that applying one of these combinators to a character set gives a character set. In the present patchset the result of applying one of these combinators to a character set is no longer considered to be a character set (which is the right behaviour for unicode, I think). This can be reverted back to the old behaviour for the Re module with a little bit of refactoring. This is solved by using the simple_case_folding property of unicode characters which is 1:1, and so does not suffer from the above problem. UPDATE The case, no_case combinators are now working with Unicode.
The ~pos and ~len arguments in the exec*, all*, split* functions require indexing into a UTF-8 string which takes time linear on the length of the string. These arguments are not present in the current changeset, but could be easily added.

In summary, I think that, long-term, the best course of action would be to refactor the library so that the core is completely independent of the character set used and deals only with arbitrary bytes. Both ASCII and unicode engines should be added on top. Right now, there is some special handling of the bol, eol, eow, bow combinators in the RE engine that would have to be factored out, but someone more knowledgeable about the internals needs to explain what is involved in this.

Any and all comments welcome!

This is a first try to adding Unicode support to `ocaml-re`. The approach is the one suggested at the end of #24. Namely, translate *unicode* regular expressions into *byte*-oriented regular expressions that match UTF-8-encoded strings. This patchset adds a new module, `Re_unicode` which defines the type of unicode regular expressions. A new type is needed because it is not safe to mix unicode and byte-oriented regular expressions. The interface is a slightly modified version of the existing interface in `Re`. There is also a corresponding findlib library `re.unicode`. On the implementation side, `Re_unicode` uses the same implementation as `Re`, but overloading the `Set` variant to mean sets of *unicode* code points. Before compilation the unicode regular expression is traversed (see `handle_unicode` in `re_unicode.ml`) to translate it into a byte-oriented one and then everything goes through as before. The same POSIX character classes (`alnum`, `digit`, etc.) present in `Re` are offered by this module, but implemented in terms of unicode as suggested by [UTS #18](http://www.unicode.org/reports/tr18/#Compatibility_Properties). More fine-grained unicode character sets can be added easily. The definition of unicode character sets depends on static data contained in the module `Unicode_groups`. This module can be regenerated from the Unicode Character Database by a tool called `gen_unicode_groups` found in the `tools/` directory. The tool itself depends on the [`uucd`](http://erratique.ch/software/uucd) library. No extra dependency is required for `re`. There are some difficulties in integrating unicode elegantly to the current code base arising from the fact that this library makes some assumptions about the character set when compiling and interpreting REs: - The `bol`, `eol`, `eow`, `bow` combinators are ASCII-specific and handled specially in the RE engine. These are not currently present in the `Re_unicode` interface. - (**BREAKING CHANGE**) The `case`, `no_case` combinators have different semantics for unicode because a single unicode character can change case to a *sequence* of unicode characters, so it is no longer true that applying one of these combinators to a character set gives a character set. In the present patchset the result of applying one of these combinators to a character set is no longer considered to be a character set (which is the right behaviour for unicode, I think). This can be reverted back to the old behaviour for the `Re` module with a little bit of refactoring. - The `~pos` and `~len` arguments in the `exec*`, `all*`, `split*` functions require indexing into a UTF-8 string which takes time linear on the length of the string. These arguments are not present in the current changeset, but could be easily added. In summary, I think the best course of action would be to refactor the library so that the core is completely independent of the character sets and deals only with arbitrary bytes. Both ASCII and unicode engines should be added on top. There is some special handling of the `bol`, `eol`, `eow`, `bow` combinators in the RE engine that would have to be factored out, but someone more knowledgeable about the internals needs to explain what is involved in this. Lastly, this changeset has only been tested very lightly and more thorough testing is required, but I wanted to put it out now in order to receive feedback from those interested. Any and all comments welcome!

It is not currently needed.

Matches any UTF-8 character except '\n'. This is what re2 does.

`any_byte` => `any`

This is in preparation to implementing case, no_case for unicode using simple_case_folding (which is a 1:1 mapping between unicode characters).

Case folding table still takes up too much space, but it can be reduced further by using same tricks as in re2. (see https://github.com/nelhage/re2/blob/master/re2/unicode_casefold.h)

Untranslated unicode code points were being passed to `compile_1`. Also, changed signature of `case_insens` back to `Cset.t -> Cset.t` in preparation to implementing case, no_case for Unicode.

Same as in `Re`.

The `case`, `no_case` now work with Unicode. A more compact encoding of the `Unicode_groups.foldcase` table will be introduced later.

For some reason removing them causes some tests to fail. Also it is not clear whether removing or leaving them is the better option.

They are implemented using `uutf`. Later they could be rewritten directly so that we do not depend on any external library.

vouillon · 2014-12-24T18:05:03Z

This is great! Thank a lot!

I agree the library will need to be refactored.

Regarding the ~pos and ~len arguments, I think we can use byte values for them. We could experiment with a substring algebra to abstract them (see A Subsequence Algebra: First Class Values for Substrings, Wilfred J Hansen).

The bol, eol, eow and bow combinators are going to be tricky to implement. I need to think about that.

nojb · 2014-12-28T12:35:42Z

Great! Regarding the problematic combinators (bol, eol, eow, bow), it looks like they are special cases of lookahead/lookbehind. How hard would it be to add general lookahead/lookbehind to the RE engine?

vouillon · 2015-01-05T16:40:38Z

Indeed, these combinators are special cases of lookahead and lookbehind, and I think that's how they should be implemented.

The way to implement a lookbehind subexpression (?<=r) is by matching with .*(r) in parallel. Then, the subexpression (?<=r) matches at a given position in the string if we have a match for .*(r) ending at this position.

The implementation of ocaml-re is DFA-based. Intuitively, each state of the DFA is the disjunction of the positions we may have reached in the regular expression. In fact, we have a tree structure rather than simply a disjunction to deal with the matching semantics (longest, shortest or first match). To deal with lookahead, we will need something like a disjunction of conjunctions (both the main regular expression and the lookahead expression(s) must match).

We need a couple of restrictions on lookbehind and lookahead expression to make this work:

the lengths of the set of strings matched by a lookbehind expression must be bounded, so that one knows how many bytes before the start position one should start parsing to get enough context (this is a standard restriction);
we cannot have lookahead expressions inside lookbehind expressions (as we must be able to decide whether a lookbehind expression matches right before a given position without looking beyond this position).

And it's going to be much simpler to implement if we do not allow group matching inside lookahead and lookbehind expressions.

m2ym · 2015-03-16T11:02:46Z

👍

nojb · 2020-05-15T13:34:37Z

A fun hack, but no more than that.

nojb added 20 commits December 21, 2014 01:43

Unicode: remove uutf dependency

4792240

It is not currently needed.

Unicode: add back notnl

81c428e

Matches any UTF-8 character except '\n'. This is what re2 does.

Unicode: improve tools/gen_unicode_groups.ml

6bae451

Unicode: fix typo in compile

fdb41a6

`any_byte` => `any`

Revert to old behaviour of {no_}case wrt char sets

0330507

This is in preparation to implementing case, no_case for unicode using simple_case_folding (which is a 1:1 mapping between unicode characters).

Unicode: extract case folding info from ucd

0e13c15

Case folding table still takes up too much space, but it can be reduced further by using same tricks as in re2. (see https://github.com/nelhage/re2/blob/master/re2/unicode_casefold.h)

Unicode: fix definition of character sets

1927567

Unicode: Fix bug in compile

ba2ac98

Untranslated unicode code points were being passed to `compile_1`. Also, changed signature of `case_insens` back to `Cset.t -> Cset.t` in preparation to implementing case, no_case for Unicode.

Unicode: add ascii character set

ae5d27b

Cset: add mem_range function

c1bd741

Unicode: generate case fold data as an *array*

032ff21

Unicode: expose group, no_group, nest

ae91446

Same as in `Re`.

Unicode: implement case_insens

e67e1e3

The `case`, `no_case` now work with Unicode. A more compact encoding of the `Unicode_groups.foldcase` table will be introduced later.

Unicode: add basic test suite

33174f6

Add debug printer for uncompiled regexps

16cb68c

Remove extra spaces in Cset.print

152c296

Unicode: do not remove surrogates from char sets

3432f8b

For some reason removing them causes some tests to fail. Also it is not clear whether removing or leaving them is the better option.

Unicode: add set and str combinators

05a620c

They are implemented using `uutf`. Later they could be rewritten directly so that we do not depend on any external library.

Unicode: add tests for case, no_case

54c45ea

nojb mentioned this pull request Feb 5, 2018

[Feature Request] Unicode support #24

Open

nojb closed this May 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Add Unicode support #48

[RFC] Add Unicode support #48

nojb commented Dec 20, 2014

vouillon commented Dec 24, 2014

nojb commented Dec 28, 2014

vouillon commented Jan 5, 2015

m2ym commented Mar 16, 2015

nojb commented May 15, 2020

[RFC] Add Unicode support #48

[RFC] Add Unicode support #48

Conversation

nojb commented Dec 20, 2014

vouillon commented Dec 24, 2014

nojb commented Dec 28, 2014

vouillon commented Jan 5, 2015

m2ym commented Mar 16, 2015

nojb commented May 15, 2020