Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add core:text/regex #3962

Merged
merged 24 commits into from
Aug 21, 2024
Merged

Add core:text/regex #3962

merged 24 commits into from
Aug 21, 2024

Conversation

Feoramund
Copy link
Contributor

Regular Expressions

I've been testing, benchmarking, and optimizing this over the past couple weeks, and now I'm finally ready to share it with everyone. I realize a package like this is going to be opening a Pandora's box of sorts, given the reputation that regular expressions have for being misused, but I've included some helpful tips on when and when not to use RegEx in the documentation. With that said, let's get into it.

First of all, this is a complete implementation of a multi-threaded virtual machine based off of the writings of Russ Cox with designs attributed to Ken Thompson and Rob Pike. The VM dispatches runes, one at a time, running each thread in lock step with no look-behind, guaranteeing linear time execution.

Feature Set

Feature Example
Classes [0-9]
Wildcards .
Alternation apple|cherry
Repeat Zero a*
Repeat One a+
Optional a?
Grouping (a)
Non-Capture Grouping (?:a)
Anchors ^$
Word Boundaries \b\B

Compile-Time Flags

Flag Name Description
g Global Attempts to match the expression anywhere in the string.
m Multiline Changes ^ and $ to behave as if they also match newlines.
i Case Insensitive Ignores case when matching.
x Ignore Whitespace Ignores all non-escaped whitespace in a pattern.
u Unicode Explicitly compiles and matches for Unicode. ASCII is the default.
n No Capture Compiles the expression without save groups; it will only return true or false for a match.
- No Optimization Disables the optimizer; for debugging.

Construction

Regular expressions may be compiled either by specifying the pattern and a bit_set of flags, or a delimited string.

The following three ways are equivalent.

// Standard
rex, err := regex.create("hellope", { .Global })

// Delimiter (standard slash)
rex, err := regex.create_by_user("/hellope/g")

// Delimiter (custom rune)
rex, err := regex.create_by_user("#hellope#g")

For create_by_user, the delimiter is determined by the first rune in the string and has the only requirement that it not be \.

Matching

Regular expressions are matched against strings after compilation. No API is provided to match an arbitrary pattern; they must be compiled first.

// One-time Match
capture, success := regex.match(rex, "hellope")

// Re-using a Capture struct
my_capture := regex.preallocate_capture()
num_groups, success := regex.match(rex, "hellope", &my_capture)
num_groups, success = regex.match(rex, "hellope2", &my_capture)

Benchmark Results

Odin:    dev-2024-07:68550cf91
OS:      Arch Linux, Linux 6.9.7-arch1-1
CPU:     12th Gen Intel(R) Core(TM) i7-12700K
RAM:     31913 MiB
Backend: LLVM 17.0.6

Here are the results from the included benchmark, output modified only for readability.

Command: odin test . -define:ODIN_TEST_THREADS=1 -o:speed -disable-assert -no-bounds-check

Matching /a(?:bb|cc|dd|ee|ff)/gn over a text block of only `a`s.
[2.00KiB : 78.64µs : 24.84MiB/s]
[32.00KiB : 816.593µs : 38.27MiB/s]
[64.00KiB : 1.45482ms : 42.96MiB/s]
[256.00KiB : 6.05229ms : 41.31MiB/s]
[512.00KiB : 11.594802ms : 43.12MiB/s]
[1.00MiB : 23.196333ms : 43.11MiB/s]
[2.00MiB : 47.743698ms : 41.89MiB/s]

Matching /[\w\d]+/g over a string of spaces with "0123456789abcdef" at the end.
[2.00KiB : 13.015µs : 150.07MiB/s]
[32.00KiB : 157.191µs : 198.80MiB/s]
[64.00KiB : 228.776µs : 273.19MiB/s]
[256.00KiB : 936.322µs : 267.00MiB/s]
[512.00KiB : 1.887321ms : 264.93MiB/s]
[1.00MiB : 3.752987ms : 266.45MiB/s]
[2.00MiB : 7.501526ms : 266.61MiB/s]


[8 : 1.459µs : 2.61MiB/s] Matched `a?^8a^8` against `a^8`.
[16 : 2.232µs : 3.42MiB/s] Matched `a?^16a^16` against `a^16`.
[32 : 5.538µs : 2.76MiB/s] Matched `a?^32a^32` against `a^32`.
[64 : 19.385µs : 1.57MiB/s] Matched `a?^64a^64` against `a^64`.

Matching /Hellope World!/g over a block of random ASCII text.
[2.00KiB : 6.263µs : 311.85MiB/s]
[32.00KiB : 91.766µs : 340.54MiB/s]
[64.00KiB : 194.674µs : 321.05MiB/s]
[256.00KiB : 794.582µs : 314.63MiB/s]
[512.00KiB : 1.567796ms : 318.92MiB/s]
[1.00MiB : 3.137128ms : 318.76MiB/s]
[2.00MiB : 6.310817ms : 316.92MiB/s]

Matching /こにちは/gu over a block of random Unicode text.
[2.00KiB : 5.903µs : 330.87MiB/s]
[32.00KiB : 78.118µs : 400.04MiB/s]
[64.00KiB : 157.795µs : 396.08MiB/s]
[256.00KiB : 661.053µs : 378.18MiB/s]
[512.00KiB : 1.285897ms : 388.83MiB/s]
[1.00MiB : 2.578081ms : 387.89MiB/s]
[2.00MiB : 5.188262ms : 385.49MiB/s]

Implementation Details

The construction stage starts with a tokenizer for the pattern string, goes to a Pratt parser to resolve precedence parsing, then to a bytecode-based compiler for maximum compactness. The original implementation of the compiler and virtual machine used an array of unions to structs. The current bytecode design is much smaller and even slightly faster.

All opcodes are 8 bits, and no operand is greater than 32 bits. Most opcodes take no operands.

The compiler also comes with an expression optimizer that works on the AST given to it by the parser. It can turn simple conjunctions into more performant constructions.

Optimizer Feature Summary

Name Example
Class Simplification [aab] => [ab]
Class Reduction [a] => a
Range Construction [abc] => [a-c]
Rune Merging into Range [aa-c] => [a-c]
Range Merging [a-cc-e] => [a-e]
Alternation to Optional a| => a?
Alternation to Optional Non-Greedy |a => a??
Alternation Reduction a|a => a
Alternation to Class a|b => [ab]
Class Union [a0]|[b1] => [a0b1]
Wildcard Reduction a|. => .
Common Suffix Elimination blueberry|strawberry => (?:blue|straw)berry
Common Prefix Elimination abi|abe => ab(?:i|e)
Composition: Consume All to Anchored End .*$ => <special opcode>

The optimizer runs an indefinite number of passes until it sees no more changes in the AST.

There are also a few post-compilation optimizations done at the end, such as turning Jumps to Jumps into Jumps to their final calculated destination and resolving relative Jumps into absolute Jumps.

Of note, I've used a bitmap to keep track of which threads occupy which PCs. No method was explicitly described in Russ's articles of how to do this, but I found this to be the simplest way.

Limitations

Due to the size of the Jump and Split operands (16 bits), the VM is limited to a program size no larger than 32,767 bytes. This should be enough for any sane pattern.

Due to the size of the Rune_Class operands, no more than 256 [abc]-type class specifiers can be stored in a Regular Expression. Note that both the negated and regular Rune_Class share which Rune_Class_Data specifiers they refer to, so [a-c] and [^a-c] are stored the same but evaluated separately. The optimizer is also able to reduce how many classes are used, to a limited extent, by collapsing fundamentally identical classes that are written differently, i.e. [0-3] versus [0123].

For arbitrary reasons, it is not possible to save more than 10 capture groups, including the implicit expression-wide capture group. Thus an expression can only have up to 9 distinct (groups).

Testing and Debugging

A complete test suite of 64 test cases is included, covering standard regular expressions, erroneous expressions, invalid expressions, and a few edge cases with the optimizer.

There is also an ODIN_DEBUG_REGEX config that enables output to STDERR of differences between the unoptimized and optimized AST and each thread's execution in the virtual machine.

Final Notes

This is a sizeable package with much going on inside. I've reviewed it myself a few times now, but that doesn't mean I haven't missed something. I intend to let this sit as a draft to gather feedback for a few days. Otherwise, it is feature-complete and fully functional.

If you happen to give the package a go, I'm on the lookout for bugs, particularly misconstructed patterns (so a pattern that shouldn't compile but does, or a pattern that matches in an unexpected way). I'd also like to hear about API ideas. I have a note to consider possibly removing groups from the Capture struct and relying only on pos instead (or alternatively turning them into multi-pointers to conserve space), but this is more of a matter of taste, to see what the community might like.

With the variety of parsers out there, there's bound to be a feature I haven't implemented. If it's simple enough, I can see to it, but for now, this is a rather basic implementation of Regular Expressions with serviceable speed.

I just hope someone doesn't use this package a few years in the future to write an HTML parser...

class_data := vm.class_data[operand]
next_rune := vm.next_rune

check: {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOVELY use of a named block!

core/fmt/fmt.odin Outdated Show resolved Hide resolved
Simplified error checking while I was at it, too.
The `original_pattern` introduced a tenuous dependency to the expression
value as a whole, and after some consideration, I decided that it would
be better for the developer to manage their own pattern strings.

In the event you need to print the text representation of a pattern,
it's usually better that you manage the memory of it as well.
This should hopefully avoid any issues with loading operands greater
than 8 bits on alignment-sensitive platforms.
Copy link
Contributor

@flysand7 flysand7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's not too much, can you provide some documentation to public procedures and types before merging? I see that you did a great job documenting the packages themselves. A lot of procedures are obvious as to what they do, but other procedures would like to have some clarifications in usage.

@flysand7
Copy link
Contributor

Also some of the features that might be wanted:

  • Repeat pattern n times {n}
  • Match start of string/end of string (unaffected by multiline mode) \A and \Z
  • Any whitespace character \s or \w depending on the dialect, and other similar classes, like the ones to match any digit, any letter etc. Due to overlap with classes, it's better if these work on all unicode according to its character classification.

I'm not sure if any of these are implemented, but I'm seeing you didn't mention them on your PR. I haven't had the time to check the code yet beyond some light skimming

@Feoramund
Copy link
Contributor Author

If it's not too much, can you provide some documentation to public procedures and types before merging?

They've all been documented already.

A lot of procedures are obvious as to what they do, but other procedures would like to have some clarifications in usage.

Which ones did you have trouble understanding?

  • Repeat pattern n times {n}

Already implemented. {n,m} {n} {n,} {,m}

  • Match start of string/end of string (unaffected by multiline mode) \A and \Z

Could be implemented.

  • Any whitespace character \s or \w depending on the dialect, and other similar classes, like the ones to match any digit, any letter etc. Due to overlap with classes, it's better if these work on all unicode according to its character classification.

Already implemented (\w\d\s\W\D\S), but they do not have any notion of Unicode, i.e. for \s no Unicode spaces are checked, just ASCII.

@Kelimion
Copy link
Member

Already implemented (\w\d\s\W\D\S), but they do not have any notion of Unicode, i.e. for \s no Unicode spaces are checked, just ASCII.

We do have unicode.is_space which could be turned into is own bytecode that you emit if Unicode mode is enabled? The rest could be documented as doing the same as ASCII mode.

@Feoramund
Copy link
Contributor Author

We do have unicode.is_space which could be turned into is own bytecode that you emit if Unicode mode is enabled? The rest could be documented as doing the same as ASCII mode.

That's certainly doable with the individual shorthands, in isolation: that we could use special opcodes for the various Unicode classes. The real issue arises when someone wants to have a class with one of those shorthands and some additional characters. I.e. \s => Unicode_Space in .Unicode mode is simple, but [_\s] will require duplicating all the Unicode spaces into the Rune_Class_Data along with the extra characters.

Any special class-based opcodes will also require an extra step in the optimizer to check if a user has supplied a pattern that matches any of what these opcodes represent, i.e. if I wrote some class that had every character checked by unicode.is_space.

Character class shorthands are one of the more maintenance-heavy parts of this implementation, as can be seen at the three @MetaCharacter note points in the comments.

@Kelimion
Copy link
Member

We do have unicode.is_space which could be turned into is own bytecode that you emit if Unicode mode is enabled? The rest could be documented as doing the same as ASCII mode.

That's certainly doable with the individual shorthands, in isolation: that we could use special opcodes for the various Unicode classes. The real issue arises when someone wants to have a class with one of those shorthands and some additional characters. I.e. \s => Unicode_Space in .Unicode mode is simple, but [_\s] will require duplicating all the Unicode spaces into the Rune_Class_Data along with the extra characters.

Any special class-based opcodes will also require an extra step in the optimizer to check if a user has supplied a pattern that matches any of what these opcodes represent, i.e. if I wrote some class that had every character checked by unicode.is_space.

Character class shorthands are one of the more maintenance-heavy parts of this implementation, as can be seen at the three @MetaCharacter note points in the comments.

Good points.

We don't directly support printing these.

To prevent future issues being raised about the pattern being missing if
someone tries to print one, hide everything.
@Feoramund Feoramund marked this pull request as ready for review August 4, 2024 23:31
@Feoramund
Copy link
Contributor Author

I thought about this for a few days, and I've decided to leave the shorthand classes as they are: ASCII only. It's simpler this way. I've written my rationale in the documentation along with a workaround if someone wants their own shorthand. If there's a great need or provable utility for expanding them, we can look at that another time.

Also added some more documentation in general, along with an extra test. No major changes since I last checked in.

Ready to go.

@DamienPetrilli
Copy link

@Feoramund I have been using your regex package since 2 weeks now, it's very solid, thanks!

@gingerBill
Copy link
Member

So this PR is amazing, but I am just contemplating whether this package should be in core or not simply because:

Should people be using regex in the first place when other better alternatives exist? If it exists in core, people will use and abuse it, and think because it is in core, it is the recommended way without looking for things like core:text/scanner or core:text/match.

@Feoramund
Copy link
Contributor Author

This is a very reasonable cautionary consideration, given the usage history of regular expressions. If you end up not wanting it in Odin, I can move it to a repo of my own, but I'm glad to make any more technical changes you might like if that ends up not being the case.

However, I'll say that I started on this work because I saw it as a candidate listed in #978 and thoroughly read through the writings of Russ Cox you linked back then. I know it's a discussion of almost 4 years old now, but it looked like the most interesting target at the time, for learning for myself and utility to others.

That said, I'm not arguing for or against its inclusion into Odin.

@gingerBill gingerBill merged commit 58e811e into odin-lang:master Aug 21, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants