Add `core:text/regex` #3962

Feoramund · 2024-07-22T19:47:18Z

Regular Expressions

I've been testing, benchmarking, and optimizing this over the past couple weeks, and now I'm finally ready to share it with everyone. I realize a package like this is going to be opening a Pandora's box of sorts, given the reputation that regular expressions have for being misused, but I've included some helpful tips on when and when not to use RegEx in the documentation. With that said, let's get into it.

First of all, this is a complete implementation of a multi-threaded virtual machine based off of the writings of Russ Cox with designs attributed to Ken Thompson and Rob Pike. The VM dispatches runes, one at a time, running each thread in lock step with no look-behind, guaranteeing linear time execution.

Feature Set

Feature	Example
Classes	`[0-9]`
Wildcards	`.`
Alternation	`apple\|cherry`
Repeat Zero	`a*`
Repeat One	`a+`
Optional	`a?`
Grouping	`(a)`
Non-Capture Grouping	`(?:a)`
Anchors	`^$`
Word Boundaries	`\b\B`

Compile-Time Flags

Flag	Name	Description
`g`	Global	Attempts to match the expression anywhere in the string.
`m`	Multiline	Changes `^` and `$` to behave as if they also match newlines.
`i`	Case Insensitive	Ignores case when matching.
`x`	Ignore Whitespace	Ignores all non-escaped whitespace in a pattern.
`u`	Unicode	Explicitly compiles and matches for Unicode. ASCII is the default.
`n`	No Capture	Compiles the expression without save groups; it will only return true or false for a match.
`-`	No Optimization	Disables the optimizer; for debugging.

Construction

Regular expressions may be compiled either by specifying the pattern and a bit_set of flags, or a delimited string.

The following three ways are equivalent.

// Standard
rex, err := regex.create("hellope", { .Global })

// Delimiter (standard slash)
rex, err := regex.create_by_user("/hellope/g")

// Delimiter (custom rune)
rex, err := regex.create_by_user("#hellope#g")

For create_by_user, the delimiter is determined by the first rune in the string and has the only requirement that it not be \.

Matching

Regular expressions are matched against strings after compilation. No API is provided to match an arbitrary pattern; they must be compiled first.

// One-time Match
capture, success := regex.match(rex, "hellope")

// Re-using a Capture struct
my_capture := regex.preallocate_capture()
num_groups, success := regex.match(rex, "hellope", &my_capture)
num_groups, success = regex.match(rex, "hellope2", &my_capture)

Benchmark Results

Odin:    dev-2024-07:68550cf91
OS:      Arch Linux, Linux 6.9.7-arch1-1
CPU:     12th Gen Intel(R) Core(TM) i7-12700K
RAM:     31913 MiB
Backend: LLVM 17.0.6

Here are the results from the included benchmark, output modified only for readability.

Command: odin test . -define:ODIN_TEST_THREADS=1 -o:speed -disable-assert -no-bounds-check

Matching /a(?:bb|cc|dd|ee|ff)/gn over a text block of only `a`s.
[2.00KiB : 78.64µs : 24.84MiB/s]
[32.00KiB : 816.593µs : 38.27MiB/s]
[64.00KiB : 1.45482ms : 42.96MiB/s]
[256.00KiB : 6.05229ms : 41.31MiB/s]
[512.00KiB : 11.594802ms : 43.12MiB/s]
[1.00MiB : 23.196333ms : 43.11MiB/s]
[2.00MiB : 47.743698ms : 41.89MiB/s]

Matching /[\w\d]+/g over a string of spaces with "0123456789abcdef" at the end.
[2.00KiB : 13.015µs : 150.07MiB/s]
[32.00KiB : 157.191µs : 198.80MiB/s]
[64.00KiB : 228.776µs : 273.19MiB/s]
[256.00KiB : 936.322µs : 267.00MiB/s]
[512.00KiB : 1.887321ms : 264.93MiB/s]
[1.00MiB : 3.752987ms : 266.45MiB/s]
[2.00MiB : 7.501526ms : 266.61MiB/s]


[8 : 1.459µs : 2.61MiB/s] Matched `a?^8a^8` against `a^8`.
[16 : 2.232µs : 3.42MiB/s] Matched `a?^16a^16` against `a^16`.
[32 : 5.538µs : 2.76MiB/s] Matched `a?^32a^32` against `a^32`.
[64 : 19.385µs : 1.57MiB/s] Matched `a?^64a^64` against `a^64`.

Matching /Hellope World!/g over a block of random ASCII text.
[2.00KiB : 6.263µs : 311.85MiB/s]
[32.00KiB : 91.766µs : 340.54MiB/s]
[64.00KiB : 194.674µs : 321.05MiB/s]
[256.00KiB : 794.582µs : 314.63MiB/s]
[512.00KiB : 1.567796ms : 318.92MiB/s]
[1.00MiB : 3.137128ms : 318.76MiB/s]
[2.00MiB : 6.310817ms : 316.92MiB/s]

Matching /こにちは/gu over a block of random Unicode text.
[2.00KiB : 5.903µs : 330.87MiB/s]
[32.00KiB : 78.118µs : 400.04MiB/s]
[64.00KiB : 157.795µs : 396.08MiB/s]
[256.00KiB : 661.053µs : 378.18MiB/s]
[512.00KiB : 1.285897ms : 388.83MiB/s]
[1.00MiB : 2.578081ms : 387.89MiB/s]
[2.00MiB : 5.188262ms : 385.49MiB/s]

Implementation Details

The construction stage starts with a tokenizer for the pattern string, goes to a Pratt parser to resolve precedence parsing, then to a bytecode-based compiler for maximum compactness. The original implementation of the compiler and virtual machine used an array of unions to structs. The current bytecode design is much smaller and even slightly faster.

All opcodes are 8 bits, and no operand is greater than 32 bits. Most opcodes take no operands.

The compiler also comes with an expression optimizer that works on the AST given to it by the parser. It can turn simple conjunctions into more performant constructions.

Optimizer Feature Summary

Name	Example
Class Simplification	`[aab]` => `[ab]`
Class Reduction	`[a]` => `a`
Range Construction	`[abc]` => `[a-c]`
Rune Merging into Range	`[aa-c]` => `[a-c]`
Range Merging	`[a-cc-e]` => `[a-e]`
Alternation to Optional	`a\|` => `a?`
Alternation to Optional Non-Greedy	`\|a` => `a??`
Alternation Reduction	`a\|a` => `a`
Alternation to Class	`a\|b` => `[ab]`
Class Union	`[a0]\|[b1]` => `[a0b1]`
Wildcard Reduction	`a\|.` => `.`
Common Suffix Elimination	`blueberry\|strawberry` => `(?:blue\|straw)berry`
Common Prefix Elimination	`abi\|abe` => `ab(?:i\|e)`
Composition: Consume All to Anchored End	`.*$` => `<special opcode>`

The optimizer runs an indefinite number of passes until it sees no more changes in the AST.

There are also a few post-compilation optimizations done at the end, such as turning Jumps to Jumps into Jumps to their final calculated destination and resolving relative Jumps into absolute Jumps.

Of note, I've used a bitmap to keep track of which threads occupy which PCs. No method was explicitly described in Russ's articles of how to do this, but I found this to be the simplest way.

Limitations

Due to the size of the Jump and Split operands (16 bits), the VM is limited to a program size no larger than 32,767 bytes. This should be enough for any sane pattern.

Due to the size of the Rune_Class operands, no more than 256 [abc]-type class specifiers can be stored in a Regular Expression. Note that both the negated and regular Rune_Class share which Rune_Class_Data specifiers they refer to, so [a-c] and [^a-c] are stored the same but evaluated separately. The optimizer is also able to reduce how many classes are used, to a limited extent, by collapsing fundamentally identical classes that are written differently, i.e. [0-3] versus [0123].

For arbitrary reasons, it is not possible to save more than 10 capture groups, including the implicit expression-wide capture group. Thus an expression can only have up to 9 distinct (groups).

Testing and Debugging

A complete test suite of 64 test cases is included, covering standard regular expressions, erroneous expressions, invalid expressions, and a few edge cases with the optimizer.

There is also an ODIN_DEBUG_REGEX config that enables output to STDERR of differences between the unoptimized and optimized AST and each thread's execution in the virtual machine.

Final Notes

This is a sizeable package with much going on inside. I've reviewed it myself a few times now, but that doesn't mean I haven't missed something. I intend to let this sit as a draft to gather feedback for a few days. Otherwise, it is feature-complete and fully functional.

If you happen to give the package a go, I'm on the lookout for bugs, particularly misconstructed patterns (so a pattern that shouldn't compile but does, or a pattern that matches in an unexpected way). I'd also like to hear about API ideas. I have a note to consider possibly removing groups from the Capture struct and relying only on pos instead (or alternatively turning them into multi-pointers to conserve space), but this is more of a matter of taste, to see what the community might like.

With the variety of parsers out there, there's bound to be a feature I haven't implemented. If it's simple enough, I can see to it, but for now, this is a rather basic implementation of Regular Expressions with serviceable speed.

_{I just hope someone doesn't use this package a few years in the future to write an HTML parser...}

core/text/regex/virtual_machine/virtual_machine.odin

gingerBill · 2024-07-22T20:29:54Z

core/text/regex/virtual_machine/virtual_machine.odin

+				class_data := vm.class_data[operand]
+				next_rune := vm.next_rune
+
+				check: {


LOVELY use of a named block!

core/text/regex/virtual_machine/virtual_machine.odin

core/fmt/fmt.odin

core/text/regex/common/common.odin

core/text/regex/parser/parser.odin

Simplified error checking while I was at it, too.

The `original_pattern` introduced a tenuous dependency to the expression value as a whole, and after some consideration, I decided that it would be better for the developer to manage their own pattern strings. In the event you need to print the text representation of a pattern, it's usually better that you manage the memory of it as well.

This should hopefully avoid any issues with loading operands greater than 8 bits on alignment-sensitive platforms.

flysand7

If it's not too much, can you provide some documentation to public procedures and types before merging? I see that you did a great job documenting the packages themselves. A lot of procedures are obvious as to what they do, but other procedures would like to have some clarifications in usage.

flysand7 · 2024-07-25T03:29:13Z

Also some of the features that might be wanted:

Repeat pattern n times {n}
Match start of string/end of string (unaffected by multiline mode) \A and \Z
Any whitespace character \s or \w depending on the dialect, and other similar classes, like the ones to match any digit, any letter etc. Due to overlap with classes, it's better if these work on all unicode according to its character classification.

I'm not sure if any of these are implemented, but I'm seeing you didn't mention them on your PR. I haven't had the time to check the code yet beyond some light skimming

Feoramund · 2024-07-25T09:34:15Z

If it's not too much, can you provide some documentation to public procedures and types before merging?

They've all been documented already.

A lot of procedures are obvious as to what they do, but other procedures would like to have some clarifications in usage.

Which ones did you have trouble understanding?

Repeat pattern n times {n}

Already implemented. {n,m} {n} {n,} {,m}

Match start of string/end of string (unaffected by multiline mode) \A and \Z

Could be implemented.

Any whitespace character \s or \w depending on the dialect, and other similar classes, like the ones to match any digit, any letter etc. Due to overlap with classes, it's better if these work on all unicode according to its character classification.

Already implemented (\w\d\s\W\D\S), but they do not have any notion of Unicode, i.e. for \s no Unicode spaces are checked, just ASCII.

Kelimion · 2024-07-25T12:48:07Z

Already implemented (\w\d\s\W\D\S), but they do not have any notion of Unicode, i.e. for \s no Unicode spaces are checked, just ASCII.

We do have unicode.is_space which could be turned into is own bytecode that you emit if Unicode mode is enabled? The rest could be documented as doing the same as ASCII mode.

Feoramund · 2024-07-25T17:47:19Z

We do have unicode.is_space which could be turned into is own bytecode that you emit if Unicode mode is enabled? The rest could be documented as doing the same as ASCII mode.

That's certainly doable with the individual shorthands, in isolation: that we could use special opcodes for the various Unicode classes. The real issue arises when someone wants to have a class with one of those shorthands and some additional characters. I.e. \s => Unicode_Space in .Unicode mode is simple, but [_\s] will require duplicating all the Unicode spaces into the Rune_Class_Data along with the extra characters.

Any special class-based opcodes will also require an extra step in the optimizer to check if a user has supplied a pattern that matches any of what these opcodes represent, i.e. if I wrote some class that had every character checked by unicode.is_space.

Character class shorthands are one of the more maintenance-heavy parts of this implementation, as can be seen at the three @MetaCharacter note points in the comments.

Kelimion · 2024-07-25T19:28:28Z

We do have unicode.is_space which could be turned into is own bytecode that you emit if Unicode mode is enabled? The rest could be documented as doing the same as ASCII mode.

That's certainly doable with the individual shorthands, in isolation: that we could use special opcodes for the various Unicode classes. The real issue arises when someone wants to have a class with one of those shorthands and some additional characters. I.e. \s => Unicode_Space in .Unicode mode is simple, but [_\s] will require duplicating all the Unicode spaces into the Rune_Class_Data along with the extra characters.

Any special class-based opcodes will also require an extra step in the optimizer to check if a user has supplied a pattern that matches any of what these opcodes represent, i.e. if I wrote some class that had every character checked by unicode.is_space.

Character class shorthands are one of the more maintenance-heavy parts of this implementation, as can be seen at the three @MetaCharacter note points in the comments.

Good points.

We don't directly support printing these. To prevent future issues being raised about the pattern being missing if someone tries to print one, hide everything.

Feoramund · 2024-08-04T23:31:41Z

I thought about this for a few days, and I've decided to leave the shorthand classes as they are: ASCII only. It's simpler this way. I've written my rationale in the documentation along with a workaround if someone wants their own shorthand. If there's a great need or provable utility for expanding them, we can look at that another time.

Also added some more documentation in general, along with an extra test. No major changes since I last checked in.

Ready to go.

DamienPetrilli · 2024-08-17T09:57:36Z

@Feoramund I have been using your regex package since 2 weeks now, it's very solid, thanks!

gingerBill · 2024-08-19T01:17:38Z

So this PR is amazing, but I am just contemplating whether this package should be in core or not simply because:

Should people be using regex in the first place when other better alternatives exist? If it exists in core, people will use and abuse it, and think because it is in core, it is the recommended way without looking for things like core:text/scanner or core:text/match.

Feoramund · 2024-08-19T02:07:52Z

This is a very reasonable cautionary consideration, given the usage history of regular expressions. If you end up not wanting it in Odin, I can move it to a repo of my own, but I'm glad to make any more technical changes you might like if that ends up not being the case.

However, I'll say that I started on this work because I saw it as a candidate listed in #978 and thoroughly read through the writings of Russ Cox you linked back then. I know it's a discussion of almost 4 years old now, but it looked like the most interesting target at the time, for learning for myself and utility to others.

That said, I'm not arguing for or against its inclusion into Odin.

Feoramund added 5 commits July 22, 2024 14:25

Add core:text/regex

cb0704d

Support printing Regular_Expression in fmt

730e10b

Add tests for core:text/regex

3e49ceb

Add benchmarks for core:text/regex

be38ba6

Add core:text/regex to examples/all

b8f3d0f

gingerBill reviewed Jul 22, 2024

View reviewed changes

core/text/regex/virtual_machine/virtual_machine.odin Outdated Show resolved Hide resolved

gingerBill reviewed Jul 22, 2024

View reviewed changes

core/text/regex/virtual_machine/virtual_machine.odin Outdated Show resolved Hide resolved

gingerBill reviewed Jul 22, 2024

View reviewed changes

core/fmt/fmt.odin Outdated Show resolved Hide resolved

gingerBill reviewed Jul 22, 2024

View reviewed changes

core/text/regex/common/common.odin Outdated Show resolved Hide resolved

Kelimion reviewed Jul 23, 2024

View reviewed changes

core/text/regex/parser/parser.odin Show resolved Hide resolved

Kelimion reviewed Jul 23, 2024

View reviewed changes

core/text/regex/parser/parser.odin Show resolved Hide resolved

Feoramund added 7 commits July 24, 2024 15:17

Fix handling of unclosed regex classes and repetitions

e642be8

Add test cases for unclosed classes and repetition

e8537a3

Simplified error checking while I was at it, too.

Use slice.zero instead

16b644a

Allow configuring of MAX_CAPTURE_GROUPS for n > 10

c52a8a5

Use unaligned_load for regex virtual machine

ff492e6

This should hopefully avoid any issues with loading operands greater than 8 bits on alignment-sensitive platforms.

Use unaligned_store in regex too

90f1f7f

flysand7 reviewed Jul 25, 2024

View reviewed changes

Feoramund added 5 commits August 4, 2024 13:21

Add missing features to regex package documentation

6252712

Test that a RegEx Capture pos corresponds to its groups

cd82725

Hide Regular_Expression values

d3a51e2

We don't directly support printing these. To prevent future issues being raised about the pattern being missing if someone tries to print one, hide everything.

Move Flag_To_Letter to core:text/regex/common

babdc43

Remove unused code

1ccb0b2

Feoramund added 5 commits August 4, 2024 18:56

Use regex.destroy for test captures

743480b

Add explicit test case for Capture pos

ca7e46d

Add more documentation for core:text/regex API

dde42f0

Document rationale behind RegEx shorthand classes

e17fc82

Add explicit license info to core:text/regex

1485830

Feoramund marked this pull request as ready for review August 4, 2024 23:31

Feoramund added 2 commits August 5, 2024 03:49

Review manual for loops in core:text/regex

8f5b838

Remove debug line from test

d0d4f19

gingerBill merged commit 58e811e into odin-lang:master Aug 21, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `core:text/regex` #3962

Add `core:text/regex` #3962

Feoramund commented Jul 22, 2024

gingerBill Jul 22, 2024

flysand7 left a comment

flysand7 commented Jul 25, 2024

Feoramund commented Jul 25, 2024

Kelimion commented Jul 25, 2024

Feoramund commented Jul 25, 2024

Kelimion commented Jul 25, 2024

Feoramund commented Aug 4, 2024

DamienPetrilli commented Aug 17, 2024

gingerBill commented Aug 19, 2024

Feoramund commented Aug 19, 2024

Add core:text/regex #3962

Add core:text/regex #3962

Conversation

Feoramund commented Jul 22, 2024

Regular Expressions

Feature Set

Compile-Time Flags

Construction

Matching

Benchmark Results

Implementation Details

Optimizer Feature Summary

Limitations

Testing and Debugging

Final Notes

gingerBill Jul 22, 2024

Choose a reason for hiding this comment

flysand7 left a comment

Choose a reason for hiding this comment

flysand7 commented Jul 25, 2024

Feoramund commented Jul 25, 2024

Kelimion commented Jul 25, 2024

Feoramund commented Jul 25, 2024

Kelimion commented Jul 25, 2024

Feoramund commented Aug 4, 2024

DamienPetrilli commented Aug 17, 2024

gingerBill commented Aug 19, 2024

Feoramund commented Aug 19, 2024

Add `core:text/regex` #3962

Add `core:text/regex` #3962