Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend RegexCharClass.Canonicalize range inversion optimization #61562

Merged
merged 3 commits into from
Nov 17, 2021

Conversation

stephentoub
Copy link
Member

There's a simple optimization in RegexCharClass.Canonicalize that was added in .NET 5, with the goal of taking a set that's made up of exactly two ranges and seeing whether those ranges were leaving out exactly one character. If they were, the set can instead be rewritten as that character negated, which is a normalized form used downstream and optimized. We can extend this normalization ever so slightly to be for two ranges separated not just be a single character but by more than that as well. That in turn lights up existing optimizations in both RegexCompiler and the source generator that are looking for a single range, e.g. for the pattern @"\P{IsGreek}", which is trying to match every character other than one that's Greek, FindFirstChar would previously have been emitted as:

for (int i = 0; i < span.Length; i++)
{
    if (((ch = span[i]) < 128 ? ("\uffff\uffff\uffff\uffff\uffff\uffff\uffff\uffff"[ch >> 4] & (1 << (ch & 0xF))) != 0 : CharInClass((char)ch, "\0\u0003\0\0ͰЀ")))
    {
        base.runtextpos = runtextpos + i;
        return true;
    }
}

and is now emitted as:

for (int i = 0; i < span.Length; i++)
{
    if ((((uint)span[i]) - 'Ͱ' > (uint)('Ͽ' - 'Ͱ')))
    {
        base.runtextpos = runtextpos + i;
        return true;
    }
}

cc: @joperezr

ps Separately the above example highlights how after enumerating all of ASCII to compute the bit map lookup for a set being emitted, we could use the resulting knowledge to e.g. not emit the lookup at all if we find that every ASCII character matches.

@ghost
Copy link

ghost commented Nov 14, 2021

Tagging subscribers to this area: @eerhardt, @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

There's a simple optimization in RegexCharClass.Canonicalize that was added in .NET 5, with the goal of taking a set that's made up of exactly two ranges and seeing whether those ranges were leaving out exactly one character. If they were, the set can instead be rewritten as that character negated, which is a normalized form used downstream and optimized. We can extend this normalization ever so slightly to be for two ranges separated not just be a single character but by more than that as well. That in turn lights up existing optimizations in both RegexCompiler and the source generator that are looking for a single range, e.g. for the pattern @"\P{IsGreek}", which is trying to match every character other than one that's Greek, FindFirstChar would previously have been emitted as:

for (int i = 0; i < span.Length; i++)
{
    if (((ch = span[i]) < 128 ? ("\uffff\uffff\uffff\uffff\uffff\uffff\uffff\uffff"[ch >> 4] & (1 << (ch & 0xF))) != 0 : CharInClass((char)ch, "\0\u0003\0\0ͰЀ")))
    {
        base.runtextpos = runtextpos + i;
        return true;
    }
}

and is now emitted as:

for (int i = 0; i < span.Length; i++)
{
    if ((((uint)span[i]) - 'Ͱ' > (uint)('Ͽ' - 'Ͱ')))
    {
        base.runtextpos = runtextpos + i;
        return true;
    }
}

cc: @joperezr

ps Separately the above example highlights how after enumerating all of ASCII to compute the bit map lookup for a set being emitted, we could use the resulting knowledge to e.g. not emit the lookup at all if we find that every ASCII character matches.

Author: stephentoub
Assignees: -
Labels:

area-System.Text.RegularExpressions

Milestone: -

Copy link
Member

@joperezr joperezr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Changes look good to me. I'm not yet familiar with the source generator tests, but would it be possible to add some regression coverage here by ensuring that the Regex expression [\0-AC-\uFFFF] is translated into [^B] and specially to ensure that [\0-AE-\uFFFF] will now be [^B-D]?

There's a simple optimization in RegexCharClass.Canonicalize that was added in .NET 5, with the goal of taking a set that's made up of exactly two ranges and seeing whether those ranges were leaving out exactly one character.  If they were, the set can instead be rewritten as that character negated, which is a normalized form used downstream and optimized.  We can extend this normalization ever so slightly to be for two ranges separated not just be a single character but by more than that as well.
@stephentoub
Copy link
Member Author

would it be possible to add some regression coverage

Done, thanks.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants