Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API Proposal]: [RegexGenerator(...)] attribute #58880

Closed
stephentoub opened this issue Sep 9, 2021 · 21 comments · Fixed by #59186
Closed

[API Proposal]: [RegexGenerator(...)] attribute #58880

stephentoub opened this issue Sep 9, 2021 · 21 comments · Fixed by #59186
Assignees
Labels
api-approved API was approved in API review, it can be implemented area-System.Text.RegularExpressions
Milestone

Comments

@stephentoub
Copy link
Member

stephentoub commented Sep 9, 2021

Background and motivation

Attribute to trigger regex source generation:
#44676

API Proposal

namespace System.Text.RegularExpressions
{
    [AttributeUsage(AttributeTargets.Method, AllowMultiple = false, Inherited = false)]
    public sealed class RegexGeneratorAttribute : Attribute
    {
        // Same three constructors as on Regex, except the timeout is an int (ms) instead of a TimeSpan
        public RegexGeneratorAttribute(string pattern);
        public RegexGeneratorAttribute(string pattern, RegexOptions options);
        public RegexGeneratorAttribute(string pattern, RegexOptions options, int matchTimeout);

        public string Pattern { get; }
        public RegexOptions Options { get; }
        public int MatchTimeout { get; }
    }
}

Alternate names:

  • RegexAttribute
  • GeneratedRegexAttribute
  • RegexSourceGeneratorAttribute
  • SourceGeneratedRegexAttribute
  • ...

(Note there's already a System.ComponentModel.DataAnnotations.RegularExpressionAttribute that's very different.)

API Usage

[RegexGenerator(@"\b[t]\w+", RegexOptions.IgnoreCase | RegexOptions.InvariantCulture)]
private static partial Regex GetRegexToFindWordsBeginningWithT();
...
Regex r = GetRegexToFindWordsBeginningWithT();
foreach (Match m in r.Matches(input))
{
    ...
}

Risks

n/a

@stephentoub stephentoub added area-System.Text.RegularExpressions api-ready-for-review API is ready for review, it is NOT ready for implementation labels Sep 9, 2021
@stephentoub stephentoub added this to the 7.0.0 milestone Sep 9, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Sep 9, 2021
@ghost
Copy link

ghost commented Sep 9, 2021

Tagging subscribers to this area: @eerhardt, @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and motivation

Attribute to trigger regex source generation:
#44676

API Proposal

namespace System.Text.RegularExpressions
{
    [AttributeUsage(AttributeTargets.Method, AllowMultiple = false, Inherited = false)]
    public sealed class RegexGeneratorAttribute : Attribute
    {
        // Same three constructors as on Regex, except the timeout is an int (ms) instead of a TimeSpan
        public RegexGeneratorAttribute(string pattern);
        public RegexGeneratorAttribute(string pattern, RegexOptions options);
        public RegexGeneratorAttribute(string pattern, RegexOptions options, int matchTimeout);

        public string Pattern { get; }
        public RegexOptions Options { get; }
        public int MatchTimeout { get; }
    }
}

API Usage

[RegexGenerator(@"\b[t]\w+", RegexOptions.IgnoreCase | RegexOptions.InvariantCulture)]
private static partial Regex GetRegexToFindWordsBeginningWithT();
...
Regex r = GetRegexToFindWordsBeginningWithT();
foreach (Match m in r.Matches(input))
{
    ...
}

Risks

n/a

Author: stephentoub
Assignees: -
Labels:

area-System.Text.RegularExpressions, api-ready-for-review

Milestone: 7.0.0

@stephentoub stephentoub removed the untriaged New issue has not been triaged by the area owner label Sep 9, 2021
@davidfowl
Copy link
Member

Is the instance cached or do you get a new regex every call?

@stephentoub
Copy link
Member Author

It hasn't been implemented yet, so, neither :) But my intention is that it gets cached as a singleton.

@GREsau
Copy link

GREsau commented Sep 12, 2021

If the method were to return a cached singleton or would otherwise be cheap to call, could/should this be done via a field/property rather than a method?
e.g.

[RegexGenerator(@"\b[t]\w+", RegexOptions.IgnoreCase | RegexOptions.InvariantCulture)]
private static readonly Regex RegexToFindWordsBeginningWithT;

// or

[RegexGenerator(@"\b[t]\w+", RegexOptions.IgnoreCase | RegexOptions.InvariantCulture)]
private static Regex RegexToFindWordsBeginningWithT { get; }

I'm still not very familiar with what source generators can do, but I assume they would be able to support this by outputting a static constructor to set the field/property? Although this would be a slight footgun as you could technically include this attribute for source generation, and also initialise the field yourself (which would be a silly thing to do), probably resulting in strange behaviour...

Edit: Ah, I see a class can only have a single static constructor, so my earlier idea of having a source generator output a static constructor would also preclude users from including their own static constructors, so it seems like a no-go

@stephentoub
Copy link
Member Author

could/should this be done via a field/property rather than a method?

If C# adds support for partial properties in the future, it could support that as well.

@davidfowl
Copy link
Member

This is a good use case for it. It also make me wonder if it would be better to support regex literals in the compiler (that's essentially what this does).

@stephentoub
Copy link
Member Author

It also make me wonder if it would be better to support regex literals in the compiler

Why?

@davidfowl
Copy link
Member

It would feel more ergonomic and could potentially optimize more than a source generator could (it can rewrite the expression).

@davidfowl
Copy link
Member

But one thing I'm overlooking is the regex options. I'm not sure what that would look like with literal syntax.

@stephentoub
Copy link
Member Author

it can rewrite the expression

So can the source generator. We already do optimizing / rewriting in Regex.

one thing I'm overlooking is the regex options

And timeout.

@davidfowl
Copy link
Member

So can the source generator. We already do optimizing / rewriting in Regex.

It's about it feeling more natural and ergonomic. The attribute on top of a partial method doesn't feel very nice. It's working around language limitations.

@stephentoub
Copy link
Member Author

The attribute on top of a partial method doesn't feel very nice. It's working around language limitations.

Which language limitations? If you mean lack of partial properties, the source generator can easily be augmented when that's in the language. We shouldn't work around the work around by jumping straight to adding an entirely new sublanguage to C#.

@Joe4evr
Copy link
Contributor

Joe4evr commented Sep 13, 2021

Side-note: If this is done, it'd be great to have it followed-up in Roslyn by having the RegEx highlighting feature recognize this attribute.

@davidfowl
Copy link
Member

Which language limitations? If you mean lack of partial properties, the source generator can easily be augmented when that's in the language. We shouldn't work around the work around by jumping straight to adding an entirely new sublanguage to C#.

I was thinking of this more like a generalization of string literals. We had a similar discussion with utf8 literals, now we regex we have another potential pattern for the same thing.

The other benefit is that you would be able to use it outside of field declarations so it could be used for locals as well.

Regex re = /^hello+/;

Console.WriteLine(re.IsMatch(args[0]));

@am11
Copy link
Member

am11 commented Sep 13, 2021

it'd be great to have it followed-up in Roslyn by having the RegEx highlighting feature recognize this attribute.

That should already be supported via lang=regex comment:

[RegexGenerator(/* lang=regex */ @"\b[t]\w+", RegexOptions.IgnoreCase | RegexOptions.InvariantCulture)]

string foo = /* lang=regex */ "^hey joe!$";

// lang=regex
string bar = "^hey joe!$";

CallFoo(/* lang=regex */ "^hey joe!$");

@terrajobst
Copy link
Member

terrajobst commented Sep 14, 2021

Video

  • Looks good as proposed
    • We should rename matchTimeout to matchTimeoutMilliseconds as we can't use a TimeSpan
namespace System.Text.RegularExpressions
{
    [AttributeUsage(AttributeTargets.Method, AllowMultiple = false, Inherited = false)]
    public sealed class RegexGeneratorAttribute : Attribute
    {
        public RegexGeneratorAttribute(string pattern);
        public RegexGeneratorAttribute(string pattern, RegexOptions options);
        public RegexGeneratorAttribute(string pattern, RegexOptions options, int matchTimeoutMilliseconds);

        public string Pattern { get; }
        public RegexOptions Options { get; }
        public int MatchTimeoutMilliseconds { get; }
    }
}

@terrajobst terrajobst added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels Sep 14, 2021
@danmoseley
Copy link
Member

@stephentoub as you pointed out offline, our current regex cache uses the actual culture name in the key (ie not just Invariant vs. Current as is in the RegexOptions). Either we have to accept that's information baked into the regex at the time it is compiled, or that's a bug we should fix. In the former case, this attribute will need to have a CultureInfo as well (the string form thereof at least)

thoughts?

@stephentoub
Copy link
Member Author

thoughts?

We need to revisit the expected behavior here around culture all-up, not just with regards to the source generator, in particular does "current culture" (when RegexOptions.CultureInvariant isn't specified) apply at construction time or at match time? Right now, for both the interpreter and RegexOptions.Compiled, it's in a weird middle world where data from the culture that current at the time of ctor is factored in and then culture that's current at the time of match is factored in, and the implementations don't respond well to those being different.

From my perspective it's a bug that we factor it in from both. If we choose to only factor it in at match time, the generator will match whatever we choose to do for RegexOptions.Compiled (which will likely have a perf impact). If we choose to only factor it in from ctor time, we may want to do something special for the source generator, since (as we discussed in the API review) it's currently going to bake in the culture current at the time of build. Options there could include just saying "yeah, you get what you get", it could include adding another ctor that takes a culture name, it could include throwing if you don't specify InvariantCulture but did specify either IgnoreCase or a pattern that employs (?i), etc.

@danmoseley
Copy link
Member

Agreed. What would be nice here is some data on what customers rely on. I'm guessing changing to "you get what you get" or "always construction time" would be breaking. We at least should be consistent if we can.

@stephentoub
Copy link
Member Author

I'm guessing changing to "you get what you get" or "always construction time" would be breaking.

Any change here is likely to break someone somewhere. We'll need to determine what we believe is a) the "right" behavior we desire in an ideal world, b) the perf impact of that decision, and c) how widespread we expect breaks to be from that change, and make a decision. We shouldn't do that here, though. I'll be opening a handful of issues related to the source generator as follow-ups to address after it's initially in.

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Sep 15, 2021
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Sep 22, 2021
@Joe4evr
Copy link
Contributor

Joe4evr commented Oct 4, 2021

That should already be supported via lang=regex comment:

Sure, but it'd be nice if this attribute is recognized so the magic comment isn't needed.

@ghost ghost locked as resolved and limited conversation to collaborators Nov 3, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-approved API was approved in API review, it can be implemented area-System.Text.RegularExpressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants