Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement lint for regex::Regex compilation inside a loop #13412

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

GnomedDev
Copy link

Closes #598.

Seems like a pretty simple one, I'm not sure if I sorted out all the lint plumbing correctly because I was adding it to the existing regex pass, but seems to work. The name is a bit jank and I'm super open to suggestions for changing it.

changelog: [regex_compile_in_loop]: Added lint for Regex compilation inside loops.

@rustbot
Copy link
Collaborator

rustbot commented Sep 18, 2024

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @y21 (or someone else) some time within the next two weeks.

Please see the contribution instructions for more information. Namely, in order to ensure the minimum review times lag, PR authors and assigned reviewers should ensure that the review label (S-waiting-on-review and S-waiting-on-author) stays updated, invoking these commands when appropriate:

  • @rustbot author: the review is finished, PR author should check the comments and take action accordingly
  • @rustbot review: the author is ready for a review, this PR will be queued again in the reviewer's queue

@rustbot rustbot added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties label Sep 18, 2024
@GnomedDev GnomedDev force-pushed the regex-comp-in-loop branch 2 times, most recently from e44de97 to 63c6dac Compare September 20, 2024 13:05
Copy link
Member

@y21 y21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started reviewing this 2 days ago but then forgot to continue 😅 Overall looks great aside from a few things

if let Some((_, fun, arg)) = extract_regex_call(self.definitions, self.cx, expr)
&& (matches!(arg.kind, ExprKind::Lit(_)) || const_str(self.cx, arg).is_some())
{
span_lint_and_help(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As is this needs to use span_lint_hir_and_then to make #[allow]/#[expect] attributes on the Regex::new call work correctly since this is emitting a warning on a different node (would be good to have a test case that allowing the lint works)

definitions: &self.definitions,
};

visitor.visit_block(block);
Copy link
Member

@y21 y21 Sep 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now this is visiting all expressions in loops twice (once in the visitor and another time in the lint pass). We could just have a loop_stack: Vec<Span> in the lint pass that a span gets pushed into in check_expr for loops and popped in check_expr_post.

Would get rid of the visitor and the need to use span_lint_hir_and_then, though if this ends up being much more complicated than before then it's probably not worth it and we could just leave it. But it also seems like it could end up simpler - all we would need is a loop_stack.last() call in check_expr to get the enclosing loop span if there is one and the rest of the already existing Regex::new() matching could be reused

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented this way as having state in the lint itself made me worry that Something could go wrong, such as expressions not being visited in the expected order... but that idea of a loop stack makes me want to give it a go.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of things in clippy rely on the fact that the iteration order is at the very least check_node → <everything contained in node> → check_node_post. Grepping for check_.+_post shows a bunch of cases that do something similar with state like here

/// ```
///
#[clippy::version = "1.83.0"]
pub REGEX_COMPILE_IN_LOOP,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe regex_creation_in_loops? Ideally lint names should be pluralized and make sense when read as a sentence together with #[allow] (https://rust-lang.github.io/rfcs/0344-conventions-galore.html#lints)

/// ```no_run
/// # let haystacks = [""];
/// # const MY_REGEX: &str = "a.b";
/// let regex = regex::Regex::new(MY_REGEX).unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could imagine a situation where the regex creation is in an unlikely (error) path in a loop and the suggested change of simply moving the Regex::new() call outside the loop would go from compiling it almost never to always compiling it once

Though in those cases one can still move it outside the loop (or even into a static) wrapped in a Lazy{Cell,Lock} so it's only compiled when accessed. The lint messages/description doesn't contradict this or specify/require how it should be moved out of the loop, but I think it'd still be useful to mention that somewhere because it might be non-obvious that that's an option

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would also be good to link to regex's own description of the antipattern: https://docs.rs/regex/latest/regex/#avoid-re-compiling-regexes-especially-in-a-loop

@y21
Copy link
Member

y21 commented Sep 20, 2024

Given regex's own advice it makes me wonder if we could just have a lint for regex creations with literals outside of a static anywhere as having it in a static with LazyLock would avoid recompiling it ever again even across function calls, but I can see how that's more... controversial than just this specific pattern

@GnomedDev
Copy link
Author

Yeah, that would be a separate (pedantic) lint should probably have an issue opened for it, but definitely not this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review Status: Awaiting review from the assignee but also interested parties
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Warn when compiling regexes in a loop
3 participants