Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement new codemod API #213

Merged
merged 2 commits into from
Jan 17, 2024
Merged

Implement new codemod API #213

merged 2 commits into from
Jan 17, 2024

Conversation

drdavella
Copy link
Member

@drdavella drdavella commented Jan 15, 2024

Overview

Implement new codemod API

Description

  • This refactoring was driven by the following goals:
    • Simplified codemod class hierarchy
    • Cleaner metadata validation and more consistent metadata representation
    • More generality for other detectors (e.g. Sonar input) and transformers (e.g. LLMs or other AST/CST implementations)
    • Decoupling of base codemod implementation from LibCST constructs
    • Enable transformers to be shared between different codemods
    • Enable multiple transformations to be applied in sequence by the same codemod
  • Codemods now consist of the following attributes:
    • Metadata (e.g. name, review guidance, references, etc.)
    • Detector (optional) which either performs its own analysis, possibly by running a tool like Semgrep, or parses files generated by an external tool
    • A pipeline of one or more transformations
  • Some notable changes:
    • All codemods (except for one) use the same (simplified) base class. The one exception is simply for the sake of example
    • No codemods use external Semgrep YAML files

How to review this PR

  • Apologies in advance for the huge PR. This change touched a huge number of files but most of the changes are not that complicated.
  • Start by looking at the new codemod implementations. This should give you a sense of the general philosophy behind the changes here.
  • Next look at base_codemod.py and libcst_transformer.py
    • There's not a lot of new stuff here. Most of it has just been rearranged and moved to a new location
  • Next look at the new api.py file. This contains the simplified API implementation which is intended for most of our current use cases
  • Next look at tests. The updates here are pretty uniform and not too complicated

Copy link

codecov bot commented Jan 15, 2024

Codecov Report

Attention: 9 lines in your changes are missing coverage. Please review.

Comparison is base (ac74626) 96.46% compared to head (ebb14fe) 96.43%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #213      +/-   ##
==========================================
- Coverage   96.46%   96.43%   -0.04%     
==========================================
  Files          91       95       +4     
  Lines        4277     4065     -212     
==========================================
- Hits         4126     3920     -206     
+ Misses        151      145       -6     
Files Coverage Δ
src/codemodder/codemods/api.py 100.00% <100.00%> (ø)
src/codemodder/codemods/base_transformer.py 100.00% <100.00%> (ø)
src/codemodder/codemods/base_visitor.py 100.00% <ø> (ø)
src/codemodder/codemods/semgrep.py 100.00% <100.00%> (ø)
...odemodder/dependency_management/setup_py_writer.py 100.00% <100.00%> (ø)
src/core_codemods/add_requests_timeouts.py 100.00% <100.00%> (ø)
src/core_codemods/api/__init__.py 100.00% <100.00%> (ø)
src/core_codemods/api/core_codemod.py 100.00% <100.00%> (ø)
src/core_codemods/combine_startswith_endswith.py 100.00% <100.00%> (ø)
src/core_codemods/django_debug_flag_on.py 100.00% <100.00%> (ø)
... and 49 more

@drdavella drdavella marked this pull request as ready for review January 15, 2024 20:24
from codemodder.codemods.utils_mixin import NameResolutionMixin


class RefactorNewApi(SimpleCodemod, NameResolutionMixin):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this clearly took some effort so I can feel the reason to commit it, but do you think we will put it to use later on? maybe add a docstring explaining this was used during this transition.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add a docstring. It is one-time use but it's also a nice demonstration of the capabilities of codemods. I intend to blog about it later. It also genuinely saved me some time with refactoring ~35 of the codemods.

Copy link
Contributor

@clavedeluna clavedeluna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very minor non-blocking questions. I'm super impressed how easy this looks on the reviewer side, awesome job Dan!


assert output_code == dedent(expected)

def run_and_assert_filepath( # pylint: disable=too-many-arguments
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could just add this disable to the top of the file instead of for every method

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I actually thought about just disabling it globally since it's not my favorite pylint rule. Any thoughts?

@@ -37,4 +36,3 @@ def test_combine(self, tmpdir, func):
)
def test_no_change(self, tmpdir, code):
self.run_and_assert(tmpdir, code, code)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not seeing num_changes=0 being passed so just want to make sure that removal of assert len(self.file_context.codemod_changes) == 0 is indeed covered in new tests refactor

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's covered. The case where expected == result explicitly checks for that condition.

Copy link
Contributor

@andrecsilva andrecsilva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are good ideas that came with the new API, but I don't think the current implementation reflects them.

src/codemodder/codemods/base_codemod.py Outdated Show resolved Hide resolved
src/codemodder/codemods/libcst_transformer.py Outdated Show resolved Hide resolved
tree = source_tree
with file_context.timer.measure("transform"):
for transformer in self.transformers:
tree = transformer.transform(tree, results, file_context)
Copy link
Contributor

@andrecsilva andrecsilva Jan 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Results will be stale after the first transform. You either generate the results (with a detector) before applying each transformer or ignore the results. I suggest making a pipeline class for each case.

If you'd still want to make a pipeline that reuses the results for each transform, then my suggestion is to create a detector that always returns the same results.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fair. That does make sense if the detector belongs to the transformer pipeline.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You raise a really good point here, and you're absolutely right about the potential for the results to become stale in general. However, I still think that architecturally the codemod itself is the right place for detector. This makes easier to reason about codemods that might share the same transformations but use different detectors.

To be clear: the cases I had in mind for multiple transformers fell into one of two categories:

  • Cases where the transformers themselves perform detection (which is perfectly valid)
  • Cases involving LLM-based transformations

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the burden of composing transformations should be with the codemod itself.

My suggestion is to change the Codemod class to accept a list of transformations, and move this logic into the apply function of the codemod. This way you could use the detector to guarantee results before each transformation. It also means the Pipeline classes of transformations wouldn't be needed.

Copy link
Member Author

@drdavella drdavella Jan 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with that suggestion is that there is a phase mismatch between detectors and transformers. Detectors apply to a list of files whereas a transformer applies to a single file. This is necessarily the case due the fact that a tool like Semgrep or Sonar returns results that apply to a collection of files whereas LibCST transformations are applied to a single file at a time. Similarly we would expect LLM-based transformations to apply to one file at a time as well.

from codemodder.codemods.semgrep import SemgrepRuleDetector


class SimpleCodemod(LibcstResultTransformer, metaclass=ABCMeta):
Copy link
Contributor

@andrecsilva andrecsilva Jan 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You went through the trouble of separating codemods into three components: metadata, transformer, and detector, only to muddle them all together again in this class. It's a bit telling it doesn't use the new BaseCodemod class at all (and must, instead, rely on an attribute for that).

Not sure why it the codemod class should be concerned with details like detector_pattern or on_result_found. Those should be the responsibility of both the detector and transformer, respectively.

Maybe this exists as a convenient way to define everything in the same class, however not only that defeats the purpose of separation to begin with, but it's also not really any different from the old BaseCodemod class.

I feel like any codemod class should be simple composition of those three components. So it should look like:

SandboxUrlSemgrepDetector = SemgrepDetector(rule = ...)
SandboxUrlTransformer = LibcstResultTransformer(callback = on_result_found)
SandboxUrlCodemod = BaseCodemod(SandBoxUrlSemgrepDetector, SandboxUrlTransformer, metadata = ...)
LimitReadlineCodemod = SemgrepCodemod(LimitReadlineTransformer, metadata = ..., semgrep_rule = ...)

or

class LimitReadline(SemgrepCodemod):
         metadata = ...
         transformer = LimitReadlineTransformer
         semgrep_rule = ...         

My suggestion is to first think how you want the codemods to be expressed (e.g. the two examples above) and then work backwards from there.

Copy link
Member Author

@drdavella drdavella Jan 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this exists as a convenient way to define everything in the same class, however not only that defeats the purpose of separation to begin with, but it's also not really any different from the old BaseCodemod class.

This is exactly the point of this API and I believe that this is an unfair criticism. Making the simple cases simple to express is necessarily in tension with providing the full generality that we want from the BaseCodemod API. We did in fact work backwards from the API that we wanted, which was already reflected in most of these codemod implementations. However as we were all aware, the original implementation was both not powerful enough and also quite convoluted in terms of the underlying class hierarchy. So adding an API layer to make the simple cases simple seems both worthwhile and like a good design.

I feel like any codemod class should be simple composition of those three components

I am in agreement with you. The BaseCodemod API makes this explicit, except that the composition is expressed in terms of a class definition instead of a function call.

If you would like to propose an alternative API for the simple codemod cases, I am open to considering it. In truth I'm not very happy with the underlying implementation of SimpleCodemod since it involves some metaprogramming magic. But I think the API should be judged by its use cases which are clean and simple. I would encourage you to rewrite those examples in terms of your proposed API and see whether that leads to cleaner code and we can evaluate it from there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an addendum: I am not trying to hide the fact that this simplified API was explicitly designed to minimize the changes required from the previous simplified API and the new API. I don't see that as an unworthy goal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exactly the point of this API and I believe that this is an unfair criticism. Making the simple cases simple to express is necessarily in tension with providing the full generality that we want from the BaseCodemod API.

If that's the intention, then it's perfectly fine. It does have the benefits of being easy and convenient to throw everything into a class with a few methods/attributes.

Having a clear (and enforced) separation between detection/transformation/metadata does have other benefits like modularity and reusability. You can't really reuse any transformation defined within a SimpleCodemod.

I am in agreement with you. The BaseCodemod API makes this explicit, except that the composition is expressed in terms of a class definition instead of a function call.

The point is that the composition is muddled with the SimpleCodemod class. It is a Codemod and Transformation at the same time. For a more clear separation, it should "have" a transformation and not be one. If the intent is clear separation, I don't think it should have any logic pertaining the transformation itself.

If you would like to propose an alternative API for the simple codemod cases, I am open to considering it. In truth I'm not very happy with the underlying implementation of SimpleCodemod since it involves some metaprogramming magic. But I think the API should be judged by its use cases which are clean and simple. I would encourage you to rewrite those examples in terms of your proposed API and see whether that leads to cleaner code and we can evaluate it from there.

I'm happy with how it turned out. In fact, you may recall that I've suggested something close to this before. The point of the aforementioned comment is to point that the introduction of the SimpleCodemod introduces a new way to express codemods and that makes you lose some of the benefits of the new design. Again, if you were already aware of that, then that's fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't really reuse any transformation defined within a SimpleCodemod.

This is okay with me, but that's good point. I think that cases where the transformation is shared between multiple codemods will need to be refactored.

The point is that the composition is muddled with the SimpleCodemod class. It is a Codemod and Transformation at the same time. For a more clear separation, it should "have" a transformation and not be one. If the intent is clear separation, I don't think it should have any logic pertaining the transformation itself.

I understand the criticism here and I don't disagree but it is also fundamentally at odds with providing a "concise" simplified API which is another explicit design goal here.

The point of the aforementioned comment is to point that the introduction of the SimpleCodemod introduces a new way to express codemods and that makes you lose some of the benefits of the new design.

Again, you are not wrong, but I don't think it's so problematic. SimpleCodemod is just sugar to make the easy cases easy. It reduces a lot of boilerplate (and refactoring) that would otherwise be required. It just so happens that most of our existing codemods fit into the "simple" category but I expect that there will be many other codemods that will require the lower-level API.

@drdavella drdavella force-pushed the new-codemod-hierarchy branch 2 times, most recently from a3b996f to ae183c0 Compare January 16, 2024 21:01
Copy link
Contributor

@andrecsilva andrecsilva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should fix the pipeline issue before merging this. It will most likely need some structural changes in the code.
However, since no codemod use it for now, I'm willing to accept this. Just make sure to add a comment raising this issue.

Copy link

sonarcloud bot commented Jan 17, 2024

Quality Gate Passed Quality Gate passed

The SonarCloud Quality Gate passed, but some issues were introduced.

5 New issues
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

@drdavella drdavella added this pull request to the merge queue Jan 17, 2024
Merged via the queue into main with commit d078cda Jan 17, 2024
14 checks passed
@drdavella drdavella deleted the new-codemod-hierarchy branch January 17, 2024 19:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants