Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental Parsing #49

Open
ilonachan opened this issue Sep 24, 2023 · 1 comment
Open

Incremental Parsing #49

ilonachan opened this issue Sep 24, 2023 · 1 comment

Comments

@ilonachan
Copy link
Contributor

ilonachan commented Sep 24, 2023

I know the original blogpost mentions this, so I'm surprised there's not also a tracking issue for this enhancement. I've been playing with this a little bit, and there are a few issues that I think might need to be solved:

  • Tree-Sitter obviously supports the whole edit thing to speed up generating the CST; this isn't hard to implement, but requires exposing some kind of handle to the user, who can then use this to perform the edits.
    In my version this is just a wrapper around the internal tree, but maybe we don't want that to be exposed to the user? Because of macro sanitation stuff this might overlap with Deduplicate field logic into runtime utility #14, because moving the logic of the generated parser function into the runtime means the internals of the EditHandle can be private to that crate.
    Either way, this would definitely be a breaking API change.
  • While speeding up the parsing itself is already great, Rust-Sitter specifically introduces the Extraction step where the CST is turned into the Rust data structure. It'd be great if this step could also be sped up by reusing previous parse results, but there are a few challenges to this:
    • How do we know what can be reused? The new tree doesn't keep track of what nodes are now different from the previous tree... the old one does though. Possibly the extraction step might want to walk along both trees at the same time, and use the old one to check if something changed in the new one. Alternatively we can also get a list of byte ranges that have changed, and maybe check each node for overlap as we encounter it (I don't even think this would worsen the time complexity if done right).
    • Where do we get the data to reuse from? Because of the whole ownership thing, once we've given the user their structure, we don't have it anymore. Unless we enforce any grammar structure to be fully Clone (so we can keep a clone in our possession), this can't really be overcome.
      The solution would be to require the user to pass ownership to their object back for reparsing. Which they could then choose to avoid by making everything cloneable, or they could just deal with the consequences.
      As a compromise, maybe there is a way to allow the user to give only the cloneable parts back? Like, a PseudoClone wrapper around non-clonable values, which (like Box and Spanned and such) would be ignored by the grammar, but could be filled with a default value for the edit pass? But this would mean we need to keep track of the branches that have these wrappers in them, as those definitely need to be visited on every extraction.
    • Can we even trust the user to give the data back? The problem is that if the user has ownership of the data structure, mutably, then they could change its contents and our reparsing step wouldn't notice.
      The most trivial, but also most annoying solution would be to never give the user full ownership: instead have the structure wrapped inside the EditHandle, and only give out immutable references.
      But perhaps even more trivial: maybe we can just ignore this. Changing the contents of a syntax tree like this doesn't seem like a very common use case, and if the user already decides to keep it up to date with edits then they probably wouldn't want to do that anyway. It's just important to make them aware of the consequences: "All nodes that are involved in an edit will have their changes overwritten, but the remaining nodes aren't automatically reset." This could be slightly unpredictable, but a risk they'd have to take.
    • Does the user even want this optimization? Maybe a full re-extraction is desired anyway! In that case we could just have the previous parse result be fully optional, even if an edit tree is provided, and only do the reuse if the user did provide something. The PseudoClone nodes could then also have the role of allowing the user to explicitly control what parts of the tree should be reused.
@ilonachan
Copy link
Contributor Author

ilonachan commented Sep 25, 2023

Update: I tried implementing the PseudoClone trait I described, and realized that unfortunately what I'm trying to do is impossible in Rust. Because for this to make sense, I'd have to do a passthrough implementation on Box, Spanned etc... but these also have a passthrough implementation for Clone. I can't write a PseudoClone impl block for these types that wouldn't overlap with the Clone=>PseudoClone default impl if the inner type is Clone. If I could write sth like impl <T: !Clone> to exclude any T that is Clone, everything would work (I think that's being discussed in this RFC). And I think this RFC could also help, but for now the PartialClone solution is probably not happening.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant