Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ignore: solve re.error on group name redefinition in pathspec 0.10.x #8663

Closed
wants to merge 1 commit into from

Conversation

hiroto7
Copy link

@hiroto7 hiroto7 commented Dec 6, 2022

Fixes #8217 by getting rid of regex concatenation that causes ERROR: unexpected error - redefinition of group name 'ps_d' as group 2; was group 1 at position 46 when latest pathspec (0.10.x) is installed.

Instead of concatenating regexes with |, run matches() for each regular expression and take any().

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

@codecov
Copy link

codecov bot commented Dec 6, 2022

Codecov Report

Base: 93.54% // Head: 93.55% // Increases project coverage by +0.01% 🎉

Coverage data is based on head (18eab2e) compared to base (9169d27).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8663      +/-   ##
==========================================
+ Coverage   93.54%   93.55%   +0.01%     
==========================================
  Files         457      457              
  Lines       36249    36249              
  Branches     5232     5232              
==========================================
+ Hits        33908    33914       +6     
+ Misses       1837     1834       -3     
+ Partials      504      501       -3     
Impacted Files Coverage Δ
dvc/ignore.py 90.07% <100.00%> (ø)
tests/func/experiments/test_experiments.py 99.72% <0.00%> (+0.54%) ⬆️
dvc/repo/experiments/queue/celery.py 85.20% <0.00%> (+0.79%) ⬆️
dvc/repo/experiments/executor/local.py 90.79% <0.00%> (+1.22%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@dtrifiro
Copy link
Contributor

dtrifiro commented Dec 6, 2022

Related: #8553

Copy link
Contributor

@dtrifiro dtrifiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. pathspec should also be bumped

@hiroto7
Copy link
Author

hiroto7 commented Dec 13, 2022

@dtrifiro Thank you for reviewing!

When I ran tests with the latest pathspec, I noticed that the interpretation of dir/* in .dvcignore may be slightly changed after bumping.
For example, when we write dir/* in .dvcignore, currently dir/a.txt is ignored but dir/b/c.txt is not ignored.
In contrast, with the latest pathspec, both dir/a.txt and dir/b/c.txt are ignored.

Sorry that I submitted this PR earlier, but we may need to consider that breaking change in bumping.

(cpburnz/python-pathspec#19 was closed this August.)

for ignore, pattern in self.ignore_spec[::-1]:
if matches(pattern, path, is_dir):
for ignore, patterns in self.ignore_spec[::-1]:
if any(matches(pattern, path, is_dir) for pattern in patterns):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @hiroto7 I have a problem here, previously we concatenate these reg expressions mainly for the performance. And actually, in the very beginning, we use pathspaces's API to do these checks but it's really slow and the inner implementation is like this kind of nested for loop. Any better methods to solve this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karajan1001 Oh, I didn't know that the regex concatenation is deliberate, and I have noticed you have previously mentioned a bottleneck in pathspec's API in #3869 (comment).

Although pathspec seems to try to check all regexes even if any pattern matches soonly, the any() function does short-circuit evaluation. So I think my PR never produces unnecessary regex matchings.

Copy link
Contributor

@karajan1001 karajan1001 Dec 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karajan1001 Oh, I didn't know that the regex concatenation is deliberate, and I have noticed you have previously mentioned a bottleneck in pathspec's API in #3869 (comment).

Although pathspec seems to try to check all regexes even if any pattern matches soonly, the any() function does short-circuit evaluation. So I think my PR never produces unnecessary regex matchings.

Yes, but your algorithm is still O(m) for m reg expressions. even if it would be faster than the pathspec's solution.
While the regex built a state machine inside of it. For example, you have both 1 and 10 patterns, in a long regex if 1 doesn't been matched, we also know 10 doesn't, while in the for loop it will still be checked.

Here is a simple test on my computer, I create 100 patterns for the dvc status benchmark.

$ cat tests/benchmarks/cli/commands/test_status_ignore.py                                                                                               [ins][18:34:34]
def test_status(bench_dvc, tmp_dir, dvc, make_dataset):
    dataset = make_dataset(files=True, dvcfile=True, cache=True)
    ignore_list =  '\n'.join([str(n) for n in range(100)])
    (tmp_dir/".dvcignore").write_text(ignore_list)

    bench_dvc("status")
    bench_dvc("status", name="noop")

    (dataset / "new").write_text("new")
    bench_dvc("status", name="changed")
    bench_dvc("status", name="changed-noop")

And the final benchmark shows that time cost after this PR increased from 280ms to 310ms

image

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the benchmark result.

If that's the case, it seems that simply looping regexes one by one, as I did, is not a good idea.
Concatenating regexes is likely to be effective, unfortunately, but I have no idea about any other way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had two suggestions here,

  1. use some kind of hacky way to override the regex pattern created by pathspec. (replace <ps_d> with <ps_d1>', <ps_d2>', ... `<ps_dn>' )
  2. Another more radical choice is to contribute back to pathspec to improve its performance and then directly use the API of it.

@daavoo daavoo added awaiting response we are waiting for your reply, please respond! :) and removed awaiting response we are waiting for your reply, please respond! :) labels Jan 3, 2023
@karajan1001
Copy link
Contributor

close as #8767

@karajan1001 karajan1001 closed this Jan 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

dvc pull: re.error: redefinition of group name 'ps_d' as group 2; was group 1 at position 46
4 participants