-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DRAFT] Explore skipping graph checks #10425
base: main
Are you sure you want to change the base?
Conversation
@dberenbaum You mentioned I would want to skip repo.find_outs_by_path in dvc/repo/add.py under def get_or_create_stage(
repo: "Repo",
target: str,
out: Optional[str] = None,
to_remote: bool = False,
force: bool = False,
) -> StageInfo:
if out:
target = resolve_output(target, out, force=force)
path, wdir, out = resolve_paths(repo, target, always_local=to_remote and not out)
try:
(out_obj,) = repo.find_outs_by_path(target, strict=False)
stage = out_obj.stage
if not stage.is_data_source:
raise DvcException(...omit...)
return StageInfo(stage, output_exists=True)
except OutputNotFoundError:
stage = repo.stage.create(...omit...)
return StageInfo(stage, output_exists=False) I don't have a strong understanding of what a "Stage" is, so its unclear what the correct mechanism for skipping that line is, as it defines the "stage" object used in the subsequent lines. I could just raise a I found some docs describing stages. |
dvc/repo/add.py
Outdated
try: | ||
# How best to disable this line? With Skip Graph Checks Flag? | ||
repo._skip_graph_checks = True | ||
if getattr(repo, "_skip_graph_checks", False): | ||
print("todo: skip graph checks") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong understanding of what a "Stage" is.
Internally, everything is a Stage
in dvc.
What you want to do here is similar to following:
if not skip_graph_checks:
try:
... # existing inner block inside try
except OutputNotFoundError:
continue
... # existing block from except
We might want to add a warning in case of skip_graph_checks=True
saying that:
partial or virtual add does not work when --skip-graph-checks are enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may want to pass skip_graph_checks: bool
to this function so to depend less on global state as possible (with default set to False
unless --skip-graph-checks
is used).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree minimizing global state is a good idea. There seem to be existing tests for it, so I'm going to start with a global state implementation and then "fix it" once I get it working.
Is the "graph" referring to a graph of stages where edges represent dependencies between stages? Is the essence of what I want an optimization for the case where the graph is a disconnected set of nodes? EDIT: I see the docs in dvc.repo.graph.build_graph, so I think I'm correct here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still unsure what the right thing to do in get_or_create_stage
when this optimization is enabled. From what I understand we are attempting to lookup a "Stage" in repo.find_outs_by_path
and that corresponds to a ".dvc" file? Can I use .dvc file and Stage interchangeably? I've been referring to them as "Sidecars", but I would like to understand the architecture of the repo better.
In find_outs_by_path
, it calls self.index.outs_graph
, which is one of the big time sinks, and that's what I'd like to avoid if possible. Is there an alternative way I can create the dvc.output.Output
object when the "path" refers to a directory or file that corresponds to an existing or soon-to-be-created ".dvc" file?
As it stands my current implementation falls back to the case where it assumes the file is always a new .dvc file. But this optimization should apply in the case where the entire directory or file is tracked by the .dvc file. Guidance on what to do here would be helpful.
I'm also thinking that if there is any way to only crawl part of the graph without creating the entire thing, then perhaps every use-case can be made more efficient without needing to add a special case? (Although in the short-term I do just want to add the special case and move on).
Does it make sense to add docstrings to some of these functions? It would be nice there were docs explaining a bit about each argument and in which cases it might be specified, and perhaps points to a reference if it uses terminology the user is expected to be familiar with?
Something else that would be incredibly useful to have are doctests that would give me a quick entrypoint into any function. However, for things like DVC, that would require some concise helper function to setup a demo repo so the doctests don't get ridiculous. I happen to have some code that does this. I've been working on a DVC helper module called simple_dvc that improves my quality of life. An example that illustrates can be seen in the "request" function. Is there any interest in porting any of that work here?
I've pushed up a new commit that adds some of these docstrings, with questions where I don't know the answer. If someone wants to take a look and comment on what I've written / had questions about, perhaps we can consolidate that knowledge into the docstrings and make this codebase more approachable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it stands my current implementation falls back to the case where it assumes the file is always a new .dvc file. But this optimization should apply in the case where the entire directory or file is tracked by the .dvc file. Guidance on what to do here would be helpful.
I think the approach suggested by @skshetry above seems reasonable. Yes, it will fall back to creatin a new .dvc file. I think this is why we will have to add warnings. I don't see how we can do better without traversing the graph.
dvc/repo/__init__.py
Outdated
@@ -284,6 +284,7 @@ def index(self) -> "Index": | |||
def check_graph( | |||
self, stages: Iterable["Stage"], callback: Optional[Callable] = None | |||
) -> None: | |||
self._skip_graph_checks = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be done inside Repo.add()
API (and only when skip_graph_checks=True
is passed to Repo.add()
).
We also need to think how to contain the modification within Repo.add()
, i.e. rollback the value before exiting Repo.add()
.
for more information, see https://pre-commit.ci
@Erotemic Do you plan to get back to this? |
@dberenbaum Unfortunately, I've lost funding on the project that was letting me contribute to open source projects, so my ability to contribute is currently limited and anything I do will basically be on my free time. When last I left off, I still had questions regarding the DVC architecture and I added documentation to some arguments indicating what I thought they did or outright asking what they did if I was confused. Reviews of these docs would help me move forward. |
target : an expression that resolves to a ... | ||
out : if specified, what does this to? | ||
to_remote : if True, what does this to? | ||
force : what does this to? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the cli options in https://dvc.org/doc/command-reference/add
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I add explicit documentation here? Alternatively, can I include docs that point to references on where these concepts are enumerated explicitly? Or would it be better to have code with no docs here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong opinion but maybe @skshetry can weigh in
@@ -127,6 +128,16 @@ def add_parser(subparsers, parent_parser): | |||
help="Don't recreate links from cache to workspace.", | |||
) | |||
parser.set_defaults(relink=True) | |||
parser.add_argument( | |||
# Do we want a short code here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine if an option like this is long since we want people to be very explicit in choosing this option.
A PR where I'm going to experiment with solutions to the issue discussed in #10415
The basic idea is that running
dvc add
(and other commands) walk the entire non-dvc indexed repo to look for DVC files. This is to handle granular modifications, but handling that use-case may not always be necessary, and it would be nice to disable these checks, which would allow the runtime ofdvc add
and other similar operations to have speed independent of other files indexed in the repo.Not much in this PR yet, but I'm pushing it up as a DRAFT while I work on it.