Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metadata resolve workstream #12921

Open
1 of 10 tasks
cosmicexplorer opened this issue Aug 17, 2024 · 9 comments
Open
1 of 10 tasks

metadata resolve workstream #12921

cosmicexplorer opened this issue Aug 17, 2024 · 9 comments
Labels
C: dependency resolution About choosing which dependencies to install type: performance Commands take too long to run type: refactor Refactoring code

Comments

@cosmicexplorer
Copy link
Contributor

cosmicexplorer commented Aug 17, 2024

What's the problem this feature will solve?

The 2020 resolver separated the resolution logic from the rest of pip, and made the resolver much easier to read and extend. With --use-feature=fast-deps, we began to investigate improved pip performance by avoiding the download of dependencies until we reach a complete resolution. With install --report, we enabled pip users to read the output of the resolver without needing to download dependencies. However, there remain a few blockers to achieving performance improvements, for multiple use cases:

Uncached resolves:

When pip is executed entirely from scratch (without an existing ~/.cache/pip directory), as is often the case in CI, we are unlikely to get too much faster than we are now (and notably, it's extremely unlikely a non-pip tool could go faster in this case without relying on some sort of remote resolve index). However, there are a couple improvements we can still make here:

  • Make --use-feature=fast-deps default, to cover for wheels without PEP 658 metadata backfilled yet.
    • Importantly, this is often the case for internal corporate CI indices, which are unlikely to have performed the process of backfilling PEP 658 metadata, and may even be a --find-links repo instead of serving the PyPI simple repository API.
    • fast-deps is not as fast as it could be, which will be addressed further below.
  • Establish a separate metadata cache directory, so that CI runs in e.g. github actions can retain the result of metadata resolution run-over-run, without having to store large binary wheel output.
    • We will discuss metadata caching further below--this currently does not exist in pip.
    • This should achieve similar performance to partially-cached resolves, discussed next.
  • Batch downloading of wheels all at once after a shallow metadata resolve.
    • Pip already has the infrastructure to do this, it's just not implemented yet!

Partially-cached resolves with downloading

When pip is executed with a persistent ~/.cache/pip directory, we can take advantage of much more caching, and this is the bulk of the work here. In e.g. #11111 and other work, we (mostly) separated metadata resolution from downloading, and this has allowed us to consider how to cache not just downloaded artifacts, but parts of the resolution process itself. This is directly enabled by the clean separation and design of the 2020 resolver. We can cache the following:

  • PEP 658 metadata for a particular wheel (this saves us a small download).
  • fast-deps metadata for a particular wheel (this saves us a few HTTP range requests).
  • Metadata extracted from an sdist (this saves us a medium-size download and a build process).

These alone may not seem like much, but over the course of an entire resolve, not having to make potentially multiple network requests per dependency and staying within our in-memory pip resolve logic adds up and produces a very significant performance improvement. These also reduce the number of requests we make against PyPI.

But wait! We can go even faster! Because in addition to the metadata cache (which is idempotent and not time-varying--the same wheel hash always maps to the same metadata), we can also cache the result of querying the simple repository API for the list of dists available for a given dependency name! This additional caching requires messing around with HTTP caching headers to see if a given page has changed, but it lets us cache:

  • The raw content of simple repository index page, if unchanged (this saves us a medium-size download).
  • The result of parsing a simple repository index page into Links, if unchanged (this saves us an HTML parser invocation).
  • The result of filtering Links by interpreter compatibility, if unchanged (this saves us having to calculate interpreter compatibility using the tags logic).

Resolves without downloading

With install --report out.json --dry-run and the metadata resolve+caching discussed above, we should be able to avoid downloading the files associated with the resolved dists, enabling users to download those dependencies in a later phase (as I achieved at Twitter with pantsbuild/pants#8793). However, we currently don't do so (see #11512), because of a mistake I made in previous implementation work (sorry!). So for this, we just need:

  • Avoid downloading dists when invoked from a command which only requests metadata (such as install --dry-run).

Describe the solution you'd like

I have created several PRs which achieve all of the above:

Batch downloading [0/2]

For batch downloading of metadata-only dists, we have two phases:

fast-deps fixes [0/1]

Formalize "concrete" vs metadata-only dists [0/3]

To avoid downloading dists for metadata-only commands, we have several phases:

Metadata caching [0/1]

Caching index pages [0/2]

To optimize the process of obtaining Links to resolve against, we have at least two phases:

Each of these PRs demonstrate some nontrivial performance improvement in their description. All together, the result is quite significant, and never produces a slowdown.

Alternative Solutions

  • None of these changes modify any external APIs. They do introduce new subdirectories to ~/.cache/pip, which can be tracked separately from the wheel cache to speed up resolution without ballooning in size.
  • I expect the Link parsing and interpreter compatibility caching in persistent cache for link parsing and interpreter compatibility #12258 to involve more discussion, as they take up more cache space than the idempotent metadata cache, produce less of a performance improvement, and are more complex to implement. However, nothing else depends on them to work, and they can safely be discussed later after the preceding caching work is done.

Additional context

In writing this, I realized we may be able to modify the approach of #12257 to work with --find-links repos as well. Those are expected to change much more frequently than index pages using the simple repository API, but may be worth considering after the other work is done.

Code of Conduct

@cosmicexplorer cosmicexplorer added S: needs triage Issues/PRs that need to be triaged type: feature request Request for a new feature labels Aug 17, 2024
@ichard26 ichard26 added type: refactor Refactoring code C: dependency resolution About choosing which dependencies to install type: performance Commands take too long to run and removed type: feature request Request for a new feature S: needs triage Issues/PRs that need to be triaged labels Aug 19, 2024
@pelson
Copy link
Contributor

pelson commented Aug 19, 2024

I personally am excited to see metadata being used (without downloading the full dist) to optimise the --dry-run case. This looks like a really comprehensive suite of improvements 👍.

When it comes to testing, I wanted to make you aware of simple-repository-server, which can give you a very functional package index (PEP-503) complete with PEP-658 support. All you need to provide are the wheels, and it can compute the metadata on the fly (perhaps the problem with this though would be that there are no wheels without metadata for testing...). More generally, I gave a presentation last year at EuroPython 2023 on the simple-repository infrastructure that we have developed - whilst not perfect yet, I am convinced that a common simple-repository interface can serve both client and server usage. Imagine on the pip side to be able to request package metadata from a single interface, and behind that interface it can choose to compute the metadata and cache the dist if there is no metadata on the index being used. It would normalize the functionality within pip (you can assume that there is always a metadata file), and move the (http/dist) caching logic into its own layer. I don't want to derail this meta-issue, so can open a discussion elsewhere if you are interested to explore the idea more.

Back on the topic of metadata resolve optimisation, the big one which would speed-up the no-cache resolve is parallel requests. Is this something that you have explored as a possibility? (FWIW, there is an issue for parallel downloads in #825, which mostly pre-dates PEP-658 metadata).

@cosmicexplorer
Copy link
Contributor Author

@pelson responding to the optimization part first as that's what I have more paged in right now:

Back on the topic of metadata resolve optimisation, the big one which would speed-up the no-cache resolve is parallel requests.

Do you have a benchmark of sorts to demonstrate this? Parallel downloads of dists is actually something we have already established a foundation for; see the use of BatchDownloader in the preparer:

batch_download = self._batch_download(
links_to_fully_download.keys(),
temp_dir,
)
for link, (filepath, _) in batch_download:
logger.debug("Downloading link %s to %s", link, filepath)
req = links_to_fully_download[link]
# Record the downloaded file path so wheel reqs can extract a Distribution
# in .get_dist().
req.local_file_path = filepath
# Record that the file is downloaded so we don't do it again in
# _prepare_linked_requirement().
self._downloaded[req.link.url] = filepath
# If this is an sdist, we need to unpack it after downloading, but the
# .source_dir won't be set up until we are in _prepare_linked_requirement().
# Add the downloaded archive to the install requirement to unpack after
# preparing the source dir.
if not req.is_wheel:
req.needs_unpacked_archive(Path(filepath))

I avoided addressing that because it would require figuring out how to represent download progress for parallel downloads, which I wasn't sure about, and I also didn't see an incredible performance improvement when I tested it a few years ago in #7819 (but I'm not sure I was doing the right thing).

I was hesitant at first but now that I recall we already have a basis for parallel downloads it would probably make perfect sense to investigate it. Thanks so much for mentioning it! I'll add it to this issue.

@cosmicexplorer
Copy link
Contributor Author

When it comes to testing, I wanted to make you aware of simple-repository-server

Thanks so much for linking this!! ^_^ So for e.g. #12208 and #12256, a lot of the testing we're doing here is to cover the variety of ways an index may respond to pip requests, so instead of the conformant behavior it looks like simple-repository-server provides, we're trying to test a whole bunch of arbitrary types of responses so pip can handle them all. See the very very wonky local server logic for HTTP range requests in #12208:

pip/tests/conftest.py

Lines 1175 to 1305 in b06d73b

class ContentRangeDownloadHandler(
http.server.SimpleHTTPRequestHandler, metaclass=abc.ABCMeta
):
"""Extend the basic ``http.server`` to support content ranges."""
@abc.abstractproperty
def range_handler(self) -> RangeHandler: ...
# NB: Needs to be set on per-function subclasses.
get_request_counts: ClassVar[Dict[str, int]] = {}
positive_range_request_paths: ClassVar[Set[str]] = set()
negative_range_request_paths: ClassVar[Set[str]] = set()
head_request_paths: ClassVar[Set[str]] = set()
ok_response_counts: ClassVar[Dict[str, int]] = {}
@contextmanager
def _translate_path(self) -> Iterator[Optional[Tuple[BinaryIO, str, int]]]:
# Only test fast-deps, not PEP 658.
if self.path.endswith(".metadata"):
self.send_error(http.HTTPStatus.NOT_FOUND, "File not found")
yield None
return
path = self.translate_path(self.path)
if os.path.isdir(path):
path = os.path.join(path, "index.html")
ctype = self.guess_type(path)
try:
with open(path, "rb") as f:
fs = os.fstat(f.fileno())
full_file_length = fs[6]
yield (f, ctype, full_file_length)
except OSError:
self.send_error(http.HTTPStatus.NOT_FOUND, "File not found")
yield None
return
def _send_basic_headers(self, ctype: str) -> None:
self.send_header("Content-Type", ctype)
if self.range_handler.supports_range():
self.send_header("Accept-Ranges", "bytes")
# NB: callers must call self.end_headers()!
def _send_full_file_headers(self, ctype: str, full_file_length: int) -> None:
self.send_response(http.HTTPStatus.OK)
self.ok_response_counts.setdefault(self.path, 0)
self.ok_response_counts[self.path] += 1
self._send_basic_headers(ctype)
self.send_header("Content-Length", str(full_file_length))
self.end_headers()
def do_HEAD(self) -> None:
self.head_request_paths.add(self.path)
with self._translate_path() as x:
if x is None:
return
(_, ctype, full_file_length) = x
self._send_full_file_headers(ctype, full_file_length)
def do_GET(self) -> None:
self.get_request_counts.setdefault(self.path, 0)
self.get_request_counts[self.path] += 1
with self._translate_path() as x:
if x is None:
return
(f, ctype, full_file_length) = x
range_arg = self.headers.get("Range", None)
if range_arg is not None:
m = re.match(r"bytes=([0-9]+)?-([0-9]+)", range_arg)
if m is not None:
if m.group(1) is None:
self.negative_range_request_paths.add(self.path)
else:
self.positive_range_request_paths.add(self.path)
# If no range given, return the whole file.
if range_arg is None or not self.range_handler.supports_range():
self._send_full_file_headers(ctype, full_file_length)
self.copyfile(f, self.wfile) # type: ignore[misc]
return
# Otherwise, return the requested contents.
assert m is not None
# This is a "start-end" range.
if m.group(1) is not None:
start = int(m.group(1))
end = int(m.group(2))
assert start <= end
was_out_of_bounds = (end + 1) > full_file_length
else:
# This is a "-end" range.
if self.range_handler.sneakily_coerces_negative_range():
end = int(m.group(2))
self.send_response(http.HTTPStatus.PARTIAL_CONTENT)
self._send_basic_headers(ctype)
self.send_header("Content-Length", str(end + 1))
self.send_header(
"Content-Range", f"bytes 0-{end}/{full_file_length}"
)
self.end_headers()
f.seek(0)
self.wfile.write(f.read(end + 1))
return
if not self.range_handler.supports_negative_range():
self.send_response(http.HTTPStatus.NOT_IMPLEMENTED)
self._send_basic_headers(ctype)
self.end_headers()
return
end = full_file_length - 1
start = end - int(m.group(2)) + 1
was_out_of_bounds = start < 0
if was_out_of_bounds:
if self.range_handler.overflows_negative_range():
self._send_full_file_headers(ctype, full_file_length)
self.copyfile(f, self.wfile) # type: ignore[misc]
return
self.send_response(http.HTTPStatus.REQUESTED_RANGE_NOT_SATISFIABLE)
self._send_basic_headers(ctype)
self.send_header("Content-Range", f"bytes */{full_file_length}")
self.end_headers()
return
sent_length = end - start + 1
self.send_response(http.HTTPStatus.PARTIAL_CONTENT)
self._send_basic_headers(ctype)
self.send_header("Content-Length", str(sent_length))
self.send_header("Content-Range", f"bytes {start}-{end}/{full_file_length}")
self.end_headers()
f.seek(start)
self.wfile.write(f.read(sent_length))

However, it was absolutely appropriate for you to mention this, especially in response to e.g. this bit of the issue:

Importantly, this is often the case for internal corporate CI indices, which are unlikely to have performed the process of backfilling PEP 658 metadata, and may even be a --find-links repo instead of serving the PyPI simple repository API.

I was speaking from my knowledge of Twitter from 2017-2020, which initially served a --find-links repo from SVN (hence my change in #7729), then later generated a simple repository index, but for many reasons I suspect wouldn't have caught up to PEP 658 metadata yet. Lazily generating metadata like simple-repository-server sounds like absolutely the right move! While I think pip's own testing requires the ability to generate nonstandard/erroneous responses so we can be sure to handle them correctly, it would definitely absolutely be useful to advertise the use of something like simple-repository-server so e.g. internal corporate repos don't have to rely on workarounds like fast-deps (which is actually very impressively robust now thanks to @dholth's work in #12208, but still a workaround which PEP 658 was created specifically to address).

@cosmicexplorer
Copy link
Contributor Author

@pelson thanks so much for mentioning parallel downloads! It ended up being much less effort than I expected: see #12923. It's still drafted because I whipped it up very quickly, but please take a look at that and ask others to review! One part I'm not sure how to implement is the parallel progress bar, especially because I believe we may not know Content-Length in advance. However, I believe for servers which do provide Content-Length, we should be able to get the full number of bytes we need to download, and produce a progress bar based on that.

@pelson
Copy link
Contributor

pelson commented Aug 20, 2024

I think pip's own testing requires the ability to generate nonstandard/erroneous responses so we can be sure to handle them correctly

This was kind of my point ☝️ about using a normalized interface for package repositories within pip... there is indeed a lot of detail about making it robust (given the wide variety of indices out there), but conceptually that can be the responsibility of somebody other than pip to deal with. I'm not saying that simple-repository is ready to be that interface (and it certainly isn't as robust as pip is), but might be an interesting avenue to explore for the future (and I would certainly be looking at that kind of abstraction if I was writing an installer from scratch).

Do you have a benchmark of sorts to demonstrate this?

Sadly not. Empirically, I expect the depth of a dependency tree to be much smaller than the total number of dependencies, suggesting there is a benefit from parallelism (the same being true for dists, so I don't see why there wasn't a good improvement experienced in the past...). From simple-repository-server's perspective, a metadata request can be non-trivial in response time (e.g. if it has to actually download the full dist in order to extract the metadata), so for sure making the requests in parallel would result in an improvement for that (not that this should justify making pip more complex if it doesn't help in the PyPI case).

thanks so much for mentioning parallel downloads! It ended up being much less effort than I expected: see #12923

Wow! Cool! If I understand correctly, this isn't for fetching metadata in parallel though, right? Still, I would expect the parallel download of dists to be a major win overall! 🚀

@cosmicexplorer
Copy link
Contributor Author

Very long post working out the logic in my head. Basically I think metadata caching is a strictly superior approach to parallelism within the metadata resolve process (and I list the assumptions I'm leaning on to justify that), but I do think the 2020 resolver has a space for introducing parallelism in its get_dependencies() method, and I think that parallelism would be able to plug into the framework we've introduced for metadata caching.

Please feel free not to read, I really appreciated your post here as it was quite thought-provoking, and it gave me an opportunity to justify this approach further.

Wow! Cool! If I understand correctly, this isn't for fetching metadata in parallel though, right? Still, I would expect the parallel download of dists to be a major win overall! 🚀

That's correct--the current resolver algorithm is backtracking, but still serial. However, with the metadata caching from #12256 and #12257 especially, I would be incredibly surprised if any parallel algorithm could achieve a significant performance improvement. I believe parallel metadata fetching would also require modifying resolution semantics in a backwards-incompatible way, which is something non-pip resolvers are very glad to do but which I would like to avoid if at all possible. Using metadata caching lets us avoid making requests in the first place, which is significantly less complex and performance-intensive than modifying resolution logic to enable parallelism. See #12258 where sufficient caching lets us get down to 3.9 seconds to resolve a very large dependency tree.

I would be very interested in prior art that demonstrates a process to fetch metadata in parallel, and the 2020 resolver via resolvelib definitely has opportunities to do so. In particular, the get_dependencies() method is the part which my work is using caching to speed up, but could also be optimistically executed in parallel. However (having worked on parallel resolver systems like the pants build tool), I'm not convinced that this is worth the complexity/performance tradeoff for pip, or even that it will be able to achieve meaningfully improved performance.

A few underlying assumptions of the problem space here:

  1. a dist with the same name and version (and features, if an sdist) always maps to the same metadata,
  2. metadata is much smaller than the dist itself,
  3. resolves are generally going to return approximately the same set of dists most of the time.

(1) means we can cache the result of extracting metadata persistently and avoid pinging the server ever again for that metadata (this is also why fast-deps and PEP 658 improve performance). It also means we can transfer the cached metadata across machines safely. (2) means we can store that metadata persistently without taking up significant cache space (metadata is also highly compressible text data). (3) means we are largely going to be able to stay within the space of cached results, so if we preserve the metadata cache, we will largely avoid pinging the server for new requests (this is in contrast to the fully-uncached resolve, where you could imagine using some sort of remote server to resolve metadata--and that's largely what PEP 658 does). The particular setup of the simple repository API means we can go even further like #12257 and #12258, and cache not just the dist -> metadata mapping, but also the project name -> [list of dists] mapping.

My impression is that parallelism in the resolution process would allow us to overcome slow network i/o with multiple concurrent requests, the way batched download parallelism in #12923 achieves. My thought process is that rather than overwhelming PyPI with even more parallel requests, we could instead focus on avoiding those requests in the first place. While it's true that these techniques are orthogonal--we could perform parallel resolution along with caching--caching does not require us to modify our resolver logic (which may e.g. introduce nondeterministic errors), and it also achieves a secondary goal: not pinging PyPI so much! See #12256 for a discussion of PyPI bandwidth costs.

Does that argument make sense? I'm also thinking about your framing here:

From simple-repository-server's perspective, a metadata request can be non-trivial in response time (e.g. if it has to actually download the full dist in order to extract the metadata), so for sure making the requests in parallel would result in an improvement for that (not that this should justify making pip more complex if it doesn't help in the PyPI case).

The caching work I'm doing here is all about avoiding those metadata requests in the first place, which I believe is safe and correct to the problem space assumptions I listed above. While the get_dependencies() method of resolvelib could be sped up with parallelism without introducing nondeterministic errors (I'm actually warming up to that a little), I think that the assumptions I listed above would mean that anywhere we might get an improvement from parallelism, we could also get an even better improvement from simply caching the metadata and avoiding that request entirely. And again, it also achieves a secondary goal of reducing requests against PyPI!

@cosmicexplorer
Copy link
Contributor Author

@pelson regarding your other point:

This was kind of my point ☝️ about using a normalized interface for package repositories within pip... there is indeed a lot of detail about making it robust (given the wide variety of indices out there), but conceptually that can be the responsibility of somebody other than pip to deal with. I'm not saying that simple-repository is ready to be that interface (and it certainly isn't as robust as pip is), but might be an interesting avenue to explore for the future (and I would certainly be looking at that kind of abstraction if I was writing an installer from scratch).

I'm sorry! It looks like I misunderstood the purpose of simple-repository-server -- I thought it was supposed to be a way to set up a professional-quality simple repository index with PEP 658 support. It's absolutely the kind of thing I really wish had existed within Twitter--@McSinyx and I wouldn't have needed to develop fast-deps at all then! xD I get that PyPI itself has vastly different performance/scale requirements than literally any other repository, but having a standard tool like simple-repository-server should definitely make it easier to test pip against: see #12208 where @dholth and I banged our heads against the wall figuring out that PyPI's switch to Fastly caching broke the ability to use negative range requests, making the error handling logic about 3x as long. This isn't PyPI's fault, as every one of their engineers has been incredibly helpful and let us quickly debug the issue, but it's why I emphasized that pip's own testing needs to handle really bizarre stuff that I would prefer a tool like simple-repository-server not have to support.

Since pip doesn't need to vendor things it uses for testing, I can see simple-repository-server being useful to integrate into pip testing, especially something like performance benchmarks. I think a lot of the testing for metadata caching is still going to require manually hacking server responses (there is a lot of testing already added for the PRs linked in this issue, they are actually mostly "done" and just need to go through review), and my impression is just that simple-repository-server is less for generating all possible types of failures (which is what pip testing generally needs), and instead about performing the idiomatic behavior that makes pip's job easiest.

One thing I would definitely love to collaborate on is how these caching changes might interact with simple-repository-server. In general, the caching work does not actually impose any specific requirements on the server, but here is what it makes use of:

The additional caching from #12258 is independent of the server response.

Was this response useful to you? I am looking to understand further how we can work together to move forward to more efficient and effective standardized package resolution ^_^! Please let me know if I'm missing something.

@cosmicexplorer
Copy link
Contributor Author

Also cc @alanbato who I believe has contributed to PyPI; most of the performance gains from these caching techniques are related to pip's internal bookkeeping as opposed to its interaction with PyPI, but please do let me know if you have any comments on the effectiveness of e.g. #12256 and #12257 on reducing PyPI bandwidth costs as well. @pelson: one way we may be able to use simple-repository-server is to verify the reduction in bandwidth usage when executing pip with these caching improvements applied. Selfishly, that would be a great way to motivate these changes (although I'll still have to do the work to make them mergeable). @pelson I would definitely be interested in looking at developing a benchmark harness for pip too -- you can see in all of these PRs I've been using a specific resolve with tensorflow and a few others (that's what we were doing at Twitter for the Cortex ML team), but I would absolutely like to have a more rounded basis for comparison.

@cosmicexplorer
Copy link
Contributor Author

I've converted all open PRs to drafts except for the ones that are ready for review and have no dependencies. If reviewers are interested:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C: dependency resolution About choosing which dependencies to install type: performance Commands take too long to run type: refactor Refactoring code
Projects
None yet
Development

No branches or pull requests

3 participants