Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrote protobuf generation scripts in Python #12527

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

Avasam
Copy link
Sponsor Collaborator

@Avasam Avasam commented Aug 15, 2024

Closes #12511

This is so much faster on Windows, takes a few seconds to run (including downloads and pre-commit). Compared to the over 1m it used to take me on WSL.

I have two design questions:

  1. Which usage do we prefer:
# A)
python scripts/sync_proto/tensorflow.py

# B)
python scripts/sync_proto.py tensorflow
# Which I could make work with the following for shell autocomplete
python scripts/sync_proto.py ./stubs/tensorflow
  1. Should I add a parametrized main function (something like sync_stubs_with_proto) that takes all of a scripts' special needs as parameters (including a "post-run" Callable). Or keep the 3 scripts separate with shared helper functions.

I'm also open to name changes suggestions.

@Avasam Avasam marked this pull request as ready for review August 16, 2024 19:17
@@ -13,6 +13,8 @@ ruff==0.5.4 # must match .pre-commit-config.yaml

# Libraries used by our various scripts.
aiohttp==3.10.2
grpcio-tools
Copy link
Sponsor Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure there's a minimal version of protoc that is shipped with grpcio-tools that we should be using, but I can't recall off the top of my head and would have to search through past PRs to find what it was.

from http.client import HTTPResponse
from pathlib import Path
from typing import TYPE_CHECKING, Iterable
from urllib.request import urlopen
Copy link
Sponsor Collaborator Author

@Avasam Avasam Aug 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm purposefully avoiding requests here, as to not add requests and types-requests in requirements-tests.txt

def extract_python_version(file_path: Path) -> str:
"""Extract the Python version from https://github.com/protocolbuffers/protobuf/blob/main/version.json"""
with open(file_path) as file:
data: dict[str, dict[str, dict[str, str]]] = json.load(file)
Copy link
Sponsor Collaborator Author

@Avasam Avasam Aug 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically the type is dict[str, dict[str, str | dict[str, str]]] so I'm being a bit cheesy here, but I think it's fine over a Any. This is the type for the tree of key-values we traverse, everything else doesn't matter. A TypedDict felt overkill just to prevent a typo in data.values().

@Avasam Avasam changed the title [WIP] Rewrote protobuf generation scripts in Python Rewrote protobuf generation scripts in Python Aug 16, 2024
Comment on lines +16 to +18
# grpc install only fails on Windows, but let's avoid building sdist on other platforms
# https://github.com/grpc/grpc/issues/36201
grpcio-tools; python_version < "3.13"
Copy link
Sponsor Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Avasam Avasam added the project: infrastructure typeshed build, test, documentation, or distribution related label Aug 18, 2024
Copy link
Collaborator

@srittau srittau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I'm not really familiar with protobuf or protobuf stubs generation so my comments are limited to general issues.

In general, there are a few places that use / as path separator, so I'd expect problems on Windows, but I'm fine with that for now.

Comment on lines +25 to +29
resp: HTTPResponse = urlopen(url)
if resp.getcode() != 200:
raise RuntimeError(f"Error downloading {url}")
with open(destination, "wb") as file:
file.write(resp.read())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
resp: HTTPResponse = urlopen(url)
if resp.getcode() != 200:
raise RuntimeError(f"Error downloading {url}")
with open(destination, "wb") as file:
file.write(resp.read())
resp: HTTPResponse
with urlopen(url) as resp:
if resp.getcode() != 200:
raise RuntimeError(f"Error downloading {url}")
with open(destination, "wb") as file:
file.write(resp.read())

Comment on lines +34 to +36
data: dict[str, dict[str, dict[str, str]]] = json.load(file)
# The root key will be the protobuf source code version
return next(iter(data.values()))["languages"]["python"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see some validation of the version, considering its coming from an outside source. Something like:

Suggested change
data: dict[str, dict[str, dict[str, str]]] = json.load(file)
# The root key will be the protobuf source code version
return next(iter(data.values()))["languages"]["python"]
data = json.load(file)
# The root key will be the protobuf source code version
version = next(iter(data.values()))["languages"]["python"]
assert isinstance(version, str)
assert re.fullmatch(r"...", version) # proper re here
return version

This way we're also sure (at runtime) that version has the correct type and format.

https://github.com/protocolbuffers/protobuf/blob/main/python/dist/BUILD.bazel
"""
with open(temp_dir / EXTRACTED_PACKAGE_DIR / "python" / "dist" / "BUILD.bazel") as file:
matched_lines = filter(None, (re.search(PROTO_FILE_PATTERN, line) for line in file.readlines()))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

Suggested change
matched_lines = filter(None, (re.search(PROTO_FILE_PATTERN, line) for line in file.readlines()))
matched_lines = filter(None, (re.search(PROTO_FILE_PATTERN, line) for line in file))

Comment on lines +55 to +75
temp_dir = Path(tempfile.mkdtemp())
# Fetch s2clientprotocol (which contains all the .proto files)
archive_path = temp_dir / ARCHIVE_FILENAME
download_file(ARCHIVE_URL, archive_path)
extract_archive(archive_path, temp_dir)

# Remove existing pyi
for old_stub in STUBS_FOLDER.rglob("*_pb2.pyi"):
old_stub.unlink()

PROTOC_VERSION = run_protoc(
proto_paths=(f"{EXTRACTED_PACKAGE_DIR}/src",),
mypy_out=STUBS_FOLDER,
proto_globs=extract_proto_file_paths(temp_dir),
cwd=temp_dir,
)

PYTHON_PROTOBUF_VERSION = extract_python_version(temp_dir / EXTRACTED_PACKAGE_DIR / "version.json")

# Cleanup after ourselves, this is a temp dir, but it can still grow fast if run multiple times
shutil.rmtree(temp_dir)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make sure the temp directory is always cleaned up:

Suggested change
temp_dir = Path(tempfile.mkdtemp())
# Fetch s2clientprotocol (which contains all the .proto files)
archive_path = temp_dir / ARCHIVE_FILENAME
download_file(ARCHIVE_URL, archive_path)
extract_archive(archive_path, temp_dir)
# Remove existing pyi
for old_stub in STUBS_FOLDER.rglob("*_pb2.pyi"):
old_stub.unlink()
PROTOC_VERSION = run_protoc(
proto_paths=(f"{EXTRACTED_PACKAGE_DIR}/src",),
mypy_out=STUBS_FOLDER,
proto_globs=extract_proto_file_paths(temp_dir),
cwd=temp_dir,
)
PYTHON_PROTOBUF_VERSION = extract_python_version(temp_dir / EXTRACTED_PACKAGE_DIR / "version.json")
# Cleanup after ourselves, this is a temp dir, but it can still grow fast if run multiple times
shutil.rmtree(temp_dir)
with tempfile.TemporaryDirectory() as td:
temp_dir = Path(td)
# Fetch s2clientprotocol (which contains all the .proto files)
archive_path = temp_dir / ARCHIVE_FILENAME
download_file(ARCHIVE_URL, archive_path)
extract_archive(archive_path, temp_dir)
# Remove existing pyi
for old_stub in STUBS_FOLDER.rglob("*_pb2.pyi"):
old_stub.unlink()
PROTOC_VERSION = run_protoc(
proto_paths=(f"{EXTRACTED_PACKAGE_DIR}/src",),
mypy_out=STUBS_FOLDER,
proto_globs=extract_proto_file_paths(temp_dir),
cwd=temp_dir,
)
PYTHON_PROTOBUF_VERSION = extract_python_version(temp_dir / EXTRACTED_PACKAGE_DIR / "version.json")



def main() -> None:
temp_dir = Path(tempfile.mkdtemp())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.



def main() -> None:
temp_dir = Path(tempfile.mkdtemp())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

@Avasam Avasam added the status: needs decision Needs a final decision by the typeshed maintainers label Sep 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
project: infrastructure typeshed build, test, documentation, or distribution related status: needs decision Needs a final decision by the typeshed maintainers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rewriting proto scripts in Python
2 participants