Future deserialization without available client #7580

fjetter · 2023-02-23T16:37:00Z

This is similar to

This grew a bit more complicated because I stumbled over #7498 again and had a deeper look and tried to preserve the "pass futures in collections" feature but it is fundamentally flawed. While I could bypass most of the "accidental client creations" in this PR (which is good), the fundamental flaw about it being possible to release a future before it is being deserialized is still there and hard to avoid without a more fundamental approach.

Two notable changes

The get_worker function now actually does what it is claiming. It returns the worker of the currently running task. Most notably this will raise if being executed outside of the threadpool. This is to a certain degree inconvenient since deserialization does typically not happen inside of our thread but I consider this important considering that our entire test suite is running asynchronously.
Futures, etc. are actually only initializing a client lazily if a client (or worker) isn't immediately available. This avoids problems about "when, where and how" do we deserialize a task, e.g. we can deserialize a Future or Queue object on the event loop but since we're only interacting with it in the threadpool, the initialization will be deferred. To a certain degree this is also nicer for diagnostics but that's not the motivating change for this.

This is again a supporting PR for #7564

fjetter · 2023-02-23T16:37:42Z

distributed/worker.py

-    thread_state.execution_state = execution_state
-    thread_state.key = key
-    thread_state.actor = True
-
-    result = function(*args, **kwargs)


State was actually leaking all over the place making get_worker in our test suite extremely unreliable.

fjetter · 2023-02-23T16:39:30Z

distributed/worker.py

-        try:
-            return first(w for w in Worker._instances if w.status in WORKER_ANY_RUNNING)
-        except StopIteration:
-            raise ValueError("No workers found")


I get the appeal of this but this can throw off some of our test logic and tests suggest that certain functionality would work which in reality doesn't.

i.e. due to our async tests, we are likely to get a worker just because in the same process an instance already exists. However, in reality there is no valid worker in the context, i.e. if the test example would be executed on a "real" cluster, it would fail

fjetter · 2023-02-23T16:41:37Z

distributed/tests/test_multi_locks.py

+@gen_cluster(client=True)
+async def test_timeout_wake_waiter(c, s, a, b):


The change to get_worker causes these things to break. On main, this test should actually not allowed to pass since there is no worker and no client in the context of the multi lock. However, due to the "pick first worker in the process" logic, it would pick a random worker and create implicitly a worker client. The Worker doesn't even have to be alive for this as long as it isn't GCed, yet, it would work
In most cases this is not a big deal but implicit, surprising client creations can cause things like #7498

github-actions · 2023-02-23T18:30:07Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      26 files ±  0       26 suites ±0 11h 44m 40s ⏱️ - 41m 0s
  3 534 tests +  2   3 426 ✔️ -   1   104 💤 +  1 4 ❌ +2
44 682 runs +26 42 589 ✔️ +10 2 089 💤 +14 4 ❌ +2

For more details on these failures, see this check.

Results for commit f7786ee. ± Comparison against base commit 89d5ad4.

This pull request removes 1 and adds 3 tests. Note that renamed tests count towards both.

distributed.tests.test_worker ‑ test_get_worker_name

distributed.protocol.tests.test_protocol ‑ test_deeply_nested_structures
distributed.tests.test_client ‑ test_serialize_future_without_client
distributed.tests.test_utils_test ‑ test_ensure_no_new_clients

This pull request skips 2 and un-skips 1 tests.

distributed.tests.test_worker ‑ test_get_client_coroutine_sync
distributed.tests.test_worker_client ‑ test_submit_different_names

distributed.shuffle.tests.test_shuffle ‑ test_minimal_version

♻️ This comment has been updated with latest results.

fjetter · 2023-02-24T11:27:15Z

Interesting. I seem to hit a related recursion error in test_profile_server... 🤯

fjetter · 2023-03-08T12:26:50Z

OK, so I'm not entirely sure if the profiling thing is indeed related but apparently with this change, we're more likely to generate a very deeply nested profile message. The profile collection works out nicely but as soon as one tries to submit this profile message, we're hitting a recursion error during serialization.

We can't truly protect ourselves from this during serialization since the serialization code does not know what is OK to cut out and what isn't. The only way I see we can deal with this is to be more conservative with the cutoff in profiling. Local testing found ~250 stacks to be the magical limit to get test_profile_server pass.

fjetter · 2023-03-08T12:27:52Z

distributed/profile.py

-    if depth <= 0:
-        return None


this check caused us to not collect any information in these cases. I think it's still valuable to get a snapshot through even if it's not the lowest frame. Moving the depth check further below achieves this (see test_profile.py)

distributed/protocol/tests/test_protocol.py

hendrikmakait

The code generally looks good to me. I have a few minor nits, and a few edge cases are untested. I suggest marking them with pragma: nocover if we don't want to test these.

hendrikmakait · 2023-03-17T10:34:21Z

distributed/worker.py

-            return first(w for w in Worker._instances if w.status in WORKER_ANY_RUNNING)
-        except StopIteration:
-            raise ValueError("No workers found")
+        raise ValueError("No workers found") from None


nit:

Suggested change

raise ValueError("No workers found") from None

raise ValueError("No worker found") from None

hendrikmakait · 2023-03-17T10:48:23Z

distributed/tests/test_utils_test.py

+        with pytest.raises(AssertionError):
+            with ensure_no_new_clients():
+                async with Client(s.address, asynchronous=True):
+                    with ensure_no_new_clients():
+                        pass


If I understand the intention of this correct, I'd split it up into two distinct checks

Suggested change

with pytest.raises(AssertionError):

with ensure_no_new_clients():

async with Client(s.address, asynchronous=True):

with ensure_no_new_clients():

pass

with pytest.raises(AssertionError):

with ensure_no_new_clients():

async with Client(s.address, asynchronous=True):

pass

async with Client(s.address, asynchronous=True):

with ensure_no_new_clients():

pass

hendrikmakait · 2023-03-17T11:03:41Z

distributed/shuffle/tests/test_shuffle.py

+# @pytest.mark.skipif(
+#     pa is not None,
+#     reason="We don't have a CI job that is installing a very old pyarrow version",
+# )


Suggested change

# @pytest.mark.skipif(

# pa is not None,

# reason="We don't have a CI job that is installing a very old pyarrow version",

# )

It's not clear to me why this change was made to this test. My impression is that what's currently on main is still want we want. I've reverted back to what's on main (confirmed the test passes locally when pyarrow isn't installed). Happy to submit a follow-up PR if needed

hendrikmakait · 2023-03-17T11:03:52Z

distributed/shuffle/tests/test_shuffle.py

@@ -86,8 +86,9 @@ async def test_minimal_version(c, s, a, b):
        dtypes={"x": float, "y": float},
        freq="10 s",
    )
-    with pytest.raises(RuntimeError, match="requires pyarrow"):
-        await c.compute(dd.shuffle.shuffle(df, "x", shuffle="p2p"))
+    # with pytest.raises(RuntimeError, match="requires pyarrow"):


Suggested change

# with pytest.raises(RuntimeError, match="requires pyarrow"):

jrbourbeau · 2023-03-21T04:05:24Z

Noting test failures are unrelated to the changes in this PR (xref #7688)

In dask/distributed#7580 `get_worker` was modified to return the worker of a task, thus it cannot be used by `client.run`, and we must now use `dask_worker` as the first argument to `client.run` to obtain the worker.

In dask/distributed#7580 get_worker was modified to return the worker of a task, thus it cannot be used by client.run, and we must now use dask_worker as the first argument to client.run to obtain the worker.

In dask/distributed#7580 `get_worker` was modified to return the worker of a task, thus it cannot be used by `client.run`, and we must now use `dask_worker` as the first argument to `client.run` to obtain the worker. Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Mads R. B. Kristensen (https://github.com/madsbk) - Lawrence Mitchell (https://github.com/wence-) URL: #1141

In dask/distributed#7580 get_worker was modified to return the worker of a task, thus it cannot be used by client.run, and we must now use dask_worker as the first argument to client.run to obtain the worker. Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Corey J. Nolet (https://github.com/cjnolet) - AJ Schmidt (https://github.com/ajschmidt8) URL: #1365

martindurant · 2023-03-28T17:13:55Z

Since this PR, the following code fails surprisingly:

>>> client = distributed.Client()
>>> client.run(distributed.worker.get_worker)
ValueError: No worker found

This got noticed because it was causing fsspec's dask implementation to hang in tests.

martindurant · 2023-03-28T17:23:54Z

I did this to workaround, since my filesystem only needs to know whether this is a process housing a worker, not if it happens to be running in a task.

jrbourbeau · 2023-03-28T17:25:15Z

Thanks @martindurant, this looks related to #7696. There's a good conversation in that issue, but #7696 (comment) is the specific comment that relates to the snippet you posted

martindurant · 2023-03-29T15:13:38Z

Thanks for the pointer. From fsspec's point of view, the question is "am I on a worker machine", independent of serialisation or task state, so I'll use the simple approach.

fjetter commented Feb 23, 2023

View reviewed changes

fjetter mentioned this pull request Feb 23, 2023

Use pickle for graph submissions from client to scheduler #7564

Merged

fjetter force-pushed the future_serialization branch from b19041a to 7780ad7 Compare March 8, 2023 12:11

fjetter commented Mar 8, 2023

View reviewed changes

distributed/protocol/tests/test_protocol.py Show resolved Hide resolved

fjetter requested a review from graingert March 8, 2023 12:29

fjetter force-pushed the future_serialization branch from 80bb9ff to a1f1f56 Compare March 10, 2023 11:18

fjetter mentioned this pull request Mar 10, 2023

Forbid collections of futures to be passed as arguments #7500

Closed

fjetter added 10 commits March 15, 2023 13:30

Get futures to work nicely

aaa6638

Late bindings

f551512

self review

045658e

more lazy clients

285dde9

Never inform on Actor

7790576

Ensure profiling messages can be serialized despite deep nesting

1198e76

Extend tests for stackoverflow profile

c9b1ab6

Skip test on windows

e0b121e

restore test_stack_overflow

4950082

adjust limits

f7786ee

fjetter force-pushed the future_serialization branch from 68f48cf to f7786ee Compare March 15, 2023 12:32

fjetter requested a review from jacobtomlinson as a code owner March 15, 2023 12:32

hendrikmakait self-requested a review March 15, 2023 13:40

hendrikmakait approved these changes Mar 17, 2023

View reviewed changes

Code review

c6787d8

jrbourbeau merged commit 1b34a5b into dask:main Mar 21, 2023

fjetter deleted the future_serialization branch March 21, 2023 10:44

quasiben mentioned this pull request Mar 21, 2023

[BUG] CI Failure rapidsai/raft#1362

Closed

This was referenced Mar 21, 2023

Update usage of get_worker() in tests rapidsai/dask-cuda#1141

Merged

Remove usage of Dask's get_worker rapidsai/raft#1365

Merged

quasiben mentioned this pull request Mar 22, 2023

Functional change for get_worker #7696

Open

gjoseph92 mentioned this pull request Apr 4, 2023

CancelledError with task function holding a reference to a Future #7746

Open

fjetter mentioned this pull request Apr 27, 2023

Can not get client when not on worker thread #7763

Open

fjetter mentioned this pull request Sep 4, 2023

get_worker doesn't find worker within register_worker_callbacks #8152

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Future deserialization without available client #7580

Future deserialization without available client #7580

fjetter commented Feb 23, 2023

fjetter Feb 23, 2023

fjetter Feb 23, 2023

fjetter Mar 8, 2023

fjetter Feb 23, 2023 •

edited

Loading

github-actions bot commented Feb 23, 2023 •

edited

Loading

fjetter commented Feb 24, 2023

fjetter commented Mar 8, 2023

fjetter Mar 8, 2023

hendrikmakait left a comment

hendrikmakait Mar 17, 2023

hendrikmakait Mar 17, 2023

hendrikmakait Mar 17, 2023

jrbourbeau Mar 21, 2023 •

edited

Loading

hendrikmakait Mar 17, 2023

jrbourbeau commented Mar 21, 2023

martindurant commented Mar 28, 2023

martindurant commented Mar 28, 2023

jrbourbeau commented Mar 28, 2023

martindurant commented Mar 29, 2023

		@gen_cluster(client=True)
		async def test_timeout_wake_waiter(c, s, a, b):

	raise ValueError("No workers found") from None
	raise ValueError("No worker found") from None

Future deserialization without available client #7580

Future deserialization without available client #7580

Conversation

fjetter commented Feb 23, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjetter Feb 23, 2023 • edited Loading

Choose a reason for hiding this comment

github-actions bot commented Feb 23, 2023 • edited Loading

Unit Test Results

fjetter commented Feb 24, 2023

fjetter commented Mar 8, 2023

Choose a reason for hiding this comment

hendrikmakait left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jrbourbeau Mar 21, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jrbourbeau commented Mar 21, 2023

martindurant commented Mar 28, 2023

martindurant commented Mar 28, 2023

jrbourbeau commented Mar 28, 2023

martindurant commented Mar 29, 2023

fjetter Feb 23, 2023 •

edited

Loading

github-actions bot commented Feb 23, 2023 •

edited

Loading

jrbourbeau Mar 21, 2023 •

edited

Loading