Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_asyncio: test_create_server_ssl_over_ssl() failed on ARM64 macOS 3.x #95573

Closed
vstinner opened this issue Aug 2, 2022 · 15 comments
Closed
Labels
topic-asyncio type-bug An unexpected behavior, bug, or error

Comments

@vstinner
Copy link
Member

vstinner commented Aug 2, 2022

test_create_server_ssl_over_ssl() of test_asyncio failed on ARM64 macOS 3.x: https://buildbot.python.org/all/#/builders/725/builds/2392

ERROR: test_create_server_ssl_over_ssl (test.test_asyncio.test_ssl.TestSSL.test_create_server_ssl_over_ssl)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/buildbot/buildarea/3.x.pablogsal-macos-m1.macos-with-brew/build/Lib/test/test_asyncio/test_ssl.py", line 1160, in test_create_server_ssl_over_ssl
    self.loop.run_until_complete(start_server())
  File "/Users/buildbot/buildarea/3.x.pablogsal-macos-m1.macos-with-brew/build/Lib/asyncio/base_events.py", line 650, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/Users/buildbot/buildarea/3.x.pablogsal-macos-m1.macos-with-brew/build/Lib/test/test_asyncio/test_ssl.py", line 1153, in start_server
    await asyncio.wait_for(asyncio.gather(*tasks), TIMEOUT)
  File "/Users/buildbot/buildarea/3.x.pablogsal-macos-m1.macos-with-brew/build/Lib/asyncio/tasks.py", line 488, in wait_for
    return fut.result()
           ^^^^^^^^^^^^
  File "/Users/buildbot/buildarea/3.x.pablogsal-macos-m1.macos-with-brew/build/Lib/test/test_asyncio/test_ssl.py", line 1131, in test_client
    await fut
  File "/Users/buildbot/buildarea/3.x.pablogsal-macos-m1.macos-with-brew/build/Lib/test/test_asyncio/test_ssl.py", line 1109, in prog
    do(sslobj.write, B_DATA)
  File "/Users/buildbot/buildarea/3.x.pablogsal-macos-m1.macos-with-brew/build/Lib/test/test_asyncio/test_ssl.py", line 1100, in do
    sock.send(outgoing.read())
  File "/Users/buildbot/buildarea/3.x.pablogsal-macos-m1.macos-with-brew/build/Lib/ssl.py", line 1148, in send
    return self._sslobj.write(data)
           ^^^^^^^^^^^^^^^^^^^^^^^^
ssl.SSLEOFError: EOF occurred in violation of protocol (_ssl.c:2390)

The same test failed when test_asyncio was re-run.

@vstinner vstinner added type-bug An unexpected behavior, bug, or error topic-asyncio labels Aug 2, 2022
@tiran
Copy link
Member

tiran commented Aug 3, 2022

#95613 adds a debug callback for BIO (both mem and socket type). It prints to stderr. Could somebody use it to capture a failing test run?

@ambv
Copy link
Contributor

ambv commented Aug 3, 2022

Findings so far:

  • I'm able to reproduce this on my M1 Max laptop running macOS Monterey 12.5 on an installed framework build (i.e. ./configure --enable-framework=)
  • The test failure is intermittent, related to some resource congestion, easiest way to repro is to run ./python.exe -m test test_asyncio -m test_create_server_ssl_over_ssl -F -j8
  • The test failure happens with both --pydebug and release builds
  • The test failure happens both on OpenSSL 3.0.0.1.0 as well as on 1.1.1.14.15 (as reported by ssl._OPENSSL_API_VERSION)
  • I ran with @tiran's gh-95573: Add debug callback for SSL BIO #95613, debug output from a failure attached on the PR (beware, it's 1.6 MB of text)
  • The test is failing for as long as it exists, i.e., since 13c10bf which is definitely related: bpo-44011: New asyncio ssl implementation #31275

Now the open question is: did the new test surface a pre-existing flakiness or was the flakiness introduced by the SSL rewrite in GH-31275?

@ambv
Copy link
Contributor

ambv commented Aug 3, 2022

Crash log from 13c10bf:

Thu Aug  4 01:05:08 CEST 2022
Python 3.11.0a5+
0:00:00 load avg: 7.99 Run tests in parallel using 8 child processes
0:00:01 load avg: 10.71 [  1/1] test_asyncio failed (1 error)
test test_asyncio failed -- Traceback (most recent call last):
  File "/Volumes/RAMDisk/installed-13c10bfb77/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/contextlib.py", line 155, in __exit__
    self.gen.throw(typ, value, traceback)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/RAMDisk/installed-13c10bfb77/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/test/test_asyncio/test_ssl.py", line 155, in _silence_eof_received_warning
    yield
    ^^^^^
  File "/Volumes/RAMDisk/installed-13c10bfb77/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/test/test_asyncio/test_ssl.py", line 1157, in test_create_server_ssl_over_ssl
    self.loop.run_until_complete(start_server())
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/RAMDisk/installed-13c10bfb77/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/base_events.py", line 640, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/Volumes/RAMDisk/installed-13c10bfb77/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/test/test_asyncio/test_ssl.py", line 1150, in start_server
    await asyncio.wait_for(asyncio.gather(*tasks), TIMEOUT)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/RAMDisk/installed-13c10bfb77/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/tasks.py", line 445, in wait_for
    return fut.result()
           ^^^^^^^^^^^^
  File "/Volumes/RAMDisk/installed-13c10bfb77/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/test/test_asyncio/test_ssl.py", line 1128, in test_client
    await fut
    ^^^^^^^^^
  File "/Volumes/RAMDisk/installed-13c10bfb77/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/test/test_asyncio/test_ssl.py", line 1102, in prog
    do(sslobj.write, A_DATA)
    ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/RAMDisk/installed-13c10bfb77/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/test/test_asyncio/test_ssl.py", line 1097, in do
    sock.send(outgoing.read())
    ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/RAMDisk/installed-13c10bfb77/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/ssl.py", line 1209, in send
    return self._sslobj.write(data)
           ^^^^^^^^^^^^^^^^^^^^^^^^
ssl.SSLEOFError: EOF occurred in violation of protocol (_ssl.c:2386)

== Tests result: FAILURE ==

10 slowest tests:
- test_asyncio: 958 ms

1 test failed:
    test_asyncio

@ned-deily
Copy link
Member

Progress! Using a variation of the "easiest way" from @ambv, I am now able to reliably produce the test failure on (some) Intel Macs, Intel Macs on a VM as well as multiple Apple Silicon (M1) Macs using either OpenSSL 1.1 or 3, with or without pydebug, with or without Mac framework build, top of 3.11 source build or with python.org 3.11.0b5 macOS installer, and on multiple macOS versions (in particular, 11.6.8, 12.5, and 13.0 developer beta).

/path/to/python -m test test_asyncio -m test_create_server_ssl_over_ssl -F -j0

The key is setting the -j value. In many case, -j0, which computes a default based on the machine configuration is sufficient, but not always. For whatever reason, the failure appears to be much more difficult to reproduce on the Intel Macs I have access to. Rather than using -j0, in some cases, I've had to use a larger value like -j24 and even with that it could take 25 or more iterations before a failure on one Intel Mac, whereas on an M1 MacBookPro, the test would fail fairly reliably in one iteration with -j0 -> 12 processes.

I'm going to step away from this now but ping me if you need help reproducing. I'll also ping @ronaldoussoren here in case this behavior might ring a bell.

@erlend-aasland
Copy link
Contributor

I haven't had the time to read though all the conversations about this on Discord, but this reminds me of #26893. I'll see if I can put aside some time tonight for debugging this.

@ambv
Copy link
Contributor

ambv commented Aug 4, 2022

We essentially need to decide today if we're reverting #31275.

@pablogsal
Copy link
Member

We have decided to revert to unblock the release of 3.11.0rc1

@vstinner
Copy link
Member Author

vstinner commented Aug 4, 2022

In May 2021, Andrew Svetlov copied the asyncio SSL implementation from uvloop to Python (asyncio). Sadly, Pablo had to revert it (GH-88177) in Python 3.10, due to test_asyncio failures on buildbots.

In Feburary 2022, Kumar Aditya basically copied it again to Python upstream. Sadly, the first release candidate of Python 3.11 is scheduled for tomorrow and there is no asyncio expert available to debug this tricky issue. Moreover, the issue only occurs on macOS which makes is harder to debug (for example, I don't have access to any macOS machine).

By the way, the commit message lacks author information from source repository, and the motivation behind the change is missing. Most of the code was written by Yury and Fantix.

We have decided to revert to unblock the release of 3.11.0rc1

The discussion occurred on Discord #python-dev. For example, Pablo wrote there:

We also have the possibility to revert that change. If the problem is in 3.10 as well that's a pity, but is a safer option given that the RC is tomorrow.


The failing asyncio test_create_server_ssl_over_ssl() test is testing "TLS over TLS".

@fantix
Copy link
Contributor

fantix commented Aug 4, 2022

I found the root reason by temporarily disabling this suppression:

with self._silence_eof_received_warning():
self.loop.run_until_complete(start_server())

Task exception was never retrieved
future: <Task finished name='Task-79' coro=<TestSSL.test_create_server_ssl_over_ssl.<locals>.handle_client() done, defined at cpython/Lib/test/test_asyncio/test_ssl.py:1039> exception=OSError(55, 'No buffer space available')>
Traceback (most recent call last):
  File "cpython/Lib/test/test_asyncio/test_ssl.py", line 1042, in handle_client
    data = await reader.readexactly(len(A_DATA))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "cpython/Lib/asyncio/streams.py", line 728, in readexactly
    await self._wait_for_data('readexactly')
  File "cpython/Lib/asyncio/streams.py", line 522, in _wait_for_data
    await self._waiter
  File "cpython/Lib/asyncio/selector_events.py", line 1049, in write
    n = self._sock.send(data)
        ^^^^^^^^^^^^^^^^^^^^^
OSError: [Errno 55] No buffer space available

I think a reasonable fix is to reduce the test data size:

A_DATA = b'A' * 1024 * 1024
B_DATA = b'B' * 1024 * 1024

Reducing from 1MB to 64KB seems to work fine on an M1 macOS with -j64 (thanks to @ned-deily and @ambv for reproducing!).

@pablogsal
Copy link
Member

pablogsal commented Aug 4, 2022

I am confused, the original error is not

OSError: [Errno 55] No buffer space available

but

ssl.SSLEOFError: EOF occurred in violation of protocol (_ssl.c:2386)

Could you provide a bit more context on how that error is making the other one appear?

@ambv
Copy link
Contributor

ambv commented Aug 4, 2022

@fantix, thanks for your successful investigation. Exciting find! However, I agree with Pablo that it's not quite clear how silencing a warning might have lead to silencing of the OSError that in turn lead to the SSLEOFError. It's a bit of a head scratcher.

Additionally, speaking of the proposed workaround (I can't make myself call it a fix), don't you think a 1MB blob is realistic enough that it should be supported even in cases of raised concurrency? 64 processes still isn't that much, honestly, but we were able to easily overwhelm the test with just -j8.

Am I right thinking that this suggests that tunneling SSL in SSL with this sslproto implementation isn't feasible for production use? I mean, I've seen JavaScript files over 1MB 😆 Let me know what you think.

@fantix
Copy link
Contributor

fantix commented Aug 4, 2022

Thanks for checking! Those are good questions. Let me try to answer one of them here first:

@contextlib.contextmanager
def _silence_eof_received_warning(self):
# TODO This warning has to be fixed in asyncio.
logger = logging.getLogger('asyncio')
filter = logging.Filter('has no effect when using ssl')
logger.addFilter(filter)
try:
yield
finally:
logger.removeFilter(filter)

I think this Filter is wrong (originally introduced here, I'll create a fix) - it effectively suppressed all asyncio logging because no logger would have a name has no effect when using ssl (which should be a part of the message to filter out).

In our case here, StreamReaderProtocol creates server-side asyncio tasks of handle_client() that are not awaited. Their errors went into the asyncio logger and got suppressed. When the server-side is failing unexpectedly, the client-side SSLSocket would encounter an unexpected disconnection and raise ssl.SSLEOFError: EOF occurred in violation of protocol. As we were (by mistake) only showing client-side errors in this case, that became the first thing in sight.

For the second question of why the No buffer space available was raised in the first place, I'll need some more time to investigate. Will follow up here soon!

@ambv
Copy link
Contributor

ambv commented Aug 4, 2022

@fantix's discovery led @ronaldoussoren to create a C-only repoducer for the problem, proving it's not an issue with asyncio's new SSL implementation. He filed FB11063974 about this issue in Apple's tracker. I could reproduce the issue using Ronald's code.

Instead of reverting the new asyncio.sslproto, we should instead special-case macOS to decrease the buffer size as @fantix is suggesting. Let's only decrease the buffer for the Mac.

miss-islington pushed a commit to miss-islington/cpython that referenced this issue Aug 5, 2022
…gs (pythonGH-95687)

(cherry picked from commit e1d68b3)

Co-authored-by: Fantix King <fantix.king@gmail.com>
ambv pushed a commit that referenced this issue Aug 5, 2022
…-95687) (GH-95699)

(cherry picked from commit e1d68b3)

Co-authored-by: Fantix King <fantix.king@gmail.com>
ambv added a commit that referenced this issue Aug 5, 2022
Co-authored-by: Łukasz Langa <lukasz@langa.pl>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Aug 5, 2022
…thonGH-95668)

Co-authored-by: Łukasz Langa <lukasz@langa.pl>
(cherry picked from commit 3a9e1fd)

Co-authored-by: Fantix King <fantix.king@gmail.com>
ambv pushed a commit that referenced this issue Aug 5, 2022
GH-95705)

Co-authored-by: Łukasz Langa <lukasz@langa.pl>
(cherry picked from commit 3a9e1fd)

Co-authored-by: Fantix King <fantix.king@gmail.com>
@ambv
Copy link
Contributor

ambv commented Aug 5, 2022

This can now be closed and the 3.11 release is unblocked. Thanks for investigating @tiran and @ned-deily, thanks for identifying the root cause and issuing PRs @fantix, and thanks @ronaldoussoren for ensuring the problem is not Python-specific and following up with Apple.

@ambv ambv closed this as completed Aug 5, 2022
@vstinner
Copy link
Member Author

vstinner commented Aug 5, 2022

Fixed by:

Would it be possible to add a comment explaining BUF_MULTIPLIER = 1024 if not MACOS else 64 in Lib/test/test_asyncio/test_ssl.py? Comment mentioning ENOBUFS and the macOS bug FB11063974.

iritkatriel pushed a commit to iritkatriel/cpython that referenced this issue Aug 11, 2022
iritkatriel pushed a commit to iritkatriel/cpython that referenced this issue Aug 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-asyncio type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

7 participants