Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

asyncio tcp transport on Windows reads bytearray instead of bytes #99941

Closed
DarioDaF opened this issue Dec 2, 2022 · 14 comments
Closed

asyncio tcp transport on Windows reads bytearray instead of bytes #99941

DarioDaF opened this issue Dec 2, 2022 · 14 comments
Labels
topic-asyncio type-bug An unexpected behavior, bug, or error

Comments

@DarioDaF
Copy link
Contributor

DarioDaF commented Dec 2, 2022

Bug report

asyncio.Protocol.data_received prototype not respected: when using create_connection to create a tcp transport data_received is being called with a bytearray object instead of bytes.
If this is expected behaviour libraries like httpx and such should be warned or the prototype modified, although I doubt it's intended because it would suppose the generator doesn't hold reference to it otherwise data could change while stored in buffered stuff.

import sys
print(f'Python: {sys.version}\n')

import asyncio

class MyProto(asyncio.Protocol):
    def data_received(self, data: bytes) -> None:
        print('@@@@@@@@@@ ', data)
    def eof_received(self):
        print('##########')

async def main():
    t, proto = await asyncio.get_running_loop().create_connection(MyProto, 'example.com', 80)
    t.write(b'SITE BE MAD\n\n')
    await asyncio.sleep(1)
    t.close()

asyncio.run(main())

Correct output: @@@@@@@@@@ b'HTTP/1.0 ...WHATEVER THE SERVER ANSWERS...'
Faulty output: @@@@@@@@@@ bytearray(b'HTTP/1.0 ...WHATEVER THE SERVER ANSWERS...')

Your environment

On Windows 11 x64

Tested on:

  • 3.11.0 (main, Oct 24 2022, 18:26:48) [MSC v.1933 64 bit (AMD64)]
  • 3.10.1 (tags/v3.10.1:2cd268a, Dec 6 2021, 19:10:37) [MSC v.1929 64 bit (AMD64)]

Working as expected in:

  • 3.9.9 (tags/v3.9.9:ccb0e6a, Nov 15 2021, 18:08:50) [MSC v.1929 64 bit (AMD64)]

Related problems

Link to the thread where I found a conflicting thing (it's being solved there as a workaround)
encode/httpx#2305 (comment)

EDIT: simplified the minimal example

Linked PRs

@DarioDaF DarioDaF added the type-bug An unexpected behavior, bug, or error label Dec 2, 2022
@gvanrossum
Copy link
Member

Is it really only happening on Windows?

I'm not sure what you mean by asyncio.Protocol.data_received prototype -- are you referring to the type stubs used by static type checkers in typeshed perhaps? Those are not considered normative (they are often created using reverse engineering and bugs may be fixed when users complain). What does the doc say? That would be considered normative (in most cases).

Fixing this might require copying the data which would adversely affect performance, effectively copying all data received an extra time.

@DarioDaF
Copy link
Contributor Author

DarioDaF commented Dec 3, 2022

I've not tested it on linux, the docs clearly says https://docs.python.org/3/library/asyncio-protocol.html#asyncio.Protocol.data_received non empty bytes object although could make sense for perfomrance to return a bytes-like object (alhough non hashable that is the main problem, and slices of bytearrays are still byte arrays even tho they don't share memory and are a copy) https://github1s.com/python/cpython/blob/HEAD/Lib/_pyio.py#L621 as you can see searching for read functions it's always either casted to bytes or calling other reads that finally lead to os.read which correctly returns bytes, will test on linux probably on monday and see if I can setup a test env to delve deeper in where the bytearray comes from

@pochmann
Copy link
Contributor

pochmann commented Dec 3, 2022

the docs clearly says non empty bytes object

Hmm, actually it says "bytes object", not "bytes object".

That said, the doc for bytes and bytearray often says "bytes object" and "bytearray object", referring to bytes and bytearray, respectively.

@gvanrossum
Copy link
Member

I agree that the phrase "bytes object" in the docs refers to bytes, not bytearray.

The cause of the regression (if it is that) is presumably GH-21442, which switches to bytearray for performance reasons. It does this only for the proactor event loop, so Windows only.

Now, I am reluctant to switch back to bytes, because that would lose some of the performance gain. I'd rather update the docs. If a particular protocol class really want bytes, it's easy to put data = bytes(data) at the top of the data_received method.

@DarioDaF
Copy link
Contributor Author

DarioDaF commented Dec 4, 2022

Fair as long as the bytearray is not reused and is released by the source, otherwise if any protocol buffers stuff storing the object it will get corrupted, although I see self._data[:length] in the deferred call, so it's already copying, making it a bytes shouldn't have speed downsides cause slices for bytearrays are copies

@kumaraditya303
Copy link
Contributor

kumaraditya303 commented Dec 4, 2022

+1 for updating the docs to something like data is a bytes-like object. An optimized protocol may choose to return memoryview of the data to avoid excessive copying so I would like to keep the docs a bit generic for future improvements.

@gvanrossum
Copy link
Member

Hm, memoryview is probably too different from bytes and bytearray to be acceptable. (In the typeshed world there's been discussion of dropping its inclusion in the bytes-like concept there, though this is not so simple either.)

It is indeed the case that the bytesarray passed to data_received() is a fresh slice. Unfortunately I think there's no API that I know of that creates a slice of a bytesarray and turns it into a bytes without an extra copy. (Unless that's possible using memoryview as an intermediate?)

I think it's fine if the docs state bytes-like.

@DarioDaF
Copy link
Contributor Author

DarioDaF commented Dec 4, 2022

Totally agree, should be possible (ofc requiring testing) to wrap a memoryview, slice that which is a noop and bytes it instead with almost no performance drop

@gvanrossum
Copy link
Member

So instead of self._data[:length] we could write bytes(memoryview(self._data)[:length]) and that doesn't copy bytes redundantly. It does create two memoryview objects (one for self._data and one for the slice) and each memoryview object costs 184 bytes on my machine, so maybe we could do something like bytes(memoryview(self._data)[:length]) if length >= 500 else bytes(self._data[:length]) (the 500 is a hunch, we could tune it using a micro-benchmark if you feel like it).

But since the current behavior is already out in 3.10 and 3.11 maybe it's not worth fixing and we should just document that it's "bytes-like". Thoughts?

@DarioDaF
Copy link
Contributor Author

DarioDaF commented Dec 4, 2022

In my opinion I'd keep bytes, cause passing mutable objects in the stream stack means that any protocol in between could mess up and break reading stuff (same with write in fact there was a discussion to disallow passing bytearray in writes, I think they settled in copying it with bytes on function call). Also all the reads (file like objects and everything) documentations must be changed or al least checked if we change this protocol. I'm not in the dev team for python tho, so I might not be the best person to ask this

@gvanrossum
Copy link
Member

Do you feel comfortable submitting a PR? I think there are only two or three places that need to be changed, plus a unittest for each.

@DarioDaF
Copy link
Contributor Author

DarioDaF commented Dec 6, 2022

Sure no problem, I'd see how to test performance with the example, I've also found that a similar thing is done in streams.py to uniform that we should decide what to do, imo unless the interpreter can figure out bytes(my_bytearray[:n]) as a simplified slice (since the slice reference is never stored) it's always best to go with the memoryview intermediary (if someone is reading with a very small buffer they are clearly not going for performance and adding a conditional will slow stuff down) so I'd change to that also in https://github1s.com/python/cpython/blob/HEAD/Lib/asyncio/streams.py#L691 and https://github1s.com/python/cpython/blob/HEAD/Lib/asyncio/streams.py#L733

@msoxzw
Copy link
Contributor

msoxzw commented Jan 28, 2023

Another PR, #21446, efficiently implementing BufferedProtocol for proactor event loop, addresses this issue as well.

@DarioDaF
Copy link
Contributor Author

I'll give a look at the PR, it seems more broad then the actual problem. Now there are test cases for this so if it passes should be fine

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-asyncio type-bug An unexpected behavior, bug, or error
Projects
Status: Done
Development

No branches or pull requests

5 participants