Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A temporary quick fix for dataclass serialization (#500) #503

Merged
merged 4 commits into from
Jul 21, 2022

Conversation

anivegesana
Copy link
Contributor

This quick fix will be removed when proper dataclass serialization support is added to dill. This is just here to allow for better support, at least for now. dataclasses pickled with this PR will be unpicklable by future versions of dill, but the future versions of dill will be able to be automatically use the newer features in dataclasses.py that were not available in older versions of Python. That forward compatibility feature is not present in this PR.

This quick fix will be removed when proper dataclass serialization 
support is added to dill. This is just here to allow for better support, 
at least for now. dataclasses pickled with this PR will be unpicklable 
by future versions of dill, but the future versions of dill will be able 
to be automatically use the newer features in dataclasses.py that were 
not available in older versions of Python. That forward compatibility 
features is not present in this PR.
@mmckerns mmckerns added this to the dill-0.3.6 milestone Jun 8, 2022
@mmckerns mmckerns self-requested a review June 8, 2022 20:59
@mmckerns
Copy link
Member

mmckerns commented Jun 9, 2022

I'll review this shortly. Note that PyPy-3.7 is failing.

@anivegesana
Copy link
Contributor Author

I think that is an unrelated bug. Creating a new PR (#506) to fix that to make it easier to index later in case there are other changes in PyPy 3.7+.

Copy link
Member

@mmckerns mmckerns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks reasonable. (1) In the future, if your intent is to replace the newly registered "save" functions with "better" save functions, how would you deal with old stored pickles that use these definitions once a "better" dataclass serialization has been implemented? I assume with _shims. (2) Should the newly registered types be added to dill._objects? I should be able to approve this once you've addressed the above question.

@anivegesana
Copy link
Contributor Author

anivegesana commented Jun 18, 2022

The idea is that the current temporary fix does not modify the _create_* functions. Because of this, it will be able to be unpickled in every version of dill that has the same set of create functions. Later, a _create_dataclass function will be created and we will pickle to that. Let's say that is done by dill 0.3.7. Then pickles containing dataclasses created in dill 0.3.7 will not be able to be loaded in dill 0.3.6. However, the older pickles using _create_type will be able to be loaded in all versions of dill that have a _create_type, which is pretty much all of them. This does not let users load pickles from 0.3.7 in older versions of dill, so I will create a shim for _create_dataclass that calls dataclasses.dataclass. This way, we can change _create_dataclass, but older versions will still be able to open the pickle. This is not ideal because every pickle that has a dataclass will be a little bit more bloated with essentially an if statement to choose the right function, but there isn't a more convenient way.

I think the newly registered types should be added to dill._objects only if they are part of the standard library. dataclass is just a decorator that modifies the class in-place, so it is a user-defined class, not a standard library class.

In summary, as long as we are careful to not lose compatibility in the _create_* functions and we use shims for all new _create_* functions, everything will work out. This will add some bloat, but the bloat will be in cases that are relatively rare, considering that they were not supported before and people are using dill.

@mmckerns
Copy link
Member

mmckerns commented Jun 18, 2022

I think the newly registered types should be added to dill._objects only if they are part of the standard library. dataclass is just a decorator that modifies the class in-place, so it is a user-defined class, not a standard library class.

I'm not sure I agree with you.

>>> import dataclasses
>>> type(dataclasses.MISSING)
<class 'dataclasses._MISSING_TYPE'>
>>> 

You are specifically registering a _MISSING_TYPE object, so why should we treat that differently from any of the other classes defined in the python STL? We don't treat numpy classes differently (they are also in dill._objects).

Also... (3) shouldn't these have tags for log tracing?

The idea is that the current temporary fix...

What if we don't remove these registered types when a we have a _create_dataclass function? I assume that would be better for backward compatibility, unless they are limiting in some way. On quick inspection, I don't see them as being limiting.

Also, FYI, unless a high-priority bug fix comes up, I'm done with new issues for the pending release (although i may, given time, work on #487 or other higher-priority bug fix issues). It seems that with this PR, you are also done with PRs tagged for 0.3.6. @leogama has a few that are still being worked on. There's a soft release deadline for changes on 27 June, so we will assess readiness at that time.

@anivegesana
Copy link
Contributor Author

anivegesana commented Jun 19, 2022

Ah. Forgot another similarity in all of the dill.objects types. They are all builtin types written in C, not Python. The Python class should be pickleable anyway by the pickle module, unless the author of the Python class goofed in writing their __reduce__ function or they wanted to explicitly bar others from pickling it. I think this is the first case. They intended for __reduce__ to hint to the pickler that the object is a singleton by returning "MISSING_TYPE" (the global name of the singleton object). A PR to cPython should be written to fix it. If we add all of the Python types, dill.objects would be unmaintainable because many many classes would be added and it would be very inconvenient to maintain it. In addition, the test_objects test case would get unbearably long pickling objects that are obviously pickable.

In my opinion, there is no need to add every class in Python into dill.objects. Just the ones that are implemented in C, not Python.

@mmckerns
Copy link
Member

mmckerns commented Jul 1, 2022

@anivegesana: let's resolve this. A new release should happen in a few days, at most.

@anivegesana
Copy link
Contributor Author

The last two weeks have been chaotic, so I didn't get to do much. What did we end up deciding?

Add the types of dataclass package singletons to dill.objects because they are in the standard library or omit them because they are private and are not builtins (written in C)?

@mmckerns
Copy link
Member

mmckerns commented Jul 2, 2022

(2) Should the newly registered types be added to dill._objects?

Please address this. You should produce the behavior as in (#518 and) #519.

(3) shouldn't these have tags for log tracing?

Please address this.

@mmckerns mmckerns merged commit 74e0fd4 into uqfoundation:master Jul 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants