Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_model_row does not work for schema extensions of ModelV1 #24

Closed
kleinschmidt opened this issue Feb 21, 2023 · 15 comments
Closed

read_model_row does not work for schema extensions of ModelV1 #24

kleinschmidt opened this issue Feb 21, 2023 · 15 comments

Comments

@kleinschmidt
Copy link
Member

read_model_row always returns a ModelV1 which makes it pretty useless for anything that requires fields from a schema extensions. In Legolas <0.5, this was fine because extra fields were preserved but now they are discarded.

Instead, could LegolasFlux get the schema from the table metadata, use Legolas.record_type to get the appropriate record type, and try to construct that instead?

@ericphanson
Copy link
Member

ericphanson commented Feb 21, 2023

hmm, I guess? I am not sure, that feels kind of like trying to work around the legolas change rather than embracing it. To me it seems like either that function should be dropped and folks should add it themselves for their own schemas if it is useful*, or it should have a required schema argument to say what schema you want to deserialize.

*we could have a macro to generate this function, but that seems somewhat indirect/confusing. Maybe better is to just write that function out as an example somewhere to suggest writing your own if it is useful.

@hannahilea
Copy link
Contributor

a required schema argument to say what schema you want to deserialize

What about an optional schema argument for this, so that the existing behavior (return ModelV1) is still the default and doesn't break in the absence of a specified schema?

@kleinschmidt
Copy link
Member Author

Isn't this what Legolas.read does already (auto-detect the schema)? Maybe there should be a Legolas.read_row?

@hannahilea
Copy link
Contributor

Good point. Or update read to take an optional schema specification

read(io_or_path; validate::Bool=true, schema=missing)

@kleinschmidt
Copy link
Member Author

I think that read by design does not take a schema argument

@hannahilea
Copy link
Contributor

ah, that makes sense.

@kleinschmidt
Copy link
Member Author

oh yikes I missed that write_model_row always imposes the ModelV1 schema though...

@kleinschmidt
Copy link
Member Author

So as things currently stand, if we follow the recommendations of the readme (use write_model_row/read_model_row, use schema extensions for your own model), things will not work: the written model row may contain the extra fields, but the deserialized one definitely won't.

@kleinschmidt
Copy link
Member Author

I think we shoudl follow the legolas template of:

  • write takes a schema argument (can default to ModelV1SchemaVersion)
  • read extracts teh schema from the Arrow table metadata.

@ericphanson
Copy link
Member

Yeah, I don't think folks should actually be using write_model_row or read_model_row as-is (even with Legolas v0.4), but rather defining their own version with their schema downstream. They were included to kind of have a minimal interface so the README could have an example showing you can :just: read and write, but that was probably a bad idea, because in practice you want to be defining your own schema extension and your own functions to write/read that schema if needed. This is def a docs/explanation failure but also an interface one it sounds like.

@kleinschmidt
Copy link
Member Author

in practice you want to be defining your own schema extension and your own functions to write/read that schema if needed

why though? what would those extensions be doing other than what these functions are doing, modulo the schema difference? I think it's pretty reasonable to provide these functions in a way that's generally useful. Right now they're just a footgun, since a user will see that they are used in teh examples adn exported and think they are meant for users to use, and then wonder why their code is breaking in mysterious ways when their extension columns go missing. If they're really meant just as examples/templates, we should remove them from the package and define them in-line in the README

@ericphanson
Copy link
Member

If they're really meant just as examples/templates, we should remove them from the package and define them in-line in the README

Yeah, that's what I was saying I should've done/should do.

Right now they're just a footgun, since a user will see that they are used in teh examples adn exported and think they are meant for users to use, and then wonder why their code is breaking in mysterious ways when their extension columns go missing

Agreed, that's what I meant by interface failure.

why though? what would those extensions be doing other than what these functions are doing, modulo the schema difference?

True... OK maybe they can be useful then. In my usage we have modified them for backwards compat reasons (e.g. if we infer an old schema, we can upgrade to the new one on-the-fly), but OK I agree they can be useful except for the specific schema being hardcoded.

@kleinschmidt
Copy link
Member Author

we have modified them for backwards compat reasons (e.g. if we infer an old schema, we can upgrade to the new one on-the-fly)

Yeah, I feel like this is the class of situations where in vanilla legolas land you are encouraged to use Arrow.read directly; since that fallback is always available, I think there isn't much downside to making the convenience functions we do provide useful

@kleinschmidt
Copy link
Member Author

Okay I'll make a PR then with my proposed changes and some tests!

@kleinschmidt
Copy link
Member Author

fixed by #26

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants