Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nested schema deserialization depends on column order #94

Closed
ericphanson opened this issue Jun 11, 2023 · 4 comments
Closed

Nested schema deserialization depends on column order #94

ericphanson opened this issue Jun 11, 2023 · 4 comments

Comments

@ericphanson
Copy link
Member

using Legolas
using Legolas: @schema, @version

@schema "test.issue" Issue

@version IssueV1 begin
    a::Int
    b::Int
end


@schema "test.parent" Parent
@version ParentV1 begin
    x::IssueV1
end


table = [ParentV1(; x=IssueV1(; a=1, b=2))]
Legolas.write("issue.arrow",table,  ParentV1SchemaVersion())

# New Julia session

using Legolas
using Legolas: @schema, @version

@schema "test.issue" Issue

@version IssueV1 begin
    b::Int
    a::Int
end


@schema "test.parent" Parent
@version ParentV1 begin
    x::IssueV1
end

Legolas.read("issue.arrow")

gives

julia> Legolas.read("issue.arrow").x[1]
IssueV1: (b = 1, a = 2)

which is incorrect, as b has value 1. This is because

$Arrow.ArrowTypes.fromarrow(::Type{<:$R}, $(keys(record_fields)...)) = $R(; $(keys(record_fields)...))
assumes the order in the serialized file matches the existing one.

I am not sure if this is a bug or just a foot-gun (depending on what guarantees Legolas is supposed to have about column order not mattering).

@jrevels
Copy link
Member

jrevels commented Jun 12, 2023

I am not sure if this is a bug or just a foot-gun

This is a (pretty bad 😨) bug

@jrevels
Copy link
Member

jrevels commented Dec 3, 2023

EDIT: in an older version of this comment, i forgot we're packing these records into arrow structs (basically tuples, where the key info is stored in the table's schema), not maps (basically dicts where key info is stored inline). updated to reflect the former


The root of the problem here is that Arrow.fromarrow (and more specifically, the Arrow.jl indexing code that invokes fromarrow) splats accessed field values and does not pass the corresponding field keys through to the downstream constructor where they might be utilized; from https://docs.juliahub.com/ArrowTypes/3gg9u/2.2.2/autodocs/#ArrowTypes.fromarrow:

StructKind: must overload fromarrow(::Type{T}, x...) where individual fields are passed as separate positional arguments; so if my custom type Interval has two fields first and last, then I'd overload like ArrowTypes.fromarrow(::Type{Interval}, first, last) = .... Note the default implementation is ArrowTypes.fromarrow(::Type{T}, x...) = T(x...), so if your type already accepts all arguments in a constructor no additional fromarrow method should be necessary (default struct constructors have this behavior).

So even though the Arrow-formatted table does encode the necessary field information to enable deserialization that is agnostic to the field ordering employed at serialization time, the interface that Arrow.jl provides to downstream type authors prevents that information from being utilized.

It'd be nice if the relevant getindex method could instead simply invoke fromarrow the whole NamedTuple that's actually encoded in the Arrow struct, instead of a splatted list of field values.


the actual layout info I'd want to pass through from getindex to fromarrow lives in the column's surrounding Arrow.Table, in the table's schema field. unfortunately, the column itself does not carry its own schema - it only carries the Julia type computed via Arrow.juliaeltype:

help?> Arrow.juliaeltype
Given a flatbuffers metadata type definition (a Field instance from Schema.fbs), translate to the
appropriate Julia storage eltype

so one non-breaking approach to add a fnames type parameter to the Arrow.Struct type, then define:

struct StructElement{T<:NamedTuple}
    fields::T
end

fromarrow(T::Type, x::StructElement) = fromarrow(T, values(x.fields)...)

Then, update the relevant getindex method to pass a StructElement instance to fromarrow, using the fnames type parameter to construct the NamedTuple. Then, downstream type authors could overload fromarrow(T::Type, x::StructElement) if they'd like to hook into this transformation at this earlier stage.

ericphanson added a commit to apache/arrow-julia that referenced this issue Dec 5, 2023
…#493)

Motivated by
beacon-biosignals/Legolas.jl#94 (comment)

Still requires:

- [x] docs
- [x] a test
- [x] a bit more due diligence benchmarking-wise. `@benchmark`ing the
access in the test case from
beacon-biosignals/Legolas.jl#94 didn't reveal
any perf difference, which seems like a good sign

---------

Co-authored-by: Eric Hanson <5846501+ericphanson@users.noreply.github.com>
@jrevels
Copy link
Member

jrevels commented Dec 5, 2023

will be resolved by #106 once it is merged/released (and once apache/arrow-julia#493 is released)

@jrevels
Copy link
Member

jrevels commented Dec 13, 2023

resolved by #106

@jrevels jrevels closed this as completed Dec 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants