Nested schema deserialization depends on column order #94

ericphanson · 2023-06-11T17:19:18Z

using Legolas
using Legolas: @schema, @version

@schema "test.issue" Issue

@version IssueV1 begin
    a::Int
    b::Int
end


@schema "test.parent" Parent
@version ParentV1 begin
    x::IssueV1
end


table = [ParentV1(; x=IssueV1(; a=1, b=2))]
Legolas.write("issue.arrow",table,  ParentV1SchemaVersion())

# New Julia session

using Legolas
using Legolas: @schema, @version

@schema "test.issue" Issue

@version IssueV1 begin
    b::Int
    a::Int
end


@schema "test.parent" Parent
@version ParentV1 begin
    x::IssueV1
end

Legolas.read("issue.arrow")

gives

julia> Legolas.read("issue.arrow").x[1]
IssueV1: (b = 1, a = 2)

which is incorrect, as b has value 1. This is because

Legolas.jl/src/schemas.jl

Line 649 in b86444c

    
           $Arrow.ArrowTypes.fromarrow(::Type{<:$R}, $(keys(record_fields)...)) = $R(; $(keys(record_fields)...))

assumes the order in the serialized file matches the existing one.

I am not sure if this is a bug or just a foot-gun (depending on what guarantees Legolas is supposed to have about column order not mattering).

The text was updated successfully, but these errors were encountered:

jrevels · 2023-06-12T11:35:31Z

I am not sure if this is a bug or just a foot-gun

This is a (pretty bad 😨) bug

jrevels · 2023-12-03T04:38:49Z

EDIT: in an older version of this comment, i forgot we're packing these records into arrow structs (basically tuples, where the key info is stored in the table's schema), not maps (basically dicts where key info is stored inline). updated to reflect the former

The root of the problem here is that Arrow.fromarrow (and more specifically, the Arrow.jl indexing code that invokes fromarrow) splats accessed field values and does not pass the corresponding field keys through to the downstream constructor where they might be utilized; from https://docs.juliahub.com/ArrowTypes/3gg9u/2.2.2/autodocs/#ArrowTypes.fromarrow:

StructKind: must overload fromarrow(::Type{T}, x...) where individual fields are passed as separate positional arguments; so if my custom type Interval has two fields first and last, then I'd overload like ArrowTypes.fromarrow(::Type{Interval}, first, last) = .... Note the default implementation is ArrowTypes.fromarrow(::Type{T}, x...) = T(x...), so if your type already accepts all arguments in a constructor no additional fromarrow method should be necessary (default struct constructors have this behavior).

So even though the Arrow-formatted table does encode the necessary field information to enable deserialization that is agnostic to the field ordering employed at serialization time, the interface that Arrow.jl provides to downstream type authors prevents that information from being utilized.

It'd be nice if the relevant getindex method could instead simply invoke fromarrow the whole NamedTuple that's actually encoded in the Arrow struct, instead of a splatted list of field values.

the actual layout info I'd want to pass through from getindex to fromarrow lives in the column's surrounding Arrow.Table, in the table's schema field. unfortunately, the column itself does not carry its own schema - it only carries the Julia type computed via Arrow.juliaeltype:

help?> Arrow.juliaeltype
Given a flatbuffers metadata type definition (a Field instance from Schema.fbs), translate to the
appropriate Julia storage eltype

so one non-breaking approach to add a fnames type parameter to the Arrow.Struct type, then define:

struct StructElement{T<:NamedTuple}
    fields::T
end

fromarrow(T::Type, x::StructElement) = fromarrow(T, values(x.fields)...)

Then, update the relevant getindex method to pass a StructElement instance to fromarrow, using the fnames type parameter to construct the NamedTuple. Then, downstream type authors could overload fromarrow(T::Type, x::StructElement) if they'd like to hook into this transformation at this earlier stage.

…#493) Motivated by beacon-biosignals/Legolas.jl#94 (comment) Still requires: - [x] docs - [x] a test - [x] a bit more due diligence benchmarking-wise. `@benchmark`ing the access in the test case from beacon-biosignals/Legolas.jl#94 didn't reveal any perf difference, which seems like a good sign --------- Co-authored-by: Eric Hanson <5846501+ericphanson@users.noreply.github.com>

jrevels · 2023-12-05T17:56:41Z

will be resolved by #106 once it is merged/released (and once apache/arrow-julia#493 is released)

jrevels · 2023-12-13T21:52:04Z

resolved by #106

ericphanson mentioned this issue Jun 11, 2023

Cannot deserialize nested schema after adding optional field #95

Open

jrevels mentioned this issue Dec 3, 2023

enable field-order-agnostic overloads of fromarrow for struct types apache/arrow-julia#493

Merged

3 tasks

jrevels closed this as completed Dec 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nested schema deserialization depends on column order #94

Nested schema deserialization depends on column order #94

ericphanson commented Jun 11, 2023

jrevels commented Jun 12, 2023

jrevels commented Dec 3, 2023 •

edited

Loading

jrevels commented Dec 5, 2023

jrevels commented Dec 13, 2023

Nested schema deserialization depends on column order #94

Nested schema deserialization depends on column order #94

Comments

ericphanson commented Jun 11, 2023

jrevels commented Jun 12, 2023

jrevels commented Dec 3, 2023 • edited Loading

jrevels commented Dec 5, 2023

jrevels commented Dec 13, 2023

jrevels commented Dec 3, 2023 •

edited

Loading