refactor package to remove sharp edges for schema authors/users (particularly `@row` / `Row`) and improve API #54

jrevels · 2022-10-03T15:34:39Z

I pulled on the thread of Legolas.@row to resolve #49, and accidentally unravelled the knitted sweater that was Legolas' API.

To elaborate: the overwhelming number of changes included in this single PR result from a development process that kind of resembles "x should be added/removed; hmm, adding/removing x immediately renders it obvious that y needs to be added/removed, or else the API will be inconsistent or have an obvious hole". Now that I know what the final puzzle looks like all put together, it'd be technically possible to break this up into more incremental PRs, but many of them would still each be breaking and I think it'd actually worsen the churn...but even if it didn't, we don't really have the bandwidth to draw out landing these changes 😅

A lot of this PR's changes stem from the design mini-investigation undertaken here.

Note that this PR supersedes #52, but is based atop @OTDE's commits from that branch to ensure they also receive contribution assignment for this PR once it lands on main.

changelog

I think the best way to onboard into this PR is probably to go through its new tour.jl, but here are a summary of the changes:

rm'd Legolas.Row in favor of the record types described in this document
Legolas.Schema -> Legolas.SchemaVersion, and associated name changes to more properly refer to "schema versions" where such changes are warranted. Also, the schema_ prefix has been dropped from many of the methods that were defined against this type.
rm'd methods against ::Type{<:Schema} in favor of methods against ::SchemaVersion (for similar reasons to why Base methods generally prefer ::Val{n} over ::Type{Val{n}}).
More disciplined/uniform API for handling schema identifiers, including replacing Legolas.Schema(id::AbstractString) with first(parse_identifier(id))
Legolas.@row has been replaced with Legolas.@schema (which simply declares a schema) and Legolas.@version (which declares a new version of a schema, and generates all associated record type definitions)
General overhaul of tour.jl and docs to both reflect the new changes and to hopefully provide a smoother / more thorough on-ramp for new users.
More comprehensive introspection/validation utilities:
- required_fields
- declaration
- declared
- find_violation
- complies_with
- accepted_field_type
validate(table, schema) --> validate(Tables.schema(table), schema). This works better/more generically with find_violation/complies_with, and just yields more control to the caller in general IMO. Furthermore, this makes it harder for callers to accidentally incorrectly assume that Legolas.validate validates row values.
Better-tailored error messages for Legolas.read/Legolas.write now that those methods don't need to punt to the removed validate method
The new Legolas-generated record types are a bit "dumber" than the former Legolas.Row, and its more likely that row tables constructed via these record types will have directly inferrable Tables.Schemas. Thus, we've removed a small bit of Legolas' schema-guessing code in a manner that removes a bit of responsibility from Legolas.read/Legolas.write and makes them more efficient. The tradeoff is that these functions have a higher expectation that callers will provide them with tables whose Tables.Schemas are inferrable via Tables.schema.
expected properties of schema version declarations are now better validated/enforced:
- schema version declarations may only contain one statement per required field
- schema version declarations must include at least one required field
- resolves attempting to define a child schema of a non-existent parent schema should induce an UnknownSchemaError #51
- resolves attempting to define a schema that violates its parent schema types should induce an error #53
- Previously, it was technically possible to alter a preexisting schema version declaration, and for such changes to be automatically reflected in the behavior of that schema version's children within the same Julia session. For example, you could redeclare a preexisting schema version foo@1 with an additional required field compared to its original declaration, which would de facto cause Legolas.validate to require the field for foo@1's child schemas. Since there was little enforcement of actual parent/child field compatibility in the first place, it was possible to redefine schema versions in such a manner that rendered them incompatible with their preexisting children. Now that we seek to better enforce parent/child schema version compatibility, we should strive to disallow this kind of accidental invalidation. We achieve this by bluntly disallowing the alteration of preexisting schema version declarations via redeclaration entirely, because a) there's not currently a mechanism for validating parent declarations against preexisting child declarations and b) the ability to alter preexisting schema version declarations is of questionable utility in the first place.
- Enforcing the previous points also de facto enforces that a child can't accidentally be declared as its own parent
resolves Legolas.Row type aliases impede ergonomic deprecation cycles #49
obviates throw better error messages for defensive implicit AbstractRecord convert failures #50
resolves Add function that returns a Schema's cols #41
resolve Columns of extension schemas should go after columns of parent schemas #12
more comprehensive unit tests (though I'm sure there are more test cases we should add more over time)
bumps minimum supported Julia version to 1.6, which is the latest LTS minor version, just to avoid maintenance overhead of dealing with Julia 1.3 compatibility. If any user actually requires Julia 1.3 support, please speak up and I'll reconsider.

Breaking Change Summary

The removal/replacement of Legolas.Row in general. Note that, unless we add a deprecation path, the new Legolas version will not be able to automatically deserialize values that were previously serialized via the former Legolas.Row ArrowTypes overloads. Note though that this probably doesn't affect most Legolas-serialized data, unless e.g. you had a nested Legolas.Row value inside of a column of a table.
Legolas.@row -> Legolas.@schema + Legolas.@version
Legolas.Schema -> Legolas.SchemaVersion, and associated method name changes
methods against Type{<:Schema} -> methods against ::SchemaVersion
Schema(x::AbstractString) -> first(parse_identifier(x)) (the previous method allowed, but did not return, parent identifiers - and was buggy to boot, allowing invalid parent identifiers)
general removal of the schema_ prefix from SchemaVersion accessor methods
schema_qualified_string -> identifier
validate(table, schema) -> validate(Tables.schema(table), schema)

Where to go from here

Note this PR bumps the version to v0.5, not v1.0, because there are still some obvious breaking changes to be made later:

to the table-level API (deprecate Legolas.write_full_path and related behavior that uses it #31, support pluggable/arbitrary (de)serialization formats/targets (CSV, JSON, YAML, etc.) #34)
to errors that are currently documented as thrown for specific cases (Better error messages when passing an object that does not satisfy as schema. #44)
get rid of DataFrames dependency and move generic methods elsewhere (e.g. gather)

My plan is to merge this PR without implementing automated deprecation paths. I think the breakage in this release (esp. w.r.t. Legolas.Row removal) is sufficiently nontrivial that any user who'd like to upgrade will probably need to actually manually upgrade, rather than fully rely on automated deprecation pathways. If it's worthwhile, we can add in depwarns for more trivial cases (e.g. simple method renames) in follow-up patch releases, but I'm reluctant to attempt more complex deprecation pathways that would have to handle a lot of edge cases in order to be correct (and thus have a higher risk of being subtly buggy).

Note that Beaconeers aren't expected to upgrade to v0.5 internally without dedicated support from the relevant team members (if you have a question about this, feel free to ping me in Beacon's internal Slack). So feel free to upgrade "early" if you want to for your use case, but keep in mind that it's not a requirement.

Co-authored-by: Alex Arslan <ararslan@comcast.net>

…/beacon-biosignals/Legolas.jl into sc/column-name-utility-functions

hannahilea · 2022-10-03T15:43:24Z

More comprehensive introspection/validation utilities [...]

<3

jrevels · 2022-10-06T16:11:14Z

I'm currently working on a branch to upgrade Onda to the intended Legolas v0.5 and running into an area of Legolas' design that is really tempting to try to improve before merging this.

I don't have time to write up a full explanation of the problem space now (will do so later) but suffice to say I'm tempted to reintroduce a struct-generation mechanism that is kind of like Legolas.Row but that doesn't allow "extension w/o declaration", i.e. the full field structure is statically known and forcibly encoded in the type definition. either that, or some schema-specialized "table builder" pattern

still unclear on how/whether the field types for such a mechanism should be allowed to be parameterized, or whether or not this notion is actually 1:1 w/ schema declaration

will think on it + follow up with a better breakdown of the problem space / motivations later

jrevels · 2022-10-16T23:04:25Z

will think on it + follow up with a better breakdown of the problem space / motivations later

Alrighty then! I jotted down a bunch of thoughts in an attempt to break down the problem/motivations as promised. It's a bit too much content for a single comment so I've dumped the write up in a gist here: https://gist.github.com/jrevels/fdfe939109bee23566d425440b7c759e

I'm going to forge ahead and refocus this PR towards implementing the proposal put forward in that gist. Will flip this into draft mode in the meantime. My hope is to get everything updated and mergeable by EOW.

jrevels · 2022-10-26T01:59:12Z

My hope is to get everything updated and mergeable by EOW.

Took a little bit longer than expected, but I've now merged #55 into this PR, which implements the aforementioned proposal.

I'm going to try to get CI green, then I'll flip this PR back out of draft mode. My hope is to get this merged/tagged on Thursday, modulo any major concerns brought up in review.

ericphanson

I think this is great! I think this is a better system than the already-quite-useful @row one. I do have fears about a long tail of upgrading old tables, and more specific fears about arrow serialization still not being tested thoroughly (e.g. w/ multiple type parameters, w/ union type constraints, etc), and some more specific comments.

Dev ergonomicity 1: Revise

Redefining a schema was actually super nice when developing. It made Legolas Row's like Revisable-structs, where you could do

@row("x@1", x::Int)

and then later realize you actually need a y field too, so you just put in the REPL

@row("x@1", x::Int, y::Float64)

and everything "just works" (since the type is the same and all the info is in the method's which get invalidated).

I get that we don't want to be redefining schemas like this in package code, and don't want to be overwriting existing definitions etc, but... if there's some way to preserve the development ergonomicity, it would be super nice. However I'm not sure how that could actually be implemented now that we are really defining individual structs...

Dev ergonomicity 2: what was created by this macro

I think one issue with @version and @schema is that it's not really visible what happened when you run one of those, like what did it generate. @row was similar but it returned something, so you had some feedback that OK now I have this object and I can use it. But these return nothing, so you don't get a nudge as to what was created and what you can use. (Obvs it's in the docs/tour, but still it's kind of nice to have that automatic feedback).

In particular, the connection between @version("example.foo@1", ...) and FooV1 being created feels too loose (in terms of code clarity, not actual semantics).

One option is @version could return say FooV1, the type. But I worry this would encourage folks to do const MyFoo = @version(...) like we do with @row, and that would be kinda an anti-pattern here (since probably it's better to use the FooV1 name rather than introduce another one, unless there is a good reason to).

Another semi weird option would be to try to introspect if the macro is running in the REPL (vs a package/precompiling/etc) and if so, print something like ("FooV1 type has been created"). I think that could go along way to making the package more approachable, even if it's a bit odd.

Or we could somehow try to get FooV1 into the written code itself, like writing the macro as @version(FooV1, ...) and using @schema to reverse the connection, meaning that @version(FooV1, ...) errors unless an @schema has been declared so that FooV1 can be associated to the example.foo@1 schema. In other words, you wouldn't have any more freedom in choosing how to spell the type (which is good), but you would have to write the type name there, so then you know it exists and when it shows up later in the code you know where it came from.

So that would look like:

@schema "example.foo" Foo

@version FooV1 begin
    a::Real
    b::String
end

docs/src/schema-concepts.md

examples/tour.jl

src/schemas.jl

src/tables.jl

src/schemas.jl

test/runtests.jl

jrevels · 2022-10-26T15:05:52Z

Or we could somehow try to get FooV1 into the written code itself, like writing the macro as @version(FooV1, ...) and using @schema to reverse the connection, meaning that @version(FooV1, ...) errors unless an @schema has been declared so that FooV1 can be associated to the example.foo@1 schema. In other words, you wouldn't have any more freedom in choosing how to spell the type (which is good), but you would have to write the type name there, so then you know it exists and when it shows up later in the code you know where it came from.

So that would look like:

Yeah, this does look prettier and clearer from a certain angle

The cons AFAICT:

it kind of emphasizes the record type over the schema version type
it obfuscates the schema version identifier itself a bit
it would require me to reimplement some stuff 😅

i think the huge pro, though, is the greppability improvement - even if you don't know what's going on with FooV1 at a callsite, you can find the FooV1 definition via text search. i think this is actually pretty important, and i think tips me in favor of this change

i think as part of this change we should also maybe just change FooSchemaV1 (which reads to me as "foo schema version 1") to FooV1SchemaVersion (which reads to me as "the SchemaVersion for FooV1"). Keeping the V1 next to the Foo i think preserves some of the greppability we're getting from the change. FooV1Schema would be shorter, but I don't like that it kinda conflates "schema" and "schema version", which this PR works pretty hard to cleanly delineate

let the repaint of the bikeshed begin 😁

jrevels · 2022-10-26T15:10:47Z

I get that we don't want to be redefining schemas like this in package code, and don't want to be overwriting existing definitions etc, but... if there's some way to preserve the development ergonomicity, it would be super nice. However I'm not sure how that could actually be implemented now that we are really defining individual structs...

Yeah, makes sense. I'm not sure how to cleanly support this, but would be down if somebody wants to make a suggestion on how to do so.

Also worth noting that AFAICT it likely wouldn't be a breaking change if we figured out how to solve this in a subsequent release

Co-authored-by: Eric Hanson <5846501+ericphanson@users.noreply.github.com>

hannahilea · 2022-10-26T16:02:57Z

if we figured out how to solve this

...could always poke the relevant upstream issue (Revise+struct) folks for this one? JuliaLang/julia#40399 via timholy/Revise.jl#18 (comment)

jrevels · 2022-10-27T05:03:54Z

let the repaint of the bikeshed begin

Okay, bikeshed repainted. Retagging @ericphanson for one last review pass 🙏

If there are no major concerns my plan is to merge by EOD tomorrow and tag the v0.5 release as stated in the OP.

…ifiers

ericphanson

Looks good to me. I would suggest we also document some update guidance & best practices, maybe in other PRs (I know you said everyone will get help upgrading but I think it still helps to have some written guidance). E.g., questions I have would be:

when should we be making a schema extension vs using custom columns (given constructors drop custom cols)?
- Should custom cols only be used when it's entirely dynamic what the column is, so it's not really possible to write a schema unless we generate it on-the-fly (which sounds like a very bad idea), or are custom columns more broadly applicable?
since we can only check type-constraints (not value constraints) without the constructors, whose responsibility is it to do that construction? the writer or the reader or both?
- I am guessing the reader, or maybe more specifically, the caller of a function requiring that vetting, since you shouldn't really trust the writer did it, since who knows what they did
- if it is the reader/caller, then this also means it's possible to serialize out custom columns, otherwise they would be dropped by constructing-before-writing, and you can read them in as long as you don't do that constructing
is it more OK to constrain function signatures by type now than before? E.g. if you want to express that you need a constructed-therefore-validated type as input
is it more OK to dispatch on record types now than before? Since your package "owns" the type rather than Legolas now, and they seem to have greater semantic meaning now
- this is kinda the same as the previous question, but imo relying on dispatch for functionality is different semantically than e.g. having a function w/ a single method w/ a type constraint

also, unrelatedly, if we want table-level validation, and we are in the validate-by-construction framework, does that mean we need tabular types, not just row types?

docs/src/schema-concepts.md

Co-authored-by: Eric Hanson <5846501+ericphanson@users.noreply.github.com>

jrevels · 2022-10-27T16:54:52Z

when should we be making a schema extension vs using custom columns (given constructors drop custom cols)?

i'm not sure there's a hard rule, so i'd just say "when it's useful to do so" (which i know is annoying 😁 but maybe we can figure out a more thoughtful rule at some point). one's use of schema-aware tools/systems may incentivize this in different directions

since we can only check type-constraints (not value constraints) without the constructors, whose responsibility is it to do that construction? the writer or the reader or both?

For the reader, it's all about trust; for the writer, it's all about goals. A big part of this is whether you're reading from / writing to a schema-on-read system or a (trusted) schema-on-write system. For example, if I'm a reader that is reading from a trusted schema-on-write system, i'd probably want to avoid revalidating unless i'm forced to by e.g. using a tool that is unaware/agnostics to the provenance of its inputs, and thus forces its own validation

i have some design notes written down on this topic that i'd like to put into a design docs section of the docs

is it more OK to constrain function signatures by type now than before? E.g. if you want to express that you need a constructed-therefore-validated type as input

Yup, it is; but with the caveat that doing so also means biting the bullet that your code is now non-generic, and thus could "poison" otherwise generic code that callers might want to compose it with. which, in the worst case, could result in costly/burdensome conversion gymnastics (but in some cases, that might be okay)

is it more OK to dispatch on record types now than before? Since your package "owns" the type rather than Legolas now, and they seem to have greater semantic meaning now

Yeah, but maybe less so than the former one lol.

Basically it's okay to do these things if you understand/accept the consequences. i personally think a more duck-typed style is more useful/less frustrating in a lot of scenarios, given the plethora of types that a valid table/row value can take on. IMO the holy grail would be true structural typing support, but that might be a bit of a lofty non-goal in julia land

also, unrelatedly, if we want table-level validation, and we are in the validate-by-construction framework, does that mean we need tabular types, not just row types?

either that, or have tabular validation be treated separately. i actually explored introducing a table type in this PR a little bit before going with the record type approach

update guidance & best practices

agreed, i'll file an issue for this

OTDE and others added 11 commits August 9, 2022 15:44

Add name and type util functions for schemas

917fbb9

Add tests for field name/type util functions

4ca4b42

Invoke generated struct directly

b42e64a

Add dispatch on generated struct type

def1138

Undo accidental formatting

c4e0bb7

Use consistent kwarg spacing

304d017

Apply suggestions from code review

f8a3a13

Co-authored-by: Alex Arslan <ararslan@comcast.net>

Remove fallback method errors for util functions

cbd1fcf

Merge branch 'sc/column-name-utility-functions' of https://github.com…

176aa6f

…/beacon-biosignals/Legolas.jl into sc/column-name-utility-functions

Update tests to expect new error type

1158b56

refactor

c9d7de5

jrevels mentioned this pull request Oct 3, 2022

Add column name/utility accessors for Legolas.Schema #52

Closed

jrevels added 7 commits October 3, 2022 18:09

fix typo

32b8663

julia version compat fix

44efde2

another julia compat fix

3633d12

make comment consistent with code change

c2ea420

fix field type escaping and add a test

5c5c044

fix julia 1.3 compat issue again

6ee2d8d

moar julia 1.3 compat fixes

2906296

jrevels mentioned this pull request Oct 9, 2022

support pluggable/arbitrary (de)serialization formats/targets (CSV, JSON, YAML, etc.) #34

Open

jrevels marked this pull request as draft October 16, 2022 23:06

jrevels mentioned this pull request Oct 26, 2022

implement design proposal arising from #54 #55

Merged

implement design proposal arising from #54 (#55)

6b968c7

bump minimum Julia version to latest LTS minor version

f8b48fa

jrevels marked this pull request as ready for review October 26, 2022 02:20

ericphanson reviewed Oct 26, 2022

View reviewed changes

jrevels and others added 3 commits October 26, 2022 11:18

Update src/schemas.jl

c6c0d57

Co-authored-by: Eric Hanson <5846501+ericphanson@users.noreply.github.com>

Update src/schemas.jl

53be860

Co-authored-by: Eric Hanson <5846501+ericphanson@users.noreply.github.com>

Update docs/src/schema-concepts.md

b9eca8a

Co-authored-by: Eric Hanson <5846501+ericphanson@users.noreply.github.com>

This was referenced Oct 26, 2022

convenience constructors for Tables.Schema JuliaData/Tables.jl#314

Open

Feature request: include package name or package url in arrow metadata #46

Open

ericphanson mentioned this pull request Oct 26, 2022

Print all schema violations in validate's error message #56

Closed

jrevels added 3 commits October 26, 2022 12:56

tweak err msg, fix tests

e3dd4c0

add additional syntax note

5f8ec57

make sure to test nested (de)serialization of record types

2fe38fc

jrevels mentioned this pull request Oct 26, 2022

Better error messages when passing an object that does not satisfy as schema. #44

Closed

jrevels requested a review from ericphanson October 27, 2022 05:04

use record type names to drive version declaration, rather than ident…

9fe8092

…ifiers

jrevels force-pushed the jr/refactor-clean branch from 9910493 to 9fe8092 Compare October 27, 2022 05:08

add a couple more nested serialization tests

7315fd3

ericphanson approved these changes Oct 27, 2022

View reviewed changes

docs/src/schema-concepts.md Outdated Show resolved Hide resolved

Update docs/src/schema-concepts.md

e8b0cb4

Co-authored-by: Eric Hanson <5846501+ericphanson@users.noreply.github.com>

This was referenced Oct 27, 2022

add documentation to aid upgrading from v0.4 to v0.5 #57

Open

add design motivations + related best practices section(s) to documentation #58

Open

jrevels merged commit 1561ef4 into main Oct 27, 2022

jrevels deleted the jr/refactor-clean branch October 27, 2022 16:57

This was referenced Oct 31, 2022

refactor to latest Legolas version + onda.signal v2 beacon-biosignals/Onda.jl#133

Merged

Allow schema from variable name in @row #9

Closed

schema prefixes are reserved globally instead of by module #65

Closed

jrevels mentioned this pull request Nov 13, 2022

@check convenience syntax for record/field-level value constraints #69

Closed

ericphanson mentioned this pull request Nov 22, 2022

start upgrade guide, and add pieces from LegolasFlux upgrade #71

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor package to remove sharp edges for schema authors/users (particularly `@row` / `Row`) and improve API #54

refactor package to remove sharp edges for schema authors/users (particularly `@row` / `Row`) and improve API #54

jrevels commented Oct 3, 2022 •

edited

Loading

hannahilea commented Oct 3, 2022

jrevels commented Oct 6, 2022 •

edited

Loading

jrevels commented Oct 16, 2022

jrevels commented Oct 26, 2022

ericphanson left a comment

jrevels commented Oct 26, 2022 •

edited

Loading

jrevels commented Oct 26, 2022

hannahilea commented Oct 26, 2022

jrevels commented Oct 27, 2022

ericphanson left a comment

jrevels commented Oct 27, 2022 •

edited

Loading

refactor package to remove sharp edges for schema authors/users (particularly @row / Row) and improve API #54

refactor package to remove sharp edges for schema authors/users (particularly @row / Row) and improve API #54

Conversation

jrevels commented Oct 3, 2022 • edited Loading

changelog

Breaking Change Summary

Where to go from here

hannahilea commented Oct 3, 2022

jrevels commented Oct 6, 2022 • edited Loading

jrevels commented Oct 16, 2022

jrevels commented Oct 26, 2022

ericphanson left a comment

Choose a reason for hiding this comment

Dev ergonomicity 1: Revise

Dev ergonomicity 2: what was created by this macro

jrevels commented Oct 26, 2022 • edited Loading

jrevels commented Oct 26, 2022

hannahilea commented Oct 26, 2022

jrevels commented Oct 27, 2022

ericphanson left a comment

Choose a reason for hiding this comment

jrevels commented Oct 27, 2022 • edited Loading

refactor package to remove sharp edges for schema authors/users (particularly `@row` / `Row`) and improve API #54

refactor package to remove sharp edges for schema authors/users (particularly `@row` / `Row`) and improve API #54

jrevels commented Oct 3, 2022 •

edited

Loading

jrevels commented Oct 6, 2022 •

edited

Loading

jrevels commented Oct 26, 2022 •

edited

Loading

jrevels commented Oct 27, 2022 •

edited

Loading