Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor package to remove sharp edges for schema authors/users (particularly @row / Row) and improve API #54

Merged
merged 29 commits into from
Oct 27, 2022

Conversation

jrevels
Copy link
Member

@jrevels jrevels commented Oct 3, 2022

I pulled on the thread of Legolas.@row to resolve #49, and accidentally unravelled the knitted sweater that was Legolas' API.

To elaborate: the overwhelming number of changes included in this single PR result from a development process that kind of resembles "x should be added/removed; hmm, adding/removing x immediately renders it obvious that y needs to be added/removed, or else the API will be inconsistent or have an obvious hole". Now that I know what the final puzzle looks like all put together, it'd be technically possible to break this up into more incremental PRs, but many of them would still each be breaking and I think it'd actually worsen the churn...but even if it didn't, we don't really have the bandwidth to draw out landing these changes 😅

A lot of this PR's changes stem from the design mini-investigation undertaken here.

Note that this PR supersedes #52, but is based atop @OTDE's commits from that branch to ensure they also receive contribution assignment for this PR once it lands on main.

changelog

I think the best way to onboard into this PR is probably to go through its new tour.jl, but here are a summary of the changes:

  • rm'd Legolas.Row in favor of the record types described in this document

  • Legolas.Schema -> Legolas.SchemaVersion, and associated name changes to more properly refer to "schema versions" where such changes are warranted. Also, the schema_ prefix has been dropped from many of the methods that were defined against this type.

  • rm'd methods against ::Type{<:Schema} in favor of methods against ::SchemaVersion (for similar reasons to why Base methods generally prefer ::Val{n} over ::Type{Val{n}}).

  • More disciplined/uniform API for handling schema identifiers, including replacing Legolas.Schema(id::AbstractString) with first(parse_identifier(id))

  • Legolas.@row has been replaced with Legolas.@schema (which simply declares a schema) and Legolas.@version (which declares a new version of a schema, and generates all associated record type definitions)

  • General overhaul of tour.jl and docs to both reflect the new changes and to hopefully provide a smoother / more thorough on-ramp for new users.

  • More comprehensive introspection/validation utilities:

    • required_fields
    • declaration
    • declared
    • find_violation
    • complies_with
    • accepted_field_type
  • validate(table, schema) --> validate(Tables.schema(table), schema). This works better/more generically with find_violation/complies_with, and just yields more control to the caller in general IMO. Furthermore, this makes it harder for callers to accidentally incorrectly assume that Legolas.validate validates row values.

  • Better-tailored error messages for Legolas.read/Legolas.write now that those methods don't need to punt to the removed validate method

  • The new Legolas-generated record types are a bit "dumber" than the former Legolas.Row, and its more likely that row tables constructed via these record types will have directly inferrable Tables.Schemas. Thus, we've removed a small bit of Legolas' schema-guessing code in a manner that removes a bit of responsibility from Legolas.read/Legolas.write and makes them more efficient. The tradeoff is that these functions have a higher expectation that callers will provide them with tables whose Tables.Schemas are inferrable via Tables.schema.

  • expected properties of schema version declarations are now better validated/enforced:

    • schema version declarations may only contain one statement per required field
    • schema version declarations must include at least one required field
    • resolves attempting to define a child schema of a non-existent parent schema should induce an UnknownSchemaError #51
    • resolves attempting to define a schema that violates its parent schema types should induce an error #53
    • Previously, it was technically possible to alter a preexisting schema version declaration, and for such changes to be automatically reflected in the behavior of that schema version's children within the same Julia session. For example, you could redeclare a preexisting schema version foo@1 with an additional required field compared to its original declaration, which would de facto cause Legolas.validate to require the field for foo@1's child schemas. Since there was little enforcement of actual parent/child field compatibility in the first place, it was possible to redefine schema versions in such a manner that rendered them incompatible with their preexisting children. Now that we seek to better enforce parent/child schema version compatibility, we should strive to disallow this kind of accidental invalidation. We achieve this by bluntly disallowing the alteration of preexisting schema version declarations via redeclaration entirely, because a) there's not currently a mechanism for validating parent declarations against preexisting child declarations and b) the ability to alter preexisting schema version declarations is of questionable utility in the first place.
    • Enforcing the previous points also de facto enforces that a child can't accidentally be declared as its own parent
  • resolves Legolas.Row type aliases impede ergonomic deprecation cycles #49

  • obviates throw better error messages for defensive implicit AbstractRecord convert failures #50

  • resolves Add function that returns a Schema's cols #41

  • resolve Columns of extension schemas should go after columns of parent schemas #12

  • more comprehensive unit tests (though I'm sure there are more test cases we should add more over time)

  • bumps minimum supported Julia version to 1.6, which is the latest LTS minor version, just to avoid maintenance overhead of dealing with Julia 1.3 compatibility. If any user actually requires Julia 1.3 support, please speak up and I'll reconsider.

Breaking Change Summary

  • The removal/replacement of Legolas.Row in general. Note that, unless we add a deprecation path, the new Legolas version will not be able to automatically deserialize values that were previously serialized via the former Legolas.Row ArrowTypes overloads. Note though that this probably doesn't affect most Legolas-serialized data, unless e.g. you had a nested Legolas.Row value inside of a column of a table.
  • Legolas.@row -> Legolas.@schema + Legolas.@version
  • Legolas.Schema -> Legolas.SchemaVersion, and associated method name changes
  • methods against Type{<:Schema} -> methods against ::SchemaVersion
  • Schema(x::AbstractString) -> first(parse_identifier(x)) (the previous method allowed, but did not return, parent identifiers - and was buggy to boot, allowing invalid parent identifiers)
  • general removal of the schema_ prefix from SchemaVersion accessor methods
  • schema_qualified_string -> identifier
  • validate(table, schema) -> validate(Tables.schema(table), schema)

Where to go from here

Note this PR bumps the version to v0.5, not v1.0, because there are still some obvious breaking changes to be made later:

My plan is to merge this PR without implementing automated deprecation paths. I think the breakage in this release (esp. w.r.t. Legolas.Row removal) is sufficiently nontrivial that any user who'd like to upgrade will probably need to actually manually upgrade, rather than fully rely on automated deprecation pathways. If it's worthwhile, we can add in depwarns for more trivial cases (e.g. simple method renames) in follow-up patch releases, but I'm reluctant to attempt more complex deprecation pathways that would have to handle a lot of edge cases in order to be correct (and thus have a higher risk of being subtly buggy).

Note that Beaconeers aren't expected to upgrade to v0.5 internally without dedicated support from the relevant team members (if you have a question about this, feel free to ping me in Beacon's internal Slack). So feel free to upgrade "early" if you want to for your use case, but keep in mind that it's not a requirement.

@hannahilea
Copy link
Contributor

More comprehensive introspection/validation utilities [...]

<3

@jrevels
Copy link
Member Author

jrevels commented Oct 6, 2022

I'm currently working on a branch to upgrade Onda to the intended Legolas v0.5 and running into an area of Legolas' design that is really tempting to try to improve before merging this.

I don't have time to write up a full explanation of the problem space now (will do so later) but suffice to say I'm tempted to reintroduce a struct-generation mechanism that is kind of like Legolas.Row but that doesn't allow "extension w/o declaration", i.e. the full field structure is statically known and forcibly encoded in the type definition. either that, or some schema-specialized "table builder" pattern

still unclear on how/whether the field types for such a mechanism should be allowed to be parameterized, or whether or not this notion is actually 1:1 w/ schema declaration

will think on it + follow up with a better breakdown of the problem space / motivations later

@jrevels
Copy link
Member Author

jrevels commented Oct 16, 2022

will think on it + follow up with a better breakdown of the problem space / motivations later

Alrighty then! I jotted down a bunch of thoughts in an attempt to break down the problem/motivations as promised. It's a bit too much content for a single comment so I've dumped the write up in a gist here: https://gist.github.com/jrevels/fdfe939109bee23566d425440b7c759e

I'm going to forge ahead and refocus this PR towards implementing the proposal put forward in that gist. Will flip this into draft mode in the meantime. My hope is to get everything updated and mergeable by EOW.

@jrevels jrevels marked this pull request as draft October 16, 2022 23:06
@jrevels
Copy link
Member Author

jrevels commented Oct 26, 2022

My hope is to get everything updated and mergeable by EOW.

Took a little bit longer than expected, but I've now merged #55 into this PR, which implements the aforementioned proposal.

I'm going to try to get CI green, then I'll flip this PR back out of draft mode. My hope is to get this merged/tagged on Thursday, modulo any major concerns brought up in review.

@jrevels jrevels marked this pull request as ready for review October 26, 2022 02:20
Copy link
Member

@ericphanson ericphanson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is great! I think this is a better system than the already-quite-useful @row one. I do have fears about a long tail of upgrading old tables, and more specific fears about arrow serialization still not being tested thoroughly (e.g. w/ multiple type parameters, w/ union type constraints, etc), and some more specific comments.

Dev ergonomicity 1: Revise

Redefining a schema was actually super nice when developing. It made Legolas Row's like Revisable-structs, where you could do

@row("x@1", x::Int)

and then later realize you actually need a y field too, so you just put in the REPL

@row("x@1", x::Int, y::Float64)

and everything "just works" (since the type is the same and all the info is in the method's which get invalidated).

I get that we don't want to be redefining schemas like this in package code, and don't want to be overwriting existing definitions etc, but... if there's some way to preserve the development ergonomicity, it would be super nice. However I'm not sure how that could actually be implemented now that we are really defining individual structs...

Dev ergonomicity 2: what was created by this macro

I think one issue with @version and @schema is that it's not really visible what happened when you run one of those, like what did it generate. @row was similar but it returned something, so you had some feedback that OK now I have this object and I can use it. But these return nothing, so you don't get a nudge as to what was created and what you can use. (Obvs it's in the docs/tour, but still it's kind of nice to have that automatic feedback).

In particular, the connection between @version("example.foo@1", ...) and FooV1 being created feels too loose (in terms of code clarity, not actual semantics).

One option is @version could return say FooV1, the type. But I worry this would encourage folks to do const MyFoo = @version(...) like we do with @row, and that would be kinda an anti-pattern here (since probably it's better to use the FooV1 name rather than introduce another one, unless there is a good reason to).

Another semi weird option would be to try to introspect if the macro is running in the REPL (vs a package/precompiling/etc) and if so, print something like ("FooV1 type has been created"). I think that could go along way to making the package more approachable, even if it's a bit odd.

Or we could somehow try to get FooV1 into the written code itself, like writing the macro as @version(FooV1, ...) and using @schema to reverse the connection, meaning that @version(FooV1, ...) errors unless an @schema has been declared so that FooV1 can be associated to the example.foo@1 schema. In other words, you wouldn't have any more freedom in choosing how to spell the type (which is good), but you would have to write the type name there, so then you know it exists and when it shows up later in the code you know where it came from.

So that would look like:

@schema "example.foo" Foo

@version FooV1 begin
    a::Real
    b::String
end

docs/src/schema-concepts.md Outdated Show resolved Hide resolved
examples/tour.jl Show resolved Hide resolved
examples/tour.jl Show resolved Hide resolved
examples/tour.jl Show resolved Hide resolved
examples/tour.jl Outdated Show resolved Hide resolved
src/schemas.jl Show resolved Hide resolved
src/tables.jl Show resolved Hide resolved
src/schemas.jl Show resolved Hide resolved
src/schemas.jl Show resolved Hide resolved
test/runtests.jl Show resolved Hide resolved
@jrevels
Copy link
Member Author

jrevels commented Oct 26, 2022

Or we could somehow try to get FooV1 into the written code itself, like writing the macro as @version(FooV1, ...) and using @schema to reverse the connection, meaning that @version(FooV1, ...) errors unless an @schema has been declared so that FooV1 can be associated to the example.foo@1 schema. In other words, you wouldn't have any more freedom in choosing how to spell the type (which is good), but you would have to write the type name there, so then you know it exists and when it shows up later in the code you know where it came from.

So that would look like:

Yeah, this does look prettier and clearer from a certain angle

The cons AFAICT:

  • it kind of emphasizes the record type over the schema version type
  • it obfuscates the schema version identifier itself a bit
  • it would require me to reimplement some stuff 😅

i think the huge pro, though, is the greppability improvement - even if you don't know what's going on with FooV1 at a callsite, you can find the FooV1 definition via text search. i think this is actually pretty important, and i think tips me in favor of this change

i think as part of this change we should also maybe just change FooSchemaV1 (which reads to me as "foo schema version 1") to FooV1SchemaVersion (which reads to me as "the SchemaVersion for FooV1"). Keeping the V1 next to the Foo i think preserves some of the greppability we're getting from the change. FooV1Schema would be shorter, but I don't like that it kinda conflates "schema" and "schema version", which this PR works pretty hard to cleanly delineate

let the repaint of the bikeshed begin 😁

@jrevels
Copy link
Member Author

jrevels commented Oct 26, 2022

I get that we don't want to be redefining schemas like this in package code, and don't want to be overwriting existing definitions etc, but... if there's some way to preserve the development ergonomicity, it would be super nice. However I'm not sure how that could actually be implemented now that we are really defining individual structs...

Yeah, makes sense. I'm not sure how to cleanly support this, but would be down if somebody wants to make a suggestion on how to do so.

Also worth noting that AFAICT it likely wouldn't be a breaking change if we figured out how to solve this in a subsequent release

jrevels and others added 3 commits October 26, 2022 11:18
Co-authored-by: Eric Hanson <5846501+ericphanson@users.noreply.github.com>
Co-authored-by: Eric Hanson <5846501+ericphanson@users.noreply.github.com>
Co-authored-by: Eric Hanson <5846501+ericphanson@users.noreply.github.com>
@hannahilea
Copy link
Contributor

if we figured out how to solve this

...could always poke the relevant upstream issue (Revise+struct) folks for this one? JuliaLang/julia#40399 via timholy/Revise.jl#18 (comment)

@jrevels
Copy link
Member Author

jrevels commented Oct 27, 2022

let the repaint of the bikeshed begin

Okay, bikeshed repainted. Retagging @ericphanson for one last review pass 🙏

If there are no major concerns my plan is to merge by EOD tomorrow and tag the v0.5 release as stated in the OP.

Copy link
Member

@ericphanson ericphanson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I would suggest we also document some update guidance & best practices, maybe in other PRs (I know you said everyone will get help upgrading but I think it still helps to have some written guidance). E.g., questions I have would be:

  • when should we be making a schema extension vs using custom columns (given constructors drop custom cols)?
    • Should custom cols only be used when it's entirely dynamic what the column is, so it's not really possible to write a schema unless we generate it on-the-fly (which sounds like a very bad idea), or are custom columns more broadly applicable?
  • since we can only check type-constraints (not value constraints) without the constructors, whose responsibility is it to do that construction? the writer or the reader or both?
    • I am guessing the reader, or maybe more specifically, the caller of a function requiring that vetting, since you shouldn't really trust the writer did it, since who knows what they did
    • if it is the reader/caller, then this also means it's possible to serialize out custom columns, otherwise they would be dropped by constructing-before-writing, and you can read them in as long as you don't do that constructing
  • is it more OK to constrain function signatures by type now than before? E.g. if you want to express that you need a constructed-therefore-validated type as input
  • is it more OK to dispatch on record types now than before? Since your package "owns" the type rather than Legolas now, and they seem to have greater semantic meaning now
    • this is kinda the same as the previous question, but imo relying on dispatch for functionality is different semantically than e.g. having a function w/ a single method w/ a type constraint

also, unrelatedly, if we want table-level validation, and we are in the validate-by-construction framework, does that mean we need tabular types, not just row types?

docs/src/schema-concepts.md Outdated Show resolved Hide resolved
Co-authored-by: Eric Hanson <5846501+ericphanson@users.noreply.github.com>
@jrevels
Copy link
Member Author

jrevels commented Oct 27, 2022

when should we be making a schema extension vs using custom columns (given constructors drop custom cols)?

i'm not sure there's a hard rule, so i'd just say "when it's useful to do so" (which i know is annoying 😁 but maybe we can figure out a more thoughtful rule at some point). one's use of schema-aware tools/systems may incentivize this in different directions

since we can only check type-constraints (not value constraints) without the constructors, whose responsibility is it to do that construction? the writer or the reader or both?

For the reader, it's all about trust; for the writer, it's all about goals. A big part of this is whether you're reading from / writing to a schema-on-read system or a (trusted) schema-on-write system. For example, if I'm a reader that is reading from a trusted schema-on-write system, i'd probably want to avoid revalidating unless i'm forced to by e.g. using a tool that is unaware/agnostics to the provenance of its inputs, and thus forces its own validation

i have some design notes written down on this topic that i'd like to put into a design docs section of the docs

is it more OK to constrain function signatures by type now than before? E.g. if you want to express that you need a constructed-therefore-validated type as input

Yup, it is; but with the caveat that doing so also means biting the bullet that your code is now non-generic, and thus could "poison" otherwise generic code that callers might want to compose it with. which, in the worst case, could result in costly/burdensome conversion gymnastics (but in some cases, that might be okay)

is it more OK to dispatch on record types now than before? Since your package "owns" the type rather than Legolas now, and they seem to have greater semantic meaning now

Yeah, but maybe less so than the former one lol.

Basically it's okay to do these things if you understand/accept the consequences. i personally think a more duck-typed style is more useful/less frustrating in a lot of scenarios, given the plethora of types that a valid table/row value can take on. IMO the holy grail would be true structural typing support, but that might be a bit of a lofty non-goal in julia land

also, unrelatedly, if we want table-level validation, and we are in the validate-by-construction framework, does that mean we need tabular types, not just row types?

either that, or have tabular validation be treated separately. i actually explored introducing a table type in this PR a little bit before going with the record type approach

update guidance & best practices

agreed, i'll file an issue for this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment