Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add metadata #48

Merged
merged 44 commits into from
Sep 19, 2022
Merged

add metadata #48

merged 44 commits into from
Sep 19, 2022

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented May 8, 2022

Following the discussion in JuliaData/DataFrames.jl#2961 I propose to have getmetadata be a function defined on DataAPI.jl level. In this way in particular:

  • Arrow.jl take any table and if it supports getmetadata write appropriate metadata to arrow file;
  • DataFrames.jl DataFrame constructor taking a table can check if it supports metadata and if it does automatically attach this metadata to a DataFrame;

@bkamins bkamins requested review from nalimilan and quinnj May 8, 2022 06:19
@codecov
Copy link

codecov bot commented May 8, 2022

Codecov Report

Base: 95.12% // Head: 96.55% // Increases project coverage by +1.42% 🎉

Coverage data is based on head (c5f699e) compared to base (e8f7842).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #48      +/-   ##
==========================================
+ Coverage   95.12%   96.55%   +1.42%     
==========================================
  Files           1        1              
  Lines          41       58      +17     
==========================================
+ Hits           39       56      +17     
  Misses          2        2              
Impacted Files Coverage Δ
src/DataAPI.jl 96.55% <100.00%> (+1.42%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

src/DataAPI.jl Outdated Show resolved Hide resolved
src/DataAPI.jl Outdated Show resolved Hide resolved
Copy link
Member

@quinnj quinnj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the slow review here; (getting ramped up in the new job and trying to get caught up a on a lot of Julia stuff + prioritize for the future). I like the name change to just metadata and getting this merged. I can make the required changes in Arrow.jl.

src/DataAPI.jl Outdated Show resolved Hide resolved
src/DataAPI.jl Outdated Show resolved Hide resolved
bkamins and others added 2 commits May 22, 2022 13:04
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
src/DataAPI.jl Outdated Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Though we may want to hold this until we have an implementation ready in DataFrames.jl and/or in Tables.jl just in case we discover unanticipated needs.

@bkamins
Copy link
Member Author

bkamins commented May 22, 2022

I will do implementation in DataFrames.jl then and I understand that @quinnj will do implementation in Arrow.jl - right?
I guess these would be "pending PRs" waiting for this to be merged and released, but at least we would see if there were no implementation issues.

@nalimilan - What would you want implemented in Tables.jl?

@bkamins bkamins changed the title add getmetadata add metadata May 22, 2022
@bkamins
Copy link
Member Author

bkamins commented May 22, 2022

I have started the implementation and see one problem. I propose to discuss it here (although it is DataFrames.jl specific).

Assume that I want to add metadata to some data frame that does not have metadata yet. I run medatada(df) and I get nothing. Now there is no way to add metadata to nothing. So, while I am OK to return nothing by default, I think that in DataFrames.jl we should always return Dict{String}. The only thing I would do is that creation of this dictionary would be lazy - i.e. by default it would be nothing, but if user called metadata then the dictionary would be created on a first call. Technically then this should be metadata! as it would mutate the source data frame, but I think it is OK, to call this function metadata and hide the fact that the creation of the dictionary is delayed.

@nalimilan - is this OK for you?

@nalimilan
Copy link
Member

Yeah it sounds fine to create the dict on the first call to avoid returning nothing.

BTW, instead of recommending users mutate the Dict directly we could have a metadata! function that could give a somewhat simpler API for basic operations, with e.g. metadata!(df, "month" => "January") instead of metadata(df)["month"] = "January" (which looks awfully like setting dimension names in R). But probably better start with a minimal API for now.

@bkamins
Copy link
Member Author

bkamins commented May 22, 2022

I started implementing it and it seems that we need metadata! the way you proposed it. The reason is that we need to return nothing if there is no metadata in a data frame. Otherwise there would be no way to check that some object does not have metadata (i.e. checking for metadata would always create an empty dictionary which could be unwanted).

The metadata! signature should be:

metadata!(table, pair; mode::Symbol=:overwrite)

where mode would be:

  • :overwrite -> new metadata overwrites old one if it exists
  • :leave -> if metadata existed it is unchanged (i.e. new value is ignored)
  • :error -> throw an error if existing key is tried to be assigned

Side note: to be able to add metadata to DataFrame we have to make it mutable struct (now it is struct so we could not change the value from nothing to dictionary).

@nalimilan - do you have any additional thoughts on these two points?

@bkamins
Copy link
Member Author

bkamins commented May 22, 2022

In JuliaData/DataFrames.jl#3055 I have drafted the implementation so you can see the important aspects of the design.
(I have made the draft minimal - including only essential parts so that it is easy to review at this stage)

@nalimilan
Copy link
Member

Indeed I also realized that. :-/

Now that you list the requirements, metadata! isn't super appealing as a general solution. It's nice for simple things like metadata!(df, "month" => "January"), but having to add a mode argument amounts to reinventing the Dict API, forcing users to learn a new way to what is basically modifying a Dict. And that doesn't even include a syntax to remove a key or remove all metadata...

One alternative would be to have an hasmetadata function like in Metadata.jl that could be used to check whether there's any metadata instead of calling metadata.

Cc: @Tokazama

@bkamins
Copy link
Member Author

bkamins commented Aug 2, 2022

Thank you for the feedback. I think we have settled the design. If there will be no additional comments tomorrow I will start updating JuliaData/DataFrames.jl#3055.

@bkamins
Copy link
Member Author

bkamins commented Aug 3, 2022

OK - I am starting to implement the API in DataFrames.jl 😄.

@bkamins
Copy link
Member Author

bkamins commented Aug 3, 2022

I have started working on DataFrames.jl and I already have a decision to be made.
I will focus on table-level metadata as for column level we have the same issue.

In this PR we have metadata! for adding metadata. But we will also need a method to drop metadata. There are two issues:

  • what would be the best method name for a method that drops metadata for key for table x. I would propose deletemetadata!(x, key).
  • should we define metadata! and deletemetadata! in DataAPI.jl or it is enough to define accessor methods in DataAPI.jl and leave metadata setting/deleting to be package specific. Thinking of it I would propose to define it in DataAPI.jl, to have shared function names across packages.

@bkamins
Copy link
Member Author

bkamins commented Aug 3, 2022

@nalimilan - I have added deletion to the API so that we can evaluate if we like it.

@nalimilan
Copy link
Member

Adding this to DataAPI sounds better than having a separate package, anyway empty definitions are cheap. The deletemetadata!(x, key) API sounds good, but I have an hesitation about deletemetadata!(x): shouldn't it be emptymetadata!(x) for similarity with the collections API? That's one more function, but the same argument would apply to collections where delete!(x) could have been used instead of empty!(x).

@bkamins
Copy link
Member Author

bkamins commented Aug 3, 2022

I am OK with emptymetadata!.

But then for columns we will have:

  • deletecolmetadata!(x, col, key)
  • emptycolmetadata!(x, col)
  • emptycolmetadata!(x)

?

src/DataAPI.jl Outdated Show resolved Hide resolved
src/DataAPI.jl Outdated Show resolved Hide resolved
src/DataAPI.jl Outdated
One of the uses of the metadata `style` is decision
how the metadata should be propagated when `x` is transformed. This interface
defines the `:none` style that indicates that metadata should not be propagated
under transformations. All types supporting metadata allow at least this style.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nalimilan - maybe we should make this description more precise? Currently in DataFrames.jl I needed to make a decision when :none metadata is kept and it is kept only in two cases:

  • DataFrame constructor;
  • copy;

all other operations drop all :none metadata. So, essentially both table level and column level :none metadata are attached to a concrete instance of a table or its copies (this is a safest approach, i.e. making sure that indeed when metadata could be invalidated it is dropped). Are we in agreement here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given no response I will make the definition more precise.

metadatakeys(::Any) = ()

"""
metadata!(x, key::AbstractString, value; style)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it, maybe the syntax would be more natural as metadata!(x, key => value)? That would allow extending this in the future to pass multiple pairs if it appears to be convenient.

A counter-argument is that setindex! doesn't use that syntax, but it's almost never called that way since x[key] = value is nicer. Of course both syntaxes could be allowed as they are not ambiguous (we will probably never allow keys to be pairs).

Copy link
Member Author

@bkamins bkamins Sep 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a first reaction it makes sense.

My only reservation was that in DataFrames.jl => is used for operation specification language, so we would have yet a third way to interpret => there.

The question is how would it look for colmetadata!?

colmetadata!(x, col, key => value; style=style)

(which does not look that nice)

Also note that metadata!(x, key => value) would not be allowed, you would need to write metadata!(x, key => value; style=style).

In summary - I would keep the things as they are here and consider what we design here a low-level API.

I assume that the extra new package planned (tentatively named TableMetadataTools.jl) will provide convenient high-level functions. In practice I even expect that if we define there:

caption!(table, str) = metadata!(table, "caption", str, style=:note)
caption(table) = metadata(table, "caption")
label!(table, col, str) = colmetadata!(table, col, "label", str, style=:note)
label(table, col) = colmetadata(table, col, "caption")

this will cover 95% of use cases of metadata in practice.

In summary - I propose to discuss a convenience high-level API in TableMetadataTools.jl, as I expect that in that package we will drop the requirement to specify style which we have in low-level API, as in high level API all styles will be :note.

@bkamins
Copy link
Member Author

bkamins commented Sep 17, 2022

I changed :none to :default style that is default without propagation.

@nalimilan - can you please recheck and comment if it can be merged (I guess in JuliaData/DataFrames.jl#3055 we have converged with the implementation). Thank you!

@bkamins bkamins merged commit 32ef840 into main Sep 19, 2022
@bkamins bkamins deleted the bkamins-patch-2 branch September 19, 2022 18:34
@bkamins
Copy link
Member Author

bkamins commented Sep 19, 2022

Thank you! The ball starts rolling (the metadata discussion is the most complex addition we made ever in the ecosystem)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

metadata method
7 participants