Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why isn't df.col .= v in-place? #3200

Closed
gustafsson opened this issue Oct 16, 2022 · 22 comments · Fixed by #3201
Closed

Why isn't df.col .= v in-place? #3200

gustafsson opened this issue Oct 16, 2022 · 22 comments · Fixed by #3201
Milestone

Comments

@gustafsson
Copy link
Contributor

gustafsson commented Oct 16, 2022

This issue is a question on the most recent release notes

On Julia 1.7 or newer broadcasting assignment into an existing column of a data frame replaces it. Under Julia 1.6 or older it is an in place operation. (#3022)

I expected df.col .= v to broadcast and do in-place assignment. But I see that's no longer the case in Dataframes.jl 1.4.

The recent release broke some code of mine (I must have missed any deprecation warnings). A simple workaround was coltemp = df.col; coltemp .= v but I don't understand the reason for the new behaviour. To me this seems to make DataFrame inconsistent with other containers in Julia and left me wondering why this inconsistency would be a wanted one.

This issue equally applies to df[!, :x] .= v.

Compare:

julia> x = [1, 2, 3]
3-element Vector{Int64}:
 1
 2
 3

julia> x .= 1.5
ERROR: InexactError: Int64(1.5)

Whereas

julia> df = DataFrame(x=[1, 2, 3])
3×1 DataFrame
 Row │ x
     │ Int64
─────┼───────
   1 │     1
   2 │     2
   3 │     3

julia> x = df.x;

julia> df.x .= 1.5
3-element Vector{Float64}:
 1.5
 1.5
 1.5
 
julia> x === df.x
 false

As advertised df.x .= 1.5 does not work in-place but replaces the column, even with a new type.

If I put the vector in any other container, say, a Dict, NamedTuple or struct

julia> dt = Dict(:x=>[1, 2, 3])
Dict{Symbol, Vector{Int64}} with 1 entry:
  :x => [1, 2, 3]

julia> dt[:x] .= 1.5
ERROR: InexactError: Int64(1.5)

julia> nt = (x = [1, 2, 3],)
(x = [1, 2, 3],)

julia> nt.x .= 1.5
ERROR: InexactError: Int64(1.5)

julia> struct S
       x
       end

julia> s = S([1, 2, 3])
S([1, 2, 3])

julia> s.x .= 1.5
ERROR: InexactError: Int64(1.5)

They all behave the same. But a DataFrame behaves differently. Why is that?


The docs state "Since df[!, :col] does not make a copy" which to me makes it unexpected that it would create a new column rather than modifying the existing one.


For the use case of "create/replace column" we have df.x = v (akin to s.x = v or dict[:x] = v). Would there be any adverse side-effects of letting = broadcast scalars into new/replaced columns?


I understand there was a decision a year ago (#2804) to make df.x .= v work like d[!,:x] .= v but wouldn't a change to instead make df[!,:x] .= v work like df.x .= v have been more consistent with how containers in Julia typically work?

@bkamins
Copy link
Member

bkamins commented Oct 16, 2022

There are three parts of the explanation of the current behavior (unfortunately this is a bit complicated).

Part 1. As opposed to other containers in Julia DataFrames.jl assumes it gets ownership of the columns stored in it. For example if you store a vector in a Dict this Dict does not claim ownership of the column. With DataFrame it is different. Columns stored in a DataFrame are owned by it. What the user can do is only to ask for them to be exposed in a copying or non-copying way. This decision was made to, in general, improve code safety.

In general if you say that

The recent release broke some code of mine

This is unfortunate, and sorry for this. As you noticed we have had this deprecation for a long time and in previous release announced that this change would be made.

However, this means that your code was unsafe. It does not say that it is not allowed to write unsafe code, but it is recommended not to write it unless you have some performance critical parts. The only scenario when this change would break code is when you store column col in a DataFrame but at the same keep it somewhere else (and rely on the fact that this stored elsewhere vector is identical to the vector in a DataFrame). Since DataFrame owns the column you should not do this. If you want a column from a DataFrame you should always query it to get it (as only the owner of some object gives you a single version of truth of the value of this object).

Part 2. The docs say "Since df[!, :col] does not make a copy" indeed, but this is for getting data from a data frame, not for setting data in a data frame. All the rules for indexing are described here. In general you can have three kinds of operations related to indexing:

  • indexing access (getindex)
  • indexing write (setindex!)
  • broadcasted assignment (broadcast!)

Each of them requires a bit different treatment.

Would there be any adverse side-effects of letting = broadcast scalars into new/replaced columns?

It would be near impossible to implement it correctly + it would be inconsistent with the whole design of Julia that requires explicit broadcasting.

Part 3. Now let me discuss broadcasted assignment solely. We need two behaviors. One that updates the column in-place. It is currently:

df[:, :col] .= v

and the other that allocates new column. This is currently:

df[!, :col] .= v
df.col .= v

If we applied your recommendation there would be no way to do the second operation (all three operations would have the same effect.

However, users need both. I.e. the second behavior (allocating new column) is a must have and quite often needed. See e.g.:

julia> df = DataFrame(col=[-1, 1])
2×1 DataFrame
 Row │ col
     │ Int64
─────┼───────
   1 │    -1
   2 │     1

julia> df[:, :col] .+= 0.5
ERROR: InexactError: Int64(-0.5)

julia> df[!, :col] .+= 0.5
2-element Vector{Float64}:
 -0.5
  1.5

julia> df.col .+= 0.5
2-element Vector{Float64}:
 0.0
 2.0

Users just want such operations to work. And we had to give them a way to do it. Also, note (you probably saw this example already) that sometimes instead of an error you get just an unexpected result:

julia> df = DataFrame(a=[0])
1×1 DataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │     0

julia> df[:, :a] .= 'a'
1-element view(::Vector{Int64}, :) with eltype Int64:
 97

julia> df
1×1 DataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │    97

julia> df = DataFrame(a=[0])
1×1 DataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │     0

julia> df.a .= 'a'
1-element Vector{Char}:
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)

julia> df
1×1 DataFrame
 Row │ a
     │ Char
─────┼──────
   1 │ a

(the second behavior is what people typically expect)

Now what is the difference at its core. In general if you write ! or use getproperty DataFrames.jl indexing for assignment always behaves as if the column to which you write were not present in a data frame yet. So you do not have any side effects independent if it is present or not present in a data frame. Most often this is something that users want (if you write generic code often you want to write some expression that has fully predictable effect). On the other hand if you use : in assignment DataFrames.jl behaves like Base Julia behaves if the column is present (i.e. it updates things in-place).

So in short, from the other perspective: now ! and getproperty always work the same way - which is also a desirable property.

Now you might ask why it was implemented only in DataFrames.jl 1.4. The fact is that we wanted to have these rules already in DataFrames.jl 1.0 release, but it was not possible, because of limitations of Base Julia. Making this change was only possible after Julia 1.7 was released. That is why we had a long deprecation period. We announced that this change would be made in earlier releases, e.g. if you check DataFrames.jl 1.3 release notes you can read:

Planned changes
In DataFrames.jl 1.4 release on Julia 1.7 or newer broadcasting assignment into an existing column of a data frame will replace it. Under Julia 1.6 or older it will be an in place operation.

@gustafsson
Copy link
Contributor Author

gustafsson commented Oct 16, 2022

Thanks a bunch for writing out this explanation for me @bkamins. Really appreciate it.

your code was unsafe.

No argument there. Using DataFrames in the first place inherently type-unstable.

It would be near impossible to implement it correctly

Is that a challenge? ;)
I would attempt it, but not unless it's wanted.

it would be inconsistent with the whole design of Julia that requires explicit broadcasting.

Maybe the "whole design of Julia" is subjective? In my mental model of Julia it's clearly breaking the "whole design of Julia" to deviate from how all other containers work and let .= allocate something new instead of doing in-place broadcasted assignment. In my mental model of DataFrames as a collection of named columns with equal length it doesn't take any stretch of imagination to (re)create a column by giving it some initial value that needs to be materialized (such as with df.col = 1:3).

I also recognize that there are individual differenes in mental models of what a language or library represents, and that mine is not necessarily a typical (or even "correct") one.

Users just want such operations to work. And we had to give them a way to do it
what people typically expect

Fair enough. Personally I would expect DataFrame to behave like other containers in Julia, and columns in a DataFrame to behave likte Vectors (or AbstractVectors in general) behave in Julia. But I guess what people typically expect is more formed from pandas or R dataframes rather than Julia. It seems to be that's what this is really about? As a long time user of Julia (and typed languages) I understand that I'm not a typical user and it's important to get more people on board.


Thanks again! DataFrames.jl is a great library that I use daily :)

@bkamins
Copy link
Member

bkamins commented Oct 16, 2022

Using DataFrames in the first place inherently type-unstable.

The issue is not type stability, but object ownership. Julia does not have a well defined concept of ownership of a value (like Rust does), so it is more of a convention.

Let me start with explanation of the origin of the ownership concept related to a most common bug in code using DataFrames.jl.

Since you can extract a column as a vector from a data frame you can resize it. If you resize such a vector you invalidate a data frame (as every column of a data frame must have the same number of elements; note that this is a key difference from other containers - e.g. Dict of vectors does not have a restriction that every vector stored in it has the same number of rows).

It might seem an innocent issue and not very problematic, but in practice it is the opposite. Users, in the past, tended e.g. to create one data frame from another data frame without copying columns and then they were surprised that when they mutated one then the other was mutated. The most problematic case is when rows were added/removed as it invalidated the code. This was the most common problem in the past.

To resolve it we introduced a rule of column ownership and copycols kwarg. By default copycols=true in many functions so when you work on columns of a data frame a newly created object does not alias columns with the source.

However, we allow for copycols=false as in some performance critical code this is needed. However, users should be aware that this is an unsafe operation and should be super careful when working with such objects.

Similarly df.col and df[!, :col] operations are non-copying when getting data and therefore they are unsafe. If you extract a column from a data frame this way it is recommended that this is only done for short-lived objects and when you are sure that you will not break anything in the parent (a most common case when this is needed is when using function barriers).

Now, why all this is done? Most of the DataFrames.jl users are not Computer Science experts. As you have said - many of them are entry level programmers, not necessarily even understanding all the details how things work underneath. Therefore we design DataFrames.jl to be as simple to use as possible by them. For example the operation df.col .+= 0.5 when :col is an integer column was one of the common cases when entry-level users did not understand why it fails and it is common.

Is that a challenge? ;) I would attempt it, but not unless it's wanted.

DataFrame constructor, insertcols!, and combine and friends function define a concept of pseudo-broadcasting. This is an approximation of broadcasting used in some of the functions in DataFrames.jl (you can read about how it is defined in documentation). However, it is quite hard to cover every case of broadcasting fully correctly (especially that broadcasting assignment is defined in Base Julia for objects that exist and in DataFrames.jl broadcasting can be also used in the case of non-existing object when you create a new column). For example you write "scalar" and the concept of a scalar is not well defined in Julia. Users can define many strange types with complex expected behavior. For example what should be done if user passes an array of dimension higher than 1 would be unclear (should it error or should such object be treated as scalar). For these reasons, except for the cases that support for pseudo-broadcasting mentioned above, we try to just rely on broadcasting rules that Base Julia defines.

to deviate from how all other containers work and let .= allocate something new instead of doing in-place broadcasted assignment.

To some extent you are right.
However, if we wanted to be consistent how other containers work with .= we should error if you wanted to add a column using broadcasting. Base Julia does not allow broadcasting assignment into a non-existing object. However, this is something that we needed to define in DataFrames.jl as adding a column to a data frame is one of the most common operations users want.

As a long time user of Julia (and typed languages) I understand that I'm not a typical user and it's important to get more people on board.

I think this is a good summary of the problem here. As I have said above, we need to make DataFrames.jl relatively friendly to users that are not used to strongly-typed languages (and this is the majority - as you have commented - coming from R/Python and wanting things to mostly "just work").

In fact initially DataFrame object design was what you propose "a collection of columns", but over the years we needed to add and twist many features to make things newbie-users life easier.

@bkamins
Copy link
Member

bkamins commented Oct 16, 2022

By the way - in some style guides of corporate users there is a rule is that using df.col and df[!, :col] is discouraged. The preference is given to df[:, :col] (which works fully consistently with Base Julia matrices, except for that it allows adding columns for convenience). The non-copying column access is recommended only for exceptional cases of performance-critical code.

@bkamins
Copy link
Member

bkamins commented Oct 16, 2022

Finally - your feedback is much appreciated. I am planning to write a blog post on this change (as it was the most painful decisions we needed to make in DataFrames.jl design in general) and the discussion with you is an important in the preparation process.

Also note that historically df[:col] was supported (in times when data frame was considered to be a collection of columns) but this support was dropped. From the puristic perspective we ideally should drop df.col support also with the current design (as this is the last remaining element from the "collection of columns" thinking about DataFrame), but the df.col syntax is just too convenient for users + it has autocompletion support so we keep it.

@gustafsson
Copy link
Contributor Author

Forgive me if I digress at times, but I find the conversation interesting:

The issue is not type stability, but object ownership.

Right. The issue I encountered arose from getting a different type under the new behaviour but the issue we're focusing on here is rather one of ownership.

I also think object ownership is a larger issue in Julia, not specific to DataFrames. Akin to how f!(x) is expected to modify x but f(x) might also change x, intentionally or not. I wouldn't mind having the option to specify some sort of constness in general in Julia. I often like to use type annotations to communicate intent (together with documentation and naming conventions). One could envision a ConstArray or FixedLengthArray that wraps a normal array but implements none or only a few of the mutable functions. I'm imagining a struct Column <: AbstractVector may similarily skip the functions that change the size but still allow setindex! and thus might help in making DataFrame code more safe?

copycols=true

I think this was a great decision. I appreciate a simple and clear option to be able to not get a copy when I don't want one.

in some style guides of corporate users there is a rule is that using df.col and df[!, :col] is discouraged

I'm personally an advocate for only using DataFrames while manipulating data and then to stop using the DataFrame as soon as you can. I.e to not let the DataFrame be some generic container with unknown columns that you pass around. This is related to the lack of type stability / lack of schema, not necessarily for performance but for correctness. (note I don't think DataFrames need a schema). Which brings me to:

function barriers

Apart from the simple case of f(d.col) I've also found it really useful to convert a whole DataFrame to a NamedTuple in order to introduce type stability before passing a DataFrame through a function barrier:

NamedTuple(d::AbstractDataFrame) = (;zip(Tables.columnnames(d), Tables.columns(d))...)

This let's me work with data from the table in a dynamic way while writing performant code. And I can compute gradients with Zygote or ForwardDiff. Exceptional? Performance-critical? Maybe both. But it can also be used to effectively ensure a table schema for the purpose of correctness rather than performance.

For example the operation df.col .+= 0.5 when :col is an integer column was one of the common cases when entry-level users did not understand why it fails and it is common.

v .+= 0.5 with v::Vector{Int} also fails. If this is an issue it looks to me like more of an issue with the Julia language rather than an issue related to DataFrames.jl.

However, if we wanted to be consistent how other containers work with .= we should error if you wanted to add a column using broadcasting.

Makes sense to me. That would be the expected behaviour of in-place assignment to something that doesn't exist. Like elementwise assignment to a Matrix outside of its dimensions (and I gather that consistency with matrices is expected and wanted?)

However, this is something that we needed to define in DataFrames.jl as adding a column to a data frame is one of the most common operations users want.

julia> df.col = 1
ERROR: ArgumentError: It is only allowed to pass a vector as a column of a DataFrame.
julia> df[!, :col] = 1
ERROR: MethodError: no method matching setindex!

I think users would also accept if this created a new/replaced column and filled it with the scalar (per the pseudo-broadcasting logic) instead of the error message even if they didn't use broadcast assignment. In much the same way the constructor accepts scalars (i.e _preprocess_column(v, nrow(df), copycols)). I even think there's a case for supporting assignment of scalars without broadcasting for conistency with the constructor alone.

we try to just rely on broadcasting rules that Base Julia defines.

Yes! That's really awesome.

Also note that historically df[:col] was supported (in times when data frame was considered to be a collection of columns) but this support was dropped.

Those were the days. I might still be stuck in some ancient mindset.


Looking forward to that blog post!

@bkamins
Copy link
Member

bkamins commented Oct 17, 2022

I've also found it really useful to convert a whole DataFrame to a NamedTuple

There is a predefined function to do this: Tables.columntable. Indeed this is often useful if data frame is not super wide.

@CameronBieganek
Copy link

I haven't been following this change very closely, so take what I say with a grain of salt. However, I find this design choice a little befuddling.

Julia does not have a well defined concept of ownership of a value (like Rust does), so it is more of a convention.

Exactly. This was a design choice. It's quite natural to have variables aliased with DataFrame columns. To take one simple case:

x = [1, 2];
df = DataFrame(x=x; copycols=false)

Now x and df.x both refer to exactly the same vector: x === df.x. So it is quite odd that the same operation behaves differently depending on whether we reference the vector via x or df.x. I don't think this has anything to do with what containers should or should not do. The crux of the issue is that df.x is an AbstractVector. Every DataFrame column is an AbstractVector. So what you've done is change the behavior of in-place broadcasting on AbstractVector in one specific context. If you had a Column type you could do whatever you want, but don't change the broadcasting behavior for AbstractVector.

I think the fact that a change to Base Julia was needed in order to implement this design is a hint that this might not have been the best design choice.

@CameronBieganek
Copy link

So it is quite odd that the same operation behaves differently depending on whether we reference the vector via x or df.x.

To elaborate on this, foo(x) and foo(df.x) should behave identically for any function foo. Why should broadcasting be magical?

@CameronBieganek
Copy link

CameronBieganek commented Oct 19, 2022

Here's one final variation of my argument against this. (Sorry.)

The . property access operator is the highest precedent operator in the language. Assignment (which includes .=), is the lowest precedent operator in the language. So in the expression df.col .= v, I expect df.col to be evaluated to an AbstractVector before the broadcasted assignment occurs. Thus, I expect the broadcasting behavior to follow the usual broadcasting behavior for AbstractVector.

I admit it's not a perfect line of logic. The expression a.b = c evaluates to setindex!(a, :b, c), but it's not really documented what a.b .= c evaluates to. But that's where my intuition lies, based on operator precedence rules.

@bkamins
Copy link
Member

bkamins commented Oct 20, 2022

We have thought about all these concerns. However, if we followed them the following:

df = DataFrame(a=1:3)
df.b .= 1

would have to error as well as:

df.c = 1:3

would have to error.

Now there was a design choice:

  • do we want getproperty work exactly like in Base Julia;
  • do we want getproperty work exactly like getindex with !.

Having this choice we opted for the second variant and all other are a consequence. Note that this does not apply only to broadcasting, it also applies to non-broadcasted assignment.

Now why we have not opted for the first variant? The reason is that getproperty in Base Julia assumes that the container you are querying has a fixed schema. This is not the case for DataFrame type.

In summary - the current design is that df.a and df[!, :a] always work in the same way. We chose it because for non-experts it is a simpler rule to remember as then they have to learn only two rules: normal row indexing vs ! row indexing.

The alternative would be to have three sets of rules: for normal row indexing, ! row indexing and getproperty row indexing and the getproperty row indexing would error in cases where there is no column.

Indeed the choice breaks some assumptions advanced users might make about how getproperty works. However, as I have commented most of our users are not advanced. The alternative would be to explain them on their very first session with DataFrames.jl why the following code does not work (to repeat the example):

df = DataFrame(a=1:3)
df.c = 1:3

and everyone coming from R or Python will assume it should work.


As a side note, for completeness of the discussion (and further reference). Probably advanced users are surprised to see this:

julia> df = DataFrame(a=1:3)
3×1 DataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │     1
   2 │     2
   3 │     3

julia> df.c = 1:3
1:3

julia> df.c
3-element Vector{Int64}:
 1
 2
 3

and would argue that column c should be a range not a vector. Again, consciously we have decided to perform this conversion always, although it is expensive. However, most novice users expect that after such an operation the data frame will be resizable, which is allowed with Vector, but is not allowed in range.

Here I do not want to justify breaking the rule second time because we already did this some time ago. Rather I want to emphasize that the design choices we make are geared towards novice users. It is much easier to learn for an expert one exception than to teach a novice that something does not work as this novice would expect given their previous background in R/Python.

@gustafsson
Copy link
Contributor Author

would have to error.

I disagree. Conversion upon assignment is not controversial. A range could become a vector if that's what the lefthand side can hold.

julia> d=Dict{Symbol,Vector}()
Dict{Symbol, Vector}()

julia> d[:col] = 1:3
1:3

julia> d
Dict{Symbol, Vector} with 1 entry:
  :col => [1, 2, 3]

Something not iterable or of length length 1 could be "fill"ed.

Conversion of the left hand side upon broadcasted assignment is controversial though.

Wouldn't a novice be more likely to try df.b = 1 rather than df.b .= 1 upon their first session with DataFrames?

I'd argue that the current approach is to explain to them to use a new advanced syntax instead of something straightforward and simple that would align well with the rest of the language.

julia> df = DataFrame();

julia> df.b = 1
ERROR: ArgumentError: It is only allowed to pass a vector as a column of a DataFrame. Instead use `df[!, col_ind] .= v` if you want to use broadcasting.

@gustafsson
Copy link
Contributor Author

Now there was a design choice

Sorry for being obtuse as you tried to explained this already. But I don't follow to how that design choice is related to this thread.

df.b and df[!, :b] could be used interchangeably in all examples in this thread from what I can see.

@gustafsson
Copy link
Contributor Author

Upon broadcasted assignment it is also controversial to create a new copy of the left hand side

Like df.b .= [1,2,3] allocates memory even if it doesn't need to? Surely that's a bug?

@bkamins
Copy link
Member

bkamins commented Oct 20, 2022

Sorry for being obtuse as you tried to explained this already. But I don't follow to how that design choice is related to this thread.
df.b and df[!, :b] could be used interchangeably in all examples in this thread from what I can see.

This is the point. They could be used interchangeably and this was a design choice.

An alternative would be to make the behavior of df.b and df[!, :b] not identical in all cases (and this is what we wanted to avoid.

Like df.b .= [1,2,3] allocates memory even if it doesn't need to? Surely that's a bug?

This is intentional and a design choice. Actually it is quite important that it behaves this way.
For example if you write:

julia> df = DataFrame(a=1:3)
3×1 DataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │     1
   2 │     2
   3 │     3

julia> df.b = df.a
3-element Vector{Int64}:
 1
 2
 3

julia> df.c .= df.a
3-element Vector{Int64}:
 1
 2
 3

julia> df.b === df.a
true

julia> df.c === df.a
false

then you see that b and a are aliases, but c is not an alias of a.

The contract of broadcasted assignment for df[!, :b] .= v and df.b .= v is that always a fresh column is allocated so you are sure no aliases are created. The reason is that most likely users do not want aliased columns (aliased columns are another reason of common bugs unfortunately).

A range could become a vector if that's what the lefthand side can hold.

The point is that n general column can be any vector type (not only Vector), except for ranges. So ranges are an exception.

Wouldn't a novice be more likely to try df.b = 1 rather than df.b .= 1 upon their first session with DataFrames?

Indeed this is what user would try to do. And then - as you pointed would be instructed to do df[!, :b] .= 1. However, the next thing that user would try (at least I would) is checking what would df.b .= 1 do (since as a user I started with df.b = 1).

@CameronBieganek
Copy link

CameronBieganek commented Oct 21, 2022

df = DataFrame(a=1:3)
df.c = 1:3

and everyone coming from R or Python will assume it should work.

Actually, that does not work in pandas:

In [1]: import pandas

In [2]: pandas.__version__
Out[2]: '1.4.2'

In [3]: df = pandas.DataFrame({"a": [1, 2]})

In [4]: df.b = [3, 4]
<ipython-input-4-ee8d286e409b>:1: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
  df.b = [3, 4]

In [5]: df
Out[5]: 
   a
0  1
1  2

It does not bother me that df.newcol = 1:3 works and collects the range into a vector, because it is well documented that a.b = c evaluates to setproperty!(a, :b, c) and that setproperty! can be overloaded. However, Base.dotgetproperty is not documented.

This is my understanding of what happened:

  1. DataFrames overloaded Base.dotview so that df[!, :newcol] .= 1 works.
    • Base.dotview is undocumented and is not a part of the public API for Base Julia. Matt Bauman even said that "The reason-for-being for dotview wasn't package extension...".
    • Since Base.dotview is not public API, a minor release of Julia could break DataFrames. So the current compat bound for Julia is incorrect. It should be changed from julia = "1.6" to julia = "~1.6".
    • Let me emphasize this. One can read the entire Julia manual and not be able to explain why df[!, :newcol] .= 1 does not throw an error. There is no Julia interface that allows this behavior. It should throw an error.
  2. Now df.newcol .= 1 was inconsistent with df[!, :newcol] .= 1, so DataFrames requested that Base Julia add another undocumented function called Base.dotgetproperty.
  3. Julia added the function Base.dotgetproperty and DataFrames used it to allow df.newcol .= 1.
    • Base.dotgetproperty is undocumented and is not a part of the public API for Base Julia.
    • Since Base.dotgetproperty is not public API, a minor release of Julia could break DataFrames. So the current compat bound for Julia is incorrect. It should be changed from julia = "1.6" to julia = "~1.6".
    • Let me emphasize this. One can read the entire Julia manual and not be able to explain why df.newcol .= 1 does not throw an error. There is no Julia interface that allows this behavior. It should throw an error.
    • DataFrames is now the only place in the entire ecosystem where the meaning of a.b .= c has been changed.

I think the old saying "two wrongs do not make a right" applies here. If the Julia language had a way to enforce public versus private API, then these sorts of shenanigans would not even be possible.

In my opinion, if you want to change the meaning of a well established syntax, then you should use a macro. A better way of expressing the concept "set a column to a new vector containing a repeated value, whether or not that column already exists" would be something like this:

@fill df.c 1

which of course would evaluate to this:

df.c = fill(1, nrow(df))

Anyhow, that's probably all I have to say about this. It is what it is, and I doubt that this design will be reverted.


*One point of clarification. My comment that "It should throw an error," is based on the fact that df[!, :newcol] and df.newcol both throw an ArgumentError. If they returned a LazyNewColDataFrame or something like that, then df[!, :newcol] .= 1 and df.newcol .= 1 need not throw errors. However, that's obviously not currently the case for the DataFrames getindex and getproperty methods. It makes sense that getindex and getproperty throw errors on non-existent columns; it would probably be strange if they returned a LazyNewColDataFrame instead.

@bkamins
Copy link
Member

bkamins commented Oct 21, 2022

@gustafsson + @CameronBieganek : it is truly a pity we did not have this discussion 2 years ago.
Currently I am extremely hesitant to change things here. We have announced a long time ago that this change would be made.
Then we waited a long time before making it. And now we released it.

The clean up of indexing, that was done around 2019 and 2020 (if I recall correctly) was mainly to simplify rules and make them consistent (what we had previously was very complex and full of special cases). And making ! selector + making getproperty mean the same as ! was part of this simplification. This was a main intention.

I am not sure where we would end up with the decision, but you raised issues that were clearly previously not considered enough.

As I have written earlier - while this syntax is debatable - in my experience (fortunately) the change we make should not affect much of the ueser's code (however, indeed it will affect some unfortunately).

What I can do is:

  • first, make sure that if anything changes in Base Julia internals we properly reflect it in DataFrames.jl in a Julia version aware way (I will take care of this; tests are in place to catch such a situation)
  • improve documentation; this should be done in improve manual entry of assignment to a data frame #3201
  • clearly explain the feature in a blog post

Side note.
Investigation what @CameronBieganek commented on Pandas lead me to checking R:

> df = data.frame(a=1:2)
> df$b = 11:12
> df
  a  b
1 1 11
2 2 12

but

> df = data.frame()
> df$b = 1
Error in `$<-.data.frame`(`*tmp*`, b, value = 1) : 
  replacement has 1 row, data has 0

which is different from what we do in Julia (this is a comment showing how complex corner cases of things are).

@CameronBieganek
Copy link

CameronBieganek commented Oct 21, 2022

Perhaps you could open a Julia issue requesting that dotview and dotgetproperty be documented and made part of the Base API. (Though maybe they could have better names.) I'm not thrilled about the idea of allowing packages yet another layer of customization of the behavior of assignment and broadcasting, but perhaps it wouldn't be too bad. 😅

If the request was approved and added to a future Julia version, technically you should probably lower bound DataFrames to the Julia version that releases that new API. In theory older versions of Julia could have new patch releases that change the behavior of dotview and dotgetproperty. However, I would understand if in practice you did not lower bound to the new Julia version and assumed that patch releases of older versions won't change dotview or dotgetproperty.

There seems to be a fair number of packages that have overloaded Base.dotview, so maybe there are some compelling use cases for exposing dotview and dotgetproperty. (I haven't actually looked into how any of those packages are making use of that overload.)

@nalimilan
Copy link
Member

Let's not make the issue look more serious than it is. There's no reason to think that dotview or dotgetproperty are going to be removed in future Julia version, as their addition was validated by Julia core devs with full knowledge of the intended use in DataFrames. There are other internal Julia functions which are de facto public API because many packages rely on them. Given that DataFrames is one of the most popular packages, it won't be broken without careful consideration. And the probability that they would be dropped in a patch release is null.

Anyway, this behavior has been released, so removing it now would be breaking. The most radical change we could do is deprecate it so that it can be removed in 2.0 (which isn't planned at all at this point).

However it would seem like a good idea to document this in Julia so that it's clearly considered as part of the public API.

@CameronBieganek
Copy link

CameronBieganek commented Oct 21, 2022

There are other internal Julia functions which are de facto public API because many packages rely on them.

That's scary. 😂

Let's not make the issue look more serious than it is.

I agree that breakage is not likely, but with the current situation there is still a serious issue with logical consistency. As I mentioned above, one simply cannot deduce from the defined Julia interfaces that df.newcol .= 1 will not throw an error, given that on its own df.newcol does throw an error. My first reaction upon seeing that behavior was, "How can you do in-place assignment on a column that doesn't exist yet?" Any semi-knowledgeable Julia programmer who stops to think about it will likely get confused.

@nalimilan
Copy link
Member

Maybe they will be intrigued the first time they see it, but I don't see a particular risk of confusion leading to bugs, which is what we generally try to avoid when designing the API.

gustafsson added a commit to gustafsson/DataFrames.jl that referenced this issue Oct 25, 2022
@gustafsson
Copy link
Contributor Author

I for one was very confused and couldn't get my head around what to expect from the behaviour of df.col. I mean I could look up the rules in the docs, but I didn't have a mental model from which I could intuitively derive its properties. So learning that some organizations advice against using either of df.col or df[!, :col] made sense to me because these things didn't do what I expected either. It's unfortunate though if the simplest way of using the API is adviced against because of common pitfalls.


I'll try to summarize my own understanding of the question that started this thread:

Why isn't df.col .= v in-place?

Base Julia states that broadcast! should "store the result in the dest array". And although undocumented df[!, :col] .= ... seem to mean dotview followed by an implicit call to copyto!, what I would translate to "in-place copy to a view". But neither of these concepts applies to df.col .= v and df[!, :col] .= v.

Why?

It's a design choice to appease newcomers. The reasoning (from what I gather) goes something like this:

Newcomers would want df.col = 1 but that isn't allowed. So a user is encouraged to instead use df.col .= 1

A new user probably hasn't seen .= before but they have probably used R. It is then assumed that the user more or less expect <- from R when they use .= in Julia.

In essence: df.col .= v isn't in-place because df$col <- v in R isn't in-place.

I note that df.col = v in pandas seem to be in-place if df.col and v have the same effective dtype and not in-place otherwise. Pandas requires that col already exists.

If it's not in-place, what is it then?

df.col is a special thing that behaves like nothing else. When used on the RHS df.col provides non-copy access, and replaces the column on the LHS of =. But on the LHS of .= it also replaces any existing column. On the LHS an alias is sometimes created and sometimes a new column is allocated. For instance it allocates when you're explicit about not allocating, i.e using .=. Whether df.col = v creates an alias or not depend on which suptype of AbstractDataFrame we're working with.

How can I do what .= normally does?

In-place broadcasted assignment is instead supported through df[:, :col] .= v and df.col[:] .= v


I am extremely hesitant to change things here

And rightfully so. All the effort you put into this library is super appreciated. Being able to improve existing functionality and add new features while maintaining a backwards-compatible API is no easy feat. Necessarily breaking changes are years apart. Kudos for that!

Anyway, this behavior has been released, so removing it now would be breaking. The most radical change we could do is deprecate it so that it can be removed in 2.0 (which isn't planned at all at this point).

I'll give my 2¢ for 2.0 in a PR to make the suggestion concrete.


It would be near impossible to implement it correctly + it would be inconsistent with the whole design of Julia that requires explicit broadcasting.

I don't interpret my proposal as inconsistent of the design of Julia. My interpretation of the design of Julia would allow DataFrames.jl to freely decide how to convert values such that they can be assigned to the properties of its types. Notably = does not always mean "create an alias" in Julia. Quite on the contrary an implicit call to convert is often involved.

I gave it a go to implement this. Was it correct? It's relies on the existing pseduo-broadcasting logic from the constructor. I got the unit tests to work under this new behaviour, that's not nothing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants