Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC/WIP: creation of mapdict method to modify Dict values #31223

Merged
merged 18 commits into from
Mar 14, 2019
Merged

RFC/WIP: creation of mapdict method to modify Dict values #31223

merged 18 commits into from
Mar 14, 2019

Conversation

ndinsmore
Copy link
Contributor

This is the initial pass at the inclusion of methods mapdict & mapdict! which allows the modification of Dict values without the overhead of using the Dict keys to access & store the values.

The need for this method was discuss at:
https://discourse.julialang.org/t/fast-dict-value-modification-by-accessing-dict-vals
At the recommendation of @tpapp this was built out for base.

An example of the use would be as follows:

julia> D=Dict(:a=>1,:b=>2)
Dict{Symbol,Int64} with 2 entries:
  :a => 1
  :b => 2

julia> mapdict!(v->v-1,D)
Dict{Symbol,Int64} with 2 entries:
  :a => 0
  :b => 1

julia> D=mapdict!(v->Float64(v),D) #For the type mutating case you must use an =
Dict{Symbol,Float64} with 2 entries:
  :a => 0.0
  :b => 1.0

Currently this implementation handles most of the function cases that were brought up:
1.) Functions which return the same type as the input
2.) Functions that are mutate the value type but are typesafe
3.) Functions which are not type safe.

While the function is currently only implemented for Dict, it would be simple to add a naive fallback for AbstractDict using keys.

If there is interest in including this in Base, test would also be built out.

Benchmarking results for essentially mapdict!(v-> v > row ? v-1 : v,D) where row is set to change half the values.

image

@chethega
Copy link
Contributor

chethega commented Mar 1, 2019

API question: Should we really return a different dict if the types don't fit?

I'd prefer setindex! / map!(f, a, a) semantics: convert the new value to V for dict::Dict{K,V} where {K,V}.

Can we improve on flexibility by an additional (keyword? Valtype?) argument passkey::Bool, that calls f(key, value) instead of f(value)?

Can we improve flexibility by passing and additional (keyword? Valtype?) argument delkeys::Bool, that optionally deletes entries? I.e. when set, the loop would use newval, del = f(value) resp newval, del = f(key, value), and delete the entry instead of updating if del is true.

Constant propagation should be able to handle that, i.e. one would need no source-code duplication and get no runtime overhead inside the loop.

@kmsquire
Copy link
Member

kmsquire commented Mar 1, 2019

I like this. I'm wondering if a more specific name would be better? I can imagine a different function which maps (renames) keys instead of values, so maybe call this one mapvalues, so that one could be called mapkeys?

@chethega
Copy link
Contributor

chethega commented Mar 1, 2019

I can imagine a different function which maps (renames) keys instead of values

Renaming/changing keys cannot be reasonably done inplace without temporaries (because of the possibility of needing a rehash).

However, the function f could get access to key without any cost, and we can delete entries without any cost (because that cannot ever trigger a rehash). Hence mapfilter! or something. The only downside is that end-users would need to write their function f in a way that accepts the key (e.g. by ignoring it, if unneeded during the computation) and returning (newval, false) instead of newval if they don't need filtering.

That is potentially a heavy downside, so one could do something like

@inline _mapfilterdict!(fun, dict::Dict{K,V}) where {K,V}
    #the actual implementation, not exported
    #always mutating. Never returns a different dict.
    return dict
end

@noinline mapfilterdict!(fun::F, d::Dict) where {F} =  _mapfilterdict!(fun, d)
@noinline mapdict!(fun::F, d::Dict) where {F} =  _mapfilterdict!((k,v)->(fun(v), false), d)
@noinline mapkeydict!(fun::F, d::Dict) where {F} =  _mapfilterdict!((k,v)->(fun(k,v), false), d)

@StefanKarpinski
Copy link
Sponsor Member

StefanKarpinski commented Mar 3, 2019

I agree with liking the functionality but preferring a name that indicates mapping over values. The names mapvalues! and mapvals! seem like obvious options.

@kmsquire
Copy link
Member

kmsquire commented Mar 4, 2019

Alternatively, since map is over the values of an array, one could argue that this could simply be map. I believe @andyferris had proposed something along these lines previously.

@ndinsmore
Copy link
Contributor Author

I like mapvalues over map as it would be a better fit for map to act on the (key=>value) pairs. I also like the reduction of ambiguity in making it clear this is operating on the values.

Is everyone okay with the mutating and type widening behavior?

@chethega
Copy link
Contributor

chethega commented Mar 4, 2019

Is everyone okay with the mutating and type widening behavior?

No. mapdict! should never create and return a different dictionary. Instead it should convert the return value of f. This behavior of maybe creating a new object is completely unprecedented, from push!, map!, broadcast!, etc.

@StefanKarpinski
Copy link
Sponsor Member

It's not actually possible to widen the type of an existing Dict, so type-widening isn't really a question.

@ndinsmore
Copy link
Contributor Author

I might be in the minority but mapdict!(v->Float64(v),Dict(:a=>1,b=>2)) feels like a completely valid use case. Maybe @chethega is right and that simple case should be done with convert, but there are certainly simple mutating examples that couldn't be covered.

I think that the difference for something like a Dict to an Vector is that the size of the index is in general much large than the size of the value. So a paradigm where the user is saying take this container and just changed the values not the indexing could save alot of memory in big problems.

The other datatype that can also fit that bill are SparseArrays, and I would argue that something like a mutating mapvalues! could be useful on those as well.

@StefanKarpinski as for type widening, is we end up not allowing mapdict! to mutate I would agree that there is no reason to do type widening, but if the mutation is allowed since relatively simple first element inference is used widening could make it a little more robust.

@StefanKarpinski
Copy link
Sponsor Member

I might be in the minority but mapdict!(v->Float64(v),Dict(:a=>1,b=>2)) feels like a completely valid use case.

The type system doesn't care that you think this is a valid use case. It cannot be done.

@ndinsmore
Copy link
Contributor Author

So no mutation then? I am okay with that.

But on the copying version should we not allow the type to be changed?

@tpapp
Copy link
Contributor

tpapp commented Mar 4, 2019

@ndinsmore: The question is not about valid use cases (there are many of those), but what should/could be specialized for Dict to can take advantage of the internals. Eg the naive implementation

mapvalues(f, dict) = Dict(zip(keys(dict), map(f, values(dict))))

can be improved by not rebuilding the table, and similarly mapvalues! can do this in place.

The point of providing these methods is that they can be faster and allocate less, not to support some use case (since they are easy to do already, just suboptimal).

The in-place version cannot change the type (it should error when trying to do so, and this should be tested), while the "functional" version should work generally, and figure out the result type like collect or Dict does, eg

julia> d = Dict(string(i) => i for i in 1:5)
Dict{String,Int64} with 5 entries:
  "4" => 4
  "1" => 1
  "5" => 5
  "2" => 2
  "3" => 3

julia> mapvalues(v -> Symbol(string(v)), d)
Dict{String,Symbol} with 5 entries:
  "4" => Symbol("4")
  "1" => Symbol("1")
  "5" => Symbol("5")
  "2" => Symbol("2")
  "3" => Symbol("3")

@StefanKarpinski
Copy link
Sponsor Member

So no mutation then? I am okay with that.

That's not what I said. The values in a dict can change but the type of a dict cannot. You can write mapdict!(Float64, Dict(:a=>1,b=>2)) if you want to but the type of the dictionary is Dict{Symbol,Int} and that cannot change so it going to end up being a very expensive no-op.

@andyferris
Copy link
Member

andyferris commented Mar 5, 2019

Alternatively, since map is over the values of an array, one could argue that this could simply be map. I believe @andyferris had proposed something along these lines previously.

Yes, my point of view is that it would be easiest to do data manipulation if dictionaries and arrays followed the same interface.

However at least for the moment I wouldn't like to break the link between iteration and map. (My earlier contribution was basically: "let's change iteration of Dict before v1.0 so that map, mapreduce, broadcast, filter and so-on for dictionaries become more useful, and faster, for data manipulation... at the same time users can still use something like map(f, pairs(dict)) for many common usage patterns of the v0.6 behavior where the keys are unchanged, and even that would still have become faster"). But, I would be pretty conservative for changes during v1.x development. For the moment at least, a function like mapvalues would seem both useful and clear enough.

Other related data manipulation functions could include reducevalues, mapreducevalues, sumvalues and filtervalues and so-on. I hope at that point that people can see this list continues to grow... I don't think it makes sense to implement foldlvalues and foldrvalues and mapfoldlvalues and mapfoldlvalues and SplitApplyCombine.groupvalues and SplitApplyCombine.innerjoinvalues and so on, just so we can support dictionaries. Ultimately to enable code composibility and reusability it is the container behavior that should change rather than introducing a variety of new functions - potentially even by patterns like map(f, values(dict)).

I wonder if it might be more productive then to implement a more "interesting" container for values(dict), pairs(dict) and even keys(dict) that follow semantics suitable for our data manipulation functions? The way each of these iterate is completely obvious, but these containers could have an efficient method specialization for map similar to this PR.

@chethega
Copy link
Contributor

chethega commented Mar 5, 2019

I wonder if it might be more productive ...

If I understood you right, then I really like your idea: All higher-order functions on dictionaries would operate on (key=>value) pairs and follow the usual inplace/widening conventions. All code for the higher-order functions would have a fast path (until isequal on the keys fails the first time) and a slow path (first time key changes: Cannot operate inplace any longer, must allocate a temporary of some sort). Constant folding should detect and eliminate that branch, for functions that return the key unchanged.

If people want to write their kernels in a way that does not get to operate on the key-value pair, we define corresponding higher-order functions on keys(dict) and values(dict) that hide the unneeded component from the user-supplied kernel, and people would write e.g. map!(fun, values(dict)) or newdict = map(fun, values(dict)), or map!(fun, dst, src = dst) (resize and empty destination, attempt to copy without new hash-evals until the first time key changes, afterwards use usual dictionary insertion that pays for extra hash evals).

Out-of-place filter and map that does not change keys should still be done with zero hash evaluations (unless the dict could shrink by a very large factor). There could be an argument for returning a Base.ValueIterator wrapped around a new dict, instead of returning a new dict: That would be consistent. But imo, it would be more pragmatic to return a new dict.

@stevengj
Copy link
Member

stevengj commented Mar 5, 2019

I agree that map!(fun, values(dict)) seems like a better name for this functionality, and doesn't seem like it would require any change to the Base.ValueIterator type.

@StefanKarpinski
Copy link
Sponsor Member

StefanKarpinski commented Mar 6, 2019

That is nice but how would you write the non-mutating version?

@ndinsmore
Copy link
Contributor Author

It would be great if we could come up with a solution that makes sense for SparseMatrixCSC & SparseVector as well. On multiple occasions I would have loved to have both 'map!(fun,values(sparse))' and the equivalent to the version that changes that value type. Going even further for the sparse arrays it would be great for map!(fun, pairs(sparse)) to pass in ((i,j) => v) similar to '(key => value)'

@tpapp
Copy link
Contributor

tpapp commented Mar 6, 2019

I think it would be better to leave sparse arrays out of this PR and focus on Dict. Mapping sparse structures raises a lot of other questions.

@nalimilan
Copy link
Member

That is nice but how would you write the non-mutating version?

Indeed we can't change the behavior of map(f, values(d)), which currently returns a vector. Maybe something to mark for 2.0?

@stevengj
Copy link
Member

stevengj commented Mar 6, 2019

I think the current semantics of map(f, values(dict)) producing a new array of values is fine and is what I would expect from mapping the values iterator, consistent with other iterators. We already have a non-mutating dictionary map via Dict(k => f(v) for (k,v) in d), and you could also do map!(f, values(copy(dict))) to avoid re-hashing the keys, so I'm not sure there is a need for a new function.

In contrast, map!(f, values(dict)), which is currently an error, has only one possible reasonable meaning to me: modify the values of dict in-place.

The main value of this function is as a performance optimization in the in-place case, anyway; in a typical context where you are willing to make a copy of the dictionary, I'm guessing that the performance cost of re-hashing the keys is not such a big deal.

@ndinsmore
Copy link
Contributor Author

ndinsmore commented Mar 6, 2019

What about map(f, values(dict), create_dict=true) or map(f, values(dict), create_new_dict=true) which would be the explicit call to do the non-hashing copy. I am not sure if that is exactly equivalent to Dict(k => f(v) for (k,v) in d)

if we used create_new_dict=true we could also allow new_dict = map!(f, values(dict), create_new_dict = true) which would allow value type modification without copying the keys and hash.

@tpapp
Copy link
Contributor

tpapp commented Mar 6, 2019

allow new_dict = map!(f, values(dict), create_new_dict = true) which would allow

A ! method which creates a new object may be misleading.

It is also unclear what is meant by

value type modification without copying the keys and hash.

, would the two dicts share structure? I can't imagine that being useful.

In any case, merely copying keys, tables, and associated information should be very cheap. It is building the hash table that is expensive.

StefanKarpinski and others added 3 commits March 13, 2019 12:17
Co-Authored-By: ndinsmore <45537276+ndinsmore@users.noreply.github.com>
Co-Authored-By: ndinsmore <45537276+ndinsmore@users.noreply.github.com>
Co-Authored-By: ndinsmore <45537276+ndinsmore@users.noreply.github.com>
@StefanKarpinski
Copy link
Sponsor Member

Looks good to me. Should be squashed when merging.

base/dict.jl Outdated Show resolved Hide resolved
@StefanKarpinski StefanKarpinski merged commit 2567c3a into JuliaLang:master Mar 14, 2019
@fredrikekre
Copy link
Member

I believe this was your first PR to Julia? Great work, and welcome as a contributor.

@StefanKarpinski
Copy link
Sponsor Member

Yes, this is a really nice first PR. Thank you!

raphbacher pushed a commit to raphbacher/julia that referenced this pull request Mar 19, 2019
@fredrikekre fredrikekre added needs compat annotation Add !!! compat "Julia x.y" to the docstring needs news A NEWS entry is required for this change labels May 3, 2019
@rfourquet rfourquet added the domain:collections Data structures holding multiple items, e.g. sets label Mar 20, 2020
rfourquet added a commit that referenced this pull request Mar 20, 2020
KristofferC pushed a commit that referenced this pull request Mar 23, 2020
@KristofferC KristofferC mentioned this pull request Mar 23, 2020
27 tasks
KristofferC pushed a commit that referenced this pull request Mar 23, 2020
@garrison garrison mentioned this pull request Oct 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:collections Data structures holding multiple items, e.g. sets needs compat annotation Add !!! compat "Julia x.y" to the docstring needs news A NEWS entry is required for this change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants