Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add groupby to the standard lib #32331

Open
al6x opened this issue Jun 15, 2019 · 56 comments
Open

Add groupby to the standard lib #32331

al6x opened this issue Jun 15, 2019 · 56 comments
Labels
kind:feature Indicates new feature / enhancement requests

Comments

@al6x
Copy link

al6x commented Jun 15, 2019

Was surprised it's missing, really useful function, something like

groupby(f, list::Array) = begin
  foldl(list; init = Dict()) do dict, v
    push!(get!(dict, f(v), []), v)
    dict
  end
end
@nalimilan
Copy link
Member

See https://github.com/JuliaData/SplitApplyCombine.jl. I'm not sure it adds anything to have it in Base: better have more complete packages with a series of related functions.

@KristofferC
Copy link
Sponsor Member

Yes, this is unlikely to be added to Julia itself.

@al6x
Copy link
Author

al6x commented Jun 17, 2019

The natural languages and the computer languages too comply to power law, big head - long tail (20/80 or Pareto or Exponential etc) distribution.

80% of use cases covered by just 20% of features, probably even more extreme in case of software languages.

groupby in my opinion belongs to big head or standard library. Ruby, Kotlin, Elixir, Java have it. I was really surprised that Julia - a functional language with focus on data processing doesn't have that essential function for data manipulation in its standard library. Even SQL - with its really strict and compact set of built-in functions has GROUP BY.

better have more complete packages with a series of related functions

That's why there's no need in "more complete package". Because those "series of related functions" - are the long tail that you never need. And 36 stars on that package SplitApplyCombine.jl confirm that it's not that popular.

@StefanKarpinski
Copy link
Sponsor Member

StefanKarpinski commented Jun 17, 2019

What would you propose the Base groupby function do? From your code example (I'm a little unclear on why it's using fold instead of just a loop), it seems like you'd want it to apply a group key function to an iterable collection of items and then construct a dict mapping each group key value to the vector of values which produced that key. Does that sound correct?

A major concern is that you want to use the name groupby in data packages like DataFrames and having it in Base with a different meaning makes that particularly annoying. So unless the two meanings are compatible, we wouldn't want to do that.

@al6x
Copy link
Author

al6x commented Jun 17, 2019

why it's using fold instead of just a loop

Do you mean for loop like that? Or there's some shorter version?

groupby(f, list::Array) = begin
  groups = Dict()
  for v in list
    push!(get!(groups, f(v), []), v)
  end
  groups
end

What would you propose the Base groupby function do? it seems like you'd want it to apply a group key function to an iterable collection of items and then construct a dict mapping each group key value to the vector of values which produced that key.

Yes, that, or some similar data structure like [[(key_a, v1), (key_a, v2)], [(key_b, v3)], ...] that would make sense for representing grouped data.

A major concern is that you want to use the name groupby in data packages like DataFrames and having it in Base with a different meaning makes that particularly annoying. So unless the two meanings are compatible, we wouldn't want to do that.

How it actually looked like. I needed to group the list, and typed ?> groupby - found none. Wow, strange. Typed julia groupby in Google - it gave me DataFrame.groupby - no, I definitely don't need some advanced DataFrame library just to group simple collection. Tried searching further - found nothing and had to write the groupby myself.

There are 2 things that I don't understand with that reasoning:

  1. I'm not sure why I would need DataFrame. There's a reason for DataFrame in Python because Python is slow and DataFrame allows to express data computation in a general way so it will be understood by its C-DataFrame-engine and executed fast. But Julia is fast - so why should I bother learning some special DataFrame API when I can just use plain Arrays and Collections?

  2. You saying - that you don't want to add some function to standard library because it may conflict with third-party package using the same function? There are 2 problems here in my opinion: 1) I see priorities exactly the opposite - the core is more important than third-party packages. 2) There shouldn't be problem because (in my understanding, maybe I'm missing something) the whole point of multiple dispatch is to solve such cases and allow to exist same function names operating on different arguments.

@StefanKarpinski
Copy link
Sponsor Member

StefanKarpinski commented Jun 18, 2019

Reopening for discussion. The first point about Python doesn’t make sense to me: there’s absolutely no reason Python needs data frames to implement an operation that takes a collection and returns a dictionary of arrays (or lists as Python calls them). Searching “Python groupby” gives “pandas.DataFrame.groupby” and there is no built-in group by operation. So we are in exactly the same state of affairs as Python, which suggests that it’s not that unreasonable.

That said, groupby is a pretty handy operation and slightly annoying to write out, so maybe it would be reasonable to have it.

  1. I see priorities exactly the opposite - the core is more important than third-party packages.

Not at all. The core should implement what is necessary to have in the core and hard or awkward to have in packages. There must be a good reason to be in the core language. Any complex functionality that can be removed into a package should be. Being stuck in the core language is where things go to fossilize. However, the proposed groupby functionality is clear and simple enough that it might be acceptable to fossilize it.

  1. There shouldn't be problem because (in my understanding, maybe I'm missing something) the whole point of multiple dispatch is to solve such cases and allow to exist same function names operating on different arguments.

That depends on if they implement compatible signatures and concepts that are truly the same. If the DataFrames.groupby operation ever wanted to have a method that took an arbitrary collection and produced DataFrames, then that would directly conflict with the proposed groupby function. So multiple dispatch doesn’t necessarily solve this problem and one does have to think—generally pretty carefully—about what a generic function really means.

@max-sixty
Copy link

Searching “Python groupby” gives “pandas.DataFrame.groupby” and there is no built-in group by operation

FWIW groupby is in the python stdlib: https://docs.python.org/3/library/itertools.html#itertools.groupby. Though not sure the python stdlib is the muse

@al6x
Copy link
Author

al6x commented Jun 18, 2019

I see your point, makes sense. I guess we have slightly different goals. I prefer unified solution ready out of the box. And you prefer something more like a constructor to assemble custom solution you want.

@nalimilan
Copy link
Member

I think it would be OK for DataFrames to import and override Base.groupby since we only need to handle AbstractDataFrame arguments. But I'd rather not commit now to a particular API for groupby. As the Python example shows, it can make sense to return an iterator, but the OP suggested returning a dict instead, and yet a recent comment proposed returning an array. Let's wait until a clear solution emerges in packages. Though maybe the iterator version could go to Iterators.

Cc: @andyferris

@oxinabox
Copy link
Contributor

IterTools.jl has it
https://juliacollections.github.io/IterTools.jl/latest/#groupby(f,-xs)-1

In general Base.Iterators contains only the bare minimum set of iterator processing that Base itself needs.
Other iterator processing functions go in IterTools.jl.
Just like Base contains minimal data structures, the rest being in DataStructures.jl

@JeffreySarnoff
Copy link
Contributor

Julia's way of placing arguments follows the reading of the function's name:
group __ by ___
groupby
group by
groupby(items, fn)

@davidanthoff
Copy link
Contributor

Query.jl also has it.

I agree, I don't think this belongs in base. IterTools.jl and and other packages can handle this perfectly well.

@mkborregaard
Copy link
Contributor

mkborregaard commented Jun 18, 2019

FWIW I've wanted this quite often. An argument for having it in Base is that this would emphasize having a common syntax and intuition. I think exactly the argument that it is challenging to come up with a general API is a good reason for trying to do so, aligning the API across the many (IterTools, Query, DataFrames, SplitApplyCombine) that have it.
I also tend to feel that when there is a functionality that gets reimplemented again and again across the ecosystem, that's a sign that there is a need to consolidate. And it should be clear and straightforward enough to fossilize.

@andyferris
Copy link
Member

Yes, I think having a single obvious way of grouping data would be better than having a smattering of implementations with slightly different semantics.

But what should that shared semantic be? understand the semantics in e.g. the OP and SplitApplyCombine,group is a bit different than what might seem obvious/intuitive for a tabular structure. To that end I got sucked into a vortex of dictionaries and tables that I haven’t worked my way through yet (I’ve been distracted by other things, unfortunately).

I did think there may be a path to combine the semantics of groupby by (a) tweaking AbstractDict and (b) dealing with partitioned tables as nested structures and (c) thinking of tables with a known unique key column as behaving more like a dictionary from the key value to the row than as an “array”-like thing mapping consecutive integers to rows (typical DataFrame).

But ignoring completely tables for a minute, I think obvious and common operations on/between arrays and dictionaries totally belong in Base. Put another way - consider how well fleshed out the AbstractArray interface is - batteries included, even ignoring LinearAlgebra. I don’t see why a lesser level of sophistication should apply to AbstractDict (or various obvious transformations involving dictionaries and arrays).

@davidanthoff
Copy link
Contributor

The groupby operation in Query.jl returns an iterator of Grouping items, where Grouping <: AbstractArray. That works well for both the non-tabular and tabular space (in fact, I think it ends up with a much more natural story than resorting to dicts). The design is just copied from LINQ.

@nalimilan
Copy link
Member

It would be great to find a common interface, but I think it should first live in a package, and only when everybody is relatively OK with it should it be moved to Base (if at all).

@o314
Copy link
Contributor

o314 commented Oct 22, 2019

groupby in others languages (rust, python, ruby, etc.) is one of the less harmonized definition among high-order functions. From datastore to datastore, there are some mismatches too. Not so simple.

Edited: after a bit of investigation as suggested by oxinabox, a consensus may have emerged in favor of sql. good god.

@o314
Copy link
Contributor

o314 commented Nov 21, 2019

IterTools.jl has it

Itertools groupby only group consecutive values eg. is rather useless.

@oxinabox
Copy link
Contributor

oxinabox commented Nov 21, 2019

Itertools groupby only group consecutive values eg. is rather useless.

I wouldn't say it is useless
This is the exact behavour of:

idk about ruby's.

As you said, this is one of the least harmonized defintions.
This is a reason not to have it in the stdlib in my mind.

@o314
Copy link
Contributor

o314 commented Nov 22, 2019

I vote +1 and -1 to your post :)

First, let's forget itertools, not a stdlib IMHO.
We've got now those teams:

Consecutive grouping caps

  • python
  • rust

Total grouping caps

According to this sample, a consensus seems to emerge in favor of grouping caps.
(My previous post may have been told a little too quickly, i will now downvote it).

The problem is however that group by apply materialization of data even solliciting intermediary collections. There will always be someone that need another version of array.
But sticking to iterator, vector, dictionary could do the things for a base version...

@StefanKarpinski
Copy link
Sponsor Member

The main issue at this point is harmonizing with the DataFrames definition in such a way that we don't block DataFrames from using the same name. In other words, we need a generic meaning for groupby that: (a) can be implemented in base, operating on generic iterables, returning a dict of vectors; and (b) agrees with the DataFrames definition of groupby, so that it's legitimate for DataFrames to extend base's groupby function—i.e. they fundamentally do the same thing.

@davidanthoff
Copy link
Contributor

It is not clear to me that the version that operates on iterables should return a dict. I think the Query/LINQ approach that returns a Grouping is actually more composable for the iterable and table case.

@StefanKarpinski
Copy link
Sponsor Member

That's good feedback. In that case this should probably not be in Base.

@nalimilan
Copy link
Member

Yes, DataFrames and Query both return an iterable of groups, which has the advantage that you can aggregate it without necessarily allocating a copy of all grouped rows. For simple reductions like summing, this lazy operation can improve performance a lot as you can even avoid computing the permutation that sorts entries by groups, and instead compute the group sums in one pass over the data.

I wonder to what extent that approach makes sense for single vectors, as the cost of allocating copies is lower compared to a multi-column table -- and the vector of group indices takes as much space as the original data.

@tkf
Copy link
Member

tkf commented Dec 1, 2019

I commented it in JuliaData/DataFrames.jl#1256 (comment) but I find quite useful sometimes that pandas.DataFrame.groupby is an iterable of key-group pairs rather than groups. This would be equivalent to returning a Dict.


Also, I think using non-lazy group vectors is necessary for unindexable iterators. In particular, the input iterator may be stateful (e.g., eachline).

I wonder to what extent that approach makes sense for single vectors, as the cost of allocating copies is lower compared to a multi-column table -- and the vector of group indices takes as much space as the original data.

I suppose this may not be the case for vectors of big named tuples or structs?

Due to these points, I think Base needs to have at least two methods of groupby: one for generic iterable (e.g., returning Dict{_, <:Vector}) and one for AbstractVector (e.g., returning Dict{_, <:SubArray}).

@ssfrr
Copy link
Contributor

ssfrr commented Feb 18, 2020

FWIW I think it would be great to have a groupby in Base. Currently you can get really far just using Vector{NamedTuple} as a table type, and map and filter for bread-and-butter data processing with no dependencies. groupby is the one that pops up frequently that I need an external package for. As far as what type it should return - I get that using a Dict internally is a reasonable way to implement this, but that seems like an implementation detail and not a fundamental property of the implementation. An iterable of Pairs seems pretty reasonable for this (which currently is what Dict is, but that might not be forever).

Here's one implementation I've been playing around with. It's heavily inspired by @andyferris's SplitApplyCombine, but returns a Vector. It's designed to be maximally flexible, taking user-provided functions. It might make sense to give it a more sort-like API with keyword arguments for the functions.

It's also implemented in terms of IterTools.groupby, which only groups consecutively, so that would have to be re-implemented or copied to include this in Base. It could also be tweaked to use generators to reduce allocation, but so far that's been a net performance loss in my tests.

using IterTools: IterTools
"""
    groupby(by=identity, mapf=identity, groupf=Pair, x)
Group `x` by `by.(x)`, applying `mapf` to each element within the group, and then
calling `groupf(key, values)` on each group.

Example:

julia> data = [
    (foo=1, bar=2, baz=3),
    (foo=2, bar=1, baz=2),
    (foo=1, bar=10, baz=10)]

julia> gs = groupby(r->r.foo, r->r.baz, data)
2-element Array{Pair{Int64,Array{Int64,1}},1}:
 1 => [3, 10]
 2 => [2]
"""
@inline groupby(x) = groupby(identity, x)
@inline groupby(by, x) = groupby(by, identity, x)
@inline groupby(by, mapf, x) = groupby(by, mapf, Pair, x)
function groupby(by, mapf, groupf, x)
    groups = IterTools.groupby(by, sort(x; by=by))
    map(g->groupf(by(g[1]), map(mapf, g)), groups)
end

@JeffreySarnoff
Copy link
Contributor

JeffreySarnoff commented Feb 18, 2020

@ssfrr I have relied on groupby or its equivalent when working with databases (large and small), are there other large domains of use where groupby is frequently called?

I see good reason to add groupby to a standard lib, that would simplify, promulgate, and, over time, enhance the ease with which members of the community approach their tabular processing.

To put something into Base means that Base has been .. in that specific manner .. lacking or wrongly unsupportive of _ and is only best served by the inclusion of _. Is groupby more appropriate to Base than dot (which is in the LinearAlgebra standard lib).

@ssfrr
Copy link
Contributor

ssfrr commented Feb 18, 2020

Yeah, you're right, a stdlib is better, I didn't think that through very carefully. In my mind it's closely related to map, and filter, but is definitely much less widely-used than they are. Base.Iterators seems like a reasonable home (which should perhaps also be a stdlib, but that seems like a separate issue).

@tkf
Copy link
Member

tkf commented Feb 19, 2020

It's also implemented in terms of IterTools.groupby, which only groups consecutively

By the way, if we were to add groupby in stdlib or Base, I think the first step would be to add an iterator transformation (say) partitionby in Base.Iterators with this semantics. I borrowed the name from Clojure's partition-by. We already have Iterators.partition so I think it's a reasonable name. Also, if we have Iterators.partitionby, groupby can be done with one-linear (see below).

returns a Vector

If not returning a dictionary, I suggest returning a lazy iterator and make the lifetime and ownership of each element clear in the API to leave opportunity of optimizations (i.e., we can return a view or reuse the buffer/output vector in the partitionby implementation). If we return a lazy iterator, we don't need mapf and groupf. A reference implementation will be something like:

const partitionby = IterTools.groupby
groupby(by, x) = (by(g[1]) => g for g in partitionby(by, sort(x, by = by)))

Even though this is one-linear, it can be optimized more (e.g., specialize partitionby on dense array, use sortperm if eltype(x) is big, etc.) so I think it makes sense to have it in a library function.

Alternatively, maybe it's better to use pairs(groupby(by, x)) to get pairs:

struct Groups
    by
    partitions
end

iterate(gs::Groups, ...) = ...  # forward to partitions
groupby(by, x) = Groups(by, partitionby(by, sort(x, by = by)))
pairs(gs::Groups) = (by(g[1]) => g for g in gs.partitions)

Base.Iterators seems like a reasonable home

I think Base.Iterators should only contain (mostly) non-allocating lazy iterator transformations that is almost free to construct, to keep its focus clear. Group-by in the sense discussed here typically needs sorting so I don't think it fits in this scope.

@nalimilan
Copy link
Member

I think it would be more reasonable to start shipping an implementation in a package. Then if everybody agrees it's a good API, it could be moved to Base or the stdlib.

@mkborregaard
Copy link
Contributor

I think it would be more reasonable

Given that this is the topic of the entire thread, with many good arguments above, this seems quite thin. Could you phrase your point so it adresses the arguments above?

@nalimilan
Copy link
Member

It's just that there's already one implementation in SplitApplyCombine, and that other implementations are being proposed here, so it's already not clear which API is best nor whether it will suit all needs. And even @tkf's nice proposal above seems uncertain about the exact type to return.

Anyway, if people want to use that function soon, it would better live in a package, otherwise you won't be able to use it until Julia 1.5 at best.

@andyferris
Copy link
Member

I agree that clojure's partition-by is worth a look, and there's no reason I see not to put that into IterTools.

It seems that a non-sorted (or otherwise indexed) collection can't be grouped with no memory overhead so should be dealt with outside of IterTools. Sorting and partitioning seems like a lower-level implementation of the operation you really want - splitting the elements of a collection into groups. We can definitely recommend that as a "pattern" for people to use from Base - but I really think we should provide a generic function with well-defined semantics that people can (a) use and (b) overload with different implementation details.

And in case this wasn't clear - everyone is welcome to treat SplitApplyCombine as a place to prototype interfaces and implementations that might be suitable for inclusion in Base or a stdlib - in a way that was what it was created for, and really summed up by @ssfrr with "you can get really far just using Vector{NamedTuple} as a table type, and map and filter for bread-and-butter data processing" and trying to fill in the remaining gaps. The functions I use most are definitely the group* family (I am surprised how often I use groupcount, actually) - and not necessarily on tabular data, they are great for arrays and other collections too.

As for the discussion above about whether returning a dictionary or an array or something simpler is best, or something that iterates groups or key-group pairs - that seems very interesting to me! I always felt we should double-down on AbstractDict so it's as easy to use and extend as AbstractArray so that choosing a dictionary output to grouping wouldn't feel like any kind of implementation burden; just an interface guarantee that lets you deal with the output.

the advantage that you can aggregate it without necessarily allocating a copy of all grouped rows

Yes, I strongly agree, it seems essential that whatever we do that there is a way of doing reductions over the groups without sorting or excess allocations, etc.

@o314
Copy link
Contributor

o314 commented Feb 28, 2020

The problem there is that could be 1/ very enviable first 2/ then very badly implemented in the long run. One example : the (10 differents) filter semantics of panda vs the (one and universal) where semantic of sql.

All this stuff is a matter of relational calculus / relational algebra, not data structure.
If you want to see a good example of a how generic implementation can be provided and specialized Apache Calcite is a very good start.

Particulary class Aggregate implemented by ElasticsearchAggregate, EnumerableAggregate, GeodeAggregate, JdbcRules.JdbcAggregate, LogicalAggregate, MongoAggregate, PigAggregate

Missing to rely on those strong foundational work, and too much time (eg. years) may be and will be spent examing cosmetic and/or dysfunctional options.
And at this stage, putting this first in stdlib does not seem a good point IMHO

@ThoAppelsin
Copy link

I have been looking for this for the past hour. I went through the same steps as @al6x:

  1. Tried ?group (mind you that I'm not even familiar with groupby or anything of sorts in other languages).
  2. Tried ?partition.
  3. Tried the internet for the same, encountered some packages, some that feels like an overkill (e.g. DataFrames).
  4. Tried ?unique because it seemed conceptually similar. We sure do have unique(f, itr), which probably does exactly what group(by)/partition(by) would do, only that it discards the duplicates it finds rather than returning/yielding them.

Much has been said already but (unless I've missed it) this one's new on the table: We already have unique in Base. Is group(by) so much different from it to be left out?

@nalimilan
Copy link
Member

unique already existed in Base when this issue was filed, so AFAICT the discussion above is still relevant (and it shows why it's there's considerably more variation in how groupby could work).

@ThoAppelsin
Copy link

I didn't imply that it wasn't; and in fact (or at least my opinion), longer the unique(f, itr) has been in Base (and you say it has), more odd becomes the group(by)'s absence given their similarity in operation.

@StefanKarpinski
Copy link
Sponsor Member

Saying that it's odd that it doesn't exist in Base doesn't really get us anywhere without addressing any of the questions about what its definition should be further up in the issue, which still aren't addressed. What should groupby do in the absence of a data frame type for it to produce?

@ThoAppelsin
Copy link

ThoAppelsin commented Apr 22, 2021

Sorry, I thought it would help you out on deciding where to put it (=Base, contrary to your comment from ~1.5 years ago) once the rest is settled. I also thought that it could give this issue a kick, present a motivation to actually proceed with it.

As for your question, I don't have any _should_s, but just personal _opinion_s: It could mirror the unique, and instead of returning array of singular "unique" elements, it could return arrays of "same" elements in the order they appear in the original iterable. See the following example:

julia> unique(x -> mod(x, 0:2), [3, 10, 9, 4, 6, 3, 2, 9, 8, 5])
3-element Vector{Int64}:
  3
 10
  2

julia> groupby(x -> mod(x, 0:2), [3, 10, 9, 4, 6, 3, 2, 9, 8, 5])
3-element Vector{Vector{Int64}}:
 [3, 9, 6, 3, 9]
 [10, 4]
 [2, 5, 8]

As a consequence, getindex.(groupby(f, itr), 1) == unique(f, itr) is always true.

@StefanKarpinski
Copy link
Sponsor Member

That seems like a sane behavior for groupby on a vector. Then the question becomes: does it match what DataFrames.groupby means? If so, then we could merge them into a single generic function. If not, then exporting a function by the same name from Base would be extremely disruptive and annoying. Any DataFrames folks care to comment on whether this definition would be conceptually compatible or not?

@goretkin
Copy link
Contributor

goretkin commented Apr 22, 2021

As far as I can tell, there isn't a DataFrames.groupby that takes a function to be applied to a row. You have to group by a column. Here is what the analogy would be:

julia> using DataFrames

julia> df = DataFrame((;a=mod(b, 0:2), b) for b=[3, 10, 9, 4, 6, 3, 2, 9, 8, 5])
10×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   10      3
   21     10
   30      9
   41      4
   50      6
   60      3
   72      2
   80      9
   92      8
  102      5

julia> map(df -> df[:, :b], collect(groupby(df, :a)))
3-element Vector{Vector{Int64}}:
 [3, 9, 6, 3, 9]
 [10, 4]
 [2, 8, 5]

or perhaps

julia> map(df -> df.b, collect(groupby(df, :a)))
3-element Vector{SubArray{Int64, 1, Vector{Int64}, Tuple{Vector{Int64}}, false}}:
 [3, 9, 6, 3, 9]
 [10, 4]
 [2, 8, 5]

I think the definition in #32331 (comment) is what I would like, but I do not see a good way to reconcile it with DataFrames.groupby.
[EDIT]: Upon more careful thought, I do think it's more valuable to allow indexing by the group key, as the comment below indicates.

@andyferris
Copy link
Member

What should groupby do in the absence of a data frame type for it to produce?

Grouping is something that makes sense - and is extremely useful - in the absence of dataframes. I think it should be in Base alongside filter and map.

To me it's important that you can look up the groups by key, like in some kind of dictionary or dictionary-like container. Here's a quick-and-dirty implementation for the earlier example:

julia> function groupby(f, itr)
           out = Dict{Core.Compiler.return_type(f, Tuple{eltype(itr)}), Vector{eltype(itr)}}()
           for x in itr
               key = f(x)
               group = get!(Vector{eltype(itr)}, out, key)
               push!(group, x)
           end
           return out
       end
groupby (generic function with 1 method)

julia> groupby(x -> mod(x, 0:2), [3, 10, 9, 4, 6, 3, 2, 9, 8, 5])
Dict{Int64, Vector{Int64}} with 3 entries:
  0 => [3, 9, 6, 3, 9]
  2 => [2, 8, 5]
  1 => [10, 4]

(To be clear I am specifically proposing we add a cleaned-up version of the above to Base - the idea has been validated in SplitApplyCombine.jl for some time now and it's useful and convenient in many situtations).

I'm not sure that the interface should require that you return an actual AbstractDict, but something similar that you can extract the groups (and their keys) from. I can see this being extended for when a dataframe is the input, returning a grouped dataframes of some kind (note: there's an interesting distinction between nested and flattened grouped dataframes that you need to be careful of - because SQL doesn't allow nesting we generally think of the flattened version but that's not really the underlying operation. In DataFrames.jl this is playing out as the question of "What does GroupedDataFrame iterate - a row or a group's dataframe?". My answer here says dataframe. There's another distinction on whether the group key is a miniture row or can be just a single value. But the point is I think such issues can be resolved in DataFrames.jl without being unnatural, and for other data structures besides).

@andyferris
Copy link
Member

andyferris commented Apr 22, 2021

Part of the problem with @ThoAppelsin's version is efficiency. If you happen to want to look up groups by keys, you need to call unique and groupby and then add it to a Dict. You are creating the exact same hash-map 3 times, which isn't great for many small groups. It's more efficient to create the dictionary once and collect the keys and values if necessary (that's what happens in unique already).

There's lots of other interesting efficiency-related stuff in this space (fast group reductions, speedups for when the input is already sorted by the grouping key, etc).

@nalimilan
Copy link
Member

In DataFrames.jl this is playing out as the question of "What does GroupedDataFrame iterate - a row or a group's dataframe?". My answer here says dataframe.

GroupedDataFrame objects indeed return SubDataFrame views pointing to the relevant rows when iterating, so that's not just "your answer". :-)

There's another distinction on whether the group key is a miniture row or can be just a single value. But the point is I think such issues can be resolved in DataFrames.jl without being unnatural, and for other data structures besides

In DataFrames, keys are GroupKey objects which behave almost identical to named tuples. So it seems to be consistent with the implementation proposed by @andyferris if we consider that DataFrame is equivalent to a Vector{<:NamedTuple}.

DataFrames doesn't currently allow specifying a transformation to apply to rows before grouping, but that should be a relatively natural extension using the incol => fun syntax we support elsewhere. I guess we could also imagine groupby(fun, df) even if type-stability would make it inefficient (just like we have filter(fun, df)).

Regarding the implementation, I'm not sure returning a Dict is ideal as I noted above. Using a custom iterator would allow implementing major optimizations for things like reduce(+, groupby(x -> x > 0, vec)) or more complex cases like mapreduce(last, +, groupby(first, [("a", 1), ("b", "2"), ...]), just like we do in DataFrames. Without this ability, you're forced to have additional functions like groupreduce, making the API less composable.

@ThoAppelsin
Copy link

ThoAppelsin commented Apr 23, 2021

@andyferris I think you can do the following to obtain what you want with my solution and without a performance hit: My bad.

I also think returning a Dict with keys would be inconsistent with how unique(f, itr) works, which discards the keys used for finding the unique elements. I.e. it doesn't do this:

julia> unique(x -> mod(x, 0:2), [3, 10, 9, 4, 6, 3, 2, 9, 8, 5])
Dict{Int64, Int64} with 3 entries:
  0 => 3
  2 => 2
  1 => 10

I personally am fine with it, though. They are mostly interchangeable anyway.

@nalimilan
Copy link
Member

@ThoAppelsin Dict(mod3(x[1]) => x for x in a) requires calling mod3 again on each value, hashing the result and doing a hash table lookup. This is slow, and prohibitively so if the number of groups is large (e.g. 2 values per group).

@StefanKarpinski
Copy link
Sponsor Member

I also think returning a Dict with keys would be inconsistent with how unique(f, itr) works, which discards the keys used for finding the unique elements. I.e. it doesn't do this:

There's nothing that says that groupby needs to behave the same way as unique though. All you need to do is change the relationship between them to say that this is the identity that always holds:

[ first(vals) for (key, vals) in groupby(f, itr) ] == unique(f, itr)

The only argument that needs to be made here is that one typically wants the key when doing a group by whereas one doesn't need it when finding unique values by some transformer.

@bkamins
Copy link
Member

bkamins commented Apr 23, 2021

A request from DataFrames.jl is. If this method is added to Base with groupby name to use a definition that will not cause dispatch ambiguity with what we define, i.e.:

groupby(df::AbstractDataFrame, cols; sort, skipmissing)

(this is probably easy to ensure, but I wanted to be explicit)

To make it what @nalimilan said complete: we return GroupedDataFrame which supports iteration, and is read-only indexable both by group number and by group key. Here is an example for a reference:

julia> df = DataFrame(x=[1,1,2], y=["A", "A", "B"])
3×2 DataFrame
 Row │ x      y
     │ Int64  String
─────┼───────────────
   1 │     1  A
   2 │     1  A
   3 │     2  B

julia> gdf = groupby(df, [:x, :y])
GroupedDataFrame with 2 groups based on keys: x, y
First Group (2 rows): x = 1, y = "A"
 Row │ x      y
     │ Int64  String
─────┼───────────────
   1 │     1  A
   2 │     1  A
⋮
Last Group (1 row): x = 2, y = "B"
 Row │ x      y
     │ Int64  String
─────┼───────────────
   1 │     2  B

julia> gdf[1]
2×2 SubDataFrame
 Row │ x      y
     │ Int64  String
─────┼───────────────
   1 │     1  A
   2 │     1  A

julia> gdf[(x=2, y="B")]
1×2 SubDataFrame
 Row │ x      y
     │ Int64  String
─────┼───────────────
   1 │     2  B

julia> keys(gdf)
2-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
 GroupKey: (x = 1, y = "A")
 GroupKey: (x = 2, y = "B")

@andyferris
Copy link
Member

Regarding the implementation, I'm not sure returning a Dict is ideal as I noted above. Using a custom iterator would allow implementing major optimizations for things like reduce(+, groupby(x -> x > 0, vec)) or more complex cases like mapreduce(last, +, groupby(first, [("a", 1), ("b", "2"), ...]), just like we do in DataFrames. Without this ability, you're forced to have additional functions like groupreduce, making the API less composable.

This is true - though we have the same problem with composing map and reduce currently. It seems consistent to me if the functions like map, filter and groupby were all eager or all lazy - so I would suggest also creating an Iterators.groupby function to match Iterators.map and Iterators.filter (as an aside, I wonder if we an eager partition has a use?).

I'm not aware of a lazy groupby iterator that supports both fast lookup by group key and ideal performance for group reductions, though. In SplitApplyCombine terms that could be stated like: groupsum(f, itr) is faster than sum.(group(f, itr)) and also faster than sum.(groupview(f, itr)). Maybe there is an approach I'm not aware of - does anyone know? Perhaps the tranducer approach has a solution? Otherwise you might want groupreduce regardless?

@o314
Copy link
Contributor

o314 commented Apr 25, 2021

Serious work will require to begin by specifying first the expectation about sort and uniquess of your data. Better, then to define index on them.

That may be too much for a first implementation in Base.
But it will be hard too to talk about performance analysis without that coming first.

+1 for an eager family and a lazy one in Iterators.

@bkamins
Copy link
Member

bkamins commented Apr 25, 2021

But it will be hard too to talk about performance analysis without that coming first.

Agreed. Also column type and element type should be taken into account. However this can be done by the packages. E.g. PooledArrays.jl could opt-in to optimally handle grouping by it.

Practice shows that sortedness has a relatively greater impact on performance than uniqueness. Fortunately in typical cases checking issorted to decide the algorithm should be fast enough (as issorted fails fast)

@nalimilan
Copy link
Member

This is true - though we have the same problem with composing map and reduce currently. It seems consistent to me if the functions like map, filter and groupby were all eager or all lazy - so I would suggest also creating an Iterators.groupby function to match Iterators.map and Iterators.filter (as an aside, I wonder if we an eager partition has a use?).

I'm not completely sure that what I'm describing qualifies as "lazy" in the Iterators sense, as it allocates a lot of data. AFAICT existing functions in Iterators do not allocate anything, with the downside that repeated accesses to the same entry require repeating computations (which are usually cheap). Transposing this behavior to groupby would be quite slow as the whole data would have to be scanned for each group. Python and Rust (which live in their itertools modules) avoid this by requiring data to already be sorted by groups, but that's not super useful IMHO.

I'm not aware of a lazy groupby iterator that supports both fast lookup by group key and ideal performance for group reductions, though. In SplitApplyCombine terms that could be stated like: groupsum(f, itr) is faster than sum.(group(f, itr)) and also faster than sum.(groupview(f, itr)). Maybe there is an approach I'm not aware of - does anyone know? Perhaps the tranducer approach has a solution? Otherwise you might want groupreduce regardless?

What DataFrames does is to compute eagerly only the index of the group to which each observation belongs. Then depending on what you do, it populates more fields lazily when they are accessed (that includes a Dict for fast key lookup and a permutation of rows that sorts them into groups).

@StefanKarpinski
Copy link
Sponsor Member

StefanKarpinski commented Apr 30, 2021

Python and Rust (which live in their itertools modules) avoid this by requiring data to already be sorted by groups, but that's not super useful IMHO.

Indeed, in order to use that API you have to sort by unique key first, which turns an O(n) hash based algorithm into an O(n log n) sorting based algorithm. Oops.

@bkamins
Copy link
Member

bkamins commented Apr 30, 2021

And in general keys do not have to be sortable.

@andyferris
Copy link
Member

andyferris commented May 1, 2021

And in general keys do not have to be sortable.

Yeah I was going to bring up this point, too - hash and isequal works for more types and combinations of types than isless. So I don't think the basic generic algorithm should be sort based.

Python and Rust (which live in their itertools modules) avoid this by requiring data to already be sorted by groups, but that's not super useful IMHO.

Haha, I was familiar with these and Clojure's partition, and I always assumed this was exactly what our Iterators.partition did - but it seems I'm wrong!

julia> Iterators.partition
partition (generic function with 1 method)

help?> Iterators.partition
  partition(collection, n)

  Iterate over a collection n elements at a time.

  Examples
  ≡≡≡≡≡≡≡≡≡≡

  julia> collect(Iterators.partition([1,2,3,4,5], 2))
  3-element Vector{SubArray{Int64, 1, Vector{Int64}, Tuple{UnitRange{Int64}}, true}}:
   [1, 2]
   [3, 4]
   [5]

So, should we also consider a Iterators.partition(f, iter) method which checks whether consecutive x1 in iter, x2 in iter are isequal(f(x1), f(x2))? (Which will naturally fall out for appropriately sorted collections, and sometimes you already have data in this form so it can be nice to take advantage of the efficiency). Something that does something like:

julia> collect.(Iterators.partition(iseven, [1,3,5,2,4]))
  3-element Vector{Vector{Int64}}:
   [1, 3, 5]
   [2, 4]

@bergel
Copy link

bergel commented Oct 11, 2022

As a newcomer to Julia, I was quite surprised to not see this function directly available. It is as useful as filter, findall, ...

@brenhinkeller brenhinkeller added the kind:feature Indicates new feature / enhancement requests label Nov 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:feature Indicates new feature / enhancement requests
Projects
None yet
Development

No branches or pull requests