Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stdlib compat set #1006

Merged
merged 13 commits into from
Jan 22, 2021
Merged

Conversation

jabot2
Copy link
Contributor

@jabot2 jabot2 commented Jan 19, 2021

BatSet is missing several methods from Stdlib.Set.
Also, several existing methods in Stdlib.Set guarantee physical equality of resulting sets if possible.
This PR introduces the missing methods into BatSet, and adds the physical equality guarantees.
It also adds BatSet to the compattest.

@UnixJunkie
Copy link
Member

Thanks for this contribution.
Note that all new code must come with unit tests.

@jabot2
Copy link
Contributor Author

jabot2 commented Jan 20, 2021

@UnixJunkie I thought i added unit tests for everything? I will check again...

Copy link
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is nice work (again).

Note: for functorized modules (Set and Map) we also have non-inline tests in testsuite/test_<module>.ml, so that the tests can be performed on all implementations (for functions that are available everywhere). I have mixed thoughts about adding inline tests for some functions (if they could be tested this way): inline tests are more convenient, but they either force us to add redundancy (copying the tests for several interfaces) or they test less. (But there probably should have been a comment in the .ml file pointing at where the tests are.) I would have a slight preference for moving the tests to testsuite/test_set.ml when it makes sense, but I'm interested in what you think about this.

src/batSet.ml Outdated Show resolved Hide resolved
src/batSet.ml Outdated
then join l v r
else union cmp l (add cmp v r)

let rec map_stdlib cmp f = function
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you decide to keep the older, less-efficient implementations of map and filter_map around, and name those *_stdlib to avoid name conflicts? Would it be correctly to completely replace the previous functions with the new implementations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the map and filter_map functions from stdlib take a function elt -> elt, while the functions in Batteries take a function elt -> 'b and produce a 'b set (for the polymorphic maps)
So it seems we must keep the old versions or break compatibility with people that currently rely on map/filter_map to create sets of a different element type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, thanks. I would propose to call them map_endo or endo_map (from "endo{morphism,function}", a function from a set to itself) or homogeneous_map rather than map_stdlib.

The implementation of map_stdlib and filter_map_stdlib that you propose is more efficient than the existing map and filter_map functions in the case where the mapping function preserves the order, or "nearly" preserves the order. Would you possibly be interested in using the same implementation with try_join for the existing/heterogeneous map functions? Note that the == trick cannot be reused, as it assumes that the input and output types are equal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To give credit where it is due: the map_stdlib and filter_map_stdlib methods are copied (mostly) verbatim from the stdlib.

I rewrote the map and filter_map functions to use try_join. I'm not convinced that makes them faster though...

src/batSet.ml Outdated
match m with
| Empty -> Seq.empty
| Node(l, v, r, _) ->
Seq.append (to_seq l) (Seq.cons v (to_seq r))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two remarks:

  • This implementation does not do what I would expect due to the use of the strict function Seq.cons: it will traverse the whole set before returning the first element. This is not technically wrong as there is no order-of-evaluation difference (and probably not much of a performance difference, although it is worth measuring), but I would find it more natural to have a version that lazily enumerates the list.
  • We have a proper datatype to represent partial enumerations of sets, 'a iter and cons_iter.

I would expect something looking roughly like (untested):

let rec iter_to_seq = function
| E -> Seq.empty
| C (e, r, t) -> fun () -> Seq.Cons (e, iter_to_seq (cons_iter r t))

let to_seq s = iter_to_seq (cons_iter t E)

(to_seq_from also needs an adaptation.)

I would be curious to know what is the performance difference between your version, a version using iter as above, and also a modified version of your code that avoids Seq.cons to avoid forcing the tail (but otherwise traverses the set recursively in the direct way).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modified to_seq and to_seq_from to be more lazy.
They will not do anything on invocation (except returning a closure).
Furthermore, once that closure is called, it will only go down the leftmost path in the tree to determine the first element; recursing into the right subtrees is again done in a closure.

I hope that satisfies your expectations of these functions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They do, thanks! I would still be curious to know whether the 'a iter version is faster, but I don't have the time to look at this right now and maybe you are not so curious yourself, so the new versions are fine with me.

src/batSet.ml Outdated Show resolved Hide resolved
@jabot2
Copy link
Contributor Author

jabot2 commented Jan 20, 2021

I started moving some testcases to the testsuite. I'm not done yet, sorry...

@UnixJunkie
Copy link
Member

If some tests are in another file, it would be nice to have a comment
just after each function tested in this way, saying
where the test is, so that people are not surprised that there is
no unit test in comments just after an implementation (and they know right away
where to look for it).

@jabot2
Copy link
Contributor Author

jabot2 commented Jan 22, 2021

@UnixJunkie After checking again: I think i have unit tests for every function that i touched.
@gasche I renamed the _stdlib functions to _endo

Do you think that there is anything else to do?

@gasche
Copy link
Member

gasche commented Jan 22, 2021

Let's go ahead and merge. Thanks for the nice work.

@gasche gasche merged commit 46bc18c into ocaml-batteries-team:master Jan 22, 2021
gasche added a commit to gasche/batteries-included that referenced this pull request Jan 30, 2021
…batteries-team#1006

Results on my machine:

    enumeration (89.71 us) is 77.5% faster than
    batseq (399.39 us) which is 3.8% faster than
    too strict (415.29 us) which is probably (alpha=40.06%) same speed as
    simple (416.74 us)
    Saving times to times.flat

This suggests that using an enumeration indeed provides a large
performance advantage over the simple approach.
gasche added a commit to gasche/batteries-included that referenced this pull request Jan 30, 2021
…batteries-team#1006

Results on my machine:

    enumeration (89.71 us) is 77.5% faster than
    batseq (399.39 us) which is 3.8% faster than
    too strict (415.29 us) which is probably (alpha=40.06%) same speed as
    simple (416.74 us)
    Saving times to times.flat

This suggests that using an enumeration indeed provides a large
performance advantage over the simple approach.
gasche added a commit that referenced this pull request Jan 31, 2021
benchmark various implementations of Bench.to_seq discussed in #1006
@UnixJunkie
Copy link
Member

@jabot2 could you contribute the to_rev_seq versions?
I don't understand the code for to_seq (in batteries) and to_rev_seq (in the stdlib).

@UnixJunkie
Copy link
Member

@jabot2 this is required in batMap, batSplay and batSet for 'make test-compat' to pass.

@gasche
Copy link
Member

gasche commented Feb 26, 2021

I can write to_rev_seq versions, in exchange of someone else doing the release work.

@gasche
Copy link
Member

gasche commented Feb 26, 2021

(Ah, I had missed a recent email from @jabot2 indicated that he's interested in contributing. Of course I'm happy to let @jabot2 continue his nice contributions! It's better to broaden the pool of active contributors.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants