Stdlib compat set #1006

jabot2 · 2021-01-19T20:48:44Z

BatSet is missing several methods from Stdlib.Set.
Also, several existing methods in Stdlib.Set guarantee physical equality of resulting sets if possible.
This PR introduces the missing methods into BatSet, and adds the physical equality guarantees.
It also adds BatSet to the compattest.

UnixJunkie · 2021-01-20T00:47:37Z

Thanks for this contribution.
Note that all new code must come with unit tests.

jabot2 · 2021-01-20T06:59:26Z

@UnixJunkie I thought i added unit tests for everything? I will check again...

gasche

Thanks, this is nice work (again).

Note: for functorized modules (Set and Map) we also have non-inline tests in testsuite/test_<module>.ml, so that the tests can be performed on all implementations (for functions that are available everywhere). I have mixed thoughts about adding inline tests for some functions (if they could be tested this way): inline tests are more convenient, but they either force us to add redundancy (copying the tests for several interfaces) or they test less. (But there probably should have been a comment in the .ml file pointing at where the tests are.) I would have a slight preference for moving the tests to testsuite/test_set.ml when it makes sense, but I'm interested in what you think about this.

src/batSet.ml

gasche · 2021-01-20T07:30:11Z

src/batSet.ml

+    then join l v r
+    else union cmp l (add cmp v r)
+
+  let rec map_stdlib cmp f = function


Why did you decide to keep the older, less-efficient implementations of map and filter_map around, and name those *_stdlib to avoid name conflicts? Would it be correctly to completely replace the previous functions with the new implementations?

the map and filter_map functions from stdlib take a function elt -> elt, while the functions in Batteries take a function elt -> 'b and produce a 'b set (for the polymorphic maps)
So it seems we must keep the old versions or break compatibility with people that currently rely on map/filter_map to create sets of a different element type.

Ah yes, thanks. I would propose to call them map_endo or endo_map (from "endo{morphism,function}", a function from a set to itself) or homogeneous_map rather than map_stdlib.

The implementation of map_stdlib and filter_map_stdlib that you propose is more efficient than the existing map and filter_map functions in the case where the mapping function preserves the order, or "nearly" preserves the order. Would you possibly be interested in using the same implementation with try_join for the existing/heterogeneous map functions? Note that the == trick cannot be reused, as it assumes that the input and output types are equal.

To give credit where it is due: the map_stdlib and filter_map_stdlib methods are copied (mostly) verbatim from the stdlib.

I rewrote the map and filter_map functions to use try_join. I'm not convinced that makes them faster though...

gasche · 2021-01-20T09:05:03Z

src/batSet.ml

+    match m with
+    | Empty -> Seq.empty
+    | Node(l, v, r, _) -> 
+       Seq.append (to_seq l) (Seq.cons v (to_seq r))


Two remarks:

This implementation does not do what I would expect due to the use of the strict function Seq.cons: it will traverse the whole set before returning the first element. This is not technically wrong as there is no order-of-evaluation difference (and probably not much of a performance difference, although it is worth measuring), but I would find it more natural to have a version that lazily enumerates the list.

We have a proper datatype to represent partial enumerations of sets, 'a iter and cons_iter.

I would expect something looking roughly like (untested):

let rec iter_to_seq = function | E -> Seq.empty | C (e, r, t) -> fun () -> Seq.Cons (e, iter_to_seq (cons_iter r t)) let to_seq s = iter_to_seq (cons_iter t E)

(to_seq_from also needs an adaptation.)

I would be curious to know what is the performance difference between your version, a version using iter as above, and also a modified version of your code that avoids Seq.cons to avoid forcing the tail (but otherwise traverses the set recursively in the direct way).

I modified to_seq and to_seq_from to be more lazy.
They will not do anything on invocation (except returning a closure).
Furthermore, once that closure is called, it will only go down the leftmost path in the tree to determine the first element; recursing into the right subtrees is again done in a closure.

I hope that satisfies your expectations of these functions.

They do, thanks! I would still be curious to know whether the 'a iter version is faster, but I don't have the time to look at this right now and maybe you are not so curious yourself, so the new versions are fine with me.

src/batSet.ml

jabot2 · 2021-01-20T20:10:53Z

I started moving some testcases to the testsuite. I'm not done yet, sorry...

…in BatSeq -> BatList -> BatSet -> BatSeq

UnixJunkie · 2021-01-22T00:44:54Z

If some tests are in another file, it would be nice to have a comment
just after each function tested in this way, saying
where the test is, so that people are not surprised that there is
no unit test in comments just after an implementation (and they know right away
where to look for it).

jabot2 · 2021-01-22T20:29:24Z

@UnixJunkie After checking again: I think i have unit tests for every function that i touched.
@gasche I renamed the _stdlib functions to _endo

Do you think that there is anything else to do?

gasche · 2021-01-22T21:08:37Z

Let's go ahead and merge. Thanks for the nice work.

…batteries-team#1006 Results on my machine: enumeration (89.71 us) is 77.5% faster than batseq (399.39 us) which is 3.8% faster than too strict (415.29 us) which is probably (alpha=40.06%) same speed as simple (416.74 us) Saving times to times.flat This suggests that using an enumeration indeed provides a large performance advantage over the simple approach.

benchmark various implementations of Bench.to_seq discussed in #1006

UnixJunkie · 2021-02-25T02:43:36Z

@jabot2 could you contribute the to_rev_seq versions?
I don't understand the code for to_seq (in batteries) and to_rev_seq (in the stdlib).

UnixJunkie · 2021-02-25T02:48:44Z

@jabot2 this is required in batMap, batSplay and batSet for 'make test-compat' to pass.

gasche · 2021-02-26T07:53:35Z

I can write to_rev_seq versions, in exchange of someone else doing the release work.

gasche · 2021-02-26T07:54:27Z

(Ah, I had missed a recent email from @jabot2 indicated that he's interested in contributing. Of course I'm happy to let @jabot2 continue his nice contributions! It's better to broaden the pool of active contributors.)

jakob krainz added 5 commits January 18, 2021 22:42

started compat changes to BatSet

74e3394

bugfix

939e4e9

added entry to changelog; mentioned authors of original implementations

40a7bfa

remaining changes

5b92d49

merge master

8e3a668

UnixJunkie added the changes requested label Jan 20, 2021

gasche reviewed Jan 20, 2021

View reviewed changes

Jakob Krainz added 2 commits January 20, 2021 19:39

cleanup

89ca5c5

moved some testcases to testsuite

16ab5ac

jakob krainz added 3 commits January 21, 2021 19:57

replace Seq with BatSeq for compat with <= 4.10, broke dependency cha…

5c08dda

…in BatSeq -> BatList -> BatSet -> BatSeq

moved some tests to testsuite, added more tests

b50457f

rewrote map and filter_map

8c2e059

Jakob Krainz and others added 3 commits January 22, 2021 08:29

change Seq to batSeq for compat with <4.07

7814e2e

rename map_stdlib, filter_map_stdlib to map_endo, filter_map_endo

bd75b85

fix documentation

60e5006

gasche merged commit 46bc18c into ocaml-batteries-team:master Jan 22, 2021

gasche added a commit that referenced this pull request Jan 31, 2021

Merge pull request #1010 from gasche/benchmark-Set.to_seq

f723cff

benchmark various implementations of Bench.to_seq discussed in #1006

gasche mentioned this pull request Jan 31, 2021

BatSet: implement to_seq{,_from} using the iteration structure #1011

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stdlib compat set #1006

Stdlib compat set #1006

jabot2 commented Jan 19, 2021

UnixJunkie commented Jan 20, 2021

jabot2 commented Jan 20, 2021

gasche left a comment

gasche Jan 20, 2021

jabot2 Jan 20, 2021

gasche Jan 20, 2021

jabot2 Jan 21, 2021

gasche Jan 20, 2021

jabot2 Jan 21, 2021

gasche Jan 21, 2021

jabot2 commented Jan 20, 2021

UnixJunkie commented Jan 22, 2021

jabot2 commented Jan 22, 2021

gasche commented Jan 22, 2021

UnixJunkie commented Feb 25, 2021

UnixJunkie commented Feb 25, 2021

gasche commented Feb 26, 2021

gasche commented Feb 26, 2021

Stdlib compat set #1006

Stdlib compat set #1006

Conversation

jabot2 commented Jan 19, 2021

UnixJunkie commented Jan 20, 2021

jabot2 commented Jan 20, 2021

gasche left a comment

Choose a reason for hiding this comment

gasche Jan 20, 2021

Choose a reason for hiding this comment

jabot2 Jan 20, 2021

Choose a reason for hiding this comment

gasche Jan 20, 2021

Choose a reason for hiding this comment

jabot2 Jan 21, 2021

Choose a reason for hiding this comment

gasche Jan 20, 2021

Choose a reason for hiding this comment

jabot2 Jan 21, 2021

Choose a reason for hiding this comment

gasche Jan 21, 2021

Choose a reason for hiding this comment

jabot2 commented Jan 20, 2021

UnixJunkie commented Jan 22, 2021

jabot2 commented Jan 22, 2021

gasche commented Jan 22, 2021

UnixJunkie commented Feb 25, 2021

UnixJunkie commented Feb 25, 2021

gasche commented Feb 26, 2021

gasche commented Feb 26, 2021