Make list diff reuse _obj_diff results instead of making duplicate calls #74

bfrobin446 · 2023-11-14T23:17:49Z

The backtracking pass of the list diff algorithm in _list_diff_0 was calling _obj_diff on pairs of objects that had already been diffed in the forward pass earlier in the _list_diff method.

When I create an array of _obj_diff results in the forward pass and reuse that array in the backtracking pass, I see about a 5x speedup on JSON files that contain lists of a few dozen large objects.

The backtracking pass of the list diff algorithm in `_list_diff_0` was calling `_obj_diff` on pairs of objects that had already been diffed in the forward pass earlier in the `_list_diff` method. This commit saves the `_obj_diff` results from the forward pass and reuses them in the backtracking pass so we don't duplicate the recursive calls.

corytodd · 2023-11-18T06:22:08Z

I'm running some benchmarks to see how this affects performance and it looks like this is noticeably slower which is unexpected. Do you happen to have any benchmarks that you've run? I'd be interested in results for both memory usage and execution time.

I picked a random test that touches this code path, test_long_arrays and bumped the size up to 10K. I see a very consistent 237 seconds for this PR versus a very consistent 161 seconds for the current master branch.

At first glance I suspect this is caused by preallocating the entire 2d array of m x n which requires touching every memory address in the array. In the current algorithm we are a bit lazier and only call _obj_diff while we have both a row and column which are allowed to decrement separately so there is an opportunity for less than m x n iterations.

corytodd · 2023-11-18T16:53:59Z

Also thanks for taking to time to make this PR @bfrobin446 :)

bfrobin446 · 2023-11-20T21:34:36Z

I’d expect the difference to be the size and structure of the array entries. If the child objects are expensive to diff (perhaps especially if there are arrays at several different levels of nesting), the balance will shift in favor of avoiding duplicate work.

I can’t share the files I was profiling on (employer internal), but I’ll see if I can generate a set of files that shows the behavior I was seeing.

bfrobin446 · 2023-11-20T21:52:31Z

Another option would be to replace the preallocated array with a dictionary that stores only the row/column combinations that the forward pass touched. Once I have a set of reproduction files that I can share, I'll see if that gets us a better compromise between performance in the normal case and performance in the pathological case.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make list diff reuse _obj_diff results instead of making duplicate calls #74

Make list diff reuse _obj_diff results instead of making duplicate calls #74

bfrobin446 commented Nov 14, 2023

corytodd commented Nov 18, 2023

corytodd commented Nov 18, 2023

bfrobin446 commented Nov 20, 2023

bfrobin446 commented Nov 20, 2023

Make list diff reuse _obj_diff results instead of making duplicate calls #74

Are you sure you want to change the base?

Make list diff reuse _obj_diff results instead of making duplicate calls #74

Conversation

bfrobin446 commented Nov 14, 2023

corytodd commented Nov 18, 2023

corytodd commented Nov 18, 2023

bfrobin446 commented Nov 20, 2023

bfrobin446 commented Nov 20, 2023