First-class random access API for KnnVectorValues #13779

msokolov · 2024-09-12T18:43:23Z

addresses #13778

Key things in this PR:

Introduces abstract KnnVectorValues from which ByteVectorValues and FloatVectorValues derive;
Folds RandomAccessVectorValues into KnnVectorValues thus eliminating some casts.
RandomAccessVectorValues.Floats becomes FloatVectorValues and RandomAccessVectorValues.Bytes becomes ByteVectorValues. RandomAccessQuantizedByteVectorValues folded into QuantizedByteVectorValues.
IndexInput getSlice() moved to a new HasIndexSlice interface.
Introduces VectorEncoding KnnVectorValues.getEncoding() to enable type-specific branches in a few places where we are dealing with abstract KnnVectorValues (tests only IIRC). Also used that to provide a default getVectorByteLength().
KnnVectorValues no longer extends DocIdSetIterator; rather it provides a tightly-coupled iterator(). This enables refactoring common iteration patterns that were repeated many times in the code base. This new iterator, DocIndexIterator provides an additional method index() analogous to IndexedDISI.

Some of the methods on KnnVectorValues have default impls that throw UnsupportedOperationException enabling subclasses to provide partial implementations and relying on testing to catch missing required methods. I'd like feedback on this. Should we provide implementations we never use, just to make these classes complete? That didn't make sense to me. But the previous alternative of attempting to provide strict adherence to declarative contracts was becoming in my view, overly restrictive and leading to hard-to-maintain code. Some of these readers would only ever be used iteratively. Random access is required for search, but not used when merging the values themselves, and when we merge we do search, but using a temporary file so that searching is always done over a file-based value. Random access also gets used during merging when the index is sorted, again this is provided by specialized readers, so not every reader needs to implement random access. But the API maintenance is greatly simplified if we allow partial implementation. Anyway that is the idea I am trying out here. Can we live with a little less API purity and gain some simplicity and less boilerplate?

Notes for reviewers:

There is a lot of code change here, but much of it is repetitive. I recommend starting with KnnVectorValues and checking its DocIndexIterator inner class. The rest of the changes are basically consequences of introducing those abstrations in place of the Random*Values we removed.

msokolov · 2024-09-12T18:51:10Z

another concern I have is how this would impact ongoing work to enable multiple vectors per doc/field. There would almost certainly be conflicts with that PR on the surface, but I hope this could actually simplify things in that the new DocIndexIterator class could be enhanced / extended to provide access to a series of values (maybe a list or array?) instead of (or in addition to?) a single one, possibly centralizing the required changes (since we have many fewer iterator implementations after this change).

benwtrent · 2024-09-12T19:41:59Z

but I hope this could actually simplify things

That is my intuition as well.

jpountz

I left a few thoughts/questions. In general, I see how such a random-access API change could help with e.g. your BP reordering work and be valuable in general. I was wondering if this API may be too tailored to HNSW and prevent us from supporting other interesting algorithms, but actually I don't think that this is the case?

lucene/core/src/java/org/apache/lucene/index/KnnVectorValues.java

jpountz · 2024-09-12T20:23:57Z

lucene/core/src/java/org/apache/lucene/index/KnnVectorValues.java

+   * Creates a new copy of this {@link KnnVectorValues}. This is helpful when you need to access
+   * different values at once, to avoid overwriting the underlying vector returned.
+   */
+  public abstract KnnVectorValues copy() throws IOException;


I wonder if we could make the API a bit nicer by removing this copy() and instead have something like a FloatVectorDictionary { float[] vectorValue(int ord); } and a method here that can return a new FloatVectorDictionary (a bit like SortedDocValues and TermsEnum).

The way SortedDocValuesTermsEnum is, calling its next method will overwrite the internal buffer ofd the SortedDocValues on which it is built, defeating the purpose of copy() which is to provide two completely independent sources. Another thing we could do is to add vectorValue(int ord, float[] scratch) allowing the caller to provide the memory to write to. If we had that, we wouldn't need copy(). Maybe we could manage to squeeze that into 10.0 too, but I'd rather do it in a separate PR

But if you call SortedDocValues#termsEnum twice, this would give you two independent sources of terms?

I always found copy very strange, but I get why it is there. I'd be tempted to leave it as is in this PR, changing the access model and cache of 1 float[] will be a bit tricky.

jpountz · 2024-09-12T20:26:54Z

lucene/core/src/java/org/apache/lucene/index/KnnVectorValues.java

+    if (iterator == null) {
+      iterator = createIterator();
+    }
+    return iterator;


Could we make this return a new iterator every time to make the API a bit nicer? From a quick look, it seems that call sites could easily be adjusted to not rely on this method returning a shared instance?

Let me try - I was also a bit unhappy about this, but at one point along this journey I was relying on being able to recover the shared state - maybe I finally was able to get rid of that and just didn't notice!

a new iterator would be cleaner, if the use sites allow for it.

jpountz · 2024-09-12T20:30:19Z

lucene/core/src/java/org/apache/lucene/index/KnnVectorValues.java

+   * Creates an iterator from this instance's ordinal-to-docid mapping which must be monotonic
+   * (docid increases when ordinal does).
+   */
+  protected DocIndexIterator fromOrdToDoc() {


nit: could we make it look a bit more like DocIdSetIterator#all by moving it to DocIndexIterator#all?

ah, you mean rename this method to all? sure, makes sense

jpountz · 2024-09-12T20:32:24Z

lucene/core/src/java/org/apache/lucene/index/KnnVectorValues.java

+    @Override
+    public int advance(int target) throws IOException {
+      return slowAdvance(target);
+    }


This looks like it could be a performance trap, which is why DocIdSetIterator offers this helper method without making it the default impl. Should we leave it without a default impl here too?

yes, I don't think anything relies on this, makes sense

jpountz · 2024-09-12T20:33:04Z

lucene/core/src/java/org/apache/lucene/index/KnnVectorValues.java

+
+    @Override
+    public long cost() {
+      throw new UnsupportedOperationException();


Likewise here, I'd rather leave it unimplemented to force implementers to decide if having cost() throw an exception is fine. Presumably, most of the time it's not.

hmm I think cost() is rarely used in the vector reader/writers which instead are concerned with KnnVectorValues.size() -- they typically want to know how many vector values there are and to the extent they care about the number of docs it's only when they must iterate through all of them and have no use for an estimate. These iterators aren't really used during searching?

If we default cost() to returning size(), that would work for me. But I'm not comfortable with having implementations of DocIdSetIterator#cost that may throw, which means e.g. that they cannot be put in a Conjunction DISI(which will want to sort its clauses by cost).

+1. Even in FloatVecotorValues cost() is returning size() only. https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/FloatVectorValues.java#L46-L48

I agree here. Either it should default to size() via some provided dependency or it shouldn't implement at all and force sub-classes.

lucene/core/src/java/org/apache/lucene/index/KnnVectorValues.java

jpountz · 2024-09-12T20:45:35Z

Am guessing correctly that you're targeting 10.0 for this change?

msokolov · 2024-09-12T21:00:57Z

Thanks for the quick review! I will get started on addressing. As for timeline for this change, it would definitely be convenient to get in to 10.0 release. I think you had said 9/22 would be a feature freeze date; it seems we could possibly meet that timeline. I will be traveling starting tomorrow for a week, but I should be able to put in some quality time on the plane LOL

jpountz · 2024-09-13T05:50:22Z

lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java

-      public byte[] vectorValue() throws IOException {
-        return current.values.vectorValue();
+      public byte[] vectorValue(int ord) throws IOException {
+        return current.values.vectorValue(current.values.iterator().index());


This part feels a bit hacky, could we instead merge the ord->vector mappings of the various vector values by concatenating them?

Maybe we can enhance DocIDMerger by adding random access to it

jpountz · 2024-09-13T10:43:39Z

think you had said 9/22 would be a feature freeze date

I was thinking of doing it next week, but we can backport this PR even though the branch has been cut if it looks ready/safe.

ChrisHegarty

I really like this change. I see a lot of refactoring similar to what I half started at one point or the other, but never finished. There are some specific comments to be addressed, but otherwise the approach LGTM.

ChrisHegarty · 2024-09-13T15:47:18Z

lucene/core/src/java/org/apache/lucene/codecs/lucene95/HasIndexSlice.java

-  @Override
-  RandomAccessQuantizedByteVectorValues copy() throws IOException;
+  /** Returns an IndexInput from which to read this instance's values. */
+  IndexInput getSlice();


I very much like this, and had something similar in a past unmarked PR. 👍

ChrisHegarty · 2024-09-13T15:49:25Z

lucene/core/src/java/org/apache/lucene/index/KnnVectorValues.java

+   * Creates a new copy of this {@link KnnVectorValues}. This is helpful when you need to access
+   * different values at once, to avoid overwriting the underlying vector returned.
+   */
+  public abstract KnnVectorValues copy() throws IOException;


I always found copy very strange, but I get why it is there. I'd be tempted to leave it as is in this PR, changing the access model and cache of 1 float[] will be a bit tricky.

lucene/core/src/java/org/apache/lucene/index/KnnVectorValues.java

ChrisHegarty · 2024-09-13T15:52:31Z

lucene/core/src/java/org/apache/lucene/index/KnnVectorValues.java

+    if (iterator == null) {
+      iterator = createIterator();
+    }
+    return iterator;


a new iterator would be cleaner, if the use sites allow for it.

msokolov · 2024-09-16T14:09:13Z

I pushed a new revision here addressing some of the major comments:

KnnVectorValues.iterator() now generally provides a new iterator; no caching is done. I removed createIterator(). Main impact was on VectorScorer (and in tests) where we now create iterators and store them locally. This is much better; thanks for the feedback.
I added implementations for advance() and got rid of the default impl.
I removed impls of cost() and added a default impl that throws UOE. This method is only ever used during search() and most of these values sources will never be searched. The exceptions are those that can be used by the ValueSource API: basically the indexed values returned by a reader. We have lots and lots of other values impls that are used during indexing for which we don't need cost. I briefly considered separating these new iterators from DISI, but that ended up in some trouble.
re: getVectorByteLength() @ChrisHegarty is correct that this is needed as it is today. We could in theory make it final (or inline it whatever) if we added some more VectorEncodings to represent the compressed cases, but I'm inclined to leave it as is. This way we could in theory support a variable size encoding? And anyway it isn't clear we want to mix up the "encoding" with compression?

I didn't have a chance to look seriously at removing copy() API. I don't think we ought to create a simple wrapper though since afaict it would require an additional memory copy of every vector value.

msokolov · 2024-09-16T14:32:13Z

OK there seem to be some test failures ... I did a complete run, but randomized testing always seems to ferret out something interesting!

Actually those really should have failed on any test run -- not sure how I missed them, oops

msokolov · 2024-09-16T15:16:26Z

Regarding the rename of fromOrdToDoc to all I think it was not helpful and plan to revert or maybe come up with some other name. The problem is we also have createDenseIterator which is also all. Essentially we have Sparse and Dense all-iterators. Maybe instead of fromOrdToDoc we can say createSparseIterator?

jpountz · 2024-09-16T16:22:41Z

FWIW I started playing with removing copy() by replacing it with a factory method for a dictionary: msokolov@ae7aca3. Not sure how far I'll go. :)

msokolov · 2024-09-16T16:34:03Z

I also just started trying to replace copy() with the approach of adding vectorValue(int ord, float[] outValue) although this does add a copy operation in some cases where previously we would expose internal storage so I'm not sure it's great

benwtrent

I like the simplicity we gain here.

lucene/core/src/java/org/apache/lucene/index/KnnVectorValues.java

benwtrent · 2024-09-18T13:21:59Z

lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java

+        // FIXME what can we assert here?
+        // assert ord == iterator.index();
+        return current.values.vectorValue(current.index());


This is the biggest "gotcha" in this whole thing I think. Does DocIdMerger allow random access at all?

It seems like vectorValue(ord) should also be able to jump between current sub iterators and move backwards and forwards. But, I don't think DocIdMerger allows backwards movement at all.

I think unless we can fix DocIdMerger, we should throw an error here indicating that only forward iteration is allowed.

I had thought we'd fix it by concatenating the vector dictionaries of each segment. Then based on the requested vector ordinal, you could compute the segment that the vector belongs to via ReaderUtil#subIndex.

There is something similar here in SlowCompositeCodecReaderWrapper - binary search across sub-values

lucene/core/src/java/org/apache/lucene/index/KnnVectorValues.java

benwtrent · 2024-09-18T13:25:35Z

lucene/core/src/java/org/apache/lucene/index/KnnVectorValues.java

+
+    @Override
+    public long cost() {
+      throw new UnsupportedOperationException();


I agree here. Either it should default to size() via some provided dependency or it shouldn't implement at all and force sub-classes.

benwtrent · 2024-09-18T13:27:40Z

lucene/core/src/java/org/apache/lucene/util/quantization/QuantizedByteVectorValues.java

+  public QuantizedByteVectorValues copy() throws IOException {
+    return this;
  }


defaulting copy to this is dangerous. I would recommend against it.

benwtrent · 2024-09-18T13:34:30Z

lucene/core/src/java/org/apache/lucene/util/quantization/ScalarQuantizer.java

@@ -269,11 +270,12 @@ static ScalarQuantizer fromVectors(
    if (totalVectorCount == 0) {
      return new ScalarQuantizer(0f, 0f, bits);
    }
+    KnnVectorValues.DocIndexIterator iterator = floatVectorValues.iterator();


I think this is ok for now, but this quantization code can be made much simpler if indeed we can randomly access even across various merged doc sub iterators.

benwtrent · 2024-09-18T13:39:31Z

lucene/core/src/java/org/apache/lucene/index/FloatVectorValues.java

+  public FloatVectorValues copy() throws IOException {
+    return this;


we shouldn't default to this that is dangerous. If an off-heap thing that assumes caching and doesn't allow multi-threaded access overrides but forgets to override, we are in a bad place.

benwtrent · 2024-09-18T13:43:38Z

lucene/core/src/java/org/apache/lucene/index/ByteVectorValues.java

+  public ByteVectorValues copy() throws IOException {
+    return this;


again, this default is dangerous to me.

jpountz · 2024-09-18T15:00:55Z

I iterated a bit on my branch, so that there is no more call site for FloatVectorValues#copy: https://github.com/msokolov/lucene/compare/knn-vector-random...jpountz:lucene:knn-vector-random?expand=1. The API looks better to me this way, but I'm keen on getting feedback before pursuing as this is quite a tedious refactoring.

msokolov · 2024-09-18T15:04:31Z

FWIW I tried removing copy() and using caller-supplied storage in vectorValue. In many ways this looks nicer, but it leads to substantial slowdown in indexing/merging because of the additional copy required, so I think it's not really viable. We could provide both vectorValue(int) and vectorValue(int, float[]) and avoid copy() that way. Maybe that's best

msokolov · 2024-09-18T21:27:29Z

ooh I just saw the Dictionary branch - that looks like a nice approach, I don't think I really understood what you were proposing before. One Q I have is can we remove copy()? Do we need to deprecate it first -- and if so, could we deprecate in 9x branch? Or perhaps it is too late for that

benwtrent · 2024-09-18T21:36:02Z

One Q I have is can we remove copy()? Do we need to deprecate it first -- and if so, could we deprecate in 9x branch?

You are already removing RandomAccessVectorValues, which is the thing that used copy(), so, I am not sure what there is to deprecate 😂

msokolov · 2024-09-19T00:23:09Z

I'll post one more iteration here addressing the concerns about dangerous default impls that adds back impls of copy() and cost(). I also added a test-and-throw ensuring that the vectorValues impls that require forward-iteration enforce it. We can fully implement random access later without breaking any APIs.

I also think we should go ahead with Adrien's Dictionary idea, but do this in two steps because there is a lot going on here already.

benwtrent · 2024-09-19T11:20:31Z

The dictionary idea is OK, but I still don't see how it removes copy(). Besides the caching of values, copy gives us multi-threaded safety by copying the underlying index readers. Otherwise we are using the same reader between threads. For concurrent merging of graphs, this is important.

I agree, any further refactoring should be done in another PR.

msokolov · 2024-09-19T23:11:57Z

I think the idea w/Dictionary is that callers, instead of calling copy().vectorValue(int ord) would call dictionary().vectorValue(int ord). So then the scratch vector storage (if needed) would be in the Dictionary not in the VectorValues, and thus not shared by multiple users of the same values instance. In some sense it's not very different, but in the sense that the Dictionary has a much more limited API than the source it came from, it is different.

jpountz · 2024-09-20T09:37:05Z

Exactly. I tried to model it similarly to what doc values do, where SortedDocValues#termsEnum() returns a dictionary with a different backing IndexInput clone on every call.

msokolov · 2024-09-20T11:24:05Z

OK I think we've addressed the blocking concerns that have been raised here and I plan to push later today if nothing else comes up. Regarding removing copy() in favor of dictionary() I'll open a separate issue. If Adrien finishes it up, great, but I may also see if I can find time to take that forward soon; it would be good to get it done for 10 since it would be a breaking change and ideally we don't want copy() to linger as deprecated. As for implementing better random access in merged values I think we can take that up at a more relaxed pace since it doesn't require any API change.

msokolov · 2024-09-20T12:05:20Z

hm interesting there was an EOFException in there - I'll dig

msokolov · 2024-09-20T18:56:23Z

OK, I found an off-by-one error plus a problem with lazy iterator creation that slipped in when we got rid of createIterator(). It makes me a little nervous these didn't show up in earlier testing. I'm now running with tests.iter=20

Michael Sokolov added 12 commits September 12, 2024 14:19

compiles!

cd9c486

adding some ordToDoc

2bbf8f1

restore vector count argument to scalarquantizer methods

a451fdb

remove docToOrd; mostly can use iterator.index()

8152b9d

Make KnnVectorValues primarily a random access API

dce766c

HasIndexSlice

2f0cc8c

remove RandomAccessVectorValues

327b930

tests pass

98ab0a6

fixing up javadocs and making iterator methods instance methods

1450b44

rename DocIterator to DocIndexIterator

8d087e2

clean up some comments

c2ae86b

fix case where index is reordered

ff7a317

jpountz reviewed Sep 12, 2024

View reviewed changes

jpountz reviewed Sep 13, 2024

View reviewed changes

ChrisHegarty reviewed Sep 13, 2024

View reviewed changes

Michael Sokolov added 4 commits September 15, 2024 15:52

rename 'fromOrdToDoc' to 'all'; move fromIndexedDISI to codecs/lucene90

9e5b9f9

no default advance(); default cost() unsupported

d43785d

make iterator() API sane

787e89c

Merge branch 'main' into knn-vector-random

1873955

Rename IteratorSupplier->SortingIteratorSupplier and add javadoc

4feecf8

cache vector values iterators in VectorFieldSources

abc1713

rename KnnvectorValues.all() to createSparseIterator()

3f6091c

benwtrent reviewed Sep 18, 2024

View reviewed changes

implement cost(); enforce forward iteration in KnnVectorsWriter

d8ab1ec

add implementations of KnnVectorValues.copy()

a2ca172

Merge remote-tracking branch 'origin/main' into knn-vector-random

274859f

fix SlowCOmpositeCodecReaderWrapper; off-by-one AND lazy iterator access

2b21668

Michael Sokolov added 2 commits September 20, 2024 18:56

Merge remote-tracking branch 'origin/main' into knn-vector-random

2a284f2

resolve merge conflicts

cb62025

		public FloatVectorValues copy() throws IOException {
		return this;

		public ByteVectorValues copy() throws IOException {
		return this;

First-class random access API for KnnVectorValues #13779

Are you sure you want to change the base?

First-class random access API for KnnVectorValues #13779

Conversation

msokolov commented Sep 12, 2024 • edited Loading

msokolov commented Sep 12, 2024

benwtrent commented Sep 12, 2024

jpountz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpountz commented Sep 12, 2024

msokolov commented Sep 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpountz commented Sep 13, 2024

ChrisHegarty left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msokolov commented Sep 16, 2024

msokolov commented Sep 16, 2024 • edited Loading

msokolov commented Sep 16, 2024

jpountz commented Sep 16, 2024

msokolov commented Sep 16, 2024

benwtrent left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpountz commented Sep 18, 2024

msokolov commented Sep 18, 2024

msokolov commented Sep 18, 2024

benwtrent commented Sep 18, 2024

msokolov commented Sep 19, 2024

benwtrent commented Sep 19, 2024

msokolov commented Sep 19, 2024

jpountz commented Sep 20, 2024

msokolov commented Sep 20, 2024

msokolov commented Sep 20, 2024

msokolov commented Sep 20, 2024

msokolov commented Sep 12, 2024 •

edited

Loading

msokolov commented Sep 16, 2024 •

edited

Loading