Custom Codecs Upgrade to Lucene99 Codec #95

sarthakaggarwal97 · 2024-01-17T15:55:21Z

Description

This change would upgrade the underlying Lucene95 codec of the custom codecs to Lucene99 codec.

Issues Resolved

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>

sarthakaggarwal97 · 2024-01-17T16:16:23Z

@reta @backslasht @mgodwan please take a look. Working simultaneously on the upgrade ITs to ensure correctness.

reta · 2024-01-18T15:34:16Z

src/main/java/org/opensearch/index/codec/customcodecs/Lucene99CustomCodec.java

+        /**
+         * ZStandard mode with dictionary
+         */
+        ZSTD("ZSTD99", Set.of("zstd")),


@sarthakaggarwal97 I don't think we should approach codec backward compatibility this way - it will cause an explosion of compression codecs. Ideally, we should be able:

read / write indices that use older (9.5) codec

create new indices using new (9.9) codec

read / write new indices that use (9.9) codec

The codec name stays the same zstd / zstdnodict. @msfroh does it make sense (from Apache Lucene perspective)?

@reta I was initially not planning to introduce new codecs. I was looking to extend the current ZSTD and ZSTD_NO_DICT codecs from Lucene99 instead of Lucene95, and finally read the stored fields using the same Lucene95CustomStoredFieldsFormat, but I started to run into problems.

This initial approach worked well to ensure that we are now indexing and reading the newly created segments with Lucene99 codec in the background, but upon reading the older segments (created with zstandard in <OSv2.12), things started to fail.

We would see the shards are getting unassigned upon recovery with the exception

{ "index": "index-26", "shard": 0, "primary": true, "current_state": "unassigned", "unassigned_info": { "reason": "CLUSTER_RECOVERED", "at": "2024-01-17T11:11:00.285Z", "last_allocation_status": "no_valid_shard_copy" }, "can_allocate": "no_valid_shard_copy", "allocate_explanation": "cannot allocate because all found copies of the shard are either stale or corrupt", "node_allocation_decisions": [ { "node_id": "GK-KmiQoQEWlE44bvkt2Rg", "node_name": "bcd0743e920a.ant.amazon.com", "transport_address": "127.0.0.1:9300", "node_attributes": { "shard_indexing_pressure_enabled": "true" }, "node_decision": "no", "store": { "in_sync": true, "allocation_id": "CnmW07YyQtqjAI0G4xSOMQ", "store_exception": { "type": "corrupt_index_exception", "reason": "checksum status indeterminate: remaining=0; please run checkindex for more details (resource=BufferedChecksumIndexInput(NIOFSIndexInput(path=\"/Users/sarthagg/workplace/actual/OpenSearch/build/distribution/local/opensearch-3.0.0-SNAPSHOT/data/nodes/0/indices/DthI0nwRTqafGvWDltUf1g/0/index/_f.si\")))", "suppressed": [ { "type": "e_o_f_exception", "reason": "read past EOF: NIOFSIndexInput(path=\"/Users/sarthagg/workplace/actual/OpenSearch/build/distribution/local/opensearch-3.0.0-SNAPSHOT/data/nodes/0/indices/DthI0nwRTqafGvWDltUf1g/0/index/_f.si\")" }, { "type": "corrupt_index_exception", "reason": "checksum passed (94fdd3c1). possibly transient resource issue, or a Lucene or JVM bug (resource=BufferedChecksumIndexInput(NIOFSIndexInput(path=\"/Users/sarthagg/workplace/actual/OpenSearch/build/distribution/local/opensearch-3.0.0-SNAPSHOT/data/nodes/0/indices/DthI0nwRTqafGvWDltUf1g/0/index/segments_7\")))" } ] } } } ] }

Even though the codecs name were same, we would still see the recovery to fail. The only difference was that instead of Lucene95 as the delegate codec, we were using Lucene99 as the delegate codec. It should have worked since even with Lucene99, lucene is still relying on the Lucene90StoredFieldsFormat.

So what changed?

There was a slight change in the way we parse the segments in Lucene95 and Lucene99.
This comparison can be viewed over here: https://editor.mergely.com/Mb1lKm8z

There is an additional call to readbyte in the Lucene99SegmentInfoFormat, post which we see EOF exception if we try to read the old segments with Lucene99SegmentInfoFormat.

It was in this commit where this change in format and how we parse the segments was introduced: apache/lucene@6677109

Got it, it definitely makes sense, I was confused a bit why we have to duplicate Mode for each codec, but looking into code it is clear - we use it as codec name, so what you've implemented makes perfect sense. Thanks @sarthakaggarwal97 !

src/main/java/org/opensearch/index/codec/customcodecs/Lucene95CustomCodec.java

reta · 2024-01-22T13:59:23Z

I think we would need to have a BWC tests here: the codecs created by previous versions of the plugin could be picked up by the new versions.

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>

sarthakaggarwal97 · 2024-01-23T12:57:59Z

@reta makes sense. Have moved the lucene95 codecs to a backward_codecs directory.
While I'm looking to introduce the bwc tests in this plugin, do you think we should take the tests in a separate PR or in the same one?

reta · 2024-01-23T13:59:29Z

While I'm looking to introduce the bwc tests in this plugin, do you think we should take the tests in a separate PR or in the same one?

I would feel more confident having the change and the tests together, the is a risk (sadly, always) to break on upgrades, thanks @sarthakaggarwal97

msfroh · 2024-01-24T17:58:42Z

src/main/resources/META-INF/services/org.apache.lucene.codecs.Codec

-org.opensearch.index.codec.customcodecs.ZstdCodec
-org.opensearch.index.codec.customcodecs.ZstdNoDictCodec
-org.opensearch.index.codec.customcodecs.ZstdDeprecatedCodec
+org.opensearch.index.codec.customcodecs.backward_codecs.Zstd95Codec
+org.opensearch.index.codec.customcodecs.backward_codecs.ZstdNoDict95Codec
+org.opensearch.index.codec.customcodecs.backward_codecs.Zstd95DeprecatedCodec
+org.opensearch.index.codec.customcodecs.Zstd99Codec
+org.opensearch.index.codec.customcodecs.ZstdNoDict99Codec


I don't think we can safely change these.

Codecs are loaded via reflection. The segment infos holds the full class name.

I think you should be able to test this manually by writing a little tool in this package that writes an index. (I would just use IndexWriter directly with an IndexWriterConfig that uses your custom codec. Here's an example.)

Write an index without this commit. That will simulate the "2.11 behavior".

Apply this commit. Open an IndexReader over the 2.11 index.

I believe that you *must keep the old fully-qualified class name to maintain backward compatibility (which unfortunately doesn't mention the version). Going forward, you can+should include version numbers.

Codecs are loaded via reflection. The segment infos holds the full class name.

Hm ... shouldn't it be code name (not class name)? Lucene moves codec under backward_codecs package every time, how does it work there?

My understanding as well is that the provided name within the constructor for codec class implementation is being used (based on how NamedSPILoader is being used in Lucene).

https://github.com/apache/lucene/blob/3a205feb27de8450c3bac7242f3c976ce5d21436/lucene/core/src/java/org/apache/lucene/codecs/Codec.java#L66-L68

I tested this by restoring the data indexed with zstd codecs in OSv2.11 to OSv3.0. Was able to ingest, and search after moving the codecs to the backward_codecs folder.
Here is part of the output when I hit GET <index>/_segments locally, was able to see both old and new segments (before and after the lucene upgrade).

"segments": { "_f": { "generation": 15, "num_docs": 1767, "deleted_docs": 0, "size_in_bytes": 53849, "memory_in_bytes": 0, "committed": false, "search": true, "version": "9.7.0", "compound": false, "attributes": { "Lucene95CustomStoredFieldsFormat.mode": "ZSTD_NO_DICT" } }, "_v": { "generation": 31, "num_docs": 432, "deleted_docs": 0, "size_in_bytes": 14885, "memory_in_bytes": 0, "committed": false, "search": true, "version": "9.9.1", "compound": true, "attributes": { "Lucene99CustomStoredFieldsFormat.mode": "ZSTD_NO_DICT" } },

Ahh... got it. Thanks @mgodwan for the link to the codec comment -- I had mistakenly assumed the FQCN got written to SegmentInfos.

In that case, we should be good to go once we have the BWC test. 👍

sarthakaggarwal97 · 2024-01-25T13:11:37Z

@reta @msfroh @mgodwan
I'm trying to introduce the support for bwc tests in custom-codecs plugin, based on the implementation of security plugin, but looks like I'm running into the same issue for some time now.

It would be great if I can get some inputs on this draft change, since we need to close on this before v2.12 release. Here is the link for the failed bwc task for reference.

Thank you!

mgodwan · 2024-01-25T15:00:55Z

I understand that currently 2.12 build is failing as we don't have the required custom codecs change in.
Due to this, the bwc tests are being run against 2.11 which is not a wire compatible version for main/3.0

This would cause any mixed mode to fail in your tests today.

This is a raw PR which runs against 2.12 as the current version and verify its bwc (at least from a cluster run perspective) with 2.11: #98

In order to have a path forward, I think we can try following approaches:

Raise a PR directly against 2.x to fix the 2.12 branch. Raise the same PR with required bwc versions in main
Raise a PR in main with failing bwc tests. Backport to 2.x and then fix the main branch code.

@reta @sarthakaggarwal97 Thoughts?

sarthakaggarwal97 · 2024-01-25T19:10:04Z

@mgodwan thank you for your inputs. It looks like the minimum_wire_compatibility_version was the issue all along.
Option 2 looks good to me (with monitoring that, post merge to 2.x, things still look okay), not sure we follow the practice of merging in 2.x prior to main when it comes to major code changes.

@reta what would you suggest?

reta · 2024-01-25T19:38:08Z

@reta what would you suggest?

Sorry @sarthakaggarwal97 , just got to that, looking into the options

reta · 2024-01-25T20:56:23Z

@reta what would you suggest?

Sorry @sarthakaggarwal97 , just got to that, looking into the options

@sarthakaggarwal97 I think we may go 2 step processes, @mgodwan @msfroh please validate me here:

Update 2.x (2.12.0-SNAPSHOT) to reflect Apache Lucene 9.9.1 update: at this moment we would use backward_codecs for Lucene 9.5, that would bring 2.x back on track.
Move on with BWC tests as planned

Alternatively, we could do 2nd option (as @mgodwan suggested) and have 2 pull requests, I think it would be better and less cycles wasted:

Raise a PR directly against 2.x to fix the 2.12 branch. Raise the same PR with required bwc versions in main

What do you think?

msfroh · 2024-01-25T21:18:47Z

What do you think?

I think @mgodwan's suggestion is probably smoother (less of a BWC dance). While we don't normally like to commit directly to 2.x, I think it's reasonable to unblock us here.

custom codecs upgrade to lucene99 codec

678be65

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>

sarthakaggarwal97 requested review from andrross, reta and nknize as code owners January 17, 2024 15:55

reta reviewed Jan 18, 2024

View reviewed changes

reta reviewed Jan 22, 2024

View reviewed changes

src/main/java/org/opensearch/index/codec/customcodecs/Lucene95CustomCodec.java Outdated Show resolved Hide resolved

moving lucene95 extended codecs to backward_codecs

eae6261

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>

msfroh reviewed Jan 24, 2024

View reviewed changes

sarthakaggarwal97 mentioned this pull request Jan 26, 2024

[2.x] Lucene 9.9.1 Upgrade #99

Merged

sarthakaggarwal97 closed this Jan 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom Codecs Upgrade to Lucene99 Codec #95

Custom Codecs Upgrade to Lucene99 Codec #95

sarthakaggarwal97 commented Jan 17, 2024

sarthakaggarwal97 commented Jan 17, 2024 •

edited

Loading

reta Jan 18, 2024 •

edited

Loading

sarthakaggarwal97 Jan 19, 2024 •

edited

Loading

reta Jan 22, 2024

reta commented Jan 22, 2024

sarthakaggarwal97 commented Jan 23, 2024

reta commented Jan 23, 2024

msfroh Jan 24, 2024

reta Jan 24, 2024

mgodwan Jan 24, 2024 •

edited

Loading

sarthakaggarwal97 Jan 24, 2024

msfroh Jan 25, 2024

sarthakaggarwal97 commented Jan 25, 2024 •

edited

Loading

mgodwan commented Jan 25, 2024

sarthakaggarwal97 commented Jan 25, 2024

reta commented Jan 25, 2024

reta commented Jan 25, 2024 •

edited

Loading

msfroh commented Jan 25, 2024

Custom Codecs Upgrade to Lucene99 Codec #95

Custom Codecs Upgrade to Lucene99 Codec #95

Conversation

sarthakaggarwal97 commented Jan 17, 2024

Description

Issues Resolved

sarthakaggarwal97 commented Jan 17, 2024 • edited Loading

reta Jan 18, 2024 • edited Loading

Choose a reason for hiding this comment

sarthakaggarwal97 Jan 19, 2024 • edited Loading

Choose a reason for hiding this comment

reta Jan 22, 2024

Choose a reason for hiding this comment

reta commented Jan 22, 2024

sarthakaggarwal97 commented Jan 23, 2024

reta commented Jan 23, 2024

msfroh Jan 24, 2024

Choose a reason for hiding this comment

reta Jan 24, 2024

Choose a reason for hiding this comment

mgodwan Jan 24, 2024 • edited Loading

Choose a reason for hiding this comment

sarthakaggarwal97 Jan 24, 2024

Choose a reason for hiding this comment

msfroh Jan 25, 2024

Choose a reason for hiding this comment

sarthakaggarwal97 commented Jan 25, 2024 • edited Loading

mgodwan commented Jan 25, 2024

sarthakaggarwal97 commented Jan 25, 2024

reta commented Jan 25, 2024

reta commented Jan 25, 2024 • edited Loading

msfroh commented Jan 25, 2024

sarthakaggarwal97 commented Jan 17, 2024 •

edited

Loading

reta Jan 18, 2024 •

edited

Loading

sarthakaggarwal97 Jan 19, 2024 •

edited

Loading

mgodwan Jan 24, 2024 •

edited

Loading

sarthakaggarwal97 commented Jan 25, 2024 •

edited

Loading

reta commented Jan 25, 2024 •

edited

Loading