Implement BloomFilter query pushdown optimization #271

dai-chen · 2024-03-05T19:39:07Z

Description

This PR introduces BloomFilter query pushdown optimization in OpenSearch by using a Painless script. Without dependency on OpenSearch SQL or another new mapper plugin, the following BloomFilter deserialization and membership check code are encoded in a Painless script https://github.com/dai-chen/opensearch-spark/blob/implement-bloom-filter-query-pushdown/flint-spark-integration/src/main/resources/bloom_filter_query.script.

  public static BloomFilter readFrom(InputStream in) {
      DataInputStream dis = new DataInputStream(in);
      int version = dis.readInt();
      int numHashFunctions = dis.readInt();
      BitArray bits = BitArray.readFrom(dis);
      ...
  }

 public boolean mightContainLong(long item) {
    int h1 = Murmur3_x86_32.hashLong(item, 0);
    int h2 = Murmur3_x86_32.hashLong(item, h1);

    long bitSize = bits.bitSize();
    for (int i = 1; i <= numHashFunctions; i++) {
      int combinedHash = h1 + (i * h2);
      // Flip all the bits if it's negative (guaranteed positive number)
      if (combinedHash < 0) {
        combinedHash = ~combinedHash;
      }
      if (!bits.get(combinedHash % bitSize)) {
        return false;
      }
    }
    return true;
  }

Because Painless language only support basic JDK API, API in Murmur3, DataInputStream and ByteArrayInputStream needs to be inlined accordingly. The Painless script length is 15.4k while OpenSearch default limit is 64k.

TODO

If any limitation or performance issue, we can either store the script by https://opensearch.org/docs/latest/api-reference/script-apis/create-stored-script/ or add dependency on SQL or a new plugin.

PR Planned

Implement BloomFilter skipping index building logic #242
Implement BloomFilter query rewrite (without pushdown optimization) #248
Implement BloomFilter query pushdown optimization #271 [Current]
Implement adaptive BloomFilter algorithm #251
Support bloom filter type in Flint SQL

Issues Resolved

#206

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Chen Dai <daichen@amazon.com>

vamsi-amazon · 2024-03-08T23:27:51Z

flint-spark-integration/src/main/resources/bloom_filter_query.script

@@ -0,0 +1,109 @@
+int hashLong(long input, int seed) {


We can probably minify this removing spaces and changing variable names. That reduces readability. Not sure if we can add a step in final build before bundling.

Ignore if it is too much optimization and we are way below limits.

Good point. I recall in C there is tool compress/obfuscate the code. Painless is not that popular that I'm not sure if any tool rather than do this manually. Will take another look. Thanks!

Since painless is mainly java, I just used this https://www.aspect-ratios.com/minify-java/

I'm thinking of something at runtime when loading or sending to OpenSearch. Otherwise like you pointed out, it will result in code hard to maintain? Will make a note and address it if any issue found later. Thanks!

...spark-integration/src/main/scala/org/apache/spark/sql/flint/storage/FlintQueryCompiler.scala

...park-integration/src/test/scala/org/apache/spark/sql/flint/datatype/FlintDataTypeSuite.scala

dai-chen added enhancement New feature or request 0.3 labels Mar 5, 2024

dai-chen self-assigned this Mar 5, 2024

dai-chen added 3 commits March 7, 2024 11:49

Add pushdown optimization by painless script

72a4f32

Signed-off-by: Chen Dai <daichen@amazon.com>

Update UT and indent script code

91a0ece

Signed-off-by: Chen Dai <daichen@amazon.com>

Minor refactor painless script

9888af8

Signed-off-by: Chen Dai <daichen@amazon.com>

dai-chen force-pushed the implement-bloom-filter-query-pushdown branch from 48659d0 to 9888af8 Compare March 7, 2024 20:01

Fix broken IT

0d60837

Signed-off-by: Chen Dai <daichen@amazon.com>

dai-chen marked this pull request as ready for review March 8, 2024 00:42

dai-chen requested review from rupal-bq, vamsi-amazon, penghuo, anirudha, kaituo and YANG-DB as code owners March 8, 2024 00:42

dai-chen mentioned this pull request Mar 8, 2024

Implement adaptive BloomFilter algorithm #251

Merged

5 tasks

Merge branch 'main' into implement-bloom-filter-query-pushdown

8c93f63

vamsi-amazon reviewed Mar 8, 2024

View reviewed changes

vamsi-amazon reviewed Mar 9, 2024

View reviewed changes

...spark-integration/src/main/scala/org/apache/spark/sql/flint/storage/FlintQueryCompiler.scala Show resolved Hide resolved

vamsi-amazon reviewed Mar 9, 2024

View reviewed changes

...park-integration/src/test/scala/org/apache/spark/sql/flint/datatype/FlintDataTypeSuite.scala Show resolved Hide resolved

vamsi-amazon approved these changes Mar 9, 2024

View reviewed changes

penghuo approved these changes Mar 9, 2024

View reviewed changes

dai-chen merged commit 7abb208 into opensearch-project:main Mar 11, 2024
4 checks passed

dai-chen deleted the implement-bloom-filter-query-pushdown branch March 11, 2024 18:35

This was referenced Mar 13, 2024

Add BloomFilter skipping index SQL support #283

Merged

[Feature] OpenSearch and Apache Spark Integration #3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement BloomFilter query pushdown optimization #271

Implement BloomFilter query pushdown optimization #271

dai-chen commented Mar 5, 2024 •

edited

Loading

vamsi-amazon Mar 8, 2024

vamsi-amazon Mar 8, 2024

dai-chen Mar 8, 2024

vamsi-amazon Mar 9, 2024

dai-chen Mar 11, 2024 •

edited

Loading

Implement BloomFilter query pushdown optimization #271

Implement BloomFilter query pushdown optimization #271

Conversation

dai-chen commented Mar 5, 2024 • edited Loading

Description

TODO

PR Planned

Issues Resolved

vamsi-amazon Mar 8, 2024

Choose a reason for hiding this comment

vamsi-amazon Mar 8, 2024

Choose a reason for hiding this comment

dai-chen Mar 8, 2024

Choose a reason for hiding this comment

vamsi-amazon Mar 9, 2024

Choose a reason for hiding this comment

dai-chen Mar 11, 2024 • edited Loading

Choose a reason for hiding this comment

dai-chen commented Mar 5, 2024 •

edited

Loading

dai-chen Mar 11, 2024 •

edited

Loading