Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement BloomFilter query pushdown optimization #271

Conversation

dai-chen
Copy link
Collaborator

@dai-chen dai-chen commented Mar 5, 2024

Description

This PR introduces BloomFilter query pushdown optimization in OpenSearch by using a Painless script. Without dependency on OpenSearch SQL or another new mapper plugin, the following BloomFilter deserialization and membership check code are encoded in a Painless script https://github.com/dai-chen/opensearch-spark/blob/implement-bloom-filter-query-pushdown/flint-spark-integration/src/main/resources/bloom_filter_query.script.

  public static BloomFilter readFrom(InputStream in) {
      DataInputStream dis = new DataInputStream(in);
      int version = dis.readInt();
      int numHashFunctions = dis.readInt();
      BitArray bits = BitArray.readFrom(dis);
      ...
  }

 public boolean mightContainLong(long item) {
    int h1 = Murmur3_x86_32.hashLong(item, 0);
    int h2 = Murmur3_x86_32.hashLong(item, h1);

    long bitSize = bits.bitSize();
    for (int i = 1; i <= numHashFunctions; i++) {
      int combinedHash = h1 + (i * h2);
      // Flip all the bits if it's negative (guaranteed positive number)
      if (combinedHash < 0) {
        combinedHash = ~combinedHash;
      }
      if (!bits.get(combinedHash % bitSize)) {
        return false;
      }
    }
    return true;
  }

Because Painless language only support basic JDK API, API in Murmur3, DataInputStream and ByteArrayInputStream needs to be inlined accordingly. The Painless script length is 15.4k while OpenSearch default limit is 64k.

TODO

If any limitation or performance issue, we can either store the script by https://opensearch.org/docs/latest/api-reference/script-apis/create-stored-script/ or add dependency on SQL or a new plugin.

PR Planned

Issues Resolved

#206

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@dai-chen dai-chen added enhancement New feature or request 0.3 labels Mar 5, 2024
@dai-chen dai-chen self-assigned this Mar 5, 2024
Signed-off-by: Chen Dai <daichen@amazon.com>
Signed-off-by: Chen Dai <daichen@amazon.com>
Signed-off-by: Chen Dai <daichen@amazon.com>
@dai-chen dai-chen force-pushed the implement-bloom-filter-query-pushdown branch from 48659d0 to 9888af8 Compare March 7, 2024 20:01
Signed-off-by: Chen Dai <daichen@amazon.com>
@dai-chen dai-chen marked this pull request as ready for review March 8, 2024 00:42
@@ -0,0 +1,109 @@
int hashLong(long input, int seed) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably minify this removing spaces and changing variable names. That reduces readability. Not sure if we can add a step in final build before bundling.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore if it is too much optimization and we are way below limits.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I recall in C there is tool compress/obfuscate the code. Painless is not that popular that I'm not sure if any tool rather than do this manually. Will take another look. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since painless is mainly java, I just used this https://www.aspect-ratios.com/minify-java/

Copy link
Collaborator Author

@dai-chen dai-chen Mar 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking of something at runtime when loading or sending to OpenSearch. Otherwise like you pointed out, it will result in code hard to maintain? Will make a note and address it if any issue found later. Thanks!

@dai-chen dai-chen merged commit 7abb208 into opensearch-project:main Mar 11, 2024
4 checks passed
@dai-chen dai-chen deleted the implement-bloom-filter-query-pushdown branch March 11, 2024 18:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.3 enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants