-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement BloomFilter query pushdown optimization #271
Implement BloomFilter query pushdown optimization #271
Conversation
Signed-off-by: Chen Dai <daichen@amazon.com>
Signed-off-by: Chen Dai <daichen@amazon.com>
Signed-off-by: Chen Dai <daichen@amazon.com>
48659d0
to
9888af8
Compare
Signed-off-by: Chen Dai <daichen@amazon.com>
@@ -0,0 +1,109 @@ | |||
int hashLong(long input, int seed) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can probably minify this removing spaces and changing variable names. That reduces readability. Not sure if we can add a step in final build before bundling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignore if it is too much optimization and we are way below limits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I recall in C there is tool compress/obfuscate the code. Painless is not that popular that I'm not sure if any tool rather than do this manually. Will take another look. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since painless is mainly java, I just used this https://www.aspect-ratios.com/minify-java/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking of something at runtime when loading or sending to OpenSearch. Otherwise like you pointed out, it will result in code hard to maintain? Will make a note and address it if any issue found later. Thanks!
Description
This PR introduces BloomFilter query pushdown optimization in OpenSearch by using a Painless script. Without dependency on OpenSearch SQL or another new mapper plugin, the following BloomFilter deserialization and membership check code are encoded in a Painless script https://github.com/dai-chen/opensearch-spark/blob/implement-bloom-filter-query-pushdown/flint-spark-integration/src/main/resources/bloom_filter_query.script.
Because Painless language only support basic JDK API, API in
Murmur3
,DataInputStream
andByteArrayInputStream
needs to be inlined accordingly. The Painless script length is 15.4k while OpenSearch default limit is 64k.TODO
If any limitation or performance issue, we can either store the script by https://opensearch.org/docs/latest/api-reference/script-apis/create-stored-script/ or add dependency on SQL or a new plugin.
PR Planned
Issues Resolved
#206
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.