Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Indexing Performance Degraded in OpenSearch 1.3.+ #2916

Closed
mattweber opened this issue Apr 15, 2022 · 20 comments
Closed

[BUG] Indexing Performance Degraded in OpenSearch 1.3.+ #2916

mattweber opened this issue Apr 15, 2022 · 20 comments
Labels

Comments

@mattweber
Copy link
Contributor

Describe the bug
I have migrated a large project from elasticsearch 7.16.2 to opensearch 1.3.0 recently and have noticed a large drop in indexing performance. Both are same cluster size specs, same index mapping, same set of docs. Other than checking lucene versions, both are on 8.10.1, I did not really dig in much yet due to other priorities related to the migration.

Other users are reporting similar performance issues on upgrade from opensearch 1.2.4 to 1.3.1 and have potentially narrowed it down to JDK11.

I will attempt to debug this further, but I wanted to open this issue in case anyone else has any ideas as to what might be the issue.

To Reproduce
Compare indexing performance of elasticsearch 7.16.2 or opensearch 1.2.4 to opensearch 1.3.+.

Expected behavior
Indexing performance should be roughly the same.

Plugins
I do use a custom tokenizer which is not open source. This same tokenizer is used in elasticsearch and is unlikely the cause.

Screenshots
n/a

Host/Environment (please complete the following information):
Running on using multiarch docker image on aarch64 (AWS r6gd.4xlarge, 30g heaps)

Additional context
Unfortunately this is part of a very large migration from elasticsearch to opensearch and a lot of variables to consider.

@mattweber mattweber added bug Something isn't working untriaged labels Apr 15, 2022
@mattweber
Copy link
Contributor Author

Upgraded to JDK17 and immediate 2x improvement in indexing performance. Pretty positive it is GC settings for JDK11 vs. JDK17.

@mattweber
Copy link
Contributor Author

ES 7.10.2 shipped with JDK15 which would have been using G1GC by default. The switch in opensearch to JDK11 changes that default garbage collector which is a pretty significant change IMO. I am testing G1GC on JDK11 now.

@mattweber
Copy link
Contributor Author

JDK11 w/ G1GC is perfectly fine and results in the indexing performance I was expecting. I guess this is not a bug, but maybe we should consider switching to G1GC by default for JDK11+? What do you think @reta @dblock ?

@mattweber
Copy link
Contributor Author

cc @nknize

@reta
Copy link
Collaborator

reta commented Apr 15, 2022

@mattweber thanks a lot for digging in. I agree with you, it is definitely worth reevaluating GC recommendations. We have this long standing issue [1] to run the benchmarks on different JVMs but it has not been finalized yet ([2] is also very helpful source). AFAIK we run into issues with security plugin on JDK-17 (see please [3]), that is one of the reasons the previous LTS (JDK-11) was set as the bundled one for 1.3.x.

[1] #1276
[2] https://kstefanj.github.io/2021/11/24/gc-progress-8-17.html
[3] opensearch-project/security#1653

@reta
Copy link
Collaborator

reta commented Apr 18, 2022

@nknize @dblock a bit of context on the issue, the 1.2.4 was shipped with JDK-15:

./opensearch-1.2.4/jdk/bin/java -version
openjdk version "15.0.1" 2020-10-20
OpenJDK Runtime Environment AdoptOpenJDK (build 15.0.1+9)
OpenJDK 64-Bit Server VM AdoptOpenJDK (build 15.0.1+9, mixed mode, sharing)

Whereas 1.3.0 uses LTS (JDK-11):

./opensearch-1.3.0/jdk/bin/java -version
openjdk version "11.0.14.1" 2022-02-08
OpenJDK Runtime Environment Temurin-11.0.14.1+1 (build 11.0.14.1+1)
OpenJDK 64-Bit Server VM Temurin-11.0.14.1+1 (build 11.0.14.1+1, mixed mode)

Using latest patched LTS JDK is surely a good idea, but our default configuration uses CMS up to JDK-13 and G1 after JDK-14.

## GC configuration
8-13:-XX:+UseConcMarkSweepGC
8-13:-XX:CMSInitiatingOccupancyFraction=75
8-13:-XX:+UseCMSInitiatingOccupancyOnly

## G1GC Configuration
# NOTE: G1 GC is only supported on JDK version 10 or later
# to use G1GC, uncomment the next two lines and update the version on the
# following three lines to your version of the JDK
# 10-13:-XX:-UseConcMarkSweepGC
# 10-13:-XX:-UseCMSInitiatingOccupancyOnly
14-:-XX:+UseG1GC
14-:-XX:G1ReservePercent=25
14-:-XX:InitiatingHeapOccupancyPercent=30

We probably have 3 options (at least):

  1. Use G1 for 11- and above ([2] has convincing points), but AFAIK we did not use G1 in the benchmarks here [1].
  2. Bring JDK-15 back (assuming we won't address blockers for JDK-17 in 1.x release line)
  3. Do nothing

I think going with 1st option would make sense, any options guys?

[1] #1276
[2] https://kstefanj.github.io/2021/11/24/gc-progress-8-17.html

@nknize
Copy link
Collaborator

nknize commented Apr 18, 2022

hrm.. this is quite the mess.

1.2.4 bundled jdk was inherited at JDK 15. It looks like 1.x was direct pushed a downgrade to JDK 11 when problems were discovered w/ the jdk 17 upgrade?

We should've simply reverted back to 15 for the inherited bundled jdk. Now we have 1.3 on an older bundled jdk than 1.2? Probably not the best idea.

What's the justification for going back to 11 as the bundled jdk when we started w/ 15?

I think we should bundle 15 (especially given all the issues w/ different 11 versions) and can still limit min runtime as 11 (required by lucene 9).

@reta
Copy link
Collaborator

reta commented Apr 18, 2022

1.2.4 bundled jdk was inherited at JDK 15. It looks like 1.x was direct pushed a downgrade to JDK 11 when problems were discovered w/ the jdk 17 upgrade?

In core, there were no issues with JDK-17 (afaik)

We should've simply reverted back to 15 for the inherited bundled jdk. Now we have 1.3 on an older bundled jdk than 1.2? Probably not the best idea.

Yes, but JDK-15 has not received any security patches for ages (technically, version is higher but not safer). It was supposed to be replaced by JDK-17 in first place.

What's the justification for going back to 11 as the bundled jdk when we started w/ 15?

The opensearch-project/security#1653 I think was a reason

I think we should bundle 15 (especially given all the issues w/ different 11 versions) and can still limit min runtime as 11 (required by lucene 9).

That's an option for 1.x, for 2.x we are on JDK-17.

@mattweber
Copy link
Contributor Author

I am personally running my clusters with option 1 as of right now but would be fine with running JDK15+ as well. I 100% think the default should be G1GC, the difference between using CMS and G1GC was massive.

@nknize
Copy link
Collaborator

nknize commented Apr 18, 2022

That's an option for 1.x

Yes... I should've been clearer that I was talking about 1.3.x bug fix. jdk17 for 2.0+ is definitely where we want to stay.

I 100% think the default should be G1GC

+1 Let's do this as a PR on main and backport all the way to 1.3 for a 1.3 bugfix release?

Separately I'm fine keeping 1.3 on 11 (especially since we're moving forward w/ 2.0+ on 17). 1.3 is set to 11.0.14.1+1 anyway which isn't suffering from the same OOM issues described in #2791.

@mattweber
Copy link
Contributor Author

@nknize @reta I can open a PR.

@reta
Copy link
Collaborator

reta commented Apr 18, 2022

@nknize @reta I can open a PR.

Please go ahead, thank you

@mattweber mattweber mentioned this issue Apr 18, 2022
5 tasks
@CEHENKLE
Copy link
Member

@anasalkouz @kotwanikunal @bbarani Ugh. I'd really like to understand how we missed this in performance testing, and how we can be confident we won't miss something like this in the future.

@dblock
Copy link
Member

dblock commented Apr 19, 2022

Looks like all the backports succeeded and were merged. Thanks for your help @mattweber. Closing.

@kotwanikunal
Copy link
Member

@anasalkouz @kotwanikunal @bbarani Ugh. I'd really like to understand how we missed this in performance testing, and how we can be confident we won't miss something like this in the future.

@CEHENKLE I have opened an issue here - #2985
I will add results and findings from my deep dive into the performance test stack.

@HugoKuo
Copy link

HugoKuo commented Apr 25, 2022

@reta I think we encounter the same issue with OpenSearch 1.3.0.
And we'd like to go option1.
May I know how to apply the G1GC for the cluster deployed by the helm chart? (We tried but seems not working)

@reta
Copy link
Collaborator

reta commented Apr 25, 2022

@HugoKuo good question, Help charts have opensearchJavaOpts and I believe you should be able to override the GC settings there by adding -XX:-UseConcMarkSweepGC -XX:-UseCMSInitiatingOccupancyOnly -XX:+UseG1GC. Another option would be to override config/jvm.options with [2]. @peterzhuamazon what would be your advice here?

[1] https://github.com/opensearch-project/helm-charts/blob/1495a00017ca54b43857173dd901553eba32f16f/charts/opensearch/README.md#configuration
[2] https://github.com/opensearch-project/OpenSearch/blob/1.3/distribution/src/config/jvm.options

@HugoKuo
Copy link

HugoKuo commented Apr 25, 2022

Updated the yml but failed with log output.

OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release.
Error occurred during initialization of VM
Multiple garbage collectors selected

The change is the yaml

    - name: OPENSEARCH_JAVA_OPTS
      value: -Xmx16g -Xms16g -XX:-UseConcMarkSweepGC -XX:-UseCMSInitiatingOccupancyOnly
        -XX:+UseG1GC

@reta
Copy link
Collaborator

reta commented Apr 25, 2022

@HugoKuo probably the second option with overriding config/jvm.options would be the fallback

@HugoKuo
Copy link

HugoKuo commented Apr 25, 2022

The change is made and seeing a huge difference.
The OpenSearch cluster we have is deployed with Helm Chart and running on top of k8s (managed by Rancher).

OpenSearch 1.3 on k8s by Helm Chart

To apply the G1GC collector is a bit tricky for helm chart. By adding the config map may or may not work. We updated the helm chart to obtain the config/jvm.options . I’m not Java people hence it takes awhile for me to apply the change. One key point is the number in the fron of each line. That means which JDK version to apply the line. The default is 14 for G1CG. OpenSearch 1.3 uses JDK 11. Hence you need to change the version range for the G1CG otherwise, it’ll go SerialGC which has very poor performance.

   ## GC configuration
    8-13:-XX:+UseConcMarkSweepGC
    8-13:-XX:CMSInitiatingOccupancyFraction=75
    8-13:-XX:+UseCMSInitiatingOccupancyOnly
    ## G1GC Configuration
    # NOTE: G1 GC is only supported on JDK version 10 or later
    # to use G1GC, uncomment the next two lines and update the version on the
    # following three lines to your version of the JDK
    10-13:-XX:-UseConcMarkSweepGC
    10-13:-XX:-UseCMSInitiatingOccupancyOnly
    10-14:-XX:+UseG1GC
    10-14:-XX:G1ReservePercent=25
    10-14:-XX:InitiatingHeapOccupancyPercent=30

Updated the same on https://discuss.opendistrocommunity.dev/t/slow-indexing-performance/9366/6?u=hugok

Hope this help.

Thanks @reta

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants