the dev of [FEATURE]Auto reload model when cluster rebooted/node rejoin #711

wujunshen · 2023-01-25T09:32:27Z

Signed-off-by: JunShen Wu wjunshen@amazon.com

Description

the new feature:
Auto reload model when cluster rebooted/node rejoin

When a ml node under the opensearch cluster halt down with some unknown reasons. The models under this node will be broken and impact the process of the inference or reduced performance. So we add a new feature: When a ml node halt down, we reboot this ml node, the opensearch on this node will auto reload all the models under this node,and user will not reload the model manually. Even in extreme cases, if the reload operation is still unsuccessful, opensearch will also tell the user via logs that the reload was unsuccessful.

Issues Resolved

please see: #577

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

codecov-commenter · 2023-01-25T15:36:33Z

Codecov Report

Merging #711 (467250e) into 2.x (ffb8a4e) will increase coverage by 0.32%.
The diff coverage is 90.22%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@             Coverage Diff              @@
##                2.x     #711      +/-   ##
============================================
+ Coverage     84.95%   85.27%   +0.32%     
- Complexity     1076     1105      +29     
============================================
  Files           100      101       +1     
  Lines          3922     4055     +133     
  Branches        370      378       +8     
============================================
+ Hits           3332     3458     +126     
- Misses          433      440       +7     
  Partials        157      157

Flag	Coverage Δ
ml-commons	`85.27% <90.22%> (+0.32%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...a/org/opensearch/ml/model/MLModelAutoReloader.java	`89.76% <89.76%> (ø)`
...rg/opensearch/ml/plugin/MachineLearningPlugin.java	`98.85% <100.00%> (+0.02%)`	⬆️
.../org/opensearch/ml/settings/MLCommonsSettings.java	`100.00% <100.00%> (ø)`
...earch/ml/action/load/TransportLoadModelAction.java	`85.84% <0.00%> (+1.76%)`	⬆️
.../cluster/MLCommonsClusterManagerEventListener.java	`79.41% <0.00%> (+11.76%)`	⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

ylwu-amzn · 2023-02-03T06:07:11Z

plugin/src/main/java/org/opensearch/ml/model/MLModelAutoReLoader.java

+     * the main method: model auto reloading
+     */
+    public void autoReLoadModel() {
+        log.info("enableAutoReLoadModel: {} ", enableAutoReLoadModel);


Log message is not so readable. How about changing to "Auto reload model enabled: {}"

ylwu-amzn · 2023-02-03T06:10:36Z

plugin/src/main/java/org/opensearch/ml/model/MLModelAutoReLoader.java

+
+        String localNodeId = clusterService.localNode().getId();
+        // auto reload all models of this local ml node
+        threadPool.generic().submit(() -> {


Why not use load model thread pool https://github.com/opensearch-project/ml-commons/blob/2.5/plugin/src/main/java/org/opensearch/ml/plugin/MachineLearningPlugin.java#L448

I think it is another function,if I use load model thread poo,it maybe will effect the original process of model load

threadPool.generic() is not dedicated for ML. Using this thread pool may impact other OpenSearch tasks. Suggest change to ML dedicated load model thread pool.

ok, I use threadPool.executor(LOAD_THREAD_POOL)

ylwu-amzn · 2023-02-03T06:10:56Z

plugin/src/main/java/org/opensearch/ml/model/MLModelAutoReLoader.java

+            try {
+                autoReLoadModelByNodeId(localNodeId);
+            } catch (ExecutionException | InterruptedException e) {
+                throw new RuntimeException(e);


Add error log here?

ok, added it~

ylwu-amzn · 2023-02-03T06:15:59Z

plugin/src/main/java/org/opensearch/ml/model/MLModelAutoReLoader.java

+            indexName
+        );
+
+        indicesExistsRequestBuilder.execute(ActionListener.wrap(actionListener::onResponse, actionListener::onFailure));


How about just checking if index exists in cluster metadata ? Refer to PR #717

ok, it is a very useful suggestion, I can reduce the complexity of coding implement, thanks.

ylwu-amzn · 2023-02-03T06:18:09Z

plugin/src/main/java/org/opensearch/ml/model/MLModelAutoReLoader.java

+                indexResponseActionListener.onResponse(indexResponse);
+                return;
+            }
+            indexResponseActionListener.onFailure(new RuntimeException("node id:" + localNodeId + " insert retry times unsuccessfully"));


Change to MLException ?

ylwu-amzn · 2023-02-03T06:24:26Z

plugin/src/main/java/org/opensearch/ml/model/MLModelAutoReLoader.java

+                    log.error("Can't auto reload model in node id {} ,has tried {} times\nThe reason is:{}", localNodeId, reTryTimes, e);
+                }
+
+                // Store the latest value of the reTryTimes and node id under the index ".plugins-ml-model-reload"


Any reason we have to persist retryTimes to index ? Is it ok to just cache retry times in memory?

because when the ml-node happened reboot,the info of retrying time will be lost in cache if we put them in cache

ylwu-amzn · 2023-02-03T06:36:40Z

plugin/src/main/java/org/opensearch/ml/model/MLModelAutoReLoader.java

+                    autoReLoadModelByNodeAndModelId(localNodeId, mlTask.getModelId());
+
+                    // if reload the model successfully,the number of unsuccessful reload should be reset to zero.
+                    result.setReTryTimes(reTryTimes);


Load model is async, I don't think we are sure model loaded successfully even line 188 autoReLoadModelByNodeAndModelId(localNodeId, mlTask.getModelId()); doesn't throw exception.

Do you need to update the retry times in the ML_MODEL_RELOAD_INDEX to 0 after loading successfully? For instance, the reload succeeded at the 3rd time so you need to reset the retry value in the index from 2 to 0?

yes，the code have had “if success，reset to zero” function

ylwu-amzn · 2023-02-03T06:40:04Z

plugin/src/main/java/org/opensearch/ml/model/MLModelAutoReLoader.java

+                mlLoadModelRequest,
+                ActionListener
+                    .wrap(response -> log.info("the model {} is auto reloading under the node {} ", modelId, localNodeId), exception -> {
+                        log.error("fail to reload model " + modelId + " under the node " + localNodeId + "\nthe reason is: " + exception);


I don't see retryTimes + 1 here, is that correct? I think we should count this failure in retry times

because the method autoReLoadModelByNodeAndModelId will throw exception to upper method,and the line188 mentioned in the previous comment will be caught, so retyrTimes+1 is written in the catch statement(the line 195)

ylwu-amzn · 2023-02-03T06:43:22Z

plugin/src/main/java/org/opensearch/ml/model/MLModelAutoReLoader.java

+                }
+
+                int reTryTimes = 0;
+                try (XContentParser parser = createXContentParserFromRegistry(xContentRegistry, result.getHits()[0].getSourceRef())) {


From result.getHits()[0], seems it will only reload the first model?

This retryTimes is defined based on the dimension of node, not node+model.
In other words, there is only one value of the retryTimes for each ml node

ylwu-amzn · 2023-02-03T06:47:26Z

plugin/src/main/java/org/opensearch/ml/model/MLModelAutoReLoader.java

+        QueryBuilder queryBuilder = QueryBuilders
+            .boolQuery()
+            .must(QueryBuilders.matchPhraseQuery("task_type", "LOAD_MODEL"))
+            .must(QueryBuilders.matchPhraseQuery("state", "COMPLETED"))


I think we should also consider COMPLETED_WITH_ERROR which means the model isn't loaded to all workers nodes, but loaded on some worker nodes successfully.

Okay, I found that there may be a time difference, when I wrote the code，the enumeration class MLTaskState does not exist COMPLETED_WITH_ERROR, I add this to my code

ylwu-amzn · 2023-02-03T06:49:30Z

common/src/main/java/org/opensearch/ml/common/CommonValue.java

+    public static final String ML_MODEL_RELOAD_INDEX = ".plugins-ml-model-reload";
+    public static final String NODE_ID_FIELD = "node_id";
+    public static final String MODEL_LOAD_RETRY_TIMES_FIELD = "retry_times";
+    public static final Integer ML_MODEL_RELOAD_MAX_RETRY_TIMES = 2;


Do we still need this constant as we have setting MLCommonsSettings.ML_MODEL_RELOAD_MAX_RETRY_TIMES;?

yes, u r right, We don't need it in class CommonValue. But we need it in opensearch.yml,so we can let user define the value by himself.

Zhangxunmt

Is it possible to test this auto reload in a local cluster? like creating a 2 nodes cluster, and kill one of the OS process in 1 node, etc?

Zhangxunmt · 2023-02-08T19:57:03Z

plugin/src/main/java/org/opensearch/ml/model/MLModelAutoReLoader.java

+    }
+
+    @Data
+    static class Result {


This class name "Result" looks too general and confusing. Let's rename it to something more meaningful? e.g. ModelsToRestore?

I found after referring to PR #717, We will not need this Result any more.so I refactored it~

dhrubo-os · 2023-02-08T20:13:51Z

plugin/src/main/java/org/opensearch/ml/model/MLModelAutoReLoader.java

+    private final NamedXContentRegistry xContentRegistry;
+    private final DiscoveryNodeHelper nodeHelper;
+    private final ThreadPool threadPool;
+    private volatile Boolean enableAutoReLoadModel;


reload is a full word. Maybe we don't need to do camel casing for reload. enableAutoReloadModel?

Same comment for retry

ok, I will search reload and retry and will refactor these 2 names in anywhere

dhrubo-os · 2023-02-08T20:23:19Z

plugin/src/main/java/org/opensearch/ml/model/MLModelAutoReLoader.java

+                // According to the node id to get retry times, if more than the max retry times, don't need to retry
+                // that the number of unsuccessful reload has reached the maximum number of times, do not need to reload
+                if (result.getReTryTimes() > autoReLoadMaxReTryTimes) {
+                    log.info("have exceeded max retry times, always failure");


How about: log.info("Node: {} has reached to the max retry limit, failed to load models", localNodeId)

that's cool

have committed the code

Zhangxunmt · 2023-02-08T21:07:26Z

common/src/main/java/org/opensearch/ml/common/CommonValue.java

@@ -32,6 +32,11 @@ public class CommonValue {
    public static final String ML_TASK_INDEX = ".plugins-ml-task";
    public static final Integer ML_MODEL_INDEX_SCHEMA_VERSION = 3;
    public static final Integer ML_TASK_INDEX_SCHEMA_VERSION = 1;
+
+    public static final String ML_MODEL_RELOAD_INDEX = ".plugins-ml-model-reload";


A general question: Is it possible to avoid using a new index to achieve auto reload? Can we just query the Task index and find out all the loaded models in the current node and reload them all after OS started? I may missed some discussion earlier, but it looks like the retry number and search results can be stored locally in the memory.

If this ml node has been restarted for some unknown reason, I can still use the persistent retryTimes value to know how many times the models on this node have been auto-reloaded before, then decide whether to do auto-reload this time. but if it is placed in cache, I can't get this info and have to auto-reload again. both are compared. The former may have some performance improvement

Yes, but this is a trade off for performance improvement by using a lot more resources. Is it possible to define this auto_reload as a ml_task and reuse the ml_task index to store the retry_times? Adding 2 new fields in ml_task may be much cheaper than using a new index. Thoughts?

After the communication with you and charlie, we will elaborate

Signed-off-by: wujunshen <frank_wjs@hotmail.com>

Signed-off-by: opensearch-ci-bot <opensearch-infra@amazon.com> Signed-off-by: opensearch-ci-bot <opensearch-infra@amazon.com> Co-authored-by: opensearch-ci-bot <opensearch-infra@amazon.com> Signed-off-by: wujunshen <frank_wjs@hotmail.com>

Signed-off-by: Yaliang Wu <ylwu@amazon.com> Signed-off-by: Yaliang Wu <ylwu@amazon.com> Signed-off-by: wujunshen <frank_wjs@hotmail.com>

…713) Signed-off-by: Yaliang Wu <ylwu@amazon.com> Signed-off-by: Yaliang Wu <ylwu@amazon.com> Signed-off-by: wujunshen <frank_wjs@hotmail.com>

…rameter (#714) * Enhance profile API to add model centric result controled by view paramter Signed-off-by: Zan Niu <zaniu@amazon.com> * Enhance profile API to add model centric result controled by view parameter Signed-off-by: Zan Niu <zaniu@amazon.com> * Enhance profile API to add model centric result controled by view parameter Signed-off-by: Zan Niu <zaniu@amazon.com> --------- Signed-off-by: Zan Niu <zaniu@amazon.com> Signed-off-by: wujunshen <frank_wjs@hotmail.com>

* add planning work nodes to model Signed-off-by: Yaliang Wu <ylwu@amazon.com> * add test Signed-off-by: Yaliang Wu <ylwu@amazon.com> --------- Signed-off-by: Yaliang Wu <ylwu@amazon.com> Signed-off-by: wujunshen <frank_wjs@hotmail.com>

Signed-off-by: Yaliang Wu <ylwu@amazon.com> Signed-off-by: wujunshen <frank_wjs@hotmail.com>

* refactor: add DL model class Signed-off-by: Yaliang Wu <ylwu@amazon.com> * fix model url in example doc Signed-off-by: Yaliang Wu <ylwu@amazon.com> * address comments Signed-off-by: Yaliang Wu <ylwu@amazon.com> * fix failed ut Signed-off-by: Yaliang Wu <ylwu@amazon.com> --------- Signed-off-by: Yaliang Wu <ylwu@amazon.com> Signed-off-by: wujunshen <frank_wjs@hotmail.com>

Signed-off-by: Yaliang Wu <ylwu@amazon.com> Signed-off-by: wujunshen <frank_wjs@hotmail.com>

Signed-off-by: wujunshen <frank_wjs@hotmail.com>

…in cluster metadata Signed-off-by: wujunshen <frank_wjs@hotmail.com>

Signed-off-by: wujunshen <frank_wjs@hotmail.com>

…nValue.java Signed-off-by: wujunshen <frank_wjs@hotmail.com>

Signed-off-by: wujunshen <frank_wjs@hotmail.com>

ylwu-amzn · 2023-02-13T21:33:34Z

plugin/src/main/java/org/opensearch/ml/model/MLModelAutoReloader.java

+                log
+                    .error(
+                        "the model auto-reloading has exception,and the root cause message is: {}",
+                        ExceptionUtils.getRootCauseMessage(e)


How about print the full exception stack trace here? Just print out the root cause seems not easy to debug.

can I use ExceptionUtils.getMessage(e)?

I think for a while,and changed it to ExceptionUtils.getStackTrace(e) at last.

Print the entire exception stack is useful and convenient to locate and debug issues, you can change to log .error("the model auto-reloading has exception,and the root cause message is: {}", e)

cool～I will modify it according to what you said

ylwu-amzn · 2023-02-13T21:52:10Z

plugin/src/main/java/org/opensearch/ml/model/MLModelAutoReloader.java

+     */
+    @VisibleForTesting
+    void autoReloadModelByNodeAndModelId(String localNodeId, String modelId) throws MLException {
+        String[] allNodeIds = nodeHelper.getAllNodeIds();


nodeHelper.getAllNodeIds() will return all nodes , not just ML nodes. Should we reload model on all nodes?

ok, I will modify it. Let the collection just have all ids of ml node.

1. Let the collection just have all ids of ml node 2. print out full exception stack trace Signed-off-by: wujunshen <frank_wjs@hotmail.com>

Zhangxunmt · 2023-02-16T18:57:31Z

common/src/main/java/org/opensearch/ml/common/CommonValue.java

@@ -32,6 +32,11 @@ public class CommonValue {
    public static final String ML_TASK_INDEX = ".plugins-ml-task";
    public static final Integer ML_MODEL_INDEX_SCHEMA_VERSION = 3;
    public static final Integer ML_TASK_INDEX_SCHEMA_VERSION = 1;
+
+    public static final String ML_MODEL_RELOAD_INDEX = ".plugins-ml-model-reload";


Yes, but this is a trade off for performance improvement by using a lot more resources. Is it possible to define this auto_reload as a ml_task and reuse the ml_task index to store the retry_times? Adding 2 new fields in ml_task may be much cheaper than using a new index. Thoughts?

Zhangxunmt · 2023-02-16T19:15:12Z

plugin/src/main/java/org/opensearch/ml/model/MLModelAutoReloader.java

+                // that the number of unsuccessful reload has reached the maximum number of times, do not need to reload
+                if (retryTimes > autoReloadMaxRetryTimes) {
+                    log.info("Node: {} has reached to the max retry limit, failed to load models", localNodeId);
+                    return;


Before return, should we check how long the node has been in the max retry status and reset to 0 after a substantial time? It looks to me the node will never reload forever once reached maximum retry times.

in the first comment of yours, I found ml_task index and ml_model index are both definition in ml_task index, so if I add 2 new fields in ml_task,the ml_model index will have these 2 fields,too. It sounds that give ml_model index redundant attributes.
in the second comment of yours, when we discussed the design earlier, if the maximum retry times is reached, instead of automatically reloading, the model need to be loaded manually.

Cool. Let's keep this logic then. But we should try to define a new type of ML Task for auto reload, and reuse MLTask to store the max_retry field, etc.

zane-neo · 2023-02-17T03:29:27Z

plugin/src/main/java/org/opensearch/ml/model/MLModelAutoReloader.java

+
+        searchRequestBuilder.execute(ActionListener.wrap(searchResponseActionListener::onResponse, exception -> {
+            log.error("index {} not found, the reason is {}", ML_TASK_INDEX, exception);
+            throw new IndexNotFoundException("index " + ML_TASK_INDEX + " not found");


We can't confirm this is IndexNotFoundException, please throw a MLException instead, and please wrap the original exception into the MLException like this:throw new MLException(exception)

ok, I have changed it~

zane-neo · 2023-02-17T03:43:49Z

plugin/src/main/java/org/opensearch/ml/model/MLModelAutoReloader.java

+                return;
+            }
+            indexResponseActionListener.onFailure(new MLException("node id:" + localNodeId + " insert retry times unsuccessfully"));
+        }, indexResponseActionListener::onFailure));


Please add logs here when receiving indexRequestBuilder.execute exception.

zane-neo · 2023-02-17T03:47:20Z

plugin/src/main/java/org/opensearch/ml/model/MLModelAutoReloader.java

+
+        String localNodeId = clusterService.localNode().getId();
+        // auto reload all models of this local ml node
+        threadPool.generic().submit(() -> {


Please change this threadpool to UPLOAD_THREAD_POOL in MachineLearningPlugin since this is dedicated for uploading models.

en ,yaliang have said it~ I have committed the latest code

use LOAD_THREAD_POOL to replace generic Signed-off-by: wujunshen <frank_wjs@hotmail.com>

print the whole exception stack Signed-off-by: wujunshen <frank_wjs@hotmail.com>

change the IndexNotFoundException to MLException Signed-off-by: wujunshen <frank_wjs@hotmail.com>

add logs when receiving indexRequestBuilder.execute exception Signed-off-by: wujunshen <frank_wjs@hotmail.com>

change the test code after code review Signed-off-by: wujunshen <frank_wjs@hotmail.com>

ylwu-amzn · 2023-02-20T22:32:48Z

plugin/src/main/java/org/opensearch/ml/model/MLModelAutoReloader.java

+     */
+    @VisibleForTesting
+    void queryTask(String localNodeId, ActionListener<SearchResponse> searchResponseActionListener) {
+        SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder().from(0).size(1);


This query only return the latest load model task. If user have 3 models, this query only return 1 latest load model task for 1 model, the other 2 models' tasks won't be returned. So we can't reload all 3 models, just reload 1 model. Is that correct?

ylwu-amzn · 2023-02-21T17:59:13Z

plugin/src/main/java/org/opensearch/ml/model/MLModelAutoReloader.java

+    private volatile Integer autoReloadMaxRetryTimes;
+
+    /**
+     * constructor method， init all the params necessary for model auto reloading


This ， after constructor method is not US-ASCII. That will cause infra team's CI workflow failure. #736

…ode rejoin (opensearch-project#711)" This reverts commit 7a51dcc.

…ode rejoin (opensearch-project#711)" This reverts commit 7a51dcc. Signed-off-by: Yaliang Wu <ylwu@amazon.com>

* fix unmappable character for encoding US-ASCII Signed-off-by: Yaliang Wu <ylwu@amazon.com> * Revert "the dev of [FEATURE]Auto reload model when cluster rebooted/node rejoin (#711)" This reverts commit 7a51dcc. Signed-off-by: Yaliang Wu <ylwu@amazon.com> --------- Signed-off-by: Yaliang Wu <ylwu@amazon.com>

wujunshen requested review from a team, zane-neo and ylwu-amzn January 25, 2023 09:32

ylwu-amzn reviewed Feb 3, 2023

View reviewed changes

Zhangxunmt reviewed Feb 8, 2023

View reviewed changes

dhrubo-os reviewed Feb 8, 2023

View reviewed changes

Zhangxunmt reviewed Feb 8, 2023

View reviewed changes

wujunshen and others added 21 commits February 10, 2023 00:20

[wjunshen] #N/A feat: fix after the latest rebase

77a1ffb

Signed-off-by: wujunshen <frank_wjs@hotmail.com>

[wjunshen] #N/A feat: fix after rebase

9f50458

Signed-off-by: wujunshen <frank_wjs@hotmail.com>

[wjunshen] #N/A feat: fix after rebase

ddfb117

Signed-off-by: wujunshen <frank_wjs@hotmail.com>

[wjunshen] #N/A feat: fix after rebase

0c235a6

Signed-off-by: wujunshen <frank_wjs@hotmail.com>

[wjunshen] #N/A feat: fix after the latest rebase

35a3703

Signed-off-by: wujunshen <frank_wjs@hotmail.com>

Increment version to 2.6.0-SNAPSHOT (#671)

350faed

Signed-off-by: opensearch-ci-bot <opensearch-infra@amazon.com> Signed-off-by: opensearch-ci-bot <opensearch-infra@amazon.com> Co-authored-by: opensearch-ci-bot <opensearch-infra@amazon.com> Signed-off-by: wujunshen <frank_wjs@hotmail.com>

fix profile API in example doc (#712)

5126aba

Signed-off-by: Yaliang Wu <ylwu@amazon.com> Signed-off-by: Yaliang Wu <ylwu@amazon.com> Signed-off-by: wujunshen <frank_wjs@hotmail.com>

change model url to public repo in text embedding model example doc (#…

c22980c

…713) Signed-off-by: Yaliang Wu <ylwu@amazon.com> Signed-off-by: Yaliang Wu <ylwu@amazon.com> Signed-off-by: wujunshen <frank_wjs@hotmail.com>

add planning work nodes to model (#715)

f62ad71

* add planning work nodes to model Signed-off-by: Yaliang Wu <ylwu@amazon.com> * add test Signed-off-by: Yaliang Wu <ylwu@amazon.com> --------- Signed-off-by: Yaliang Wu <ylwu@amazon.com> Signed-off-by: wujunshen <frank_wjs@hotmail.com>

skip running syncup job if no model index (#717)

eaf794d

Signed-off-by: Yaliang Wu <ylwu@amazon.com> Signed-off-by: wujunshen <frank_wjs@hotmail.com>

tune model config: change pooling mode to optional (#724)

da42086

Signed-off-by: Yaliang Wu <ylwu@amazon.com> Signed-off-by: wujunshen <frank_wjs@hotmail.com>

[wjunshen] #N/A feat: make the log readable

699e06a

Signed-off-by: wujunshen <frank_wjs@hotmail.com>

[wjunshen] #N/A feat: add error log

0365674

Signed-off-by: wujunshen <frank_wjs@hotmail.com>

[wjunshen] #N/A feat: Refer to PR #717,just checking if index exists …

d779f8c

…in cluster metadata Signed-off-by: wujunshen <frank_wjs@hotmail.com>

[wjunshen] #N/A feat: change RunTimeException to MLException

9fa1025

Signed-off-by: wujunshen <frank_wjs@hotmail.com>

[wjunshen] #N/A feat: also consider COMPLETED_WITH_ERROR

beef20f

Signed-off-by: wujunshen <frank_wjs@hotmail.com>

[wjunshen] #N/A feat: remove ML_MODEL_RELOAD_MAX_RETRY_TIMES in Commo…

5f3c2cc

…nValue.java Signed-off-by: wujunshen <frank_wjs@hotmail.com>

[wjunshen] #N/A feat: remove Result class

facc4a1

Signed-off-by: wujunshen <frank_wjs@hotmail.com>

[wjunshen] #N/A feat: change "reload" and "retry" to a full word

7356a82

Signed-off-by: wujunshen <frank_wjs@hotmail.com>

wujunshen added 3 commits February 10, 2023 00:20

[wjunshen] #N/A feat: change log info sentence

ddab41c

Signed-off-by: wujunshen <frank_wjs@hotmail.com>

[wjunshen] #N/A feat: code format

76fb7f0

Signed-off-by: wujunshen <frank_wjs@hotmail.com>

Merge branch 'opensearch-project:2.x' into 2.x

c0b575e

ylwu-amzn reviewed Feb 13, 2023

View reviewed changes

[Signed-off-by: wjunshen<wjunshen@amazon.com>] #N/A feat:

78ae922

1. Let the collection just have all ids of ml node 2. print out full exception stack trace Signed-off-by: wujunshen <frank_wjs@hotmail.com>

Zhangxunmt reviewed Feb 16, 2023

View reviewed changes

Zhangxunmt previously approved these changes Feb 17, 2023

View reviewed changes

zane-neo reviewed Feb 17, 2023

View reviewed changes

wujunshen added 4 commits February 17, 2023 15:31

[Signed-off-by: wjunshen<wjunshen@amazon.com>] #N/A feat:

f169ec1

use LOAD_THREAD_POOL to replace generic Signed-off-by: wujunshen <frank_wjs@hotmail.com>

[Signed-off-by: wjunshen<wjunshen@amazon.com>] #N/A feat:

5436e6d

print the whole exception stack Signed-off-by: wujunshen <frank_wjs@hotmail.com>

[Signed-off-by: wjunshen<wjunshen@amazon.com>] #N/A feat:

71e5645

change the IndexNotFoundException to MLException Signed-off-by: wujunshen <frank_wjs@hotmail.com>

[Signed-off-by: wjunshen<wjunshen@amazon.com>] #N/A feat:

38bf342

add logs when receiving indexRequestBuilder.execute exception Signed-off-by: wujunshen <frank_wjs@hotmail.com>

wujunshen dismissed Zhangxunmt’s stale review via 38bf342 February 17, 2023 07:52

[Signed-off-by: wjunshen<wjunshen@amazon.com>] #N/A feat:

467250e

change the test code after code review Signed-off-by: wujunshen <frank_wjs@hotmail.com>

zane-neo approved these changes Feb 17, 2023

View reviewed changes

Zhangxunmt approved these changes Feb 19, 2023

View reviewed changes

model-collapse self-assigned this Feb 20, 2023

model-collapse merged commit 7a51dcc into opensearch-project:2.x Feb 20, 2023

ylwu-amzn reviewed Feb 20, 2023

View reviewed changes

ylwu-amzn reviewed Feb 21, 2023

View reviewed changes

ylwu-amzn added a commit to ylwu-amzn/ml-commons that referenced this pull request Feb 21, 2023

Revert "the dev of [FEATURE]Auto reload model when cluster rebooted/n…

812a2a3

…ode rejoin (opensearch-project#711)" This reverts commit 7a51dcc.

ylwu-amzn added a commit to ylwu-amzn/ml-commons that referenced this pull request Feb 21, 2023

Revert "the dev of [FEATURE]Auto reload model when cluster rebooted/n…

1fc5ba4

…ode rejoin (opensearch-project#711)" This reverts commit 7a51dcc. Signed-off-by: Yaliang Wu <ylwu@amazon.com>

ylwu-amzn added a commit to ylwu-amzn/ml-commons that referenced this pull request Feb 21, 2023

Revert "the dev of [FEATURE]Auto reload model when cluster rebooted/n…

7e05b7c

…ode rejoin (opensearch-project#711)" This reverts commit 7a51dcc. Signed-off-by: Yaliang Wu <ylwu@amazon.com>

the dev of [FEATURE]Auto reload model when cluster rebooted/node rejoin #711

the dev of [FEATURE]Auto reload model when cluster rebooted/node rejoin #711

Conversation

wujunshen commented Jan 25, 2023

Description

Issues Resolved

Check List

codecov-commenter commented Jan 25, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ylwu-amzn Feb 3, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Zhangxunmt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zane-neo Feb 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Jan 25, 2023 •

edited

Loading

ylwu-amzn Feb 3, 2023 •

edited

Loading

zane-neo Feb 17, 2023 •

edited

Loading