Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix restart HCAD detector bug #460

Merged
merged 2 commits into from
Mar 23, 2022
Merged

Conversation

kaituo
Copy link
Collaborator

@kaituo kaituo commented Mar 23, 2022

Description

To prevent repeatedly cold starting a model due to sparse data, HCAD has a cache that remembers we have done cold start for a model. A second attempt to cold start will need to wait for 60 detector intervals. Previously, when stopping a detector, I forgot to clean the cache. So the cache remembers the model and won’t retry cold start after some time. This PR fixes the bug by cleaning the cache when stopping a detector.

Testing done:

  1. added unit and integration tests.
  2. manually reproduced the issue and verified the fix.

Signed-off-by: Kaituo Li kaituo@amazon.com

Issues Resolved

#400

Check List

  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

To prevent repeatedly cold starting a model due to sparse data, HCAD has a cache that remembers we have done cold start for a model. A second attempt to cold start will need to wait for 60 detector intervals. Previously, when stopping a detector, I forgot to clean the cache. So the cache remembers the model and won’t retry cold start after some time. This PR fixes the bug by cleaning the cache when stopping a detector.

Testing done:
1. added unit and integration tests.
2. manually reproduced the issue and verified the fix.

Signed-off-by: Kaituo Li <kaituo@amazon.com>
ylwu-amzn
ylwu-amzn previously approved these changes Mar 23, 2022
waitAllSyncheticDataIngested(data.size(), datasetName, client);
}

private void waitAllSyncheticDataIngested(int expectedSize, String datasetName, RestClient client) throws Exception {
Copy link
Member

@amitgalitz amitgalitz Mar 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR, but could we do something similar for historical tests that are flaky?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, we could

amitgalitz
amitgalitz previously approved these changes Mar 23, 2022
Copy link
Member

@amitgalitz amitgalitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this and adding tests

amitgalitz
amitgalitz previously approved these changes Mar 23, 2022
Signed-off-by: Kaituo Li <kaituo@amazon.com>
@codecov-commenter
Copy link

codecov-commenter commented Mar 23, 2022

Codecov Report

Merging #460 (8226be1) into main (3bdb4f6) will increase coverage by 0.02%.
The diff coverage is 100.00%.

Impacted file tree graph

@@             Coverage Diff              @@
##               main     #460      +/-   ##
============================================
+ Coverage     78.33%   78.35%   +0.02%     
- Complexity     4172     4176       +4     
============================================
  Files           296      296              
  Lines         17657    17661       +4     
  Branches       1879     1879              
============================================
+ Hits          13832    13839       +7     
+ Misses         2945     2940       -5     
- Partials        880      882       +2     
Flag Coverage Δ
plugin 78.35% <100.00%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...n/java/org/opensearch/ad/ml/EntityColdStarter.java 83.73% <100.00%> (+2.59%) ⬆️
...earch/ad/transport/DeleteModelTransportAction.java 96.15% <100.00%> (+0.32%) ⬆️
...rch/ad/transport/ForwardADTaskTransportAction.java 94.06% <0.00%> (-3.39%) ⬇️
...ansport/handler/AnomalyResultBulkIndexHandler.java 67.74% <0.00%> (-3.23%) ⬇️
...port/SearchAnomalyDetectorInfoTransportAction.java 66.66% <0.00%> (-2.23%) ⬇️
...opensearch/ad/indices/AnomalyDetectionIndices.java 72.28% <0.00%> (ø)
.../main/java/org/opensearch/ad/NodeStateManager.java 71.89% <0.00%> (+0.65%) ⬆️
...rch/ad/transport/AnomalyResultTransportAction.java 80.82% <0.00%> (+0.68%) ⬆️

@kaituo kaituo merged commit 9dd9718 into opensearch-project:main Mar 23, 2022
opensearch-trigger-bot bot pushed a commit that referenced this pull request Mar 23, 2022
* Fix restart HCAD detector bug

To prevent repeatedly cold starting a model due to sparse data, HCAD has a cache that remembers we have done cold start for a model. A second attempt to cold start will need to wait for 60 detector intervals. Previously, when stopping a detector, I forgot to clean the cache. So the cache remembers the model and won’t retry cold start after some time. This PR fixes the bug by cleaning the cache when stopping a detector.

Testing done:
1. added unit and integration tests.
2. manually reproduced the issue and verified the fix.

Signed-off-by: Kaituo Li <kaituo@amazon.com>
(cherry picked from commit 9dd9718)
opensearch-trigger-bot bot pushed a commit that referenced this pull request Mar 23, 2022
* Fix restart HCAD detector bug

To prevent repeatedly cold starting a model due to sparse data, HCAD has a cache that remembers we have done cold start for a model. A second attempt to cold start will need to wait for 60 detector intervals. Previously, when stopping a detector, I forgot to clean the cache. So the cache remembers the model and won’t retry cold start after some time. This PR fixes the bug by cleaning the cache when stopping a detector.

Testing done:
1. added unit and integration tests.
2. manually reproduced the issue and verified the fix.

Signed-off-by: Kaituo Li <kaituo@amazon.com>
(cherry picked from commit 9dd9718)
@opensearch-trigger-bot
Copy link

The backport to 1.2 failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-1.2 1.2
# Navigate to the new working tree
cd .worktrees/backport-1.2
# Create a new branch
git switch --create backport/backport-460-to-1.2
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 9dd9718748cd8d6917b10d66f56ca9e8ed117d7e
# Push it to GitHub
git push --set-upstream origin backport/backport-460-to-1.2
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-1.2

Then, create a pull request where the base branch is 1.2 and the compare/head branch is backport/backport-460-to-1.2.

@opensearch-trigger-bot
Copy link

The backport to 1.1 failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-1.1 1.1
# Navigate to the new working tree
cd .worktrees/backport-1.1
# Create a new branch
git switch --create backport/backport-460-to-1.1
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 9dd9718748cd8d6917b10d66f56ca9e8ed117d7e
# Push it to GitHub
git push --set-upstream origin backport/backport-460-to-1.1
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-1.1

Then, create a pull request where the base branch is 1.1 and the compare/head branch is backport/backport-460-to-1.1.

@opensearch-trigger-bot
Copy link

The backport to 1.0 failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-1.0 1.0
# Navigate to the new working tree
cd .worktrees/backport-1.0
# Create a new branch
git switch --create backport/backport-460-to-1.0
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 9dd9718748cd8d6917b10d66f56ca9e8ed117d7e
# Push it to GitHub
git push --set-upstream origin backport/backport-460-to-1.0
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-1.0

Then, create a pull request where the base branch is 1.0 and the compare/head branch is backport/backport-460-to-1.0.

@opensearch-trigger-bot
Copy link

The backport to 1.x failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-1.x 1.x
# Navigate to the new working tree
cd .worktrees/backport-1.x
# Create a new branch
git switch --create backport/backport-460-to-1.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 9dd9718748cd8d6917b10d66f56ca9e8ed117d7e
# Push it to GitHub
git push --set-upstream origin backport/backport-460-to-1.x
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-1.x

Then, create a pull request where the base branch is 1.x and the compare/head branch is backport/backport-460-to-1.x.

@opensearch-trigger-bot
Copy link

The backport to 1.x failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-1.x 1.x
# Navigate to the new working tree
cd .worktrees/backport-1.x
# Create a new branch
git switch --create backport/backport-460-to-1.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 9dd9718748cd8d6917b10d66f56ca9e8ed117d7e
# Push it to GitHub
git push --set-upstream origin backport/backport-460-to-1.x
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-1.x

Then, create a pull request where the base branch is 1.x and the compare/head branch is backport/backport-460-to-1.x.

kaituo added a commit to kaituo/anomaly-detection-1 that referenced this pull request Apr 8, 2022
* Fix restart HCAD detector bug

To prevent repeatedly cold starting a model due to sparse data, HCAD has a cache that remembers we have done cold start for a model. A second attempt to cold start will need to wait for 60 detector intervals. Previously, when stopping a detector, I forgot to clean the cache. So the cache remembers the model and won’t retry cold start after some time. This PR fixes the bug by cleaning the cache when stopping a detector.

Testing done:
1. added unit and integration tests.
2. manually reproduced the issue and verified the fix.

Signed-off-by: Kaituo Li <kaituo@amazon.com>
@kaituo kaituo mentioned this pull request Apr 8, 2022
1 task
@amitgalitz amitgalitz added the bug Something isn't working label Apr 20, 2022
kaituo added a commit that referenced this pull request Apr 22, 2022
* Fix restart HCAD detector bug (#460)

* Fix restart HCAD detector bug

* Adding test-retry plugin (#456)

* backport cve fix and improve restart IT

To prevent repeatedly cold starting a model due to sparse data, HCAD has a cache that remembers we have done cold start for a model. A second attempt to cold start will need to wait for 60 detector intervals. Previously, when stopping a detector, I forgot to clean the cache. So the cache remembers the model and won’t retry cold start after some time. This PR fixes the bug by cleaning the cache when stopping a detector.

Testing done:
1. added unit and integration tests.
2. manually reproduced the issue and verified the fix.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants