Skip to content
This repository has been archived by the owner on Aug 2, 2022. It is now read-only.

add AD task cache #337

Merged
merged 5 commits into from
Dec 23, 2020
Merged

Conversation

ylwu-amzn
Copy link
Contributor

@ylwu-amzn ylwu-amzn commented Dec 20, 2020

Issue #, if available:

Description of changes:

Add AD task cache. We will put RCF&threshold model, shingle data, threshold model training data in cache.

  1. ./gradlew build
  2. ./gradlew integTest -PnumNodes=3

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@codecov
Copy link

codecov bot commented Dec 20, 2020

Codecov Report

Merging #337 (9088012) into master (2ae77ed) will increase coverage by 0.27%.
The diff coverage is 100.00%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master     #337      +/-   ##
============================================
+ Coverage     75.50%   75.77%   +0.27%     
- Complexity     2160     2207      +47     
============================================
  Files           207      209       +2     
  Lines         10030    10144     +114     
  Branches        898      902       +4     
============================================
+ Hits           7573     7687     +114     
  Misses         2035     2035              
  Partials        422      422              
Flag Coverage Δ Complexity Δ
cli 79.27% <ø> (ø) 0.00 <ø> (ø)
plugin 75.51% <100.00%> (+0.29%) 0.00 <46.00> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ Complexity Δ
...stroforelasticsearch/ad/AnomalyDetectorPlugin.java 95.04% <ø> (ø) 11.00 <0.00> (ø)
...arch/ad/transport/ADStatsNodesTransportAction.java 100.00% <ø> (+5.55%) 9.00 <0.00> (+1.00)
...n/opendistroforelasticsearch/ad/MemoryTracker.java 77.33% <100.00%> (+0.30%) 21.00 <0.00> (ø)
...ch/ad/common/exception/LimitExceededException.java 100.00% <100.00%> (ø) 3.00 <1.00> (+1.00)
...ticsearch/ad/settings/AnomalyDetectorSettings.java 100.00% <100.00%> (ø) 1.00 <0.00> (ø)
...stroforelasticsearch/ad/task/ADBatchTaskCache.java 100.00% <100.00%> (ø) 15.00 <15.00> (?)
...roforelasticsearch/ad/task/ADTaskCacheManager.java 100.00% <100.00%> (ø) 30.00 <30.00> (?)
...asticsearch/ad/cluster/ADClusterEventListener.java 88.00% <0.00%> (-4.00%) 13.00% <0.00%> (-1.00%)
... and 2 more

Copy link
Contributor

@weicongs-amazon weicongs-amazon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure how this task will be used.

Copy link
Contributor

@weicongs-amazon weicongs-amazon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

* @param taskId task id
* @param trained threshold model trained or not
*/
protected void setThresholdModelTrained(String taskId, boolean trained) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when will you call this method? Threshold model can emit results even if it has only seen 1 rcf score. We use rcf's total updates to measure whether the models are ready or not. Simply put, threshold model is always trained. I wonder why we need a flag to set it trained or not.

Copy link
Contributor Author

@ylwu-amzn ylwu-amzn Dec 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method will be called when historical detector's threshold model finishes training/cold start.

For historical detector, we have all of the data, so we should use these data as much as possible to train and predict better. In realtime detector, we have read some(512) historical sampled data to train model. In historical detector, we use more(1000) data points to train model. For both realtime and historical detectors, the RCF&Threshold model will be continuously trained with following data after cold start. Threshold model will be trained first, then start to output reliable results. So we need to know whether the threshold model is trained or not.

From ML team's suggestion that we should use enough data to train RCF model and they suggest 1000 should be good enough. We may tune this value after performance test and more data quality checking.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. 1000 is a trade-off between accuracy and usability. In real time, 128 is the threshold to emit results.

}
checkRunningTaskLimit();
long neededCacheSize = calculateADTaskCacheSize(adTask);
if (!memoryTracker.canAllocate(neededCacheSize)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you missed my comment

if (!memoryTracker.canAllocate(neededCacheSize)) {
throw new LimitExceededException("No enough memory to run detector");
}
memoryTracker.consumeMemory(neededCacheSize, false, HISTORICAL_SINGLE_ENTITY_DETECTOR);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you missed my comment

* {@link java.lang.IllegalArgumentException}
* We throw exception rather than return {@code Optional.empty} or null
* here, so don't need to check task existence by writing duplicate null
* checking code. All AD task exceptions will be handled in AD task manager.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still suggest we remove the exceptions and use Optional or null. In Java, the convention is to always check conditions whenever possible and not to use exceptions for flow control. Conditional check is a jump in the byte code while the exception handling is much more complex. When an exception occurs inside a Java method, the method creates an Exception object and passes the Exception object to the JVM (in Java term, the method "throw" an Exception). The Exception object contains the type of the exception, and the state of the program when the exception occurs. The JVM is responsible for finding an exception handler to process the Exception object. It searches backward through the call stack until it finds a matching exception handler for that particular class of Exception object (in Java term, it is called "catch" the Exception). If the JVM cannot find a matching exception handler in all the methods in the call stack, it terminates the program.

If you don't want to do it, I am fine as well. Just a note this is not a good practice.

Ref: https://stackoverflow.com/questions/8161042/why-use-an-exception-instead-of-if-else

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agree to not use exception for flow control. Here the exception is not to control the flow, but rather terminate the task run flow, just like any AD result action to terminate detector job run by throwing exceptions. In next PR I may change the method name to reduce the confusion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it. Please change the method name or add comment to make it clearer.

* @param taskId task id
* @param trained threshold model trained or not
*/
protected void setThresholdModelTrained(String taskId, boolean trained) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. 1000 is a trade-off between accuracy and usability. In real time, 128 is the threshold to emit results.

@ylwu-amzn ylwu-amzn merged commit d5683f6 into opendistro-for-elasticsearch:master Dec 23, 2020
@ohltyler ohltyler added the enhancement New feature or request label Feb 2, 2021
@ylwu-amzn ylwu-amzn added feature new feature and removed enhancement New feature or request labels Feb 3, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants