-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add clustering function - RCFSummarize #355
Conversation
@@ -36,7 +36,7 @@ public static byte[] serialize(Object model) { | |||
objectOutputStream.flush(); | |||
return byteArrayOutputStream.toByteArray(); | |||
} catch (IOException e) { | |||
throw new ModelSerDeSerException("Failed to serialize model.", e.getCause()); | |||
throw new ModelSerDeSerException("Failed to serialize model." + e.getMessage(), e.getCause()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: we don't return the error message by design to avoid exposing implementation details.
|
||
import java.util.function.BiFunction; | ||
public class MathUtil { | ||
public static Integer iamconst = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where this const being used? Add some comments ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will delete, copied from where else.
|
||
Iterable<float[]> centroidsLst = Arrays.asList(summary.summaryPoints); | ||
List<Integer> predictions = new ArrayList<>(); | ||
Arrays.stream(featureNamesValues.v2()).forEach(e->predictions.add(MathUtil.findNearest(e, centroidsLst, distance))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, I guess RCF has already calculated nearest centroid for all data points. Why doesn't RCF return the centroids for all data points directly (at least provide some option to let user decide whether output or not)? If so, we can avoid recalculating nearest centroids here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me dig into the code, I am not sure if the library supports exporting these things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opps, they didn't export these things, but it is a good idea to follow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for confirming, I created an issue to track this aws/random-cut-forest-by-aws#338
parameters.getInitialK(), | ||
parameters.getPhase1Reassign(), | ||
distance, | ||
new Random().nextLong(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: how about reusing some singleton Random
object? Any reason to create a new Random
object for each request?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree
* Copyright OpenSearch Contributors | ||
* SPDX-License-Identifier: Apache-2.0 | ||
*/ | ||
package org.opensearch.ml.engine.utils; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems this class only used by RCF summarizer. How about move it to .../algorithms/clustering
package?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, will
DCO check failed, you can add |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add "-s" after your "git commit" so it will auto adds the Signed-off-by: xxx .
Codecov Report
@@ Coverage Diff @@
## main #355 +/- ##
=========================================
Coverage 92.09% 92.09%
Complexity 519 519
=========================================
Files 58 58
Lines 1481 1481
Branches 116 116
=========================================
Hits 1364 1364
Misses 78 78
Partials 39 39
Flags with carried forward coverage won't be shown. Click here to find out more.
Help us with your feedback. Take ten seconds to tell us how you rate us. |
// TODO: expose seed? | ||
|
||
@Builder(toBuilder = true) | ||
<<<<<<< HEAD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to resolve merge conflict.
public void writeTo(StreamOutput out) throws IOException { | ||
out.writeOptionalInt(maxK); | ||
|
||
if (initialK != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line 128-133 can be replaced with out.writeOptionalInt(initialK);
public RCFSummarizeParams(StreamInput in) throws IOException { | ||
this.maxK = in.readOptionalInt(); | ||
|
||
if (in.readBoolean()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 65-67 can be replaced with this.initialK = in.readOptionalInt();
this.initialK = in.readOptionalInt(); | ||
} | ||
|
||
if (in.readBoolean()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use in.readOptionalBoolean()
?
parallel = parser.booleanValue(); | ||
break; | ||
case DISTANCE_TYPE_FIELD: | ||
distanceType = DistanceType.from(parser.text()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about transforming parser.text()
to upper case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why upper case?
* | ||
* @return array of discovery node | ||
*/ | ||
protected DiscoveryNode[] getEligibleNodes() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this change is not part of your PR. Try to rebase with main branch?
Signed-off-by: Yang <yych@5c52309d57bd.ant.amazon.com>
@@ -31,16 +31,25 @@ | |||
import java.util.Map; | |||
import java.util.Optional; | |||
|
|||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this comment, the issue fixed.
Signed-off-by: Yang <yych@5c52309d57bd.ant.amazon.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for adding this algorithm!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks.
* squash everything again Signed-off-by: Yang <yych@5c52309d57bd.ant.amazon.com> * removing the comment Signed-off-by: Yang <yych@5c52309d57bd.ant.amazon.com> Co-authored-by: Yang <yych@5c52309d57bd.ant.amazon.com> Signed-off-by: Jing Zhang <jngz@amazon.com>
* squash everything again Signed-off-by: Yang <yych@5c52309d57bd.ant.amazon.com> * removing the comment Signed-off-by: Yang <yych@5c52309d57bd.ant.amazon.com> Co-authored-by: Yang <yych@5c52309d57bd.ant.amazon.com> (cherry picked from commit 3a86a78)
* squash everything again Signed-off-by: Yang <yych@5c52309d57bd.ant.amazon.com> * removing the comment Signed-off-by: Yang <yych@5c52309d57bd.ant.amazon.com> Co-authored-by: Yang <yych@5c52309d57bd.ant.amazon.com> (cherry picked from commit 3a86a78) Co-authored-by: model-collapse <50862890+model-collapse@users.noreply.github.com>
Description
Adding RCFSummarize clustering algo into ml-commons
Issue Link here: #356
Issues Resolved
[List any issues this PR will resolve]
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.