Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add clustering function - RCFSummarize #355

Merged
merged 2 commits into from
Aug 1, 2022
Merged

Add clustering function - RCFSummarize #355

merged 2 commits into from
Aug 1, 2022

Conversation

model-collapse
Copy link
Collaborator

@model-collapse model-collapse commented Jun 27, 2022

Description

Adding RCFSummarize clustering algo into ml-commons
Issue Link here: #356

Issues Resolved

[List any issues this PR will resolve]

Check List

  • [* ] New functionality includes testing.
    • [* ] All tests pass
  • [* ] New functionality has been documented.
    • [* ] New functionality has javadoc added
  • [* ] Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@model-collapse model-collapse requested a review from a team June 27, 2022 14:07
@@ -36,7 +36,7 @@ public static byte[] serialize(Object model) {
objectOutputStream.flush();
return byteArrayOutputStream.toByteArray();
} catch (IOException e) {
throw new ModelSerDeSerException("Failed to serialize model.", e.getCause());
throw new ModelSerDeSerException("Failed to serialize model." + e.getMessage(), e.getCause());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: we don't return the error message by design to avoid exposing implementation details.


import java.util.function.BiFunction;
public class MathUtil {
public static Integer iamconst = 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where this const being used? Add some comments ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will delete, copied from where else.


Iterable<float[]> centroidsLst = Arrays.asList(summary.summaryPoints);
List<Integer> predictions = new ArrayList<>();
Arrays.stream(featureNamesValues.v2()).forEach(e->predictions.add(MathUtil.findNearest(e, centroidsLst, distance)));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, I guess RCF has already calculated nearest centroid for all data points. Why doesn't RCF return the centroids for all data points directly (at least provide some option to let user decide whether output or not)? If so, we can avoid recalculating nearest centroids here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me dig into the code, I am not sure if the library supports exporting these things.

Copy link
Collaborator Author

@model-collapse model-collapse Jul 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opps, they didn't export these things, but it is a good idea to follow.

Copy link
Collaborator

@ylwu-amzn ylwu-amzn Jul 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for confirming, I created an issue to track this aws/random-cut-forest-by-aws#338

parameters.getInitialK(),
parameters.getPhase1Reassign(),
distance,
new Random().nextLong(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: how about reusing some singleton Random object? Any reason to create a new Random object for each request?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree

* Copyright OpenSearch Contributors
* SPDX-License-Identifier: Apache-2.0
*/
package org.opensearch.ml.engine.utils;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems this class only used by RCF summarizer. How about move it to .../algorithms/clustering package?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will

ylwu-amzn
ylwu-amzn previously approved these changes Jul 1, 2022
@ylwu-amzn
Copy link
Collaborator

DCO check failed, you can add --signoff when commit change

Copy link
Collaborator

@Zhangxunmt Zhangxunmt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add "-s" after your "git commit" so it will auto adds the Signed-off-by: xxx .

@codecov-commenter
Copy link

codecov-commenter commented Jul 19, 2022

Codecov Report

Merging #355 (29fbcf0) into main (e51211b) will not change coverage.
The diff coverage is n/a.

@@            Coverage Diff            @@
##               main     #355   +/-   ##
=========================================
  Coverage     92.09%   92.09%           
  Complexity      519      519           
=========================================
  Files            58       58           
  Lines          1481     1481           
  Branches        116      116           
=========================================
  Hits           1364     1364           
  Misses           78       78           
  Partials         39       39           
Flag Coverage Δ
ml-commons 92.09% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...rg/opensearch/ml/plugin/MachineLearningPlugin.java 100.00% <ø> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us.

// TODO: expose seed?

@Builder(toBuilder = true)
<<<<<<< HEAD
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to resolve merge conflict.

public void writeTo(StreamOutput out) throws IOException {
out.writeOptionalInt(maxK);

if (initialK != null) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line 128-133 can be replaced with out.writeOptionalInt(initialK);

public RCFSummarizeParams(StreamInput in) throws IOException {
this.maxK = in.readOptionalInt();

if (in.readBoolean()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 65-67 can be replaced with this.initialK = in.readOptionalInt();

this.initialK = in.readOptionalInt();
}

if (in.readBoolean()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use in.readOptionalBoolean()?

parallel = parser.booleanValue();
break;
case DISTANCE_TYPE_FIELD:
distanceType = DistanceType.from(parser.text());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about transforming parser.text() to upper case?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why upper case?

*
* @return array of discovery node
*/
protected DiscoveryNode[] getEligibleNodes() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change is not part of your PR. Try to rebase with main branch?

Signed-off-by: Yang <yych@5c52309d57bd.ant.amazon.com>
@@ -31,16 +31,25 @@
import java.util.Map;
import java.util.Optional;

/**
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this comment, the issue fixed.

Signed-off-by: Yang <yych@5c52309d57bd.ant.amazon.com>
Copy link
Collaborator

@ylwu-amzn ylwu-amzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for adding this algorithm!

Copy link
Collaborator

@jngz-es jngz-es left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks.

@ylwu-amzn ylwu-amzn merged commit 3a86a78 into opensearch-project:main Aug 1, 2022
jngz-es pushed a commit to jngz-es/ml-commons that referenced this pull request Aug 3, 2022
* squash everything again

Signed-off-by: Yang <yych@5c52309d57bd.ant.amazon.com>

* removing the comment

Signed-off-by: Yang <yych@5c52309d57bd.ant.amazon.com>

Co-authored-by: Yang <yych@5c52309d57bd.ant.amazon.com>
Signed-off-by: Jing Zhang <jngz@amazon.com>
opensearch-trigger-bot bot pushed a commit that referenced this pull request Aug 4, 2022
* squash everything again

Signed-off-by: Yang <yych@5c52309d57bd.ant.amazon.com>

* removing the comment

Signed-off-by: Yang <yych@5c52309d57bd.ant.amazon.com>

Co-authored-by: Yang <yych@5c52309d57bd.ant.amazon.com>
(cherry picked from commit 3a86a78)
Zhangxunmt pushed a commit that referenced this pull request Aug 5, 2022
* squash everything again

Signed-off-by: Yang <yych@5c52309d57bd.ant.amazon.com>

* removing the comment

Signed-off-by: Yang <yych@5c52309d57bd.ant.amazon.com>

Co-authored-by: Yang <yych@5c52309d57bd.ant.amazon.com>
(cherry picked from commit 3a86a78)

Co-authored-by: model-collapse <50862890+model-collapse@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants