Add Logistic Regression algorithm #383

Zhangxunmt · 2022-08-02T00:21:04Z

Signed-off-by: Xun Zhang xunzh@amazon.com

Description

Add Tribuo Logistic Regression into ML-Common.
UTs and ITs of logistic regression are included in the PR. Addressed comments from the previous PR.

Issues Resolved

#318

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

codecov-commenter · 2022-08-02T00:27:13Z

Codecov Report

Merging #383 (869d70d) into main (0618b93) will not change coverage.
The diff coverage is n/a.

@@            Coverage Diff            @@
##               main     #383   +/-   ##
=========================================
  Coverage     92.09%   92.09%           
  Complexity      519      519           
=========================================
  Files            58       58           
  Lines          1481     1481           
  Branches        116      116           
=========================================
  Hits           1364     1364           
  Misses           78       78           
  Partials         39       39

Flag	Coverage Δ
ml-commons	`92.09% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...rg/opensearch/ml/plugin/MachineLearningPlugin.java	`100.00% <ø> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Craigacp · 2022-08-02T01:13:43Z

ml-algorithms/src/main/java/org/opensearch/ml/engine/utils/TribuoUtil.java

@@ -95,6 +134,9 @@ public static <T extends Output<T>> MutableDataset<T> generateDataset(DataFrame
                    // TODO: support anomaly labels to evaluate prediction result
                    example = new ArrayExample<>((T) new Event(defaultEventType), featureNamesValues.v1(), featureNamesValues.v2()[i]);
                    break;
+                case LABEL:
+                    example = new ArrayExample<>((T) new Label(""), featureNamesValues.v1(), featureNamesValues.v2()[i]);


You can use UNKNOWN_LABEL for this if you want. It reduces the amount of allocation, and is fine to be shared among examples at prediction time. Or outputFactory.getUnknownOutput() as you already have an output factory reference.

Replaced with outputFactory.getUnknownOutput().

Craigacp · 2022-08-02T01:14:47Z

...orithms/src/main/java/org/opensearch/ml/engine/algorithms/regression/LogisticRegression.java

+                "Logistic regression training data from OpenSearch", TribuoOutputType.LABEL, parameters.getTarget());
+        // Integer epochs = Optional.ofNullable(parameters.getEpochs()).orElse(DEFAULT_EPOCHS);
+        // LinearSGDTrainer(objective=LogMulticlass,optimiser=AdaGrad(initialLearningRate=1.0,epsilon=0.1,initialValue=0.0),epochs=5,minibatchSize=1,seed=12345)
+        Trainer<Label> logisticRegressionTrainer = new LogisticRegressionTrainer();


Our LogisticRegressionTrainer is designed to be very simple for onboarding users & demos, I recommend you use LinearSGDTrainer and configure it given you're wrapping all the interface.

@Craigacp Thanks for your suggestion. We have integrated LinearSGDTrainer to provide linear regression https://github.com/opensearch-project/ml-commons/blob/main/ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/regression/LinearRegression.java#L203.

You mean we should wrap LinearSGDTrainer rather than LogisticRegressionTrainer to provide logistic regression? Appreciate if you can provide any example ?

Well, this is slightly unfortunate naming on our part. We have three classes all called LinearSGDTrainer, org.tribuo.classification.sgd.linear.LinearSGDTrainer, org.tribuo.multilabel.sgd.linear.LinearSGDTrainer and org.tribuo.regression.sgd.linear.LinearSGDTrainer. They have basically the same interface, but are constrained by the type system to accept the appropriate kind of objective function, and have the right generic types. LogisticRegressionTrainer is just a org.tribuo.classification.sgd.linear.LinearSGDTrainer subclass with a constructor that has fairly sensible defaults for running quick tests & demos. So for your use case I'd recommend you use org.tribuo.classification.sgd.linear.LinearSGDTrainer as then you can expose basically the same set of options as you do for linear regression (aside from the difference in objective functions).

Thanks for the details. @Zhangxunmt I think we should change to org.tribuo.classification.sgd.linear.LinearSGDTrainer.

BTW, @Craigacp , For linear regression org.tribuo.regression.sgd.linear.LinearSGDTrainer and this org.tribuo.classification.sgd.linear.LinearSGDTrainer, do you have any suggestion for the default parameter value like objective function, epochs, batch size etc? Any way/API to tell if the model converging/good enough?

For the classification/logistic regression the ones we use in the constructor are fine - https://github.com/oracle/tribuo/blob/main/Classification/SGD/src/main/java/org/tribuo/classification/sgd/linear/LogisticRegressionTrainer.java#L44. You could use them for the linear regression too, substituting the LogMulticlass objective for SquaredLoss. For the linear models a batch size of 1 tends to work just fine, and there isn't really any performance benefit to larger ones given the way the linear algebra system is setup (as it's not multithreaded yet).

Unfortunately we don't have any convergence metrics like validation error built into Tribuo's trainers, and we'd need to modify the trainer API to enable such things.

I was using LinearSGDTrainer with the default parameters in the first place (check the commented out line), then I realized it's almost equivalent to the given LogisticRegressionTrainer since it's basically using LinearSGDTrainer under the hood. We can change it to LinearSGDTrainer with default setup, and let customers to configure parameters through the training requests.

ylwu-amzn · 2022-08-04T00:50:58Z

common/src/main/java/org/opensearch/ml/common/dataframe/DefaultDataFrame.java

@@ -131,6 +131,24 @@ public DataFrame select(int[] columns) {
        return new DefaultDataFrame(newColumnMetas, rows.stream().map(row-> row.select(columns)).collect(Collectors.toList()));
    }

+    @Override
+    public int getColumnIndex(String target) {
+        List<String> featureNames = Arrays.stream(this.columnMetas()).map(ColumnMeta::getName).collect(Collectors.toList());


Minor: rename featureNames as columnNames?

ylwu-amzn · 2022-08-04T00:54:19Z

.../main/java/org/opensearch/ml/common/input/parameter/regression/LogisticRegressionParams.java

+        SIMPLE_SGD,
+        LINEAR_DECAY_SGD,
+        SQRT_DECAY_SGD,
+        ADA_GRAD,


@Craigacp help review if all these supported for logistic regression?

These are all supported.

ylwu-amzn · 2022-08-04T00:58:01Z

...orithms/src/main/java/org/opensearch/ml/engine/algorithms/regression/LogisticRegression.java

+        }
+
+        if (parameters.getLoggingInterval() != null && parameters.getLoggingInterval() < 0) {
+            throw new IllegalArgumentException("Logging intervals should not be negative.");


From java doc of org.tribuo.classification.sgd.linear.LinearSGDTrainer

Log the loss after this many iterations. If -1 don't log anything.

Good catch!

ylwu-amzn · 2022-08-04T01:00:27Z

...orithms/src/main/java/org/opensearch/ml/engine/algorithms/regression/LogisticRegression.java

+        Double epsilon = Optional.ofNullable(parameters.getEpsilon()).orElse(DEFAULT_EPSILON);
+
+        switch (optimizerType) {
+            // ToDo: Add more possible optimizer. Tribuo only provides AdaGrad for logistic regression.


@Craigacp can you confirm if this is true? Does org.tribuo.classification.sgd.linear.LinearSGDTrainer support other optimizer ?

Yes, it supports everything that implements org.tribuo.math.StochasticGradientOptimiser which is everything in this folder - https://github.com/oracle/tribuo/tree/main/Math/src/main/java/org/tribuo/math/optimisers. I wouldn't recommend Pegasos though, it's fussy to tune, and AdaGrad is a sensible default for convex problems like logistic regression, but all of them should work. Note ParameterAveraging requires a nested gradient optimiser to average over, and so you might not want to expose that because it'll be more complicated.

Thanks for the clarification. These optimizers are all supported in the PR now.

Signed-off-by: Xun Zhang <xunzh@amazon.com>

opensearch-trigger-bot · 2022-08-04T23:46:01Z

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-2.x 2.x
# Navigate to the new working tree
cd .worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-383-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 23cbbb4abee7641d1ea236e10361853156f19ecd
# Push it to GitHub
git push --set-upstream origin backport/backport-383-to-2.x
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-383-to-2.x.

* Add Logistic Regression algorithm Signed-off-by: Xun Zhang <xunzh@amazon.com> (cherry picked from commit 23cbbb4)

* Add Logistic Regression algorithm Signed-off-by: Xun Zhang <xunzh@amazon.com> (cherry picked from commit 23cbbb4) Co-authored-by: Xun Zhang <xunzh@amazon.com>

* Add Logistic Regression algorithm Signed-off-by: Xun Zhang <xunzh@amazon.com>

Zhangxunmt requested review from a team, jngz-es and ylwu-amzn August 2, 2022 00:21

Craigacp reviewed Aug 2, 2022

View reviewed changes

Zhangxunmt force-pushed the main branch from a2f2536 to 869d70d Compare August 4, 2022 00:01

ylwu-amzn reviewed Aug 4, 2022

View reviewed changes

Zhangxunmt added 3 commits August 4, 2022 12:52

Add Logistic Regression algorithm

18b1d82

Signed-off-by: Xun Zhang <xunzh@amazon.com>

register logistic regression param and use LinearSGDTrainer

0a3b6df

Signed-off-by: Xun Zhang <xunzh@amazon.com>

make all logistic regression params configurable

cca73b5

Signed-off-by: Xun Zhang <xunzh@amazon.com>

Zhangxunmt force-pushed the main branch from 869d70d to d1380af Compare August 4, 2022 21:12

add more optimizerTypes and momentumTypes

6e0f349

Signed-off-by: Xun Zhang <xunzh@amazon.com>

Zhangxunmt force-pushed the main branch from d1380af to 6e0f349 Compare August 4, 2022 21:28

Zhangxunmt added backport 2.x backport 2.2 labels Aug 4, 2022

ylwu-amzn approved these changes Aug 4, 2022

View reviewed changes

jngz-es approved these changes Aug 4, 2022

View reviewed changes

Zhangxunmt merged commit 23cbbb4 into opensearch-project:main Aug 4, 2022

opensearch-trigger-bot bot pushed a commit that referenced this pull request Aug 4, 2022

Add Logistic Regression algorithm (#383)

0374794

* Add Logistic Regression algorithm Signed-off-by: Xun Zhang <xunzh@amazon.com> (cherry picked from commit 23cbbb4)

opensearch-trigger-bot bot mentioned this pull request Aug 4, 2022

[Backport 2.2] Add Logistic Regression algorithm #392

Merged

Zhangxunmt added a commit that referenced this pull request Aug 4, 2022

Add Logistic Regression algorithm (#383) (#392)

3b8b01c

* Add Logistic Regression algorithm Signed-off-by: Xun Zhang <xunzh@amazon.com> (cherry picked from commit 23cbbb4) Co-authored-by: Xun Zhang <xunzh@amazon.com>

Zhangxunmt added a commit to Zhangxunmt/ml-commons that referenced this pull request Aug 5, 2022

Add Logistic Regression algorithm (opensearch-project#383)

bcbc67f

* Add Logistic Regression algorithm Signed-off-by: Xun Zhang <xunzh@amazon.com>

Zhangxunmt added a commit that referenced this pull request Aug 5, 2022

Add Logistic Regression algorithm (#383) (#393)

4645abe

* Add Logistic Regression algorithm Signed-off-by: Xun Zhang <xunzh@amazon.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Logistic Regression algorithm #383

Add Logistic Regression algorithm #383

Zhangxunmt commented Aug 2, 2022 •

edited

Loading

codecov-commenter commented Aug 2, 2022 •

edited

Loading

Craigacp Aug 2, 2022

Zhangxunmt Aug 2, 2022

Craigacp Aug 2, 2022

ylwu-amzn Aug 2, 2022 •

edited

Loading

Craigacp Aug 2, 2022

ylwu-amzn Aug 2, 2022 •

edited

Loading

Craigacp Aug 2, 2022

Zhangxunmt Aug 2, 2022 •

edited

Loading

ylwu-amzn Aug 4, 2022

ylwu-amzn Aug 4, 2022

Craigacp Aug 4, 2022

ylwu-amzn Aug 4, 2022

Zhangxunmt Aug 4, 2022

ylwu-amzn Aug 4, 2022

Craigacp Aug 4, 2022

Zhangxunmt Aug 4, 2022

opensearch-trigger-bot bot commented Aug 4, 2022

Add Logistic Regression algorithm #383

Add Logistic Regression algorithm #383

Conversation

Zhangxunmt commented Aug 2, 2022 • edited Loading

Description

Issues Resolved

Check List

codecov-commenter commented Aug 2, 2022 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ylwu-amzn Aug 2, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ylwu-amzn Aug 2, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Zhangxunmt Aug 2, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

opensearch-trigger-bot bot commented Aug 4, 2022

Zhangxunmt commented Aug 2, 2022 •

edited

Loading

codecov-commenter commented Aug 2, 2022 •

edited

Loading

ylwu-amzn Aug 2, 2022 •

edited

Loading

ylwu-amzn Aug 2, 2022 •

edited

Loading

Zhangxunmt Aug 2, 2022 •

edited

Loading