Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Logistic Regression algorithm #383

Merged
merged 4 commits into from
Aug 4, 2022

Conversation

Zhangxunmt
Copy link
Collaborator

@Zhangxunmt Zhangxunmt commented Aug 2, 2022

Signed-off-by: Xun Zhang xunzh@amazon.com

Description

Add Tribuo Logistic Regression into ML-Common.
UTs and ITs of logistic regression are included in the PR. Addressed comments from the previous PR.

Issues Resolved

#318

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@Zhangxunmt Zhangxunmt requested review from a team, jngz-es and ylwu-amzn August 2, 2022 00:21
@codecov-commenter
Copy link

codecov-commenter commented Aug 2, 2022

Codecov Report

Merging #383 (869d70d) into main (0618b93) will not change coverage.
The diff coverage is n/a.

@@            Coverage Diff            @@
##               main     #383   +/-   ##
=========================================
  Coverage     92.09%   92.09%           
  Complexity      519      519           
=========================================
  Files            58       58           
  Lines          1481     1481           
  Branches        116      116           
=========================================
  Hits           1364     1364           
  Misses           78       78           
  Partials         39       39           
Flag Coverage Δ
ml-commons 92.09% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...rg/opensearch/ml/plugin/MachineLearningPlugin.java 100.00% <ø> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@@ -95,6 +134,9 @@ public static <T extends Output<T>> MutableDataset<T> generateDataset(DataFrame
// TODO: support anomaly labels to evaluate prediction result
example = new ArrayExample<>((T) new Event(defaultEventType), featureNamesValues.v1(), featureNamesValues.v2()[i]);
break;
case LABEL:
example = new ArrayExample<>((T) new Label(""), featureNamesValues.v1(), featureNamesValues.v2()[i]);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use UNKNOWN_LABEL for this if you want. It reduces the amount of allocation, and is fine to be shared among examples at prediction time. Or outputFactory.getUnknownOutput() as you already have an output factory reference.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced with outputFactory.getUnknownOutput().

"Logistic regression training data from OpenSearch", TribuoOutputType.LABEL, parameters.getTarget());
// Integer epochs = Optional.ofNullable(parameters.getEpochs()).orElse(DEFAULT_EPOCHS);
// LinearSGDTrainer(objective=LogMulticlass,optimiser=AdaGrad(initialLearningRate=1.0,epsilon=0.1,initialValue=0.0),epochs=5,minibatchSize=1,seed=12345)
Trainer<Label> logisticRegressionTrainer = new LogisticRegressionTrainer();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our LogisticRegressionTrainer is designed to be very simple for onboarding users & demos, I recommend you use LinearSGDTrainer and configure it given you're wrapping all the interface.

Copy link
Collaborator

@ylwu-amzn ylwu-amzn Aug 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Craigacp Thanks for your suggestion. We have integrated LinearSGDTrainer to provide linear regression https://github.com/opensearch-project/ml-commons/blob/main/ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/regression/LinearRegression.java#L203.

You mean we should wrap LinearSGDTrainer rather than LogisticRegressionTrainer to provide logistic regression? Appreciate if you can provide any example ?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, this is slightly unfortunate naming on our part. We have three classes all called LinearSGDTrainer, org.tribuo.classification.sgd.linear.LinearSGDTrainer, org.tribuo.multilabel.sgd.linear.LinearSGDTrainer and org.tribuo.regression.sgd.linear.LinearSGDTrainer. They have basically the same interface, but are constrained by the type system to accept the appropriate kind of objective function, and have the right generic types. LogisticRegressionTrainer is just a org.tribuo.classification.sgd.linear.LinearSGDTrainer subclass with a constructor that has fairly sensible defaults for running quick tests & demos. So for your use case I'd recommend you use org.tribuo.classification.sgd.linear.LinearSGDTrainer as then you can expose basically the same set of options as you do for linear regression (aside from the difference in objective functions).

Copy link
Collaborator

@ylwu-amzn ylwu-amzn Aug 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the details. @Zhangxunmt I think we should change to org.tribuo.classification.sgd.linear.LinearSGDTrainer.

BTW, @Craigacp , For linear regression org.tribuo.regression.sgd.linear.LinearSGDTrainer and this org.tribuo.classification.sgd.linear.LinearSGDTrainer, do you have any suggestion for the default parameter value like objective function, epochs, batch size etc? Any way/API to tell if the model converging/good enough?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the classification/logistic regression the ones we use in the constructor are fine - https://github.com/oracle/tribuo/blob/main/Classification/SGD/src/main/java/org/tribuo/classification/sgd/linear/LogisticRegressionTrainer.java#L44. You could use them for the linear regression too, substituting the LogMulticlass objective for SquaredLoss. For the linear models a batch size of 1 tends to work just fine, and there isn't really any performance benefit to larger ones given the way the linear algebra system is setup (as it's not multithreaded yet).

Unfortunately we don't have any convergence metrics like validation error built into Tribuo's trainers, and we'd need to modify the trainer API to enable such things.

Copy link
Collaborator Author

@Zhangxunmt Zhangxunmt Aug 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was using LinearSGDTrainer with the default parameters in the first place (check the commented out line), then I realized it's almost equivalent to the given LogisticRegressionTrainer since it's basically using LinearSGDTrainer under the hood. We can change it to LinearSGDTrainer with default setup, and let customers to configure parameters through the training requests.

@@ -131,6 +131,24 @@ public DataFrame select(int[] columns) {
return new DefaultDataFrame(newColumnMetas, rows.stream().map(row-> row.select(columns)).collect(Collectors.toList()));
}

@Override
public int getColumnIndex(String target) {
List<String> featureNames = Arrays.stream(this.columnMetas()).map(ColumnMeta::getName).collect(Collectors.toList());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: rename featureNames as columnNames?

SIMPLE_SGD,
LINEAR_DECAY_SGD,
SQRT_DECAY_SGD,
ADA_GRAD,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Craigacp help review if all these supported for logistic regression?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are all supported.

}

if (parameters.getLoggingInterval() != null && parameters.getLoggingInterval() < 0) {
throw new IllegalArgumentException("Logging intervals should not be negative.");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From java doc of org.tribuo.classification.sgd.linear.LinearSGDTrainer

Log the loss after this many iterations. If -1 don't log anything.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

Double epsilon = Optional.ofNullable(parameters.getEpsilon()).orElse(DEFAULT_EPSILON);

switch (optimizerType) {
// ToDo: Add more possible optimizer. Tribuo only provides AdaGrad for logistic regression.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Craigacp can you confirm if this is true? Does org.tribuo.classification.sgd.linear.LinearSGDTrainer support other optimizer ?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it supports everything that implements org.tribuo.math.StochasticGradientOptimiser which is everything in this folder - https://github.com/oracle/tribuo/tree/main/Math/src/main/java/org/tribuo/math/optimisers. I wouldn't recommend Pegasos though, it's fussy to tune, and AdaGrad is a sensible default for convex problems like logistic regression, but all of them should work. Note ParameterAveraging requires a nested gradient optimiser to average over, and so you might not want to expose that because it'll be more complicated.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification. These optimizers are all supported in the PR now.

Signed-off-by: Xun Zhang <xunzh@amazon.com>
Signed-off-by: Xun Zhang <xunzh@amazon.com>
Signed-off-by: Xun Zhang <xunzh@amazon.com>
Signed-off-by: Xun Zhang <xunzh@amazon.com>
@Zhangxunmt Zhangxunmt merged commit 23cbbb4 into opensearch-project:main Aug 4, 2022
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-2.x 2.x
# Navigate to the new working tree
cd .worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-383-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 23cbbb4abee7641d1ea236e10361853156f19ecd
# Push it to GitHub
git push --set-upstream origin backport/backport-383-to-2.x
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-383-to-2.x.

opensearch-trigger-bot bot pushed a commit that referenced this pull request Aug 4, 2022
* Add Logistic Regression algorithm

Signed-off-by: Xun Zhang <xunzh@amazon.com>
(cherry picked from commit 23cbbb4)
Zhangxunmt added a commit that referenced this pull request Aug 4, 2022
* Add Logistic Regression algorithm

Signed-off-by: Xun Zhang <xunzh@amazon.com>
(cherry picked from commit 23cbbb4)

Co-authored-by: Xun Zhang <xunzh@amazon.com>
Zhangxunmt added a commit to Zhangxunmt/ml-commons that referenced this pull request Aug 5, 2022
* Add Logistic Regression algorithm

Signed-off-by: Xun Zhang <xunzh@amazon.com>
Zhangxunmt added a commit that referenced this pull request Aug 5, 2022
* Add Logistic Regression algorithm

Signed-off-by: Xun Zhang <xunzh@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants