-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Logistic Regression algorithm #383
Conversation
Codecov Report
@@ Coverage Diff @@
## main #383 +/- ##
=========================================
Coverage 92.09% 92.09%
Complexity 519 519
=========================================
Files 58 58
Lines 1481 1481
Branches 116 116
=========================================
Hits 1364 1364
Misses 78 78
Partials 39 39
Flags with carried forward coverage won't be shown. Click here to find out more.
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
@@ -95,6 +134,9 @@ public static <T extends Output<T>> MutableDataset<T> generateDataset(DataFrame | |||
// TODO: support anomaly labels to evaluate prediction result | |||
example = new ArrayExample<>((T) new Event(defaultEventType), featureNamesValues.v1(), featureNamesValues.v2()[i]); | |||
break; | |||
case LABEL: | |||
example = new ArrayExample<>((T) new Label(""), featureNamesValues.v1(), featureNamesValues.v2()[i]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use UNKNOWN_LABEL
for this if you want. It reduces the amount of allocation, and is fine to be shared among examples at prediction time. Or outputFactory.getUnknownOutput()
as you already have an output factory reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced with outputFactory.getUnknownOutput().
"Logistic regression training data from OpenSearch", TribuoOutputType.LABEL, parameters.getTarget()); | ||
// Integer epochs = Optional.ofNullable(parameters.getEpochs()).orElse(DEFAULT_EPOCHS); | ||
// LinearSGDTrainer(objective=LogMulticlass,optimiser=AdaGrad(initialLearningRate=1.0,epsilon=0.1,initialValue=0.0),epochs=5,minibatchSize=1,seed=12345) | ||
Trainer<Label> logisticRegressionTrainer = new LogisticRegressionTrainer(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our LogisticRegressionTrainer
is designed to be very simple for onboarding users & demos, I recommend you use LinearSGDTrainer
and configure it given you're wrapping all the interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Craigacp Thanks for your suggestion. We have integrated LinearSGDTrainer
to provide linear regression https://github.com/opensearch-project/ml-commons/blob/main/ml-algorithms/src/main/java/org/opensearch/ml/engine/algorithms/regression/LinearRegression.java#L203.
You mean we should wrap LinearSGDTrainer
rather than LogisticRegressionTrainer
to provide logistic regression? Appreciate if you can provide any example ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, this is slightly unfortunate naming on our part. We have three classes all called LinearSGDTrainer
, org.tribuo.classification.sgd.linear.LinearSGDTrainer
, org.tribuo.multilabel.sgd.linear.LinearSGDTrainer
and org.tribuo.regression.sgd.linear.LinearSGDTrainer
. They have basically the same interface, but are constrained by the type system to accept the appropriate kind of objective function, and have the right generic types. LogisticRegressionTrainer
is just a org.tribuo.classification.sgd.linear.LinearSGDTrainer
subclass with a constructor that has fairly sensible defaults for running quick tests & demos. So for your use case I'd recommend you use org.tribuo.classification.sgd.linear.LinearSGDTrainer
as then you can expose basically the same set of options as you do for linear regression (aside from the difference in objective functions).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the details. @Zhangxunmt I think we should change to org.tribuo.classification.sgd.linear.LinearSGDTrainer
.
BTW, @Craigacp , For linear regression org.tribuo.regression.sgd.linear.LinearSGDTrainer
and this org.tribuo.classification.sgd.linear.LinearSGDTrainer
, do you have any suggestion for the default parameter value like objective function, epochs, batch size etc? Any way/API to tell if the model converging/good enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the classification/logistic regression the ones we use in the constructor are fine - https://github.com/oracle/tribuo/blob/main/Classification/SGD/src/main/java/org/tribuo/classification/sgd/linear/LogisticRegressionTrainer.java#L44. You could use them for the linear regression too, substituting the LogMulticlass
objective for SquaredLoss
. For the linear models a batch size of 1 tends to work just fine, and there isn't really any performance benefit to larger ones given the way the linear algebra system is setup (as it's not multithreaded yet).
Unfortunately we don't have any convergence metrics like validation error built into Tribuo's trainers, and we'd need to modify the trainer API to enable such things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was using LinearSGDTrainer with the default parameters in the first place (check the commented out line), then I realized it's almost equivalent to the given LogisticRegressionTrainer since it's basically using LinearSGDTrainer under the hood. We can change it to LinearSGDTrainer with default setup, and let customers to configure parameters through the training requests.
@@ -131,6 +131,24 @@ public DataFrame select(int[] columns) { | |||
return new DefaultDataFrame(newColumnMetas, rows.stream().map(row-> row.select(columns)).collect(Collectors.toList())); | |||
} | |||
|
|||
@Override | |||
public int getColumnIndex(String target) { | |||
List<String> featureNames = Arrays.stream(this.columnMetas()).map(ColumnMeta::getName).collect(Collectors.toList()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: rename featureNames
as columnNames
?
SIMPLE_SGD, | ||
LINEAR_DECAY_SGD, | ||
SQRT_DECAY_SGD, | ||
ADA_GRAD, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Craigacp help review if all these supported for logistic regression?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are all supported.
} | ||
|
||
if (parameters.getLoggingInterval() != null && parameters.getLoggingInterval() < 0) { | ||
throw new IllegalArgumentException("Logging intervals should not be negative."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From java doc of org.tribuo.classification.sgd.linear.LinearSGDTrainer
Log the loss after this many iterations. If -1 don't log anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
Double epsilon = Optional.ofNullable(parameters.getEpsilon()).orElse(DEFAULT_EPSILON); | ||
|
||
switch (optimizerType) { | ||
// ToDo: Add more possible optimizer. Tribuo only provides AdaGrad for logistic regression. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Craigacp can you confirm if this is true? Does org.tribuo.classification.sgd.linear.LinearSGDTrainer
support other optimizer ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it supports everything that implements org.tribuo.math.StochasticGradientOptimiser
which is everything in this folder - https://github.com/oracle/tribuo/tree/main/Math/src/main/java/org/tribuo/math/optimisers. I wouldn't recommend Pegasos though, it's fussy to tune, and AdaGrad is a sensible default for convex problems like logistic regression, but all of them should work. Note ParameterAveraging
requires a nested gradient optimiser to average over, and so you might not want to expose that because it'll be more complicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification. These optimizers are all supported in the PR now.
Signed-off-by: Xun Zhang <xunzh@amazon.com>
Signed-off-by: Xun Zhang <xunzh@amazon.com>
Signed-off-by: Xun Zhang <xunzh@amazon.com>
Signed-off-by: Xun Zhang <xunzh@amazon.com>
The backport to
To backport manually, run these commands in your terminal: # Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-2.x 2.x
# Navigate to the new working tree
cd .worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-383-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 23cbbb4abee7641d1ea236e10361853156f19ecd
# Push it to GitHub
git push --set-upstream origin backport/backport-383-to-2.x
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-2.x Then, create a pull request where the |
* Add Logistic Regression algorithm Signed-off-by: Xun Zhang <xunzh@amazon.com> (cherry picked from commit 23cbbb4)
* Add Logistic Regression algorithm Signed-off-by: Xun Zhang <xunzh@amazon.com>
Signed-off-by: Xun Zhang xunzh@amazon.com
Description
Add Tribuo Logistic Regression into ML-Common.
UTs and ITs of logistic regression are included in the PR. Addressed comments from the previous PR.
Issues Resolved
#318
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.