-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Set random seeds #101
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @gwarmstrong , this is awesome.
What do you think about only having 1 seed passed in? I'm assuming that two seeds are defined due to numpy / TF differences. But you can imagine the following should work
def multinomial(..., seed):
np.random.seed(seed)
tf.set_random_seed(seed)
It'll be more ideal to only specify 1 seed to try to minimize the number of available options to the users.
scripts/songbird
Outdated
@@ -157,6 +181,10 @@ def multinomial( | |||
save_path=summary_dir, | |||
) | |||
with tf.Graph().as_default(), tf.Session() as session: | |||
# set the tf random seed | |||
if tf_seed is not None: | |||
tf.set_random_seed(int(tf_seed)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if the int conversion should be necessary if the type is specified in click.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I avoided providing a type originally since it broke convention in the existing decorators. Since you're fine with it, I'll do that instead.
Sure, I thought it might be nice to be able to adjust them independently. This could be helpful for, e.g., verifying that you get a somewhat consistent error minimum across different a random initialization, but with the same train/test split. Given that the loss here is convex with respect to the parameters, my idea is probably less necessary and we can change to a single seed. |
@mortonjt this should reflect the changes you suggested now |
@mortonjt just a polite bump on this PR! Also noting that this issue has subsequently come up with multiple people within Knight Lab |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a couple of typos.
If you run this command on a non-zero seed and post your tensorboard results on the example dataset (i.e. redsea) I'll can double check on my machine to make sure those results are reproducible.
CHANGELOG.md
Outdated
@@ -1,6 +1,8 @@ | |||
# songbird changelog | |||
|
|||
## Version 1.0.2-dev | |||
Added ability to set random seed for CLI and sets fixed random seeds for qiime2 []() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
want to fix this PR name and number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
songbird/q2/_method.py
Outdated
) | ||
|
||
model = MultRegression(learning_rate=learning_rate, clipnorm=clipnorm, | ||
beta_mean=differential_prior, | ||
batch_size=batch_size, | ||
save_path=None) | ||
with tf.Graph().as_default(), tf.Session() as session: | ||
tf.set_random_seed(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tf.set_random_seed(0) | |
tf.set_random_seed(seed) |
this won't work - the seed will always be set to zero otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably will want to have the seed=None
option here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the QIIME plugin, I thought it might be better to make the seed 0 always, so as to avoid exposing a seed argument to the QIIME user. Given this assumption, the code is fine as it stands (tf.set_random_seed(seed)
will actually throw a NameError
). Thought being that this could cut down on instances of comparing a model run with many different seeds to a baseline that was run only with the default seed, or something like that. It basically enforces the "right" behavior.
If you want me to expose the seed to the user via QIIME, that is fine too, and I can make the corresponding change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want the seed to be always set to zero in the qiime2 side. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. I will rewrite so that default seed for QIIME is 0, but can be set via a parameter.
I'm not sure I understand where seed=None
factors into this interface. I think we want the default behavior of the QIIME plugin to be that the result of two commands on the same data are directly comparable by default. Then you can change seed if you need. That shouldn't require a seed=None
anywhere for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll still want to have an option to not specify a seed, namely
if seed is not None:
tf.set_random_seed(seed)
Does the third tensorboard output in the original description work? Command and tensorboard screenshot is included. Or do you need additional information? |
Yes, the output in the original tensorboard would work - but the interface changed right? |
Ah right, that is when the seeds could be set separately. Sure, I will re-run and upload. |
songbird/q2/_method.py
Outdated
) | ||
|
||
model = MultRegression(learning_rate=learning_rate, clipnorm=clipnorm, | ||
beta_mean=differential_prior, | ||
batch_size=batch_size, | ||
save_path=None) | ||
with tf.Graph().as_default(), tf.Session() as session: | ||
tf.set_random_seed(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll still want to have an option to not specify a seed, namely
if seed is not None:
tf.set_random_seed(seed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I'm happy with this PR.
However, I'm still not sure if having a default seed of 0 is suitable in the qiime2 version. The option of not choosing a random seed is already available in the standalone script (so I guess that is addressed).
In theory, I'm ok with these changes, but additional input would be nice. @lisa55asil do you have any thoughts on this?
@@ -60,4 +64,5 @@ | |||
"checkpoint-interval": 3600, | |||
"summary-interval": 10, | |||
"summary-dir": "summarydir", | |||
"random-seed": 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"random-seed": 0, | |
"random-seed": None, |
I'm still leaning towards having the default be None, otherwise the option to not specify a random seed will no longer exist. For example, if you have a multiple random runs, you do things like take averages, or take the best fit amongst multiple random runs. If this option is not available, it won't be possible for the user to not specify a random seed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! And I appreciate the feedback on the PR.
I think it is important to mandate a seed for Q2. For a software that has a mission of reproducibility, you should not be able to create artifacts that you cannot reproduce. If someone were to give me an artifact created by songbird with seed set to None, I’d never be able to reproduce it exactly.
I’m also not sure why you can’t achieve your use case by setting the random seed differently, multiple times? This use case is actually the reason I had two different seeds originally, so you could keep the same train/test split, but change the model fitting. Though this behavior could be achieved by this PR by setting a training column and then varying the seed.
Also, as you mentioned, setting a None seed is still available via the CLI.
) | ||
|
||
model = MultRegression(learning_rate=learning_rate, clipnorm=clipnorm, | ||
beta_mean=differential_prior, | ||
batch_size=batch_size, | ||
save_path=None) | ||
with tf.Graph().as_default(), tf.Session() as session: | ||
tf.set_random_seed(random_seed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tf.set_random_seed(random_seed) | |
if random_seed is not None: | |
tf.set_random_seed(random_seed) |
In light of the previous comment, it'll be preferable to have this instead
I think these changes are great and strongly agree that we should have a default seed set for the q2 plugin. As you mentioned if people want change or remove the seed they can use the standalone version, but it is critical that q2 produces reproducible results. I'm going to update the readme to make it more clear:
Thanks for the awesome updates George!! |
Thanks @lisa55asil . I'm going to merge this in. @gwarmstrong thanks! |
Songbird random seeds
Current behavior:
When I run:
I get something like the following, where the cv-error converges to different values and the training error is updated differently:
Further exploration:
So I provided an option to fix the seed for making train/test splits:
And now the models converge to approximately the same value.
However, as there a still variations in their training updates. So I provided a way to fix the seed that tensorflow uses as well.
And now it is identical between runs:
Proposed behavior:
When the above commands are run, I should at least get the same CV score, but furthermore, it should be possible to fix the entire training process.
In order to fix this I propose the following solution:
Summary of changes:
--split-seed
and--tf-seed
option to the songbird CLI (and update parameter info accordingly)scripts.test_songbird_cli
that verifies that different combinations of these flags still run on the CLI