Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discuss log-ratio generalizability / training and test splits in a tutorial? #317

Open
fedarko opened this issue Jul 17, 2021 · 0 comments
Labels
docs README, tutorials, demos, etc. question Further information is requested

Comments

@fedarko
Copy link
Collaborator

fedarko commented Jul 17, 2021

Not a big deal or anything, but it might be nice to add some extra context (e.g. another tutorial notebook) about using Qurro in a more ML-ish style—where differential abundance is run on only a subset of the samples (the training samples), and these samples are used to select a log-ratio in Qurro which can then be tested against the held-out testing samples.

The advantage of this approach is that it provides a stronger argument for how reliable a log-ratio's association with some metadata is, since the log-ratio has held up to an extra round of validation. This is kind of a philosophical difference and not really Qurro's problem (you could argue that many differential abundance approaches don't really account for this by default, and/or that Songbird's train/test setup already accounts for this), but it may be worth mentioning somewhere at least.

One way to support this in Qurro's codebase would involve adding a parameter that takes as input a TrainTest column (analogous to what Songbird asks for with the --training-column/--p-training-column parameter), and then generates two separate Qurro visualizations (one for the training samples, one for the testing samples). That might get kind of clunky, though! An extension to this would be adding an Import selected features button (analogous to the Export currently selected features button) so that the user can easily test the same log-ratios in multiple visualizations.

...That all being said, after the user tries, like, more than one log-ratio on both the training and testing datasets this kind of loses its effectiveness! The exploratory data analysis approach Qurro uses might be at odds somewhat with this idea of validation, since it's inherently susceptible to the whole multiple-comparisons thing.

Anyway, I figured I should write this up somewhere, if nothing else to document that this might be worth thinking about more at some point. Partially inspired by going through this preprint :)

@fedarko fedarko added question Further information is requested docs README, tutorials, demos, etc. labels Jul 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs README, tutorials, demos, etc. question Further information is requested
Projects
None yet
Development

No branches or pull requests

1 participant