-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug fixes in ordering and fraction total variance explained #54
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this pull request is a good idea.
The proportion explained variance assumes that we are able to explain all of the variance in the data (which we can't due to missing values). There are a couple of worthwhile approaches worth considering
- Employ the extended optspace algorithm to automatically search for the optimal rank. If we can estimate the rank of the matrix - we can use that to approximate the explained variance in the top 3 observed dimensions. This is in part C in this paper
- Perform cross validation as I suggested previously. Mask some additional entries of the matrix and treat those as missing when training (fitting optspace). With those heldout entries predict their respective proportions from the imputation performed by optspace. You'll have to think about how to report an interpretable error metric, but this is another informative statistic.
deicode/rpca.py
Outdated
projection = u * s[:n_components] | ||
projection_var = np.var(projection, axis=0) | ||
# get the total approximated variance of the initial table | ||
tot_var = np.nansum(np.nanvar(rclr(table), axis=1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced that this is the total variance that you actually want. However, I'm not entirely sure what the best approach here is (or even sure if there is a correct way to do this).
The sure fire way to do this is to run SVD with D-1 components where D is the number of species. You'll get the full matrix factorization from which you can compute the proportion explained. But clearly this is dumb and defeats the purpose of assuming low rank.
@mortonjt I agree. I think that the rank estimation is a better idea (i.e. option one). I added it as an option by setting |
Another option would be to have a completely new command (i.e. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it code overall is good.
But I'm quite confused about the unittest -- I don't see why the proposed simulation should return a rank of 2. I added another unittest that is a more direct test. If this works, feel free to add that test in.
deicode/optspace.py
Outdated
# estimate the rank of the matrix | ||
self.n_components = rank_estimate(obs, eps) | ||
else: | ||
raise ValueError("n-components must be an" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raise ValueError("n-components must be an" | |
raise ValueError("n-components must be an " |
Need dat space
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done. thanks
@@ -40,10 +41,12 @@ def __init__( | |||
N = Features (i.e. OTUs, metabolites) | |||
M = Samples | |||
|
|||
n_components: int | |||
n_components: int or {"optspace"}, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about this - it's generally not good practice to allow for different data types in the same parameter.
Its also misleading - this is to enable automated estimation of rank. Both options utilize optspace.
I'll recommend to have a separate command but it is up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree I split the functions into two separate commands.
deicode/tests/test_optspace.py
Outdated
total_nonzeros = np.count_nonzero(mask) | ||
eps = total_nonzeros / np.sqrt(m * n) | ||
# estimate rank | ||
self.assertEqual(2, rank_estimate(obs, eps)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this test ...
If I were to do this, I would do something like
N = 100
D = 50
k = 3
U = np.random.randn(0, 1, size=(N, k))
V = np.random.randn(0, 1, size=(k,D))
Y = U @ V
# randomly mask Y
mask = np.random.random(size=(N, D))
Y[mask] = 0
self.assertEqual(k, rank_estimate(Y, eps))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, I changed the test accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor typos are still left. But overall, I think this is good to go once the flake8 errors are resolved.
deicode/_rpca_defaults.py
Outdated
"(suggested: 1 < rank < 10) [minimum 2]." | ||
" Note: as the rank increases the runtime" | ||
" will increase dramatically.") | ||
DESC_MSC = ("Minimum sum cutoff of sample across all features" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DESC_MSC = ("Minimum sum cutoff of sample across all features" | |
DESC_MSC = ("Minimum sum cutoff of sample across all features. " |
The space will look weird here otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks!
deicode/_rpca_defaults.py
Outdated
"The value can be at minimum zero and must be an whole" | ||
" integer. It is suggested to be greater than or equal" | ||
" to 500.") | ||
DESC_MFC = ("Minimum sum cutoff of features across all samples." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DESC_MFC = ("Minimum sum cutoff of features across all samples." | |
DESC_MFC = ("Minimum sum cutoff of features across all samples. " |
same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks!
deicode/q2/plugin_setup.py
Outdated
output_descriptions={ | ||
'biplot': ('A biplot of the (Robust Aitchison) RPCA feature loadings'), | ||
'distance_matrix': ('The Aitchison distance of' | ||
'the sample loadings from RPCA.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'the sample loadings from RPCA.') | |
' the sample loadings from RPCA.') |
same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks!
from scipy.linalg import svd | ||
|
||
|
||
def rpca(table: biom.Table, | ||
n_components: int = DEFAULT_RANK, | ||
n_components: Union[int, str] = DEFAULT_RANK, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this complex type casting still necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way I have it set up, to prevent too much copy-pasta, is that auto_rpca
calls rpca
but with n_components flagged as "auto"
. However, the types are restricted in the CLI so you couldn't run rpca
with the input "auto"
in QIIME or in the standalone. So technically that rpca
function still accepts both types.
Ahh ok. That makes sense. Ok, this is good to merge then
…On Fri, Jan 10, 2020, 3:58 PM Cameron Martino ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In deicode/_rpca_defaults.py
<#54 (comment)>:
> DEFAULT_ITERATIONS = 5
-DESC_RANK = ("The underlying low-rank structure (suggested: 1 < rank < 10)"
- " [minimum 2]")
-DESC_MSC = "Minimum sum cutoff of sample across all features"
-DESC_MFC = "Minimum sum cutoff of features across all samples"
+DESC_RANK = ("The underlying low-rank structure."
+ " The input can be an integer "
+ "(suggested: 1 < rank < 10) [minimum 2]."
+ " Note: as the rank increases the runtime"
+ " will increase dramatically.")
+DESC_MSC = ("Minimum sum cutoff of sample across all features"
Done, thanks!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#54?email_source=notifications&email_token=AA75VXM4VJHUQOUGKXOCMMLQ5DOQDA5CNFSM4KEAW6LKYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCRM2D7A#discussion_r365427658>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA75VXOA6S2YTI2SCNCSSJTQ5DOQDANCNFSM4KEAW6LA>
.
|
@mortonjt - good to merge! thanks. |
There are two main bug fixes in this PR which address the issue(s) #53 and #52. For issue #52 the bug was in the sorting of the singular value in the OptSpace function itself. This was sometimes corrected by the SVD in the recentering and some times was not, thus the periodic axis switch raised in #52. Examples of this fix on simulated:
and real data:
For issue #53, which is a common complaint, the solution involved comparing the variance across the projection (i.e. U * s) and the original data. To calculate the total variance for the original data I first took the rclr and then the sum of the variance w.r.t. only the observed values. In practice, this seems to estimate the fraction of variance explained much better in partial-rank compared to a full-rank decomposition. Especially, when contrasted to the previous method of using the partial-rank singular value sum of squared. This can be seen on both simulated:
and real data:
All the analysis/figures for the simulated and real-data can be found here and here.
I also added a feature raised in #45 for feature presence frequency filtering. In practice, this seems better than sum filtering so downstream log-ratios w/o pseudo counts don't have large sample dropouts.
The changelog has been updated and the version has been bumped.