Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug fixes in ordering and fraction total variance explained #54

Merged
merged 14 commits into from
Jan 10, 2020

Conversation

cameronmartino
Copy link
Collaborator

There are two main bug fixes in this PR which address the issue(s) #53 and #52. For issue #52 the bug was in the sorting of the singular value in the OptSpace function itself. This was sometimes corrected by the SVD in the recentering and some times was not, thus the periodic axis switch raised in #52. Examples of this fix on simulated:

image

and real data:

image

For issue #53, which is a common complaint, the solution involved comparing the variance across the projection (i.e. U * s) and the original data. To calculate the total variance for the original data I first took the rclr and then the sum of the variance w.r.t. only the observed values. In practice, this seems to estimate the fraction of variance explained much better in partial-rank compared to a full-rank decomposition. Especially, when contrasted to the previous method of using the partial-rank singular value sum of squared. This can be seen on both simulated:

image

and real data:

image

All the analysis/figures for the simulated and real-data can be found here and here.

I also added a feature raised in #45 for feature presence frequency filtering. In practice, this seems better than sum filtering so downstream log-ratios w/o pseudo counts don't have large sample dropouts.

The changelog has been updated and the version has been bumped.

Copy link
Collaborator

@mortonjt mortonjt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this pull request is a good idea.

The proportion explained variance assumes that we are able to explain all of the variance in the data (which we can't due to missing values). There are a couple of worthwhile approaches worth considering

  1. Employ the extended optspace algorithm to automatically search for the optimal rank. If we can estimate the rank of the matrix - we can use that to approximate the explained variance in the top 3 observed dimensions. This is in part C in this paper
  2. Perform cross validation as I suggested previously. Mask some additional entries of the matrix and treat those as missing when training (fitting optspace). With those heldout entries predict their respective proportions from the imputation performed by optspace. You'll have to think about how to report an interpretable error metric, but this is another informative statistic.

deicode/rpca.py Outdated
projection = u * s[:n_components]
projection_var = np.var(projection, axis=0)
# get the total approximated variance of the initial table
tot_var = np.nansum(np.nanvar(rclr(table), axis=1))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced that this is the total variance that you actually want. However, I'm not entirely sure what the best approach here is (or even sure if there is a correct way to do this).

The sure fire way to do this is to run SVD with D-1 components where D is the number of species. You'll get the full matrix factorization from which you can compute the proportion explained. But clearly this is dumb and defeats the purpose of assuming low rank.

@cameronmartino
Copy link
Collaborator Author

@mortonjt I agree. I think that the rank estimation is a better idea (i.e. option one). I added it as an option by setting --n-components to optspace and reverted the fraction variance explained. I included the option so if more methods for rank estimation are implemented later they can be chosen. I also did not change the default --n-components for this version from 3 to optspace to avoid user confusion. However, I did add a FutureWanring for hard set rank's saying that the next version of DEICODE will have optspace based rank estimation as the default.

@mortonjt
Copy link
Collaborator

mortonjt commented Jan 8, 2020

Another option would be to have a completely new command (i.e. auto-rpca) that includes the rpca procedure but automatically estimates the rank. That way this will be backwards compatible.

Copy link
Collaborator

@mortonjt mortonjt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it code overall is good.

But I'm quite confused about the unittest -- I don't see why the proposed simulation should return a rank of 2. I added another unittest that is a more direct test. If this works, feel free to add that test in.

# estimate the rank of the matrix
self.n_components = rank_estimate(obs, eps)
else:
raise ValueError("n-components must be an"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raise ValueError("n-components must be an"
raise ValueError("n-components must be an "

Need dat space

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. thanks

@@ -40,10 +41,12 @@ def __init__(
N = Features (i.e. OTUs, metabolites)
M = Samples

n_components: int
n_components: int or {"optspace"}, optional
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this - it's generally not good practice to allow for different data types in the same parameter.
Its also misleading - this is to enable automated estimation of rank. Both options utilize optspace.

I'll recommend to have a separate command but it is up to you.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree I split the functions into two separate commands.

total_nonzeros = np.count_nonzero(mask)
eps = total_nonzeros / np.sqrt(m * n)
# estimate rank
self.assertEqual(2, rank_estimate(obs, eps))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this test ...

If I were to do this, I would do something like

N = 100
D = 50
k = 3
U = np.random.randn(0, 1, size=(N, k))
V = np.random.randn(0, 1, size=(k,D))
Y = U @ V
# randomly mask Y
mask = np.random.random(size=(N, D))
Y[mask] = 0
self.assertEqual(k, rank_estimate(Y, eps))

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, I changed the test accordingly.

Copy link
Collaborator

@mortonjt mortonjt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor typos are still left. But overall, I think this is good to go once the flake8 errors are resolved.

"(suggested: 1 < rank < 10) [minimum 2]."
" Note: as the rank increases the runtime"
" will increase dramatically.")
DESC_MSC = ("Minimum sum cutoff of sample across all features"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
DESC_MSC = ("Minimum sum cutoff of sample across all features"
DESC_MSC = ("Minimum sum cutoff of sample across all features. "

The space will look weird here otherwise.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

"The value can be at minimum zero and must be an whole"
" integer. It is suggested to be greater than or equal"
" to 500.")
DESC_MFC = ("Minimum sum cutoff of features across all samples."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
DESC_MFC = ("Minimum sum cutoff of features across all samples."
DESC_MFC = ("Minimum sum cutoff of features across all samples. "

same here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

output_descriptions={
'biplot': ('A biplot of the (Robust Aitchison) RPCA feature loadings'),
'distance_matrix': ('The Aitchison distance of'
'the sample loadings from RPCA.')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
'the sample loadings from RPCA.')
' the sample loadings from RPCA.')

same here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

from scipy.linalg import svd


def rpca(table: biom.Table,
n_components: int = DEFAULT_RANK,
n_components: Union[int, str] = DEFAULT_RANK,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this complex type casting still necessary?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I have it set up, to prevent too much copy-pasta, is that auto_rpca calls rpca but with n_components flagged as "auto". However, the types are restricted in the CLI so you couldn't run rpca with the input "auto" in QIIME or in the standalone. So technically that rpca function still accepts both types.

@mortonjt
Copy link
Collaborator

mortonjt commented Jan 10, 2020 via email

@cameronmartino
Copy link
Collaborator Author

Thanks, @mortonjt! I addressed those typos and updated the README/tutorial(s) to be consistent with the new command. I also added a python API tutorial and standalone tutorial that follows the QIIME2 tutorial. This should address issue #40

@cameronmartino
Copy link
Collaborator Author

@mortonjt - good to merge! thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants