bug fixes in ordering and fraction total variance explained #54

cameronmartino · 2020-01-08T00:01:22Z

There are two main bug fixes in this PR which address the issue(s) #53 and #52. For issue #52 the bug was in the sorting of the singular value in the OptSpace function itself. This was sometimes corrected by the SVD in the recentering and some times was not, thus the periodic axis switch raised in #52. Examples of this fix on simulated:

and real data:

For issue #53, which is a common complaint, the solution involved comparing the variance across the projection (i.e. U * s) and the original data. To calculate the total variance for the original data I first took the rclr and then the sum of the variance w.r.t. only the observed values. In practice, this seems to estimate the fraction of variance explained much better in partial-rank compared to a full-rank decomposition. Especially, when contrasted to the previous method of using the partial-rank singular value sum of squared. This can be seen on both simulated:

and real data:

All the analysis/figures for the simulated and real-data can be found here and here.

I also added a feature raised in #45 for feature presence frequency filtering. In practice, this seems better than sum filtering so downstream log-ratios w/o pseudo counts don't have large sample dropouts.

The changelog has been updated and the version has been bumped.

mortonjt

I don't think this pull request is a good idea.

The proportion explained variance assumes that we are able to explain all of the variance in the data (which we can't due to missing values). There are a couple of worthwhile approaches worth considering

Employ the extended optspace algorithm to automatically search for the optimal rank. If we can estimate the rank of the matrix - we can use that to approximate the explained variance in the top 3 observed dimensions. This is in part C in this paper
Perform cross validation as I suggested previously. Mask some additional entries of the matrix and treat those as missing when training (fitting optspace). With those heldout entries predict their respective proportions from the imputation performed by optspace. You'll have to think about how to report an interpretable error metric, but this is another informative statistic.

mortonjt · 2020-01-08T00:57:16Z

deicode/rpca.py

+    projection = u * s[:n_components]
+    projection_var = np.var(projection, axis=0)
+    # get the total approximated variance of the initial table
+    tot_var = np.nansum(np.nanvar(rclr(table), axis=1))


I'm not convinced that this is the total variance that you actually want. However, I'm not entirely sure what the best approach here is (or even sure if there is a correct way to do this).

The sure fire way to do this is to run SVD with D-1 components where D is the number of species. You'll get the full matrix factorization from which you can compute the proportion explained. But clearly this is dumb and defeats the purpose of assuming low rank.

cameronmartino · 2020-01-08T22:08:43Z

@mortonjt I agree. I think that the rank estimation is a better idea (i.e. option one). I added it as an option by setting --n-components to optspace and reverted the fraction variance explained. I included the option so if more methods for rank estimation are implemented later they can be chosen. I also did not change the default --n-components for this version from 3 to optspace to avoid user confusion. However, I did add a FutureWanring for hard set rank's saying that the next version of DEICODE will have optspace based rank estimation as the default.

mortonjt · 2020-01-08T22:11:34Z

Another option would be to have a completely new command (i.e. auto-rpca) that includes the rpca procedure but automatically estimates the rank. That way this will be backwards compatible.

mortonjt

I think it code overall is good.

But I'm quite confused about the unittest -- I don't see why the proposed simulation should return a rank of 2. I added another unittest that is a more direct test. If this works, feel free to add that test in.

mortonjt · 2020-01-08T22:12:26Z

deicode/optspace.py

+                # estimate the rank of the matrix
+                self.n_components = rank_estimate(obs, eps)
+            else:
+                raise ValueError("n-components must be an"


Suggested change

raise ValueError("n-components must be an"

raise ValueError("n-components must be an "

Need dat space

done. thanks

mortonjt · 2020-01-08T22:13:06Z

deicode/optspace.py

@@ -40,10 +41,12 @@ def __init__(
            N = Features (i.e. OTUs, metabolites)
            M = Samples

-        n_components: int
+        n_components: int or {"optspace"}, optional


Not sure about this - it's generally not good practice to allow for different data types in the same parameter.
Its also misleading - this is to enable automated estimation of rank. Both options utilize optspace.

I'll recommend to have a separate command but it is up to you.

I agree I split the functions into two separate commands.

mortonjt · 2020-01-08T22:19:44Z

deicode/tests/test_optspace.py

+        total_nonzeros = np.count_nonzero(mask)
+        eps = total_nonzeros / np.sqrt(m * n)
+        # estimate rank
+        self.assertEqual(2, rank_estimate(obs, eps))


I don't understand this test ...

If I were to do this, I would do something like

N = 100 D = 50 k = 3 U = np.random.randn(0, 1, size=(N, k)) V = np.random.randn(0, 1, size=(k,D)) Y = U @ V # randomly mask Y mask = np.random.random(size=(N, D)) Y[mask] = 0 self.assertEqual(k, rank_estimate(Y, eps))

Agree, I changed the test accordingly.

mortonjt

Minor typos are still left. But overall, I think this is good to go once the flake8 errors are resolved.

mortonjt · 2020-01-09T15:25:49Z

deicode/_rpca_defaults.py

+             "(suggested: 1 < rank < 10) [minimum 2]."
+             " Note: as the rank increases the runtime"
+             " will increase dramatically.")
+DESC_MSC = ("Minimum sum cutoff of sample across all features"


Suggested change

DESC_MSC = ("Minimum sum cutoff of sample across all features"

DESC_MSC = ("Minimum sum cutoff of sample across all features. "

The space will look weird here otherwise.

Done, thanks!

mortonjt · 2020-01-09T15:26:06Z

deicode/_rpca_defaults.py

+            "The value can be at minimum zero and must be an whole"
+            " integer. It is suggested to be greater than or equal"
+            " to 500.")
+DESC_MFC = ("Minimum sum cutoff of features across all samples."


Suggested change

DESC_MFC = ("Minimum sum cutoff of features across all samples."

DESC_MFC = ("Minimum sum cutoff of features across all samples. "

same here

Done, thanks!

mortonjt · 2020-01-09T15:26:47Z

deicode/q2/plugin_setup.py

+    output_descriptions={
+        'biplot': ('A biplot of the (Robust Aitchison) RPCA feature loadings'),
+        'distance_matrix': ('The Aitchison distance of'
+                            'the sample loadings from RPCA.')


Suggested change

'the sample loadings from RPCA.')

' the sample loadings from RPCA.')

same here

Done, thanks!

mortonjt · 2020-01-09T15:27:44Z

deicode/rpca.py

 from scipy.linalg import svd


 def rpca(table: biom.Table,
-         n_components: int = DEFAULT_RANK,
+         n_components: Union[int, str] = DEFAULT_RANK,


is this complex type casting still necessary?

The way I have it set up, to prevent too much copy-pasta, is that auto_rpca calls rpca but with n_components flagged as "auto". However, the types are restricted in the CLI so you couldn't run rpca with the input "auto" in QIIME or in the standalone. So technically that rpca function still accepts both types.

mortonjt · 2020-01-10T21:06:50Z

Ahh ok. That makes sense. Ok, this is good to merge then

…

On Fri, Jan 10, 2020, 3:58 PM Cameron Martino ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In deicode/_rpca_defaults.py <#54 (comment)>: > DEFAULT_ITERATIONS = 5 -DESC_RANK = ("The underlying low-rank structure (suggested: 1 < rank < 10)" - " [minimum 2]") -DESC_MSC = "Minimum sum cutoff of sample across all features" -DESC_MFC = "Minimum sum cutoff of features across all samples" +DESC_RANK = ("The underlying low-rank structure." + " The input can be an integer " + "(suggested: 1 < rank < 10) [minimum 2]." + " Note: as the rank increases the runtime" + " will increase dramatically.") +DESC_MSC = ("Minimum sum cutoff of sample across all features" Done, thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#54?email_source=notifications&email_token=AA75VXM4VJHUQOUGKXOCMMLQ5DOQDA5CNFSM4KEAW6LKYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCRM2D7A#discussion_r365427658>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA75VXOA6S2YTI2SCNCSSJTQ5DOQDANCNFSM4KEAW6LA> .

cameronmartino · 2020-01-10T21:52:26Z

Thanks, @mortonjt! I addressed those typos and updated the README/tutorial(s) to be consistent with the new command. I also added a python API tutorial and standalone tutorial that follows the QIIME2 tutorial. This should address issue #40

cameronmartino · 2020-01-10T22:16:47Z

@mortonjt - good to merge! thanks.

cameronmartino added 5 commits January 7, 2020 08:11

fix order and fraction variance explained

ee93980

add frequency filter to settings

1d57d80

fix tests for new update

1215808

flake8 passing

a56d855

update version and chnagelog

4544d66

cameronmartino requested a review from mortonjt January 8, 2020 00:01

fedarko mentioned this pull request Jan 8, 2020

Parallel plot display looks odd when activated from a biplot, and causes issues in the scatterplot view biocore/emperor#747

Closed

mortonjt reviewed Jan 8, 2020

View reviewed changes

cameronmartino added 2 commits January 8, 2020 13:39

add rank estimation option

78f1f8b

add frequency filter missing line

9e95ef9

mortonjt reviewed Jan 8, 2020

View reviewed changes

cameronmartino added 4 commits January 8, 2020 16:23

fix rank est. testing function

a92437f

make default back comp.

abddba5

split scripts

8cea719

flake8 passing

f5f18bc

mortonjt approved these changes Jan 9, 2020

View reviewed changes

cameronmartino added 2 commits January 10, 2020 12:53

update tutorials for auto-rpca

6a76292

fix flake8 mistakes

e2dbac2

reduce max rank and set iter break for rank est.

98c3deb

mortonjt merged commit 7d685fd into master Jan 10, 2020

This was referenced Jan 11, 2020

"variation of interest" seems to be split across axis 2 instead of axis 1, as expected? #52

Closed

fraction variance explained. #53

Closed

Add frequency filter. #45

Closed

arrows appear to be biased? #56

Closed

cameronmartino mentioned this pull request Aug 3, 2023

low-rank determination not working with high-rank data biocore/gemelli#70

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug fixes in ordering and fraction total variance explained #54

bug fixes in ordering and fraction total variance explained #54

cameronmartino commented Jan 8, 2020

mortonjt left a comment

mortonjt Jan 8, 2020

cameronmartino commented Jan 8, 2020

mortonjt commented Jan 8, 2020

mortonjt left a comment

mortonjt Jan 8, 2020

cameronmartino Jan 9, 2020

mortonjt Jan 8, 2020

cameronmartino Jan 9, 2020

mortonjt Jan 8, 2020

cameronmartino Jan 9, 2020

mortonjt left a comment

mortonjt Jan 9, 2020

cameronmartino Jan 10, 2020

mortonjt Jan 9, 2020

cameronmartino Jan 10, 2020

mortonjt Jan 9, 2020

cameronmartino Jan 10, 2020

mortonjt Jan 9, 2020

cameronmartino Jan 10, 2020

mortonjt commented Jan 10, 2020 via email

cameronmartino commented Jan 10, 2020

cameronmartino commented Jan 10, 2020

	raise ValueError("n-components must be an"
	raise ValueError("n-components must be an "

	DESC_MSC = ("Minimum sum cutoff of sample across all features"
	DESC_MSC = ("Minimum sum cutoff of sample across all features. "

	DESC_MFC = ("Minimum sum cutoff of features across all samples."
	DESC_MFC = ("Minimum sum cutoff of features across all samples. "

	'the sample loadings from RPCA.')
	' the sample loadings from RPCA.')

bug fixes in ordering and fraction total variance explained #54

bug fixes in ordering and fraction total variance explained #54

Conversation

cameronmartino commented Jan 8, 2020

mortonjt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cameronmartino commented Jan 8, 2020

mortonjt commented Jan 8, 2020

mortonjt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mortonjt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mortonjt commented Jan 10, 2020 via email

cameronmartino commented Jan 10, 2020

cameronmartino commented Jan 10, 2020