Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to choose the number of principal components in GLM-PCA? #32

Open
YushaLiu opened this issue Jun 10, 2021 · 7 comments
Open

How to choose the number of principal components in GLM-PCA? #32

YushaLiu opened this issue Jun 10, 2021 · 7 comments

Comments

@YushaLiu
Copy link

Hi Will,
Do you have any suggestions on how to choose the number of principal components in GLM-PCA? Is there a way to quantify the contributions of each PC similar to the proportion of variance explained in PCA? Thanks!

@willtownes
Copy link
Owner

Hi Yusha, thanks for your question. In PCA this is commonly done by plotting the "variance explained" of each component (ie a scree plot). Due to the link function in GLM-PCA we can't exactly call the variance of each component "variance explained", but you could still use it to examine their importance as a function of dimensionality. Since both PCA and GLM-PCA return the components in decreasing order of variance, you could make a similar plot (x-axis: dimension index, y-axis: standard deviation of corresponding column in the factors matrix). Note this is possible because GLM-PCA automatically post-processes the model fit to make the loadings orthonormal, without this step the variance of the factors is not interpretable.

@YushaLiu
Copy link
Author

Hi Will, thanks for your response -- very helpful! I think the column-wise variances of the factor matrix quantify the relative importance of the factors, but is there a way to quantify the contributions of each factor on an absolute scale (similar to the PVE of each factor that explains the variance existing in the observed data matrix in the gaussian case)? Or more specifically, if I choose L=2 and run GLM-PCA, how do I know if these 2 factors are really useful for capturing the variation in the single cell count data or not? Thanks!

@willtownes
Copy link
Owner

Yes, that is a great idea but not something I have figured out, perhaps an open topic for a research paper? The closest thing I have heard of is "deviance explained" from pipecomp. You could compare the deviance of a fitted GLM-PCA model to the deviance of a null model with only an intercept term (this would have a closed-form solution) as an absolute goodness-of-fit metric. The difficulty is I don't think you can add and drop individual factors because the optimal GLM-PCA solution for L=2 is unlikely to be the same as the optimal solution for L=3 with the third factor dropped. Rather, you would have to re-fit the model for each value of L. As an approximate alternative, I suppose you could also try just doing PCA on residuals like we implemented with the scry package.

@YushaLiu
Copy link
Author

I see. Thanks very much for your explanations and suggestions!

@YushaLiu
Copy link
Author

YushaLiu commented Jun 17, 2021

Hi Will, I have a follow-up question. in the "Deviance residuals provide fast approximation to GLM-PCA" section in your paper, I saw that you proposed running plain PCA on the multinomial residuals under the null model as a fast approximation to GLM-PCA. Is that implemented in scry package? If so, does it also allow adjustment of covariates (e.g., batches, cell cycles) in the calculation of multinomial residuals? Thanks so much!

@willtownes
Copy link
Owner

Yes, the null residuals is fully implemented in the scry package and should also work for disk-based (HDF5) or sparse matrices. It only handles categorical covariates though, since anything more complex would not have a closed-form solution (you could easily implement it yourself though, just run a separate Poisson regression for each gene). Here's a related comment from scry github. Once you have the residuals matrix, you can just pass it to your favorite PCA implementation (eg prcomp for a smaller dataset, or BiocSingular for larger ones)

@YushaLiu
Copy link
Author

Thanks very much! That should work since I'm just trying to adjust for categorical covariates :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants