Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Earlier crash for NA values in the SEX column #568

Open
Jakob37 opened this issue Aug 15, 2024 · 4 comments
Open

Earlier crash for NA values in the SEX column #568

Jakob37 opened this issue Aug 15, 2024 · 4 comments

Comments

@Jakob37
Copy link

Jakob37 commented Aug 15, 2024

Hello, and thanks for this cool pipeline!

I have been spending some time debugging DROP. The last error was for the aberrant expression workflow inside the Summary.R script from the line:

lda <- lda(SEX ~ log2(XIST+1) + log2(UTY+1), data = train_dt)

It took me some time to figure out, but I realized in the end that the original error was that I had accidentally supplied a SEX column filled with "NA".

For me as a user, it would be useful if this was catched upstream in the pipeline, or alternatively if the Summary.R script could handle this similarly to as if there are no gender information.

@vyepez88
Copy link
Collaborator

Hi Jakob, thanks for reporting. Do you mean the column was empty or it had the string "NA"? As DROP takes many sources of input, it's quite difficult to catch every possible error. We'll try to add some ways to deal with this.

@Jakob37
Copy link
Author

Jakob37 commented Aug 15, 2024

Hi Jakob, thanks for reporting. Do you mean the column was empty or it had the string "NA"? As DROP takes many sources of input, it's quite difficult to catch every possible error. We'll try to add some ways to deal with this.

The column had the string "NA" as value for all samples.

Yes, I realize that it isn't easy to cover all input issues!

@Jakob37
Copy link
Author

Jakob37 commented Aug 15, 2024

Some follow up here. I added a dummy column with all 'M'. Apparently this did not pass this step either. I checked the behaviour of lda, and it indeed seems that it will crash if there is only one level in y:

> df <- data.frame(y=c('M', 'M', 'M', 'M'), x=c(1,2,3,4))
> lda(y ~ x, df)
Error in svd(X, nu = 0L) : infinite or missing values in 'x'

(c('M', 'M', 'F', 'F') on the other hand worked).

I will try running again with no SEX column (which is more correct, as I at the moment don't have access to that info).

Finally, on the topic of NA-values. I ran into a separate NA issue, where the GENE_COUNTS_FILE had been assigned NA in my all non-reference run (by an upstream pipeline I am using to run DROP). This crashed Snakemake in the DAG assembly step. This was resolved in the end by removing that column.

These issues are not blocking me, and I realize that it must be a challenge to maintain such a varied code base. Just wanting to let you know, so that some future travellers might have a shorter route ⛰️

@vyepez88
Copy link
Collaborator

Yes, the problem is that the formula needs at least 2 values to try to categorize them. That's why all M doesn't work.
Thanks for reporting the issue with the GENE_COUNTS_FILE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants