Earlier crash for NA values in the SEX column #568

Jakob37 · 2024-08-15T07:38:21Z

Hello, and thanks for this cool pipeline!

I have been spending some time debugging DROP. The last error was for the aberrant expression workflow inside the Summary.R script from the line:

lda <- lda(SEX ~ log2(XIST+1) + log2(UTY+1), data = train_dt)

It took me some time to figure out, but I realized in the end that the original error was that I had accidentally supplied a SEX column filled with "NA".

For me as a user, it would be useful if this was catched upstream in the pipeline, or alternatively if the Summary.R script could handle this similarly to as if there are no gender information.

The text was updated successfully, but these errors were encountered:

vyepez88 · 2024-08-15T08:31:50Z

Hi Jakob, thanks for reporting. Do you mean the column was empty or it had the string "NA"? As DROP takes many sources of input, it's quite difficult to catch every possible error. We'll try to add some ways to deal with this.

Jakob37 · 2024-08-15T08:40:42Z

Hi Jakob, thanks for reporting. Do you mean the column was empty or it had the string "NA"? As DROP takes many sources of input, it's quite difficult to catch every possible error. We'll try to add some ways to deal with this.

The column had the string "NA" as value for all samples.

Yes, I realize that it isn't easy to cover all input issues!

Jakob37 · 2024-08-15T12:55:23Z

Some follow up here. I added a dummy column with all 'M'. Apparently this did not pass this step either. I checked the behaviour of lda, and it indeed seems that it will crash if there is only one level in y:

> df <- data.frame(y=c('M', 'M', 'M', 'M'), x=c(1,2,3,4))
> lda(y ~ x, df)
Error in svd(X, nu = 0L) : infinite or missing values in 'x'

(c('M', 'M', 'F', 'F') on the other hand worked).

I will try running again with no SEX column (which is more correct, as I at the moment don't have access to that info).

Finally, on the topic of NA-values. I ran into a separate NA issue, where the GENE_COUNTS_FILE had been assigned NA in my all non-reference run (by an upstream pipeline I am using to run DROP). This crashed Snakemake in the DAG assembly step. This was resolved in the end by removing that column.

These issues are not blocking me, and I realize that it must be a challenge to maintain such a varied code base. Just wanting to let you know, so that some future travellers might have a shorter route ⛰️

vyepez88 · 2024-08-15T16:50:20Z

Yes, the problem is that the formula needs at least 2 values to try to categorize them. That's why all M doesn't work.
Thanks for reporting the issue with the GENE_COUNTS_FILE

This was referenced Aug 16, 2024

Allow running DROP without reference samples genomic-medicine-sweden/tomte#152

Closed

Adds possibility to make drop database genomic-medicine-sweden/tomte#147

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Earlier crash for NA values in the SEX column #568

Earlier crash for NA values in the SEX column #568

Jakob37 commented Aug 15, 2024

vyepez88 commented Aug 15, 2024

Jakob37 commented Aug 15, 2024

Jakob37 commented Aug 15, 2024 •

edited

Loading

vyepez88 commented Aug 15, 2024

Earlier crash for NA values in the SEX column #568

Earlier crash for NA values in the SEX column #568

Comments

Jakob37 commented Aug 15, 2024

vyepez88 commented Aug 15, 2024

Jakob37 commented Aug 15, 2024

Jakob37 commented Aug 15, 2024 • edited Loading

vyepez88 commented Aug 15, 2024

Jakob37 commented Aug 15, 2024 •

edited

Loading