more incorrect number of column errors #51

adamfreedman · 2020-01-29T17:47:56Z

I just installed the latest portcullis with bioconda, so all versioning issues should be managed within that install, correct? Even so, I am getting:

src/junction.cc(1242): Throw in function static std::shared_ptrportcullis::Junction portcullis::Junction::parse(const string&)
Dynamic exception type: boost::wrapexceptportcullis::JunctionException
std::exception::what: std::exception
[portcullis::JunctionError*] = Could not parse line due to incorrect number of columns. This is probably a version mismatch. Check file and portcullis versions. Expected 75 columns. Found 74. Line:
37950851391Sca51543818022904231742712279623201-?-AAN0006560604061.002401.79247999999999990.0102272709900011002043000000166666666666666664444

lucventurini · 2020-02-03T11:34:53Z

Dear @adamfreedman , I will have a look and let you know.

Apologies for the late reply but I was on leave last week.

Kind regards

swarbred · 2020-02-03T12:33:53Z

Dear @adamfreedman

While we fix this try

cleaning the portcullis_all.junctions.tab output file of the junc stage

cat portcullis_all.junctions.tab |awk '!(($11=="?" && $14=="NA") || ($11=="?" && $13=="NA"))' > portcullis_all.junctions.cleaned.tab

then run the filt stage passing in the cleaned file

On a similar example which was failing for me the cleanup removed 44 lines and then the filt stage completed

lucventurini · 2020-02-03T13:51:13Z

Hi @swarbred,
Would you be able to send over the location (or even attach it here /by email) of the offending file? I might try to see what's happening.

Best

swarbred · 2020-02-03T14:45:15Z

Hi @lucventurini

I will copy the porcullis out to a location you have access to and send you details

lucventurini · 2020-02-03T16:50:45Z

Hi @adamfreedman , @swarbred ,
I found the problem. The script rule_filter.py (called during the filtering stage) does the necessary filtering before the self-training procedure, using pandas. When it writes down the final lines, though, it writes the "NA" values as an empty field rather than as a specific value.
This subsequently breaks the parsing of the portcullis C++ library that should load the junctions into the self-training procedure.
I will endeavour to find a fix ASAP.

Kind regards

lucventurini · 2020-02-03T17:31:33Z

Hi all, I think that 9f86ebe could fix this issue. I basically force rule_filter.py to output all the NAs as "NA" strings in the filtered files, and this should ensure compatibility with the C++ parser. I have not tested it extensively.

@swarbred if you could install and test on the problematic dataset, we could confirm the fix.

Best

lucventurini · 2020-02-03T17:50:17Z

Also: the bug is triggered by the fact that occasionally there will be a splicing junction with a donor or an acceptor with dinucleotide "NA".
This gets interpreted by pandas as a NaN value, which is not the intended behaviour! I hopefully fixed this in ebe42fb by removing "NA" from the valid list of NaN values for pandas.read_csv.

lucventurini self-assigned this Feb 3, 2020

lucventurini added the bug label Feb 3, 2020

lucventurini added a commit to lucventurini/portcullis that referenced this issue Feb 3, 2020

Potential fix for EI-CoreBioinformatics#51

9f86ebe

lucventurini added a commit that referenced this issue Feb 3, 2020

Another fix for #51: do not read 'NA' splice junctions as 'NaN' values!

ebe42b8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

more incorrect number of column errors #51

more incorrect number of column errors #51

adamfreedman commented Jan 29, 2020

lucventurini commented Feb 3, 2020

swarbred commented Feb 3, 2020

lucventurini commented Feb 3, 2020

swarbred commented Feb 3, 2020

lucventurini commented Feb 3, 2020

lucventurini commented Feb 3, 2020

lucventurini commented Feb 3, 2020

more incorrect number of column errors #51

more incorrect number of column errors #51

Comments

adamfreedman commented Jan 29, 2020

lucventurini commented Feb 3, 2020

swarbred commented Feb 3, 2020

lucventurini commented Feb 3, 2020

swarbred commented Feb 3, 2020

lucventurini commented Feb 3, 2020

lucventurini commented Feb 3, 2020

lucventurini commented Feb 3, 2020