Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

more incorrect number of column errors #51

Open
adamfreedman opened this issue Jan 29, 2020 · 7 comments
Open

more incorrect number of column errors #51

adamfreedman opened this issue Jan 29, 2020 · 7 comments
Assignees
Labels

Comments

@adamfreedman
Copy link

I just installed the latest portcullis with bioconda, so all versioning issues should be managed within that install, correct? Even so, I am getting:

src/junction.cc(1242): Throw in function static std::shared_ptrportcullis::Junction portcullis::Junction::parse(const string&)
Dynamic exception type: boost::wrapexceptportcullis::JunctionException
std::exception::what: std::exception
[portcullis::JunctionError*] = Could not parse line due to incorrect number of columns. This is probably a version mismatch. Check file and portcullis versions. Expected 75 columns. Found 74. Line:
37950851391Sca51543818022904231742712279623201-?-AAN0006560604061.002401.79247999999999990.0102272709900011002043000000166666666666666664444

@lucventurini lucventurini self-assigned this Feb 3, 2020
@lucventurini
Copy link
Collaborator

Dear @adamfreedman , I will have a look and let you know.

Apologies for the late reply but I was on leave last week.

Kind regards

@swarbred
Copy link

swarbred commented Feb 3, 2020

Dear @adamfreedman

While we fix this try

cleaning the portcullis_all.junctions.tab output file of the junc stage

cat portcullis_all.junctions.tab |awk '!(($11=="?" && $14=="NA") || ($11=="?" && $13=="NA"))' > portcullis_all.junctions.cleaned.tab

then run the filt stage passing in the cleaned file

On a similar example which was failing for me the cleanup removed 44 lines and then the filt stage completed

@lucventurini
Copy link
Collaborator

Hi @swarbred,
Would you be able to send over the location (or even attach it here /by email) of the offending file? I might try to see what's happening.

Best

@swarbred
Copy link

swarbred commented Feb 3, 2020

Hi @lucventurini

I will copy the porcullis out to a location you have access to and send you details

@lucventurini
Copy link
Collaborator

Hi @adamfreedman , @swarbred ,
I found the problem. The script rule_filter.py (called during the filtering stage) does the necessary filtering before the self-training procedure, using pandas. When it writes down the final lines, though, it writes the "NA" values as an empty field rather than as a specific value.
This subsequently breaks the parsing of the portcullis C++ library that should load the junctions into the self-training procedure.
I will endeavour to find a fix ASAP.

Kind regards

lucventurini added a commit to lucventurini/portcullis that referenced this issue Feb 3, 2020
@lucventurini
Copy link
Collaborator

Hi all, I think that 9f86ebe could fix this issue. I basically force rule_filter.py to output all the NAs as "NA" strings in the filtered files, and this should ensure compatibility with the C++ parser. I have not tested it extensively.

@swarbred if you could install and test on the problematic dataset, we could confirm the fix.

Best

@lucventurini
Copy link
Collaborator

Also: the bug is triggered by the fact that occasionally there will be a splicing junction with a donor or an acceptor with dinucleotide "NA".
This gets interpreted by pandas as a NaN value, which is not the intended behaviour! I hopefully fixed this in ebe42fb by removing "NA" from the valid list of NaN values for pandas.read_csv.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants