Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consider setNumericRounding(2) to make DT[v1==0.6] works #1642

Open
archenemies opened this issue Apr 11, 2016 · 11 comments
Open

consider setNumericRounding(2) to make DT[v1==0.6] works #1642

archenemies opened this issue Apr 11, 2016 · 11 comments

Comments

@archenemies
Copy link

I was asked by Arunkumar Srinivasan to submit a github issue for this:

http://lists.r-forge.r-project.org/pipermail/datatable-help/2016-April/003164.html

The problem is that data.table produces out-of-order floating point
values when using setkey, order, etc.:

> x=data.table(rnorm(1e6,1,1e-3)); setkey(x,V1); sum(diff(x$V1)<0)
[1] 772

The result should be zero, but as you can see 772 of the 1e6 random
values were actually sorted out of order. I can't remember where I ran
into this problem, but I can imagine wanting properly sorted floats
when computing densities or splines, and when such unexpected behavior
arises then it can be quite difficult to debug. The problem is due to
"numeric rounding", as documented under setNumericRounding. However,
neither the documentation nor Arunkumar provides any reason why it is
helpful to sort values out of order. Arunkumar gives the following
example to explain why setNumericRounding(2) is helpful in 65535/65536
cases, for looking up an exact floating point value:

DT = data.table(a=seq(0,1,by=0.2),b=1:2, key="a")
DT
setNumericRounding(0)   # turn off rounding
DT[.(0.4)]   # works
DT[.(0.6)]   # no match, confusing since 0.6 is clearly there in DT

This behavior is fine with me (although I'm not sure how often it
comes up that we want a program to work in only 65535/65536 cases). I
just don't see why it is necessary to break sorting to achieve that
behavior. I'm not familiar with the data.table internals, but it seems
like it should be possible to do approximate lookups while preserving
exact sorting.

I'm not quite sure I understand Arunkumar's "two options" in the above
message; I'm also not sure whether he understood my argument. I
think he proposes something like doing one of the following:

  1. use different values of setNumericRounding for different
    operations, e.g. 2 for look-ups and 0 for sorting. This is what I had
    meant to suggest in my message
  2. get rid of setNumericRounding and just use full comparisons in both
    look-ups and sorting

I would be happy either way. But I don't think I'm really qualified to
argue for one or the other. Thanks.

@arunsrinivasan
Copy link
Member

That is one heck of a rude post. But thanks.

@archenemies
Copy link
Author

Haha sorry it came off that way! Thanks for the feedback...

@archenemies
Copy link
Author

Arun: By the way, if you read the issue then it would be nice to have some feedback on the substance as well. For instance, did I describe your "two options" correctly? Do you understand my proposal, and is it feasible that key lookups and sorting could be given different numeric rounding options? Thanks.

@archenemies
Copy link
Author

Does somebody want to get back to me on this? Since Data.table is a popular package in R, since sorting is a central part of the Data.table functionality, and since floating point numbers come up a lot in numeric computing, I would think that a bug in which Data.table needlessly sorts floats out of order would be important to R users. But before anyone can fix this I think it makes sense to have feedback from the maintainers as to what kind of fix would be accepted. Am I wrong in thinking that?

I don't mean to be rude, but please note that my first message about this to the list was on 27 January 2016. Arun and I had some back-and-forth that day and the next, and then things stalled until I re-sent my message on 7 April. Then we had some discussion on Github but it seems to have been shelved or forgotten again. I guess people are busy, but this leaves me in the position of wondering how often I should keep pinging the maintainers here - even just to get a response to simple requests for clarification in the middle of a discussion. I'm not even sure why my original post was considered "rude", so forgive me if I'm stepping on someone's toes by saying this.

I see the Cran Data.table maintainer is @mattdowle - Matt, any feedback?

@franknarf1
Copy link
Contributor

@archenemies (I am not a developer, but...) This seems to be documented in ?setkey already:

a very fast 6-pass radix order for columns of type double is also implemented [...]
Note that columns of numeric types (i.e., double) have their last two bytes rounded off while computing order, by defalult, to avoid any unexpected behaviour due to limitations in representing floating point numbers precisely.

I agree that it would be nice to have a way to sort by a floating point number reliably by reference (probably with setorder), but that sounds more like a feature request than an urgent cataclysmic bug. If you need reliable sorting right now, try

x[{order(V1)}, sq := seq_len(.N)]; setorder(x, sq); sum(diff(x$V1) < 0) 
# 0

The {} are needed to circumvent options(datatable.optimize) which drops in the radix order on seeing x[order(v), j] instead of using the base order. If you're after more than sorting and need V1 to be a key, I think you can fake it with the setattr function after sorting correctly.

@archenemies
Copy link
Author

Thanks for the reply @franknarf1. It sounds like your workaround is more complicated than the one which was already discussed, with setNumericRounding(0). I'm not sure I understand the part about the optimizer.

I don't think it is a cataclysmic bug, but when I communicated about this with Arun on the "datatable-help" list, he said "We do this so we can get predictable behavior" and I said "That reasoning seems to apply to look-ups, but not sorting" and he said "Can you suggest anything better?" and I said "Yes, don't use rounding when you sort", and then he didn't reply so I waited two months and re-sent my message, and he asked me to post it here, and I got no real response. So I'm observing a pattern of soliciting feedback and then ignoring it, which is somewhat frustrating (I haven't really contributed anything to the project at this point so perhaps I have no cause for grievance, but I think it might be useful to point this out).

I could be mistaken, but it seems that the current design creates an unnecessary "gotcha" for new users, and an unnecessary annoyance for advanced users, which could be fixed without breaking any existing code. Since I don't know anything about Data.table internals, I thought someone else could help at least to confirm whether this is true or not. Basically I would like to be able to recommend this package to others, but if I would always have to say, "By the way, watch out because sorting is broken" then I would come across as a buffoon.

On a side note, my original comment mentioned rounding being useful (for look-ups) in 65535/65536 cases, based on my theoretical understanding of how it works, but I just did an experiment which shows it to be much more reliable than that:

> setNumericRounding(2)
> x=seq(0,1,1.0e-7); which(is.na(DT[.((x+0.2)-x)]$b))
integer(0)

So right now I'm just trying to understand why rounding is being used for sorting, given that both Arun and the documentation say that rounding is not used for speed, but to "avoid surprising results", which it in fact seems to create when used in sorting. I realize that I should probably refer to Arun as @arunsrinivasan so that he can get notified if he wants to join in.

@franknarf1
Copy link
Contributor

I'm sure they've looked at your feedback. They probably just didn't respond because there's nothing of substance to say until they've made a change.

I agree that it is surprising since ?setkey says it sorts, but you just have to read further down to see the bit I quoted before to understand that the real goal of setkey is to key the table for merges, which does not require perfect sorting.

Floats are generally a bad choice for keys, but if you want to use one as such, you need to read the notes to understand how it's handled differently. Floats are also bad as grouping columns. In your example, many floats are conflated in by= thanks to rounding. I see x[, .N, by=V1][N>1, .N] # 1525. My recommendation to new users would be, not "By the way, watch out because sorting is broken", but rather

"Typical keys are IDs (as you'd have in SQL tables) and integer times (IDate, ITime). While you can use floats and factors, it's almost always unnecessary and an invitation to trouble."

If they asked for details, I'd point them to threads on binary search and radix sorting of floats.

Fyi, there is a which argument you can turn on in lookups, so sum(is.na(DT[.((x+0.2)-x), which=TRUE])) # 0.

@archenemies
Copy link
Author

It would be good to have input from the maintainers too, but thank you to @franknarf1 for another reply...

I agree that it is surprising since ?setkey says it sorts, but you just have to read further down to see the bit I quoted before to understand that the real goal of setkey is to key the table for merges, which does not require perfect sorting.

If you're talking about this paragraph in setkey

Note that columns of ‘numeric’ types (i.e., ‘double’) have their
last two bytes rounded off while computing order, by defalult, to
avoid any unexpected behaviour due to limitations in representing
floating point numbers precisely. Have a look at
‘setNumericRounding’ to learn more.

then that's not clear at all. Giving an invalid reason makes it unclear to the reader, and besides it should be more explicit about something so important. It should say,

Note that numeric columns, which are 8-byte doubles, are only
sorted according to the first six bytes by default - the last two
bytes are ignored. This means that numbers which are very close
together may be sorted out of order. The reason for this behavior
is to make our benchmarks 25% faster. [or whatever real the reason
is, certainly not "avoiding unexpected behavior"]

Additionally, since you can make good use of Data.table column-sorting without ever reading the 'setkey' page, the warning should be put in other places as well. But why fix the documentation when we could fix the code? Maybe it'll be a 1-line change, I don't know.

And why invalidate users when it's the library that's buggy? Should I tell new users not to implement convex-hull algorithms, because that's a "bad choice"? No, I'll tell them that they shouldn't use Data.table, because their convex-hull or spline or ROC curve will have inexplicable behavior that will take them hours to debug - unless they're willing to read all the documentation first, and memorize some odd caveats that even the maintainers can't explain.

@arunsrinivasan
Copy link
Member

Default is not to do rounding for now..

@arunsrinivasan arunsrinivasan added this to the v2.0.0 milestone Jul 21, 2016
@archenemies
Copy link
Author

Hey thank you for fixing this Arun. I was surprised to hear back from you.

I hope that your change doesn't break anyone's code. I thought it would be possible to keep rounding as a default for equality comparisons in key look-ups, and turn it off only for sorting. But I never looked at the data.table source to verify the feasibility of this separation. If you're willing to change the default for both operations, that works for me; I just wanted to be sure you know it's not the solution I was trying to propose.

@mattdowle
Copy link
Member

Reopening just to come back to it as we want to do the appropriate rounding for the context. Changing the default to 0 was a good move for now. Milestone was already 1.9.10.

@mattdowle mattdowle reopened this Nov 23, 2016
@mattdowle mattdowle self-assigned this Nov 23, 2016
@mattdowle mattdowle removed this from the Candidate milestone May 10, 2018
@jangorecki jangorecki changed the title data.table produces out-of-order floating point columns, with no rationale consider setNumericRounding(2) to make DT[v1==0.6] works Mar 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants