Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarifications on comparisons with NaN #21936

Closed
kgpai opened this issue Feb 15, 2024 · 13 comments
Closed

Clarifications on comparisons with NaN #21936

kgpai opened this issue Feb 15, 2024 · 13 comments
Labels

Comments

@kgpai
Copy link
Contributor

kgpai commented Feb 15, 2024

Current Behavior

Currently in Presto NaN() seems to behave like its > infinity,

select ARRAY_SORT(a) from  (values array[NAN(), INFINITY(), -INFINITY(), 1]) as T(a); 
—  [ "-Infinity", 1,  "Infinity",  "NaN" ]

select ARRAY_SORT_Desc(a) from  (values array[NAN(), INFINITY(), -INFINITY(), 1]) as T(a); 
—  ["NaN", "Infinity", 1, "-Infinity"]

This is not just limited to array_sort and you can see this behavior in order by too:

SELECT x, y from (values (1, 1.0), (2, NAN()), (3, -NAN()), (4, INFINITY()), (5, -INFINITY())) as T(x, y) order by y;

— 
5	-Infinity
1	1
4	Infinity
2	NaN
3	NaN

However running

SELECT NAN() > INFINITY()

-- False

gives a contradictory answer.

Expected Behavior

Comparison with NaN is not defined so we should confirm to https://en.wikipedia.org/wiki/NaN#Comparison_with_NaN.

Context

Velox is trying to ensure correctness with Presto Java but we are seeing inconsistencies in behavior and wanted clarifications on these.

@kgpai kgpai added the bug label Feb 15, 2024
@kgpai
Copy link
Contributor Author

kgpai commented Feb 15, 2024

@mbasmanova
Copy link
Contributor

CC: @kaikalur @feilong-liu

@tdcmeehan
Copy link
Contributor

As discussed in #21877, I am not sure if it's most appropriate to follow the IEEE standard, or to follow the behavior of most engines. It appears that behavior of NaN is not defined in the spec, so we are actually free to follow whatever behavior is most intuitive to users--in my opinion, this should be consistent with that of other engines. The real requirement is to make them consistent, and document the behavior well for functions.

@rschlussel
Copy link
Contributor

I've been looking into the NaN inconsistencies, and wanted to give an update/summary of my thinking. If we can get alignment on the solution we can start tackling the problem

reference document of presto’s treatment of NaN for comparison and ordering across all functions and operators: https://docs.google.com/spreadsheets/d/1KclsskH88CLHMiu_Q2P25S3QfPc9H3Tu7h_9MXP10Dk/edit#gid=128250498

What does IEEE-754 say about NaN

IEEE specifies that operations on NaN return false, and that comparison operations return false except for inequality which returns true. (I found a section online here after seeing references to total order: https://gist.github.com/leto/176888/e1bf1cf76d56a3da34b6be5a0f6b95139a747299)
It also specifies a total order of doubles as follows

  • -quietNan
  • -signalingNan
  • regular expected ordering of doubles
  • +signalingNaN
  • +quietNan

what people usually expect should be true about numbers

There are a couple concepts that people intuitively expect to be consistent:

  1. equality/inequality and distinctness, groupings, etc.
  2. <, >, <=, >= and sort order

Presto current behavior

  1. comparison - basic operations follow IEEE754 (nan <> nan). joins do not match NaN keys, but distinct and group bys do consider NaN to be equal (deduplicates, considered a single group, etc.). Many map and array function do not consider NaN to be equal to itself, (will show up multiple times in maps, array distincts, unions, etc.), but the exact behavior is inconsistent.
  2. ordering - basic operations follow IEEE754 (NaN < or > anything is false). Order by puts NaNs as biggest. min and max have inconsistent behavior. greatest and least throw an error. topn functions on arrays and maps are inconsistent, but generally sort NaN as biggest. File format min/max stats differ by file format and may also differ based on who wrote the file.

A lot of the Presto inconsistency comes from the following:
In Java, basic comparison operations on double type (<, >, =, !=, >-, <=) behave according to the IEEE754 definition. However, Double the object class has equals, compare() and hashcode all defined so that NaN are equal to each other and greater than anything else. Some of our functions and operators use the Object comparison or structures while others use the primitive definition (and some probably use a combo). There are still other operators that have special handling for nulls. Additionally, the primitive definition itself can lead to bugs when used for sorting if nan is not handled separately (e.g. #22040).

what to do:

How should we decide what the behavior of NaN should be?

  1. useful and reasonably intuitive behavior for users
  2. reasonably standard
  3. reasonable to implement and maintain as new functions and features are added.

Options

  1. IEEE definition without total order - Any basic comparisons with NaN return false, NaN<> NaN is true. NaN are not de-distincted. min, max etc. return nan. Would throw an error for sorting.
    Pros:
    a) IEEE754 compliant
    cons:
    a) not useful behavior for users if we throw an error for order by queries when input contains nan. (and possibly not sql compliant?)
    b) implementation might be tricky. and keeping consistency might be tricky.
  2. IEEE compliant with total order - Any basic comparisons with NaN return false, NaN<> NaN is true. NaN are not de-distincted. unclear what min/max should do. For sorting uses total order
    Pros:
    a) IEEE754 compliant
    cons:
    a) not intuitive that basic comparison operations don’t match sort order. operations like min/max or topn don't obviously fall into distinctions between sort and comparison
    b) might be surprising to users that nans wouldn’t be deduplicated for distinct, group by, etc.
    c) having comparison operations behave differently than sort makes this prone to inconsistency/bugs.
  3. NaN=NaN and NaN is biggest.
    Pros:
    a) many other DBs use this definition
    b) easy to maintain consistency across different kinds of functions and operations because of consistency with expectations that <> and distinct are the same and > is the same as > for the purposes of sorting.
    c) behavior is reasonably useful for users
    Cons:
    a) not IEEE compliant

Conclusion

I think option 3 (nan=nan and sorts largest) will be simplest to implement and maintain consistently, is reasonable behavior for users, and is aligned with many other dbs, and therefore reasonably standard even though it's not IEEE compliant.

@Yuhta
Copy link
Contributor

Yuhta commented Feb 29, 2024

@rschlussel Not sure what users we have in mind but from a numerical computing background, treating NaN as a normal number is very counter intuitive (making it larger than Inf is even insane, how could something be larger than Inf?!). And it definitely will introduce more bugs and harder to maintain than option 1 (throwing is always the safest way; not alway the most friendly though).

Assume we want to go with option 3, do we want to fix the comparison operations to reflect this new definition of NaN?

@rschlussel
Copy link
Contributor

@rschlussel Not sure what users we have in mind but from a numerical computing background, treating NaN as a normal number is very counter intuitive (making it larger than Inf is even insane, how could something be larger than Inf?!). And it definitely will introduce more bugs and harder to maintain than option 1 (throwing is always the safest way; not alway the most friendly though).

Assume we want to go with option 3, do we want to fix the comparison operations to reflect this new definition of NaN?

That's a great point. I guess I was assuming that most users are not using the db for very mathematical use cases, but mostly nan are bad data and they are doing basic things like getting the top x by price. But it would be a good idea to ask some customers their preferences as well.

I am very hesitant about option 1. I think throwing an exception would likely be a bad user experience in most cases, especially given that there is nothing mandating we do that, and it seems like most dbs don't.
I tried to see what the sql spec had to say about it. It says

If PVi and QVi are not the null value and the result of “PVi QVi” is Unknown, then the relative ordering of PVi and QVi is implementation-dependent."

That to me suggests that for NaN values (where by IEEE comparisons are considered unordered), the sort order would be implementation dependent, but it doesn't seem like it should throw an exception.

For option 3, yes, we would redefine comparison operations to reflect the new definition of NaN (what makes it simple is that notions of distinctnesss, equality/inequality, ordering comparisons, and sorting are all behaving the same. The definition isn't standard to what you'd expect in mathematical computing, but it is internally consistent and easy to understand).

@Yuhta
Copy link
Contributor

Yuhta commented Mar 1, 2024

@rschlussel OK then at least we can have some consistent design in this space. The implementation would be messy though, because you will need to normalize all the different NaNs to the same place as +NaN on every newly generated floating point blocks if you want it to be efficient, and need to be careful whenever doing floating number comparison, we need to use the custom comparator instead of the built in operators.

@mbasmanova
Copy link
Contributor

mbasmanova commented Mar 5, 2024

@czentgr proposed the following behavior for Velox: facebookincubator/velox#7237

@rschlussel
Copy link
Contributor

@czentgr proposed the following behavior for Velox: facebookincubator/velox#7237

Thank you! It sounds like we are in agreement then. Let me collect approvals so I can move forward with this as well.

@tdcmeehan
Copy link
Contributor

I think option 3 (nan=nan and sorts largest) will be simplest to implement and maintain consistently, is reasonable behavior for users, and is aligned with many other dbs, and therefore reasonably standard even though it's not IEEE compliant.

Agreed on this--this also seems most consistent with other engines, and it's easy to document as we see above.

@czentgr
Copy link
Contributor

czentgr commented Mar 5, 2024

Thanks for the comments on PR facebookincubator/velox#7237.
I will add the requested information and check.

@kagamiori
Copy link
Contributor

kagamiori commented Mar 19, 2024

Hi @rschlussel, not sure if you're already aware of this, but I found another "inconsistent" handling of NaN in Presto: distinct aggregation considers two NaN to be duplicates, while set_agg() function considers two NaN to be distinct.

SELECT
    SET_AGG(c0)
FROM (
    VALUES
        (1.1),
        (NAN()),
        (NAN())
) t(c0); --- result is [1.1, NaN, NaN]
SELECT DISTINCT
    c0
FROM (
    VALUES
        (1.1),
        (NAN()),
        (NAN())
) t(c0); --- result are two rows: 1.1 and NaN

Since the Presto documentation of set_agg says "Returns an array created from the distinct input x elements", I expect these two queries to treat NaN in the same way.

@rschlussel
Copy link
Contributor

Thank you, yes, i did know about this difference. I made a wiki about how we handle nans across all of our functions that do comparison or ordering on doubles https://github.com/prestodb/presto/wiki/Presto-NaN-behavior

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Archived in project
Development

No branches or pull requests

8 participants