Skip to content

Presto NaN behavior

Elliotte Rusty Harold edited this page Jul 1, 2024 · 13 revisions

We are moving to a consistent semantics with regards to NaN across all functions and operators. This will bring internal consistency to Presto, as well as align us with Velox, as we move towards CPP workers in Presto. The new semantics are enabled by default and can be disabled by setting the configuration property use-new-nan-definition=false.

According to the new definition:

  • NaN = NaN is true for all equality and distinctness purposes
  • NaN is the largest value for all comparison and sorting purposes.

In addition to the NaN changes, we have updated our semantics regarding +0 and -0 to consistently treat +0 and -0 as equal to each other for all equality and distinctness purposes.

You can find more information below about where this definition differs from the previous behavior, and how you can handle it for your queries.

What is nan?

NaN, short for "not a number", is a value in floating point types (DOUBLE and REAL) that indicates an invalid result of a mathematical operation. For example 0.0/0.0 returns NaN, as does sqrt(-1). NaN is different from NULL, which indicates an absent value.

How will these changes affect my queries

The new NaN semantics may cause your queries to produce different results by returning different values than previously or returning a different number of rows. Below are places where the new Nan semantics differ from the old.

Comparison operators

Comparison operators are by far the most common cause of results differences between the old and new semantics, especially comparisons such as double_column > value.

> and < operators

Filters such as x > 0.5 previously would not include NaNs. Now, since NaNs are considered larger than all other numbers, it will include NaN values. This is the most common cause of results differences we see with the new NaN changes. Sometimes this is from NaN values stored in tables. Other times it comes from comparison after division operations where you end up with 0.0 / 0.0. e.g. WHERE (col1 / col2) > 0.5, but col1 and col2 are both 0.0.

SELECT x > 0.5 FROM (VALUES nan()) t(x)

Previous results: 
 _col0                
-------
 false 
(1 row)

New results:
 _col0                
-------
 true 
(1 row)
SELECT x/y > 0.5 FROM (VALUES (0.0, 0.0))t(x, y)

Previous results: 
 _col0                
-------
 false 
(1 row)

New results:
 _col0                
-------
 true 
(1 row)

= and IN operators

Filters such as x = y now return true if both values are NaN, whereas previously they returned false.

SELECT x = x FROM (values nan())t(x)
Previous results:
 _col0                
-------
 false 
(1 row)

New results:
 _col0                
-------
 true 
(1 row)

<> and NOT IN operators

x <> y now returns false if both values are NaN. Previously it returned true.

SELECT x <> x FROM (values nan())t(x)
Previous results:
 _col0                
-------
 true 
(1 row)

New results:
 _col0                
-------
 false 
(1 row)

Functions

min, max, min_by, max_by

These functions now consider NaNs largest. Previously they returned NaN if it was the first value encountered and otherwise they returned the min or max non-NaN value.

array_min

Previously array_min returned NaN if there was any NaN in the input. Now it will consider NaN as the largest value.

SELECT array_min(ARRAY[1, nan()]);

Previous results:
 _col0                
-------
   NaN 
(1 row)

New results:
 _col0                
-------
     1 
(1 row)

greatest, least

Previously these functions threw an error if there was NaN in the input. Now they will consider NaN largest.

SELECT greatest (1, nan());
Previous results:
Query 20240627_211145_97754_m2atc failed: Invalid argument to greatest(): NaN

New results:
 _col0                
-------
   NaN 
(1 row)

set_agg, set_union, map_agg, multimap_agg, map_union and other map and array functions that are supposed to return a distinct set of elements in an array or map

Previously these did not deduplicate NaN values and maps could also contain multiple NaN keys. Now they will deduplicate NaN values and maps can only contain one NaN key.

SELECT set_agg(x) from (values nan(), nan()) t(x);

Previous results:
   _col0    
------------
 [NaN, NaN] 
(1 row)
New results:
   _col0    
------------
 [NaN] 

contains, element_at, arrays_overlap and other functions that find or match an element in a map or array

Previously these functions would not match NaN values in the array or map. Now these functions will match NaN to itself.

SELECT contains(ARRAY[nan()], nan())

Previous results:
 _col0                
-------
 false 
(1 row)

New results:
 _col0                
-------
 true 
(1 row)

map_top_n

map_top_n() previously returned wrong results in the presence of NaNs. Now it will consider NaN largest.

select map_top_n(map(array['a', 'b', 'c'], array[3, nan(), 2]),1)
Previous results (bug!):
  _col0               
---------
 {c=2.0} 
(1 row)

New results:
  _col0               
---------
 {b=NaN} 
(1 row)

User defined SQL functions

A user defined SQL function may have behavior differences with the new NaN semantics if it calls a function that has differences with the new NaN semantics

Joins

NaN join keys now match. Previously they would not. Example:

with nan_input as (SELECT * FROM (values nan())t(x)) SELECT t1.x FROM nan_input t1 JOIN nan_input t2 on t1.x = t2.x;

Previous results:
 x 
---
(0 rows)

New results:
 x 
---
NaN
(1 row)

GROUP BY and DISTINCT for REAL types

Previously we did not deduplicate NaNs when performing grouping operations on REAL types. Now we deduplicate NaNs and consider them all part of the same group for a "GROUP BY". There is no change for DOUBLE types which already had this behavior.

SELECT col, count(*) FROM (VALUES cast(nan() AS REAL), CAST(nan() AS REAL))t(col) GROUP BY col;

Previous results:
 col | _col1 
-----+-------
 NaN |     1 
 NaN |     1 
(2 rows)

New results:
 col | _col1 
-----+-------
 NaN |     2 
(1 row)

Casts

In addition to the comparison and ordering differences, a bug was fixed regarding casting NaN and infinity from DOUBLE or REAL to BIGINT, INT, TINYINT, or SMALLINT. These casts now throw an error, as NaN and infinity cannot be represented in integer types. Previously all these casts other than DOUBLE -> BIGINT would incorrectly return 0.

+0 and -0

Previously joins, group bys, distincts, and many functions treated +0 and -0 as distinct from each other. Now they will be treated as equal to each other. They will match as join keys, be grouped to a single group, and be considered equals and not distinct for all purposes.

SELECT DISTINCT x FROM (VALUES (DOUBLE '0.0'), (DOUBLE '-0.0')) t(x)

Previous results:
  x                   
------
  0.0 
 -0.0 
(2 rows)

New results:
  x                   
------
  0.0 
(1 row)
SELECT * FROM (VALUES DOUBLE '0.0')t1(x) JOIN (VALUES DOUBLE '-0.0') t2(x) ON t1.x = t2.x

Previous results:
 x | x 
---+---
(0 rows)

New results:
 x  |  x             
-----+-----
 0.0 | 0.0 
(1 row)

How can I filter out NaNs in my data?

In most cases where NaNs show up, people do not realize that they may be producing or processing NaNs.

If you would like to filter out NaNs from your data in a way that is compatible with both the old and new semantics, you can use the is_nan() function. For example: SELECT * FROM my_table WHERE NOT is_nan(my_double_column). You can also use is_finite() to filter out both NaNs and infinities. For example: SELECT * FROM my_table WHERE is_finite(my_double_column).

You can also prevent yourself from creating NaN by filtering inputs to certain functions that could produce NaNs. For example, you might check that the denominator is not 0 before performing a division operation. SELECT col1/col2 as ratio FROM my_table WHERE col2 <> 0.

Clone this wiki locally