Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

map_top_n returns wrong results if NaN appears in the input #22040

Closed
rschlussel opened this issue Feb 28, 2024 · 3 comments · Fixed by #22386
Closed

map_top_n returns wrong results if NaN appears in the input #22040

rschlussel opened this issue Feb 28, 2024 · 3 comments · Fixed by #22386
Assignees
Labels

Comments

@rschlussel
Copy link
Contributor

This issue was discovered as part of an audit of all the comparison and ordering behaviors for NaN across Presto functions (related to #21936 and #21877).

While there are a lot of inconsistencies in how NaN are handled that need to be addressed, map_top_n can produce definite wrong results when NaN values shows up in the map.

According to the documentaion, map_top_n "Truncates map items. Keeps only the top N elements by value. n must be a non-negative integer"

In the presence of NaN values, NaN seems to "reset" the search for topn entries

 
select map_top_n(map(array['a', 'b', 'c'], array[nan(), 3, 2]),1);
  _col0               
---------
 {b=3.0} 
(1 row)

-- BUG! regardless of interpretation of NaN, 2 is always less than 3. So this result is definitely incorrect
 select map_top_n(map(array['a', 'b', 'c'], array[3, nan(), 2]),1);
  _col0               
---------
 {c=2.0} 
(1 row)

select map_top_n(map(array['a', 'b', 'c'], array[3, 2, nan()]),1);
  _col0               
---------
 {c=NaN} 
(1 row)
@rschlussel rschlussel added the bug label Feb 28, 2024
@tdcmeehan
Copy link
Contributor

Note that map_top_n is just a SQL function, so likely the bug or unexpected behavior is in one of its constituent functions (such as array_sort).

@rschlussel
Copy link
Contributor Author

rschlussel commented Feb 28, 2024

It's not a bug in array_sort itself, but in the lambda function that we are passing to it. Array_sort without a function argument sorts with NaN as the largest. This passes a function that does element comparison for the values, but < and = comparison with NaN always return false.

The relevant part of the sql function is here:

`array_sort(map_entries(map_filter(input, (k, v) -> v is not null)), (x, y) -> IF(x[2] < y[2], 1, IF(x[2] = y[2], IF(x[1] < y[1], 1, -1), -1)))`

@mbasmanova
Copy link
Contributor

This is a great catch!

A side note, array_sort with a lambda is quite dangerous. It is very easy to write lambda that doesn't match implicit requirements (like in this case): https://velox-lib.io/blog/array-sort

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants