Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix equality check for simple floating types in RowContainer #7780

Closed
wants to merge 1 commit into from

Conversation

czentgr
Copy link
Collaborator

@czentgr czentgr commented Nov 29, 2023

The row container implementation for
equalsNoNulls and equalsWithNulls contained a bug:

  1. Incorrect equals check for floating point types when NaN values are used.
  2. Refactor to use SimpleVector::comparePrimitiveAsc in RowContainer and ContainerRowSerde for a common comparison function.
  3. Change static SimpleVector::comparePrimitiveAsc to be static inline to reduce function call overhead in this expanded usage.

This is a continuation of PR #5833 which addressed floating point comparisons for complex types.

Affected operators:
FilterProject, TopN, TopNRowNumber, OrderBy, MergeExchange, LocalMerge, HashProbe, NestedLoopJoinProbe

The lists may not be complete.

Copy link

netlify bot commented Nov 29, 2023

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 59e7e4f
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/65f04a035438950008e5c857

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 29, 2023
@czentgr czentgr marked this pull request as ready for review December 4, 2023 15:20
@czentgr czentgr force-pushed the cz_fix_equal_nan branch 2 times, most recently from bfb6fe7 to 99a2445 Compare December 11, 2023 20:11
auto rowContainer = makeRowContainer({type}, {type}, false);
int numRows = values->size();
DecodedVector decodedWithNulls(*values);
auto rows = storeRows(decodedWithNulls, numRows, *rowContainer);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the test loops are very similar. Can you abstract a common function for it.

@czentgr
Copy link
Collaborator Author

czentgr commented Jan 29, 2024

@aditi-pandit @mbasmanova Please review. I addressed @aditi-pandit earlier comment. Thanks!

@@ -466,8 +466,40 @@ class RowContainerTest : public exec::test::RowContainerTestBase {
}
}

template <typename T, bool mayHaveNulls>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit : Use of "mayHaveNulls" sounds ambiguous. Could you use just hasNulls ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used the same name as used for the equals function. The template doesn't indicate if the input vector has nulls or not - it actually always has nulls because I didn't want to generate two of them in the caller. Instead, it is passed to the equals function to expect nulls or not. Thus, if this is false, then null values must be removed from the input vector.

Perhaps it should be named equalsCanHandleNulls or something?

@@ -466,8 +466,40 @@ class RowContainerTest : public exec::test::RowContainerTestBase {
}
}

template <typename T, bool mayHaveNulls>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't see the use of template type 'T' in the method. Is something missing ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, you are right. It is not needed. Which also means I can remove it from the other test as well. It is used in callers to distinguish the double/float type but it is not needed at this level anymore.


int32_t index{0};
for (auto row : rows) {
ASSERT_TRUE(rowContainer->equals<canHandleNulls>(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add some tests for rowContainer->equals returning false as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of values do you think of using to test the evaluation to false?
This test uses the edge values for floating points so compare them against each other? Or some random floating point values?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was thinking one edge against another like say NaN with max/min and also will some regular random floating point values.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I added some more tests to test random and edge values against NaN values.

}

auto numRows = values->size();
auto valuesSlice = values->slice(numNulls, numRows - numNulls);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Want to make sure I dont misunderstand this - but say you have 10 rows and 1 null, then you take the offset from 2nd to last. Are the nulls going to be from 0 to numNulls ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the test uses a specific input vector of edge values where the initial row is a NULL. It then uses a subset to specifically test the rowContainer->equals<false> where no NULL values can be part of the input.

Practically, that means numNulls == 1.

An example for an input vector is:

auto values = makeNullableFlatVector<T>(

There are tests for the various row types and they always have a specific input to test the edge cases.

Copy link
Collaborator

@aditi-pandit aditi-pandit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests look good. Thanks @czentgr

template <typename T, typename V>
void testOrderAndNullsFirstVariations(
template <bool canHandleNulls>
void testRowContainerEqualAPI(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit : Spelling "Equals"

@aditi-pandit
Copy link
Collaborator

@mbasmanova : PTAL.

Copy link
Collaborator Author

@czentgr czentgr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bikramSingh91 Thank you for your detailed review and comments!
I addressed them. Please take a look.

for (auto row : rows) {
ASSERT_EQ(
expected->asFlatVector<bool>()->valueAt(index),
rowContainer->equals<canHandleNulls>(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point.

while (numRows) {
auto value = folly::Random::randDouble(min, max, gen);
rawData.push_back(value);
if (value == min || value == lowest || value == max) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct. We always compare to nan which any generated value or edge value should be false.

@facebook-github-bot
Copy link
Contributor

@bikramSingh91 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@kagamiori
Copy link
Contributor

Hi @czentgr, this change looks good to me. But I wonder why the PR summary says that this change affects functions such as array_distinct and array_max? I saw the change affects RowContainer, but array_distinct and array_max do not use RowContainer, right?

@czentgr
Copy link
Collaborator Author

czentgr commented Feb 29, 2024

@kagamiori You are correct. The functions don't operate on the row container and the statement in the commit/PR is incorrect. I took a closer look at the functions and turns out there are more problems with NaN that aren't addressed by this fix.

For example, results of array_sort with Prestissimo (including the fix in this PR)

presto> select array_sort(a) from (values ARRAY[nan(), 0.0e0, 1.0e0]) t(a);
      _col0
-----------------
 [0.0, NaN, 1.0]
(1 row)

Java

presto>  select array_sort(a) from (values ARRAY[nan(), 0.0e0, 1.0e0]) t(a);
      _col0
-----------------
 [0.0, 1.0, NaN]
(1 row)

The same problem exists for array_sort_desc.

I also think there is a problem with set_union but both Java and Prestissimo return the same result. I expect NaN to be in the result once (as we consider NaN == NaN)

presto> SELECT set_union(a) FROM ( VALUES ARRAY[1.0, nan(), 3.0], ARRAY[nan(), 3.0, 4.0]) AS t(a);
           _col0
---------------------------
 [1.0, NaN, 3.0, NaN, 4.0]
(1 row)

@kagamiori
Copy link
Contributor

kagamiori commented Feb 29, 2024

@kagamiori You are correct. The functions don't operate on the row container and the statement in the commit/PR is incorrect. I took a closer look at the functions and turns out there are more problems with NaN that aren't addressed by this fix.

For example, results of array_sort with Prestissimo (including the fix in this PR)

presto> select array_sort(a) from (values ARRAY[nan(), 0.0e0, 1.0e0]) t(a);
      _col0
-----------------
 [0.0, NaN, 1.0]
(1 row)

Java

presto>  select array_sort(a) from (values ARRAY[nan(), 0.0e0, 1.0e0]) t(a);
      _col0
-----------------
 [0.0, 1.0, NaN]
(1 row)

The same problem exists for array_sort_desc.

I also think there is a problem with set_union but both Java and Prestissimo return the same result. I expect NaN to be in the result once (as we consider NaN == NaN)

presto> SELECT set_union(a) FROM ( VALUES ARRAY[1.0, nan(), 3.0], ARRAY[nan(), 3.0, 4.0]) AS t(a);
           _col0
---------------------------
 [1.0, NaN, 3.0, NaN, 4.0]
(1 row)

Hi @czentgr, thank you for updating the PR summary. We have recently noticed the NaN comparison issue in Velox functions too. (See #8738 and #8690) In fact, tracing back the behavior of NaN handling, we found that Presto has inconsistent NaN behaviors within and among functions too. There are two github issues in the Presto repo related to this problem: prestodb/presto#21877 and prestodb/presto#21936. Our current plan is to let Presto clarify (or fix) and document the NaN behavior of functions and then follow Presto in Velox. Please take a look at those github issues if you're interested.

@facebook-github-bot
Copy link
Contributor

@bikramSingh91 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@czentgr
Copy link
Collaborator Author

czentgr commented Mar 4, 2024

@bikramSingh91 Would you please be able to tell me on the Linter warnings? I would like to fix them and also rebase the PR again. Did I add new warnings in this PR?

@czentgr
Copy link
Collaborator Author

czentgr commented Mar 5, 2024

@kagamiori Thank you for the found issues. I have a documentation PR for Velox to clarify the NaN behavior there. This is adopted from the Spark behavior and makes the most sense and provides consistency. Presumably, Presto should follow this as well: #7237.

@kagamiori
Copy link
Contributor

@bikramSingh91 Would you please be able to tell me on the Linter warnings? I would like to fix them and also rebase the PR again. Did I add new warnings in this PR?

Hi @czentgr, @bikramSingh91 is out of office this week. The internal linter warnings come from the fact that the newly added methods in RowContainerTest.cpp defines a set of variables through TEST_FLOATING_TYPE_LIMIT_VARIABLES, but not all the defined variables are used in these methods (i.e., warning of unused variable).

Comment on lines +584 to +600
while (numRows) {
auto value = folly::Random::randDouble(min, max, gen);
// Intersperse nan values.
if (static_cast<int64_t>(std::fmod(value, 3.0)) == 0) {
rawData.push_back(nan);
rawExpected.push_back(true);
} else {
rawData.push_back(value);
rawExpected.push_back(false);
}
--numRows;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this piece of code implicitly assumes that the other vector to be compared against the result of this method is an all-nan vector. But this assumption is not told in the name of this method or any comment. Could we add some comments to this method explaining what it does?

const VectorPtr& lhs,
const VectorPtr& rhs,
const std::vector<bool>& rawExpected) {
TEST_FLOATING_TYPE_LIMIT_VARIABLES;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this macro not needed in this method?

int32_t numRows,
std::vector<std::optional<T>>& rawData,
std::vector<bool>& rawExpected) {
TEST_FLOATING_TYPE_LIMIT_VARIABLES;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This macro results in a lot of warnings of unusued variables. Can you replace it with something like : auto [max, min, _ , , , _] = FLOATING_TYPE_LIMIT_VARIABLES();
Where the function returns a const std::tuple and you only give names to the ones you need ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to use the tuple approach. The issue is that if you use more than one _ it complains because the identifier repeats. It doesn't exclude it. And from posts I read there should be warnings for _ as well as it is simply just a name for one of the returned entries.

Example:

/Users/czentgr/gitspace/velox/velox/exec/tests/RowContainerTest.cpp:630:25: error: redefinition of '_'
    const auto [nan, _, _, _, _] = getTestFloatingTypeLimitVariables<T>();
                        ^

I turned on-Wunused-variables to see what the compiler will do and on macOS (clang) it doesn't trigger the error if the entries were named but not used. However, the linter might still complain.

In most functions all the special values should be used except in a few where only NaN is used. So I will just not use the macro there. I experimented with a templated struct instead of the macro but this also causes more changes.

@czentgr czentgr force-pushed the cz_fix_equal_nan branch 2 times, most recently from 2441348 to cbe3d03 Compare March 6, 2024 18:10
@facebook-github-bot
Copy link
Contributor

@kagamiori has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@kagamiori kagamiori left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you for helping address the linter warning.

@kagamiori
Copy link
Contributor

Hi @czentgr, could you rebase this PR onto the latest main? It's required for committing the code. Thanks!

The row container implementation for
equalsNoNulls and equalsWithNulls contained a bug:

1.  Incorrect equals check for floating point
types when NaN values are used.
2. Refactor to use SimpleVector::comparePrimitiveAsc in RowContainer
and ContainerRowSerde for a common comparison function.
3. Change static SimpleVector::comparePrimitiveAsc to be static inline
to reduce function call overhead in this expanded usage.

This is a continuation of PR facebookincubator#5833 which addressed
floating point comparisons for complex types.

Affected operators:
FilterProject, TopN, TopNRowNumber, OrderBy, MergeExchange,
LocalMerge, HashProbe, NestedLoopJoinProbe

The lists may not be complete.
@czentgr
Copy link
Collaborator Author

czentgr commented Mar 12, 2024

@kagamiori done. Thanks!

@facebook-github-bot
Copy link
Contributor

@kagamiori has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@kagamiori merged this pull request in 451db90.

@czentgr czentgr deleted the cz_fix_equal_nan branch March 19, 2024 13:10
Joe-Abraham pushed a commit to Joe-Abraham/velox that referenced this pull request Jun 7, 2024
…kincubator#7780)

Summary:
The row container implementation for
equalsNoNulls and equalsWithNulls contained a bug:

1.  Incorrect equals check for floating point types when NaN values are used.
2. Refactor to use SimpleVector::comparePrimitiveAsc in RowContainer and ContainerRowSerde for a common comparison function.
3. Change static SimpleVector::comparePrimitiveAsc to be static inline to reduce function call overhead in this expanded usage.

This is a continuation of PR facebookincubator#5833 which addressed floating point comparisons for complex types.

Affected operators:
FilterProject, TopN, TopNRowNumber, OrderBy, MergeExchange, LocalMerge, HashProbe, NestedLoopJoinProbe

The lists may not be complete.

Pull Request resolved: facebookincubator#7780

Reviewed By: Yuhta

Differential Revision: D54141907

Pulled By: kagamiori

fbshipit-source-id: 0306cfaffd4d486a0b72f6e6b659b40b2d66688f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants