Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Leaking" of _expr_* columns into result relations #4719

Open
2 tasks done
snth opened this issue Jul 11, 2024 · 3 comments
Open
2 tasks done

"Leaking" of _expr_* columns into result relations #4719

snth opened this issue Jul 11, 2024 · 3 comments
Labels
bug Invalid compiler output or panic

Comments

@snth
Copy link
Member

snth commented Jul 11, 2024

What happened?

prqlc produces SQL which uses _expr_* columns internally for internal intermediate calculation results. These should never end up in the resulting output relation.

The current situation is understandable since not all SQL dialects support the SELECT EXCLUDE _expr_* syntax and without schema information about the input relations, the compiler cannot explicitly enumerate the output columns.

However if we think about PRQL from first principles as "a language for Relational Algebra" then I hazard the guess that we all agree that such internal implementation details should not form part of the output.

As pointed out in #4633 (comment), such extraneous columns can also cause errors in downstream calcuations.

One of my personal ambitions for PRQL is to extend it beyond the current SQL backends to also include other "relational" languages and query engines such as Pandas, Polars, (Power Query) M Language, etc... to name just a few. In those cases the _expr_* columns should also not show up and ideally PRQL should produce consistent results between different backends.

PRQL input

from [{a=1}]
derive a=2

SQL output

WITH table_0 AS (
  SELECT
    1 AS a
)
SELECT
  a AS _expr_0,
  2 AS a
FROM
  table_0

-- Generated by PRQL compiler version:0.12.2 (https://prql-lang.org)

Expected SQL output

WITH table_0 AS (
  SELECT
    1 AS a
)
SELECT
  2 AS a
FROM
  table_0

MVCE confirmation

  • Minimal example
  • New issue

Anything else?

I'm not sure what the solution would be so just raising this as a tracking issue and discussion point.

There are some more examples in #3130 (comment).

@snth snth added the bug Invalid compiler output or panic label Jul 11, 2024
@aljazerzen
Copy link
Member

I think the current result is correct.

from [{a = 3}] has type [{a = int}] (array of tuples with a single field named a)

derive a = 2 will add a new field and because name a already exists in the tuple, it will remove that name, but not the whole field. What remains is this: [{int, a = int}] (an array of tuples with two fields: an unnamed int and an int named a).

We express that in SQL as SELECT ... AS _expr_0, ... AS a.

@kgutwin
Copy link
Collaborator

kgutwin commented Jul 11, 2024

derive a = 2 will add a new field and because name a already exists in the tuple, it will remove that name, but not the whole field. What remains is this: [{int, a = int}] (an array of tuples with two fields: an unnamed int and an int named a).

This reasoning makes sense to me, but might be a bit unexpected to a user, as it seems the most obvious result of derive a = 2 in this case would be to overwrite the existing a field in the tuple. Is it necessary to keep the unnamed field around?

Alternatively, it also makes sense that unnamed fields could be (by default?) filtered out of relations at the end of pipelines.

@aljazerzen
Copy link
Member

Yes, valid points. I don't remember when we decided on this behavior, but I was a discussed decision, not just something that happened accidentally.

Before opening up discussion about these questions, it would be wise to find the issue where we decided on current semantics of derive. Maybe @snth rememeber where was that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Invalid compiler output or panic
Projects
None yet
Development

No branches or pull requests

3 participants