Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong output when tables are used and columns have same name #4752

Open
2 tasks done
cottrell opened this issue Jul 17, 2024 · 10 comments
Open
2 tasks done

Wrong output when tables are used and columns have same name #4752

cottrell opened this issue Jul 17, 2024 · 10 comments
Labels
bug Invalid compiler output or panic

Comments

@cottrell
Copy link

What happened?

Deriving columns with same name as existing results in confusing or wrong output.

image
image

PRQL input

from invoices
derive {
  total = invoices.total * 0,
  a = invoices.total * 0,
}
into data

from data
select {
total,
data.total,
a
}

SQL output

WITH data AS (
  SELECT
    *,
    total * 0 AS total,
    total * 0 AS a
  FROM
    invoices
)
SELECT
  total,
  total,
  a
FROM
  data

-- Generated by PRQL compiler version:0.12.2 (https://prql-lang.org)

Expected SQL output

No response

MVCE confirmation

  • Minimal example
  • New issue

Anything else?

No response

@cottrell cottrell added the bug Invalid compiler output or panic label Jul 17, 2024
@max-sixty
Copy link
Member

max-sixty commented Jul 18, 2024

Our semantics should be a bit clearer here, but what would you want the output to be? Currently PRQL returns:

SELECT
  total,
  total,
  a

...based on

total,
data.total,
a

What would we expect? ("Expected SQL Output" is useful there!)

@cottrell
Copy link
Author

Actually, is this a problem with DuckDB? Or with PRQL?

@max-sixty
Copy link
Member

What should the SQL be?

@cottrell
Copy link
Author

I think the problem is with the select * and, at least in DuckDB, it appears to be order dependent what you get as an answer.

image

The following (putting the '*' last seems to achieve the correct results.

WITH data AS (
  SELECT
    total * 0 AS total,
    total * 0 AS a,
   *
  FROM
    invoices
)
SELECT
  total,
  total,
  a
FROM
  data

Additionally, this PRQL works properly

from invoices
select {
total
}
derive {
  total = invoices.total * 0,
  a = invoices.total * 0,
}
into data

from data
select {
total,
data.total,
a
}

with sql

WITH data AS (
  SELECT
    total AS _expr_0,
    total * 0 AS total,
    total * 0 AS a
  FROM
    invoices
)
SELECT
  total,
  total,
  a
FROM
  data

-- Generated by PRQL compiler version:0.12.2 (https://prql-lang.org)

I think the select * should not be there as you know the columns you need if you crawl the DAG. At minimum the * should appear last I think but I'm not sure if that behaviour is the same across SQL dialects.

@aljazerzen
Copy link
Member

This is a bug.

(I hate wildcard columns)

@max-sixty
Copy link
Member

We discussed this on the call. It's not an easy problem from the compiler perspective since it doesn't know whether it's creating duplicate columns, we'll need to think more.

For a workaround — replacing derive with select there should work. (Doesn't solve the underlying issue in PRQL, but should fix the specific case)

@aljazerzen
Copy link
Member

Minimal example:

from invoices
derive {
  total = total * 0,
}
into data

from data
select {
  total,
}

... produces:

WITH data AS (
  SELECT
    *,
    total * 0 AS total
  FROM
    invoices
)
SELECT
  total
FROM
  data;

The problem is that CTE data will two columns named total: one from invoices and one defined in the SELECT clause. When using total in the main query:

  • DuckDB will use the first column named total and ignore the second one,
  • PostgreSQL will raise "ambiguous name" error.

@aljazerzen
Copy link
Member

A correct SQL output to the above input would be:

WITH data AS (
  SELECT
    * EXCLUDE (total),  -- exclude total from wildcard
    total as _expr_0,   -- include total under a different name
    total * 0 AS total  -- this column will now be the only column named total
  FROM
    invoices
)
SELECT
  total
FROM
  data;

Unfortunately this only works for SQL dialects that support a way of excluding columns from a wildcard.

With say PostgreSQL, this is not possible to express because:

  • we don't know all of the column of invoice, so we need to use wildcard select, which will include column total,
  • we need to also include the computed column aliased to total,
  • this produces a relation with multiple columns named total so we cannot refer to them properly.

@cottrell
Copy link
Author

cottrell commented Jul 22, 2024

I think for everyone's benefit, it would be good to lean hard into "please never use wildcards in anything that matters". It really isn't that hard to use strict column naming but everyone needs to be reminded to do it.

If most things work in wildcard mode that is convenient for quick tests and development, but really with PRQL and what it offers, you are going to suffer if you forget to remove wildcards.

Like "strict mode" or something might be reasonable branding.

I would even perhaps enjoy the ability to have some kind of flavour or option I could add to disallow wildcards in the query as this would remind me and throw if I break the rule.

@aljazerzen
Copy link
Member

aljazerzen commented Jul 22, 2024

That's a good idea. A strict mode for PRQL where it will not infer table types and deal with "unknown columns", but instead require that the table types be specified by the user.

This can already be done in current versions of PRQL:

module default_db {
  let invoices <[{
    invoice_id = int,
    customer_id = int,
    invoice_date = date,
    total = float,
  }]>
}

from default_db.invoices
derive {
  total = total * 0,
}
into data

from data
select {
  total,
}

... this will produce correct SQL, that will return total of all 0, as it should.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Invalid compiler output or panic
Projects
None yet
Development

No branches or pull requests

3 participants