Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Column renaming for the duration of a function call #4718

Open
aljazerzen opened this issue Jul 10, 2024 · 9 comments
Open

Column renaming for the duration of a function call #4718

aljazerzen opened this issue Jul 10, 2024 · 9 comments
Labels
language-design Changes to PRQL-the-language

Comments

@aljazerzen
Copy link
Member

aljazerzen commented Jul 10, 2024

@snth has recently raised an issue that I remember we have already talked about, but cannot find now. I do think is a pain point and something we should do do better.

While on the topic of functions, one thing that I think is really needed is a way to alias columns in a relation passed to functions. I've thought a bit about this and I don't really know of an elegant way to do that.

Take for example a simple take_smallest function I use in my presentations:

let take_smallest = func n tbl -> (
  from tbl
  sort bytes
  take n
)

from tracks
group {album_id} (
  take_smallest 3
)

This works but bytes is hardcoded and you would really want that to be a parameter of take_smallest.

One workaround would be the following

let take_smallest = func n tbl -> (
  # Requires a _sort_by column to be set
  from tbl
  sort _sort_by
  take n
)

from tracks
group {album_id} (
  derive _sort_by=milliseconds
  take_smallest 3
  select !{_sort_by}
)

That works but you kinda want that derive _sort_by= ... select !{_sort_by} to happen implicitly inside the function definition.

@aljazerzen aljazerzen added the language-design Changes to PRQL-the-language label Jul 10, 2024
@aljazerzen
Copy link
Member Author

I think that a functional language would solve it like this:

let take_smallest = func n key_getter tbl -> (
  from tbl
  sort (key_getter this)
  take n
)

from tracks
group {album_id} (
  take_smallest 3 (func this -> this.milliseconds)
)

This currently fails miserably as functions are not really first-class citizens of PRQL.

@kgutwin
Copy link
Collaborator

kgutwin commented Jul 10, 2024

What about just passing the column as an argument to the function? This seems to work fine for me...

let take_smallest = func n <int> c <scalar> tbl <relation> -> (
  from tbl
  sort c
  take n
)

from tracks
group {album_id} (
  take_smallest 3 bytes
)

Compiled SQL:

WITH table_0 AS (
  SELECT
    *,
    ROW_NUMBER() OVER (
      PARTITION BY album_id
      ORDER BY
        bytes
    ) AS _expr_0
  FROM
    tracks
)
SELECT
  *
FROM
  table_0
WHERE
  _expr_0 <= 3

-- Generated by PRQL compiler version:0.12.2 (https://prql-lang.org)

Interestingly, if you omit the <relation> type specification from the function definition, you get a weird "Unknown name" error...

@snth
Copy link
Member

snth commented Jul 11, 2024

Thanks @kgutwin. I didn't know that could work. What does the <scalar> type represent, here or in the compiler?

Interestingly, I found that the <scalar> type annotation isn't necessary but the <relation> one is.

My minimal working example is:

let take_smallest = func n c tbl<relation> -> (
  from tbl
  sort c
  take n
)

from tracks
group {album_id} (
  take_smallest 3 bytes
)

🚀

@snth
Copy link
Member

snth commented Jul 11, 2024

This is very cool! There seem to be quite a few corner cases though.

First a fun little example to play around with:

let f = func n c tbl<relation> -> (from tbl | sort c | take n | select c)

let t = (from [{a=1},{a=2},{a=3}] | derive b=-a)

from t
f 1 a     # works

produces

[
  {"a": 1}
]

We can also rename the column, including to c with

let f = func n c tbl<relation> -> (from tbl | sort c | take n | select {c=c})

from t
f 1 b

producing

[
  {"c": -3},
  {"c": -2}
]

So far, so good.

However

let f = func n c tbl<relation> -> (from tbl | sort c | take n | select {c, c=c})

from t
f 1 b

produces

[
  {"c": -3, "c": -3}
]

and SQL

WITH table_0 AS (
  SELECT
    1 AS a
  UNION
  ALL
  SELECT
    2 AS a
  UNION
  ALL
  SELECT
    3 AS a
),
t AS (
  SELECT
    a,
    - a AS b
  FROM
    table_0
)
SELECT
  b AS c,
  b AS c
FROM
  t
ORDER BY
  c
LIMIT
  1

-- Generated by PRQL compiler version:0.12.2 (https://prql-lang.org)

I think the expected output would be:

[
  {"b": -3, "c": -3}
]

and SQL

...
SELECT
  b,
  b AS c
FROM
  t
ORDER BY
  c
LIMIT
  1

@snth
Copy link
Member

snth commented Jul 11, 2024

I tried to see how far I can push this, for example with joins.

Given the following simple setup, would it be possible to parameterise the relation and column being joined?

let t = (from [{a=1},{a=2},{a=3}] | derive b=-a)
let u = (from t | select {a=a+1, b=b+1})

from t
join u (==b)

Unfortunately none of the following worked for me:

Simple:

let j = func s<relation> c r<relation> -> (from r | join s (==c))

from t
j u a

One level of indirection:

let j = func s<relation> c r<relation> -> (
  from r
  derive {_i=c}
  join (from s | derive {_i=c}) (==_i)
)

from t
j u a

Explicit:

let j = func s<relation> c r<relation> -> (
  from r
  derive {_i=c}
  join (from s | derive {_i=c}) (this._i==that._i)
)

from t
j u a

They all produce the following error (which is why I tried that last example):

Error: 
    ╭─[:28:5]
    │
 28 │ j u a
    │     ┬  
    │     ╰── Ambiguous name
    │ 
    │ Help: could be any of: that.u.a, this.t.a
    │ 
    │ Note: available columns: t.a, t.b, u.a, u.b
────╯

@kgutwin
Copy link
Collaborator

kgutwin commented Jul 11, 2024

Here's an interesting behavior (I was experimenting a bit with some of your examples):

from t
select {b, d=b}

produces:

SELECT
  b,
  b AS d
FROM
  t

But:

let g = func c tbl <relation> -> (
  from tbl
  select {c, d=c}
)
from t
g b

produces:

SELECT
  b AS d,
  b AS d
FROM
  t

That seems like it would be a bug. I would expect the second example to produce the same output as the first.

@aljazerzen
Copy link
Member Author

Oh god, what have we created!

The "passing column into the function" is not something that was designed purposefully - it's an artifact of how resolver is implemented, so it might change in the future. And also, it is not even fully correct as you two have found on many examples :D

@kgutwin
Copy link
Collaborator

kgutwin commented Jul 11, 2024

I'd argue in favor of keeping this sort of behavior (well, fixing the bugs and inconsistencies, of course). It feels really straightforward to pass a column reference to a user-created function, just like you pass column references to standard library functions.

@aljazerzen
Copy link
Member Author

That's the thing: we don't have a clear idea what are references in PRQL; std functions are taking arguments by value (at least in my understanding). We'd need to formalize this notion of column references (that don't have a type), but are just names that the function is supposed to use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
language-design Changes to PRQL-the-language
Projects
None yet
Development

No branches or pull requests

3 participants