Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bring EXTRACT into alignment with PostgreSQL v14 #10027

Merged
merged 23 commits into from
Feb 2, 2022

Conversation

jkosh44
Copy link
Contributor

@jkosh44 jkosh44 commented Jan 13, 2022

Before this commit, EXTRACT was treated as an alias to date_part with slightly
different syntax. Both of these functions returned a float data type. PostgreSQL
v14 updated EXTRACT to return a numeric data type, which makes it compliant with
the SQL standard. This commit updates the EXTRACT function so that it return a
numeric data type so that it matches PostgreSQL. date_part still returns a float.

Additionally PostgreSQL v14 implemented EXTRACT explicitly for the DATE data
type. Previously DATEs were casted to TIMESTAMPs before extracting fields from
them. This commit also explicitly implements EXTRACT for DATE types, so they
aren't cast to a TIMESTAMP first. A consequence of this is that time related
fields (e.g. SECOND) are now rejected when trying to EXTRACT a DATE type.
However, The implementation for date_part still casts DATE types to TIMESTAMP
types. This means that date_part does not reject time related fields for DATE
types. This implementation matches PostgreSQL.

The postgres commit that implements these changes can be found here:
postgres/postgres@a2da77c

This commit also implements extracting EPOCHs from TIME types, which wasn't
previously implemented.

Fixes #9853, #9870, #10027

Motivation

This PR adds a known-desirable feature: #9853 and #9870

Tips for reviewer

The previous implementation has the following structure. Each type (Interval, TimestampLike, etc) implemented methods to extract out the various parts from the type and return an f64. For example Interval had fn extract_quarter(&self) -> f64. These methods can be further broken down into two categories:

  1. Methods that involve floating point arithmetic. These are epoch, second, millisecond, microsecond.
  2. Methods that really output a signed/unsigned integer which is cast to an f64 before returning. These are all the other date/time parts.

The date_part_X_inner functions (ex: date_part_timestamp_inner) would simply call these functions and wrap the result in a Result<Datum<'a>, EvalError>.

This PR changes it to the following structure. Type 1 methods are converted into generic methods that let them do either floating point arithmetic or Numeric arithmetic. Type 2 methods are converted to returning signed/unsigned integers directly. The date_part_X_inner methods are also made generic so that they can be called with either f64 or Numeric. When a date_part_X_inner method calls a type 1 method, is uses the supplied generic type. When a date_part_X_inner method calls a type 2 method, it casts the result to the supplied generic type.

The goal was to try and avoid duplicating every method twice for the different return types. Though after finishing it feels slightly over-engineered so I'm curious to hear if anyone has thoughts/suggestions on the approach.

Also I'm very open to stylistic comments since I'm new to both Rust and the code base.

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered.
  • This PR adds a release note for any user-facing behavior changes.


/// Used to do potentially lossy value-to-value conversions while consuming the input value. Useful
/// for interoperability between Numeric and f64.
pub trait LossyFrom<T>: Sized {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't super happy with this approach. The issue is that you can't do f64::from(i) if i is type i64 since the conversion is potentially lossy. Which means I can't use the From<i64 trait on generic methods. So if anyone has a better approach please let me know. Also if there's a better place to put this please let me know.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also if you have a better name, since it's technically only lossy when used with f64 and not Numeric, let me know.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think d155340 cleans this up slightly with the DecimalLike trait.

Before this commit, EXTRACT was treated as an alias to date_part with slightly
different syntax. Both of these functions returned a float data type. PostgreSQL
v14 updated EXTRACT to return a numeric data type, which makes it compliant with
the SQL standard. This commit updates the EXTRACT function so that it return a
numeric data type so that it matches PostgreSQL. date_part still returns a float.

Additionally PostgreSQL v14 implemented EXTRACT explicitly for the DATE data
type. Previously DATEs were casted to TIMESTAMPs before extracting fields from
them. This commit also explicitly implements EXTRACT for DATE types, so they
aren't cast to a TIMESTAMP first. A consequence of this is that time related
fields (e.g. SECOND) are now rejected when trying to EXTRACT a DATE type.
However, The implementation for date_part still casts DATE types to TIMESTAMP
types. This means that date_part does not reject time related fields for DATE
types. This implementation matches PostgreSQL.

The postgres commit that implements these changes can be found here:
postgres/postgres@a2da77c

This commit also implements extracting EPOCHs from TIME types, which wasn't
previously implemented.

Fixes MaterializeInc#9853, MaterializeInc#9870
@jkosh44
Copy link
Contributor Author

jkosh44 commented Jan 13, 2022

Also, in the discussion and commit for postgres they talk about making an optimized numeric division function:

Well, I had an idea that I put to work. In most of these cases where we
need division, we divide an integer by a power of 10. That can be done
with numeric very quickly by just shifting the weight and scale around.
So I wrote a function that does that specifically (look for
int64_div_fast_to_numeric()).

I haven't looked into this in our code base, but if there's interest than I can see if we can make a similar optimization.

{
/// Used to do value-to-value conversions while consuming the input value. Depending on the
/// implementation it may be potentially lossy.
fn lossy_from(i: i64) -> Self;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently I just found out that there's a similar RFC for Rust: rust-lang/rfcs#2484

…sh44/extract

� Conflicts:
�	doc/user/content/release-notes.md
…sh44/extract

� Conflicts:
�	doc/user/content/release-notes.md
…sh44/extract

� Conflicts:
�	doc/user/content/release-notes.md
@benesch
Copy link
Member

benesch commented Jan 24, 2022

So sorry for the delay here, @jkosh44! Looks like this slipped through the cracks. I've pinged a few relevant folks.

@jkosh44
Copy link
Contributor Author

jkosh44 commented Jan 24, 2022

So sorry for the delay here, @jkosh44! Looks like this slipped through the cracks. I've pinged a few relevant folks.

No worries and no rush.

Copy link
Contributor

@sploiselle sploiselle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a soft LGTM on this right now, modulo the comments; I need to take another look at it this evening.

Feel free to hold off on implementing anything mentioned here until I (or someone else) comes back with firmer suggestions. I agree this feels a little over-engineered, but I think that's maybe just an artifact of the trait.

doc/user/content/sql/functions/extract.md Outdated Show resolved Hide resolved
src/expr/src/scalar/func.rs Outdated Show resolved Hide resolved
DatePartTimestamp(units) => date_part_timestamp_inner(*units, a.unwrap_timestamp()),
DatePartTimestampTz(units) => date_part_timestamp_inner(*units, a.unwrap_timestamptz()),
ExtractInterval(units) => {
date_part_interval_inner::<Numeric>(*units, a.unwrap_interval())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was not your doing, but all of these unwraps could just get moved inside this inner func:

diff --git a/src/expr/src/scalar/func.rs b/src/expr/src/scalar/func.rs
index 331e1acdb..683d941d6 100644
--- a/src/expr/src/scalar/func.rs
+++ b/src/expr/src/scalar/func.rs
@@ -1735,13 +1735,11 @@ fn date_part_interval(a: Datum, interval: Interval) -> Result<Datum, EvalError>
     }
 }
 
-fn date_part_interval_inner<T>(
-    units: DateTimeUnits,
-    interval: Interval,
-) -> Result<Datum<'static>, EvalError>
+fn date_part_interval_inner<T>(units: DateTimeUnits, a: Datum) -> Result<Datum<'static>, EvalError>
 where
     T: DecimalLike + Into<Datum<'static>>,
 {
+    let interval = a.unwrap_interval();
     match units {
         DateTimeUnits::Epoch => Ok(interval.as_seconds::<T>().into()),
         DateTimeUnits::Millennium => Ok(T::from(interval.millennia()).into()),
@@ -3752,9 +3750,7 @@ impl UnaryFunc {
             CharLength => char_length(a),
             IsRegexpMatch(regex) => Ok(is_regexp_match_static(a, &regex)),
             RegexpMatch(regex) => regexp_match_static(a, temp_storage, &regex),
-            ExtractInterval(units) => {
-                date_part_interval_inner::<Numeric>(*units, a.unwrap_interval())
-            }
+            ExtractInterval(units) => date_part_interval_inner::<Numeric>(*units, a),
             ExtractTime(units) => date_part_time_inner::<_, Numeric>(*units, a.unwrap_time()),
             ExtractTimestamp(units) => {
                 date_part_timestamp_inner::<_, Numeric>(*units, a.unwrap_timestamp())
@@ -3763,7 +3759,7 @@ impl UnaryFunc {
                 date_part_timestamp_inner::<_, Numeric>(*units, a.unwrap_timestamptz())
             }
             ExtractDate(units) => extract_date_inner(*units, a.unwrap_date()),
-            DatePartInterval(units) => date_part_interval_inner::<f64>(*units, a.unwrap_interval()),
+            DatePartInterval(units) => date_part_interval_inner::<f64>(*units, a),
             DatePartTime(units) => date_part_time_inner::<_, f64>(*units, a.unwrap_time()),
             DatePartTimestamp(units) => {
                 date_part_timestamp_inner::<_, f64>(*units, a.unwrap_timestamp())

note that I don't think this blithely works with some of the other suggestions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just pushed this change for interval, time, and date.

doc/user/content/sql/functions/date-part.md Outdated Show resolved Hide resolved
@@ -8,13 +8,15 @@ menu:

`EXTRACT` returns some time component from a time-based value, such as the year from a Timestamp.

See [`date_part`](../date-part) for the traditional Ingres equivalent function.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense to point people from date_part to extract because extract is the new preferred way. Does it make sense to not even link people from extract to date_part, and instead remove this sentence?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this sentence.

@jkosh44
Copy link
Contributor Author

jkosh44 commented Jan 25, 2022

@sploiselle @mjibson Just pushed some changes that address your comments and fix merge conflicts. Thanks for the review!

Copy link
Contributor

@maddyblue maddyblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs changes LGTM. Other folks might have opinions about the Rust, I don't.

doc/user/content/sql/functions/date-part.md Outdated Show resolved Hide resolved
…sh44/extract

� Conflicts:
�	doc/user/content/release-notes.md
@sploiselle
Copy link
Contributor

@jkosh44 Sorry for the delay here; I saw there were some failing tests and forgot to circle back here. Overall this approach LGTM; if there's some simplification that becomes obvious, can always refactor. When I came back to this today, though, I had an easy time reasoning about the changes, so I think it's in a good state.

LMK if you'd like any help chasing down these failures.

@jkosh44
Copy link
Contributor Author

jkosh44 commented Jan 31, 2022

@sploiselle Sorry about that, I didn't notice I had failed tests. I just pushed some changes that should hopefully fix them, but I think I need someone to starts the build for me (I'm not a member of the build-authorized team).

@jkosh44
Copy link
Contributor Author

jkosh44 commented Jan 31, 2022

@sploiselle Double apologies, I definitely should have confirmed locally that I had fixed everything. I fixed the remaining errors and confirmed locally.

Lint Docs:

$ ./bin/ci-builder run stable ci/test/lint-docs.sh 
Start building sites … 
hugo v0.87.0-B0C541E4+extended linux/amd64 BuildDate=2021-08-03T10:57:28Z VendorInfo=gohugoio
WARN 2022/01/31 20:53:23 "/home/joe/IdeaProjects/materialize/doc/user/content/demos/_index.md:1:1": duplicate menu entry with identifier "Try Materialize" in menu "main"
WARN 2022/01/31 20:53:23 "/home/joe/IdeaProjects/materialize/doc/user/content/sql/_index.md:1:1": duplicate menu entry with identifier "SQL Overview" in menu "main"

                   | EN   
-------------------+------
  Pages            | 199  
  Paginator pages  |   0  
  Non-page files   |   0  
  Static files     |  27  
  Processed images |   0  
  Aliases          |  63  
  Sitemaps         |   1  
  Cleaned          |   0  

Total in 489 ms
Skipping the checking of external links.
htmltest started at 08:53:23 on ci/www/public
========================================================================
✔✔✔ passed in 212.259593ms
tested 240 documents

Testdrive

$ cargo run --bin testdrive --release -- test/testdrive/date_func.td 
    Finished release [optimized + debuginfo] target(s) in 0.19s
     Running `target/release/testdrive test/testdrive/date_func.td`
Configuration parameters:
    Kafka address: localhost:9092
    Schema registry URL: http://localhost:8081/
    materialized host: Tcp("localhost")
    error limit: 10
--- test/testdrive/date_func.td
Running test 1 time(s) ... 
Run 1 ...
> DROP DATABASE materialize
> SELECT EXTRACT(NULL FROM CAST('2011-11-11' AS DATE));
query error matches; continuing
> SELECT EXTRACT(NULL FROM CAST('11:11:11' AS TIME));
query error matches; continuing
> SELECT EXTRACT(NULL FROM CAST('2011-11-11' AS TIMESTAMP));
query error matches; continuing
> SELECT EXTRACT(NULL FROM CAST('2011-11-11' AS TIMESTAMP WITH TIME ZONE));
query error matches; continuing
> SELECT DATE_PART(NULL, CAST('2011-11-11' AS DATE)) IS NULL;
rows match; continuing at ts 1643663789.6457105
> SELECT DATE_PART(NULL, CAST('11:11:11' AS TIME)) IS NULL;
rows match; continuing at ts 1643663789.6461234
> SELECT DATE_PART(NULL, CAST('2011-11-11' AS TIMESTAMP)) IS NULL;
rows match; continuing at ts 1643663789.646539
> SELECT DATE_PART(NULL, CAST('2011-11-11' AS TIMESTAMP WITH TIME ZONE)) IS NULL;
rows match; continuing at ts 1643663789.6470332
> SELECT EXTRACT('foo' FROM CAST('2011-11-11' AS DATE));
query error matches; continuing
> SELECT EXTRACT('foo' FROM CAST('11:11:11' AS TIME));
query error matches; continuing
> SELECT EXTRACT('foo' FROM CAST('2011-11-11' AS TIMESTAMP));
query error matches; continuing
> SELECT EXTRACT('foo' FROM CAST('2011-11-11' AS TIMESTAMP WITH TIME ZONE));
query error matches; continuing
> SELECT DATE_PART('foo', CAST('2011-11-11' AS DATE));
query error matches; continuing
> SELECT DATE_PART('foo', CAST('11:11:11' AS TIME));
query error matches; continuing
> SELECT DATE_PART('foo', CAST('2011-11-11' AS TIMESTAMP));
query error matches; continuing
> SELECT DATE_PART('foo', CAST('2011-11-11' AS TIMESTAMP WITH TIME ZONE));
query error matches; continuing
> SELECT EXTRACT('' FROM CAST('2011-11-11' AS DATE));
query error matches; continuing
> SELECT EXTRACT('' FROM CAST('11:11:11' AS TIME));
query error matches; continuing
> SELECT EXTRACT('' FROM CAST('2011-11-11' AS TIMESTAMP));
query error matches; continuing
> SELECT EXTRACT('' FROM CAST('2011-11-11' AS TIMESTAMP WITH TIME ZONE));
query error matches; continuing
> SELECT DATE_PART('', CAST('2011-11-11' AS DATE));
query error matches; continuing
> SELECT DATE_PART('', CAST('11:11:11' AS TIME));
query error matches; continuing
> SELECT DATE_PART('', CAST('2011-11-11' AS TIMESTAMP));
query error matches; continuing
> SELECT DATE_PART('', CAST('2011-11-11' AS TIMESTAMP WITH TIME ZONE));
query error matches; continuing
> SELECT EXTRACT('second' FROM CAST('2011-11-11' AS DATE));
query error matches; continuing
> SELECT EXTRACT('second' FROM CAST('2011-11-11' AS TIMESTAMP));
rows match; continuing at ts 1643663789.6538699
> SELECT EXTRACT('second' FROM CAST('2011-11-11' AS TIMESTAMP WITH TIME ZONE));
rows match; continuing at ts 1643663789.6543183
> SELECT DATE_PART('second', CAST('2011-11-11' AS DATE));
rows match; continuing at ts 1643663789.6547601
> SELECT DATE_PART('second', CAST('2011-11-11' AS TIMESTAMP));
rows match; continuing at ts 1643663789.6551979
> SELECT DATE_PART('second', CAST('2011-11-11' AS TIMESTAMP WITH TIME ZONE));
rows match; continuing at ts 1643663789.6555967
> SELECT EXTRACT('day' FROM CAST(NULL AS DATE)) IS NULL;
rows match; continuing at ts 1643663789.6559987
> SELECT EXTRACT('day' FROM CAST(NULL AS TIME)) IS NULL;
rows match; continuing at ts 1643663789.6563857
> SELECT EXTRACT('day' FROM CAST(NULL AS TIMESTAMP)) IS NULL;
rows match; continuing at ts 1643663789.656778
> SELECT EXTRACT('day' FROM CAST(NULL AS TIMESTAMP WITH TIME ZONE)) IS NULL;
rows match; continuing at ts 1643663789.6571867
> SELECT DATE_PART('day', CAST(NULL AS DATE)) IS NULL;
rows match; continuing at ts 1643663789.6576157
> SELECT DATE_PART('day', CAST(NULL AS TIME)) IS NULL;
rows match; continuing at ts 1643663789.6580007
> SELECT DATE_PART('day', CAST(NULL AS TIMESTAMP)) IS NULL;
rows match; continuing at ts 1643663789.6583924
> SELECT DATE_PART('day', CAST(NULL AS TIMESTAMP WITH TIME ZONE)) IS NULL;
rows match; continuing at ts 1643663789.6588137
+++ testdrive completed successfully.

@sploiselle
Copy link
Contributor

No apologies necessary! Testing is iterative

@jkosh44 jkosh44 merged commit 9e8a594 into MaterializeInc:main Feb 2, 2022
@jkosh44 jkosh44 deleted the jkosh44/extract branch February 2, 2022 18:07
jkosh44 added a commit to jkosh44/materialize that referenced this pull request Feb 7, 2022
Breaking changes are supposed to be listed first in the release notes.
PR MaterializeInc#10027 mistakenly did not list it's breaking changes first, so this
commit fixes that.
@jkosh44 jkosh44 mentioned this pull request Feb 7, 2022
2 tasks
jkosh44 added a commit that referenced this pull request Feb 7, 2022
Breaking changes are supposed to be listed first in the release notes.
PR #10027 mistakenly did not list it's breaking changes first, so this
commit fixes that.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sql: bring EXTRACT into alignment with PostgreSQL v14
4 participants