Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Surface a Request Unit metric per statement #74441

Closed
kevin-v-ngo opened this issue Jan 5, 2022 · 10 comments
Closed

Surface a Request Unit metric per statement #74441

kevin-v-ngo opened this issue Jan 5, 2022 · 10 comments
Assignees
Labels
A-sql-cli-observability Issues related to surfacing SQL observability in SHOW, CRDB_INTERNAL, SYSTEM, etc. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-sql-queries SQL Queries Team

Comments

@kevin-v-ngo
Copy link

kevin-v-ngo commented Jan 5, 2022

This issue tracks surfacing how many RUs are consumed per statement for CockroachDB Serverless.

  1. RU consumed is surfaced via EXPLAIN ANALYZE

Epic: CRDB-12080

@kevin-v-ngo kevin-v-ngo added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-sql-observability labels Jan 5, 2022
@kevin-v-ngo kevin-v-ngo added the A-sql-cli-observability Issues related to surfacing SQL observability in SHOW, CRDB_INTERNAL, SYSTEM, etc. label Jan 5, 2022
@ajwerner
Copy link
Contributor

ajwerner commented Jan 5, 2022

This seems very hard given the current architecture of how we compute RUs. Potentially depends on #60589, see also golang/go#41554. Even that wouldn't necessarily be enough, but without something like that, it's hard to imagine a path.

@andy-kimball
Copy link
Contributor

I've been doing a lot of work in this area. Most queries are dominated by Storage Layer requests, not by CPU. We can also estimate CPU pretty well, especially when there isn't background work going on. I think as long as we made it clear that the RUs reported by EXPLAIN ANALYZE are "approximate", I think we can do this work without #60589.

@blathers-crl blathers-crl bot added the T-sql-queries SQL Queries Team label Sep 1, 2022
DrewKimball added a commit to DrewKimball/cockroach that referenced this issue Nov 17, 2022
This commit adds a top-level field to the output of `EXPLAIN ANALYZE`
that shows the estimated number of RUs that would be consumed due to
network egress to the client. The estimate is obtained by buffering
each value from the query result in text format and then measuring the
size of the buffer before resetting it. The result is used to get the
RU consumption with the tenant cost config's `PGWireEgressCost` method.

**sql: surface query request units consumed due to cpu usage**

This commit adds the ability for clients to estimate the number of RUs
consumed by a query due to CPU usage. This is accomplished by keeping a
moving average of the CPU usage for the entire tenant process, then using
that to obtain an estimate for what the CPU usage *would* be if the query
wasn't running. This is then compared against the actual measured CPU usage
during the query's execution to get the estimate. For local flows this is
done at the `connExecutor` level; for remote flows this is handled by the
last outbox on the node (which gathers and sends the flow's metadata).
The resulting RU estimate is added to the existing estimate from network
egress and displayed in the output of `EXPLAIN ANALYZE`.

**sql: surface query request units consumed by IO**

This commit adds tracking for request units consumed by IO operations
for all execution operators that perform KV operations. The corresponding
RU count is recorded in the span and later aggregated with the RU consumption
due to network egress and CPU usage. The resulting query RU consumption
estimate is visible in the output of `EXPLAIN ANALYZE`.

**multitenantccl: add sanity testing for ru estimation**

This commit adds a sanity test for the RU estimates produced by running
queries with `EXPLAIN ANALYZE` on a tenant. The test runs each test query
several times with `EXPLAIN ANALYZE`, then runs all test queries without
`EXPLAIN ANALYZE` and compares the resulting actual RU measurement to the
aggregated estimates. For now, this test is disabled during builds because
it is flaky in the presence of background activity. For this reason it
should only be used as a manual sanity test.

Informs cockroachdb#74441

Release note (sql change): Added an estimate for the number of request units
consumed by a query to the output of `EXPLAIN ANALYZE` for tenant sessions.
craig bot pushed a commit that referenced this issue Nov 18, 2022
89256: sql: output RU estimate for EXPLAIN ANALYZE on tenants r=DrewKimball a=DrewKimball

**sql: surface query request units consumed by network egress**
This commit adds a top-level field to the output of `EXPLAIN ANALYZE`
that shows the estimated number of RUs that would be consumed due to
network egress to the client. The estimate is obtained by measuring
the in-memory size of the query result, and passing that to the
tenant cost config's `PGWireEgressCost` method.

**sql: surface query request units consumed due to cpu usage**
This commit adds the ability for clients to estimate the number of RUs
consumed by a query due to CPU usage. This is accomplished by keeping a
moving average of the CPU usage for the entire tenant process, then using
that to obtain an estimate for what the CPU usage *would* be if the query
wasn't running. This is then compared against the actual measured CPU usage
during the query's execution to get the estimate. For local flows this is
done at the `connExecutor` level; for remote flows this is handled by the
last outbox on the node (which gathers and sends the flow's metadata).
The resulting RU estimate is added to the existing estimate from network
egress and displayed in the output of `EXPLAIN ANALYZE`.

**sql: surface query request units consumed by IO**
This commit adds tracking for request units consumed by IO operations
for all execution operators that perform KV operations. The corresponding
RU count is recorded in the span and later aggregated with the RU consumption
due to network egress and CPU usage. The resulting query RU consumption
estimate is visible in the output of `EXPLAIN ANALYZE`.

**multitenantccl: add sanity testing for ru estimation**
This commit adds a sanity test for the RU estimates produced by running
queries with `EXPLAIN ANALYZE` on a tenant. The test runs each test query
several times with `EXPLAIN ANALYZE`, then runs all test queries without
`EXPLAIN ANALYZE` and compares the resulting actual RU measurement to the
aggregated estimates.

Informs #74441

Release note (sql change): Added an estimate for the number of request units
consumed by a query to the output of `EXPLAIN ANALYZE` for tenant sessions.

Co-authored-by: Drew Kimball <drewk@cockroachlabs.com>
DrewKimball added a commit to DrewKimball/cockroach that referenced this issue Dec 3, 2022
This patch adds a cluster setting, `sql.tenant_ru_estimation.enabled`,
which is used to determine whether tenants collect an RU estimate for
queries run with `EXPLAIN ANALYZE`. This is an escape hatch so that the
RU estimation logic can be more safely backported.

Informs cockroachdb#74441

Release note: None
DrewKimball added a commit to DrewKimball/cockroach that referenced this issue Dec 6, 2022
This patch adds a cluster setting, `sql.tenant_ru_estimation.enabled`,
which is used to determine whether tenants collect an RU estimate for
queries run with `EXPLAIN ANALYZE`. This is an escape hatch so that the
RU estimation logic can be more safely backported.

Informs cockroachdb#74441

Release note: None
craig bot pushed a commit that referenced this issue Dec 6, 2022
92952: sql: fix statement bundle creation when memo isn't detached  r=cucaroach a=cucaroach

When a memo is deemed not resuable we don't detach it from the factory
and this causes problems later when we execute SetIndexRecommendations
which resets the optimizer context which will reset the memo.  This
causes the schema.sql and opt*.txt files to be empty.

Fixes: #92920

Release note: None

92968: sql: add cluster setting to disable RU estimation r=DrewKimball a=DrewKimball

This patch adds a cluster setting, `sql.tenant_ru_estimation.enabled`, which is used to determine whether tenants collect an RU estimate for queries run with `EXPLAIN ANALYZE`. This is an escape hatch so that the RU estimation logic can be more safely backported.

Informs #74441

Release note: None

93131: rttanalysis: don't fail when benchmarking, do skip r=ajwerner a=ajwerner

Fixes: #92770

Release note: None

Co-authored-by: Tommy Reilly <treilly@cockroachlabs.com>
Co-authored-by: Drew Kimball <drewk@cockroachlabs.com>
Co-authored-by: Andrew Werner <awerner32@gmail.com>
DrewKimball added a commit to DrewKimball/cockroach that referenced this issue Dec 7, 2022
This commit adds a top-level field to the output of `EXPLAIN ANALYZE`
that shows the estimated number of RUs that would be consumed due to
network egress to the client. The estimate is obtained by buffering
each value from the query result in text format and then measuring the
size of the buffer before resetting it. The result is used to get the
RU consumption with the tenant cost config's `PGWireEgressCost` method.

**sql: surface query request units consumed due to cpu usage**

This commit adds the ability for clients to estimate the number of RUs
consumed by a query due to CPU usage. This is accomplished by keeping a
moving average of the CPU usage for the entire tenant process, then using
that to obtain an estimate for what the CPU usage *would* be if the query
wasn't running. This is then compared against the actual measured CPU usage
during the query's execution to get the estimate. For local flows this is
done at the `connExecutor` level; for remote flows this is handled by the
last outbox on the node (which gathers and sends the flow's metadata).
The resulting RU estimate is added to the existing estimate from network
egress and displayed in the output of `EXPLAIN ANALYZE`.

**sql: surface query request units consumed by IO**

This commit adds tracking for request units consumed by IO operations
for all execution operators that perform KV operations. The corresponding
RU count is recorded in the span and later aggregated with the RU consumption
due to network egress and CPU usage. The resulting query RU consumption
estimate is visible in the output of `EXPLAIN ANALYZE`.

**multitenantccl: add sanity testing for ru estimation**

This commit adds a sanity test for the RU estimates produced by running
queries with `EXPLAIN ANALYZE` on a tenant. The test runs each test query
several times with `EXPLAIN ANALYZE`, then runs all test queries without
`EXPLAIN ANALYZE` and compares the resulting actual RU measurement to the
aggregated estimates. For now, this test is disabled during builds because
it is flaky in the presence of background activity. For this reason it
should only be used as a manual sanity test.

Informs cockroachdb#74441

Release note (sql change): Added an estimate for the number of request units
consumed by a query to the output of `EXPLAIN ANALYZE` for tenant sessions.
DrewKimball added a commit to DrewKimball/cockroach that referenced this issue Dec 7, 2022
This patch adds a cluster setting, `sql.tenant_ru_estimation.enabled`,
which is used to determine whether tenants collect an RU estimate for
queries run with `EXPLAIN ANALYZE`. This is an escape hatch so that the
RU estimation logic can be more safely backported.

Informs cockroachdb#74441

Release note: None
DrewKimball added a commit to DrewKimball/cockroach that referenced this issue Dec 8, 2022
This commit adds a top-level field to the output of `EXPLAIN ANALYZE`
that shows the estimated number of RUs that would be consumed due to
network egress to the client. The estimate is obtained by buffering
each value from the query result in text format and then measuring the
size of the buffer before resetting it. The result is used to get the
RU consumption with the tenant cost config's `PGWireEgressCost` method.

**sql: surface query request units consumed due to cpu usage**

This commit adds the ability for clients to estimate the number of RUs
consumed by a query due to CPU usage. This is accomplished by keeping a
moving average of the CPU usage for the entire tenant process, then using
that to obtain an estimate for what the CPU usage *would* be if the query
wasn't running. This is then compared against the actual measured CPU usage
during the query's execution to get the estimate. For local flows this is
done at the `connExecutor` level; for remote flows this is handled by the
last outbox on the node (which gathers and sends the flow's metadata).
The resulting RU estimate is added to the existing estimate from network
egress and displayed in the output of `EXPLAIN ANALYZE`.

**sql: surface query request units consumed by IO**

This commit adds tracking for request units consumed by IO operations
for all execution operators that perform KV operations. The corresponding
RU count is recorded in the span and later aggregated with the RU consumption
due to network egress and CPU usage. The resulting query RU consumption
estimate is visible in the output of `EXPLAIN ANALYZE`.

**multitenantccl: add sanity testing for ru estimation**

This commit adds a sanity test for the RU estimates produced by running
queries with `EXPLAIN ANALYZE` on a tenant. The test runs each test query
several times with `EXPLAIN ANALYZE`, then runs all test queries without
`EXPLAIN ANALYZE` and compares the resulting actual RU measurement to the
aggregated estimates. For now, this test is disabled during builds because
it is flaky in the presence of background activity. For this reason it
should only be used as a manual sanity test.

Informs cockroachdb#74441

Release note (sql change): Added an estimate for the number of request units
consumed by a query to the output of `EXPLAIN ANALYZE` for tenant sessions.
DrewKimball added a commit to DrewKimball/cockroach that referenced this issue Dec 8, 2022
This patch adds a cluster setting, `sql.tenant_ru_estimation.enabled`,
which is used to determine whether tenants collect an RU estimate for
queries run with `EXPLAIN ANALYZE`. This is an escape hatch so that the
RU estimation logic can be more safely backported.

Informs cockroachdb#74441

Release note: None
DrewKimball added a commit to DrewKimball/cockroach that referenced this issue Dec 9, 2022
This commit adds a top-level field to the output of `EXPLAIN ANALYZE`
that shows the estimated number of RUs that would be consumed due to
network egress to the client. The estimate is obtained by buffering
each value from the query result in text format and then measuring the
size of the buffer before resetting it. The result is used to get the
RU consumption with the tenant cost config's `PGWireEgressCost` method.

**sql: surface query request units consumed due to cpu usage**

This commit adds the ability for clients to estimate the number of RUs
consumed by a query due to CPU usage. This is accomplished by keeping a
moving average of the CPU usage for the entire tenant process, then using
that to obtain an estimate for what the CPU usage *would* be if the query
wasn't running. This is then compared against the actual measured CPU usage
during the query's execution to get the estimate. For local flows this is
done at the `connExecutor` level; for remote flows this is handled by the
last outbox on the node (which gathers and sends the flow's metadata).
The resulting RU estimate is added to the existing estimate from network
egress and displayed in the output of `EXPLAIN ANALYZE`.

**sql: surface query request units consumed by IO**

This commit adds tracking for request units consumed by IO operations
for all execution operators that perform KV operations. The corresponding
RU count is recorded in the span and later aggregated with the RU consumption
due to network egress and CPU usage. The resulting query RU consumption
estimate is visible in the output of `EXPLAIN ANALYZE`.

**multitenantccl: add sanity testing for ru estimation**

This commit adds a sanity test for the RU estimates produced by running
queries with `EXPLAIN ANALYZE` on a tenant. The test runs each test query
several times with `EXPLAIN ANALYZE`, then runs all test queries without
`EXPLAIN ANALYZE` and compares the resulting actual RU measurement to the
aggregated estimates. For now, this test is disabled during builds because
it is flaky in the presence of background activity. For this reason it
should only be used as a manual sanity test.

Informs cockroachdb#74441

Release note (sql change): Added an estimate for the number of request units
consumed by a query to the output of `EXPLAIN ANALYZE` for tenant sessions.
DrewKimball added a commit to DrewKimball/cockroach that referenced this issue Dec 9, 2022
This patch adds a cluster setting, `sql.tenant_ru_estimation.enabled`,
which is used to determine whether tenants collect an RU estimate for
queries run with `EXPLAIN ANALYZE`. This is an escape hatch so that the
RU estimation logic can be more safely backported.

Informs cockroachdb#74441

Release note: None
DrewKimball added a commit to DrewKimball/cockroach that referenced this issue Dec 12, 2022
This commit adds a top-level field to the output of `EXPLAIN ANALYZE`
that shows the estimated number of RUs that would be consumed due to
network egress to the client. The estimate is obtained by buffering
each value from the query result in text format and then measuring the
size of the buffer before resetting it. The result is used to get the
RU consumption with the tenant cost config's `PGWireEgressCost` method.

**sql: surface query request units consumed due to cpu usage**

This commit adds the ability for clients to estimate the number of RUs
consumed by a query due to CPU usage. This is accomplished by keeping a
moving average of the CPU usage for the entire tenant process, then using
that to obtain an estimate for what the CPU usage *would* be if the query
wasn't running. This is then compared against the actual measured CPU usage
during the query's execution to get the estimate. For local flows this is
done at the `connExecutor` level; for remote flows this is handled by the
last outbox on the node (which gathers and sends the flow's metadata).
The resulting RU estimate is added to the existing estimate from network
egress and displayed in the output of `EXPLAIN ANALYZE`.

**sql: surface query request units consumed by IO**

This commit adds tracking for request units consumed by IO operations
for all execution operators that perform KV operations. The corresponding
RU count is recorded in the span and later aggregated with the RU consumption
due to network egress and CPU usage. The resulting query RU consumption
estimate is visible in the output of `EXPLAIN ANALYZE`.

**multitenantccl: add sanity testing for ru estimation**

This commit adds a sanity test for the RU estimates produced by running
queries with `EXPLAIN ANALYZE` on a tenant. The test runs each test query
several times with `EXPLAIN ANALYZE`, then runs all test queries without
`EXPLAIN ANALYZE` and compares the resulting actual RU measurement to the
aggregated estimates. For now, this test is disabled during builds because
it is flaky in the presence of background activity. For this reason it
should only be used as a manual sanity test.

Informs cockroachdb#74441

Release note (sql change): Added an estimate for the number of request units
consumed by a query to the output of `EXPLAIN ANALYZE` for tenant sessions.
DrewKimball added a commit to DrewKimball/cockroach that referenced this issue Dec 12, 2022
This patch adds a cluster setting, `sql.tenant_ru_estimation.enabled`,
which is used to determine whether tenants collect an RU estimate for
queries run with `EXPLAIN ANALYZE`. This is an escape hatch so that the
RU estimation logic can be more safely backported.

Informs cockroachdb#74441

Release note: None
@mgartner
Copy link
Collaborator

@DrewKimball What work remains here?

@andy-kimball
Copy link
Contributor

One thing we realized after rolling out the RU estimation - it's being measured in int64 units rather than float64 units. This means that RUs are not being reported very accurately for small statements, like a point read (it's just reporting 0 for small statements, which is confusing).

@DrewKimball
Copy link
Collaborator

The first two bullet points are still left:

  1. Like rowCount, RU consumed is returned as part of the result for each statement execution
  2. RU consumed is surfaced as part of crdb_internal.statement_statistics

There are also some smaller things to do like reporting RUs in float64 instead of int64 as mentioned above and using grunning to measure CPU for the estimate, but I'll open separate issues for those.

@andy-kimball
Copy link
Contributor

How come we'd want to return RU consumed as part of the result of statement execution? I thought we only compute RUs consumed for EXPLAIN ANALYZE statements?

@DrewKimball
Copy link
Collaborator

I thought we only compute RUs consumed for EXPLAIN ANALYZE statements?

We do, but that doesn't necessarily have to be the case. Do you think we shouldn't measure it for each statement?

@andy-kimball
Copy link
Contributor

I'd be worried about extra overhead to gather information that will probably almost never be used. Also, we can only get an estimated amount right now, not an exact amount. @kevin-v-ngo, do you see important use cases for returning RUs from every statement execution, as opposed to just from EXPLAIN ANALYZE?

@kevin-v-ngo
Copy link
Author

I'd be worried about extra overhead to gather information that will probably almost never be used. Also, we can only get an estimated amount right now, not an exact amount. @kevin-v-ngo, do you see important use cases for returning RUs from every statement execution, as opposed to just from EXPLAIN ANALYZE?

This issue was not synced from JIRA and was scoped to only EXPLAIN ANALYZE. It should be closed as Done.

There are use cases for understanding workloads (e.g., statements and transactions) that take up the most RUs over time; however, I haven't validate or yet seen explicit use cases expressed from users. It'd take it incrementally for now and to also understand how accurate and useful our existing RU estimation before capturing this in our other observability interfaces.

@DrewKimball
Copy link
Collaborator

Thanks for the extra context, I'll close this out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-cli-observability Issues related to surfacing SQL observability in SHOW, CRDB_INTERNAL, SYSTEM, etc. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-sql-queries SQL Queries Team
Projects
Archived in project
Development

No branches or pull requests

5 participants