Surface a Request Unit metric per statement #74441

kevin-v-ngo · 2022-01-05T02:34:24Z

This issue tracks surfacing how many RUs are consumed per statement for CockroachDB Serverless.

RU consumed is surfaced via EXPLAIN ANALYZE

Epic: CRDB-12080

ajwerner · 2022-01-05T02:45:47Z

This seems very hard given the current architecture of how we compute RUs. Potentially depends on #60589, see also golang/go#41554. Even that wouldn't necessarily be enough, but without something like that, it's hard to imagine a path.

andy-kimball · 2022-06-01T21:06:07Z

I've been doing a lot of work in this area. Most queries are dominated by Storage Layer requests, not by CPU. We can also estimate CPU pretty well, especially when there isn't background work going on. I think as long as we made it clear that the RUs reported by EXPLAIN ANALYZE are "approximate", I think we can do this work without #60589.

This commit adds a top-level field to the output of `EXPLAIN ANALYZE` that shows the estimated number of RUs that would be consumed due to network egress to the client. The estimate is obtained by buffering each value from the query result in text format and then measuring the size of the buffer before resetting it. The result is used to get the RU consumption with the tenant cost config's `PGWireEgressCost` method. **sql: surface query request units consumed due to cpu usage** This commit adds the ability for clients to estimate the number of RUs consumed by a query due to CPU usage. This is accomplished by keeping a moving average of the CPU usage for the entire tenant process, then using that to obtain an estimate for what the CPU usage *would* be if the query wasn't running. This is then compared against the actual measured CPU usage during the query's execution to get the estimate. For local flows this is done at the `connExecutor` level; for remote flows this is handled by the last outbox on the node (which gathers and sends the flow's metadata). The resulting RU estimate is added to the existing estimate from network egress and displayed in the output of `EXPLAIN ANALYZE`. **sql: surface query request units consumed by IO** This commit adds tracking for request units consumed by IO operations for all execution operators that perform KV operations. The corresponding RU count is recorded in the span and later aggregated with the RU consumption due to network egress and CPU usage. The resulting query RU consumption estimate is visible in the output of `EXPLAIN ANALYZE`. **multitenantccl: add sanity testing for ru estimation** This commit adds a sanity test for the RU estimates produced by running queries with `EXPLAIN ANALYZE` on a tenant. The test runs each test query several times with `EXPLAIN ANALYZE`, then runs all test queries without `EXPLAIN ANALYZE` and compares the resulting actual RU measurement to the aggregated estimates. For now, this test is disabled during builds because it is flaky in the presence of background activity. For this reason it should only be used as a manual sanity test. Informs cockroachdb#74441 Release note (sql change): Added an estimate for the number of request units consumed by a query to the output of `EXPLAIN ANALYZE` for tenant sessions.

89256: sql: output RU estimate for EXPLAIN ANALYZE on tenants r=DrewKimball a=DrewKimball **sql: surface query request units consumed by network egress** This commit adds a top-level field to the output of `EXPLAIN ANALYZE` that shows the estimated number of RUs that would be consumed due to network egress to the client. The estimate is obtained by measuring the in-memory size of the query result, and passing that to the tenant cost config's `PGWireEgressCost` method. **sql: surface query request units consumed due to cpu usage** This commit adds the ability for clients to estimate the number of RUs consumed by a query due to CPU usage. This is accomplished by keeping a moving average of the CPU usage for the entire tenant process, then using that to obtain an estimate for what the CPU usage *would* be if the query wasn't running. This is then compared against the actual measured CPU usage during the query's execution to get the estimate. For local flows this is done at the `connExecutor` level; for remote flows this is handled by the last outbox on the node (which gathers and sends the flow's metadata). The resulting RU estimate is added to the existing estimate from network egress and displayed in the output of `EXPLAIN ANALYZE`. **sql: surface query request units consumed by IO** This commit adds tracking for request units consumed by IO operations for all execution operators that perform KV operations. The corresponding RU count is recorded in the span and later aggregated with the RU consumption due to network egress and CPU usage. The resulting query RU consumption estimate is visible in the output of `EXPLAIN ANALYZE`. **multitenantccl: add sanity testing for ru estimation** This commit adds a sanity test for the RU estimates produced by running queries with `EXPLAIN ANALYZE` on a tenant. The test runs each test query several times with `EXPLAIN ANALYZE`, then runs all test queries without `EXPLAIN ANALYZE` and compares the resulting actual RU measurement to the aggregated estimates. Informs #74441 Release note (sql change): Added an estimate for the number of request units consumed by a query to the output of `EXPLAIN ANALYZE` for tenant sessions. Co-authored-by: Drew Kimball <drewk@cockroachlabs.com>

This patch adds a cluster setting, `sql.tenant_ru_estimation.enabled`, which is used to determine whether tenants collect an RU estimate for queries run with `EXPLAIN ANALYZE`. This is an escape hatch so that the RU estimation logic can be more safely backported. Informs cockroachdb#74441 Release note: None

92952: sql: fix statement bundle creation when memo isn't detached r=cucaroach a=cucaroach When a memo is deemed not resuable we don't detach it from the factory and this causes problems later when we execute SetIndexRecommendations which resets the optimizer context which will reset the memo. This causes the schema.sql and opt*.txt files to be empty. Fixes: #92920 Release note: None 92968: sql: add cluster setting to disable RU estimation r=DrewKimball a=DrewKimball This patch adds a cluster setting, `sql.tenant_ru_estimation.enabled`, which is used to determine whether tenants collect an RU estimate for queries run with `EXPLAIN ANALYZE`. This is an escape hatch so that the RU estimation logic can be more safely backported. Informs #74441 Release note: None 93131: rttanalysis: don't fail when benchmarking, do skip r=ajwerner a=ajwerner Fixes: #92770 Release note: None Co-authored-by: Tommy Reilly <treilly@cockroachlabs.com> Co-authored-by: Drew Kimball <drewk@cockroachlabs.com> Co-authored-by: Andrew Werner <awerner32@gmail.com>