- Feature name: External introspection
- Associated: (Insert list of associated epics, issues, or PRs)
In this document we will propose a few features related to "external" introspection; that is, introspection by end-users. These features will be exposed in the web UI.
It is currently difficult for anyone but the most unusually sophisticated users to debug issues with compute-maintained features: understand the health and stability of their replicas, understand the performance of their queries, and so on.
We will explain each proposed feature separately.
This will provide the functionality available now in the memory
and
/hierarchical-memory
visualizations: visualizing the graph of
operators in a dataflow, including the transfer of data along channels
and the number of records in arrangements.
This will differ from the presently-available GUIs in the following ways:
- It will be externally visible
- The user will be able to dynamically refine the zoom level by clicking (rather than having to scroll down as in the current hierarchical UI)
- We will include scheduling durations alongside the other per-node information (arrangement sizes / channel traffic)
- If possible (TBD) we should figure out how to filter out the error paths.
The global frontier lag visualizer will allow users to see at a glance whether dataflows are falling behind. It will display the controller's view of each source and export frontier, and color nodes if their outputs lag significantly behind their inputs.
There may be multiple values corresponding to a single export, if the cluster on which the source dataflow is running has more than one replica. Each edge between dataflows will be labeled with the maximum frontier across all replicas.
The view will incorporate all possible inputs and outputs of a dataflow: indexes, MVs, sources, and sinks.
By "resource usage information" we mean the same things that are displayed per-node in the hierarchical dataflow visualizer.
The scope structure should let us compute various values (time spent, records contained) and display them in a tree-like visualization, as shown below.
We should collected the following metadata for each query:
- The timestamp at which the query executed
- The frontiers of all dependencies
- The optimized and physical query plans
- The ID of the dataflow (if it is not a simple peek) used to service the query
- The SQL text of the query
We can then save all this information in a table with non-trivial retention, and surface it from the web UI.
Currently we have text-only EXPLAIN PLAN
output. We should render
this data in visual form, as a graph. We should also (whenever
possible) flow column names from the source relations through the
nodes, so that we can show something more useful than #0
, etc.
All features will be implemented in the web UI using React, querying
the mz_internal
relations for the necessary data. Relations that
will be used include:
- For the hierarchical dataflow visualizer:
mz_dataflow_operators
mz_dataflow_channels
mz_arrangement_sizes
mz_dataflow_addresses
- For the global frontier lag visualizer:
mz_object_dependencies
mz_cluster_replica_frontiers
- For the flamegraph views: same as the hierarchical dataflow visualizer
- For the per-query metadata: New table to be created
(
mz_query_metadata
). - For the user-friendly plan rendering: parsed
EXPLAIN PLAN
output (possibly frommz_query_metadata
, or entered manually by the user).
We will use the d3-graphviz library for layout and rendering of the hierarchical dataflow visualizer, global frontier lag visualizer, and user-friendly plan rendering. We will use d3-flame-graph -- which we are already using on the internal side -- for the flamegraph visualizer.
We will be considering this an experimental feature until such time as we can involve professional product designers and front-end engineers. Until that time, the level of polish of the UX may not reach the same standards as the rest of the site, so we will require users to click a link with a label like "Advanced Features" or similar in order to access the tools.
We will also use LaunchDarkly to gate access to the feature, and not launch it at all to the public until we have gotten some internal feedback from support and DevEx that it is useful.
We will have a separate LaunchDarkly flag just for the
mz_query_metadata
table, since this will have potentially large cost
in high-QPS scenarios.
(TBD -- will cover this in our meeting with Robin tomorrow and get an overview from him of the testing strategy for this kind of feature)
- If any of the features don't prove useful, we are cluttering our UX unnecessarily.
- Maintaining the
mz_query_metadata
table will introduce overhead on persist, especially in high-QPS scenarios. We should measure this before implementing.
- I am unaware of any other possible designs
- What will be the overhead of the
mz_query_metadata
table, and is it acceptable? - How can we communicate query IDs back to the user for looking up the
query in the per-query metadata view? Is it acceptable to use
NOTICE
messages in this case?
Not sure of any. We should launch an MVP, get feedback and then iterate.