monster issue list #2

wkalt · 2024-03-12T16:05:30Z

bugs

~~bug: versionstore is not hooked up properly. See nodestore.nextVersion or something - needs to come from versionstore.~~
~~queries for topics/producers that are not stored produce 500 errors, including multi-topic requests when some topics are present but not others~~
~~duplicate uploads don't duplicate. If I upload the same file twice, I don't get duplicate messages in GetMessages. On one hand this is desired, on the other hand I don't remember implementing it...~~ Add write-ahead log #5
~~output MCAP writer is misassociating schemas when merging across different files, resulting in nil-valued schemas for one of the files~~ Fixes bug in mcap merge coordinator #6
~~Tree merge is erroneously merging areas of the tree that did not change~~ 3e16a2e
~~Statistics mishandles NaN and probably infinite valued floats. Probably the thing to do is skip these for sum/min/max/mean accounting, and maintain nan/inf counts as a separate statistic.~~ 1883231
~~Tree iterator is building a full list of leaves up front, which is extremely slow on huge resultsets. Needs to be incremental.~~ 60685f5
~~nondeterministic import failure: nondeterministic failure when querying concurrently with big import #11~~ Ensure inner node children in merge are fully cloned #18
service does not accept read requests during recovery (in startup), but it seems like it could
~~service does not currently crash on a port conflict~~ c0440db
~~semicolon termination should be in the grammar not enforced in client - to enable batched queries.~~ cfc95d8
~~local disk storage implementation should write to a tmpfile + rename~~ Ensure local disk directory store does atomic put #14
statistics storage on inner nodes is currently stored per-schema, but there are some kinds of statistics that are difficult for us to get per-schema, such as the compressed size of leaf data. We need to restructure the statistics representation a bit to handle this.
the executor defers initialization of the output writer until a message is successfully pulled, to allow schema conflicts to be surfaced as 400 errors. It needs to write an empty file if the resultset is simply empty. 82dbf9c
tables response format needs structural cleanup

tree methods

~~delete~~
~~return message diff between versions~~
get statistics since (version) maybe? does that make sense?

testing

service-level integration tests. Test service restarts don't corrupt database.
~~tests for concurrent inserts into tree~~
~~duplicate data must be deduplicated (on timestamp and message byte)~~
~~testing for storage with minio~~
test a huge rootmap

performance analysis

establish metrics of interest: memory usage during querying and ingestion, tree node sizes, records per second @ various input file dimensions, read throughput, time to first record. These need to be easy to observe through prometheus or something. When we get to evaluation it should focus on minio-backed deployments and be run against both local and remote minio deployments.
effectiveness of write batching

design questions

leaf nodes are sized by time, not byte size. There is probably an ideal write size for our storage writes. If we get a sample of messages we can attempt to size a tree to hit that, but if the sample is bad we will produce sub-ideal writes, through no fault of the user. We need to implement that sampling mechanism and also think harder about the general problem. It would be ideal if user write patterns influenced physical storage layout as little as possible (it won't be totally avoidable).
We are starting with ros1msg support because there are a lot of bag files available on the internet that we can use to test with, but ros1msg is not the only recording format used in ros2, or possibly even the most common. To support ros2 we will need to add parsing/statistics support for protobuf, CDR, and flatbuffers. We will need to survey the mcap community to figure out what's highest priority - my guess is CDR will be.
Multiple schemas may be used for a single topic name, particularly over long periods of time as schemas evolve. Is it OK to have multiple schemas in one tree, or do we really need to make trees unique by schema? Nothing in playback breaks due to multiple schemas, but search/statistics features could be complicated (they will be complicated whether there are multiple trees or one). 70905e5
bidirectional playback protocol - currently the user makes a request and gets a dump of MCAP response data back. If frontend tools could define a contract, would they choose this or something with finer-grained bidirectional controls? If a spec were defined would FE tooling implement support? This issue seems worth a read/consideration: Allow play/pause/seek control when streaming recorded data through web socket foxglove/ws-protocol#261
It would similarly be useful to spike out a variant with parquet (or a columnar format of some kind) in the leaves. We would need to transcode both on the way in and out, which would be a pain, but having some sense of how much would be lost or gained in row-oriented throughput would be useful.
If we want to support the ability to dump messages on a topic irrespective of producer, we are going to have a problem with the current arch doing an in-memory merge join if the number of producers is very large. Very large numbers of producers can happen easily in simulation, if each run ID is treated as a different producer. For heavy analytics usecases we have an answer, which is use spark or something, but for usecases like viewing logs in a webpage that doesn't work. Probably to support this we will need to spill to disk in the executor. I think as long as we can see a path we can defer this for a while.
we use data files numbered with unpadded decimal numbers, but padded numbers could be helpful for lexicographical sorting. Maybe we should be padding our object IDs.
Today our data files are a concatenation of node serializations from leaf to root. If this could itself be packaged as a single MCAP file (maybe with attachments for the inner nodes) this would be a big usability win.

features needed

Currently we flush WAL synchronously with inserts. WAL flushing needs to get moved to a background thread that intelligently flushes after periods of inactivity on a topic or when size limits get reached. Add write-ahead log #5
~~can we ditch the nodestore staging map if inserts flush to WAL?~~ Add write-ahead log #5
~~data files should be segregated by tree in storage, with a meaningful name~~ 20c80fe
warm node cache prior to accepting requests?
~~statrange command should not require start/end~~ Don't require start/end for statrange command #7
~~export command should not require topics. When no topics are supplied it should return all topics.~~ be61a9b
inner node serialization should change from JSON to compact binary format. Waiting on full statistics support to gather a better picture of what we will need.
inner nodes should be cached in serialized form, not deserialized
statrange queries currently have a minimum granularity of 60s (I think) lower than which will produce a 400. We need to extend it to actually look at the message data and produce a correct result.
~~Switch to 64-bit offsets and lengths in IDs. This costs 8 bytes per IDs but will insulate us against gigantic messages.~~ d6ad718
dump command: it should be possible to dump the database to a hive-partitioned directory of MCAP files.
~~WAL doesn't garbage collect yet~~ Add WAL garbage collection #8
export command shouldn't require a producer. Should be able to return data across all producers.
statistics must be extended to variable-length arrays
currently we route topics to ingestion workers based on a hash of producer/topic to ensure two workers don't process the same topic, but this causes underutilization when a worker gets assigned multiple slow-to-process topics. We should switch to a semaphore based strategy to allow other workers to pick up the extra work in this case.
track original import request ID through WAL, and log completion
need API for looking up message definitions by hash, which implies storing them somewhere by hash.
~~multiple database support. it should be possible to have sim and real-world data segregated on one instance.~~ 54707ef
Statistics should degrade gracefully for encoding formats we don't understand, i.e still keep message count, bytecount, just not fieldlevel. This will let non ros1msg MCAP users get some benefit before we implement full parsers.
~~Die immediately on second sigint~~ 3e35173
export query results into parquet files
can we detect if a CLI user is in vim mode and have vim mode in the client?
make number of concurrent wal files configurable. Today we use just one. We don't want one per producer/topic. But one probably isn't optimal.
playback needs to support a mode where the first message returned on each merged stream is the last prior to the requested time, within an adjustable time bound, to allow visualizations to avoid data gradually filtering in.
it should be possible for a database to span different storage buckets. This enables the user to configure retention policies at the bucket level instead of based on an object prefix, which is usually frowned on. It will not be possible for individual trees to span buckets, without storing a bit more state in the node IDs (one byte for 128 allowed buckets within a database seems like it would be sufficient and leave us plenty of range for length).

catalog introspection

from within the client,

~~what producers do I have?~~
~~what topics exist for a producer?~~
~~what are the message-level stats for each table's root nodes?~~
~~what previous versions of a table do I have, dated and numbered?~~
~~what schema(s) are associated with a topic?~~
~~what fields on a topic can be queried?~~
~~eventually - what databases do I have?~~

community

~~present @ foxglove community meetup~~
~~present @ foxglove community meetup 1 mo followup~~
project logo

performance evaluation

is there MCAP or bag data at berkeley we could load up for an evaluation at the end?
establish benchmark metrics

client

interfaces - stick with REST? Use gprc? Yes: grpc. Maybe keep rest, but if the CLI tool is good we don't need rest.
switch to string-format time params in APIs, or js clients will struggle
~~fun CLI features. Like psql "session" interface, plotting of statistical ranges, displaying images? playing video?~~ f688894
web interface - just to display functionality. Maybe coverage (ranges of data coverage at a given granularity).
autocomplete based on producer/table listing
autocomplete grammar
MCAP library in Java or Scala
Python and java iterator implementations that access a root directly.

clustering

versionstore, wal, rootmap are currently sqlite-based. Both versionstore and rootmap need to move out of sqlite because multiple nodes need to hit them. WAL can stay sqlite for now. Let's go with postgres for now.
~~storage needs an S3-compatible implementation. Use minio libraries.~~
*~~inserts need to shard across replicas based on producer + topic. What manages the shards? Probably goes in postgres.~~
on the read side, it would be best if we could merge reads with WAL. The "problem" is this would require distribute WAL storage IF we also want any node to be able to serve reads. We can solve this with distributed WAL storage but that's more complicated and slower.
Expanded in clustering #10

monitoring

metrics instrumentation & scraping support
function tracing
~~pprof debugging endpoint~~

deployment

kubernetes manifests (probably helm?)
guidance for how to deploy

retention policies

I think the way retention will work is to store a retention policy on the root's record in the rootmap, and guard readers against reading data older than the policy dictates. Once that is in place retention can be managed with regular object lifecycle policies supported by the cloud provider.
Targeted exemption from GC is still outstanding
We will probably need to stick insertion times on inner nodes (in the children probably?) in order to implement the guard.

search & query language

~~statistics: field-level~~
~~SQL or not SQL?~~
~~SQL: better 3rd party compatibility, maybe chatgpt can answer queries for us~~
Not SQL: SQL is crappy for expressing complex as-of joins, which are a common kind of query. Maybe we can do a lot better. Ideally end users would be able to express queries themselves. Queries might be something like "show me all times in last 6 months when it was raining and we were taking an unprotected left and there were dogs in the intersection". That is hard to write in SQL if you aren't a SQL expert. We don't want customers to need to hire teams of SQL experts to translate. Also chatgpt is far from writing good english to SQL for arbitrary business contexts - not clear it will ever work.
~~Expanded in query language #9.~~
descending keyword to reverse sort order
variable-length array support for where clauses
~~statistics acceleration for scans~~ Leverage statistics in querying #25
support within (bidirectional precedes/succeeds)
we should be able to accelerate as-of joins using the MCAP message index. Prior to decompressing anything, consult the indexes to see if messages on the relevant topic are within the threshold of each other.
Once we have UDF support, it would be really useful to have materialized views
"neighbors" remains unimplemented in the query language
statistics acceleration can be applied at a higher level than the scan level, so that scans on different tables can restrict otherwise unrestricted scans on other tables by time. This would be helpful to improve performance in join scenarios but will require more sophistication.

maintenance

Custom golang-ci lint rule enforcing capitalization of log lines
Custom assert/require lib with better pretty-printing and representation of unsigned numerics
tree pretty printer for better test diffs

weirdnesses

versions are assigned unnecessarily while staging writes to WAL. Each write to WAL gets a version, then we merge them and create one big commit with a final version. I think the version assignment can just be deferred until the big commit. Add write-ahead log #5
tree insert over existing data currently clones all nodes down to the leaf. Pretty sure it only needs to clone the root for tree dimensions, and then all the other copying happens at time of merge from WAL. No indication so far that this is a bottleneck but it probably will be if it isn't yet. Add write-ahead log #5
~~cgo sqlite stuff is hard to inspect with pprof. Need a solution or perhaps switch to golang embedded db.~~ Add write-ahead log #5
Usage of the word granularity is weird and we may want to revise. Our granularity is an interval in seconds that the stats bucket width must be at least as small as, but this means low "granularity" is "highly granular". Maybe we are misusing the word or should pick a better one.

beta release blockers

document versioning strategy
document versioning strategy for physical tree nodes
graceful statistics degradation for non-ros1msg format messages
document data deletion strategy (based on object lifecycle policies) and implement feature support in the server to mask deleted data.
whole-tree delete command
swagger API docs

The text was updated successfully, but these errors were encountered:

wkalt · 2024-03-13T03:49:36Z

spent some time hacking on the meaningful names in storage, namely paths including topic and producer name. It makes the API that merges messages from a list of tree roots inconvenient, since a list of prefixes must also be specified. I think we should stash that idea and maybe think about solving the problem with better introspection APIs in the database. Ideally users don't care about the data file layout. 20c80fe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

monster issue list #2

monster issue list #2

wkalt commented Mar 12, 2024 •

edited

Loading

wkalt commented Mar 13, 2024 •

edited

Loading

monster issue list #2

monster issue list #2

Comments

wkalt commented Mar 12, 2024 • edited Loading

bugs

tree methods

testing

performance analysis

design questions

features needed

catalog introspection

community

performance evaluation

client

clustering

monitoring

deployment

retention policies

search & query language

maintenance

weirdnesses

beta release blockers

wkalt commented Mar 13, 2024 • edited Loading

wkalt commented Mar 12, 2024 •

edited

Loading

wkalt commented Mar 13, 2024 •

edited

Loading