Derived dataset auto refresh benchmark #38

penghuo · 2023-09-20T17:06:18Z

Goals

Size the cluster to auto refresh derived dataset
Latency of auto refresh derived dataset.
Cost of auto refresh derived dataset

Test Plan

Dimensions

Queries
- Auto refresh Skipping Index
- Auto refresh Covering Index
- Auto refresh Materialized View
Dataset. http_logs dataset
Ingestion
- Ingestion traffic pattern.
  - StreamingIngestion, 10 file per 10 seconds, file size is 1MB (24MB uncompressed) - streaming ingestion
  - BatchIngestion, 10 file per 30 minutes, file size is 100MB - log case
Configuration
- streaming source model
  - pull mode: MicroBatch pull files from s3.
  - push mode: MicroBatch read from SQS/SNS. (TBD)
- configure job with dynamicAllocation enable / disable
- configure job with different executors configuration
  - executors = 3 (default)
  - executors = 10
  - executors = 30
- spark streaming job interval
  - default
  - 10mins

Measurement

Latency: p90 and p75 of Latency. Latency is the time difference between the moment of data production at the source (PUT object on S3) and the moment that the data has produced an output.
Cost: p90 and p75 Billed resource utilization

Latency

We measure the event-time latency.
For skipping index. We define event-time latency to be the interval between a file’s event-time and its emission time from the output operator.

The generator append eventTime to filename.
streaming system calculate latency = processTime - extractTimeFromFileName.

For covering index. We define event-time latency to be the interval between a file’s event-time and its emission time from the output operator.

The generator append eventTime for each tuple.
streaming system calculate latency = processTime - eventTime.

For MV. We define event-time latency to be the interval between a tuple’s event-time and its emission time from the output operator.

The generator append eventTime for each tuple.
streaming system re-calculate eventTime = max(eventTime contribute to the window).
latency = processingTime - eventTime

penghuo mentioned this issue Sep 20, 2023

[Feature] OpenSearch and Apache Spark Integration #3

Closed

github-actions bot added the untriaged label Sep 20, 2023

penghuo added enhancement New feature or request and removed untriaged labels Sep 20, 2023

penghuo self-assigned this Sep 20, 2023

dai-chen mentioned this issue Jun 3, 2024

[FEATURE] Performance and Scalability Enhancements for Flint Index #365

Open

dai-chen mentioned this issue Jun 18, 2024

[EPIC] Zero-ETL - Apache Iceberg Table Support #372

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Derived dataset auto refresh benchmark #38

Derived dataset auto refresh benchmark #38

penghuo commented Sep 20, 2023 •

edited

Loading

Derived dataset auto refresh benchmark #38

Derived dataset auto refresh benchmark #38

Comments

penghuo commented Sep 20, 2023 • edited Loading

Goals

Test Plan

Dimensions

Measurement

Latency

penghuo commented Sep 20, 2023 •

edited

Loading