Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Derived dataset auto refresh benchmark #38

Open
penghuo opened this issue Sep 20, 2023 · 0 comments
Open

Derived dataset auto refresh benchmark #38

penghuo opened this issue Sep 20, 2023 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@penghuo
Copy link
Collaborator

penghuo commented Sep 20, 2023

Goals

  • Size the cluster to auto refresh derived dataset
  • Latency of auto refresh derived dataset.
  • Cost of auto refresh derived dataset

Test Plan

Dimensions

  • Queries
    • Auto refresh Skipping Index
    • Auto refresh Covering Index
    • Auto refresh Materialized View
  • Dataset. http_logs dataset
  • Ingestion
    • Ingestion traffic pattern.
      • StreamingIngestion, 10 file per 10 seconds, file size is 1MB (24MB uncompressed) - streaming ingestion
      • BatchIngestion, 10 file per 30 minutes, file size is 100MB - log case
  • Configuration
    • streaming source model
      • pull mode: MicroBatch pull files from s3.
      • push mode: MicroBatch read from SQS/SNS. (TBD)
    • configure job with dynamicAllocation enable / disable
    • configure job with different executors configuration
      • executors = 3 (default)
      • executors = 10
      • executors = 30
    • spark streaming job interval
      • default
      • 10mins

Measurement

  • Latency: p90 and p75 of Latency. Latency is the time difference between the moment of data production at the source (PUT object on S3) and the moment that the data has produced an output.
  • Cost: p90 and p75 Billed resource utilization

Latency

We measure the event-time latency.
For skipping index. We define event-time latency to be the interval between a file’s event-time and its emission time from the output operator.

  1. The generator append eventTime to filename.
  2. streaming system calculate latency = processTime - extractTimeFromFileName.
    Screenshot 2023-09-20 at 11 12 06 AM

For covering index. We define event-time latency to be the interval between a file’s event-time and its emission time from the output operator.

  1. The generator append eventTime for each tuple.
  2. streaming system calculate latency = processTime - eventTime.

For MV. We define event-time latency to be the interval between a tuple’s event-time and its emission time from the output operator.

  1. The generator append eventTime for each tuple.
  2. streaming system re-calculate eventTime = max(eventTime contribute to the window).
  3. latency = processingTime - eventTime
Screen Shot 2022-11-17 at 3 52 49 PM
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant