Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stackdriver: Add latency metric for write log entries HTTP request. #9182

Merged
merged 1 commit into from
Aug 13, 2024

Conversation

shuaich
Copy link
Contributor

@shuaich shuaich commented Aug 9, 2024

Add metric to measure the round trip latency for write entries HTTP request to Cloud Logging.

This is an observability improvement.


Enter [N/A] in the box, if an item is not applicable to your change.

Testing
New metric is exposed to v2 metrics endpoint.

curl -s http://127.0.0.1:2020/api/v2/metrics/prometheus

# HELP fluentbit_stackdriver_write_entries_latency Latency of Cloud Logging WriteLogEntries HTTP request.
# TYPE fluentbit_stackdriver_write_entries_latency histogram
fluentbit_stackdriver_write_entries_latency_bucket{le="1.0",name="stackdriver-cp-v3-container"} 2326
fluentbit_stackdriver_write_entries_latency_bucket{le="2.0",name="stackdriver-cp-v3-container"} 2329
fluentbit_stackdriver_write_entries_latency_bucket{le="4.0",name="stackdriver-cp-v3-container"} 2329
fluentbit_stackdriver_write_entries_latency_bucket{le="8.0",name="stackdriver-cp-v3-container"} 2329
fluentbit_stackdriver_write_entries_latency_bucket{le="16.0",name="stackdriver-cp-v3-container"} 2329
fluentbit_stackdriver_write_entries_latency_bucket{le="32.0",name="stackdriver-cp-v3-container"} 2329
fluentbit_stackdriver_write_entries_latency_bucket{le="64.0",name="stackdriver-cp-v3-container"} 2329
fluentbit_stackdriver_write_entries_latency_bucket{le="+Inf",name="stackdriver-cp-v3-container"} 0
fluentbit_stackdriver_write_entries_latency_sum{name="stackdriver-cp-v3-container"} 261.11756087094545
fluentbit_stackdriver_write_entries_latency_count{name="stackdriver-cp-v3-container"} 2329
# HELP fluentbit_stackdriver_write_entries_latency Latency of Cloud Logging WriteLogEntries HTTP request.
# TYPE fluentbit_stackdriver_write_entries_latency histogram
fluentbit_stackdriver_write_entries_latency_bucket{le="1.0",name="stackdriver-cp-v3-gce_instance"} 762
fluentbit_stackdriver_write_entries_latency_bucket{le="2.0",name="stackdriver-cp-v3-gce_instance"} 762
fluentbit_stackdriver_write_entries_latency_bucket{le="4.0",name="stackdriver-cp-v3-gce_instance"} 762
fluentbit_stackdriver_write_entries_latency_bucket{le="8.0",name="stackdriver-cp-v3-gce_instance"} 762
fluentbit_stackdriver_write_entries_latency_bucket{le="16.0",name="stackdriver-cp-v3-gce_instance"} 762
fluentbit_stackdriver_write_entries_latency_bucket{le="32.0",name="stackdriver-cp-v3-gce_instance"} 762
fluentbit_stackdriver_write_entries_latency_bucket{le="64.0",name="stackdriver-cp-v3-gce_instance"} 762
fluentbit_stackdriver_write_entries_latency_bucket{le="+Inf",name="stackdriver-cp-v3-gce_instance"} 0
fluentbit_stackdriver_write_entries_latency_sum{name="stackdriver-cp-v3-gce_instance"} 84.148241396993399
fluentbit_stackdriver_write_entries_latency_count{name="stackdriver-cp-v3-gce_instance"} 762
  • Example configuration file for the change
    [N/A]
  • Debug log output from testing the change
    [N/A]
  • Attached Valgrind output that shows no leaks or memory corruption was found
cmake -DFLB_DEV=On -DFLB_TESTS_RUNTIME=On -DFLB_TESTS_INTERNAL=On -DFLB_VALGRIND=On ../ && make -j 8

valgrind ./bin/flb-rt-out_stackdriver

SUCCESS: All unit tests have passed.
==1394023==
==1394023== HEAP SUMMARY:
==1394023==     in use at exit: 56 bytes in 1 blocks
==1394023==   total heap usage: 290,010 allocs, 290,009 frees, 126,380,731 bytes allocated
==1394023==
==1394023== LEAK SUMMARY:
==1394023==    definitely lost: 0 bytes in 0 blocks
==1394023==    indirectly lost: 0 bytes in 0 blocks
==1394023==      possibly lost: 0 bytes in 0 blocks
==1394023==    still reachable: 56 bytes in 1 blocks
==1394023==         suppressed: 0 bytes in 0 blocks
==1394023== Rerun with --leak-check=full to see details of leaked memory
==1394023==
==1394023== For lists of detected and suppressed errors, rerun with: -s
==1394023== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

  • Run local packaging test showing all targets (including any new ones) build.
    [N/A]
  • Set ok-package-test label to test for all targets (requires maintainer to do).
    [N/A]

Documentation

  • Documentation required for this feature

Backporting

  • Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

@edsiper
Copy link
Member

edsiper commented Aug 10, 2024

@braydonk from the Google team, who should review this one ?

@edsiper edsiper added this to the Fluent Bit v3.1.6 milestone Aug 10, 2024
@braydonk
Copy link
Contributor

I am back from vacation now and will review it today.

@shuaich
Copy link
Contributor Author

shuaich commented Aug 12, 2024

Discussed with @braydonk on adding an end to end latency metric that measures processing latency from the time when records are read by the input plugin. One gap is there is no read timestamp in the chunk metadata.

I believe the capability of measuring end to end(input plugin -> output plugin) processing latency is very helpful and welcome. I would like to start a separate discussing thread on this topic.

What do you think? @edsiper @braydonk

@@ -34,6 +34,8 @@
#include <fluent-bit/flb_log_event_decoder.h>
#include <fluent-bit/flb_gzip.h>

#include <cmetrics/cmt_histogram.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this needs to be included in stackdriver.c and stackdriver_conf.c, let's include it in stackdriver.h since that's already included in both spots.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -2994,6 +3006,7 @@ static void cb_stackdriver_flush(struct flb_event_chunk *event_chunk,
#ifdef FLB_HAVE_METRICS
if (ret_code == FLB_OK) {
cmt_counter_inc(ctx->cmt_successful_requests, ts, 1, (char *[]) {name});
cmt_histogram_observe(ctx->cmt_write_entries_latency, ts, write_entries_latency, 1, (char *[]) {name});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there is a chance that write_entries_latency can be 0.0, due to the if statement on line 2937, perhaps this should also be guarded by a check that write_entries_latency is greater than 0.0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@braydonk
Copy link
Contributor

I would like to start a separate discussing thread on this topic.

@shuaich Yes I think you should open an issue to discuss this feature request.

@braydonk
Copy link
Contributor

You will also need to address the DCO check. Please ensure that all of your commits are signed. This is the guide I always refer to when I forget and need to resign old commits: https://hackmd.io/@James-Ebert/HyYOcRAXo

Signed-off-by: shuaichen <shuaichen@google.com>
@braydonk braydonk merged commit c7c0c09 into fluent:master Aug 13, 2024
42 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants