Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

in_emitter: Fix single record chunks and respect mem_buf_limit pause #8473

Merged
merged 6 commits into from
Apr 12, 2024

Conversation

drbugfinder-work
Copy link
Contributor

@drbugfinder-work drbugfinder-work commented Feb 9, 2024

This PR covers two issues:

  1. This PR corrects the behavior of a reached mem_buf_limit. The plugin used to accept further records even after being paused which leads to Potential log loss during high load at Multiline & Rewrite Tag Filter (in_emitter) #8198

  2. The current in_emitter implementation results in only one record per chunk being created, which is suboptimal. This PR fixes the collector handling:

Please refer to line:

return do_in_emitter_add_record(ec, in);

To validate this observation, please use:

Check the output:

[2024/02/09 15:58:22] [trace] [filter:multiline:multiline.0 at /Users/root/Development/fluent-bit-in_emitter/fluent-bit/plugins/filter_multiline/ml.c:177] emitting from dummy.0 to dummy.0
[2024/02/09 15:58:22] [debug] [input:emitter:emitter_for_multiline.0] did not use ring-buffer
[2024/02/09 15:58:22] [debug] [input:emitter:emitter_for_multiline.0] Number of existing chunks: 0
[2024/02/09 15:58:22] [debug] [input:emitter:emitter_for_multiline.0] no candidate chunk found, create a new one for tag: dummy.0
[2024/02/09 15:58:22] [debug] [input:emitter:emitter_for_multiline.0] new chunk created for tag: dummy.0
[2024/02/09 15:58:22] [trace] [filter:multiline:multiline.0 at /Users/root/Development/fluent-bit-in_emitter/fluent-bit/plugins/filter_multiline/ml.c:740] not processing records from the emitter
[2024/02/09 15:58:22] [debug] [input chunk] update output instances with new chunk size diff=205, records=1, input=emitter_for_multiline.0
[2024/02/09 15:58:22] [trace] [filter:multiline:multiline.0 at /Users/root/Development/fluent-bit-in_emitter/fluent-bit/plugins/filter_multiline/ml.c:177] emitting from dummy.0 to dummy.0
[2024/02/09 15:58:22] [debug] [input:emitter:emitter_for_multiline.0] did not use ring-buffer
[2024/02/09 15:58:22] [debug] [input:emitter:emitter_for_multiline.0] Number of existing chunks: 0
[2024/02/09 15:58:22] [debug] [input:emitter:emitter_for_multiline.0] no candidate chunk found, create a new one for tag: dummy.0
[2024/02/09 15:58:22] [debug] [input:emitter:emitter_for_multiline.0] new chunk created for tag: dummy.0
[2024/02/09 15:58:22] [trace] [filter:multiline:multiline.0 at /Users/root/Development/fluent-bit-in_emitter/fluent-bit/plugins/filter_multiline/ml.c:740] not processing records from the emitter

Full logs here:

I'm uncertain if this modification is the correct approach to resolve this issue, as it appears that the ring buffer is unused. This PR is partially reverting some parts to older code version.
However, implementing this does address the behavior of only one record being generated per chunk.

Please also see the valgrind output:
https://gist.github.com/drbugfinder-work/83352c59799659db9741f33a22083eaa


Other comments, not directly related to this PR:

Additionally, the entire file appears to require restructuring, as some comments are inaccurate, and the DEFAULT_EMITTER_RING_BUFFER_FLUSH_FREQUENCY, which I think should represent a time duration, is instead utilized as a 'buffer size'.

ctx->ring_buffer_size = DEFAULT_EMITTER_RING_BUFFER_FLUSH_FREQUENCY;

The scheduler seems to be empty here, so the ring buffer will never be started (proved by another debug output):

if (scheduler != config->sched &&
scheduler != NULL &&
ctx->ring_buffer_size == 0) {
ctx->ring_buffer_size = DEFAULT_EMITTER_RING_BUFFER_FLUSH_FREQUENCY;
flb_plg_debug(in,
"threaded emitter instances require ring_buffer_size"
" being set, using default value of %u",
ctx->ring_buffer_size);
}
if (ctx->ring_buffer_size > 0) {
ret = in_emitter_start_ring_buffer(in, ctx);
if (ret == -1) {
flb_free(ctx);
return -1;
}
}


Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • Example configuration file for the change
  • Debug log output from testing the change
  • Attached Valgrind output that shows no leaks or memory corruption was found

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

  • [N/A] Run local packaging test showing all targets (including any new ones) build.
  • [N/A] Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

  • [N/A] Documentation required for this feature

Backporting

  • [N/A] Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

@drbugfinder-work
Copy link
Contributor Author

@edsiper @nokute78 @leonardo-albertovich @pwhelan
Can you please look at this?

@drbugfinder-work drbugfinder-work changed the title in_emitter: Rollback collector use to fix single record chunks in_emitter: Rollback collector use to fix single record chunks and respect mem_buf_limit pause Feb 12, 2024
…uf_limit

The current code creates a situation, where only one record per chunk
 is created. In case of a non-existing ring-buffer, the old mechanism is used.

Also the in_emitter plugin continued to accept records even after the
set emitter_mem_buf_limit was reached. This commit implements a
check if the plugin was paused and returns accordingly.

Signed-off-by: Richard Treu <richard.treu@sap.com>
@drbugfinder-work
Copy link
Contributor Author

drbugfinder-work commented Feb 12, 2024

@edsiper @nokute78 @agup006

I further noticed, that the in_emitter continues to accept records even after reaching the mem_buf_limit and being paused. I added a check if the plugin is paused.

This PR should now fix:
#8198
#4049

@bdeetz
Copy link

bdeetz commented Mar 27, 2024

While rolling fluentbit out in my environment, we ran into this bug. We have built and run fluentbit off of this PR with no sign of failure. It has been moving 10+TB/day of logs for 12 days across 500-1500 collectors. We would very much like to see a fix, if not this one, introduced in order to avoid log loss.

@edsiper
Copy link
Member

edsiper commented Apr 11, 2024

@drbugfinder-work thanks for triaging and fixing this. In order to merge it, would you please split the commits per component ? ,e.g:

  • filter_multiline: ...
  • filter_rewrite_tag: ...
  • ..

thanks

This commit will pause the inputs (sending to multiline)
to not loose any in-flight records.

Signed-off-by: Richard Treu <richard.treu@sap.com>
This commit will pause the inputs (sending to rewrite_tag)
to not loose any in-flight records.

Signed-off-by: Richard Treu <richard.treu@sap.com>
This commit will pause all known inputs (sending to multiline)
to not loose any in-flight records. in_emitter will keep track
of all sending input plugins and actively pause/resume them
in case in_emitter is paused/resumed.

Signed-off-by: Richard Treu <richard.treu@sap.com>
This commit will add a resume message, when a paused
input plugin is resumed.

Signed-off-by: Richard Treu <richard.treu@sap.com>
…line

This commit will add a test for pause functionality of in_emitter. The test
uses a small emitter buffer size, so the in_emitter will definitely be paused.

Signed-off-by: Richard Treu <richard.treu@sap.com>
@drbugfinder-work
Copy link
Contributor Author

@drbugfinder-work thanks for triaging and fixing this. In order to merge it, would you please split the commits per component ? ,e.g:

  • filter_multiline: ...
  • filter_rewrite_tag: ...
  • ..

thanks

@edsiper Done! I've split the commits according to components.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants