Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scoping issue using Spring Cloud Stream and Micrometer 1.10.9 #4002

Closed
davidmelia opened this issue Jul 27, 2023 · 11 comments
Closed

Scoping issue using Spring Cloud Stream and Micrometer 1.10.9 #4002

davidmelia opened this issue Jul 27, 2023 · 11 comments
Labels
for: external-project For an external project and not something we can fix invalid An issue that we don't feel is valid

Comments

@davidmelia
Copy link

davidmelia commented Jul 27, 2023

Describe the bug
I've been advised in spring-projects/spring-boot#36592 to raise this issue here. In a nutshell I have upgraded micrometer from 1.10.8 to 1.10.9 and found trace info has disappeared when using a kafka consumer and a reactor thread pool

To Reproduce
https://github.com/davidmelia/spring-boot-kafka-consumer-tracing/tree/boot_3.0.9_trace_issue illustrates this issue (you will need Kafka and a topic called 'test')

Using <micrometer.version>1.10.8</micrometer.version> and running http://localhost:8080/sendMessage you get

...  INFO [aid=spring-boot-kafka-consumer-tracing,tid=64c14942455d1a27813d690aedf14ed8,sid=6efe80eaff5b8002,cusname=] ... Kafka Binder: This log statement has the trace id
...  INFO [aid=spring-boot-kafka-consumer-tracing,tid=64c14942455d1a27813d690aedf14ed8,sid=6efe80eaff5b8002,cusname=] ... Kafka Binder: This log statement also has the trace id in boot 3.0.8 but NOT in boot 3.0.9

which is great.

Remove <micrometer.version>1.10.8</micrometer.version> (the default is 1.10.9) then the final tid disappears. tracing is lost.

... INFO [aid=spring-boot-kafka-consumer-tracing,tid=64c149bb4911ade46ea6c4069bacf4a8,sid=d2c49fa012ff0dcc,cusname=] ... Kafka Binder: This log statement also has the trace id in micrometer 1.10.8 but NOT in micrometer 1.10.9
... INFO [aid=spring-boot-kafka-consumer-tracing,tid=,sid=,cusname=] ... Kafka Binder: This log statement also has the trace id in micrometer 1.10.8 but NOT in micrometer 1.10.9

the consumer is quite simple

 @Bean
  Function<Flux<String>, Mono<Void>> testMessagesKafkaBinder() {
    return events -> events.flatMapSequential(event -> {

      log.info("Kafka Binder: This log statement has the trace id");
      return Mono.just("OK").delayElement(Duration.ofMillis(10)).doOnSuccess(r -> log.info("Kafka Binder: This log statement also has the trace id in micrometer 1.10.8 but NOT in micrometer 1.10.9"));

    }, 1).onErrorResume(ex -> {
      log.info("Error", ex);
      return Mono.empty();
    }).then();
  }

note removing delayElement from the kafka consumer everything is fine.

Expected behavior
log context behaviour is preserved

Additional context
N.B upgrading to <spring-cloud.version>2022.0.3</spring-cloud.version> also breaks this

@maslailleron
Copy link

maslailleron commented Jul 31, 2023

I have a similar problem with version 1.0.9 described in spring-projects/spring-boot#36652. I wonder why these kinds of breaking changes are made without changing the minor version number.

@jonatan-ivanov
Copy link
Member

jonatan-ivanov commented Jul 31, 2023

If I understand this correctly, this is not a breaking change in the source or on the API level but seems to be a bug/regression, right?
Could you please try the following things?

  1. Could you please try setting ObservationRegistry on the ObservationThreadLocalAccessor?
  2. Could you please try to upgrade to 1.10.10-SNAPSHOT? See: Removes the need to set OR on OTLA #3986

@jonatan-ivanov
Copy link
Member

I tried out your reproducer with the latest snapshot of 1.10.10 and it seems that one is also broken, we need to investigate what could have caused this.

@jonatan-ivanov jonatan-ivanov added regression A regression from a previous release and removed waiting-for-triage labels Jul 31, 2023
@jonatan-ivanov jonatan-ivanov added this to the 1.10.x milestone Jul 31, 2023
@jonatan-ivanov
Copy link
Member

As a workaround, I was able to make this work by forcing context propagation by creating a new Observation for the Mono:

<dependency>
    <groupId>io.projectreactor</groupId>
    <artifactId>reactor-core-micrometer</artifactId>
</dependency>
.tap(Micrometer.observation(observationRegistry))
.doOnSuccess(...)

/cc @chemicL

@jonatan-ivanov
Copy link
Member

jonatan-ivanov commented Jul 31, 2023

I think we might have an ordering issue (might be the instrumentation), if I add ObservationTextPublisher to check what happens under the hood, I see that the scope is closed (and the observation is stopped) before executing doOnSuccess (checked with a debugger too).

2023-07-31T13:42:05.975-07:00  INFO [aid=spring-boot-kafka-consumer-tracing,tid=64c81c9d920868bc87f160b2f1ec8f44,sid=9e38d2571dc2aa2f,cusname=] 92165 --- [container-0-C-1] c.e.demo.consumer.TestMessagesConsumer   : Kafka Binder: This log statement has the trace id
CLOSE - name='spring.kafka.listener', contextualName='null', error='null', lowCardinalityKeyValues=[spring.kafka.listener.id='KafkaConsumerDestination{consumerDestinationName='test', partitions=0, dlqName='null'}.container-0'], highCardinalityKeyValues=[], map=[class io.micrometer.core.instrument.LongTaskTimer$Sample='SampleImpl{duration(seconds)=26.87142415, duration(nanos)=2.687142415E10, startTimeNanos=161101910417832}', class io.micrometer.core.instrument.Timer$Sample='io.micrometer.core.instrument.Timer$Sample@7a4f1d08', class io.micrometer.tracing.handler.TracingObservationHandler$TracingContext='TracingContext{span=64c81c9d920868bc87f160b2f1ec8f44/9e38d2571dc2aa2f}'], parentObservation=null
 STOP - name='spring.kafka.listener', contextualName='test receive', error='null', lowCardinalityKeyValues=[spring.kafka.listener.id='KafkaConsumerDestination{consumerDestinationName='test', partitions=0, dlqName='null'}.container-0'], highCardinalityKeyValues=[], map=[class io.micrometer.core.instrument.LongTaskTimer$Sample='SampleImpl{duration(seconds)=-1.0E-9, duration(nanos)=-1.0, startTimeNanos=161101910417832}', class io.micrometer.core.instrument.Timer$Sample='io.micrometer.core.instrument.Timer$Sample@7a4f1d08', class io.micrometer.tracing.handler.TracingObservationHandler$TracingContext='TracingContext{span=64c81c9d920868bc87f160b2f1ec8f44/9e38d2571dc2aa2f}'], parentObservation=null
2023-07-31T13:42:32.849-07:00  INFO [aid=spring-boot-kafka-consumer-tracing,tid=,sid=,cusname=] 92165 --- [     parallel-2] c.e.demo.consumer.TestMessagesConsumer   : Kafka Binder: This log statement also has the trace id in micrometer 1.10.8 but NOT in micrometer 1.10.9
(No other log statements after this.)

This is a problem, not being in a Scope means not having MDC values set (where log statements get the values for TraceId/SpanId).

But if I .tap into the Mono, I see that the scope is closed (and the observation is stopped) after executing doOnSuccess (checked with a debugger too).

2023-07-31T13:44:25.228-07:00  INFO [aid=spring-boot-kafka-consumer-tracing,tid=64c81d07987ca0b82972ea82f9afe27c,sid=fc6a7a857d9f4245,cusname=] 92815 --- [     parallel-2] c.e.demo.consumer.TestMessagesConsumer   : Kafka Binder: This log statement also has the trace id in micrometer 1.10.8 but NOT in micrometer 1.10.9
CLOSE - name='spring.kafka.listener', contextualName='null', error='null', lowCardinalityKeyValues=[spring.kafka.listener.id='KafkaConsumerDestination{consumerDestinationName='test', partitions=0, dlqName='null'}.container-0'], highCardinalityKeyValues=[], map=[class io.micrometer.core.instrument.Timer$Sample='io.micrometer.core.instrument.Timer$Sample@7363130a', class io.micrometer.tracing.handler.TracingObservationHandler$TracingContext='TracingContext{span=64c81d07987ca0b82972ea82f9afe27c/e6fa8fde0fba123b}', class io.micrometer.core.instrument.LongTaskTimer$Sample='SampleImpl{duration(seconds)=38.644287248, duration(nanos)=3.8644287248E10, startTimeNanos=161207459476651}'], parentObservation=null
 STOP - name='spring.kafka.listener', contextualName='test receive', error='null', lowCardinalityKeyValues=[spring.kafka.listener.id='KafkaConsumerDestination{consumerDestinationName='test', partitions=0, dlqName='null'}.container-0'], highCardinalityKeyValues=[], map=[class io.micrometer.core.instrument.Timer$Sample='io.micrometer.core.instrument.Timer$Sample@7363130a', class io.micrometer.tracing.handler.TracingObservationHandler$TracingContext='TracingContext{span=64c81d07987ca0b82972ea82f9afe27c/e6fa8fde0fba123b}', class io.micrometer.core.instrument.LongTaskTimer$Sample='SampleImpl{duration(seconds)=-1.0E-9, duration(nanos)=-1.0, startTimeNanos=161207459476651}'], parentObservation=null
CLOSE - name='http.server.requests', contextualName='http get /sendMessage', error='null', lowCardinalityKeyValues=[exception='none', method='GET', outcome='SUCCESS', status='200', uri='/sendMessage'], highCardinalityKeyValues=[http.url='/sendMessage'], map=[class io.micrometer.core.instrument.Timer$Sample='io.micrometer.core.instrument.Timer$Sample@93ee68', class io.micrometer.tracing.handler.TracingObservationHandler$TracingContext='TracingContext{span=64c81d07987ca0b82972ea82f9afe27c/2972ea82f9afe27c}', class io.micrometer.core.instrument.LongTaskTimer$Sample='SampleImpl{duration(seconds)=-1.0E-9, duration(nanos)=-1.0, startTimeNanos=161207450636472}'], parentObservation=null
CLOSE - name='reactor.observation', contextualName='reactor.observation', error='null', lowCardinalityKeyValues=[reactor.status='completed', reactor.type='Mono'], highCardinalityKeyValues=[], map=[class io.micrometer.core.instrument.Timer$Sample='io.micrometer.core.instrument.Timer$Sample@9c7807d', class io.micrometer.tracing.handler.TracingObservationHandler$TracingContext='TracingContext{span=64c81d07987ca0b82972ea82f9afe27c/fc6a7a857d9f4245}', class io.micrometer.core.instrument.LongTaskTimer$Sample='SampleImpl{duration(seconds)=-1.0E-9, duration(nanos)=-1.0, startTimeNanos=161221335239708}'], parentObservation=io.micrometer.observation.NoopObservation@19ab864a

This seems right, the scope should be closed after the user-provided function finished its job.

@jonatan-ivanov
Copy link
Member

@olegz Were there any changes in Spring Cloud Stream that could prevent the user provided function being in the scope of the Observation (spring.kafka.listener) that SC-Stream creates (see the reproducer)?

@jonatan-ivanov
Copy link
Member

With Micrometer 1.10.8:

 STOP - name='spring.kafka.listener', contextualName='test receive', error='null', lowCardinalityKeyValues=[spring.kafka.listener.id='KafkaConsumerDestination{consumerDestinationName='test', partitions=0, dlqName='null'}.container-0'], highCardinalityKeyValues=[], map=[class io.micrometer.tracing.handler.TracingObservationHandler$TracingContext='TracingContext{span=64c8226e55d211b5b8c5b88ea6170d3a/71f01629fc75c0ce}', class io.micrometer.core.instrument.LongTaskTimer$Sample='SampleImpl{duration(seconds)=-1.0E-9, duration(nanos)=-1.0, startTimeNanos=162590562485605}', class io.micrometer.core.instrument.Timer$Sample='io.micrometer.core.instrument.Timer$Sample@eb9619b'], parentObservation=null
 OPEN - name='spring.kafka.listener', contextualName='test receive', error='null', lowCardinalityKeyValues=[spring.kafka.listener.id='KafkaConsumerDestination{consumerDestinationName='test', partitions=0, dlqName='null'}.container-0'], highCardinalityKeyValues=[], map=[class io.micrometer.tracing.handler.TracingObservationHandler$TracingContext='TracingContext{span=64c8226e55d211b5b8c5b88ea6170d3a/71f01629fc75c0ce}', class io.micrometer.core.instrument.LongTaskTimer$Sample='SampleImpl{duration(seconds)=-1.0E-9, duration(nanos)=-1.0, startTimeNanos=162590562485605}', class io.micrometer.core.instrument.Timer$Sample='io.micrometer.core.instrument.Timer$Sample@eb9619b'], parentObservation=null
2023-07-31T14:06:54.668-07:00  INFO [aid=spring-boot-kafka-consumer-tracing,tid=64c8226e55d211b5b8c5b88ea6170d3a,sid=71f01629fc75c0ce,cusname=] 1978 --- [     parallel-1] c.e.demo.consumer.TestMessagesConsumer   : Kafka Binder: This log statement also has the trace id in micrometer 1.10.8 but NOT in micrometer 1.10.9
CLOSE - name='spring.kafka.listener', contextualName='test receive', error='null', lowCardinalityKeyValues=[spring.kafka.listener.id='KafkaConsumerDestination{consumerDestinationName='test', partitions=0, dlqName='null'}.container-0'], highCardinalityKeyValues=[], map=[class io.micrometer.tracing.handler.TracingObservationHandler$TracingContext='TracingContext{span=64c8226e55d211b5b8c5b88ea6170d3a/71f01629fc75c0ce}', class io.micrometer.core.instrument.LongTaskTimer$Sample='SampleImpl{duration(seconds)=-1.0E-9, duration(nanos)=-1.0, startTimeNanos=162590562485605}', class io.micrometer.core.instrument.Timer$Sample='io.micrometer.core.instrument.Timer$Sample@eb9619b'], parentObservation=null
2023-07-31T14:06:55.303-07:00  WARN [aid=spring-boot-kafka-consumer-tracing,tid=,sid=,cusname=] 1978 --- [ender@1d6eb6aa}] i.m.o.c.ObservationThreadLocalAccessor   : Observation <{name=null(null), error=null, context=name='null', contextualName='null', error='null', lowCardinalityKeyValues=[], highCardinalityKeyValues=[], map=[], parentObservation=null}> to which we're restoring is not the same as the one set as this scope's parent observation <null> . Most likely a manually created Observation has a scope opened that was never closed. This may lead to thread polluting and memory leaks
2023-07-31T14:06:55.469-07:00  WARN [aid=spring-boot-kafka-consumer-tracing,tid=,sid=,cusname=] 1978 --- [ender@1d6eb6aa}] i.m.o.c.ObservationThreadLocalAccessor   : Observation <{name=null(null), error=null, context=name='null', contextualName='null', error='null', lowCardinalityKeyValues=[], highCardinalityKeyValues=[], map=[], parentObservation=null}> to which we're restoring is not the same as the one set as this scope's parent observation <null> . Most likely a manually created Observation has a scope opened that was never closed. This may lead to thread polluting and memory leaks

The scope is closed after the user provided function finishes its job but by then the Observation is also stopped which should not be I think. But if this is intentional and the user provided function is not part of the Observation, I think not having traceid in the logs is the right behavior and having them was a bug. Also, please see the two WARN messages at the end; I'm suspicious that this is an instrumentation issue/question in Spring Cloud Stream.

@jonatan-ivanov
Copy link
Member

According to @garyrussell:

Reactive workloads (or any async handoff) are not supported by the underlying spring-kafka listener container. The observation is closed when the listener exits.

I think this means that this works as expected and the previous behavior was a bug (see the WARN messages). Could you please open a new issue for Spring Cloud Stream if you want this scenario to be supported? In the meantime, I think you can use my workaround with .tap.

@jonatan-ivanov jonatan-ivanov closed this as not planned Won't fix, can't repro, duplicate, stale Jul 31, 2023
@jonatan-ivanov jonatan-ivanov removed the regression A regression from a previous release label Jul 31, 2023
@jonatan-ivanov jonatan-ivanov removed this from the 1.10.x milestone Jul 31, 2023
@jonatan-ivanov jonatan-ivanov added invalid An issue that we don't feel is valid for: external-project For an external project and not something we can fix labels Jul 31, 2023
@maslailleron
Copy link

maslailleron commented Jul 31, 2023

I have a similar problem with version 1.10.9 described in spring-projects/spring-boot#36652 without using any reactive stack. I don't understand "the previous behaviour was a bug" - for you maybe it is a bug but for me it works and now it doesn't.

Unfortunately, it is common that updating the micrometer libraries brings problems that do not know why they appear. The functionality works, I update the library which supposedly introduces a small change (the last digit in the version number) and stops working.

@jonatan-ivanov
Copy link
Member

jonatan-ivanov commented Jul 31, 2023

I have a similar problem with version 1.10.9 described in spring-projects/spring-boot#36652 without using any reactive

I would say if you are not using any reactive workload or any async handoff, it is not similar, please open a new issue for it with a minimal reproducer.

I don't understand "the previous behaviour was a bug" - for you maybe it is a bug but for me it works and now it doesn't.

The scope closing mechanism had an issue and the scope was open when it should not, i.e.: Spring Cloud Stream closed the scope but because of a bug that scope was open. Right now that issue is fixed and the scope is closed when Spring Cloud Stream closes it.
I understand that you was depending on this behavior (bug) but if you want it to be changed, please open a new issue in Spring Cloud Stream and ask to include the user-provided function in the scope of the Observation that they create.

Unfortunately, it is common that updating the micrometer libraries brings problems that do not know why they appear. The functionality works, I update the library which supposedly introduces a small change (the last digit in the version number) and stops working.

Instrumenting Reactor is very tricky so we needed some time to figure it out. I also feel you misunderstood semver. Semantic Versioning says that it is allowed to make backward compatible bug fixes (no incompatible API changes) in PATCH version (last digit):

Bug fixes not affecting the API increment the patch version, backward compatible API additions/changes increment the minor version, and backward incompatible API changes increment the major version.

If we would follow your logic, we would not be able to make any fixes in Micrometer since it can change the behavior that somebody depends on (even if it is a bug). We are trying to do our best and not break source, binary and behavioral compatibility but we must change the behavior if it is a bug. In this case, see the WARN messages, they say this:

Most likely a manually created Observation has a scope opened that was never closed. This may lead to thread polluting and memory leaks

This indicates that a scope was never closed. This is a bug we fixed and now you actually see the intentional behavior of Spring Cloud Stream which is not including the user-provided function in the scope of their Observation.

@maslailleron
Copy link

@jonatan-ivanov thanks for your explanation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
for: external-project For an external project and not something we can fix invalid An issue that we don't feel is valid
Projects
None yet
Development

No branches or pull requests

3 participants