Fix histogram consistency in PrometheusMeterRegistry #5193

rhatlapa · 2024-06-07T06:31:50Z

As part of upgrade to spring boot 3.3 with micrometer 1.13.0 we are seeing issue similar to #4988, see the stacktrace below. I wasn't able to create separate reproducer, I can confirm that the issue is in micrometer as when using spring-boot 3.3 with downgraded micrometer to 1.12.6 the issue is no more visible.

Stacktrace:

j.l.IllegalArgumentException: Counts in ClassicHistogramBuckets cannot be negative.
	at i.p.m.m.s.ClassicHistogramBuckets.validate(ClassicHistogramBuckets.java:147)
	at i.p.m.m.s.ClassicHistogramBuckets.sortAndValidate(ClassicHistogramBuckets.java:108)
	at i.p.m.m.s.ClassicHistogramBuckets.of(ClassicHistogramBuckets.java:53)
	at i.m.p.PrometheusMeterRegistry.lambda$addDistributionStatisticSamples$29(PrometheusMeterRegistry.java:504)
	at i.m.p.MicrometerCollector.collect(MicrometerCollector.java:77)
	at i.p.m.m.r.PrometheusRegistry.scrape(PrometheusRegistry.java:72)
	at i.p.m.m.r.PrometheusRegistry.scrape(PrometheusRegistry.java:57)
	at o.s.b.a.m.e.p.PrometheusScrapeEndpoint.scrape(PrometheusScrapeEndpoint.java:59)
	at j.i.r.GeneratedMethodAccessor128.invoke(Unknown Source)
	at j.i.r.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at j.l.reflect.Method.invoke(Method.java:568)
	at o.s.u.ReflectionUtils.invokeMethod(ReflectionUtils.java:281)
	at o.s.b.a.e.i.r.ReflectiveOperationInvoker.invoke(ReflectiveOperationInvoker.java:74)
	at o.s.b.a.e.a.AbstractDiscoveredOperation.invoke(AbstractDiscoveredOperation.java:60)
	at o.s.b.a.e.w.s.AbstractWebMvcEndpointHandlerMapping$ServletWebOperationAdapter.handle(AbstractWebMvcEndpointHandlerMapping.java:327)
	at o.s.b.a.e.w.s.AbstractWebMvcEndpointHandlerMapping$OperationHandler.handle(AbstractWebMvcEndpointHandlerMapping.java:434)
	at j.i.r.GeneratedMethodAccessor110.invoke(Unknown Source)
	at j.i.r.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at j.l.reflect.Method.invoke(Method.java:568)
	at o.s.w.m.s.InvocableHandlerMethod.doInvoke(InvocableHandlerMethod.java:255)
	at o.s.w.m.s.InvocableHandlerMethod.invokeForRequest(InvocableHandlerMethod.java:188)
	at o.s.w.s.m.m.a.ServletInvocableHandlerMethod.invokeAndHandle(ServletInvocableHandlerMethod.java:118)
	at o.s.w.s.m.m.a.RequestMappingHandlerAdapter.invokeHandlerMethod(RequestMappingHandlerAdapter.java:926)
	at o.s.w.s.m.m.a.RequestMappingHandlerAdapter.handleInternal(RequestMappingHandlerAdapter.java:831)
	at o.s.w.s.m.m.AbstractHandlerMethodAdapter.handle(AbstractHandlerMethodAdapter.java:87)
	at o.s.b.a.a.w.s.CompositeHandlerAdapter.handle(CompositeHandlerAdapter.java:58)
	at o.s.w.s.DispatcherServlet.doDispatch(DispatcherServlet.java:1089)
	at o.s.w.s.DispatcherServlet.doService(DispatcherServlet.java:979)
	at o.s.w.s.FrameworkServlet.processRequest(FrameworkServlet.java:1014)
	at o.s.w.s.FrameworkServlet.doGet(FrameworkServlet.java:903)
	... 29 frames truncated

Note our application is standard REST application using Mongo & Feign clients to do requests against external endpoints. It also creates custom metrics with percentile histograms like this:

		meterRegistry.counter(counter_metric_name, tags.stream().collect(Collectors.toList()).increment();
		Timer.builder(timer_metric_name)
				.tags(tags.stream().collect(Collectors.toList()))
				.publishPercentileHistogram()
				.register(meterRegistry).record(duration);

Note: we use Micrometer with prometheus and running on Java 17.0.11

I wasn't able to reproduce the issue outside of the production environment. Any suggestions for a fix or what could have caused it?

The text was updated successfully, but these errors were encountered:

knoobie · 2024-06-07T12:36:27Z

I just had the exact same problem; interesting side note: The Exception only occured once while scraping for us - meaning the next scrape 15 seconds later did not have the problem anymore. Not like my other reported issue that breaks scraping completely.

nadina-f · 2024-06-10T06:51:45Z

Hello,
After upgrading my project from Spring Boot 3.2.5 to 3.3.0 and micrometer version from 1.12.5 to 1.13.0, I have the same issue.
Theoretically this issue should have bee fixed in 1.13.0 as described here:
d7b9d24, but it's not.

Do you have any plans to release soon, covering also this bug?
Thank you!

jonatan-ivanov · 2024-06-13T17:37:10Z

Workaround

It seems this is only an issue with the Prometheus 1.x Client, using the 0.x Client does not seem to have this issue. As a workaround you can downgrade to the 0.x client by using micrometer-registry-prometheus-simpleclient instead of micrometer-registry-prometheus.
Please see the release notes of Micrometer 1.13.0 as well as the Migration Guide. In a nutshell: we upgraded micrometer-registry-prometheus from Prometheus Client 0.x to 1.x but also kept the old client as-is under a different artifact: micrometer-registry-prometheus-simpleclient.

Investigation Notes

Based on the additional data in #5223, thanks for @VladimirZaitsev21, I think the issue is similar to #4988 but it definitely seems like a different one: in #4988, the issue was limited to LongTaskTimer and this seems to be happening for Timer (I guess also for DistributionSummary).

Based on the data from #5223, it seems the issue is caused by inconsistency between count and the last bucket:

HistogramSnapshot{
   count=575660,
   ...
   histogramCounts=[
       ...
       (575661.0 at 5.0E9)
   ]
}

As you can see here, count is 575660 while the last bucket is 575661.0. This should not happen, count should not be less than the last bucket. I'm not sure if we have this situation with the Prometheus Client 0.x (I think we do) but if this occurs, it does not cause a problem since we don't need to calculate the +Inf bucket there, with the Prometheus 1.x client we need to, see:

micrometer/implementations/micrometer-registry-prometheus/src/main/java/io/micrometer/prometheusmetrics/PrometheusMeterRegistry.java

Lines 494 to 498 in f4be539

    
           if (Double.isFinite(histogramCounts[histogramCounts.length - 1].bucket())) { 
        
               // ClassicHistogramBuckets is not cumulative 
        
               buckets.add(Double.POSITIVE_INFINITY); 
        
               counts.add(count - histogramCounts[histogramCounts.length - 1].count()); 
        
           }

If I need to guess, this is a timing/concurrency issue with the histogram implementation we have for Prometheus and count and the last bucket are sometimes inconsistent (e.g.: scrape happens during a recording and the bucket is already incremented while the counter is not).

jonatan-ivanov · 2024-06-13T18:33:36Z

I'm still investigating but it seems like the issue is what I described above (inconsistency between count and buckets that were not surfaced with the 0.x Client), here's a reproducer if you want to play with it:

public class HistogramDemo {
    static PrometheusMeterRegistry registry = new PrometheusMeterRegistry(PrometheusConfig.DEFAULT);
    static Timer timer = Timer.builder("test").publishPercentileHistogram().register(registry);

    static ExecutorService scrapeExecutor = Executors.newFixedThreadPool(16);
    static ExecutorService recordExecutor = Executors.newFixedThreadPool(16);
    static CountDownLatch latch = new CountDownLatch(1);

    public static void main(String[] args) {
        List<Future<?>> futures = new ArrayList<>();
        for (int i = 0; i < 32; i++) {
            futures.add(scrapeExecutor.submit(HistogramDemo::scrape));
            futures.add(recordExecutor.submit(HistogramDemo::record));
        }
        System.out.println("Tasks submitted, releasing the Kraken...");
        latch.countDown();
        waitForFutures(futures);
        scrapeExecutor.shutdown();
        recordExecutor.shutdown();
        System.out.println(registry.scrape());
    }

    static void record() {
        waitForLatch();
        timer.record(Duration.ofMillis(100));
    }

    static void scrape() {
        waitForLatch();
        registry.scrape();
    }

    static void waitForLatch() {
        try {
            latch.await();
        }
        catch (InterruptedException e) {
            throw new RuntimeException(e);
        }
    }

    static void waitForFutures(List<Future<?>> futures) {
        for (Future<?> future : futures) {
            try {
                future.get();
            }
            catch (Exception e) {
                future.cancel(true);
                System.out.println(e.getMessage());
//                e.printStackTrace();
            }
        }
    }
}

If I also print out the count and the negative buckets, I get negative values for the +Inf bucket.

jonatan-ivanov · 2024-06-13T23:17:29Z

I think I fixed this in 796c1e5 (please see the commit message for details). I also added some jcstress tests.

1.13.2-SNAPSHOT is published in our snapshot repo (see readme), could you please try out if you can still see the issue?

anetamoravcikova · 2024-06-27T08:02:56Z

Hello, I see that the issue is fixed, can you please tell us when the fix will be published for production use?

knoobie · 2024-06-27T08:18:49Z

If you click on the milestone (top right) you can find an estimated release date.

…ring boot error: j.l.IllegalArgumentException: Counts in ClassicHistogramBuckets cannot be negative. refs: micrometer-metrics/micrometer#5193

rhatlapa added the waiting-for-triage label Jun 7, 2024

rhatlapa mentioned this issue Jun 7, 2024

Counts in ClassicHistogramBuckets cannot be negative issue after upgrade to spring boot 3.3 spring-projects/spring-boot#41003

Closed

jonatan-ivanov mentioned this issue Jun 13, 2024

ClassicHistogramBuckets creating failure #5223

Closed

jonatan-ivanov changed the title ~~Regression causign j.l.IllegalArgumentException: Counts in ClassicHistogramBuckets cannot be negative. after upgrade to micrometer 1.13.0~~ Counts in ClassicHistogramBuckets cannot be negative Jun 13, 2024

jonatan-ivanov added bug A general bug and removed waiting-for-triage labels Jun 13, 2024

jonatan-ivanov added this to the 1.14.0-M1 milestone Jun 13, 2024

jonatan-ivanov modified the milestones: 1.14.0-M1, 1.13.2 Jun 13, 2024

jonatan-ivanov closed this as completed in 796c1e5 Jun 13, 2024

jonatan-ivanov changed the title ~~Counts in ClassicHistogramBuckets cannot be negative~~ Fix histogram consistency in PrometheusMeterRegistry Jun 13, 2024

shakuzen added the registry: prometheus A Prometheus Registry related issue label Jun 18, 2024

albinjanzen mentioned this issue Jun 19, 2024

Counts in ClassicHistogramBuckets cannot be negative. prometheus/client_java#966

Open

yeikel mentioned this issue Jul 9, 2024

Upgrade Micrometer from 1.13.0 to 1.13.2 vert-x3/vertx-micrometer-metrics#223

Merged

kenglxn mentioned this issue Aug 5, 2024

downgrade micrometer to fix IllegalArgumentException after bump of spring boot navikt/min-side-arbeidsgiver-api#374

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix histogram consistency in PrometheusMeterRegistry #5193

Fix histogram consistency in PrometheusMeterRegistry #5193

rhatlapa commented Jun 7, 2024

knoobie commented Jun 7, 2024

nadina-f commented Jun 10, 2024

jonatan-ivanov commented Jun 13, 2024

jonatan-ivanov commented Jun 13, 2024

jonatan-ivanov commented Jun 13, 2024

anetamoravcikova commented Jun 27, 2024

knoobie commented Jun 27, 2024

Fix histogram consistency in PrometheusMeterRegistry #5193

Fix histogram consistency in PrometheusMeterRegistry #5193

Comments

rhatlapa commented Jun 7, 2024

knoobie commented Jun 7, 2024

nadina-f commented Jun 10, 2024

jonatan-ivanov commented Jun 13, 2024

Workaround

Investigation Notes

jonatan-ivanov commented Jun 13, 2024

jonatan-ivanov commented Jun 13, 2024

anetamoravcikova commented Jun 27, 2024

knoobie commented Jun 27, 2024