Add Observability Vision document #188

tigrannajaryan · 2019-07-22T21:25:55Z

This is a guidance document that developers can consult with when
working on improving own observability of OpenTelemetry Service.

codecov-io · 2019-07-23T00:33:43Z

Codecov Report

Merging #188 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #188   +/-   ##
=======================================
  Coverage   73.29%   73.29%           
=======================================
  Files         108      108           
  Lines        6329     6329           
=======================================
  Hits         4639     4639           
  Misses       1456     1456           
  Partials      234      234

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c7eacad...fc0ea4f. Read the comment docs.

songy23 · 2019-07-23T18:14:17Z

docs/observability.md

+- End-to-end latency (from receiver input to exporter output). Note that with multiple receivers/exporters we potentially have NxM data paths, each with different latency (plus different pipelines in the future), so realistically we should likely expose the average of all data paths (perhaps broken down by pipeline).
+- Latency broken down by pipeline elements (including exporter network roundtrip latency for request/response protocols).
+
+“Rate” values must reflect the average rate of the last 10 seconds. Rates must exposed in bytes/sec and units/sec (e.g. spans/sec).


I think we can defer the calculation of rates to metric backend. The service observability can simply record and export cumulative values.

Yes, that's also what is suggested in the text as a possible option / open question:

Note: some of the current values may be calculated as derivatives of cumulative values in the backend, so it is an open question if we want to expose them separately or no.

One of the reasons when rates may be desirable to be calculated in the service is when you have a deployment where services are not monitored using a metrics monitoring system. Own calculations may not be as good as a full-blown metrics backend can provide but it may be better than nothing to have the rates dumped to service's log file periodically.

@songy23 let me know if you think the text is unclear and you want changes there.

The current statement looks good :)

docs/observability.md

pjanotti · 2019-07-24T00:48:55Z

docs/observability.md

+- Incoming connection rate (new connections per second), broken down by receiver.
+- In-memory queue size (in bytes and in units).
+- Persistent queue size (when supported).
+- End-to-end latency (from receiver input to exporter output). Note that with multiple receivers/exporters we potentially have NxM data paths, each with different latency (plus different pipelines in the future), so realistically we should likely expose the average of all data paths (perhaps broken down by pipeline).


another challenge here is batching: we could record it for each element of the batch but that increases the overall tracking cost

pjanotti · 2019-07-24T00:56:10Z

docs/observability.md

+
+### Aggregate Values
+
+For the following metrics we want to track min, average and max since start: TBD.


We likely want some histograms for some metrics.

The point is more to have a consistent set of histograms applicable to specific kind of metrics.

Histogram calculation can be also more expensive. Do we want it to be done in the service? With histograms we are really moving into the realm of full-blown metric backends, perhaps crossing the line of what should be done in-house by the service itself.

Agree with @pjanotti. IMO calculating histogram aggregation is not essentially different from calculating cumulative counts/bytes/etc. It may be more difficult to figure out the rate of histograms but I think that's the case where we can defer it to the backend.

I removed this section for now since it is incomplete. We can add it in a future PR.

docs/observability.md

pjanotti

LGTM. We can take the discussion about histograms in separate.

This is a guidance document that developers can consult with when working on improving own observability of OpenTelemetry Service.

tigrannajaryan · 2019-07-31T20:43:56Z

@songy23 PTAL.

songy23

LGTM

* Migrate updates to Prometheus receiver from opencensus-service * Remove unnecessary configuration for adjusting metrics * Update exporters README (#210) Remove stale content, put in place a simple skeleton, and put information for the Zipkin exporter. * Added Jaeger gRPC receiver (#197) * Added Jaeger gRPC receiver Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de> * Addressed first review round Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de> * Changes based on the feedback from the second review Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de> * Fixed import order, check for same ip:port on endpoints Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de> * Reverted the port-check Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de> * Fix README - remove broken links and commands (#211) Signed-off-by: Hui Kang <kangh@us.ibm.com> * Refactor opencensus exporter to make it easily extensible (#212) Refactored oc exporter code to make it easier to import in derived builds without copying all the code. Example derived exporter: https://github.com/owais/example-derived-oc-exporter - Moved internal/compression to root - Split opencensusexporter factory's `CreateTraceExporter` method into - `OCAgentOptions` and `CreateOCAgent` * Use sha256 instead of md5 in node batcher processor (#202) MD5 is insecure cryptographic hash functions, and are therefore unsuitable for security applications. * Add goimports check and fix import order for all files (#209) * Add goimports check and fix import order for all files Updates #155. * Try specifying a version for goimports * Show the diff instead of files and fix import * Fix default in Makefile * Add factory and update config for tail sampling processor (#200) * Add factory and update config for tail sampling processor Updates #146 * Move to package processor * Remove an unnecessary check * Move CreateDefaultConfig to factory and add unit tests * Fix test failure * Remove commented code * Add misspell check and fix all typos (#214) * Add misspell check and fix all typos Updates #155. * Format * Include yaml files * Move tail sampling config to its own package and remove stale configs (#217) Follow up of #200. Second step of #146. * Add Observability Vision document (#188) * Add Observability Vision document This is a guidance document that developers can consult with when working on improving own observability of OpenTelemetry Service. * PR fixes * Add Zipkin exporter factory (#207) * Add Zipkin exporter factory Add the Zipkin exporter configuration factory using the new configuration format. This does not bring many of the settings from the old configuration since they won't be used. In a sub-sequent PR the code of the exporter itself will be changed. * PR Feedback: add 1 test plus some comments * Rename endpoint to http-url and related field * Fix goimports issue * Remove TODO commented code Replaced the TODO commented code with an issue. * Rename the field used to specify destination * Update README to address reviewer comments. * Update module dependencies and import order for metricfamily * Improve OC Receiver Documentation (#203) * First round of receiver and opencensus receiver documentation. * Undo go mod/sum changes * Address initial set of comments. * Address next set of comments. * Address next set of comments. * Fix use of server instead of receiver in comment and explain settings can be mix and matched. * Merged master and fixed mispell error caugh with new tools * Add static check and fix all errors (#218) * Add static check Fixes #155. * Fix most staticcheck errors * More fixes * Fix id_batcher * Rename exporterhelp parameter (#220) The paramater was named exportFormat but it actually used as the exporter name. * Fix build: lower casing error message (#224) * Add jaeger grpc exporter (#219) * Add Jaeger gRPC exporter Adds a Jaeger gRPC exporter. * Update exporters/README.md * Fixes on the exporter/README.md * Add doc.go with package description. * Fix import order on config_test.go * Migrate updates to Prometheus receiver from opencensus-service * Remove unnecessary configuration for adjusting metrics * Update README to address reviewer comments. * Update module dependencies and import order for metricfamily * Fix goimports * Fix staticcheck issues

Since these methods are about the span itself rather than specifically *events* on the span, it makes sense to drop "events" from their names. [Closes open-telemetry#33]

tigrannajaryan requested review from bogdandrutu, flands, pjanotti and songy23 as code owners July 22, 2019 21:25

tigrannajaryan self-assigned this Jul 22, 2019

songy23 reviewed Jul 23, 2019

View reviewed changes

pjanotti reviewed Jul 24, 2019

View reviewed changes

pjanotti approved these changes Jul 30, 2019

View reviewed changes

Tigran Najaryan added 2 commits July 31, 2019 16:42

Add Observability Vision document

4722f5b

This is a guidance document that developers can consult with when working on improving own observability of OpenTelemetry Service.

PR fixes

fc0ea4f

tigrannajaryan force-pushed the feature/tigran/observability branch from a10268c to fc0ea4f Compare July 31, 2019 20:43

songy23 approved these changes Jul 31, 2019

View reviewed changes

pjanotti merged commit 3a985da into open-telemetry:master Aug 1, 2019

hughesjj pushed a commit to hughesjj/opentelemetry-collector that referenced this pull request Apr 27, 2023

Create registry key for collector service events (open-telemetry#188)

f2d0c18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Observability Vision document #188

Add Observability Vision document #188

tigrannajaryan commented Jul 22, 2019

codecov-io commented Jul 23, 2019 •

edited

Loading

songy23 Jul 23, 2019

tigrannajaryan Jul 24, 2019

tigrannajaryan Jul 24, 2019

tigrannajaryan Jul 24, 2019

songy23 Jul 24, 2019

pjanotti Jul 24, 2019

pjanotti Jul 24, 2019

pjanotti Jul 24, 2019

tigrannajaryan Jul 24, 2019

songy23 Jul 24, 2019

tigrannajaryan Jul 31, 2019

pjanotti left a comment

tigrannajaryan commented Jul 31, 2019

songy23 left a comment


		### Aggregate Values

		For the following metrics we want to track min, average and max since start: TBD.

Add Observability Vision document #188

Add Observability Vision document #188

Conversation

tigrannajaryan commented Jul 22, 2019

codecov-io commented Jul 23, 2019 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pjanotti left a comment

Choose a reason for hiding this comment

tigrannajaryan commented Jul 31, 2019

songy23 left a comment

Choose a reason for hiding this comment

codecov-io commented Jul 23, 2019 •

edited

Loading