Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Observability Vision document #188

Merged

Conversation

tigrannajaryan
Copy link
Member

This is a guidance document that developers can consult with when
working on improving own observability of OpenTelemetry Service.

@codecov-io
Copy link

codecov-io commented Jul 23, 2019

Codecov Report

Merging #188 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #188   +/-   ##
=======================================
  Coverage   73.29%   73.29%           
=======================================
  Files         108      108           
  Lines        6329     6329           
=======================================
  Hits         4639     4639           
  Misses       1456     1456           
  Partials      234      234

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c7eacad...fc0ea4f. Read the comment docs.

- End-to-end latency (from receiver input to exporter output). Note that with multiple receivers/exporters we potentially have NxM data paths, each with different latency (plus different pipelines in the future), so realistically we should likely expose the average of all data paths (perhaps broken down by pipeline).
- Latency broken down by pipeline elements (including exporter network roundtrip latency for request/response protocols).

“Rate” values must reflect the average rate of the last 10 seconds. Rates must exposed in bytes/sec and units/sec (e.g. spans/sec).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can defer the calculation of rates to metric backend. The service observability can simply record and export cumulative values.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's also what is suggested in the text as a possible option / open question:

Note: some of the current values may be calculated as derivatives of cumulative values in the backend, so it is an open question if we want to expose them separately or no.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the reasons when rates may be desirable to be calculated in the service is when you have a deployment where services are not monitored using a metrics monitoring system. Own calculations may not be as good as a full-blown metrics backend can provide but it may be better than nothing to have the rates dumped to service's log file periodically.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@songy23 let me know if you think the text is unclear and you want changes there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current statement looks good :)

docs/observability.md Show resolved Hide resolved
docs/observability.md Outdated Show resolved Hide resolved
- Incoming connection rate (new connections per second), broken down by receiver.
- In-memory queue size (in bytes and in units).
- Persistent queue size (when supported).
- End-to-end latency (from receiver input to exporter output). Note that with multiple receivers/exporters we potentially have NxM data paths, each with different latency (plus different pipelines in the future), so realistically we should likely expose the average of all data paths (perhaps broken down by pipeline).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another challenge here is batching: we could record it for each element of the batch but that increases the overall tracking cost


### Aggregate Values

For the following metrics we want to track min, average and max since start: TBD.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We likely want some histograms for some metrics.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is more to have a consistent set of histograms applicable to specific kind of metrics.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Histogram calculation can be also more expensive. Do we want it to be done in the service? With histograms we are really moving into the realm of full-blown metric backends, perhaps crossing the line of what should be done in-house by the service itself.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @pjanotti. IMO calculating histogram aggregation is not essentially different from calculating cumulative counts/bytes/etc. It may be more difficult to figure out the rate of histograms but I think that's the case where we can defer it to the backend.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this section for now since it is incomplete. We can add it in a future PR.

docs/observability.md Outdated Show resolved Hide resolved
docs/observability.md Outdated Show resolved Hide resolved
Copy link
Contributor

@pjanotti pjanotti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. We can take the discussion about histograms in separate.

Tigran Najaryan added 2 commits July 31, 2019 16:42
This is a guidance document that developers can consult with when
working on improving own observability of OpenTelemetry Service.
@tigrannajaryan
Copy link
Member Author

@songy23 PTAL.

Copy link
Member

@songy23 songy23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pjanotti pjanotti merged commit 3a985da into open-telemetry:master Aug 1, 2019
pjanotti pushed a commit that referenced this pull request Aug 7, 2019
* Migrate updates to Prometheus receiver from opencensus-service

* Remove unnecessary configuration for adjusting metrics

* Update exporters README (#210)

Remove stale content, put in place a simple skeleton, and put information for the Zipkin exporter.

* Added Jaeger gRPC receiver (#197)

* Added Jaeger gRPC receiver

Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>

* Addressed first review round

Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>

* Changes based on the feedback from the second review

Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>

* Fixed import order, check for same ip:port on endpoints

Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>

* Reverted the port-check

Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>

* Fix README - remove broken links and commands (#211)

Signed-off-by: Hui Kang <kangh@us.ibm.com>

* Refactor opencensus exporter to make it easily extensible (#212)

Refactored oc exporter code to make it easier to import in derived
builds without copying all the code.

Example derived exporter: https://github.com/owais/example-derived-oc-exporter

- Moved internal/compression to root
- Split opencensusexporter factory's `CreateTraceExporter` method into
  - `OCAgentOptions` and `CreateOCAgent`

* Use sha256 instead of md5 in node batcher processor (#202)

MD5 is insecure cryptographic hash functions, and are therefore unsuitable for security applications.

* Add goimports check and fix import order for all files (#209)

* Add goimports check and fix import order for all files

Updates #155.

* Try specifying a version for goimports

* Show the diff instead of files and fix import

* Fix default in Makefile

* Add factory and update config for tail sampling processor (#200)

* Add factory and update config for tail sampling processor

Updates #146

* Move to package processor

* Remove an unnecessary check

* Move CreateDefaultConfig to factory and add unit tests

* Fix test failure

* Remove commented code

* Add misspell check and fix all typos (#214)

* Add misspell check and fix all typos

Updates #155.

* Format

* Include yaml files

* Move tail sampling config to its own package and remove stale configs (#217)

Follow up of #200. Second step of #146.

* Add Observability Vision document (#188)

* Add Observability Vision document

This is a guidance document that developers can consult with when
working on improving own observability of OpenTelemetry Service.

* PR fixes

* Add Zipkin exporter factory (#207)

* Add Zipkin exporter factory

Add the Zipkin exporter configuration factory using the new configuration format. This does not bring many of the settings from the old configuration since they won't be used. In a sub-sequent PR the code of the exporter itself will be changed.

* PR Feedback: add 1 test plus some comments

* Rename endpoint to http-url and related field

* Fix goimports issue

* Remove TODO commented code

Replaced the TODO commented code with an issue.

* Rename the field used to specify destination

* Update README to address reviewer comments.

* Update module dependencies and import order for metricfamily

* Improve OC Receiver Documentation (#203)

* First round of receiver and opencensus receiver documentation.

* Undo go mod/sum changes

* Address initial set of comments.

* Address next set of comments.

* Address next set of comments.

* Fix use of server instead of receiver in comment and explain settings can be mix and matched.

* Merged master and fixed mispell error caugh with new tools

* Add static check and fix all errors (#218)

* Add static check

Fixes #155.

* Fix most staticcheck errors

* More fixes

* Fix id_batcher

* Rename exporterhelp parameter (#220)

The paramater was named exportFormat but it actually used as the exporter name.

* Fix build: lower casing error message (#224)

* Add jaeger grpc exporter (#219)

* Add Jaeger gRPC exporter

Adds a Jaeger gRPC exporter.

* Update exporters/README.md

* Fixes on the exporter/README.md

* Add doc.go with package description.

* Fix import order on config_test.go

* Migrate updates to Prometheus receiver from opencensus-service

* Remove unnecessary configuration for adjusting metrics

* Update README to address reviewer comments.

* Update module dependencies and import order for metricfamily

* Fix goimports

* Fix staticcheck issues
MovieStoreGuy pushed a commit to atlassian-forks/opentelemetry-collector that referenced this pull request Nov 11, 2021
Since these methods are about the span itself rather than specifically
*events* on the span, it makes sense to drop "events" from their names.

[Closes open-telemetry#33]
hughesjj pushed a commit to hughesjj/opentelemetry-collector that referenced this pull request Apr 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants