Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Protobuf and gRPC for internal communications #773

Open
7 of 24 tasks
yurishkuro opened this issue Apr 15, 2018 · 53 comments
Open
7 of 24 tasks

Use Protobuf and gRPC for internal communications #773

yurishkuro opened this issue Apr 15, 2018 · 53 comments
Labels
meta-issue An tracking issue that requires work in other repos

Comments

@yurishkuro
Copy link
Member

yurishkuro commented Apr 15, 2018

Problem

At the moment we have the following data models handled by the backend, often with two-way transformers between them:

  • Domain model in /model used by all internal APIs and business logic
    • JSON serialization of the domain model is used in many unit and integration tests
  • jaeger.thrift as the main on-the-wire representation
  • /model/json is JSON used for communicating with the UI
  • /model/json with embedded Process used by ElasticSearch storage plugin
  • zipkin.thrift and Zipkin JSON accepted as input data models as well

We also have limitations on routing traffic between agents and collectors because they use TChannel as the communication protocol.

Proposed Solution

Converge on a Protobuf 3 Model

By converging on a single Protobuf model we can consolidate some of these models, reduce the number of converters, and unblock / enable several other work streams:

Use gRPC with Protobuf model

User-Facing Deliverables

Phase 1 - MVP

Goal: deprecate TChannel.

  • Support gRPC/protobuf endpoints for core backend communications
    • Collector service (receiving spans)
    • Query service
    • Configuration services in the collector (sampling, throttling, baggage restrictions)
  • All agent to collector communications using gRPC
    • Spans forwarding
    • Configuration services
  • Lock down protobuf-based API 2.x

Phase 2

  • UI uses gRPC-based REST endpoint instead of current bespoke JSON
  • Publish Swagger (nice to have)

Phase 3

  • Agent exposes gRPC endpoints, to enable clients work

Implementation notes

We've talked about this a lot, but there are many moving pieces that must come together, so this is the meta issue to list them all out, in random order.

  • We want to have a single protobuf data model on the server
  • We want that data model to be representable in JSON so that query service API can be expressed with it instead of the current JSON model (issue Define canonical JSON model #706)
  • We probably want to use gogoprotobuf instead of the default protobuf because gogo allows generation of the data model without too many pointers, e.g. []model.KeyValue instead of []*model.KeyValue (gist)
  • We need to decide if we want to use oneof for KeyValue or keep sending the Type field explicitly as we do today
  • We need to decide if we want to keep the existing UI model's optimization where the Process object in the Span is deduped and replaced with a reference to an entry in the map of Process objects attached to the Trace
    • The current UI model supports both embedded Process and ProcessID reference, so we could keep this dual representation
    • The deduping of the Process objects is more important when reporting spans from clients to the backend (which uses the Batch type) than it is for the query service API. The UI does not really rely on the deduping, it's primarily done for the purpose of reducing the JSON message size when returning the whole trace.
  • My proposal is to keep both Batch and Trace types. Batch doesn't need any changes.
  • Need to investigate how to expose JSON API for the UI as a facade on top of a gRPC API implemented by the query service
  • The data model generated with gogoprotobuf is actually very close to the existing domain model model.Span. We need to decide if we want to keep them separate and do transformation or merge them into one by generating the core skeletons and then adding the additional business methods via companion hand-written source files. I am in favor of merging them to minimize data copying. The risk here is if future changes in proto/gogo would result in differently generated model, but that seems unlikely.
  • Current comms between agent and collectors are done with tchannel that supports client side load balancing and integration with discovery. We'd need an equivalent of that with gRPC, e.g. by using yarpc.
  • There are a few differences in the JSON generated from IDL in the gist (some could be addressed via custom renderers):
    • dates are rendered in UTC: "2018-04-02T02:17:45.171667100Z"
    • durations are rendered as string: "0.013s"
    • trace ID encoded as two integer fields instead of one string (some int64s cannot be parsed by JS)
    • span ID encoded as integer instead of one string (some int64s cannot be parsed by JS)
    • KeyValue.VType=STRING and SpanRef.Type=CHILD_OF are omitted by default (as zero values)

Implementation plan

@isaachier
Copy link
Contributor

I'm wondering if you need the JSON at all. You might be able to use JavaScript Protobuf on the browser too. Not saying JSON isn't simpler, just an idea if it isn't straightforward.

@yurishkuro
Copy link
Member Author

For Jaeger UI we might be able to use https://github.com/grpc/grpc-web. But I still think we need native JSON support, if only for jaeger-client-javascript to be able to report traces from the browser. I don't think it would be a good idea to insist on having grpc-web in the browser applications as a dependency, given that JSON is a lot more natural and easily available (possibly for other languages as well).

@jpkrohling
Copy link
Contributor

There's gRPC Gateway, which can take care of providing transparent REST+JSON support: https://github.com/grpc-ecosystem/grpc-gateway

@tiffon
Copy link
Member

tiffon commented Apr 16, 2018

Re jaeger-client-javascript, using protobufs instead of JSON is an option:

https://github.com/dcodeIO/protobuf.js

Has broad browser compatibility: http://dcode.io/protobuf.js/#compatibility

@isaachier
Copy link
Contributor

So I briefly looked into this last night and the consensus was that gzipped JSON will almost always beat Javascript protobuf.

@tiffon
Copy link
Member

tiffon commented Apr 17, 2018

consensus was that gzipped JSON will almost always beat Javascript protobuf.

@isaachier Can you refer to relevant material? And, beat in what sense, speed, size, etc?

Also, some perf benchmarks from dcodeIO/protobuf.js (in node, not the browser). There is a very significant difference between the Google implementation and decodeIO's. Which implementation was used in your investigation?

benchmarking combined performance ...

protobuf.js (reflect) x 275,900 ops/sec ±0.78% (90 runs sampled)
protobuf.js (static) x 290,096 ops/sec ±0.96% (90 runs sampled)
JSON (string) x 129,381 ops/sec ±0.77% (90 runs sampled)
JSON (buffer) x 91,051 ops/sec ±0.94% (90 runs sampled)
google-protobuf x 42,050 ops/sec ±0.85% (91 runs sampled)

   protobuf.js (static) was fastest
  protobuf.js (reflect) was 4.7% ops/sec slower (factor 1.0)
          JSON (string) was 55.3% ops/sec slower (factor 2.2)
          JSON (buffer) was 68.6% ops/sec slower (factor 3.2)
        google-protobuf was 85.5% ops/sec slower (factor 6.9)

@isaachier
Copy link
Contributor

OK consensus might be an overstatement. See this article: https://blog.wearewizards.io/using-protobuf-instead-of-json-to-communicate-with-a-frontend.

@isaachier
Copy link
Contributor

For the record, I wouldn't use node as a good indicator of frontend performance. Node C++ extensions can easily add speed ups that aren't available in the browser.

@yurishkuro
Copy link
Member Author

I believe we need to support JSON format for the following reasons:

  • Submitting JSON from clients (e.g. from browser) usually requires fewer (if any) dependencies than protobuf
  • ElasticSearch storage uses JSON as the data model on the wire
  • UI developer experience is a lot better (I assume) when the server exposes JSON API than with protobuf, e.g. you can see the payloads in browser dev tools
  • External analytical tools would probably have easier time consuming JSON than protobuf
  • Even our internal unit tests represent traces in JSON, which is readable & easier to author than protobuf

@yurishkuro
Copy link
Member Author

yurishkuro commented Apr 18, 2018

Regarding Go rendering of duration in JSON (e.g. 1666h40m, 0.016000013s) - another alternative is to record end timestamp instead of duration.
Pros:

  • no format issues
  • semantically works better when we support partial spans

Cons:

  • more data on the wire: +4 bytes in protobuf, and full UTC timestamp in JSON.

@isaachier
Copy link
Contributor

I wouldn't worry too much about the size of the span on the wire. As long as it doesn't hurt performance somewhere, people generally don't care.

@isaachier
Copy link
Contributor

isaachier commented Apr 18, 2018

Fun fact:

Proto3 supports a canonical encoding in JSON, making it easier to share data between systems. The encoding is described on a type-by-type basis in the table below.

https://developers.google.com/protocol-buffers/docs/proto3#json

@tiffon
Copy link
Member

tiffon commented Apr 19, 2018

For duration, why not just use nanoseconds?

@yurishkuro
Copy link
Member Author

Nanoseconds are fine in 64bit, but Javascript only understands 53bit and that's 100 days in nanos - uncomfortably low number. The trade-off is we either render 64bit duration to JSON as a string, e.g. 1666h40m0.016000013s, or we replace duration with end_timestamp.

@yurishkuro
Copy link
Member Author

I suppose another option would be to render int64 as a string in JSON, to avoid precision loss in Javascript.

@tiffon
Copy link
Member

tiffon commented Apr 19, 2018

If this is our canonical storage model, too, end timestamps would make it more difficult to do duration based queries.

What about storing the duration as milliseconds as a float?

@tiffon
Copy link
Member

tiffon commented Apr 19, 2018

FYI, milliseconds as a float is the approach taken by the Performance API.

@isaachier
Copy link
Contributor

I'm wondering if we'd care about the precision for that. Otherwise, struct timespec is probably the underlying implementation for all of these languages so we can copy it to Protobuf.

@yurishkuro
Copy link
Member Author

That struct is a regular nanos timestamp. It's already supported by protobuf.

@yurishkuro
Copy link
Member Author

Fwiw, average span size in our deployment is between 400-500 bytes (in thrift), so we'd be adding 1% overhead on the data size by switching from duration to timestamp.

@simonpasquier
Copy link

My 2 cents from Prometheus experience.

We probably want to use gogoprotobuf instead of the default protobuf because gogo allows generation of the data model without too many pointers, e.g. []model.KeyValue instead of []*model.KeyValue (gist)

Prometheus uses gogoprotobuf with no issue.

https://github.com/grpc-ecosystem/grpc-gateway seems to work, apparently can even generate swagger

Prometheus also uses grpc-gateway for its v2 API. The API is experimental and provided without stability guarantee but it is working fine and indeed there's even a Swagger documentation generated from the Protobuf definition. More details in this Google doc.

@tiffon
Copy link
Member

tiffon commented Apr 20, 2018

Seems the main benefits of using floats to store the duration in milliseconds is simplicity; ease to support duration based queries; it's a bit more aligned, semantically, with the property we're interested in storing and string parsing is not required in any context.

@yurishkuro yurishkuro changed the title Use Protobuf for internal communications Use Protobuf and gRPC for internal communications May 5, 2018
This was referenced May 5, 2018
@johanbrandhorst
Copy link

KeyValue.VType=STRING and SpanRef.Type=CHILD_OF are omitted by default (as zero values)

This is configurable in the marshaler. I would recommend using OrigName as well for snake_case variable names.

Note that committing to a JSON representation limits your options for backwards compatible changes (no field renaming). This is a subtle difference from usual proto3 forward/backward compatibility benefits.

@Falco20019
Copy link

Btw, GRPC allows client side load balancing by default. Just use GRPCLB. The default discovery is by using DNS naming and point to the load balancer.

isaachier pushed a commit to isaachier/jaeger that referenced this issue Sep 3, 2018
isaachier pushed a commit to isaachier/jaeger that referenced this issue Sep 3, 2018
pavolloffay pushed a commit to pavolloffay/jaeger that referenced this issue Nov 7, 2018
Signed-off-by: Yuri Shkuro <ys@uber.com>

Fix lint

Signed-off-by: Yuri Shkuro <ys@uber.com>

Add gRPC endpoint to collector

Signed-off-by: Yuri Shkuro <ys@uber.com>

Add proper grpc server

Signed-off-by: Yuri Shkuro <ys@uber.com>

Add grpc to all-in-one

Signed-off-by: Yuri Shkuro <ys@uber.com>

adds the gRPC port to the docker-all-in-one Dockerfile (jaegertracing#773) (jaegertracing#967)

Signed-off-by: Jan Heylen <jan.heylen@nokia.com>

Simple fixes (jaegertracing#999)

* Fix test error

Signed-off-by: Isaac Hier <ihier@uber.com>

* go fmt

Signed-off-by: Isaac Hier <ihier@uber.com>

Implement PostSpans for collector gRPC handler

Signed-off-by: Isaac Hier <isaachier@gmail.com>

Fix formatting

Signed-off-by: Isaac Hier <isaachier@gmail.com>

Remove timeout

Signed-off-by: Isaac Hier <ihier@uber.com>

Use expectedError variable

Signed-off-by: Isaac Hier <ihier@uber.com>
@pavolloffay
Copy link
Member

pavolloffay commented Nov 7, 2018

I am working on this now. I want to implement a minimal working solution using grpc communication between agent and collector and get it merged to master as experimental.

So far I have rebased @isaachier PR #1042 and added proto for sampling. I will be submitting PR shortly.

I will be posting my notes here what should be done afterward:

@yurishkuro
Copy link
Member Author

added proto for sampling

Would it move this along better if we start merging smaller pieces into master, e.g. my branch & Isaac's PR? I think the tests were outstanding.

@pavolloffay
Copy link
Member

I think it would break me right now. I will submit a PR to master based on those two. Now fixing the coverage.

@isaachier
Copy link
Contributor

Hey @pavolloffay. Glad to see this finally happening! Please let me know if you need any help here.

pavolloffay pushed a commit to pavolloffay/jaeger that referenced this issue Nov 8, 2018
Signed-off-by: Yuri Shkuro <ys@uber.com>

Fix lint

Signed-off-by: Yuri Shkuro <ys@uber.com>

Add gRPC endpoint to collector

Signed-off-by: Yuri Shkuro <ys@uber.com>

Add proper grpc server

Signed-off-by: Yuri Shkuro <ys@uber.com>

Add grpc to all-in-one

Signed-off-by: Yuri Shkuro <ys@uber.com>

adds the gRPC port to the docker-all-in-one Dockerfile (jaegertracing#773) (jaegertracing#967)

Signed-off-by: Jan Heylen <jan.heylen@nokia.com>

Simple fixes (jaegertracing#999)

* Fix test error

Signed-off-by: Isaac Hier <ihier@uber.com>

* go fmt

Signed-off-by: Isaac Hier <ihier@uber.com>

Implement PostSpans for collector gRPC handler

Signed-off-by: Isaac Hier <isaachier@gmail.com>

Fix formatting

Signed-off-by: Isaac Hier <isaachier@gmail.com>

Remove timeout

Signed-off-by: Isaac Hier <ihier@uber.com>

Use expectedError variable

Signed-off-by: Isaac Hier <ihier@uber.com>

Agent - collector grpc

Signed-off-by: Pavol Loffay <ploffay@redhat.com>

Working collector agent grpc

Signed-off-by: Pavol Loffay <ploffay@redhat.com>

Remove what is not needed

Signed-off-by: Pavol Loffay <ploffay@redhat.com>
@pavolloffay
Copy link
Member

pavolloffay commented Nov 26, 2018

I am wondering whether we should rename --collector.grpc-port to something else like --collector.grpc.port or even scope that with service e.g. batches (though, it seems odd). The question is if we want to expose more services on the same port or use different grpc ports for different services. The first comment also includes

Combine gRPC and HTTP handlers on the same port

cc @yurishkuro @jpkrohling

@yurishkuro
Copy link
Member Author

why do we need "host"?

@pavolloffay
Copy link
Member

my bad there should not be host. The most important is scoping it under .grpc.

@yurishkuro
Copy link
Member Author

For now I suggest keeping it simple without scoping. and serve multiple services on the same port.

@annanay25
Copy link
Member

For the query-service, would it be better to multiplex the GRPC and HTTP handlers on the same port as compared to hosting two endpoints?

@yurishkuro
Copy link
Member Author

yes, it is mentioned in the implementation plan

@annanay25
Copy link
Member

Had a discussion with @pavolloffay and he mentioned it would be better from a security perspective to have different ports for the two endpoints. Would make sense if we would want to expose the GRPC port to public traffic and keep tchannel private.

@yurishkuro
Copy link
Member Author

Why is it better for security? I can see that arguments for endpoints of different functionality, such as query vs. submitting spans, but not so much for the same functionality using different formats.

Anyway, we can certainly start with different ports. My preference is in the future to have the flexibility of the main() function recognizing that the same port is given for different protocols and automatically share the single server.

@annanay25
Copy link
Member

I meant that if the HTTP endpoint of the query-service does not support SSL+authorisation, we would not be able to expose the multiplexed port publicly (unless of course we run it behind a reverse-proxy). Whereas GRPC would have HTTPS support. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
meta-issue An tracking issue that requires work in other repos
Projects
None yet
Development

No branches or pull requests

10 participants