Skip to content
This repository has been archived by the owner on Feb 22, 2022. It is now read-only.

[incubator/jaeger] Improve the Jaeger chart #3109

Merged
merged 8 commits into from
Jan 23, 2018

Conversation

pavelnikolov
Copy link
Collaborator

This PR adds the following improvements to the Jaeger chart:

  • Use ConfigMap for configurable parameters
  • Add the Hotrod example app
  • Allow only some of the components to be installed
  • Add support for the spark dependencies job (as a k8s cronjob)
  • Use provisionDataStore key in the values.yaml file instead of tags to configure data store provisioning
  • Refactor chart to remove unnecessary quotes
  • Remove the command overrides of the docker images and use environment variables configuration instead
  • Fix hard-coded replica count
  • Collector service works both as NodePort and ClusterIP service types

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Dec 19, 2017
@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://github.com/kubernetes/kubernetes/wiki/CLA-FAQ to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Dec 19, 2017
@pavelnikolov pavelnikolov force-pushed the improve-jaeger branch 2 times, most recently from 3fc92b6 to dac7a59 Compare December 19, 2017 23:33
@pavelnikolov
Copy link
Collaborator Author

Thanks to @mikelorant for helping me with this PR.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Dec 19, 2017
@pavelnikolov
Copy link
Collaborator Author

/assign @foxish

@pavelnikolov
Copy link
Collaborator Author

/assign @unguiculus

@pavelnikolov
Copy link
Collaborator Author

pavelnikolov commented Dec 20, 2017

@foxish
Copy link
Member

foxish commented Dec 20, 2017

/ok-to-test

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Dec 20, 2017
@foxish
Copy link
Member

foxish commented Dec 20, 2017

Unassigning myself - don't know enough about Jaeger to approve/review.

@foxish foxish removed their assignment Dec 20, 2017
@pavelnikolov pavelnikolov changed the title [jaeger] Improve the Jaeger chart [incubator/jaeger] Improve the Jaeger chart Dec 20, 2017
image: jaegertracing/spark-dependencies
tag: latest
pullPolicy: Always
schedule: "*/12 * * * *"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should run just before the end of the day. Let say 5-10 min for startup

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I configured it to run every day @ 23:49

@pavolloffay
Copy link

does it expose zipkin service by default (9411 port to collect spans)?

@pavelnikolov
Copy link
Collaborator Author

pavelnikolov commented Dec 21, 2017

@pavolloffay
TL;DR yes, it does.

Take a look at the values.yaml file line #L128 - this is where the default port is set. Then, if you take a look at the collector-deploy.yaml file #L83, you will notice that the same container port is used. Also, on #L73, the COLLECTOR_ZIPKIN_HTTP_PORT environment variable is configured, which in turn gets its value from the collector.zipkin.http-port config map key. Finally, on #20 in the common-cm.yaml file you can see that the value comes from the default value (i.e. 9411 from the values.yaml).

@pavelnikolov pavelnikolov force-pushed the improve-jaeger branch 2 times, most recently from 6e125e9 to cec83a7 Compare December 21, 2017 20:30
@dvonthenen
Copy link
Collaborator

Will take a look this afternoon as well. Been out of the office.

@dvonthenen
Copy link
Collaborator

dvonthenen commented Jan 2, 2018

Giving this a test drive today...

Test Cases:

  • New Cassandra instance
  • New ES instance
  • Existing Cassandra instance
  • Existing ES instance

Will be adding a few more...

@dvonthenen
Copy link
Collaborator

dvonthenen commented Jan 3, 2018

@pavolloffay is there any reason why we need both sets of parameters?

provisionDataStore:
  cassandra: true
  elasticsearch: false

storage:
  # allowed values (cassandra, elasticsearch)
  type: cassandra

It is possible to replace .Values.storage.type using .Values.provisionDataStore.cassandra for the if statements instead? The reason why I ask it because I ran into a split brain (two different parameters are controlling the storage deploy and configuration) where I didnt know storage.type existed/needed-to-be-changed but I had enabled provisionDataStore.elasticsearch=true. By having only a single parameter, it's a little more straight forward. The default is currently Cassandra, but that can be changed to ES. My understanding is (I could be wrong) that ES is prefered storage type by the Jaeger docs/maintainers.

@pavelnikolov
Copy link
Collaborator Author

@dvonthenen The only reason there are two sets of elasticsearch and cassandra params is for clarity in the requirements conditions. Probably provisionDataStore can be turned into a boolean then we can get rid of the duplication.

@dvonthenen
Copy link
Collaborator

@pavelnikolov I think it would go a long wait in terms of usability to consolidate. Minimally, there needs to be an update to the README to update the example command line in the Installing the Chart using a New ElasticSearch Cluster section to add the --set storage.type=elasticsearch command line parameter. That is what led to this discussion. Will finish off the verification today.

Copy link
Collaborator

@dvonthenen dvonthenen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found a couple of issues when trying to deploy the chart using previously deployed cassandra and elasticsearch stores.

cassandra.datacenter.name: {{ .Values.cassandra.config.dc_name | quote }}
cassandra.keyspace: {{ printf "%s_%s" "jaeger_v1" .Values.cassandra.config.dc_name | quote }}
cassandra.schema.mode: {{ .Values.schema.mode | quote }}
cassandra.servers: {{ template "cassandra.fullname" . }}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This forces the external or previously deployed cassandra host to be -cassandra. This should be set to .Values.storage.cassandra.host.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we might need another define in helpers like you did for elasticsearch to toggle between using the template "cassandra.fullname" and .Values.storage.cassandra.host.

Copy link
Collaborator Author

@pavelnikolov pavelnikolov Jan 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I've tested provisioned cassandra, external ES and provisioned ES. This was the only scenario that I haven't tested. I'll fix it right away. Thank you very much for catching this!

collector.http-port: {{ .Values.collector.service.httpPort | quote }}
collector.port: {{ .Values.collector.service.tchannelPort | quote }}
collector.zipkin.http-port: {{ .Values.collector.service.zipkinPort | quote }}
es.server-urls: {{ template "elasticsearch.client.url" . }}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This forces the external or previously deployed elasticsearch host to be -elasticsearch. This should be set to .Values.storage.elasticsearch.host.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What values are you using it with? I have tested this with both external and provisioned ES and it is working fine for me. Take a closer look at this helper:

{{- define "elasticsearch.client.url" -}}
{{- $port := .Values.storage.elasticsearch.port | toString -}}
{{- if .Values.provisionDataStore.elasticsearch -}}
{{- $host := printf "%s-%s-%s" .Release.Name "elasticsearch" "client" | trunc 63 | trimSuffix "-" -}}
{{- printf "%s://%s:%s" .Values.storage.elasticsearch.scheme $host $port }}
{{- else }}
{{- printf "%s://%s:%s" .Values.storage.elasticsearch.scheme .Values.storage.elasticsearch.host $port }}
{{- end -}}
{{- end -}}

It doesn't force anything, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry.... small goof. I got hit by the storage.type thing again by not setting it. I corrected that now but it looks like the collector and query components are having problems connecting to elasticsearch.

These are the command line parameters I am running with:

--name myrel --set provisionDataStore.cassandra=false --set provisionDataStore.elasticsearch=false --set storage.type=elasticsearch --set storage.elasticsearch.host=elasticsearch --set storage.elasticsearch.port=9200 --set storage.elasticsearch.user=elastic --set storage.elasticsearch.password=changeme

Get pods output:

[dev@k8controller ~]$ kubectl get pods
NAME                                     READY     STATUS             RESTARTS   AGE
elasticsearch-0                          1/1       Running            0          24m
elasticsearch-1                          1/1       Running            0          24m
elasticsearch-2                          1/1       Running            0          24m
myrel-jaeger-agent-31jxv                 1/1       Running            0          8m
myrel-jaeger-agent-89kxf                 1/1       Running            0          8m
myrel-jaeger-agent-gldbw                 1/1       Running            0          8m
myrel-jaeger-agent-lh8tx                 1/1       Running            0          8m
myrel-jaeger-agent-t7c0h                 1/1       Running            0          8m
myrel-jaeger-agent-wgvrm                 1/1       Running            0          8m
myrel-jaeger-collector-623630575-b32fr   0/1       CrashLoopBackOff   6          8m
myrel-jaeger-query-3159756439-nk7c6      0/1       CrashLoopBackOff   6          8m

Error:

{"level":"fatal","ts":1515108228.394109,"caller":"collector/main.go:99","msg":"Unable to set up builder","error":"health check timeout: no Elasticsearch node available"

Verified the configmap has the correct value:

Name:		myrel-jaeger
Namespace:	default
Labels:		app=jaeger
		chart=jaeger-0.3.0
		heritage=Tiller
		jaeger-infra=common-configmap
		release=myrel
Annotations:	<none>

Data
====
...
es.server-urls:
----
http://elasticsearch:9200
...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, good catch. I've tested with ES without username and password. I'll add those two if they are set. As you can see in this file on lines 60 and 62 they haven't been implemented.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pavelnikolov I would also recommend setting the default username and password to elastic/changeme since those are the defaults.

@ledor473
Copy link
Contributor

ledor473 commented Jan 11, 2018

@pavelnikolov For me, it's good ✅

My PR #3100 was merged so you will probably have to rebase/update your branch

@pavelnikolov
Copy link
Collaborator Author

Just rebased.

name: http
protocol: TCP
- containerPort: 9411
- containerPort: {{ .Values.collector.service.zipkinPort }}
Copy link
Collaborator Author

@pavelnikolov pavelnikolov Jan 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ledor473 I used the named ports from your PR, but the values are no longer hard-coded. Now they come from the values.yaml and are mapped in the config map.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pavelnikolov yes that is even better considering the Collector Pod will listen on that port instead of doing port-forwarding at the Service level 👍

@pavelnikolov
Copy link
Collaborator Author

/assign @unguiculus

@pavelnikolov
Copy link
Collaborator Author

@unguiculus any changes required? Did you want me to assign this PR to someone else? Or break it into smaller pieces?

@pavelnikolov
Copy link
Collaborator Author

/assign @mattfarina

Return the appropriate apiVersion for cronjob APIs.
*/}}
{{- define "cronjob.apiVersion" -}}
{{- if ge .Capabilities.KubeVersion.Minor "8" -}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work once 1.10 is out. Also, this expects alpha features to be enabled for < 1.8. Better use Capabilities.APIVersions.Has.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I didn't know this was possible! I just learned something new! 🎉

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for:

Also, this expects alpha features to be enabled for...

This has been documented in the Prerequisites section of the README.md file and the spark cronjob is disabled by default.

@pavelnikolov
Copy link
Collaborator Author

/assign @unguiculus

@mattfarina
Copy link
Contributor

@dvonthenen there have been some small changes to it since you said LGTM. Still LGTM?

@dvonthenen
Copy link
Collaborator

Will check it out

@dvonthenen
Copy link
Collaborator

LGTM

@@ -1,28 +1,29 @@
{{- if .Values.collector.enabled -}}
apiVersion: extensions/v1beta1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick note, Deployments as extensions/v1beta1 will be removed from Kubernetes soon. Since this there has been apps/v1beta1, apps/v1beta2, and apps/v1 API versions. For broad compatibility we'd like to see charts use apps/v1beta2 for the moment. In k8s 1.10 the API versions prior to that can be removed.

This is just a note. Not going to have it hold up this pull request.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I'll make sure I update it when 1.10 is out.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattfarina should I remove extensions/v1beta1 now that k8s 1.10 is out?

@mattfarina
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 23, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mattfarina, pavelnikolov

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 23, 2018
@k8s-ci-robot k8s-ci-robot merged commit f8fd3fb into helm:master Jan 23, 2018
@pavelnikolov pavelnikolov deleted the improve-jaeger branch January 24, 2018 11:42
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.