Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[extension/ecsobserver] Write discovered targets as Prometheus file sd #3785

Merged
merged 9 commits into from
Jun 30, 2021

Conversation

pingleig
Copy link
Contributor

@pingleig pingleig requested a review from jrcamp as a code owner June 14, 2021 18:20
@pingleig pingleig requested a review from a team June 14, 2021 18:20
@pingleig
Copy link
Contributor Author

cc @anuraaga @Aneurysm9

@pingleig pingleig changed the title [extension/ecsobserver] Write discovered targets as file sd [extension/ecsobserver] Write discovered targets as Prometheus file sd Jun 14, 2021
@pingleig
Copy link
Contributor Author

Load test failed because some UDP message is lost https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/3785/checks?check_run_id=2822779561

2021/06/14 18:45:44 Agent process stopped, exit code=0
    validator.go:42: 
        	Error Trace:	validator.go:42
        	            				test_case.go:279
        	            				scenarios.go:190
        	            				log_test.go:177
        	Error:      	Not equal: 
        	            	expected: 139900
        	            	actual  : 139867
        	Test:       	TestLog10kDPS/udp
        	Messages:   	Received and sent counters do not match.

Copy link
Member

@mxiamxia mxiamxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

extension/observer/ecsobserver/factory.go Outdated Show resolved Hide resolved
extension/observer/ecsobserver/sd.go Outdated Show resolved Hide resolved
extension/observer/ecsobserver/target.go Outdated Show resolved Hide resolved
@pingleig
Copy link
Contributor Author

windows test failed https://app.circleci.com/pipelines/github/open-telemetry/opentelemetry-collector-contrib/15444/workflows/90c5815a-0b99-4ab5-ab15-d93e2aeb0575/jobs/134087

--- FAIL: Test_statsdreceiver_EndToEnd (0.00s)
    --- FAIL: Test_statsdreceiver_EndToEnd/default_config_with_9s_interval (0.00s)
        receiver_test.go:132: 
            	Error Trace:	receiver_test.go:132
            	Error:      	Received unexpected error:
            	            	listen udp 127.0.0.1:50835: bind: An attempt was made to access a socket in a way forbidden by its access permissions.
            	Test:       	Test_statsdreceiver_EndToEnd/default_config_with_9s_interval

@pingleig
Copy link
Contributor Author

Did a rebase, unit test passed but load test failed https://app.circleci.com/pipelines/github/open-telemetry/opentelemetry-collector-contrib/15652/workflows/b630b1c3-7505-47a1-aade-878eb52a8f2a/jobs/136105

Failed
2021/06/22 23:11:09 Starting mock backend...
2021/06/22 23:11:09 Starting Agent (/home/circleci/project/bin/otelcontribcol_linux_amd64)
2021/06/22 23:11:09 Writing Agent log to /home/circleci/project/testbed/tests/results/TestLog10kDPS/OTLP/agent.log
2021/06/22 23:11:09 Agent running, pid=34225
2021/06/22 23:11:10 Starting load generator at 10000 items/sec.
2021/06/22 23:11:12 Agent RAM (RES):   0 MiB, CPU: 0.0% | Sent:     17109 items | Received:    16,100 items (5,364/sec)
2021/06/22 23:11:12 Performance error: CPU consumption is 37.3%, max expected is 35%
2021/06/22 23:11:12 Gracefully terminating Agent pid=34225, sending SIGTEM...
2021/06/22 23:11:12 Cannot send logs: failed to push log data via OTLP exporter: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:37133: connect: connection refused"
2021/06/22 23:11:12 Agent process stopped, exit code=0
2021/06/22 23:11:12 CPU consumption is 37.3%, max expected is 35%
    test_case.go:314: CPU consumption is 37.3%, max expected is 35%
2021/06/22 23:11:12 Stopped generator. Sent:     18500 items
2021/06/22 23:11:12 Stopping mock backend...
2021/06/22 23:11:12 Stopped backend. Received:    17,400 items (5,556/sec)

btw: it seems github action is still running, so we are testing on both github action and circle CI?

@bogdandrutu
Copy link
Member

@alolita would be good to have a small design about this, not sure why we decided to write this in prometheus format.

@pingleig
Copy link
Contributor Author

@bogdandrutu it's mentioned in README https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/extension/observer/ecsobserver/README.md#notify-prometheus-receiver-of-discovered-targets Main reason is if we use receiver creator to create new receiver (i.e. pass discovered targets in process and create a new one for each target) the performance is not good #1395 (comment) . Eventually it can switch away from writing file in file sd format and support other receivers (e.g. a batch mode to reconfigure receiver in receiver creator framework).

@bogdandrutu
Copy link
Member

/cc @jrcamp I think the performance problem was discuss, is this something we should fix?

@mxiamxia
Copy link
Member

mxiamxia commented Jun 28, 2021

Hi @bogdandrutu , @jrcamp , We had discussed this performance issue before and @kohrapha has explored a few options including implemented a simple native Prom receiver as Jay proposed. But writing ECS scrape targets in static file was still the better solution we had ATM that can unblock AWS to support ECS Prom metrics auto-discovery in OTel.

This PR is the last PR we need to merge to enable ECS Prom metrics auto-discover for OTel that we'll launch this feature as Preview in AWS in a short time. Could you please help to merge this PR so we can be unblocked? We'll continue to work on the possible better solution and fixes.

cc/ @alolita

@jrcamp
Copy link
Contributor

jrcamp commented Jun 28, 2021

Yeah I'm going to look into the performance issues of many scrapers with receivercreator. I think this is a decent workaround in the meantime. Will look into it in #3977.

@alolita
Copy link
Member

alolita commented Jun 28, 2021

@bogdandrutu i think it would be useful to do a design on alternatives w pros and cons esp re: perf issues.

@mxiamxia can you and the team put together a short design proposal outlining the top alternatives.

@alolita alolita added the ready to merge Code review completed; ready to merge by maintainers label Jun 28, 2021
@mxiamxia
Copy link
Member

mxiamxia commented Jun 28, 2021

@bogdandrutu i think it would be useful to do a design on alternatives w pros and cons esp re: perf issues.

@mxiamxia can you and the team put together a short design proposal outlining the top alternatives.

Hi @alolita, The current implementation is a good alternative for us from the previous discussion in #1395. And we have the high level design in README. I think Jay will take a first stab on perf issue.

We need to get this PR merged ASAP.

@alolita
Copy link
Member

alolita commented Jun 28, 2021

I've marked this PR ready-to-merge. @bogdandrutu please take a look and flag any other concerns you may have.

- Stop the collector process from extension using `host.ReportFatalError`
  otherwiese the failure of extension just log.
prom receiver is expecting metric's job name same as the one in config.
In order to keep similar behaviour as cloudwatch agent's discovery impl,
we support getting job name from docker label, but it will break metric
type.

For long term solution, see open-telemetry/opentelemetry-collector#575 (comment)
@pingleig
Copy link
Contributor Author

Circle CI unit test failed https://app.circleci.com/pipelines/github/open-telemetry/opentelemetry-collector-contrib/16013/workflows/41d3192a-5bfb-4867-b721-7d46f891d1da/jobs/139341

running go unit test ./... + coverage in /home/circleci/project/exporter/sumologicexporter
--- FAIL: TestPushFailedBatch (21.62s)
    exporter_test.go:292: 
        	Error Trace:	exporter_test.go:292
        	Error:      	Error message not equal:
        	            	expected: "error during sending data: 500 Internal Server Error"
        	            	actual  : "Post \"http://127.0.0.1:33123\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
        	Test:       	TestPushFailedBatch
FAIL
coverage: 95.0% of statements
FAIL	github.com/open-telemetry/opentelemetry-collector-contrib/exporter/sumologicexporter	25.675s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready to merge Code review completed; ready to merge by maintainers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants