Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agent fails to start if log and trace configurations are omitted. #1320

Open
commiterate opened this issue Aug 28, 2024 · 3 comments
Open

Agent fails to start if log and trace configurations are omitted. #1320

commiterate opened this issue Aug 28, 2024 · 3 comments

Comments

@commiterate
Copy link

Describe the bug

The CloudWatch Agent fails to start if log and trace configurations are omitted. It's assumed at least one of the two exist.

Details

When config-translator is given an amazon-cloudwatch-agent.json without any tracing configurations, it will first generate an amazon-cloudwatch-agent.yaml that contains null and then delete it.

https://github.com/aws/amazon-cloudwatch-agent/blob/v1.300045.0/cmd/config-translator/translator.go#L130

https://github.com/aws/amazon-cloudwatch-agent/blob/v1.300045.0/translator/cmdutil/translatorutil.go#L237

start-amazon-cloudwatch-agent does not check if amazon-cloudwatch-agent.yaml is deleted before calling amazon-cloudwatch-agent ... -otelconfig {...}/amazon-cloudwatch-agent.yaml.

https://github.com/aws/amazon-cloudwatch-agent/blob/v1.300045.0/cmd/start-amazon-cloudwatch-agent/path.go#L68-L74

When the CloudWatch Agent attempts to read the various config files, it assumes amazon-cloudwatch-agent.yaml will always exist if no logging configurations are specified.

https://github.com/aws/amazon-cloudwatch-agent/blob/v1.300045.0/cmd/amazon-cloudwatch-agent/amazon-cloudwatch-agent.go#L309-L332

The path for amazon-cloudwatch-agent.yaml is then passed to an OpenTelemetry configuration provider. When the provider attempts to read the file, it throws a not found error.

2024-08-28T06:06:41Z E! [telegraf] Error running agent: cannot resolve the configuration: cannot retrieve the configuration: unable to read the file file:/run/amazon-cloudwatch-agent/amazon-cloudwatch-agent.yaml: open /run/amazon-cloudwatch-agent/amazon-cloudwatch-agent.yaml: no such file or directory

Steps to reproduce

Start the CloudWatch Agent with log and trace configurations omitted.

What did you expect to see?

The agent doesn't crash.

What did you see instead?

The agent crashes.

What version did you use?

v1.300045.0

What config did you use?

amazon-cloudwatch-agent.json

{
   "agent": {
      "debug": true,
      "logfile": "/var/log/amazon-cloudwatch-agent/amazon-cloudwatch-agent.log",
      "region": "ap-northeast-1"
   }
}

Environment

OS: NixOS

Additional context

NixOS/nixpkgs#337212 (comment)

We're currently trying to add amazon-cloudwatch-agent to the Nix package manager and a systemd unit to NixOS.

This currently involves rewriting the systemd configuration provided in this repository since it can't be used in NixOS due to the provided systemd configuration using start-amazon-cloudwatch-agent which hardcodes the agent installation directory.

https://github.com/aws/amazon-cloudwatch-agent/blob/v1.300045.0/packaging/dependencies/amazon-cloudwatch-agent.service

#1319

The resulting systemd configuration looks approximately like this:

[Unit]
Description=Amazon CloudWatch Agent
After=network.target

[Service]
Type=simple
RuntimeDirectory=amazon-cloudwatch-agent
LogsDirectory=amazon-cloudwatch-agent
ExecStartPre={install directory}/bin/config-translator \
  -config {...}/common-config.toml \
  -input {...}/amazon-cloudwatch-agent.json \
  -input-dir {...}/amazon-cloudwatch-agent.d \
  -output ${RUNTIME_DIRECTORY}/amazon-cloudwatch-agent.toml
ExecStart={install directory}/bin/amazon-cloudwatch-agent \
  -config ${RUNTIME_DIRECTORY}/amazon-cloudwatch-agent.toml \
  -envconfig ${RUNTIME_DIRECTORY}/env-config.json \
  -otelconfig ${RUNTIME_DIRECTORY}/amazon-cloudwatch-agent.yaml \
  -pidfile ${RUNTIME_DIRECTORY}/amazon-cloudwatch-agent.pid
KillMode=process
Restart=on-failure
RestartSec=60s

[Install]
WantedBy=multi-user.target

This effectively does the same thing as start-amazon-cloudwatch-agent but without the path hardcoding.

Like start-amazon-cloudwatch-agent, this will always pass the -otelconfig option to amazon-cloudwatch-agent even if config-translator deletes the expected amazon-cloudwatch-agent.yaml file.

This was uncovered when running a NixOS test for this systemd unit which:

  1. Starts a VM running NixOS with the agent as a systemd service. The agent is in onPremise mode without any log, metric, or trace configurations.
  2. Waits for the agent service to be active.
  3. Checks for the configuration files generated by config-translator and the PID file generated by the agent.

We noticed the agent was repeatedly crashing right after systemd started it. Checking the agent logs revealed this file not found error.

@commiterate commiterate changed the title Agent fails to start if log and trace configurations are omitted Agent fails to start if log and trace configurations are omitted. Aug 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
@commiterate and others