Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Stack Monitoring] [Test Scenario] Out of the box alerting #104440

Closed
28 of 35 tasks
neptunian opened this issue Jul 6, 2021 · 12 comments
Closed
28 of 35 tasks

[Stack Monitoring] [Test Scenario] Out of the box alerting #104440

neptunian opened this issue Jul 6, 2021 · 12 comments
Assignees
Labels
Feature:Stack Monitoring Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services test-plan

Comments

@neptunian
Copy link
Contributor

neptunian commented Jul 6, 2021

Summary

Stack Monitoring provides a set of out-of-the-box alerts, when choosing to create them from the menu. The default action for each alert is a server log and the action messaging is controlled by the Stack Monitoring UI code directly.

7.14 PRs

#102544
#101565
#103718
#101941

Testing

Create Rules

  • They should see the modal the first time they visit stack monitoring. (Note: currently, the modal is only shown on the cluster overview page. After [Monitoring] Add rules modal to listing page #104328, this the modal could also appear on the listing page)
  • When the user doesn't have existing rules and selects "Yes" on the rules modal, rules should be created and the modal shouldn't appear on following visits
  • When the user doesn't have existing rules and selects "No" on the rules modal, rules shouldn't be created and the user shouldn't see the modal anymore
  • When the user doesn't have existing rules and selects "Remind me later" or closes the modal with the X button rules shouldn't be created and the user should see the modal in a new page session
  • When the user selects "Create default rules" from the dropdown in the navigation bar, rules should be created if they don't exist

Specific rules trigger alerts

Setup and directions

  • Create a "production" deployment in Cloud with this release version
  • Create a "monitoring" deployment in Cloud with this release version
  • Ship metrics from Production cluster to Monitoring cluster by going to "Logs and Metrics" in the Production deployment settings
screenshot

Screen Shot 2021-07-06 at 2 24 37 PM

  • Sign in via Okta to Elastic Cloud Admin (staging) in order to view Kibana Cluster logs. Click the Monitoring Deployment to find the Kibana Cluster Logs link
screenshot

Screen Shot 2021-07-06 at 2 36 22 PM

  • Create all of the default rules by following one of scenarios in "Create Rules" above
  • In the Cluster Overview view, click Enter setup mode button in the top left corner of the page to view the rules (grey badge)
  • In the Cluster Overview view, click Exit setup mode button in the bottom corner of the page to view the alerts (orange badge)

Alert Per Node Rules

  • Disk Usage rule can fire alerts
    • While in setup mode, click the "x rules" badge in the Nodes Panel -> Resource Utilization -> Disk Usage -> Edit Rule
    • Change Notify to "Every time Alert is Active"
    • Change disk usage to something very low which should trigger an alert for each instance, save and exit setup mode.
    • After a minute or two, multiple alerts should be shown in the Stack Monitoring UI in the orange badge - one alert per node. Clicking on this the dropdown should show each alert by the Node Name and how many minutes ago it occured
    • Click into one of the alerts and then click the link to the node with the alert. This will take you to the Node view where you should again see a description of the alert, here in red.
    • Alerts show in Stack Management
      • Navigate to Stack Management -> Rules and Connectors -> Click Disk Usage rule
      • Disk Usage status is Active
      • Clicking on the rule should list an alert for each node with a name that looks like the node Uuid eg: 6tX5ghz7Q9uw330E6Zt6PQ
    • server log action displays alert description in Kibana server log (see directions above to sign in via Okta)
  • CPU Usage rule can fire alerts (same as Disk Usage except "CPU Usage")
  • JVM Memory Usage rule can fire alerts (same as Disk Usage except "JVM Memory Usage")
  • Missing Monitoring Data rule can fire alerts
    • While in setup mode, click the "x rules" badge in the Nodes Panel -> Errors and Exceptions -> Missing Monitoring Data -> Edit Rule
    • Change "Notify" to "Every time Alert is Active"
    • Change "Notify if monitoring data is missing for the last" to "5 seconds"
    • After a minute or two, multiple alerts should be shown in the Stack Monitoring UI in the orange badge - one alert per node. Clicking on this the dropdown should show each alert by the Node Name and how many minutes ago it occurred
    • Click into one of the alerts and then click the link to the node with the alert. This will take you to the Node view where you should again see a description of the alert, here in red.
    • Alerts show in Stack Management
      • Navigate to Stack Management -> Rules and Connectors -> Click Missing Monitoring Data rule
      • Missing Monitoring Data status is Active
      • Clicking on the rule should list an alert for each node with a name that looks like the node Uuid eg: 6tX5ghz7Q9uw330E6Zt6PQ
    • server log action displays alert description in Kibana server log (see directions above to sign in via Okta)
  • Thread Pool Search Rejection (skip for now)
  • Thread Pool Write Rejection (skip for now)

Alert Per Index Rules

  • Shard Size rule can fire alerts
    • While in setup mode, click the "x rules" badge in the Indices Panel -> Resource Utilization -> Shard size -> Edit Rule
    • Change "Notify" to "Every time Alert is Active"
    • Change "Notify when average shard size exceeds this value" to "0.000000000000001"
    • After a minute or two, several alerts should be shown in the Stack Monitoring UI in the orange badge - one alert per index. Clicking on this the dropdown should show each alert by the Index Name and how many minutes ago it occurred
    • Click into one of the alerts and then click the link to the node with the alert. This will take you to the Index view where you should again see a description of the alert, here in red.
    • Alerts show in Stack Management
      • Navigate to Stack Management -> Rules and Connectors -> Click Shard Size rule
      • Shard Size status is Active
      • Clicking on the rule should list an alert for each index with a name that looks like the clusterId:indexName eg: NGIZVsfNS_aN4WgcM4v_iA:apm-7.14.0-error-000001
    • server log action displays alert description in Kibana server log (see directions above to sign in via Okta)
  • CCR Read Rejections rule can fire alerts (skip for now)

Alert Per Cluster Rules

  • Nodes Changed rule can fire alerts
    • While in setup mode, click the "x rules" badge in the Nodes Panel -> Cluster Health -> Nodes Changed -> Edit Rule
    • Change "Notify" to "On a custom action interval" and "Every" to "1 minute"
    • Go to your Monitoring deployment -> Manage -> Edit Deployment -> Coordinating Nodes -> Change "Availability Zones" to 3 -> Save
    • After a minute or two and after and configuration changes are complete, an alert should be shown in the Stack Monitoring UI in the orange badge - one alert per node added, changed, or removed. In this scenario 1 node should have been added. Clicking on this the dropdown should show each alert how many minutes ago it occurred
    • Click into one of the alerts and it should say that "x node was added"
    • Alerts show in Stack Management
      • Navigate to Stack Management -> Rules and Connectors -> Cluster Health rule
      • Cluster Health status is Active
      • Clicking on the rule should list an alert for the cluster with the clusterId as the name eg: NGIZVsfNS_aN4WgcM4v_iA
    • server log action displays alert description in Kibana server log (see directions above to sign in via Okta)
  • Elasticsearch Version Mismatch rule can fire alerts (skip for now)
  • Kibana Version Mismatch rule can fire alerts (skip for now)
  • Logstash Version Mismatch rule can fire alerts (skip for now)
  • License Expiration alert rule can fire alerts (skip for now)
  • Cluster Health rule can fire alerts (skip for now)
@neptunian neptunian added Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services test-plan Feature:Stack Monitoring labels Jul 6, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/logs-metrics-ui (Team:logs-metrics-ui)

@neptunian neptunian self-assigned this Jul 6, 2021
@neptunian
Copy link
Contributor Author

neptunian commented Jul 6, 2021

Tested "Create Rules" in 7.14 Cloud environment and works. I'll let someone else test the alerts :)

@neptunian neptunian removed their assignment Jul 7, 2021
@mgiota mgiota self-assigned this Jul 8, 2021
@mgiota
Copy link
Contributor

mgiota commented Jul 8, 2021

@neptunian I will test the alerts

@mgiota
Copy link
Contributor

mgiota commented Jul 12, 2021

@neptunian Can you clarify in which environment I need to do the testing? (staging vs cloud with this release version)

@neptunian
Copy link
Contributor Author

@mgiota You can do it on staging cloud: https://staging.found.no/

@mgiota
Copy link
Contributor

mgiota commented Jul 12, 2021

@neptunian Thanks! For some reason staging was miserably failing, so I already did the testing on cloud with 7.14 and here are my findings:

Apart from some small UI issues I found #105220, #105205, #105249, #105259 everything I tested so far from the list above worked fine. I am not sure about the labels I should add to the tickets I created, feel free to update if needed.

Here are a few observations/comments:

  • I noticed that when entering setup mode and setting up a rule, that rule was applied to all my clusters and not just the cluster I was editing. I would expect to be able to define different rule thresholds to the different deployments of my clusters list. Is this by design?

  • This PR [Monitoring] Add rules modal to listing page #104328 is merged, but I didn't see any modal in the clusters list page. Is this expected? Since PR is merged, I would assume the code is already there and should work. I can try to create a few deployments on staging and see if the modal appears.

  • I didn't manage to see the Alert Per Cluster Rules, since deployment fails to update for whatever reason. I will try it on staging as well.

@mgiota
Copy link
Contributor

mgiota commented Jul 12, 2021

@neptunian I tried to do the testing on staging, but staging is quite broken. I can not even enable Logs and Metrics, I keep getting Interval Server Error.

I did one more time my testing on Cloud with 7.14 and I still didn't get the modal on the listing page (I had no default rules enabled and I had cleared my localStorage and sessionStorage before).

Regarding the Alert Per Cluster Rules I managed to change the Coordinating Nodes from 2 to 3, but I didn't see any alerts neither in Kibana Server log nor in the Stack Management nor in Stack Monitoring UI.

Could you double check above issues?

@estermv
Copy link
Contributor

estermv commented Jul 13, 2021

I did one more time my testing on Cloud with 7.14 and I still didn't get the modal on the listing page (I had no default rules enabled and I had cleared my localStorage and sessionStorage before).

Code is not available yet. According to the release schedule, BC2 was done on July 8, and the PR to add the modal on the listing page was merged on July 9. We need to wait for BC3, which is planned for July 14 to be able to test this. Code is not automatically updated, we need to create a new deployment with the last BC3.

@estermv
Copy link
Contributor

estermv commented Jul 13, 2021

Regarding the Alert Per Cluster Rules I managed to change the Coordinating Nodes from 2 to 3, but I didn't see any alerts neither in Kibana Server log nor in the Stack Management nor in Stack Monitoring UI.

I tried this too. I was able to see the alert in the Kibana Server log, but not in Stack Monitoring UI or Stack Management. @neptunian did you manage to see the alert in the UI?
I guess that we only see the alert in the UI when is in active state. Could it be that the alert is in active state for such a short period of time that we don't see anything in the UI?

@mgiota
Copy link
Contributor

mgiota commented Jul 13, 2021

@estermv

Code is not available yet.

Thanks for clarifying!

Could it be that the alert is in active state for such a short period of time that we don't see anything in the UI?

That was also my guess.

@neptunian
Copy link
Contributor Author

neptunian commented Jul 14, 2021

@estermv @neptunian Thanks for testing! I was able to reproduce the Nodes Changed alert, but similarly it appeared in a very short window. So I think if was set to only check once a day or something you'd see the alert for longer, but then I'm not sure how long you'd have to wait to see the alert for the first time. I'll take another look at it.

@neptunian
Copy link
Contributor Author

This is tricky to produce in Cloud for whatever reason but I was able to get cluster alerts working with BC3 of 7.14 pointed to observability release cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Stack Monitoring Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services test-plan
Projects
None yet
Development

No branches or pull requests

4 participants