Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Diagnostics] Fix AnalyzeLogs to provide more clear debug message #18654

Merged
merged 4 commits into from
Feb 21, 2018
Merged

Conversation

nak3
Copy link
Contributor

@nak3 nak3 commented Feb 17, 2018

When we run oc adm diagnostics AnalyzeLogs, diagnostics command
often misses error messsages in journal logs. So, as an admin we are
trying to find out the reason, but debug message gives us unclear
message:

$ oc adm diagnostics AnalyzeLogs -l=0
   ...
debug: Stopped reading docker log: timestamp 1518835107917828 too old
   ...

So, this patch changes regarding AnayzeLogs as:

  • Output field of struct of discovered systemd unit
  • Produce correct error message if journal log has invalid timestamp
  • Output why diagnostics stopped reading logs with readable timestamp.

Here is the message after applied this patch:

debug: Stopped reading docker log: timestamp 2018-02-17 11:33:58 +0900 JST more than 1 hour ago

@openshift-ci-robot openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Feb 17, 2018
@@ -146,7 +146,7 @@ func resolveServerIP(serverUrl string, fn dnsResolver) ([]string, error) {

func searchNodesForIP(nodes []kapi.Node, ips []string) types.DiagnosticResult {
r := types.NewDiagnosticResult(MasterNodeName)
r.Debug("DClu3005", fmt.Sprintf("Seaching for a node with master IP: %s", ips))
r.Debug("DClu3005", fmt.Sprintf("Searching for a node with master IP: %s", ips))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This debug message currently shows:

debug: map[origin-node:{origin-node false false false 0} openvswitch:{openvswitch true false true 0} iptables:{iptables true true true 0} docker:{docker true true true 0} etcd:{etcd true true true 0} kubernetes:{kubernetes false false false 0} origin-master:{origin-master false false false 0} atomic-openshift-master:{atomic-openshift-master false false false 0} atomic-openshift-node:{atomic-openshift-node true true true 0}]

Actually I wanted to delete it, but if I deleted it, the number DClu3005 will be absent in master_node.go

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no special reason to keep an identifier.

But, I think you are talking about the debug statement in locate_units.go

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need to keep the message just to preserve DClu3005 tag. If the tag is present then it should point to unique error/debug but if the error/debug is not relevant any more then we should delete the error/debug. Don't keep stale code just to maintain these tags (we have done the same thing for networkcheck and aggregatedlogging diagnostics).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can always pinpoint the error/debug/warning without these tags, we just need ose release (oc version) and filepath:line-number which are available. I think these tags are more confusing than useful. May be we should get rid of them?
@sosiouxme

Copy link
Member

@sosiouxme sosiouxme Feb 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pravisankar The tags certainly serve no purpose for debug/info level at this time. The purpose in warn/error is to have something stable and semantically meaningless that's safe to grep in automation, as there is no machine-readable output at this time. The tags were originally intended for machine-readable output, which is a requested and long-neglected feature.

I doubt anyone's using the tags for that purpose now. I suppose overall it may be better to get rid of tags entirely. It would not be hard to retrofit them when fleshing out the machine-readable feature.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, I made a comment under wrong line. As @sosiouxme, I meant logger.Debug("DS1003", fmt.Sprintf("%v", systemdUnits)). (And I removed it now.)

The message DClu3005 is really necessary until this bug was fixed https://bugzilla.redhat.com/show_bug.cgi?id=1368865

@nak3
Copy link
Contributor Author

nak3 commented Feb 20, 2018

cc @sosiouxme

Copy link

@pravisankar pravisankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit otherwise LGTM

@@ -146,7 +146,7 @@ func resolveServerIP(serverUrl string, fn dnsResolver) ([]string, error) {

func searchNodesForIP(nodes []kapi.Node, ips []string) types.DiagnosticResult {
r := types.NewDiagnosticResult(MasterNodeName)
r.Debug("DClu3005", fmt.Sprintf("Seaching for a node with master IP: %s", ips))
r.Debug("DClu3005", fmt.Sprintf("Searching for a node with master IP: %s", ips))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need to keep the message just to preserve DClu3005 tag. If the tag is present then it should point to unique error/debug but if the error/debug is not relevant any more then we should delete the error/debug. Don't keep stale code just to maintain these tags (we have done the same thing for networkcheck and aggregatedlogging diagnostics).

@@ -146,7 +146,7 @@ func resolveServerIP(serverUrl string, fn dnsResolver) ([]string, error) {

func searchNodesForIP(nodes []kapi.Node, ips []string) types.DiagnosticResult {
r := types.NewDiagnosticResult(MasterNodeName)
r.Debug("DClu3005", fmt.Sprintf("Seaching for a node with master IP: %s", ips))
r.Debug("DClu3005", fmt.Sprintf("Searching for a node with master IP: %s", ips))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can always pinpoint the error/debug/warning without these tags, we just need ose release (oc version) and filepath:line-number which are available. I think these tags are more confusing than useful. May be we should get rid of them?
@sosiouxme

Copy link
Member

@sosiouxme sosiouxme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this up. I have some requests and then we can merge this.

BTW, the log analysis doesn't check many patterns yet -- if there are useful log entries you'd like this to find for you, please suggest those or add a PR...

@@ -146,7 +146,7 @@ func resolveServerIP(serverUrl string, fn dnsResolver) ([]string, error) {

func searchNodesForIP(nodes []kapi.Node, ips []string) types.DiagnosticResult {
r := types.NewDiagnosticResult(MasterNodeName)
r.Debug("DClu3005", fmt.Sprintf("Seaching for a node with master IP: %s", ips))
r.Debug("DClu3005", fmt.Sprintf("Searching for a node with master IP: %s", ips))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no special reason to keep an identifier.

But, I think you are talking about the debug statement in locate_units.go

@@ -99,8 +99,13 @@ func (d AnalyzeLogs) Check() types.DiagnosticResult {
if err := json.Unmarshal(bytes, &entry); err != nil {
r.Debug("DS0003", fmt.Sprintf("Couldn't read the JSON for this log message:\n%s\nGot error %s", string(bytes), errStr(err)))
} else {
if lineCount > 500 && stampTooOld(entry.TimeStamp, timeLimit) {
r.Debug("DS0004", fmt.Sprintf("Stopped reading %s log: timestamp %s too old", unitName, entry.TimeStamp))
old, unixTimestamp, err := stampTooOld(entry.TimeStamp, timeLimit)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few suggestions to make here.

stampTooOld is only ever used here. All it really accomplishes is making this logic easy to understand. Here you've turned the method into trinary logic (true/false/error) plus return value and then you have to process that trinary again upon receiving it. I think it would be simpler just to bring the method inline and use loop controls rather than return.

old, unixTimestamp, err := stampTooOld(entry.TimeStamp, timeLimit)
if err != nil {
r.Error("DS0004", err, fmt.Sprintf("Stopped reading %s log, timestamp %s is invalid", unitName, entry.TimeStamp))
break
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Subtle difference here. This will end the loop the first time it sees a malformed log line (if that ever happens! not sure I've ever seen it). Previously it would only end the loop if there was an error and it had seen more than 500 lines.

It would make more sense to me to skip the log line with a continue here and keep looking.

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 21, 2018
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: nak3, sosiouxme

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 21, 2018
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot
Copy link
Contributor

/test all [submit-queue is verifying that this PR is safe to merge]

@openshift-merge-robot
Copy link
Contributor

Automatic merge from submit-queue.

@openshift-merge-robot openshift-merge-robot merged commit 37a4ec3 into openshift:master Feb 21, 2018
@nak3 nak3 deleted the diagnostics branch February 22, 2018 00:18
@nak3
Copy link
Contributor Author

nak3 commented Feb 22, 2018

Thank you. I think I should have squashed these commits, before merge. (I'm sorry!)

@sosiouxme
Copy link
Member

That does make life easier when doing a git bisect, but otherwise isn't really that important, so don't worry about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants