Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[e2e][flexible-ipam] Fix TestPrometheus failed #3868

Merged
merged 1 commit into from
Jun 9, 2022
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
73 changes: 52 additions & 21 deletions test/e2e/prometheus_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ import (
"github.com/prometheus/common/expfmt"
v1 "k8s.io/api/core/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/util/wait"
)

// Agent metrics to validate
Expand Down Expand Up @@ -140,16 +141,31 @@ func getMetricsFromAPIServer(t *testing.T, url string, token string) string {
req.Header.Add("Authorization", "Bearer "+token)
}

// Query metrics via HTTPS from Pod
resp, err := client.Do(req)
if err != nil {
t.Fatalf("Error fetching metrics from %s: %v", url, err)
}
defer resp.Body.Close()
var body []byte
err = wait.PollImmediate(defaultInterval, defaultTimeout, func() (bool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How long is defaultTimeout? Does the transient failure happen only in flexible IPAM tests? Do we know why?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some comments to explain why we do retry here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defaultTimeout is 90s, but actually 10s is enough for this case. Just use the default value as other cases.
This failure appears in flexible-ipam-e2e two weeks ago. Still not located the changes caused this issue since this issue occurs intermittently.

Added comment for 3 cases I want to fix in this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you believe we should root cause the failure? If so, please add comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comments.

// Query metrics via HTTPS from Pod
resp, err := client.Do(req)
if err != nil {
t.Fatalf("Error fetching metrics from %s: %v", url, err)
}
defer resp.Body.Close()

body = []byte{}
body, err = ioutil.ReadAll(resp.Body)
if err != nil {
t.Fatalf("Error retrieving metrics from %s. response: %v", url, err)
}

body, err := ioutil.ReadAll(resp.Body)
if resp.StatusCode >= 300 {
// Handle unexpected StatusCode returned when prometheus is not ready
// TODO: RCA the reason of resp.StatusCode=401
t.Logf("Response StatusCode: %d, Body: %s", resp.StatusCode, string(body))
return false, nil
}
return true, nil
})
if err != nil {
t.Fatalf("Error retrieving metrics from %s. response: %v", url, err)
t.Fatalf("Wrong StatusCode from Prometheus: %v", err)
}

return string(body)
Expand Down Expand Up @@ -268,22 +284,37 @@ func testMetricsFromPrometheusServer(t *testing.T, data *TestData, prometheusJob
queryURL := fmt.Sprintf("http://%s/api/v1/targets/metadata?%s", address, path)

client := &http.Client{}
resp, err := client.Get(queryURL)
if err != nil {
t.Fatalf("Error fetching metrics from %s: %v", queryURL, err)
}
defer resp.Body.Close()
var output prometheusServerOutput
err := wait.PollImmediate(defaultInterval, defaultTimeout, func() (bool, error) {
resp, err := client.Get(queryURL)
if err != nil {
// Retry when accessing prometheus failed for flexible-ipam
t.Logf("Error fetching metrics from %s: %v", queryURL, err)
return false, nil
}
defer resp.Body.Close()

body, err := ioutil.ReadAll(resp.Body)
if err != nil {
t.Fatalf("Failed to retrieve JSON data from Prometheus: %v", err)
}
body, err := ioutil.ReadAll(resp.Body)
if err != nil {
t.Fatalf("Failed to retrieve JSON data from Prometheus: %v", err)
}

// Parse JSON results
var output prometheusServerOutput
err = json.Unmarshal(body, &output)
// Parse JSON results
output = prometheusServerOutput{}
err = json.Unmarshal(body, &output)
if err != nil {
t.Fatalf("Failed to parse JSON data from Prometheus: %v", err)
}
if len(output.Data) == 0 {
// Handle empty output data returned when prometheus is not ready
// TODO: RCA the reason of empty result
t.Logf("No output data from Prometheus: %v", err)
return false, nil
}
return true, nil
})
if err != nil {
t.Fatalf("Failed to parse JSON data from Prometheus: %v", err)
t.Fatalf("No output data from Prometheus: %v", err)
}

// Create a map of all the metrics which were found on the server
Expand Down