Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track persistent volume in k8s environement #4965

Conversation

RafalKorepta
Copy link
Contributor

@RafalKorepta RafalKorepta commented May 30, 2022

Cover letter

Change Redpanda ID calculation based on the Redpanda data
folder mounted to the POD. If user would like to choose faster
local storage, every time local disk is wiped out due to node
change the Redpanda ID will be regenerated.

This PR changes seed server list calculation where seed server
list will be clean only when there will be no data and no running
other seed server nodes.

Release notes

  • Init container in kubernetes deployment will track Redpanda ID using
    persistent volumes

Improvements

  • Seed server list will be removed only in POD with ordinal 0 when cluster
    has more than 1 replicas and none other Replicas is running.

First FP

Fix linter issues with file permission.

REF:
#333

Rafal Korepta added 5 commits May 30, 2022 09:50
For Redpanda pods that are newly created Redpanda ID will be randomly generated
For Redpanda pods that has data in persistent volume the Redpanda ID will
become statefulset ordinal number to not introducing breaking change.
For Redpanda pods that have .redpanda_id file in the Redpanda data persistent
volume the content of that file will be used for node.ID configuration.
If Redpanda data doesn't have any content and pod ordinal numer is
0, then initialize Redpanda seed server list to an empty list. In
any other case seed server list should be point to all brokers.
Due to the fact that configurator is used in kubernetes enviornment
the host name reflects POD name. As Redpanda operator is using
statefulset all PODs will have ordinal number as suffix.
As cluster is bootstrapping the node with pod ordinal 0 might lose its data
folder, the seed servers list should not be empty if cluster already was
formed. There might be a race condition where pod with ordinal 0 might form
one node cluster and as it becomes a leader it might overwrite other nodes
ring0 raft group history. To prevent such case the seed server list will
be cleared only if no other node in the cluster is working (pod in running
phase) and reports as live (v1/brokers list where is_alive is set to true).
In rare case int64 random Redpanda ID might collide with already existing
ID. Currently ID can not be reused.
@RafalKorepta RafalKorepta force-pushed the rk/gh-2096/track-persistent-volume branch from ede0168 to f7fc467 Compare May 30, 2022 07:52
Copy link
Contributor

@alenkacz alenkacz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just left some initial questions

@@ -42,6 +56,9 @@ const (
externalConnectivityAddressTypeEnvVar = "EXTERNAL_CONNECTIVITY_ADDRESS_TYPE"
hostPortEnvVar = "HOST_PORT"
proxyHostPortEnvVar = "PROXY_HOST_PORT"
dataDirPath = "DATA_DIR_PATH"

redpandaIDs = "redoanda-ids"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: typo

}

err = calculateRedpandaID(func() int {
return int(rand.NewSource(time.Now().Unix()).Int63())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so it's random and not incrementing as we discussed somewhere on slack? I would strongly preferred to do that :)

return nil
}

empty, err := IsRedpandaDataFolderEmpty(c.dataDirPath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add comment explaining this change? so if data folder is empty it has empty seed dirs? So that it forms new cluster or ... ? Also I think this would not work on 2 node clusters

Copy link
Member

@nicolaferraro nicolaferraro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great start! I'm a bit unsure about the last 2 commits that introduce calls to k8s API and admin API from the configurator. Maybe this works, but there may be dragons (like deadlocks) if we go with this approach..

As far as I understood, those changes are needed to fix the following problems:

  • Node uniqueness: for which we can also think to make stronger UIDs, like mixing ordinal + timestamp + entropy-based rng
  • Seed server on startup: for which we could try to find a solution with the core team

@@ -336,6 +340,10 @@ func checkEnvVars() (configuratorConfig, error) {
value: &hostPort,
name: hostPortEnvVar,
},
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: indentation seems wrong

@@ -153,6 +159,62 @@ func main() {
log.Printf("Configuration saved to: %s", c.configDestination)
}

func calculateRedpandaID(
cfg *config.Config, c configuratorConfig, initialID brokerID,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I'd call the argument podOrdinal for more clarity

return nil
}

redpandaID := int(rand.NewSource(time.Now().Unix()).Int63())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this. .Unix() gives seconds elapsed from 1970 and there's a high chance that two pods will start within the same second when a fresh clusters with 3 or more instances is created. I'd also remove the randomization part if the uniqueness is based on the seed.. Iirc same seed leads to same result, also in different machines. Maybe including the hostIndex helps with the uniqueness, like newID := time.Now().UnixMilli() * 1e4 + podOrdinal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with your proposal.

return fmt.Errorf("checking Redpanda data folder content (%s): %w", c.dataDirPath, err)
}

if index == 0 && empty {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking to this scenario:

  • Init container starts, configurator writes .redpanda_id file
  • Main pod starts and crashes
  • Configurator now is restarted and finds .redpanda_id, so it uses the full seed_server list

Is the cluster able to form a quorum anyway?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Redpanda crashes only container will be restarted not whole POD.

Your concern is still valid that someone could delete pod when it didn't form cluster.

I need to think more about it.

if pod.Name == c.hostName {
continue
}
if pod.Status.Phase == corev1.PodRunning {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand this is a problem we need to avoid, but I wonder if we can find a different way. I've seen e.g. cases where all nodes crash during an upgrade, which can still lead the first one to think he's alone. Isn't Redpanda supposed to handle these situations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apparently Redpanda might form 2 clusters -> #333

}

for _, br := range brokers {
if br.NodeId == 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nodeId is no longer the podOrdinal, so this check may never hold

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. Fixed in next push.

@@ -56,6 +57,8 @@ const (
hostPortEnvVar = "HOST_PORT"
proxyHostPortEnvVar = "PROXY_HOST_PORT"
dataDirPath = "DATA_DIR_PATH"

redpandaIDs = "redoanda-ids"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

cm.Data = map[string]string{
redpandaIDs: "",
}
err = k8sClient.Create(ctx, cm)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered reversing the roles here. I.e. managing the -ids configmap from the operator and reading it in the configurator.

The IDS would be created uniquely by the operator itself and also the configurator could use that cm to get info on other replicas. The nodes could still treat it as a "suggestion" and still report a different nodeId in the node_config...

Just thinking loud...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Operator should track every PV that was bound to each POD and map each Redpanda ID to PV.

We don't have this information before statefuls set control schedules new POD. Then we can update configmap and configurator need to wait for this ID.

@RafalKorepta
Copy link
Contributor Author

This PR isn't solving the seed server problem and Redpanda ID can be reused even if data folder is lost.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants