-
Notifications
You must be signed in to change notification settings - Fork 577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Track persistent volume in k8s environement #4965
Track persistent volume in k8s environement #4965
Conversation
For Redpanda pods that are newly created Redpanda ID will be randomly generated For Redpanda pods that has data in persistent volume the Redpanda ID will become statefulset ordinal number to not introducing breaking change. For Redpanda pods that have .redpanda_id file in the Redpanda data persistent volume the content of that file will be used for node.ID configuration.
If Redpanda data doesn't have any content and pod ordinal numer is 0, then initialize Redpanda seed server list to an empty list. In any other case seed server list should be point to all brokers.
Due to the fact that configurator is used in kubernetes enviornment the host name reflects POD name. As Redpanda operator is using statefulset all PODs will have ordinal number as suffix.
As cluster is bootstrapping the node with pod ordinal 0 might lose its data folder, the seed servers list should not be empty if cluster already was formed. There might be a race condition where pod with ordinal 0 might form one node cluster and as it becomes a leader it might overwrite other nodes ring0 raft group history. To prevent such case the seed server list will be cleared only if no other node in the cluster is working (pod in running phase) and reports as live (v1/brokers list where is_alive is set to true).
In rare case int64 random Redpanda ID might collide with already existing ID. Currently ID can not be reused.
ede0168
to
f7fc467
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just left some initial questions
@@ -42,6 +56,9 @@ const ( | |||
externalConnectivityAddressTypeEnvVar = "EXTERNAL_CONNECTIVITY_ADDRESS_TYPE" | |||
hostPortEnvVar = "HOST_PORT" | |||
proxyHostPortEnvVar = "PROXY_HOST_PORT" | |||
dataDirPath = "DATA_DIR_PATH" | |||
|
|||
redpandaIDs = "redoanda-ids" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: typo
} | ||
|
||
err = calculateRedpandaID(func() int { | ||
return int(rand.NewSource(time.Now().Unix()).Int63()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so it's random and not incrementing as we discussed somewhere on slack? I would strongly preferred to do that :)
return nil | ||
} | ||
|
||
empty, err := IsRedpandaDataFolderEmpty(c.dataDirPath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you add comment explaining this change? so if data folder is empty it has empty seed dirs? So that it forms new cluster or ... ? Also I think this would not work on 2 node clusters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great start! I'm a bit unsure about the last 2 commits that introduce calls to k8s API and admin API from the configurator. Maybe this works, but there may be dragons (like deadlocks) if we go with this approach..
As far as I understood, those changes are needed to fix the following problems:
- Node uniqueness: for which we can also think to make stronger UIDs, like mixing ordinal + timestamp + entropy-based rng
- Seed server on startup: for which we could try to find a solution with the core team
@@ -336,6 +340,10 @@ func checkEnvVars() (configuratorConfig, error) { | |||
value: &hostPort, | |||
name: hostPortEnvVar, | |||
}, | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: indentation seems wrong
@@ -153,6 +159,62 @@ func main() { | |||
log.Printf("Configuration saved to: %s", c.configDestination) | |||
} | |||
|
|||
func calculateRedpandaID( | |||
cfg *config.Config, c configuratorConfig, initialID brokerID, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I'd call the argument podOrdinal
for more clarity
return nil | ||
} | ||
|
||
redpandaID := int(rand.NewSource(time.Now().Unix()).Int63()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this. .Unix()
gives seconds elapsed from 1970 and there's a high chance that two pods will start within the same second when a fresh clusters with 3 or more instances is created. I'd also remove the randomization part if the uniqueness is based on the seed.. Iirc same seed leads to same result, also in different machines. Maybe including the hostIndex
helps with the uniqueness, like newID := time.Now().UnixMilli() * 1e4 + podOrdinal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with your proposal.
return fmt.Errorf("checking Redpanda data folder content (%s): %w", c.dataDirPath, err) | ||
} | ||
|
||
if index == 0 && empty { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking to this scenario:
- Init container starts, configurator writes .redpanda_id file
- Main pod starts and crashes
- Configurator now is restarted and finds .redpanda_id, so it uses the full seed_server list
Is the cluster able to form a quorum anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If Redpanda crashes only container will be restarted not whole POD.
Your concern is still valid that someone could delete pod when it didn't form cluster.
I need to think more about it.
if pod.Name == c.hostName { | ||
continue | ||
} | ||
if pod.Status.Phase == corev1.PodRunning { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand this is a problem we need to avoid, but I wonder if we can find a different way. I've seen e.g. cases where all nodes crash during an upgrade, which can still lead the first one to think he's alone. Isn't Redpanda supposed to handle these situations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apparently Redpanda might form 2 clusters -> #333
} | ||
|
||
for _, br := range brokers { | ||
if br.NodeId == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nodeId
is no longer the podOrdinal
, so this check may never hold
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. Fixed in next push.
@@ -56,6 +57,8 @@ const ( | |||
hostPortEnvVar = "HOST_PORT" | |||
proxyHostPortEnvVar = "PROXY_HOST_PORT" | |||
dataDirPath = "DATA_DIR_PATH" | |||
|
|||
redpandaIDs = "redoanda-ids" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
cm.Data = map[string]string{ | ||
redpandaIDs: "", | ||
} | ||
err = k8sClient.Create(ctx, cm) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered reversing the roles here. I.e. managing the -ids
configmap from the operator and reading it in the configurator.
The IDS would be created uniquely by the operator itself and also the configurator could use that cm to get info on other replicas. The nodes could still treat it as a "suggestion" and still report a different nodeId
in the node_config...
Just thinking loud...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Operator should track every PV that was bound to each POD and map each Redpanda ID to PV.
We don't have this information before statefuls set control schedules new POD. Then we can update configmap and configurator need to wait for this ID.
This PR isn't solving the seed server problem and Redpanda ID can be reused even if data folder is lost. |
Cover letter
Change Redpanda ID calculation based on the Redpanda data
folder mounted to the POD. If user would like to choose faster
local storage, every time local disk is wiped out due to node
change the Redpanda ID will be regenerated.
This PR changes seed server list calculation where seed server
list will be clean only when there will be no data and no running
other seed server nodes.
Release notes
persistent volumes
Improvements
has more than 1 replicas and none other Replicas is running.
First FP
Fix linter issues with file permission.
REF:
#333