Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

redpanda: cluster will not form without a node with an empty seed server list #333

Closed
rkruze opened this issue Dec 22, 2020 · 13 comments · Fixed by #6744, #7079 or #7204
Closed

redpanda: cluster will not form without a node with an empty seed server list #333

rkruze opened this issue Dec 22, 2020 · 13 comments · Fixed by #6744, #7079 or #7204
Assignees
Labels
area/raft area/rpk good first issue Good for newcomers kind/bug Something isn't working

Comments

@rkruze
Copy link
Contributor

rkruze commented Dec 22, 2020

When setting up a cluster, you want to make sure all nodes have the same seed servers. This includes the initial node since if it was to come back with an empty data directory you would want it to be able to join the cluster automatically without user intervention. This does not work today. If you set up a three-node cluster with each node having all three nodes in its seeds list it will never form a cluster. You see the following from the node:

Dec 22 19:26:25 ip-172-31-61-36 rpk[17533]: INFO  2020-12-22 19:26:25,254 [shard 0] cluster - members_manager.cc:309 - Processing node '1' join reques
t
Dec 22 19:26:25 ip-172-31-61-36 rpk[17533]: INFO  2020-12-22 19:26:25,503 [shard 0] cluster - members_manager.cc:309 - Processing node '0' join reques
t
Dec 22 19:26:25 ip-172-31-61-36 rpk[17533]: INFO  2020-12-22 19:26:25,503 [shard 0] cluster - members_manager.cc:274 - Error joining cluster using 0 s
eed server
Dec 22 19:26:25 ip-172-31-61-36 rpk[17533]: INFO  2020-12-22 19:26:25,503 [shard 0] cluster - members_manager.cc:211 - Sending join request to 1 @ {ho
st: 172.31.53.238, port: 33145}
Dec 22 19:26:25 ip-172-31-61-36 rpk[17533]: INFO  2020-12-22 19:26:25,503 [shard 0] cluster - members_manager.cc:198 - Next cluster join attempt in 66
14 milliseconds
Dec 22 19:26:29 ip-172-31-61-36 rpk[17533]: INFO  2020-12-22 19:26:29,782 [shard 0] cluster - members_manager.cc:309 - Processing node '2' join reques
t
Dec 22 19:26:32 ip-172-31-61-36 rpk[17533]: INFO  2020-12-22 19:26:32,117 [shard 0] cluster - members_manager.cc:309 - Processing node '0' join reques
t
Dec 22 19:26:32 ip-172-31-61-36 rpk[17533]: INFO  2020-12-22 19:26:32,118 [shard 0] cluster - members_manager.cc:274 - Error joining cluster using 0 s
eed server
Dec 22 19:26:32 ip-172-31-61-36 rpk[17533]: INFO  2020-12-22 19:26:32,118 [shard 0] cluster - members_manager.cc:211 - Sending join request to 1 @ {ho
st: 172.31.53.238, port: 33145}
Dec 22 19:26:32 ip-172-31-61-36 rpk[17533]: INFO  2020-12-22 19:26:32,118 [shard 0] cluster - members_manager.cc:198 - Next cluster join attempt in 63
93 milliseconds
Dec 22 19:26:32 ip-172-31-61-36 rpk[17533]: INFO  2020-12-22 19:26:32,376 [shard 0] cluster - members_manager.cc:309 - Processing node '1' join reques
t
Dec 22 19:26:36 ip-172-31-61-36 rpk[17533]: INFO  2020-12-22 19:26:36,329 [shard 0] cluster - members_manager.cc:309 - Processing node '2' join reques
t
Dec 22 19:26:38 ip-172-31-61-36 rpk[17533]: INFO  2020-12-22 19:26:38,449 [shard 0] cluster - members_manager.cc:309 - Processing node '1' join reques
t
Dec 22 19:26:38 ip-172-31-61-36 rpk[17533]: INFO  2020-12-22 19:26:38,511 [shard 0] cluster - members_manager.cc:309 - Processing node '0' join reques
t
Dec 22 19:26:38 ip-172-31-61-36 rpk[17533]: INFO  2020-12-22 19:26:38,511 [shard 0] cluster - members_manager.cc:274 - Error joining cluster using 0 s
eed server
Dec 22 19:26:38 ip-172-31-61-36 rpk[17533]: INFO  2020-12-22 19:26:38,511 [shard 0] cluster - members_manager.cc:211 - Sending join request to 1 @ {ho
st: 172.31.53.238, port: 33145}
Dec 22 19:26:38 ip-172-31-61-36 rpk[17533]: INFO  2020-12-22 19:26:38,512 [shard 0] cluster - members_manager.cc:198 - Next cluster join attempt in 64
18 milliseconds
Dec 22 19:26:42 ip-172-31-61-36 rpk[17533]: INFO  2020-12-22 19:26:42,537 [shard 0] cluster - members_manager.cc:309 - Processing node '2' join reques
t
Dec 22 19:26:44 ip-172-31-61-36 rpk[17533]: INFO  2020-12-22 19:26:44,553 [shard 0] cluster - members_manager.cc:309 - Processing node '1' join reques
t

It seems no one knows who should be the bootstrap server and thus the cluster never forms.

@rkruze rkruze changed the title redpanda: cannot handle when seed server list includes itself redpanda: cluster will not form without a node with an empty seed server list Dec 22, 2020
@BenPope
Copy link
Member

BenPope commented Dec 22, 2020

Probably related/prerequisite: #245

@emaxerrno
Copy link
Contributor

i remember @mmaslankaprv mentioning we had a restriction to form the raft groups from one node, i.e.: to bootstrap from one node. we needed to differentiate node joining vs node bootstrap. i think we have more metadata tracking now we can make that distinction. basically if not in set.

@emaxerrno emaxerrno added area/raft good first issue Good for newcomers kind/bug Something isn't working labels Dec 23, 2020
@mmaslankaprv
Copy link
Member

Currently we operate with the following assumptions:

  • node without seed servers is designated as a cluster root, it is the only node that will initiate cluster formation, when no more
    nodes are present the node with empty seed server list will form a single node cluster
  • node with seed server list will use the seed servers to communicate with the cluster to join

@BenPope
Copy link
Member

BenPope commented Jan 4, 2021

How does one restart the cluster root? Should it have seeds? What if it lost its data dir?

Would it make sense to have a two-phase initialisation, where a tool, perhaps RPK, triggers cluster formation?

@mmaslankaprv
Copy link
Member

Restarting node is not a problem. The problem is when it lose the data directory, then we have to change configuration to point it to different nodes to join the cluster. I am wondering how is this solved in CoackroachDB. They similar approach to seed servers.
We were targeting the easiest way of bootstrapping the cluster. Two phase approach is an option but certainly more complicated one. Maybe we can use incoming connection as a trigger to change node behavior ?

@BenPope
Copy link
Member

BenPope commented Jan 4, 2021

I think CockroachDB does two-phase. The concern is that during a network partition, bootstrapping must form within the majority partition. An unusual situation for sure, but it comes up if the cluster root loses it's data and is restarted.

The context here is within Kubernetes. It's not so easy with StatefulSet, to have a node behave differently depending on whether it is the cluster root, and whether it has lost its data. It could be punted to an operator, but it might make sense to have Redpanda perform the magic.

@rkruze
Copy link
Contributor Author

rkruze commented Jan 4, 2021

Yes, CockroachDB uses a two-phased init. Each server is brought up with the same "join" list. Once those nodes are up, if a cluster hasn't been formed in the past, they go into standby until a node gets an "init" command. https://www.cockroachlabs.com/docs/v20.2/cockroach-init.html

@mmaslankaprv
Copy link
Member

I think we can make the operation to be two step without implementing the centralized configuration. We can introduce the centralized configuration as a follow up.

@dotnwat dotnwat removed their assignment Sep 22, 2021
@jcsp
Copy link
Contributor

jcsp commented Oct 27, 2021

I think the two-phase init makes sense -- that should probably be hidden behind an rpk setup command that runs it for the user after daemons start.

We should also retain the current behaviour that writing a config with seed_servers=[] causes a node to auto-init, so that a single node cluster init is still a trivial case of just running a binary.

@jcsp
Copy link
Contributor

jcsp commented Oct 27, 2021

Related to #2793 -- once both are done, a cluster could realistically use the same redpanda.yml on all nodes.

@nicolaferraro
Copy link
Member

Leaving this here as it may affect the solution to this issue.
I've been working on a two phase initialization in the operator and I was expecting that Redpanda node 0 could start alone if the seeds server list contained a single entry (i.e. the root node 0 itself).

It turns out this is the case, except when the cluster has TLS and mutual authentication on the Kafka API endpoint.
In that specific case, the cluster is never formed:

cluster-tls-0 redpanda DEBUG 2022-06-13 13:32:50,014 [shard 0] cluster - health_monitor_backend.cc:284 - unable to refresh health metadata, no leader controller
cluster-tls-0 redpanda INFO  2022-06-13 13:32:50,014 [shard 0] cluster - health_monitor_backend.cc:426 - error refreshing cluster health state - Currently there is no leader controller elected in the cluster
cluster-tls-0 redpanda INFO  2022-06-13 13:32:50,014 [shard 0] cluster - metadata_dissemination_service.cc:357 - unable to retrieve cluster health report - Currently there is no leader controller elected in the cluster
cluster-tls-0 redpanda DEBUG 2022-06-13 13:32:52,845 [shard 0] cluster - members_manager.cc:435 - Using current node as a seed server
cluster-tls-0 redpanda INFO  2022-06-13 13:32:52,845 [shard 0] cluster - members_manager.cc:499 - Processing node '0' join request (version 3)
cluster-tls-0 redpanda INFO  2022-06-13 13:32:52,845 [shard 0] cluster - members_manager.cc:370 - Next cluster join attempt in 5996 milliseconds
cluster-tls-0 redpanda DEBUG 2022-06-13 13:32:53,013 [shard 0] compaction_ctrl - backlog_controller.cc:129 - updating shares 10
cluster-tls-0 redpanda INFO  2022-06-13 13:32:53,014 [shard 0] group-metadata-migration - group_metadata_migration.cc:710 - kafka_internal/group topic does not exists, activating consumer_offsets feature
cluster-tls-0 redpanda DEBUG 2022-06-13 13:32:53,014 [shard 0] cluster - health_monitor_backend.cc:400 - requesing cluster state report with filter: {per_node_filter: {include_partitions: true, ntp_filters: {}}, nodes: {}}, force refresh: false

So, the seed server list needs to be completely empty for the initial cluster to be created.

nicolaferraro added a commit to nicolaferraro/redpanda that referenced this issue Jun 14, 2022
…nitial raft group

This tries to solve the problem with empty seed_servers on node 0. With this change, all fresh clusters will be initially set to 1 replica (via `status.currentReplicas`), until a cluster is created and the operator can verify it via admin API. Then the cluster is scaled to the number of instances desired by the user.

After the cluster is initialized, and for the entire lifetime of the cluster, the `seed_servers` property will be populated with the full list of available servers, in every node of the cluster.

This overcomes redpanda-data#333. Previously, node 0 was always forced to have an empty seed_servers property, but this caused problems when it lost the data dir, as it tried to create a brand-new cluster. With this change, even if node 0 loses the data dir, the seed_servers property will always point to other nodes, so it will try to join the existing cluster.
nicolaferraro added a commit to nicolaferraro/redpanda that referenced this issue Jun 15, 2022
…nitial raft group

This tries to solve the problem with empty seed_servers on node 0. With this change, all fresh clusters will be initially set to 1 replica (via `status.currentReplicas`), until a cluster is created and the operator can verify it via admin API. Then the cluster is scaled to the number of instances desired by the user.

After the cluster is initialized, and for the entire lifetime of the cluster, the `seed_servers` property will be populated with the full list of available servers, in every node of the cluster.

This overcomes redpanda-data#333. Previously, node 0 was always forced to have an empty seed_servers property, but this caused problems when it lost the data dir, as it tried to create a brand-new cluster. With this change, even if node 0 loses the data dir, the seed_servers property will always point to other nodes, so it will try to join the existing cluster.
nicolaferraro added a commit to nicolaferraro/redpanda that referenced this issue Jun 16, 2022
…nitial raft group

This tries to solve the problem with empty seed_servers on node 0. With this change, all fresh clusters will be initially set to 1 replica (via `status.currentReplicas`), until a cluster is created and the operator can verify it via admin API. Then the cluster is scaled to the number of instances desired by the user.

After the cluster is initialized, and for the entire lifetime of the cluster, the `seed_servers` property will be populated with the full list of available servers, in every node of the cluster.

This overcomes redpanda-data#333. Previously, node 0 was always forced to have an empty seed_servers property, but this caused problems when it lost the data dir, as it tried to create a brand-new cluster. With this change, even if node 0 loses the data dir, the seed_servers property will always point to other nodes, so it will try to join the existing cluster.
@twmb twmb added the area/rpk label Jul 7, 2022
@twmb
Copy link
Contributor

twmb commented Aug 12, 2022

@jcsp am I reading the thread above correctly, we need the code to exist in redpanda first, and once redpanda handles an empty seed servers, we can change rpk to emit no seed servers. I'll move this too our own "awaiting other team" queue. cc @piyushredpanda

@jcsp
Copy link
Contributor

jcsp commented Aug 12, 2022

There will be a bit more to it than that. We haven't nailed this down yet, but probably:

  • The default will still be to auto-form a cluster if one of the nodes has an empty seed_servers, as it does today
  • A new configuration property called something like "cluster_await_initialize" (false by default)
  • if cluster_await initialize is true, then redpanda will not form a cluster (i.e. will not write controller log or fully come up) until one of the nodes receives an admin API call asking it to initialize (and this call can have some limited set of parameters like an initial superuser account credentials or an initial license file).
  • seed_servers can then be set to the same value on all nodes (e.g. leave it out if you have one node, or set it symmetrically to the full list of nodes on all the nodes).

Auto-selection node_id is a separate but complementary thing: that enables orchestators to avoid picking node Ids for redpanda nodes, just leave it out the config file and redpanda will make one up.

@piyushredpanda piyushredpanda assigned andrwng and dlex and unassigned mmaslankaprv Aug 17, 2022
theRealWardo added a commit to theRealWardo/documentation that referenced this issue Nov 9, 2022
initializing a single node cluster should not set seeds in order to trigger auto-init per redpanda-data/redpanda#333 (comment)

also remove apparently invalid `empty_seed_starts_cluster` flag per:

```
INFO  2022-11-09 02:19:46,745 [shard 0] redpanda::main - application.cc:255 - Failure during startup: std::invalid_argument (Unknown property empty_seed_starts_cluster)
```
joejulian pushed a commit to joejulian/redpanda that referenced this issue Mar 10, 2023
…create initial raft group

This tries to solve the problem with empty seed_servers on node 0. With this change, all fresh clusters will be initially set to 1 replica (via `status.currentReplicas`), until a cluster is created and the operator can verify it via admin API. Then the cluster is scaled to the number of instances desired by the user.

After the cluster is initialized, and for the entire lifetime of the cluster, the `seed_servers` property will be populated with the full list of available servers, in every node of the cluster.

This overcomes redpanda-data#333. Previously, node 0 was always forced to have an empty seed_servers property, but this caused problems when it lost the data dir, as it tried to create a brand-new cluster. With this change, even if node 0 loses the data dir, the seed_servers property will always point to other nodes, so it will try to join the existing cluster.
joejulian pushed a commit to joejulian/redpanda that referenced this issue Mar 24, 2023
…0 create initial raft group

This tries to solve the problem with empty seed_servers on node 0. With this change, all fresh clusters will be initially set to 1 replica (via `status.currentReplicas`), until a cluster is created and the operator can verify it via admin API. Then the cluster is scaled to the number of instances desired by the user.

After the cluster is initialized, and for the entire lifetime of the cluster, the `seed_servers` property will be populated with the full list of available servers, in every node of the cluster.

This overcomes redpanda-data#333. Previously, node 0 was always forced to have an empty seed_servers property, but this caused problems when it lost the data dir, as it tried to create a brand-new cluster. With this change, even if node 0 loses the data dir, the seed_servers property will always point to other nodes, so it will try to join the existing cluster.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/raft area/rpk good first issue Good for newcomers kind/bug Something isn't working
Projects
None yet
10 participants