Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Etcd corruption detection triggers during in-place cluster recovery #15548

Closed
serathius opened this issue Mar 22, 2023 · 14 comments · Fixed by #15924
Closed

Etcd corruption detection triggers during in-place cluster recovery #15548

serathius opened this issue Mar 22, 2023 · 14 comments · Fixed by #15924

Comments

@serathius
Copy link
Member

serathius commented Mar 22, 2023

What happened?

During in-place cluster recovery etcd started reporting data corruption problems even though none existed. Problem disappears after all members are recovered.

By in-pace cluster recover I mean we recovered etcd snapshot but didn't change IP of members. In multi member cluster such recovery is done as a rolling update. This allows old and new members are able to maintain communication. Recovered members will get new cluster id preventing creating a quorum with old members. However cluster id is not checked during data corruption detection and potentially other operations done via non-raft communication.

During cluster recovery list of members doesn't change, so with in-place IPs, member belonging to new cluster might be still talking to members from old one and reverse. Recovered cluster usually recovers from older snapshot and then branches out their raft log, resulting in different hashes for the same revision. So when a member compares their hash to other member it might mistake it for corruption.

This affects all types of data corruption detection. For initial corruption detection recovered member will crashloop. For periodic corruption detection leader might mark whole cluster as corrupted.

What did you expect to happen?

Non-raft operations should give same safety as raft ones in regards to cluster recovery, meaning they should always check clusterid.

How can we reproduce it (as minimally and precisely as possible)?

TODO

Anything else we need to know?

No response

Etcd version (please run commands below)

v3.4.21

Etcd configuration (command line flags or environment variables)

N/A

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

N/A

Relevant log output

No response

@serathius serathius changed the title Etcd corruption detection triggers during in-pace in place cluster recovery Etcd corruption detection triggers during in-place cluster recovery Mar 22, 2023
@ahrtr
Copy link
Member

ahrtr commented Mar 23, 2023

By in-pace cluster recover I mean we recovered etcd snapshot but didn't change IP of members. In multi member cluster such recovery is done as a rolling update.

This isn't correct to me. During a disaster recovery,

  1. users should restore all members' data directories using the same snapshot offline firstly,
  2. and secondly start the members. Please read op-guide/recovery

The in-pace cluster recover makes no sense to me. If the cluster is already running, why do you need to perform a recovery? Usually a slow follower might recover from a snapshot, but there is no cluster level recovery unless you follow the steps above.

So I don't think this is a bug.

@serathius
Copy link
Member Author

As mentioned before etcd already supports in-place restore. Problem is only when data corruption detection is enabled. Restoring member changes it's cluster id specifically to avoid restored members forming a quorum with non-restored ones.

@chaochn47
Copy link
Member

chaochn47 commented Mar 24, 2023

As mentioned before etcd already supports in-place restore.

I don't think etcd supports such member restart from db snapshot with configuration --cluster-state=new. A different cluster ID means there are 2 clusters.

Taking look at the raft http code path, it rejects any raft communication in

  • streamHandler
  • snapshotHandler
  • ...

if gcid := header.Get("X-Etcd-Cluster-ID"); gcid != cid.String() {
lg.Warn(
"request cluster ID mismatch",
zap.String("local-member-id", localID.String()),
zap.String("local-member-cluster-id", cid.String()),
zap.String("local-member-server-version", localVs),
zap.String("local-member-server-minimum-cluster-version", localMinClusterVs),
zap.String("remote-peer-server-name", remoteName),
zap.String("remote-peer-server-version", remoteVs),
zap.String("remote-peer-server-minimum-cluster-version", remoteMinClusterVs),
zap.String("remote-peer-cluster-id", gcid),
)
return errClusterIDMismatch
}

Even though somehow you recover from the db snapshot with configuration --cluster-state=existing using the same member ID, etcd will panic with committed index regression #10166 (comment)

@ahrtr
Copy link
Member

ahrtr commented Mar 24, 2023

Restoring member changes it's cluster id specifically to avoid restored members forming a quorum with non-restored ones.

Note that restoring cluster from a snapshot is actually creating/starting a new cluster instead of "in-place restoring a cluster".

@ahrtr ahrtr removed the type/bug label Mar 24, 2023
@ptabor
Copy link
Contributor

ptabor commented Mar 24, 2023

If I understand correctly we have 2 orthogonal discussions here:

  1. Is it a good practice to restore new variant of cluster (with different cluster ID, but the same peer IPs) side by side to already existing one ?

I agree it carries risks so from operational perspective it's safer to turn off one cluster and create a new one after.

But even if we assume it carries risks, the etcd should put a bar high on not doing anything 'stupid' in such a situation. So we can think about this as a test-case for isolation (2).

  1. The reason for cluster's to have cluster_id is to be able to distinguish and isolate different clusters. And a restored cluster is a DIFFERENT cluster by design than the original one.
    What Marek is reporting is that there are cases we missed in implementation the isolation aspect and we don't look at cluster_id when we perform cross-node communication.

@ahrtr @chaochn47 I assume you agree that checking cluster_id for cross-node communication is a good thing to have,
but you challange the issues priority: there exists workaround if the best practices are followed. Is it fair summary ?

@chaochn47
Copy link
Member

chaochn47 commented Mar 24, 2023

I assume you agree that checking cluster_id for cross-node communication is a good thing to have,
but you challange the issues priority: there exists workaround if the best practices are followed. Is it fair summary ?

Thanks @ptabor for the clarification. It makes more sense to me now. I suggest the issue title to be renamed to align with the summary.

I slim through the etcdhttp/peer.go should have such protection cluster_id the same for peer communication. Thanks @serathius for reporting.

@ahrtr
Copy link
Member

ahrtr commented Mar 24, 2023

  1. Is it a good practice to restore new variant of cluster (with different cluster ID, but the same peer IPs) side by side to already existing one ?

It is not about good/bad practice. It isn't supported by etcd at all. Of course, I agree we should add some protection for such case.

I assume you agree that checking cluster_id for cross-node communication is a good thing to have,

Agreed. Please anyone feel free to deliver a PR for this. thx

@lbllol365
Copy link

I think i can try to do this. @ahrtr

@ahrtr
Copy link
Member

ahrtr commented Apr 24, 2023

@lbllol365 assigned to you, thx

@vianamjr
Copy link
Contributor

Is this issue open?

@jmhbnz
Copy link
Member

jmhbnz commented Jul 18, 2023

Is this issue open?

Hey @vianamjr - This was assigned out in late April and I don't recall seeing any updates since so I think you would be welcome to start working on it.

Perhaps let's check in with @lbllol365 - did you make any progress on this or is it ok to reassign?

@CaojiamingAlan
Copy link
Contributor

@jmhbnz @vianamjr This is already solved by #15924. Someone with permission can close this.

@jmhbnz
Copy link
Member

jmhbnz commented Jul 18, 2023

Thanks for the clarification @CaojiamingAlan - closing.

@serathius
Copy link
Member Author

Should this be backported to v3.4?

@serathius serathius mentioned this issue Oct 27, 2023
14 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging a pull request may close this issue.

8 participants