Etcd corruption detection triggers during in-place cluster recovery #15548

serathius · 2023-03-22T08:59:53Z

What happened?

During in-place cluster recovery etcd started reporting data corruption problems even though none existed. Problem disappears after all members are recovered.

By in-pace cluster recover I mean we recovered etcd snapshot but didn't change IP of members. In multi member cluster such recovery is done as a rolling update. This allows old and new members are able to maintain communication. Recovered members will get new cluster id preventing creating a quorum with old members. However cluster id is not checked during data corruption detection and potentially other operations done via non-raft communication.

During cluster recovery list of members doesn't change, so with in-place IPs, member belonging to new cluster might be still talking to members from old one and reverse. Recovered cluster usually recovers from older snapshot and then branches out their raft log, resulting in different hashes for the same revision. So when a member compares their hash to other member it might mistake it for corruption.

This affects all types of data corruption detection. For initial corruption detection recovered member will crashloop. For periodic corruption detection leader might mark whole cluster as corrupted.

What did you expect to happen?

Non-raft operations should give same safety as raft ones in regards to cluster recovery, meaning they should always check clusterid.

How can we reproduce it (as minimally and precisely as possible)?

TODO

Anything else we need to know?

No response

Etcd version (please run commands below)

v3.4.21

Etcd configuration (command line flags or environment variables)

N/A

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

N/A

Relevant log output

No response

ahrtr · 2023-03-23T01:45:08Z

By in-pace cluster recover I mean we recovered etcd snapshot but didn't change IP of members. In multi member cluster such recovery is done as a rolling update.

This isn't correct to me. During a disaster recovery,

users should restore all members' data directories using the same snapshot offline firstly,
and secondly start the members. Please read op-guide/recovery

The in-pace cluster recover makes no sense to me. If the cluster is already running, why do you need to perform a recovery? Usually a slow follower might recover from a snapshot, but there is no cluster level recovery unless you follow the steps above.

So I don't think this is a bug.

serathius · 2023-03-23T08:52:45Z

As mentioned before etcd already supports in-place restore. Problem is only when data corruption detection is enabled. Restoring member changes it's cluster id specifically to avoid restored members forming a quorum with non-restored ones.

chaochn47 · 2023-03-24T01:51:19Z

As mentioned before etcd already supports in-place restore.

I don't think etcd supports such member restart from db snapshot with configuration --cluster-state=new. A different cluster ID means there are 2 clusters.

Taking look at the raft http code path, it rejects any raft communication in

streamHandler
snapshotHandler
...

etcd/server/etcdserver/api/rafthttp/http.go

Lines 499 to 512 in e522ce0

    
           if gcid := header.Get("X-Etcd-Cluster-ID"); gcid != cid.String() { 
        
           	lg.Warn( 
        
           		"request cluster ID mismatch", 
        
           		zap.String("local-member-id", localID.String()), 
        
           		zap.String("local-member-cluster-id", cid.String()), 
        
           		zap.String("local-member-server-version", localVs), 
        
           		zap.String("local-member-server-minimum-cluster-version", localMinClusterVs), 
        
           		zap.String("remote-peer-server-name", remoteName), 
        
           		zap.String("remote-peer-server-version", remoteVs), 
        
           		zap.String("remote-peer-server-minimum-cluster-version", remoteMinClusterVs), 
        
           		zap.String("remote-peer-cluster-id", gcid), 
        
           	) 
        
           	return errClusterIDMismatch 
        
           }

Even though somehow you recover from the db snapshot with configuration --cluster-state=existing using the same member ID, etcd will panic with committed index regression #10166 (comment)

ahrtr · 2023-03-24T01:59:13Z

Restoring member changes it's cluster id specifically to avoid restored members forming a quorum with non-restored ones.

Note that restoring cluster from a snapshot is actually creating/starting a new cluster instead of "in-place restoring a cluster".

ptabor · 2023-03-24T09:42:19Z

If I understand correctly we have 2 orthogonal discussions here:

Is it a good practice to restore new variant of cluster (with different cluster ID, but the same peer IPs) side by side to already existing one ?

I agree it carries risks so from operational perspective it's safer to turn off one cluster and create a new one after.

But even if we assume it carries risks, the etcd should put a bar high on not doing anything 'stupid' in such a situation. So we can think about this as a test-case for isolation (2).

The reason for cluster's to have cluster_id is to be able to distinguish and isolate different clusters. And a restored cluster is a DIFFERENT cluster by design than the original one.
What Marek is reporting is that there are cases we missed in implementation the isolation aspect and we don't look at cluster_id when we perform cross-node communication.

@ahrtr @chaochn47 I assume you agree that checking cluster_id for cross-node communication is a good thing to have,
but you challange the issues priority: there exists workaround if the best practices are followed. Is it fair summary ?

chaochn47 · 2023-03-24T17:02:09Z

I assume you agree that checking cluster_id for cross-node communication is a good thing to have,
but you challange the issues priority: there exists workaround if the best practices are followed. Is it fair summary ?

Thanks @ptabor for the clarification. It makes more sense to me now. I suggest the issue title to be renamed to align with the summary.

I slim through the etcdhttp/peer.go should have such protection cluster_id the same for peer communication. Thanks @serathius for reporting.

ahrtr · 2023-03-24T23:20:18Z

Is it a good practice to restore new variant of cluster (with different cluster ID, but the same peer IPs) side by side to already existing one ?

It is not about good/bad practice. It isn't supported by etcd at all. Of course, I agree we should add some protection for such case.

I assume you agree that checking cluster_id for cross-node communication is a good thing to have,

Agreed. Please anyone feel free to deliver a PR for this. thx

lbllol365 · 2023-04-24T09:08:24Z

I think i can try to do this. @ahrtr

ahrtr · 2023-04-24T09:45:36Z

@lbllol365 assigned to you, thx

vianamjr · 2023-07-18T23:16:51Z

Is this issue open?

jmhbnz · 2023-07-18T23:23:00Z

Is this issue open?

Hey @vianamjr - This was assigned out in late April and I don't recall seeing any updates since so I think you would be welcome to start working on it.

Perhaps let's check in with @lbllol365 - did you make any progress on this or is it ok to reassign?

CaojiamingAlan · 2023-07-18T23:24:46Z

@jmhbnz @vianamjr This is already solved by #15924. Someone with permission can close this.

jmhbnz · 2023-07-18T23:25:34Z

Thanks for the clarification @CaojiamingAlan - closing.

serathius · 2023-10-27T12:39:37Z

Should this be backported to v3.4?

serathius added the type/bug label Mar 22, 2023

serathius changed the title ~~Etcd corruption detection triggers during in-pace in place cluster recovery~~ Etcd corruption detection triggers during in-place cluster recovery Mar 22, 2023

ahrtr removed the type/bug label Mar 24, 2023

ahrtr added help wanted good first issue labels Apr 24, 2023

ahrtr assigned lbllol365 Apr 24, 2023

ahrtr mentioned this issue Apr 27, 2023

Unable to add unstarted member into existing cluster #15787

Closed

kennydo mentioned this issue May 18, 2023

Incorrect hash when resuming scheduled compaction after etcd restarts #15919

Closed

This was referenced May 19, 2023

Add cluster id check in peer communications #15923

Closed

Add cluster id check for hash kv handler #15924

Merged

jmhbnz closed this as completed Jul 18, 2023

serathius mentioned this issue Oct 12, 2023

Plan release v3.5.10 #16733

Closed

serathius mentioned this issue Oct 27, 2023

Plan to release v3.4.28 #16751

Closed

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Etcd corruption detection triggers during in-place cluster recovery #15548

Etcd corruption detection triggers during in-place cluster recovery #15548

serathius commented Mar 22, 2023 •

edited

Loading

ahrtr commented Mar 23, 2023 •

edited

Loading

serathius commented Mar 23, 2023

chaochn47 commented Mar 24, 2023 •

edited

Loading

ahrtr commented Mar 24, 2023

ptabor commented Mar 24, 2023

chaochn47 commented Mar 24, 2023 •

edited

Loading

ahrtr commented Mar 24, 2023

lbllol365 commented Apr 24, 2023

ahrtr commented Apr 24, 2023

vianamjr commented Jul 18, 2023

jmhbnz commented Jul 18, 2023

CaojiamingAlan commented Jul 18, 2023

jmhbnz commented Jul 18, 2023

serathius commented Oct 27, 2023

Etcd corruption detection triggers during in-place cluster recovery #15548

Etcd corruption detection triggers during in-place cluster recovery #15548

Comments

serathius commented Mar 22, 2023 • edited Loading

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Etcd version (please run commands below)

Etcd configuration (command line flags or environment variables)

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

Relevant log output

ahrtr commented Mar 23, 2023 • edited Loading

serathius commented Mar 23, 2023

chaochn47 commented Mar 24, 2023 • edited Loading

ahrtr commented Mar 24, 2023

ptabor commented Mar 24, 2023

chaochn47 commented Mar 24, 2023 • edited Loading

ahrtr commented Mar 24, 2023

lbllol365 commented Apr 24, 2023

ahrtr commented Apr 24, 2023

vianamjr commented Jul 18, 2023

jmhbnz commented Jul 18, 2023

CaojiamingAlan commented Jul 18, 2023

jmhbnz commented Jul 18, 2023

serathius commented Oct 27, 2023

serathius commented Mar 22, 2023 •

edited

Loading

ahrtr commented Mar 23, 2023 •

edited

Loading

chaochn47 commented Mar 24, 2023 •

edited

Loading

chaochn47 commented Mar 24, 2023 •

edited

Loading