From 07eea2d9abd736b9a2bb3120e787ecc14e77c9f8 Mon Sep 17 00:00:00 2001 From: Langleu Date: Thu, 22 Aug 2024 15:43:22 +0200 Subject: [PATCH 01/15] docs(multi-region): rework simplified 8.6 procedure --- .../multi-region/dual-region-ops.md | 27 ++++++++++++------- 1 file changed, 17 insertions(+), 10 deletions(-) diff --git a/docs/self-managed/operational-guides/multi-region/dual-region-ops.md b/docs/self-managed/operational-guides/multi-region/dual-region-ops.md index 5ad2e1b0b5..01f4613eca 100644 --- a/docs/self-managed/operational-guides/multi-region/dual-region-ops.md +++ b/docs/self-managed/operational-guides/multi-region/dual-region-ops.md @@ -141,6 +141,12 @@ desired={} #### Current state +One of the regions is lost, meaning Zeebe: + +- Is unable to process new requests due to losing the quorum +- Stops exporting new data to Elasticsearch in the lost region +- Stops exporting new data to Elasticsearch in the survived region + You have previously ensured that the lost region cannot reconnect during the failover procedure. Due to the Zeebe data replication, no data has been lost. @@ -151,7 +157,7 @@ You have removed the lost brokers from the Zeebe cluster. This will allow us to #### How to get there -You will port-forward the `Zeebe Gateway` in the surviving region to the local host to interact with the Gateway. +You are going to port-forward the `Zeebe Gateway` in the surviving region to the local host to interact with the Gateway. The following alternatives to port-forwarding are possible: @@ -159,7 +165,7 @@ The following alternatives to port-forwarding are possible: - one can [`exec`](https://kubernetes.io/docs/reference/kubectl/generated/kubectl_exec/) into an existing pod (such as Elasticsearch), and `curl` from there - or temporarily [`run`](https://kubernetes.io/docs/reference/kubectl/generated/kubectl_run/) an Ubuntu pod in the cluster to `curl` from there -In our example, we went with port-forwarding to a local host, but other alternatives can also be used. +In our case we went with port-forwarding to a local host, but the alternatives can be used as well. 1. Use the [zbctl client](../../../apis-tools/cli-client/index.md) to retrieve list of remaining brokers @@ -222,7 +228,7 @@ curl -XPOST 'http://localhost:9600/actuator/cluster/brokers?force=true' -H 'Cont #### Verification -Port-forwarding the Zeebe Gateway via `kubectl` and printing the topology should reveal that the cluster size has decreased to 4, partitions have been redistributed over the remaining brokers, and new leaders have been elected. +Port-forwarding the Zeebe Gateway via `kubectl` and printing the topology should reveal the cluster size has decreased to 4, as well as that partitions have redistributed over the remaining brokers and elected new leaders. ```bash kubectl --context $CLUSTER_SURVIVING port-forward services/$HELM_RELEASE_NAME-zeebe-gateway 26500:26500 -n $CAMUNDA_NAMESPACE_SURVIVING @@ -268,10 +274,10 @@ Brokers: -You can also use the Zeebe Gateway's REST API to ensure the scaling progress has been completed. For better readability of the output, it is recommended to use [jq](https://jqlang.github.io/jq/). +You can also use the REST API of the Zeebe Gateway to ensure the scaling progress has completed. It is recommended to use [jq](https://jqlang.github.io/jq/) for better readability of the output. ```bash -kubectl --context $CLUSTER_SURVIVING port-forward services/$HELM_RELEASE_NAME-zeebe-gateway 9600:9600 -n $CAMUNDA_NAMESPACE_SURVIVING +kubectl --context $CLUSTER_SURVIVING port-forward services/$HELM_RELEASE_NAME-zeebe-gateway 26500:26500 -n $CAMUNDA_NAMESPACE_SURVIVING curl -XGET 'http://localhost:9600/actuator/cluster' | jq .lastChange ``` @@ -283,8 +289,8 @@ curl -XGET 'http://localhost:9600/actuator/cluster' | jq .lastChange { "id": 2, "status": "COMPLETED", - "startedAt": "2024-08-23T11:33:08.355681311Z", - "completedAt": "2024-08-23T11:33:09.170531963Z" + "startedAt": "2024-06-19T08:22:50.8585239Z", + "completedAt": "2024-06-19T08:22:51.062397727Z" } ``` @@ -342,8 +348,7 @@ curl -XGET 'http://localhost:9600/actuator/exporters' -2. Based on the Exporter API you will send a request to the Zeebe Gateway to disable the Elasticsearch exporter to the lost region. - +2. Based on the [Exporter APIs](../../zeebe-deployment/operations/cluster-scaling.md) you will send a request to the Zeebe Gateway to disable the Elasticsearch exporter to the lost region. ```bash curl -XPOST 'http://localhost:9600/actuator/exporters/elasticsearchregion1/disable' ``` @@ -414,7 +419,7 @@ You have a standalone region with a working Camunda 8 setup, including Zeebe, Op #### Desired state -You want to restore the dual-region functionality and deploy Camunda 8, consisting of Zeebe and Elasticsearch, to the newly restored region. Operate and Tasklist need to stay disabled to prevent interference with the database backup and restore. +You want to restore the dual-region functionality again and deploy Camunda 8 consisting of Zeebe and Elasticsearch to the newly restored region. Operate and Tasklist need to stay disabled to not interfere with the database backup and restore. #### How to get there @@ -573,6 +578,8 @@ kubectl --context $CLUSTER_SURVIVING get deployments $HELM_RELEASE_NAME-operate For the Zeebe Elasticsearch exporters, there's currently no API available to confirm this. Only the response code of `204` indicates a successful disabling. This is a synchronous operation. +TODO: double check that this is still the case. + From 0ebe33fd88632fdf6e2580518dbc0a44168a3e79 Mon Sep 17 00:00:00 2001 From: Langleu Date: Fri, 23 Aug 2024 16:35:34 +0200 Subject: [PATCH 02/15] docs(multi-region): add new operational procedure output --- .../multi-region/dual-region-ops.md | 26 ++++++++++++------- 1 file changed, 16 insertions(+), 10 deletions(-) diff --git a/docs/self-managed/operational-guides/multi-region/dual-region-ops.md b/docs/self-managed/operational-guides/multi-region/dual-region-ops.md index 01f4613eca..35a54a63e1 100644 --- a/docs/self-managed/operational-guides/multi-region/dual-region-ops.md +++ b/docs/self-managed/operational-guides/multi-region/dual-region-ops.md @@ -48,6 +48,14 @@ Running dual-region setups requires the users to be able to detect any regional ::: +:::info + +The operational procedure was reworked in Camunda 8.6 to be more straightforward and reliable. + +The old procedure of 8.5 [as outlined](../../../../versioned_docs/version-8.5/self-managed/operational-guides/multi-region/dual-region-ops.md) is still compatible with Camunda 8.6 but compatibility will be removed in 8.7. + +::: + ## Prerequisites - A dual-region Camunda 8 setup installed in two different regions, preferably derived from our [AWS dual-region guide](/self-managed/setup/deploy/amazon/amazon-eks/dual-region.md). @@ -157,7 +165,7 @@ You have removed the lost brokers from the Zeebe cluster. This will allow us to #### How to get there -You are going to port-forward the `Zeebe Gateway` in the surviving region to the local host to interact with the Gateway. +You will port-forward the `Zeebe Gateway` in the surviving region to the local host to interact with the Gateway. The following alternatives to port-forwarding are possible: @@ -165,7 +173,7 @@ The following alternatives to port-forwarding are possible: - one can [`exec`](https://kubernetes.io/docs/reference/kubectl/generated/kubectl_exec/) into an existing pod (such as Elasticsearch), and `curl` from there - or temporarily [`run`](https://kubernetes.io/docs/reference/kubectl/generated/kubectl_run/) an Ubuntu pod in the cluster to `curl` from there -In our case we went with port-forwarding to a local host, but the alternatives can be used as well. +In our example, we went with port-forwarding to a local host, but other alternatives can also be used. 1. Use the [zbctl client](../../../apis-tools/cli-client/index.md) to retrieve list of remaining brokers @@ -228,7 +236,7 @@ curl -XPOST 'http://localhost:9600/actuator/cluster/brokers?force=true' -H 'Cont #### Verification -Port-forwarding the Zeebe Gateway via `kubectl` and printing the topology should reveal the cluster size has decreased to 4, as well as that partitions have redistributed over the remaining brokers and elected new leaders. +Port-forwarding the Zeebe Gateway via `kubectl` and printing the topology should reveal that the cluster size has decreased to 4, partitions have been redistributed over the remaining brokers, and new leaders have been elected. ```bash kubectl --context $CLUSTER_SURVIVING port-forward services/$HELM_RELEASE_NAME-zeebe-gateway 26500:26500 -n $CAMUNDA_NAMESPACE_SURVIVING @@ -274,10 +282,10 @@ Brokers: -You can also use the REST API of the Zeebe Gateway to ensure the scaling progress has completed. It is recommended to use [jq](https://jqlang.github.io/jq/) for better readability of the output. +You can also use the Zeebe Gateway's REST API to ensure the scaling progress has been completed. For better readability of the output, it is recommended to use [jq](https://jqlang.github.io/jq/). ```bash -kubectl --context $CLUSTER_SURVIVING port-forward services/$HELM_RELEASE_NAME-zeebe-gateway 26500:26500 -n $CAMUNDA_NAMESPACE_SURVIVING +kubectl --context $CLUSTER_SURVIVING port-forward services/$HELM_RELEASE_NAME-zeebe-gateway 9600:9600 -n $CAMUNDA_NAMESPACE_SURVIVING curl -XGET 'http://localhost:9600/actuator/cluster' | jq .lastChange ``` @@ -289,8 +297,8 @@ curl -XGET 'http://localhost:9600/actuator/cluster' | jq .lastChange { "id": 2, "status": "COMPLETED", - "startedAt": "2024-06-19T08:22:50.8585239Z", - "completedAt": "2024-06-19T08:22:51.062397727Z" + "startedAt": "2024-08-23T11:33:08.355681311Z", + "completedAt": "2024-08-23T11:33:09.170531963Z" } ``` @@ -419,7 +427,7 @@ You have a standalone region with a working Camunda 8 setup, including Zeebe, Op #### Desired state -You want to restore the dual-region functionality again and deploy Camunda 8 consisting of Zeebe and Elasticsearch to the newly restored region. Operate and Tasklist need to stay disabled to not interfere with the database backup and restore. +You want to restore the dual-region functionality and deploy Camunda 8, consisting of Zeebe and Elasticsearch, to the newly restored region. Operate and Tasklist need to stay disabled to prevent interference with the database backup and restore. #### How to get there @@ -578,8 +586,6 @@ kubectl --context $CLUSTER_SURVIVING get deployments $HELM_RELEASE_NAME-operate For the Zeebe Elasticsearch exporters, there's currently no API available to confirm this. Only the response code of `204` indicates a successful disabling. This is a synchronous operation. -TODO: double check that this is still the case. - From fa1c8fb1b6e95eb27df345ec28900d3b8b59d51a Mon Sep 17 00:00:00 2001 From: Langleu Date: Thu, 29 Aug 2024 11:02:19 +0200 Subject: [PATCH 03/15] docs(dual-region): address review feedback --- .../operational-guides/multi-region/dual-region-ops.md | 8 -------- 1 file changed, 8 deletions(-) diff --git a/docs/self-managed/operational-guides/multi-region/dual-region-ops.md b/docs/self-managed/operational-guides/multi-region/dual-region-ops.md index 35a54a63e1..92485e750e 100644 --- a/docs/self-managed/operational-guides/multi-region/dual-region-ops.md +++ b/docs/self-managed/operational-guides/multi-region/dual-region-ops.md @@ -48,14 +48,6 @@ Running dual-region setups requires the users to be able to detect any regional ::: -:::info - -The operational procedure was reworked in Camunda 8.6 to be more straightforward and reliable. - -The old procedure of 8.5 [as outlined](../../../../versioned_docs/version-8.5/self-managed/operational-guides/multi-region/dual-region-ops.md) is still compatible with Camunda 8.6 but compatibility will be removed in 8.7. - -::: - ## Prerequisites - A dual-region Camunda 8 setup installed in two different regions, preferably derived from our [AWS dual-region guide](/self-managed/setup/deploy/amazon/amazon-eks/dual-region.md). From 63f1740888eec2537c44bac1452f8753b0bf6482 Mon Sep 17 00:00:00 2001 From: "Leo J." <153937047+leiicamundi@users.noreply.github.com> Date: Fri, 30 Aug 2024 15:16:12 +0200 Subject: [PATCH 04/15] doc: multiregion --- .../concepts/multi-region/dual-region.md | 127 ++++++++++------ .../multi-region/dual-region-ops.md | 135 +++++++----------- 2 files changed, 137 insertions(+), 125 deletions(-) diff --git a/docs/self-managed/concepts/multi-region/dual-region.md b/docs/self-managed/concepts/multi-region/dual-region.md index e2500d0225..efbe6345d6 100644 --- a/docs/self-managed/concepts/multi-region/dual-region.md +++ b/docs/self-managed/concepts/multi-region/dual-region.md @@ -9,11 +9,11 @@ description: "A dual-region setup allows you to run Camunda in two regions synch import DualRegion from "./img/dual-region.svg"; -Camunda 8 is compatible with a dual-region setup under certain [limitations](#limitations). This allows Camunda 8 to run in a mix of active-active and active-passive setups, resulting in an overall **active-passive** setup. The following will explore the concept, limitations, and considerations. +Camunda 8 is compatible with a dual-region setup under certain [limitations](#camunda-8-dual-region-limitations). This allows Camunda 8 to run in a mix of active-active and active-passive setups, resulting in an overall **active-passive** setup. The following will explore the concept, limitations, and considerations. -:::warning +:::caution -You should get familiar with the topic, the [limitations](#limitations) of the dual-region setup, and the general [considerations](#considerations) on operating a dual-region setup. +You should get familiar with the topic, the [limitations](#camunda-8-dual-region-limitations) of the dual-region setup, and the general [considerations](#platform-considerations) on operating a dual-region setup. ::: @@ -86,7 +86,7 @@ In the event of a total active region loss, the following data will be lost: - Task assignments -## Requirements +## Requirements and Limitations - Camunda 8 - Minimum [Helm chart version](https://github.com/camunda/camunda-platform-helm) **9.3+** @@ -118,28 +118,62 @@ In the event of a total active region loss, the following data will be lost: - For further information and visualization of the partition distribution, consider consulting the documentation on [partitions](../../../components/zeebe/technical-concepts/partitions.md). - The customers operating their Camunda 8 setup are responsible for detecting a regional failure and executing the [operational procedure](./../../operational-guides/multi-region/dual-region-ops.md). -## Limitations - -- We recommend using a Kubernetes dual-region setup, with [Camunda Helm chart](/self-managed/setup/install.md) installed in two Kubernetes clusters. - - Using alternative installation methods (for example, with docker-compose) is not covered in our documentation. -- Looking at the whole Camunda platform, it's **active-passive**, while some key components are active-active. - - There's always one active and one passive region for serving active user traffic. - - Serving traffic to both regions will result in a detachment of the components and users potentially observing different data in Operate and Tasklist. -- Identity is not supported. - - Multi-tenancy does not work. - - Role Based Access Control (RBAC) does not work. -- Optimize is not supported. - - This is due to Optimize depending on Identity to work. -- Connectors can be deployed alongside but ensure to understand idempotency based on [the described documentation](../../../components/connectors/use-connectors/inbound.md#creating-the-connector-event). - - in a dual-region setup, you'll have two connector deployments and using message idempotency is of importance to not duplicate events. -- Zeebe cluster scaling is not supported. -- Web-Modeler is a standalone component and is not covered in this guide. - - Modeling applications can operate independently outside of the automation clusters. - -## Considerations - -Multi-region setups in itself bring their own complexity. The following items are such complexities and are not considered in our guides. -You should familiarize yourself with those before deciding to go for a dual-region setup. +#### Minimum Versions + +| **Component** | **Elasticsearch** | **Operate** | **Tasklist** | **Zeebe** | **Zeebe Gateway** | **Camunda Helm Chart** | +| ------------------- | ----------------- | ----------- | ------------ | --------- | ----------------- | -------------------------------------------------------- | +| **Minimum Version** | 8.9+\* | 8.5+ | 8.5+ | 8.5+ | 8.5+ | [9.3+](https://github.com/camunda/camunda-platform-helm) | + +**Notes:** \*OpenSearch (both managed and self-managed) is not supported + +#### Installation Environment + +##### Kubernetes Setup + +- Two Kubernetes clusters are required for the Helm chart installation. +- OpenShift is not supported. + +##### Network Requirements + +- The regions (e.g., two Kubernetes clusters) must be able to connect to each other (e.g., via VPC peering). See [example implementation](/self-managed/setup/deploy/amazon/amazon-eks/dual-region.md) for AWS EKS. +- Maximum network round trip time (**RTT**) between regions should not exceed **100 ms**. +- Required open ports between the two regions: + - **9200** for Elasticsearch (for cross-region data push by Zeebe) + - **26500** for communication to the Zeebe Gateway from clients/workers + - **26501** and **26502** for communication between Zeebe brokers and Zeebe Gateway + +#### Zeebe Cluster Configuration + +Supported combinations for Zeebe broker counts and replication factors: + +- `clusterSize` must be a multiple of **2** and at least **4** to evenly distribute brokers across the two regions. +- `replicationFactor` must be **4** to ensure even partition distribution across regions. +- `partitionCount` is unrestricted but should be chosen based on workload requirements. See [understanding sizing and scalability behavior](../../../components/best-practices/architecture/sizing-your-environment.md#understanding-sizing-and-scalability-behavior). +- For more details on partition distribution, refer to the [documentation on partitions](../../../components/zeebe/technical-concepts/partitions.md). + +#### Regional Failure Management + +- Customers are responsible for detecting regional failures and executing the [operational procedure](./../../operational-guides/multi-region/dual-region-ops.md). + +### Camunda 8 dual-region limitations + +| **Aspect** | **Details** | +| ---------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| **Installation methods** | For **kubernetes** we recommended to use a dual-region Kubernetes setup with the [Camunda Helm chart](/self-managed/setup/install.md) installed in two Kubernetes clusters.
For **other platforms**, using alternative installation methods (for example, with docker-compose) is not supported. | +| **Camunda Platform Configuration** | The overall Camunda platform is **active-passive**, although some key components are active-active.
**Active-Passive Traffic Handling:** One active and one passive region serve active user traffic.
**Traffic to Both Regions:** Serving traffic to both regions will cause component detachment, potentially resulting in different data visibility in Operate and Tasklist. | +| **Identity Support** | Identity is not supported, multi-Tenancy and Role-Based Access Control (RBAC) does not work. | +| **Optimize Support** | Not supported because it depends on Identity. | +| **Connectors Deployment** | Connectors can be deployed in a dual-region setup, but attention to [idempotency](../../../components/connectors/use-connectors/inbound.md#creating-the-connector-event) is required to avoid event duplication. In a dual-region setup, you'll have two connector deployments and using message idempotency is of importance to not duplicate events. | +| **Zeebe Cluster Scaling** | Not supported. | +| **Web-Modeler** | Is a standalone component not covered in this guide. Modeling applications can operate independently outside of the automation clusters. | + +### Platform considerations + +:::caution +Multi-region setups in itself bring their own complexity. You should familiarize yourself with those before deciding to go for a dual-region setup. +::: + +The following items are such complexities and are not considered in our guides: - Managing multiple Kubernetes clusters and their deployments across regions - Monitoring and alerting @@ -156,33 +190,40 @@ This means the Zeebe stretch cluster will not have a quorum when half of its bro The [operational procedure](./../../operational-guides/multi-region/dual-region-ops.md) looks in detail at short-term recovery from a region loss and how to long-term fully re-establish the lost region. The procedure works the same way for active or passive region loss since we don't consider traffic routing (DNS) in the scenario. -### Active region loss +### Active Region Loss -The loss of the active region means: +The loss of the active region results in: -- The loss of previously mentioned data in Operate and Tasklist. -- Traffic is routed to the active region, which now can't be served anymore. -- The workflow engine will stop processing due to the loss of the quorum. +- **Loss of Data**: Data previously available in Operate and Tasklist is no longer accessible. +- **Service Disruption**: Traffic routed to the active region can no longer be served. +- **Workflow Engine Failure**: The workflow engine stops processing due to quorum loss. -The following high-level steps need to be taken in case of the active region loss: +#### Steps to Take in Case of Active Region Loss -1. Follow the [operational procedure](./../../operational-guides/multi-region/dual-region-ops.md#failover) to temporarily recover from the region loss and unblock the workflow engine. -2. Reroute traffic to the passive region that will now become the new active region. -3. Due to the loss of data in Operate and Tasklist, you'll have to: - 1. Reassign uncompleted tasks in Tasklist. +1. **Temporary Recovery:** follow the [operational procedure for temporary recovery](./../../operational-guides/multi-region/dual-region-ops.md#failover) to restore functionality and unblock the workflow engine. + +2. **Traffic Rerouting:** reroute traffic to the passive region, which will now become the new active region. + +3. **Data and Task Management:** due to the loss of data in Operate and Tasklist: + + 1. Reassign any uncompleted tasks in Tasklist. 2. Recreate batch operations in Operate. -4. Follow the [operational procedure](./../../operational-guides/multi-region/dual-region-ops.md#failback) to recreate a new permanent region that will become your new passive region. -### Passive region loss +4. **Permanent Region Setup:** follow the [operational procedure to create a new permanent region](./../../operational-guides/multi-region/dual-region-ops.md#failback) that will become your new passive region. + +### Passive Region Loss + +The loss of the passive region means: + +- **Workflow Engine Impact**: The workflow engine will stop processing due to the loss of quorum. -The loss of the passive region means the workflow engine will stop processing due to the loss of the quorum. +#### Steps to Take in Case of Passive Region Loss -The following high-level steps need to be taken in case of passive region loss: +1. **Temporary Recovery:** follow the [operational procedure to temporarily recover](./../../operational-guides/multi-region/dual-region-ops.md#failover) from the loss and unblock the workflow engine. -- Follow the [operational procedure](./../../operational-guides/multi-region/dual-region-ops.md#failover) to temporarily recover from the region loss and unblock the workflow engine. -- Follow the [operational procedure](./../../operational-guides/multi-region/dual-region-ops.md#failback) to recreate a new permanent region that will become your new passive region. +2. **Permanent Region Setup:** follow the [operational procedure to create a new permanent region](./../../operational-guides/multi-region/dual-region-ops.md#failback) that will become your new passive region. -Unlike the active region loss, no data will be lost, nor will any traffic require rerouting. +**Note:** Unlike an active region loss, no data will be lost and no traffic rerouting is necessary. ### Disaster Recovery diff --git a/docs/self-managed/operational-guides/multi-region/dual-region-ops.md b/docs/self-managed/operational-guides/multi-region/dual-region-ops.md index 92485e750e..5055adf787 100644 --- a/docs/self-managed/operational-guides/multi-region/dual-region-ops.md +++ b/docs/self-managed/operational-guides/multi-region/dual-region-ops.md @@ -68,17 +68,25 @@ Running dual-region setups requires the users to be able to detect any regional ## Procedure -We don't differ between active and passive regions as the procedure is the same for either loss. We will focus on losing the passive region while still having the active region. +We handle the loss of both active and passive regions using the same procedure. For clarity, this section focuses on the scenario where the passive region is lost while the active region remains operational. -You'll need to reroute the traffic to the surviving region with the help of DNS (details on how to do that depend on your DNS setup and are not covered in this guide.) +#### Key Steps to Handle Passive Region Loss -After you've identified a region loss and before beginning the region restoration procedure, ensure the lost region cannot reconnect as this will hinder a successful recovery during failover and failback execution. +1. **Traffic Rerouting** -In case the region is only lost temporarily (for example, due to network hiccups), Zeebe can survive a region loss but will stop processing due to the loss in quorum and ultimately fill up the persistent disk before running out of volume, resulting in the loss of data. + - Reroute traffic to the surviving active region using DNS. (Details on how to manage DNS rerouting depend on your specific DNS setup and are not covered in this guide.) -The **failover** phase of the procedure temporarily restores Camunda 8 functionality by removing the lost brokers and the export to the unreachable Elasticsearch instance. +2. **Prevent Reconnection** -The **failback** phase of the procedure results in completely restoring the failed region to its full functionality. It requires you to have the lost region ready again for the redeployment of Camunda 8. + - Ensure that the lost region cannot reconnect before starting the restoration procedure. Reconnection could interfere with a successful recovery during failover and failback. + +3. **Temporary Loss Scenario** + + - If the region loss is temporary (e.g., due to network issues), Zeebe can survive this loss but may stop processing due to quorum loss. This could lead to persistent disk filling up before data is lost. + +4. **Procedure Phases** + - **Failover Phase:** Temporarily restores Camunda 8 functionality by removing the lost brokers and handling the export to the unreachable Elasticsearch instance. + - **Failback Phase:** Fully restores the failed region to its original functionality. This phase requires the region to be ready for the redeployment of Camunda 8. :::warning @@ -139,7 +147,9 @@ desired={}
-#### Current state +| **Current State** | **Desired State** | +| --------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| The lost region has been ensured not to reconnect during the failover procedure.

No data has been lost due to Zeebe data replication. | The lost brokers have been removed from the Zeebe cluster.

Continued processing is enabled, and new brokers in the failback procedure will only join the cluster with our intervention. | One of the regions is lost, meaning Zeebe: @@ -311,17 +321,11 @@ desired={}
-#### Current state - -Zeebe is not yet be able to continue exporting data since the Zeebe brokers in the surviving region are configured to point to the Elasticsearch instance of the lost region. - -#### Desired state - -You have disabled the Elasticsearch exporter to the failed region in the Zeebe cluster. - -The Zeebe cluster is then unblocked and can export data to Elasticsearch again. - -Completing this step will restore regular interaction with Camunda 8 for your users, marking the conclusion of the temporary recovery. +| **Details** | **Current State** | **Desired State** | +| ----------------------- | --------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------- | +| **Zeebe Configuration** | Zeebe brokers in the surviving region are still configured to point to the Elasticsearch instance of the lost region. | Elasticsearch exporter to the failed region has been disabled in the Zeebe cluster. | +| **Export Capability** | Zeebe cannot continue exporting data. | Zeebe can export data to Elasticsearch again. | +| **User Interaction** | Regular interaction with Camunda 8 is not restored. | Regular interaction with Camunda 8 is restored, marking the conclusion of the temporary recovery. | #### How to get there @@ -413,13 +417,10 @@ desired={}
-#### Current state - -You have a standalone region with a working Camunda 8 setup, including Zeebe, Operate, Tasklist, and Elasticsearch. - -#### Desired state - -You want to restore the dual-region functionality and deploy Camunda 8, consisting of Zeebe and Elasticsearch, to the newly restored region. Operate and Tasklist need to stay disabled to prevent interference with the database backup and restore. +| **Details** | **Current State** | **Desired State** | +| ------------------------ | ------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------- | +| **Camunda 8 Setup** | A standalone region with a fully functional Camunda 8 setup, including Zeebe, Operate, Tasklist, and Elasticsearch. | Restore dual-region functionality by deploying Camunda 8 (Zeebe and Elasticsearch) to the newly restored region. | +| **Operate and Tasklist** | Operate and Tasklist are operational in the standalone region. | Operate and Tasklist should remain disabled to avoid interference during the database backup and restore process. | #### How to get there @@ -429,7 +430,7 @@ In particular, the values `ZEEBE_BROKER_EXPORTERS_ELASTICSEARCHREGION0_ARGS_URL` In addition, the following Helm command will disable Operate and Tasklist since those will only be enabled at the end of the full region restore. It's required to keep them disabled in the newly created region due to their Elasticsearch importers. -1. From the terminal context of `aws/dual-region/kubernetes` execute: +From the terminal context of `aws/dual-region/kubernetes` execute: ```bash helm install $HELM_RELEASE_NAME camunda/camunda-platform \ @@ -525,21 +526,11 @@ desired={}
-#### Current state - -You currently have the following setup: - -- Functioning Zeebe cluster (within a single region): - - working Camunda 8 installation in the surviving region - - non-participating Camunda 8 installation in the recreated region - -#### Desired state - -You are preparing everything for the newly created region to take over again to restore the functioning dual-region setup. - -For this, stop the Zeebe exporters from exporting any new data to Elasticsearch so you can create an Elasticsearch backup. - -Additionally, temporarily scale down Operate and Tasklist to zero replicas. This will result in users not being able to interact with Camunda 8 anymore and is required to guarantee no new data is imported to Elasticsearch. +| **Details** | **Current State** | **Desired State** | +| ------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| **Zeebe Cluster Setup** | Functioning Zeebe cluster within a single region:
Working Camunda 8 installation in the surviving region
Non-participating Camunda 8 installation in the recreated region | Preparing the newly created region to take over and restore the dual-region setup. | +| **Elasticsearch Export** | Currently exporting data to Elasticsearch from the surviving region. | Stop Zeebe exporters to prevent new data from being exported to Elasticsearch, allowing for the creation of an Elasticsearch backup. | +| **Operate and Tasklist** | Operate and Tasklist are operational in the surviving region. | Temporarily scale down Operate and Tasklist to zero replicas, preventing user interaction with Camunda 8 and ensuring no new data is imported to Elasticsearch. | :::note @@ -591,13 +582,10 @@ desired={}
-#### Current state - -The Camunda components are currently not reachable by end-users and will not process any new process instances. This allows creating a backup of Elasticsearch without losing any data. - -#### Desired state - -You are creating a backup of the main Elasticsearch instance in the surviving region and restore it in the recreated region. This Elasticsearch backup contains all the data and may take some time to be finished. +| **Details** | **Current State** | **Desired State** | +| ------------------------ | ------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| **Camunda Components** | Not reachable by end-users and not processing any new process instances. This state allows for data backup without loss. | Creating a backup of the main Elasticsearch instance in the surviving region and restoring it in the recreated region. Backup process may take time to complete. | +| **Elasticsearch Backup** | No backup is in progress. | Backup of Elasticsearch in the surviving region is initiated and being restored in the recreated region, containing all necessary data. | #### How to get there @@ -822,15 +810,10 @@ desired={}
-#### Current state - -The backup of Elasticsearch has been created and restored to the recreated region. - -The Camunda components remain unreachable by end-users as you proceed to restore functionality. - -#### Desired state - -You can enable Operate and Tasklist again both in the surviving and recreated region. This will allow users to interact with Camunda 8 again. +| **Details** | **Current State** | **Desired State** | +| ------------------------ | -------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------- | +| **Elasticsearch Backup** | Backup has been created and restored to the recreated region. | N/A | +| **Camunda Components** | Remain unreachable by end-users while restoring functionality. | Enable Operate and Tasklist in both the surviving and recreated regions to allow user interaction with Camunda 8 again. | #### How to get there @@ -883,15 +866,11 @@ desired={}
-#### Current state - -Camunda 8 is reachable to the end-user but not yet exporting any data. - -#### Desired state - -You are initializing a new exporter to the recreated region. This will ensure that both Elasticsearch instances are populated, resulting in data redundancy. - -Separating this step from resuming the exporters is essential as the initialization is an asynchronous procedure, and you must ensure it's finished before resuming the exporters. +| **Details** | **Current State** | **Desired State** | +| --------------------------- | --------------------------------------------------- | ----------------------------------------------------------------------------------------------------- | +| **Camunda 8 Accessibility** | Reachable to end-users, but not exporting any data. | Start a new exporter to the recreated region. | +| **Data Exporting** | No data export currently in progress. | Ensure that both Elasticsearch instances are populated for data redundancy. | +| **Procedure Step** | Data export is paused. | Separate the initialization step (asynchronous) and confirm completion before resuming the exporters. | #### How to get there @@ -959,15 +938,10 @@ desired={}
-#### Current state - -Camunda 8 is reachable to the end-user but not yet exporting any data. - -Elasticsearch exporters are enabled for both regions, and it's ensured that the operation has finished. - -#### Desired state - -You are reactivating the existing exporters. This will allow Zeebe to export data to Elasticsearch again. +| **Details** | **Current State** | **Desired State** | +| --------------------------- | ----------------------------------------------------------------------- | -------------------------------------------------- | +| **Camunda 8 Accessibility** | Reachable to end-users, but currently not exporting any data. | Reactivate existing exporters. | +| **Elasticsearch Exporters** | Enabled for both regions, with the operation confirmed to be completed. | Allow Zeebe to export data to Elasticsearch again. | #### How to get there @@ -997,13 +971,10 @@ desired={}
-#### Current state - -Camunda 8 is running in two regions but not yet utilizing all Zeebe brokers. You have redeployed Operate and Tasklist and enabled the Elasticsearch exporters again. This will allow users to interact with Camunda 8 again. - -#### Desired state - -You have a functioning Camunda 8 setup in two regions and utilizing both regions. This will fully recover the dual-region benefits. +| **Details** | **Current State** | **Desired State** | +| ------------------------ | ---------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------- | +| **Camunda 8 Deployment** | Running in two regions, but not yet utilizing all Zeebe brokers. Operate and Tasklist redeployed, Elasticsearch exporters enabled. | Fully functional Camunda 8 setup utilizing both regions, recovering all dual-region benefits. | +| **User Interaction** | Users can interact with Camunda 8 again. | Dual-region functionality is restored, maximizing reliability and performance benefits. | #### How to get there From ea5c0f397e8aa58c6e82618ba71bf18419333562 Mon Sep 17 00:00:00 2001 From: "Leo J." <153937047+leiicamundi@users.noreply.github.com> Date: Mon, 9 Sep 2024 11:52:10 +0200 Subject: [PATCH 05/15] re-integrate multiregion --- .../concepts/multi-region/dual-region.md | 56 +++++-------------- 1 file changed, 14 insertions(+), 42 deletions(-) diff --git a/docs/self-managed/concepts/multi-region/dual-region.md b/docs/self-managed/concepts/multi-region/dual-region.md index efbe6345d6..f3a1381808 100644 --- a/docs/self-managed/concepts/multi-region/dual-region.md +++ b/docs/self-managed/concepts/multi-region/dual-region.md @@ -88,37 +88,7 @@ In the event of a total active region loss, the following data will be lost: ## Requirements and Limitations -- Camunda 8 - - Minimum [Helm chart version](https://github.com/camunda/camunda-platform-helm) **9.3+** - - Minimum component images - - Elasticsearch **8.9+** - - OpenSearch (both managed and self-managed) is not supported - - Operate **8.5+** - - Tasklist **8.5+** - - Zeebe **8.5+** - - Zeebe Gateway **8.5+** -- For the Helm chart installation method, two Kubernetes clusters are required -- Network - - The regions (for example, two Kubernetes clusters) need to be able to connect to each other (for example, via VPC peering) - - See an [example implementation](/self-managed/setup/deploy/amazon/amazon-eks/dual-region.md) of two VPC peered Kubernetes clusters based on AWS EKS. - - Maximum network round trip time (**RTT**) between the regions should not exceed **100 ms** - - Open ports between the two regions: - - **9200** for Elasticsearch for Zeebe to push data cross-region - - **26500** for communication to the Zeebe Gateway from client/workers - - **26501** for the Zeebe brokers and Zeebe Gateway communication - - **26502** for the Zeebe brokers and Zeebe Gateway communication - - Cluster communication - - Kubernetes services in one cluster must be resolvable and reachable from the other cluster and vice-versa. This is essential for proper communication and functionality across regions: - - For AWS EKS setups, ensure DNS chaining is configured. Refer to the [Amazon Elastic Kubernetes Service (EKS) setup guide](/self-managed/setup/deploy/amazon/amazon-eks/dual-region.md). - - For OpenShift, [Submariner](https://docs.redhat.com/en/documentation/red_hat_advanced_cluster_management_for_kubernetes/2.11/html/networking/networking#submariner) is recommended for handling multi-cluster networking. Specific implementation guides are not yet available. -- Only specific combinations of Zeebe broker counts and replication factors are supported - - `clusterSize` must be a multiple of **2** and a minimum of **4** to evenly distribute the brokers across the two regions. - - `replicationFactor` must be **4** to ensure that the partitions are evenly distributed across the two regions. - - `partitionCount` is not restricted and depends on your workload requirements, consider having a look at [understanding sizing and scalability behavior](../../../components/best-practices/architecture/sizing-your-environment.md#understanding-sizing-and-scalability-behavior). - - For further information and visualization of the partition distribution, consider consulting the documentation on [partitions](../../../components/zeebe/technical-concepts/partitions.md). -- The customers operating their Camunda 8 setup are responsible for detecting a regional failure and executing the [operational procedure](./../../operational-guides/multi-region/dual-region-ops.md). - -#### Minimum Versions +### Minimum Versions | **Component** | **Elasticsearch** | **Operate** | **Tasklist** | **Zeebe** | **Zeebe Gateway** | **Camunda Helm Chart** | | ------------------- | ----------------- | ----------- | ------------ | --------- | ----------------- | -------------------------------------------------------- | @@ -126,32 +96,34 @@ In the event of a total active region loss, the following data will be lost: **Notes:** \*OpenSearch (both managed and self-managed) is not supported -#### Installation Environment +### Installation Environment -##### Kubernetes Setup +#### Kubernetes Setup - Two Kubernetes clusters are required for the Helm chart installation. -- OpenShift is not supported. -##### Network Requirements +#### Network Requirements - The regions (e.g., two Kubernetes clusters) must be able to connect to each other (e.g., via VPC peering). See [example implementation](/self-managed/setup/deploy/amazon/amazon-eks/dual-region.md) for AWS EKS. + - Kubernetes services in one cluster must be resolvable and reachable from the other cluster and vice-versa. This is essential for proper communication and functionality across regions: + - For AWS EKS setups, ensure DNS chaining is configured. Refer to the [Amazon Elastic Kubernetes Service (EKS) setup guide](/self-managed/setup/deploy/amazon/amazon-eks/dual-region.md). + - For OpenShift, [Submariner](https://docs.redhat.com/en/documentation/red_hat_advanced_cluster_management_for_kubernetes/2.11/html/networking/networking#submariner) is recommended for handling multi-cluster networking. Specific implementation guides are not yet available. - Maximum network round trip time (**RTT**) between regions should not exceed **100 ms**. - Required open ports between the two regions: - **9200** for Elasticsearch (for cross-region data push by Zeebe) - **26500** for communication to the Zeebe Gateway from clients/workers - **26501** and **26502** for communication between Zeebe brokers and Zeebe Gateway -#### Zeebe Cluster Configuration +### Zeebe Cluster Configuration -Supported combinations for Zeebe broker counts and replication factors: +Only a combinations for Zeebe broker counts and replication factors are supported: - `clusterSize` must be a multiple of **2** and at least **4** to evenly distribute brokers across the two regions. - `replicationFactor` must be **4** to ensure even partition distribution across regions. - `partitionCount` is unrestricted but should be chosen based on workload requirements. See [understanding sizing and scalability behavior](../../../components/best-practices/architecture/sizing-your-environment.md#understanding-sizing-and-scalability-behavior). - For more details on partition distribution, refer to the [documentation on partitions](../../../components/zeebe/technical-concepts/partitions.md). -#### Regional Failure Management +### Regional Failure Management - Customers are responsible for detecting regional failures and executing the [operational procedure](./../../operational-guides/multi-region/dual-region-ops.md). @@ -190,7 +162,7 @@ This means the Zeebe stretch cluster will not have a quorum when half of its bro The [operational procedure](./../../operational-guides/multi-region/dual-region-ops.md) looks in detail at short-term recovery from a region loss and how to long-term fully re-establish the lost region. The procedure works the same way for active or passive region loss since we don't consider traffic routing (DNS) in the scenario. -### Active Region Loss +### Active region loss The loss of the active region results in: @@ -198,7 +170,7 @@ The loss of the active region results in: - **Service Disruption**: Traffic routed to the active region can no longer be served. - **Workflow Engine Failure**: The workflow engine stops processing due to quorum loss. -#### Steps to Take in Case of Active Region Loss +#### Steps to take in case of active region loss 1. **Temporary Recovery:** follow the [operational procedure for temporary recovery](./../../operational-guides/multi-region/dual-region-ops.md#failover) to restore functionality and unblock the workflow engine. @@ -211,13 +183,13 @@ The loss of the active region results in: 4. **Permanent Region Setup:** follow the [operational procedure to create a new permanent region](./../../operational-guides/multi-region/dual-region-ops.md#failback) that will become your new passive region. -### Passive Region Loss +### Passive region loss The loss of the passive region means: - **Workflow Engine Impact**: The workflow engine will stop processing due to the loss of quorum. -#### Steps to Take in Case of Passive Region Loss +#### Steps to take in case of passive region loss 1. **Temporary Recovery:** follow the [operational procedure to temporarily recover](./../../operational-guides/multi-region/dual-region-ops.md#failover) from the loss and unblock the workflow engine. From a595ddd43047f887b65c080ac24a712dae187aad Mon Sep 17 00:00:00 2001 From: "Leo J." <153937047+leiicamundi@users.noreply.github.com> Date: Mon, 9 Sep 2024 11:59:13 +0200 Subject: [PATCH 06/15] re-integrate multiregion ops --- .../concepts/multi-region/dual-region.md | 9 ++------- .../multi-region/dual-region-ops.md | 20 ++++--------------- 2 files changed, 6 insertions(+), 23 deletions(-) diff --git a/docs/self-managed/concepts/multi-region/dual-region.md b/docs/self-managed/concepts/multi-region/dual-region.md index f3a1381808..d6a13ca751 100644 --- a/docs/self-managed/concepts/multi-region/dual-region.md +++ b/docs/self-managed/concepts/multi-region/dual-region.md @@ -100,7 +100,7 @@ In the event of a total active region loss, the following data will be lost: #### Kubernetes Setup -- Two Kubernetes clusters are required for the Helm chart installation. +Two Kubernetes clusters are required for the Helm chart installation. #### Network Requirements @@ -125,7 +125,7 @@ Only a combinations for Zeebe broker counts and replication factors are supporte ### Regional Failure Management -- Customers are responsible for detecting regional failures and executing the [operational procedure](./../../operational-guides/multi-region/dual-region-ops.md). +Customers are responsible for detecting regional failures and executing the [operational procedure](./../../operational-guides/multi-region/dual-region-ops.md). ### Camunda 8 dual-region limitations @@ -173,14 +173,10 @@ The loss of the active region results in: #### Steps to take in case of active region loss 1. **Temporary Recovery:** follow the [operational procedure for temporary recovery](./../../operational-guides/multi-region/dual-region-ops.md#failover) to restore functionality and unblock the workflow engine. - 2. **Traffic Rerouting:** reroute traffic to the passive region, which will now become the new active region. - 3. **Data and Task Management:** due to the loss of data in Operate and Tasklist: - 1. Reassign any uncompleted tasks in Tasklist. 2. Recreate batch operations in Operate. - 4. **Permanent Region Setup:** follow the [operational procedure to create a new permanent region](./../../operational-guides/multi-region/dual-region-ops.md#failback) that will become your new passive region. ### Passive region loss @@ -192,7 +188,6 @@ The loss of the passive region means: #### Steps to take in case of passive region loss 1. **Temporary Recovery:** follow the [operational procedure to temporarily recover](./../../operational-guides/multi-region/dual-region-ops.md#failover) from the loss and unblock the workflow engine. - 2. **Permanent Region Setup:** follow the [operational procedure to create a new permanent region](./../../operational-guides/multi-region/dual-region-ops.md#failback) that will become your new passive region. **Note:** Unlike an active region loss, no data will be lost and no traffic rerouting is necessary. diff --git a/docs/self-managed/operational-guides/multi-region/dual-region-ops.md b/docs/self-managed/operational-guides/multi-region/dual-region-ops.md index 5055adf787..c0a23c64cf 100644 --- a/docs/self-managed/operational-guides/multi-region/dual-region-ops.md +++ b/docs/self-managed/operational-guides/multi-region/dual-region-ops.md @@ -72,18 +72,9 @@ We handle the loss of both active and passive regions using the same procedure. #### Key Steps to Handle Passive Region Loss -1. **Traffic Rerouting** - - - Reroute traffic to the surviving active region using DNS. (Details on how to manage DNS rerouting depend on your specific DNS setup and are not covered in this guide.) - -2. **Prevent Reconnection** - - - Ensure that the lost region cannot reconnect before starting the restoration procedure. Reconnection could interfere with a successful recovery during failover and failback. - -3. **Temporary Loss Scenario** - - - If the region loss is temporary (e.g., due to network issues), Zeebe can survive this loss but may stop processing due to quorum loss. This could lead to persistent disk filling up before data is lost. - +1. **Traffic Rerouting:** reroute traffic to the surviving active region using DNS. (Details on how to manage DNS rerouting depend on your specific DNS setup and are not covered in this guide.) +2. **Prevent Reconnection:** ensure that the lost region cannot reconnect before starting the restoration procedure. Reconnection could interfere with a successful recovery during failover and failback. +3. **Temporary Loss Scenario:** if the region loss is temporary (e.g., due to network issues), Zeebe can survive this loss but may stop processing due to quorum loss. This could lead to persistent disk filling up before data is lost. 4. **Procedure Phases** - **Failover Phase:** Temporarily restores Camunda 8 functionality by removing the lost brokers and handling the export to the unreachable Elasticsearch instance. - **Failback Phase:** Fully restores the failed region to its original functionality. This phase requires the region to be ready for the redeployment of Camunda 8. @@ -161,10 +152,6 @@ You have previously ensured that the lost region cannot reconnect during the fai Due to the Zeebe data replication, no data has been lost. -#### Desired state - -You have removed the lost brokers from the Zeebe cluster. This will allow us to continue processing after the next step and ensure that the new brokers in the failback procedure will only join the cluster with our intervention. - #### How to get there You will port-forward the `Zeebe Gateway` in the surviving region to the local host to interact with the Gateway. @@ -353,6 +340,7 @@ curl -XGET 'http://localhost:9600/actuator/exporters' 2. Based on the [Exporter APIs](../../zeebe-deployment/operations/cluster-scaling.md) you will send a request to the Zeebe Gateway to disable the Elasticsearch exporter to the lost region. + ```bash curl -XPOST 'http://localhost:9600/actuator/exporters/elasticsearchregion1/disable' ``` From a0610165d40ba1f23f96293d5630890b4b00feca Mon Sep 17 00:00:00 2001 From: mesellings Date: Mon, 9 Sep 2024 12:00:41 +0100 Subject: [PATCH 07/15] TW edits: dual-region setup --- .../concepts/multi-region/dual-region.md | 72 +++++++++---------- 1 file changed, 36 insertions(+), 36 deletions(-) diff --git a/docs/self-managed/concepts/multi-region/dual-region.md b/docs/self-managed/concepts/multi-region/dual-region.md index d6a13ca751..53e3d3e61f 100644 --- a/docs/self-managed/concepts/multi-region/dual-region.md +++ b/docs/self-managed/concepts/multi-region/dual-region.md @@ -9,11 +9,11 @@ description: "A dual-region setup allows you to run Camunda in two regions synch import DualRegion from "./img/dual-region.svg"; -Camunda 8 is compatible with a dual-region setup under certain [limitations](#camunda-8-dual-region-limitations). This allows Camunda 8 to run in a mix of active-active and active-passive setups, resulting in an overall **active-passive** setup. The following will explore the concept, limitations, and considerations. +Camunda 8 is compatible with a dual-region setup under certain [limitations](#camunda-8-dual-region-limitations). This allows Camunda 8 to run in a mix of active-active and active-passive setups, resulting in an overall **active-passive** setup. :::caution -You should get familiar with the topic, the [limitations](#camunda-8-dual-region-limitations) of the dual-region setup, and the general [considerations](#platform-considerations) on operating a dual-region setup. +Before implementing a dual-region setup, ensure you understand the topic, the [limitations](#camunda-8-dual-region-limitations) of dual-region setup, and the general [considerations](#platform-considerations) of operating a dual-region setup. ::: @@ -21,9 +21,9 @@ You should get familiar with the topic, the [limitations](#camunda-8-dual-region **Active-active** and **active-passive** are standard setups used in dual-region configurations to ensure that applications remain available and operational in case of failures. -In an **active-active** setup, multiple application instances run simultaneously in different regions, actively handling user requests. This allows for better load balancing and fault tolerance, as traffic can spread across regions. If one region fails, the workload can shift to another without causing disruptions. +- In an **active-active** setup, multiple application instances run simultaneously in different regions, actively handling user requests. This allows for better load balancing and fault tolerance, as traffic can spread across regions. If one region fails, the workload can shift to another without causing disruptions. -By contrast, an **active-passive** setup designates one region as the main or active region where all user requests are processed. The other region remains on standby until needed, only becoming active if the previously active region fails. This setup is easier to manage but may result in higher delays during failover events. +- By contrast, an **active-passive** setup designates one region as the main or active region where all user requests are processed. The other region remains on standby until needed, only becoming active if the previously active region fails. This setup is easier to manage but may result in higher delays during failover events. ## Disclaimer @@ -94,7 +94,7 @@ In the event of a total active region loss, the following data will be lost: | ------------------- | ----------------- | ----------- | ------------ | --------- | ----------------- | -------------------------------------------------------- | | **Minimum Version** | 8.9+\* | 8.5+ | 8.5+ | 8.5+ | 8.5+ | [9.3+](https://github.com/camunda/camunda-platform-helm) | -**Notes:** \*OpenSearch (both managed and self-managed) is not supported +\* OpenSearch (both managed and self-managed) is not supported for a dual-region setup. ### Installation Environment @@ -104,24 +104,24 @@ Two Kubernetes clusters are required for the Helm chart installation. #### Network Requirements -- The regions (e.g., two Kubernetes clusters) must be able to connect to each other (e.g., via VPC peering). See [example implementation](/self-managed/setup/deploy/amazon/amazon-eks/dual-region.md) for AWS EKS. +- The regions (for example, two Kubernetes clusters) must be able to connect to each other (for example, via VPC peering). See [example implementation](/self-managed/setup/deploy/amazon/amazon-eks/dual-region.md) for AWS EKS. - Kubernetes services in one cluster must be resolvable and reachable from the other cluster and vice-versa. This is essential for proper communication and functionality across regions: - For AWS EKS setups, ensure DNS chaining is configured. Refer to the [Amazon Elastic Kubernetes Service (EKS) setup guide](/self-managed/setup/deploy/amazon/amazon-eks/dual-region.md). - For OpenShift, [Submariner](https://docs.redhat.com/en/documentation/red_hat_advanced_cluster_management_for_kubernetes/2.11/html/networking/networking#submariner) is recommended for handling multi-cluster networking. Specific implementation guides are not yet available. - Maximum network round trip time (**RTT**) between regions should not exceed **100 ms**. - Required open ports between the two regions: - - **9200** for Elasticsearch (for cross-region data push by Zeebe) - - **26500** for communication to the Zeebe Gateway from clients/workers - - **26501** and **26502** for communication between Zeebe brokers and Zeebe Gateway + - **9200** for Elasticsearch (for cross-region data push by Zeebe). + - **26500** for communication to the Zeebe Gateway from clients/workers. + - **26501** and **26502** for communication between Zeebe brokers and Zeebe Gateway. ### Zeebe Cluster Configuration -Only a combinations for Zeebe broker counts and replication factors are supported: +Only a combination for Zeebe broker counts and replication factors is supported: - `clusterSize` must be a multiple of **2** and at least **4** to evenly distribute brokers across the two regions. - `replicationFactor` must be **4** to ensure even partition distribution across regions. - `partitionCount` is unrestricted but should be chosen based on workload requirements. See [understanding sizing and scalability behavior](../../../components/best-practices/architecture/sizing-your-environment.md#understanding-sizing-and-scalability-behavior). -- For more details on partition distribution, refer to the [documentation on partitions](../../../components/zeebe/technical-concepts/partitions.md). +- For more details on partition distribution, see [documentation on partitions](../../../components/zeebe/technical-concepts/partitions.md). ### Regional Failure Management @@ -129,30 +129,30 @@ Customers are responsible for detecting regional failures and executing the [ope ### Camunda 8 dual-region limitations -| **Aspect** | **Details** | -| ---------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| **Installation methods** | For **kubernetes** we recommended to use a dual-region Kubernetes setup with the [Camunda Helm chart](/self-managed/setup/install.md) installed in two Kubernetes clusters.
For **other platforms**, using alternative installation methods (for example, with docker-compose) is not supported. | -| **Camunda Platform Configuration** | The overall Camunda platform is **active-passive**, although some key components are active-active.
**Active-Passive Traffic Handling:** One active and one passive region serve active user traffic.
**Traffic to Both Regions:** Serving traffic to both regions will cause component detachment, potentially resulting in different data visibility in Operate and Tasklist. | -| **Identity Support** | Identity is not supported, multi-Tenancy and Role-Based Access Control (RBAC) does not work. | -| **Optimize Support** | Not supported because it depends on Identity. | -| **Connectors Deployment** | Connectors can be deployed in a dual-region setup, but attention to [idempotency](../../../components/connectors/use-connectors/inbound.md#creating-the-connector-event) is required to avoid event duplication. In a dual-region setup, you'll have two connector deployments and using message idempotency is of importance to not duplicate events. | -| **Zeebe Cluster Scaling** | Not supported. | -| **Web-Modeler** | Is a standalone component not covered in this guide. Modeling applications can operate independently outside of the automation clusters. | +| **Aspect** | **Details** | +| :----------------------------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| Installation methods |

  • For **kubernetes** we recommend using a dual-region Kubernetes setup with the [Camunda Helm chart](/self-managed/setup/install.md) installed in two Kubernetes clusters.
  • For **other platforms**, using alternative installation methods (for example, with docker-compose) is not supported.

| +| Camunda Platform Configuration |

The overall Camunda platform is **active-passive**, although some key components are active-active.

  • **Active-Passive Traffic Handling:** One active and one passive region serve active user traffic.
  • **Traffic to Both Regions:** Serving traffic to both regions will cause component detachment, potentially resulting in different data visibility in Operate and Tasklist.

| +| Identity Support | Identity is not supported, multi-Tenancy and Role-Based Access Control (RBAC) does not work. | +| Optimize Support | Not supported (requires Identity). | +| Connectors Deployment | Connectors can be deployed in a dual-region setup, but attention to [idempotency](../../../components/connectors/use-connectors/inbound.md#creating-the-connector-event) is required to avoid event duplication. In a dual-region setup, you'll have two connector deployments and using message idempotency is of importance to not duplicate events. | +| Zeebe Cluster Scaling | Not supported. | +| Web-Modeler | Is a standalone component not covered in this guide. Modeling applications can operate independently outside of the automation clusters. | ### Platform considerations :::caution -Multi-region setups in itself bring their own complexity. You should familiarize yourself with those before deciding to go for a dual-region setup. +Multi-region setups have inherent complexities you should familiarize yourself with before choosing a dual-region setup. ::: -The following items are such complexities and are not considered in our guides: +For example, consider the following complexities (not covered in our guides): -- Managing multiple Kubernetes clusters and their deployments across regions -- Monitoring and alerting -- Increased costs of multiple clusters and cross-region traffic -- Data consistency and synchronization challenges (for example, brought in by the increased latency) - - Bursts of increased latency can already have an impact -- Managing DNS and incoming traffic +- Managing multiple Kubernetes clusters and their deployments across regions. +- Monitoring and alerting. +- Increased costs of multiple clusters and cross-region traffic. +- Data consistency and synchronization challenges (for example, brought in by the increased latency). + - Bursts of increased latency can already have an impact. +- Managing DNS and incoming traffic. ## Region loss @@ -164,7 +164,7 @@ The [operational procedure](./../../operational-guides/multi-region/dual-region- ### Active region loss -The loss of the active region results in: +The loss of the active region results in the following: - **Loss of Data**: Data previously available in Operate and Tasklist is no longer accessible. - **Service Disruption**: Traffic routed to the active region can no longer be served. @@ -172,23 +172,23 @@ The loss of the active region results in: #### Steps to take in case of active region loss -1. **Temporary Recovery:** follow the [operational procedure for temporary recovery](./../../operational-guides/multi-region/dual-region-ops.md#failover) to restore functionality and unblock the workflow engine. -2. **Traffic Rerouting:** reroute traffic to the passive region, which will now become the new active region. -3. **Data and Task Management:** due to the loss of data in Operate and Tasklist: +1. **Temporary Recovery:** Follow the [operational procedure for temporary recovery](./../../operational-guides/multi-region/dual-region-ops.md#failover) to restore functionality and unblock the workflow engine. +2. **Traffic Rerouting:** Reroute traffic to the passive region, which will now become the new active region. +3. **Data and Task Management:** Due to the loss of data in Operate and Tasklist: 1. Reassign any uncompleted tasks in Tasklist. 2. Recreate batch operations in Operate. -4. **Permanent Region Setup:** follow the [operational procedure to create a new permanent region](./../../operational-guides/multi-region/dual-region-ops.md#failback) that will become your new passive region. +4. **Permanent Region Setup:** Follow the [operational procedure to create a new permanent region](./../../operational-guides/multi-region/dual-region-ops.md#failback) that will become your new passive region. ### Passive region loss -The loss of the passive region means: +The loss of the passive region results in the following: - **Workflow Engine Impact**: The workflow engine will stop processing due to the loss of quorum. #### Steps to take in case of passive region loss -1. **Temporary Recovery:** follow the [operational procedure to temporarily recover](./../../operational-guides/multi-region/dual-region-ops.md#failover) from the loss and unblock the workflow engine. -2. **Permanent Region Setup:** follow the [operational procedure to create a new permanent region](./../../operational-guides/multi-region/dual-region-ops.md#failback) that will become your new passive region. +1. **Temporary Recovery:** Follow the [operational procedure to temporarily recover](./../../operational-guides/multi-region/dual-region-ops.md#failover) from the loss and unblock the workflow engine. +2. **Permanent Region Setup:** Follow the [operational procedure to create a new permanent region](./../../operational-guides/multi-region/dual-region-ops.md#failback) that will become your new passive region. **Note:** Unlike an active region loss, no data will be lost and no traffic rerouting is necessary. From a1db94a2959dd4480ae778f11a2743a8bd8ade96 Mon Sep 17 00:00:00 2001 From: mesellings Date: Mon, 9 Sep 2024 12:05:35 +0100 Subject: [PATCH 08/15] TW edits: dual-region ops --- .../operational-guides/multi-region/dual-region-ops.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/self-managed/operational-guides/multi-region/dual-region-ops.md b/docs/self-managed/operational-guides/multi-region/dual-region-ops.md index c0a23c64cf..35535672da 100644 --- a/docs/self-managed/operational-guides/multi-region/dual-region-ops.md +++ b/docs/self-managed/operational-guides/multi-region/dual-region-ops.md @@ -72,9 +72,9 @@ We handle the loss of both active and passive regions using the same procedure. #### Key Steps to Handle Passive Region Loss -1. **Traffic Rerouting:** reroute traffic to the surviving active region using DNS. (Details on how to manage DNS rerouting depend on your specific DNS setup and are not covered in this guide.) -2. **Prevent Reconnection:** ensure that the lost region cannot reconnect before starting the restoration procedure. Reconnection could interfere with a successful recovery during failover and failback. -3. **Temporary Loss Scenario:** if the region loss is temporary (e.g., due to network issues), Zeebe can survive this loss but may stop processing due to quorum loss. This could lead to persistent disk filling up before data is lost. +1. **Traffic Rerouting:** Reroute traffic to the surviving active region using DNS. (Details on how to manage DNS rerouting depend on your specific DNS setup and are not covered in this guide.) +2. **Prevent Reconnection:** Ensure that the lost region cannot reconnect before starting the restoration procedure. Reconnection could interfere with a successful recovery during failover and failback. +3. **Temporary Loss Scenario:** If the region loss is temporary (for example, due to network issues), Zeebe can survive this loss but may stop processing due to quorum loss. This could lead to persistent disk filling up before data is lost. 4. **Procedure Phases** - **Failover Phase:** Temporarily restores Camunda 8 functionality by removing the lost brokers and handling the export to the unreachable Elasticsearch instance. - **Failback Phase:** Fully restores the failed region to its original functionality. This phase requires the region to be ready for the redeployment of Camunda 8. From ade0b3770376544790b43b7b8192f6ee0adbc01f Mon Sep 17 00:00:00 2001 From: "Leo J." <153937047+leiicamundi@users.noreply.github.com> Date: Mon, 9 Sep 2024 15:49:03 +0200 Subject: [PATCH 09/15] WIP review --- .../concepts/multi-region/dual-region.md | 11 ++------ .../multi-region/dual-region-ops.md | 27 ++++++++++--------- 2 files changed, 16 insertions(+), 22 deletions(-) diff --git a/docs/self-managed/concepts/multi-region/dual-region.md b/docs/self-managed/concepts/multi-region/dual-region.md index 53e3d3e61f..9cd8cb3380 100644 --- a/docs/self-managed/concepts/multi-region/dual-region.md +++ b/docs/self-managed/concepts/multi-region/dual-region.md @@ -88,19 +88,12 @@ In the event of a total active region loss, the following data will be lost: ## Requirements and Limitations -### Minimum Versions - -| **Component** | **Elasticsearch** | **Operate** | **Tasklist** | **Zeebe** | **Zeebe Gateway** | **Camunda Helm Chart** | -| ------------------- | ----------------- | ----------- | ------------ | --------- | ----------------- | -------------------------------------------------------- | -| **Minimum Version** | 8.9+\* | 8.5+ | 8.5+ | 8.5+ | 8.5+ | [9.3+](https://github.com/camunda/camunda-platform-helm) | - -\* OpenSearch (both managed and self-managed) is not supported for a dual-region setup. - ### Installation Environment #### Kubernetes Setup -Two Kubernetes clusters are required for the Helm chart installation. +- Two Kubernetes clusters are required for the Helm chart installation. +- OpenSearch (both managed and self-managed) is not supported for a dual-region setup. #### Network Requirements diff --git a/docs/self-managed/operational-guides/multi-region/dual-region-ops.md b/docs/self-managed/operational-guides/multi-region/dual-region-ops.md index 35535672da..ec9e2b0fa4 100644 --- a/docs/self-managed/operational-guides/multi-region/dual-region-ops.md +++ b/docs/self-managed/operational-guides/multi-region/dual-region-ops.md @@ -73,9 +73,8 @@ We handle the loss of both active and passive regions using the same procedure. #### Key Steps to Handle Passive Region Loss 1. **Traffic Rerouting:** Reroute traffic to the surviving active region using DNS. (Details on how to manage DNS rerouting depend on your specific DNS setup and are not covered in this guide.) -2. **Prevent Reconnection:** Ensure that the lost region cannot reconnect before starting the restoration procedure. Reconnection could interfere with a successful recovery during failover and failback. -3. **Temporary Loss Scenario:** If the region loss is temporary (for example, due to network issues), Zeebe can survive this loss but may stop processing due to quorum loss. This could lead to persistent disk filling up before data is lost. -4. **Procedure Phases** +2. **Temporary Loss Scenario:** If the region loss is temporary (for example, due to network issues), Zeebe can survive this loss but may stop processing due to quorum loss. This could lead to persistent disk filling up before data is lost. +3. **Procedure Phases** - **Failover Phase:** Temporarily restores Camunda 8 functionality by removing the lost brokers and handling the export to the unreachable Elasticsearch instance. - **Failback Phase:** Fully restores the failed region to its original functionality. This phase requires the region to be ready for the redeployment of Camunda 8. @@ -148,6 +147,8 @@ One of the regions is lost, meaning Zeebe: - Stops exporting new data to Elasticsearch in the lost region - Stops exporting new data to Elasticsearch in the survived region +# TODO: rephrase + You have previously ensured that the lost region cannot reconnect during the failover procedure. Due to the Zeebe data replication, no data has been lost. @@ -339,7 +340,7 @@ curl -XGET 'http://localhost:9600/actuator/exporters' -2. Based on the [Exporter APIs](../../zeebe-deployment/operations/cluster-scaling.md) you will send a request to the Zeebe Gateway to disable the Elasticsearch exporter to the lost region. +2. Based on the Exporter APIs you will send a request to the Zeebe Gateway to disable the Elasticsearch exporter to the lost region. ```bash curl -XPOST 'http://localhost:9600/actuator/exporters/elasticsearchregion1/disable' @@ -408,7 +409,7 @@ desired={} | **Details** | **Current State** | **Desired State** | | ------------------------ | ------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------- | | **Camunda 8 Setup** | A standalone region with a fully functional Camunda 8 setup, including Zeebe, Operate, Tasklist, and Elasticsearch. | Restore dual-region functionality by deploying Camunda 8 (Zeebe and Elasticsearch) to the newly restored region. | -| **Operate and Tasklist** | Operate and Tasklist are operational in the standalone region. | Operate and Tasklist should remain disabled to avoid interference during the database backup and restore process. | +| **Operate and Tasklist** | Operate and Tasklist are operational in the standalone region. | Operate and Tasklist need to stay disabled to avoid interference during the database backup and restore process. | #### How to get there @@ -516,8 +517,8 @@ desired={} | **Details** | **Current State** | **Desired State** | | ------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| **Zeebe Cluster Setup** | Functioning Zeebe cluster within a single region:
Working Camunda 8 installation in the surviving region
Non-participating Camunda 8 installation in the recreated region | Preparing the newly created region to take over and restore the dual-region setup. | -| **Elasticsearch Export** | Currently exporting data to Elasticsearch from the surviving region. | Stop Zeebe exporters to prevent new data from being exported to Elasticsearch, allowing for the creation of an Elasticsearch backup. | +| **Camunda 8 Cluster Setup** | Functioning Zeebe cluster within a single region:
Working Camunda 8 installation in the surviving region
Non-participating Camunda 8 installation in the recreated region | Preparing the newly created region to take over and restore the dual-region setup. | +| **Export Capability** | Currently exporting data to Elasticsearch from the surviving region. | Stop Zeebe exporters to prevent new data from being exported to Elasticsearch, allowing for the creation of an Elasticsearch backup. | | **Operate and Tasklist** | Operate and Tasklist are operational in the surviving region. | Temporarily scale down Operate and Tasklist to zero replicas, preventing user interaction with Camunda 8 and ensuring no new data is imported to Elasticsearch. | :::note @@ -572,8 +573,8 @@ desired={} | **Details** | **Current State** | **Desired State** | | ------------------------ | ------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| **Camunda Components** | Not reachable by end-users and not processing any new process instances. This state allows for data backup without loss. | Creating a backup of the main Elasticsearch instance in the surviving region and restoring it in the recreated region. Backup process may take time to complete. | -| **Elasticsearch Backup** | No backup is in progress. | Backup of Elasticsearch in the surviving region is initiated and being restored in the recreated region, containing all necessary data. | +| **Camunda Components** | Not reachable by end-users and not processing any new process instances. This state allows for data backup without loss. | Remain unreachable by end-users and not processing any new instances. | +| **Elasticsearch Backup** | No backup is in progress. | Backup of Elasticsearch in the surviving region is initiated and being restored in the recreated region, containing all necessary data. Backup process may take time to complete. | #### How to get there @@ -800,8 +801,8 @@ desired={} | **Details** | **Current State** | **Desired State** | | ------------------------ | -------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------- | -| **Elasticsearch Backup** | Backup has been created and restored to the recreated region. | N/A | | **Camunda Components** | Remain unreachable by end-users while restoring functionality. | Enable Operate and Tasklist in both the surviving and recreated regions to allow user interaction with Camunda 8 again. | +| **Elasticsearch Backup** | Backup has been created and restored to the recreated region. | N/A | #### How to get there @@ -856,7 +857,7 @@ desired={} | **Details** | **Current State** | **Desired State** | | --------------------------- | --------------------------------------------------- | ----------------------------------------------------------------------------------------------------- | -| **Camunda 8 Accessibility** | Reachable to end-users, but not exporting any data. | Start a new exporter to the recreated region. | +| **Camunda Components** | Reachable to end-users, but not exporting any data. | Start a new exporter to the recreated region. | | **Data Exporting** | No data export currently in progress. | Ensure that both Elasticsearch instances are populated for data redundancy. | | **Procedure Step** | Data export is paused. | Separate the initialization step (asynchronous) and confirm completion before resuming the exporters. | @@ -928,7 +929,7 @@ desired={} | **Details** | **Current State** | **Desired State** | | --------------------------- | ----------------------------------------------------------------------- | -------------------------------------------------- | -| **Camunda 8 Accessibility** | Reachable to end-users, but currently not exporting any data. | Reactivate existing exporters. | +| **Camunda 8** | Reachable to end-users, but currently not exporting any data. | Reactivate existing exporters. | | **Elasticsearch Exporters** | Enabled for both regions, with the operation confirmed to be completed. | Allow Zeebe to export data to Elasticsearch again. | #### How to get there @@ -961,7 +962,7 @@ desired={} | **Details** | **Current State** | **Desired State** | | ------------------------ | ---------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------- | -| **Camunda 8 Deployment** | Running in two regions, but not yet utilizing all Zeebe brokers. Operate and Tasklist redeployed, Elasticsearch exporters enabled. | Fully functional Camunda 8 setup utilizing both regions, recovering all dual-region benefits. | +| **Camunda 8** | Running in two regions, but not yet utilizing all Zeebe brokers. Operate and Tasklist redeployed, Elasticsearch exporters enabled. | Fully functional Camunda 8 setup utilizing both regions, recovering all dual-region benefits. | | **User Interaction** | Users can interact with Camunda 8 again. | Dual-region functionality is restored, maximizing reliability and performance benefits. | #### How to get there From 88bac57d1194a4a0dc623296f1541b95f851b5c1 Mon Sep 17 00:00:00 2001 From: "Leo J." <153937047+leiicamundi@users.noreply.github.com> Date: Tue, 10 Sep 2024 11:05:36 +0200 Subject: [PATCH 10/15] update doc based on feedback --- .../multi-region/dual-region-ops.md | 75 ++++++++----------- 1 file changed, 32 insertions(+), 43 deletions(-) diff --git a/docs/self-managed/operational-guides/multi-region/dual-region-ops.md b/docs/self-managed/operational-guides/multi-region/dual-region-ops.md index ec9e2b0fa4..f03644d30a 100644 --- a/docs/self-managed/operational-guides/multi-region/dual-region-ops.md +++ b/docs/self-managed/operational-guides/multi-region/dual-region-ops.md @@ -137,21 +137,15 @@ desired={}
-| **Current State** | **Desired State** | -| --------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| The lost region has been ensured not to reconnect during the failover procedure.

No data has been lost due to Zeebe data replication. | The lost brokers have been removed from the Zeebe cluster.

Continued processing is enabled, and new brokers in the failback procedure will only join the cluster with our intervention. | +| **Current State** | **Desired State** | +| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| You have ensured that you fully lost a region and want to start the temporary recovery.

One of the regions is lost, meaning Zeebe:
- No data has been lost thanks to Zeebe data replication.
- Is unable to process new requests due to losing the quorum
- Stops exporting new data to Elasticsearch in the lost region
- Stops exporting new data to Elasticsearch in the survived region | The lost region has been ensured not to reconnect during the failover procedure. You should be sure it is lost, and if so, look into measures to prevent it from reconnecting. For example, by utilizing the suggested solution below to isolate your active environment.

The lost brokers have been removed from the Zeebe cluster.

Continued processing is enabled, and new brokers in the failback procedure will only join the cluster with our intervention. | -One of the regions is lost, meaning Zeebe: - -- Is unable to process new requests due to losing the quorum -- Stops exporting new data to Elasticsearch in the lost region -- Stops exporting new data to Elasticsearch in the survived region - -# TODO: rephrase +:::warning -You have previously ensured that the lost region cannot reconnect during the failover procedure. +It's crucial to ensure the isolation of the environments because, during the operational procedure, we will have duplicate Zeebe broker IDs, which would collide if not correctly isolated and if the other region came accidentally on again. -Due to the Zeebe data replication, no data has been lost. +::: #### How to get there @@ -309,11 +303,10 @@ desired={}
-| **Details** | **Current State** | **Desired State** | -| ----------------------- | --------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------- | -| **Zeebe Configuration** | Zeebe brokers in the surviving region are still configured to point to the Elasticsearch instance of the lost region. | Elasticsearch exporter to the failed region has been disabled in the Zeebe cluster. | -| **Export Capability** | Zeebe cannot continue exporting data. | Zeebe can export data to Elasticsearch again. | -| **User Interaction** | Regular interaction with Camunda 8 is not restored. | Regular interaction with Camunda 8 is restored, marking the conclusion of the temporary recovery. | +| **Details** | **Current State** | **Desired State** | +| ----------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------- | +| **Zeebe Configuration** | Zeebe brokers in the surviving region are still configured to point to the Elasticsearch instance of the lost region. Zeebe cannot continue exporting data. | Elasticsearch exporter to the failed region has been disabled in the Zeebe cluster. Zeebe can export data to Elasticsearch again. | +| **User Interaction** | Regular interaction with Camunda 8 is not restored. | Regular interaction with Camunda 8 is restored, marking the conclusion of the temporary recovery. | #### How to get there @@ -406,9 +399,9 @@ desired={}
-| **Details** | **Current State** | **Desired State** | -| ------------------------ | ------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------- | -| **Camunda 8 Setup** | A standalone region with a fully functional Camunda 8 setup, including Zeebe, Operate, Tasklist, and Elasticsearch. | Restore dual-region functionality by deploying Camunda 8 (Zeebe and Elasticsearch) to the newly restored region. | +| **Details** | **Current State** | **Desired State** | +| ------------------------ | ------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------- | +| **Camunda 8** | A standalone region with a fully functional Camunda 8 setup, including Zeebe, Operate, Tasklist, and Elasticsearch. | Restore dual-region functionality by deploying Camunda 8 (Zeebe and Elasticsearch) to the newly restored region. | | **Operate and Tasklist** | Operate and Tasklist are operational in the standalone region. | Operate and Tasklist need to stay disabled to avoid interference during the database backup and restore process. | #### How to get there @@ -515,11 +508,10 @@ desired={}
-| **Details** | **Current State** | **Desired State** | -| ------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| **Camunda 8 Cluster Setup** | Functioning Zeebe cluster within a single region:
Working Camunda 8 installation in the surviving region
Non-participating Camunda 8 installation in the recreated region | Preparing the newly created region to take over and restore the dual-region setup. | -| **Export Capability** | Currently exporting data to Elasticsearch from the surviving region. | Stop Zeebe exporters to prevent new data from being exported to Elasticsearch, allowing for the creation of an Elasticsearch backup. | -| **Operate and Tasklist** | Operate and Tasklist are operational in the surviving region. | Temporarily scale down Operate and Tasklist to zero replicas, preventing user interaction with Camunda 8 and ensuring no new data is imported to Elasticsearch. | +| **Details** | **Current State** | **Desired State** | +| ------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| **Camunda 8** | Functioning Zeebe cluster within a single region:
Working Camunda 8 installation in the surviving region
Non-participating Camunda 8 installation in the recreated region.
Currently exporting data to Elasticsearch from the surviving region. | Preparing the newly created region to take over and restore the dual-region setup. Stop Zeebe exporters to prevent new data from being exported to Elasticsearch, allowing for the creation of an Elasticsearch backup. | +| **Operate and Tasklist** | Operate and Tasklist are operational in the surviving region. | Temporarily scale down Operate and Tasklist to zero replicas, preventing user interaction with Camunda 8 and ensuring no new data is imported to Elasticsearch. | :::note @@ -571,10 +563,10 @@ desired={}
-| **Details** | **Current State** | **Desired State** | -| ------------------------ | ------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| **Camunda Components** | Not reachable by end-users and not processing any new process instances. This state allows for data backup without loss. | Remain unreachable by end-users and not processing any new instances. | -| **Elasticsearch Backup** | No backup is in progress. | Backup of Elasticsearch in the surviving region is initiated and being restored in the recreated region, containing all necessary data. Backup process may take time to complete. | +| **Details** | **Current State** | **Desired State** | +| ------------------------ | ------------------------------------------------------------------------------------------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| **Camunda 8** | Not reachable by end-users and not processing any new process instances. This state allows for data backup without loss. | Remain unreachable by end-users and not processing any new instances. | +| **Elasticsearch Backup** | No backup is in progress. | Backup of Elasticsearch in the surviving region is initiated and being restored in the recreated region, containing all necessary data. Backup process may take time to complete. | #### How to get there @@ -801,7 +793,7 @@ desired={} | **Details** | **Current State** | **Desired State** | | ------------------------ | -------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------- | -| **Camunda Components** | Remain unreachable by end-users while restoring functionality. | Enable Operate and Tasklist in both the surviving and recreated regions to allow user interaction with Camunda 8 again. | +| **Camunda 8** | Remain unreachable by end-users while restoring functionality. | Enable Operate and Tasklist in both the surviving and recreated regions to allow user interaction with Camunda 8 again. | | **Elasticsearch Backup** | Backup has been created and restored to the recreated region. | N/A | #### How to get there @@ -855,11 +847,9 @@ desired={}
-| **Details** | **Current State** | **Desired State** | -| --------------------------- | --------------------------------------------------- | ----------------------------------------------------------------------------------------------------- | -| **Camunda Components** | Reachable to end-users, but not exporting any data. | Start a new exporter to the recreated region. | -| **Data Exporting** | No data export currently in progress. | Ensure that both Elasticsearch instances are populated for data redundancy. | -| **Procedure Step** | Data export is paused. | Separate the initialization step (asynchronous) and confirm completion before resuming the exporters. | +| **Details** | **Current State** | **Desired State** | +| ------------- | --------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| **Camunda 8** | Reachable to end-users, but not exporting any data. | Start a new exporter to the recreated region.
Ensure that both Elasticsearch instances are populated for data redundancy.
Separate the initialization step (asynchronous) and confirm completion before resuming the exporters. | #### How to get there @@ -927,10 +917,9 @@ desired={}
-| **Details** | **Current State** | **Desired State** | -| --------------------------- | ----------------------------------------------------------------------- | -------------------------------------------------- | -| **Camunda 8** | Reachable to end-users, but currently not exporting any data. | Reactivate existing exporters. | -| **Elasticsearch Exporters** | Enabled for both regions, with the operation confirmed to be completed. | Allow Zeebe to export data to Elasticsearch again. | +| **Details** | **Current State** | **Desired State** | +| ------------- | -------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------ | +| **Camunda 8** | Reachable to end-users, but currently not exporting any data. Exporters are enabled for both regions, with the operation confirmed to be completed | Reactivate existing exporters that will allow Zeebe to export data to Elasticsearch again. | #### How to get there @@ -960,10 +949,10 @@ desired={}
-| **Details** | **Current State** | **Desired State** | -| ------------------------ | ---------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------- | -| **Camunda 8** | Running in two regions, but not yet utilizing all Zeebe brokers. Operate and Tasklist redeployed, Elasticsearch exporters enabled. | Fully functional Camunda 8 setup utilizing both regions, recovering all dual-region benefits. | -| **User Interaction** | Users can interact with Camunda 8 again. | Dual-region functionality is restored, maximizing reliability and performance benefits. | +| **Details** | **Current State** | **Desired State** | +| -------------------- | ---------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------- | +| **Camunda 8** | Running in two regions, but not yet utilizing all Zeebe brokers. Operate and Tasklist redeployed, Elasticsearch exporters enabled. | Fully functional Camunda 8 setup utilizing both regions, recovering all dual-region benefits. | +| **User Interaction** | Users can interact with Camunda 8 again. | Dual-region functionality is restored, maximizing reliability and performance benefits. | #### How to get there From 702d144cbd3d6364a67931e5bc7610da458ca6d9 Mon Sep 17 00:00:00 2001 From: "Leo J." <153937047+leiicamundi@users.noreply.github.com> Date: Tue, 10 Sep 2024 11:13:59 +0200 Subject: [PATCH 11/15] up --- .../operational-guides/multi-region/dual-region-ops.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/self-managed/operational-guides/multi-region/dual-region-ops.md b/docs/self-managed/operational-guides/multi-region/dual-region-ops.md index f03644d30a..939554a9b1 100644 --- a/docs/self-managed/operational-guides/multi-region/dual-region-ops.md +++ b/docs/self-managed/operational-guides/multi-region/dual-region-ops.md @@ -137,9 +137,9 @@ desired={}
-| **Current State** | **Desired State** | -| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| You have ensured that you fully lost a region and want to start the temporary recovery.

One of the regions is lost, meaning Zeebe:
- No data has been lost thanks to Zeebe data replication.
- Is unable to process new requests due to losing the quorum
- Stops exporting new data to Elasticsearch in the lost region
- Stops exporting new data to Elasticsearch in the survived region | The lost region has been ensured not to reconnect during the failover procedure. You should be sure it is lost, and if so, look into measures to prevent it from reconnecting. For example, by utilizing the suggested solution below to isolate your active environment.

The lost brokers have been removed from the Zeebe cluster.

Continued processing is enabled, and new brokers in the failback procedure will only join the cluster with our intervention. | +| **Current State** | **Desired State** | +| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| You have ensured that you fully lost a region and want to start the temporary recovery.

One of the regions is lost, meaning Zeebe:
- No data has been lost thanks to Zeebe data replication.
- Is unable to process new requests due to losing the quorum
- Stops exporting new data to Elasticsearch in the lost region
- Stops exporting new data to Elasticsearch in the survived region | You should be sure it is lost, and if so, look into measures to prevent it from reconnecting. For example, by utilizing the suggested solution below to isolate your active environment.

The lost brokers have been removed from the Zeebe cluster.

Continued processing is enabled, and new brokers in the failback procedure will only join the cluster with our intervention. | :::warning From a49a20c6064db09f522e519d70db7bd7f6a5bbe3 Mon Sep 17 00:00:00 2001 From: "Leo J." <153937047+leiicamundi@users.noreply.github.com> Date: Tue, 10 Sep 2024 13:28:05 +0200 Subject: [PATCH 12/15] add a mention regarding cidrs overlap --- docs/self-managed/concepts/multi-region/dual-region.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/self-managed/concepts/multi-region/dual-region.md b/docs/self-managed/concepts/multi-region/dual-region.md index 9cd8cb3380..558541a5c5 100644 --- a/docs/self-managed/concepts/multi-region/dual-region.md +++ b/docs/self-managed/concepts/multi-region/dual-region.md @@ -97,6 +97,7 @@ In the event of a total active region loss, the following data will be lost: #### Network Requirements +- Kubernetes clusters, services, and pods must not have overlapping CIDRs. Each cluster must use distinct CIDRs that do not conflict or overlap with those of any other cluster; otherwise, you will encounter routing issues. - The regions (for example, two Kubernetes clusters) must be able to connect to each other (for example, via VPC peering). See [example implementation](/self-managed/setup/deploy/amazon/amazon-eks/dual-region.md) for AWS EKS. - Kubernetes services in one cluster must be resolvable and reachable from the other cluster and vice-versa. This is essential for proper communication and functionality across regions: - For AWS EKS setups, ensure DNS chaining is configured. Refer to the [Amazon Elastic Kubernetes Service (EKS) setup guide](/self-managed/setup/deploy/amazon/amazon-eks/dual-region.md). From fd133a40334101c1e9c1d8edb2818d6f373731dd Mon Sep 17 00:00:00 2001 From: "Leo J." <153937047+leiicamundi@users.noreply.github.com> Date: Tue, 10 Sep 2024 13:32:12 +0200 Subject: [PATCH 13/15] remove prevent reconnection --- .../multi-region/dual-region-ops.md | 12 +++--------- 1 file changed, 3 insertions(+), 9 deletions(-) diff --git a/docs/self-managed/operational-guides/multi-region/dual-region-ops.md b/docs/self-managed/operational-guides/multi-region/dual-region-ops.md index 939554a9b1..e3bb550778 100644 --- a/docs/self-managed/operational-guides/multi-region/dual-region-ops.md +++ b/docs/self-managed/operational-guides/multi-region/dual-region-ops.md @@ -137,15 +137,9 @@ desired={}
-| **Current State** | **Desired State** | -| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -| You have ensured that you fully lost a region and want to start the temporary recovery.

One of the regions is lost, meaning Zeebe:
- No data has been lost thanks to Zeebe data replication.
- Is unable to process new requests due to losing the quorum
- Stops exporting new data to Elasticsearch in the lost region
- Stops exporting new data to Elasticsearch in the survived region | You should be sure it is lost, and if so, look into measures to prevent it from reconnecting. For example, by utilizing the suggested solution below to isolate your active environment.

The lost brokers have been removed from the Zeebe cluster.

Continued processing is enabled, and new brokers in the failback procedure will only join the cluster with our intervention. | - -:::warning - -It's crucial to ensure the isolation of the environments because, during the operational procedure, we will have duplicate Zeebe broker IDs, which would collide if not correctly isolated and if the other region came accidentally on again. - -::: +| **Current State** | **Desired State** | +| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| You have ensured that you fully lost a region and want to start the temporary recovery.

One of the regions is lost, meaning Zeebe:
- No data has been lost thanks to Zeebe data replication.
- Is unable to process new requests due to losing the quorum
- Stops exporting new data to Elasticsearch in the lost region
- Stops exporting new data to Elasticsearch in the survived region | The lost brokers have been removed from the Zeebe cluster.

Continued processing is enabled, and new brokers in the failback procedure will only join the cluster with our intervention. | #### How to get there From d863487ab958979bb2b817bdee40d3786b27b960 Mon Sep 17 00:00:00 2001 From: "Leo J." <153937047+leiicamundi@users.noreply.github.com> Date: Tue, 10 Sep 2024 13:36:19 +0200 Subject: [PATCH 14/15] unsupported vs undocumented --- docs/self-managed/concepts/multi-region/dual-region.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/self-managed/concepts/multi-region/dual-region.md b/docs/self-managed/concepts/multi-region/dual-region.md index 558541a5c5..f87d078709 100644 --- a/docs/self-managed/concepts/multi-region/dual-region.md +++ b/docs/self-managed/concepts/multi-region/dual-region.md @@ -93,7 +93,7 @@ In the event of a total active region loss, the following data will be lost: #### Kubernetes Setup - Two Kubernetes clusters are required for the Helm chart installation. -- OpenSearch (both managed and self-managed) is not supported for a dual-region setup. +- OpenSearch (both managed and self-managed) is not covered in this dual-region setup. #### Network Requirements @@ -125,7 +125,7 @@ Customers are responsible for detecting regional failures and executing the [ope | **Aspect** | **Details** | | :----------------------------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| Installation methods |

  • For **kubernetes** we recommend using a dual-region Kubernetes setup with the [Camunda Helm chart](/self-managed/setup/install.md) installed in two Kubernetes clusters.
  • For **other platforms**, using alternative installation methods (for example, with docker-compose) is not supported.

| +| Installation methods |

  • For **kubernetes** we recommend using a dual-region Kubernetes setup with the [Camunda Helm chart](/self-managed/setup/install.md) installed in two Kubernetes clusters.
  • For **other platforms**, using alternative installation methods (for example, with docker-compose) is not covered in this guide.

| | Camunda Platform Configuration |

The overall Camunda platform is **active-passive**, although some key components are active-active.

  • **Active-Passive Traffic Handling:** One active and one passive region serve active user traffic.
  • **Traffic to Both Regions:** Serving traffic to both regions will cause component detachment, potentially resulting in different data visibility in Operate and Tasklist.

| | Identity Support | Identity is not supported, multi-Tenancy and Role-Based Access Control (RBAC) does not work. | | Optimize Support | Not supported (requires Identity). | From 8ea06b27050f201150e351e21370e4f186b2f3e5 Mon Sep 17 00:00:00 2001 From: "Leo J." <153937047+leiicamundi@users.noreply.github.com> Date: Tue, 10 Sep 2024 13:42:25 +0200 Subject: [PATCH 15/15] remove support of OpenSearch --- docs/self-managed/concepts/multi-region/dual-region.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/self-managed/concepts/multi-region/dual-region.md b/docs/self-managed/concepts/multi-region/dual-region.md index f87d078709..7e1d9097ae 100644 --- a/docs/self-managed/concepts/multi-region/dual-region.md +++ b/docs/self-managed/concepts/multi-region/dual-region.md @@ -93,7 +93,7 @@ In the event of a total active region loss, the following data will be lost: #### Kubernetes Setup - Two Kubernetes clusters are required for the Helm chart installation. -- OpenSearch (both managed and self-managed) is not covered in this dual-region setup. +- OpenSearch (both managed and self-managed) is not supported in this dual-region setup. #### Network Requirements