Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design documentation for the NonAdminBackup Status and Conditions #23

Merged
merged 5 commits into from
May 17, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 87 additions & 0 deletions design/archive/nab_status_update-alternative.md
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in this PR imo.

Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# Developer Workflow: NonAdminBackup Phase and Conditions within NonAdminBackup Status

## Version

This is `alternative` of the design document. Simplifier approach is described in the [active design] doc.

## Overview

This document outlines the design around updating NonAdminBackup objects Phase and Conditions within Status.

### Phase

A NonAdminBackup's `status` field has a `phase` field, which is updated by NAC controller.

The `phase` is a simple one high-level summary of the lifecycle of an NonAdminBackup.

It is always a one well defined value, that is intended to be a comprehensive state of a NonAdminBackup.

Those are are the possible values for phase:

| **Value** | **Description** |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets add a table where we map the Phase and Conditions as well as VeleroBackupPhase with each other for easier understanding

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why we have diagram which is integral part of that .md file with Phase/Condition within workflow. Do we need a table to support that diagram as well? Attaching screenshot of the diagram in case you don's see it in the review 'view'

image

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw the diagram :)
ok lets at least tabulate velero phases with NAB phases ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on the gray squares, should not ALL of them have Queued populated ❓

|-----------|--------------------------------|
| New | *NonAdminBackup resource was accepted by the OpenShift cluster, but it has not yet been processed by the NonAdminController* |
| BackingOff | Velero *Backup* object was not created due to NonAdminBackup error (configuration or similar) |
| Created | Velero *Backup* was created, but it has not yet fully ran or started |
| Running | Velero *Backup* is currently running |
| Succeeded | Velero *Backup* was performed with Success state |
| Failed | Velero *Backup* did not succeed |

### Conditions

The `conditions` is also a part of the NonAdminBackup's `status` field. One NAB object may have multiple conditions. It is more granular knowledge of the NonAdminBackup object and represents the array of the conditions through which the NonAdminBackup has or has not passed. Each `NonAdminCondition` has one of the following `type`:

| **Condition** | **Description** |
|-----------|--------------------------------|
| BackupAccepted | The Backup object was accepted by the reconcile loop, but the Velero Backup may have not yet been created |
| BackupQueued | The Velero Backup was created succesfully and did not return any known errors. It's in the queue for Backup. At this stage errors may still occur during backup procedure. |
| BackupCompleted | The Velero Backup succeeded. |

The `condition` data is also accomapied with the following:

| **Field name** | **Description** |
|-----------|--------------------------------|
| type | The `Type` of the condition |
| status | represents the state of individual condition. The resulting `phase` value should report `Success` only when all of the conditions are met and the backup succeded. One of `True`, `False` or `Unknown`. |
| lastProbeTime | Timestamp of when the NonAdminBackup condition was last probed. |
| lastTransitionTime | Timestamp for when the NonAdminBackup last transitioned from one status to another. |
| reason | Machine-readable, UpperCamelCase text indicating the reason for the condition's last transition. |
| message | Human-readable message indicating details about the last status transition. |

### BackupStatus

`BackupStatus` which is also part of the `NonAdminBackupStatus` object is a `BackupStatus` that is taken directly from the Velero Backup Status and copied over.

## Phase Update scenarios

*** Questions ***

- BackupQueued stays true at the end,
should we remove that from conditions?
I don't think we should set it to false.

```mermaid
%%{init: {'theme':'forest'}}%%
graph
AA[Phase: New] -->A
A{BackupAccepted: Unknown\n BackupQueued: False\n BackupCompleted: False} -- NAC processes --> B[Non Admin Backup Accepted]
A -- NAC hits an error --> E[Phase: BackingOff]
B -- Create Velero Backup --> C[Phase: Created]
B -.-> BA{BackupAccepted: True\nBackupQueued: False\nBackupCompleted: False}
C -- Backup runs --> D[Phase: Running]
C -.-> BS{BackupAccepted: True\nBackupQueued: True\nBackupCompleted: False}
D -- Backup succeeds --> F[Phase: Succeeded]
F -.-> FF{BackupAccepted: True\nBackupQueued: True\nBackupCompleted: True}

D -- Backup fails --> G[Phase: Failed]
E -.-> EE{BackupAccepted: False\nBackupQueued: False\nBackupCompleted: False}

classDef conditions fill:#ccc,stroke:#ccc,stroke-width:2px;
class A,BA,BS,BC,EE,FF conditions;
classDef phases fill:#777,stroke:#ccc,stroke-width:2px;
class AA,C,D,E,G,F phases;

```


[active design]: ./nab_status_update.md
82 changes: 82 additions & 0 deletions design/nab_status_update.md
mateusoliveira43 marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
# Developer Workflow: NonAdminBackup Phase and Conditions within NonAdminBackup Status

## Overview

This document outlines the design around updating NonAdminBackup objects Phase and Conditions within Status.

### Phase

A NonAdminBackup's `status` field has a `phase` field, which is updated by NAC controller.

The `phase` is a simple one high-level summary of the lifecycle of an NonAdminBackup.

It is always a one well defined value, that is intended to be a comprehensive state of a NonAdminBackup.

Those are are the possible values for phase:

| **Value** | **Description** |
|-----------|--------------------------------|
| New | *NonAdminBackup resource was accepted by the OpenShift cluster, but it has not yet been processed by the NonAdminController* |
| BackingOff | Velero *Backup* object was not created due to NonAdminBackup error (configuration or similar) |
| Created | Velero *Backup* was created. The Phase will not have additional informations about the |
mateusoliveira43 marked this conversation as resolved.
Show resolved Hide resolved

### Conditions

The `conditions` is also a part of the NonAdminBackup's `status` field. One NAB object may have multiple conditions. It is more granular knowledge of the NonAdminBackup object and represents the array of the conditions through which the NonAdminBackup has or has not passed. Each `NonAdminCondition` has one of the following `type`:

| **Condition** | **Description** |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mpryc I think we also need a Error condition here. So that whenever the backup is in BackingOff phase the Error condition is applied to the NAB object

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the BackingOff we set the BackupAccepted condition to False see the diagram in another comment. Do we need to have Error condition as well? I think it's a bit too many states.

The BackupAccepted set to False have also the string reason/message where we can say why it was wrong.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example:

status:
  conditions:
    - lastTransitionTime: '2024-04-17T20:56:57Z'
      message: NonAdminBackup CR does not contain valid BackupSpec
      reason: invalid_backupspec
      status: 'False'
      type: BackupAccepted
  phase: BackingOff

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the user sees an Error condition they definitely think that something is not working as intended/expected.
The word Accepted does not imply the same. I agree that the Accepted condition can display the error message but having an Error condition would improve the user experience.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have phase: BackingOff for that, so this is high level information to tell that something went wrong.

User has the Accepted: False status, of course we could add Accepted: False and Error: True or just Error: True, but that in my opinion is duplication of what phase is for.

|-----------|--------------------------------|
| BackupAccepted | The Backup object was accepted by the reconcile loop, but the Velero Backup may have not yet been created |
| BackupQueued | The Velero Backup was created succesfully. At this stage errors may still occur either from the Velero not accepting backup or during backup procedure. |

The `condition` data is also accomapied with the following:

| **Field name** | **Description** |
|-----------|--------------------------------|
| type | The `Type` of the condition |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here is just Accepted or Queued, right ❓ I think is can be more clear

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently yes, but I don't want to limit that in the design, because we may say later we want to have other conditions.

| status | represents the state of individual condition. The resulting `phase` value should report `Success` only when all of the conditions are met and the backup succeded. One of `True`, `False` or `Unknown`. |
| lastProbeTime | Timestamp of when the NonAdminBackup condition was last probed. |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unsure if we need this, seems like this ts would be managed by NAB controller and we will have to embed logic in it ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I took it from the condition structure, but looks like I need to adjust to what's there. We need lastTransitionTime instead (below is from the NAB object itself):

    - lastTransitionTime: '2024-04-15T20:27:45Z'
      message: Created Velero Backup object
      reason: backup_scheduled
      status: 'True'
      type: BackupQueued

| lastTransitionTime | Timestamp for when the NonAdminBackup last transitioned from one status to another. |
| reason | Machine-readable, UpperCamelCase text indicating the reason for the condition's last transition. |
| message | Human-readable message indicating details about the last status transition. |

### BackupStatus

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a sample of how the NAB status in yaml form would look like ? Also the overall golang struct to give a rough idea ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok here is entire (failing) one, anything else from it?

apiVersion: nac.oadp.openshift.io/v1alpha1
kind: NonAdminBackup
metadata:
  creationTimestamp: '2024-04-15T20:27:35Z'
  generation: 2
  managedFields:
    - apiVersion: nac.oadp.openshift.io/v1alpha1
      fieldsType: FieldsV1
      fieldsV1:
        'f:spec':
          .: {}
          'f:backupSpec': {}
      manager: Mozilla
      operation: Update
      time: '2024-04-15T20:27:35Z'
    - apiVersion: nac.oadp.openshift.io/v1alpha1
      fieldsType: FieldsV1
      fieldsV1:
        'f:spec':
          'f:backupSpec':
            'f:itemOperationTimeout': {}
            'f:metadata': {}
            'f:volumeSnapshotLocations': {}
            'f:snapshotMoveData': {}
            'f:defaultVolumesToFsBackup': {}
            'f:ttl': {}
            'f:csiSnapshotTimeout': {}
            'f:storageLocation': {}
            'f:hooks': {}
      manager: main
      operation: Update
      time: '2024-04-15T20:27:45Z'
    - apiVersion: nac.oadp.openshift.io/v1alpha1
      fieldsType: FieldsV1
      fieldsV1:
        'f:status':
          .: {}
          'f:backupStatus':
            .: {}
            'f:expiration': {}
            'f:failureReason': {}
            'f:formatVersion': {}
            'f:phase': {}
            'f:startTimestamp': {}
            'f:version': {}
          'f:conditions': {}
          'f:oadpVeleroBackup': {}
      manager: main
      operation: Update
      subresource: status
      time: '2024-04-16T08:12:12Z'
  name: testnonadminbackup
  namespace: nacproject
  resourceVersion: '61393755'
  uid: 12712cf9-2161-4e58-931d-21528afdff89
spec:
  backupSpec:
    volumeSnapshotLocations:
      - velero-sample-1
    defaultVolumesToFsBackup: false
    csiSnapshotTimeout: 10m0s
    ttl: 720h0m0s
    itemOperationTimeout: 4h0m0s
    metadata: {}
    storageLocation: velero-sample-1
    hooks: {}
    snapshotMoveData: false
status:
  backupStatus:
    expiration: '2024-05-16T08:12:11Z'
    failureReason: >-
      unable to get credentials: unable to get key for secret: Secret
      "cloud-credentials" not found
    formatVersion: 1.1.0
    phase: Failed
    startTimestamp: '2024-04-16T08:12:11Z'
    version: 1
  conditions:
    - lastTransitionTime: '2024-04-15T20:27:35Z'
      message: backup accepted
      reason: backup_accepted
      status: 'True'
      type: BackupAccepted
    - lastTransitionTime: '2024-04-15T20:27:45Z'
      message: Created Velero Backup object
      reason: backup_scheduled
      status: 'True'
      type: BackupQueued
  oadpVeleroBackup: nab-nacproject-83fc04a2fd253d

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This helps, so I would say we need to restructure a bit here, just remove the backupStatus named struct, all the feilds of backupStatus can be directly under NAB.status. This would mean:
Current traversal is: NonAdminBackup.status.BackupStatus.phase
Suggested/Modified traversal will be : NonAdminBackup.status.phase
The backupStatus under status seems redundant.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not in favor of putting the backupStatus directly under NAB.status. It's because from implementation it's easier to compare structures that are same and NAB.status have already additional fields.

Secondly it's clear what comes from Velero Status and what is addition to that status such as veleroBackup or veleroBackupNamespace or even conditions. After your proposal we would have:

status:
  expiration: '2024-05-16T08:12:11Z'
  failureReason: >-
    unable to get credentials: unable to get key for secret: Secret
    "cloud-credentials" not found
  formatVersion: 1.1.0
  phase: Failed
  startTimestamp: '2024-04-16T08:12:11Z'
  version: 1
  conditions:
    - lastTransitionTime: '2024-04-15T20:27:35Z'
      message: backup accepted
      reason: backup_accepted
      status: 'True'
      type: BackupAccepted
    - lastTransitionTime: '2024-04-15T20:27:45Z'
      message: Created Velero Backup object
      reason: backup_scheduled
      status: 'True'
      type: BackupQueued
  veleroBackup: nab-nacproject-83fc04a2fd253d
  veleroBackupNamespace: openshift-adp

`BackupStatus` which is also part of the `NonAdminBackupStatus` object is a `BackupStatus` that is taken directly from the Velero Backup Status and copied over.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NonAdminBackupStatus will consist of:

  • its own status
  • and VeleroBackupStatus

Copy link
Collaborator Author

@mpryc mpryc Apr 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No that is not true. The BackupStatus within NonAdminBackup.Status is just VeleroBackup.Status nothing else.

So whatever is under backupStatus is a deepcopy of VeleroBackup.Status and I would be really for not mixing Velero Status with other NAB status fields.

status:
  backupStatus:
    expiration: '2024-05-16T08:12:11Z'
    failureReason: >-
      unable to get credentials: unable to get key for secret: Secret
      "cloud-credentials" not found
    formatVersion: 1.1.0
    phase: Failed
    startTimestamp: '2024-04-16T08:12:11Z'
    version: 1


Copy link
Member

@shubham-pampattiwar shubham-pampattiwar Apr 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we might need to add Creation and Completion timestamp as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creation of NAB ? That is already part of metadata, or we are talking here about Creation of VeleroBackup?

I am unsure here what you want to represent. The Completion of backup operation? Isn't that part of the VeleroBackup status? See startTimestamp and I expect VeleroBackup will also consist of completion part

  backupStatus:
    expiration: '2024-05-16T08:12:11Z'
    failureReason: >-
      unable to get credentials: unable to get key for secret: Secret
      "cloud-credentials" not found
    formatVersion: 1.1.0
    phase: Failed
    startTimestamp: '2024-04-16T08:12:11Z'
    version: 1

Copy link
Member

@shubham-pampattiwar shubham-pampattiwar Apr 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok we can hold off on adding completionTimestamp Velero status timestamps should be good enough. We would need seperate completionTimestamp for NAB status if did any crucial processing once Velero backup is complete but in our case we will just be updated the NAB status so we should be good.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding completionTS would just make the NAB status future proof.

### VeleroBackupName and VeleroBackupNamespace
The `VeleroBackupName` is a component of the `NonAdminBackupStatus` object. It represents the name of the `VeleroBackup` object. The `VeleroBackupNamespace` represents the namespace in which the `VeleroBackup` object was created.

This `VeleroBackupName` and `VeleroBackupNamespace` serves as a reference to the Backup responsible for executing the backup task.

The format of those fields allows to interact with that Backup using `oc` or `velero` commands as follows:

```shell
# Example:
# veleroBackupName: nab-nacproject-c3499c2729730a
# veleroBackupNamespace: openshift-adp

$ oc describe -n openshift-adp nab-nacproject-c3499c2729730a

$ velero backup describe -n openshift-adp nab-nacproject-c3499c2729730a
```


## Phase Update scenarios

shubham-pampattiwar marked this conversation as resolved.
Show resolved Hide resolved
```mermaid
%%{init: {'theme':'forest'}}%%
graph
START[Phase: New] -- NAC config OK --> ACCEPTED[Non Admin Backup Accepted]
START -- NAC config NOT OK --> ERROR[Phase: BackingOff]
ACCEPTED -- Create Velero Backup --> CREATED[Phase: Created]
ACCEPTED -.-> COND_ACCEPTED{BackupAccepted: True\n}
CREATED -.-> COND_QUEUED{BackupAccepted: True\nBackupQueued: True\n}
ERROR -.-> COND_ERROR{BackupAccepted: False}

classDef conditions fill:#ccc,stroke:#ccc,stroke-width:2px;
class COND_ACCEPTED,COND_QUEUED,COND_ERROR conditions;
classDef phases fill:#777,stroke:#ccc,stroke-width:2px;
class START,CREATED,ERROR phases;

```
Loading