Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design documentation for the NonAdminBackup Status and Conditions #23

Merged
merged 5 commits into from
May 17, 2024

Conversation

mpryc
Copy link
Collaborator

@mpryc mpryc commented Apr 3, 2024

Two proposals for updating and keeping in sync NonAdminBackup Status and Condition fields.

The Status also consists of Velero Backup Name within NAB object.

Two proposals for updating and keeping in sync NonAdminBackup Status
and Condition fields.

The Status also consists of Velero Backup Name within NAB object.

Signed-off-by: Michal Pryc <mpryc@redhat.com>

| **Condition** | **Description** |
|-----------|--------------------------------|
| BackupAccepted | The Backup object was accepted by the reconcile loop, but the Velero Backup may have not yet been created |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here I would accept only after velero accepts it. My fear it that even a backup CR looks good (pass cluster validation) it may not work (because of lack of cluster validation or BSL is invalid, for example)


- BackupQueued stays true at the end,
should we remove that from conditions?
I don't think we should set it to false.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think is good to set false, may be confusing if operation finished and this is still true. But have a way to say "it is not queued because it completed"

@shubham-pampattiwar
Copy link
Member

Just needs update on which approach we are going ahead with.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will need this for #35

Do you believe this design can cover both Backup and Restores, or best to have one for each?

FYI @shubham-pampattiwar @weshayutin

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will have similar phases for Restores too, so should cover both. Maybe just add a note if there is anything that will be different for restores.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to have same Phases for Restore:
New
BackingOff
Created

What will be different is conditions, they are coupled with backup e.g.:

BackupAccepted
BackupQueued

At this stage we can make those also generic and convert to Accepted and Queued which would mean Restore also could use same names?

@shubham-pampattiwar ^^ ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I was gonna suggest the same, lets drop the Backup prefix and we should be good.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from what I read, the text seems right and should be easy to follow it to have same logic in restore

question is if we need a design for that later, if yes, I suggest making this one for both restore and backup (having only backup examples, saying it is the same for restores)

mpryc and others added 2 commits April 17, 2024 14:04
Co-authored-by: Tiger Kaovilai <passawit.kaovilai@gmail.com>
Signed-off-by: Michal Pryc <mpryc@redhat.com>
Adjusted design to address comments.

The non-active design is in the folder archive, as we may come back
to it at some point in the future.

Signed-off-by: Michal Pryc <mpryc@redhat.com>

The `conditions` is also a part of the NonAdminBackup's `status` field. One NAB object may have multiple conditions. It is more granular knowledge of the NonAdminBackup object and represents the array of the conditions through which the NonAdminBackup has or has not passed. Each `NonAdminCondition` has one of the following `type`:

| **Condition** | **Description** |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mpryc I think we also need a Error condition here. So that whenever the backup is in BackingOff phase the Error condition is applied to the NAB object

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the BackingOff we set the BackupAccepted condition to False see the diagram in another comment. Do we need to have Error condition as well? I think it's a bit too many states.

The BackupAccepted set to False have also the string reason/message where we can say why it was wrong.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example:

status:
  conditions:
    - lastTransitionTime: '2024-04-17T20:56:57Z'
      message: NonAdminBackup CR does not contain valid BackupSpec
      reason: invalid_backupspec
      status: 'False'
      type: BackupAccepted
  phase: BackingOff

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the user sees an Error condition they definitely think that something is not working as intended/expected.
The word Accepted does not imply the same. I agree that the Accepted condition can display the error message but having an Error condition would improve the user experience.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have phase: BackingOff for that, so this is high level information to tell that something went wrong.

User has the Accepted: False status, of course we could add Accepted: False and Error: True or just Error: True, but that in my opinion is duplication of what phase is for.

|-----------|--------------------------------|
| type | The `Type` of the condition |
| status | represents the state of individual condition. The resulting `phase` value should report `Success` only when all of the conditions are met and the backup succeded. One of `True`, `False` or `Unknown`. |
| lastProbeTime | Timestamp of when the NonAdminBackup condition was last probed. |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unsure if we need this, seems like this ts would be managed by NAB controller and we will have to embed logic in it ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I took it from the condition structure, but looks like I need to adjust to what's there. We need lastTransitionTime instead (below is from the NAB object itself):

    - lastTransitionTime: '2024-04-15T20:27:45Z'
      message: Created Velero Backup object
      reason: backup_scheduled
      status: 'True'
      type: BackupQueued

| message | Human-readable message indicating details about the last status transition. |

### BackupStatus

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a sample of how the NAB status in yaml form would look like ? Also the overall golang struct to give a rough idea ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok here is entire (failing) one, anything else from it?

apiVersion: nac.oadp.openshift.io/v1alpha1
kind: NonAdminBackup
metadata:
  creationTimestamp: '2024-04-15T20:27:35Z'
  generation: 2
  managedFields:
    - apiVersion: nac.oadp.openshift.io/v1alpha1
      fieldsType: FieldsV1
      fieldsV1:
        'f:spec':
          .: {}
          'f:backupSpec': {}
      manager: Mozilla
      operation: Update
      time: '2024-04-15T20:27:35Z'
    - apiVersion: nac.oadp.openshift.io/v1alpha1
      fieldsType: FieldsV1
      fieldsV1:
        'f:spec':
          'f:backupSpec':
            'f:itemOperationTimeout': {}
            'f:metadata': {}
            'f:volumeSnapshotLocations': {}
            'f:snapshotMoveData': {}
            'f:defaultVolumesToFsBackup': {}
            'f:ttl': {}
            'f:csiSnapshotTimeout': {}
            'f:storageLocation': {}
            'f:hooks': {}
      manager: main
      operation: Update
      time: '2024-04-15T20:27:45Z'
    - apiVersion: nac.oadp.openshift.io/v1alpha1
      fieldsType: FieldsV1
      fieldsV1:
        'f:status':
          .: {}
          'f:backupStatus':
            .: {}
            'f:expiration': {}
            'f:failureReason': {}
            'f:formatVersion': {}
            'f:phase': {}
            'f:startTimestamp': {}
            'f:version': {}
          'f:conditions': {}
          'f:oadpVeleroBackup': {}
      manager: main
      operation: Update
      subresource: status
      time: '2024-04-16T08:12:12Z'
  name: testnonadminbackup
  namespace: nacproject
  resourceVersion: '61393755'
  uid: 12712cf9-2161-4e58-931d-21528afdff89
spec:
  backupSpec:
    volumeSnapshotLocations:
      - velero-sample-1
    defaultVolumesToFsBackup: false
    csiSnapshotTimeout: 10m0s
    ttl: 720h0m0s
    itemOperationTimeout: 4h0m0s
    metadata: {}
    storageLocation: velero-sample-1
    hooks: {}
    snapshotMoveData: false
status:
  backupStatus:
    expiration: '2024-05-16T08:12:11Z'
    failureReason: >-
      unable to get credentials: unable to get key for secret: Secret
      "cloud-credentials" not found
    formatVersion: 1.1.0
    phase: Failed
    startTimestamp: '2024-04-16T08:12:11Z'
    version: 1
  conditions:
    - lastTransitionTime: '2024-04-15T20:27:35Z'
      message: backup accepted
      reason: backup_accepted
      status: 'True'
      type: BackupAccepted
    - lastTransitionTime: '2024-04-15T20:27:45Z'
      message: Created Velero Backup object
      reason: backup_scheduled
      status: 'True'
      type: BackupQueued
  oadpVeleroBackup: nab-nacproject-83fc04a2fd253d

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This helps, so I would say we need to restructure a bit here, just remove the backupStatus named struct, all the feilds of backupStatus can be directly under NAB.status. This would mean:
Current traversal is: NonAdminBackup.status.BackupStatus.phase
Suggested/Modified traversal will be : NonAdminBackup.status.phase
The backupStatus under status seems redundant.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not in favor of putting the backupStatus directly under NAB.status. It's because from implementation it's easier to compare structures that are same and NAB.status have already additional fields.

Secondly it's clear what comes from Velero Status and what is addition to that status such as veleroBackup or veleroBackupNamespace or even conditions. After your proposal we would have:

status:
  expiration: '2024-05-16T08:12:11Z'
  failureReason: >-
    unable to get credentials: unable to get key for secret: Secret
    "cloud-credentials" not found
  formatVersion: 1.1.0
  phase: Failed
  startTimestamp: '2024-04-16T08:12:11Z'
  version: 1
  conditions:
    - lastTransitionTime: '2024-04-15T20:27:35Z'
      message: backup accepted
      reason: backup_accepted
      status: 'True'
      type: BackupAccepted
    - lastTransitionTime: '2024-04-15T20:27:45Z'
      message: Created Velero Backup object
      reason: backup_scheduled
      status: 'True'
      type: BackupQueued
  veleroBackup: nab-nacproject-83fc04a2fd253d
  veleroBackupNamespace: openshift-adp


### BackupStatus

`BackupStatus` which is also part of the `NonAdminBackupStatus` object is a `BackupStatus` that is taken directly from the Velero Backup Status and copied over.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NonAdminBackupStatus will consist of:

  • its own status
  • and VeleroBackupStatus

Copy link
Collaborator Author

@mpryc mpryc Apr 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No that is not true. The BackupStatus within NonAdminBackup.Status is just VeleroBackup.Status nothing else.

So whatever is under backupStatus is a deepcopy of VeleroBackup.Status and I would be really for not mixing Velero Status with other NAB status fields.

status:
  backupStatus:
    expiration: '2024-05-16T08:12:11Z'
    failureReason: >-
      unable to get credentials: unable to get key for secret: Secret
      "cloud-credentials" not found
    formatVersion: 1.1.0
    phase: Failed
    startTimestamp: '2024-04-16T08:12:11Z'
    version: 1

### BackupStatus

`BackupStatus` which is also part of the `NonAdminBackupStatus` object is a `BackupStatus` that is taken directly from the Velero Backup Status and copied over.

Copy link
Member

@shubham-pampattiwar shubham-pampattiwar Apr 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we might need to add Creation and Completion timestamp as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creation of NAB ? That is already part of metadata, or we are talking here about Creation of VeleroBackup?

I am unsure here what you want to represent. The Completion of backup operation? Isn't that part of the VeleroBackup status? See startTimestamp and I expect VeleroBackup will also consist of completion part

  backupStatus:
    expiration: '2024-05-16T08:12:11Z'
    failureReason: >-
      unable to get credentials: unable to get key for secret: Secret
      "cloud-credentials" not found
    formatVersion: 1.1.0
    phase: Failed
    startTimestamp: '2024-04-16T08:12:11Z'
    version: 1

Copy link
Member

@shubham-pampattiwar shubham-pampattiwar Apr 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok we can hold off on adding completionTimestamp Velero status timestamps should be good enough. We would need seperate completionTimestamp for NAB status if did any crucial processing once Velero backup is complete but in our case we will just be updated the NAB status so we should be good.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding completionTS would just make the NAB status future proof.


Those are are the possible values for phase:

| **Value** | **Description** |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets add a table where we map the Phase and Conditions as well as VeleroBackupPhase with each other for easier understanding

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why we have diagram which is integral part of that .md file with Phase/Condition within workflow. Do we need a table to support that diagram as well? Attaching screenshot of the diagram in case you don's see it in the review 'view'

image

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw the diagram :)
ok lets at least tabulate velero phases with NAB phases ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on the gray squares, should not ALL of them have Queued populated ❓

@mpryc mpryc force-pushed the non-admin-status-design branch 3 times, most recently from e81a115 to b228c84 Compare April 18, 2024 15:46
Updated NAB workflow design that addresses the following:
 - Removed Backup prefix from the Condition name, so it
   can be the same for the Restore
 - Removed lastProbeTime from the Condition field
 - Added sample yaml files to easy reading how the
   object looks like
 - Updated diagram to present what is Condition

Signed-off-by: Michal Pryc <mpryc@redhat.com>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in this PR imo.


| **Field name** | **Description** |
|-----------|--------------------------------|
| type | The `Type` of the condition |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here is just Accepted or Queued, right ❓ I think is can be more clear

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently yes, but I don't want to limit that in the design, because we may say later we want to have other conditions.

 - Update reason example to be CammelCase
 - Move the design docs into docs/ folder
 - Fix one missing sentence

Signed-off-by: Michal Pryc <mpryc@redhat.com>
Copy link
Contributor

@mateusoliveira43 mateusoliveira43 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
Follow up PR for adding deleted phase and NonAdminRestore to design

Copy link

openshift-ci bot commented May 17, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mateusoliveira43, mpryc, shubham-pampattiwar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [mateusoliveira43,mpryc,shubham-pampattiwar]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit 914e498 into migtools:master May 17, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Merged / Ready for build
Development

Successfully merging this pull request may close these issues.

4 participants