Introduce DCN DT #392

sbekkerm · 2024-09-09T18:03:45Z

This PR introduces DCN VA, which builds upon the HCI VA architecture and is designed for multi-site deployment.

In addition to the regular configuration files, this PR includes Jinja templates for generating values.yaml and service-values.yaml files. These templates are essential for Zuul job execution, allowing for the creation of site-specific configuration files multiple times for each DCN site.

openshift-ci · 2024-09-09T18:03:51Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sbekkerm
Once this PR has been reviewed and has the lgtm label, please assign fultonj for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2024-09-09T18:03:56Z

Hi @sbekkerm. Thanks for your PR.

I'm waiting for a openstack-k8s-operators member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

abays · 2024-09-16T09:25:47Z

examples/va/dcn/README.md

@@ -0,0 +1,42 @@
+# Distributed Compute Node (DCN) OpenStack Architecture with HCI and Ceph
+
+**Based on OpenStack K8S operators from the "main" branch of the [OpenStack Operator repo](https://github.com/openstack-k8s-operators/openstack-operator/commit/aa63bf3931f74722dd48af8a0914233b2b384330) on Dec 19th, 2023**


Please update this to indicate which commit of OpenStack operator you used to test this.

abays · 2024-09-16T09:56:26Z

automation/vars/dcn.yaml

This seems to be missing stages for the data plane. Or was that intentional?

I removed the data plane deployment to avoid issues with the Zuul jobs. The deployment is handled by a custom ansible playbook since, for DCN, we need to repeat Steps 3, 4, and install Ceph for each DCN site. This PR has been tested within downstream, job #665

Do you want to put that playbook in a ci-framwork patch and document that it's something to be run separately?

https://github.com/openstack-k8s-operators/ci-framework/tree/main/playbooks

Have a look at this:

https://github.com/openstack-k8s-operators/architecture/blob/main/automation/vars/hci.yaml#L50-L54

In theory, you could add a post_stage_run in this PR which calls your new playbook and have a depends-on.

What do you think @cjeanner ?

fultonj · 2024-09-16T12:41:25Z

@sbekkerm I see two changes needed up front.

Written Instructions

The readme files are incomplete. Please see the the four stage readme's for VA HCI:

https://github.com/openstack-k8s-operators/architecture/tree/main/examples/va/hci#stages

It contains English instructions that someone can read to implement the VA without ci-framework and using only the produced k8s manifests. If there are external automations for them, that's fine but I should be able to read the directions and reproduce your work so that we can have independent verification. Right now it looks like the the VA1 directions are still there and not updated. In my early example someone could read my directions and get a full deployment (and the extra directory with scripts can technically be ignored).

https://github.com/fultonj/dcn?tab=readme-ov-file#steps

No code should be required to implement what I'm talking about for this request. Just written instructions.

VA vs DT

Would you please change this so that it puts the added files into the dt directory instead of the va directory?

sbekkerm · 2024-09-16T13:21:33Z

@sbekkerm I see two changes needed up front.

Written Instructions

The readme files are incomplete. Please see the the four stage readme's for VA HCI:

https://github.com/openstack-k8s-operators/architecture/tree/main/examples/va/hci#stages

It contains English instructions that someone can read to implement the VA without ci-framework and using only the produced k8s manifests. If there are external automations for them, that's fine but I should be able to read the directions and reproduce your work so that we can have independent verification. Right now it looks like the the VA1 directions are still there and not updated. In my early example someone could read my directions and get a full deployment (and the extra directory with scripts can technically be ignored).

https://github.com/fultonj/dcn?tab=readme-ov-file#steps

No code should be required to implement what I'm talking about for this request. Just written instructions.

VA vs DT

Would you please change this so that it puts the added files into the dt directory instead of the va directory?

It contains the instructions. All four DCN steps are almost the same as HCI VA, except for the post-nova actions, which are already covered here: https://github.com/sbekkerm/architecture/blob/dcn/examples/va/dcn/dataplane-post-ceph.md#finalize-nova-computes
Additionally, it mentions that Steps 3, 4, and the Ceph installation need to be executed for each DCN site

The main difference between VA and HCI is in the values.yaml and service-values.yaml files. For example, the nncp values.yaml contains the configuration necessary for spine and leaf: https://github.com/sbekkerm/architecture/blob/dcn/examples/va/dcn/control-plane/nncp/values.yaml#L18

and the post-ceph service-values.yaml contains Glance Multi Store configuration:
https://github.com/sbekkerm/architecture/blob/dcn/examples/va/dcn/service-values.yaml#L117

Why should we use DT instead of VA?

fultonj · 2024-09-16T13:27:59Z

@sbekkerm I see two changes needed up front.

Written Instructions

The readme files are incomplete. Please see the the four stage readme's for VA HCI:
https://github.com/openstack-k8s-operators/architecture/tree/main/examples/va/hci#stages
It contains English instructions that someone can read to implement the VA without ci-framework and using only the produced k8s manifests. If there are external automations for them, that's fine but I should be able to read the directions and reproduce your work so that we can have independent verification. Right now it looks like the the VA1 directions are still there and not updated. In my early example someone could read my directions and get a full deployment (and the extra directory with scripts can technically be ignored).
https://github.com/fultonj/dcn?tab=readme-ov-file#steps
No code should be required to implement what I'm talking about for this request. Just written instructions.

VA vs DT

Would you please change this so that it puts the added files into the dt directory instead of the va directory?

It contains the instructions. All four DCN steps are almost the same as HCI VA, except for the post-nova actions, which are already covered here: https://github.com/sbekkerm/architecture/blob/dcn/examples/va/dcn/dataplane-post-ceph.md#finalize-nova-computes
Additionally, it mentions that Steps 3, 4, and the Ceph installation need to be executed for each DCN site

The main difference between VA and HCI is in the values.yaml and service-values.yaml files. For example, the nncp values.yaml contains the configuration necessary for spine and leaf: https://github.com/sbekkerm/architecture/blob/dcn/examples/va/dcn/control-plane/nncp/values.yaml#L18

and the post-ceph service-values.yaml contains Glance Multi Store configuration: https://github.com/sbekkerm/architecture/blob/dcn/examples/va/dcn/service-values.yaml#L117

Yes, I understand but I still need clear written instructions so I (or anyone else) can reproduce.

I want to collaborate with you on this and reproduce the results in my environment so I can find and fix bugs. I think the READMEs are missing too much and filling them in will help other engineers and the docs team. From a very high level it's so we can have https://en.wikipedia.org/wiki/Reproducibility

Why DT?

https://github.com/openstack-k8s-operators/architecture/blob/main/examples/dt/README.md

I don't think this should be an update to an existing DT, it should be a new DT, but it's a DT, not a VA.

I wouldn't want to hand the field the Jinja2 files. This is something we do for our CI but not yet ready to be a full blown VA we could hand to someone in the field. Maybe it can evolve into a VA in the future. For now, in order to merge what you have I think it should be a DT.

sbekkerm · 2024-09-16T14:45:44Z

@sbekkerm I see two changes needed up front.

Written Instructions

The readme files are incomplete. Please see the the four stage readme's for VA HCI:
https://github.com/openstack-k8s-operators/architecture/tree/main/examples/va/hci#stages
It contains English instructions that someone can read to implement the VA without ci-framework and using only the produced k8s manifests. If there are external automations for them, that's fine but I should be able to read the directions and reproduce your work so that we can have independent verification. Right now it looks like the the VA1 directions are still there and not updated. In my early example someone could read my directions and get a full deployment (and the extra directory with scripts can technically be ignored).
https://github.com/fultonj/dcn?tab=readme-ov-file#steps
No code should be required to implement what I'm talking about for this request. Just written instructions.

VA vs DT

Would you please change this so that it puts the added files into the dt directory instead of the va directory?

It contains the instructions. All four DCN steps are almost the same as HCI VA, except for the post-nova actions, which are already covered here: https://github.com/sbekkerm/architecture/blob/dcn/examples/va/dcn/dataplane-post-ceph.md#finalize-nova-computes
Additionally, it mentions that Steps 3, 4, and the Ceph installation need to be executed for each DCN site

The main difference between VA and HCI is in the values.yaml and service-values.yaml files. For example, the nncp values.yaml contains the configuration necessary for spine and leaf: https://github.com/sbekkerm/architecture/blob/dcn/examples/va/dcn/control-plane/nncp/values.yaml#L18
and the post-ceph service-values.yaml contains Glance Multi Store configuration: https://github.com/sbekkerm/architecture/blob/dcn/examples/va/dcn/service-values.yaml#L117

Yes, I understand but I still need clear written instructions so I (or anyone else) can reproduce.

I want to collaborate with you on this and reproduce the results in my environment so I can find and fix bugs. I think the READMEs are missing too much and filling them in will help other engineers and the docs team. From a very high level it's so we can have https://en.wikipedia.org/wiki/Reproducibility

Why DT?

https://github.com/openstack-k8s-operators/architecture/blob/main/examples/dt/README.md

I don't think this should be an update to an existing DT, it should be a new DT, but it's a DT, not a VA.

I wouldn't want to hand the field the Jinja2 files. This is something we do for our CI but not yet ready to be a full blown VA we could hand to someone in the field. Maybe it can evolve into a VA in the future. For now, in order to merge what you have I think it should be a DT.

The README contains all the steps to reproduce the environment. Could you please clarify what specifically is unclear?

These steps for deploying the control plane:
https://github.com/sbekkerm/architecture/blob/dcn/examples/va/dcn/control-plane.md
These steps for preparing nodes for ceph installation:
https://github.com/sbekkerm/architecture/blob/dcn/examples/va/dcn/dataplane-pre-ceph.md
These steps for configuring nodes after ceph installation:
https://github.com/sbekkerm/architecture/blob/dcn/examples/va/dcn/dataplane-post-ceph.md

The Jinja templates are not involved in the deployment process, as they are only used by CI to set the "CHANGEME" parameters in values.yaml and service-valiues.yaml

fultonj · 2024-09-16T18:39:40Z

@sbekkerm

Please move this from the va directory to the dt directory.

I am putting it in my backlog to go through your README line by line and attempt to reproduce what you have deployed and I'll ask you clarifying questions along the way which will point out what is incomplete in the READMEs.

sbekkerm · 2024-09-17T09:25:17Z

@fultonj

Moved from va to dt directory as requested.
Also added a high-level diagram to the README for more clarity. Let me know if you have any questions.

krcmarik

I am suggesting some changes based on what we had in 17.1 openstack services configs and/or to make tempest tests pass (I've managed to make compute and volume tempest suites to pass all tests with the proposed changes which I applied manually)

krcmarik · 2024-09-21T06:59:16Z

examples/dt/dcn/service-values.yaml

+      customServiceConfig: |
+        [DEFAULT]
+        enabled_backends = ceph
+        glance_api_servers = http://glance-az1-internal.openstack.svc:9292


The TLS is enabled by default so we should use https endpoint:
https://glance-az1-internal.openstack.svc:9292

krcmarik · 2024-09-21T07:00:27Z

examples/dt/dcn/service-values.yaml

+      customServiceConfig: |
+        [DEFAULT]
+        enabled_backends = ceph
+        glance_api_servers = http://glance-az2-internal.openstack.svc:9292


The TLS is enabled by default so we should use https endpoint:
https://glance-az2-internal.openstack.svc:9292

krcmarik · 2024-09-21T07:01:09Z

examples/dt/dcn/service-values.yaml.j2

+        [DEFAULT]
+        enabled_backends = ceph
+{% if 'ceph' not in _ceph.cifmw_ceph_client_cluster %}
+        glance_api_servers = http://glance-{{ _ceph.cifmw_ceph_client_cluster }}-internal.openstack.svc:9292


The TLS is enabled by default so we should use https endpoint:
https://glance-{{ _ceph.cifmw_ceph_client_cluster }}-internal.openstack.svc:9292

krcmarik · 2024-09-21T07:10:38Z

examples/dt/dcn/service-values.yaml

+  cinderAPI:
+    replicas: 3
+  cinderBackup:
+    replicas: 3


I'd disable cinder-backup by setting replicas to 0 for now because tempest does have a capability to create a backup in a different/specified AZ anyway

krcmarik · 2024-09-21T07:15:37Z

examples/dt/dcn/service-values.yaml

+data:
+  preserveJobs: false
+  cinderAPI:
+    replicas: 3


We need to define the default AZ so some Volume resources such as Volumes groups created in tempest tests have a proper AZ to be created at, something like:

cinderAPI: replicas: 3 customServiceConfig: | [DEFAULT] default_availability_zone = ceph

krcmarik · 2024-09-21T07:23:55Z

examples/dt/dcn/values.yaml.j2

+                images_rbd_glance_copy_poll_interval=15
+                images_rbd_glance_copy_timeout=600
+                rbd_user=openstack
+                rbd_secret_uuid={{ cifmw_ceph_client_fsid }}


We should add some more parameters to be set, the glance endpoint and the cross_az_attach parameter, so It could look like:

[glance] endpoint_override = https://glance-{{ _az }}-internal.openstack.svc:9292 valid_interfaces = internal [cinder] cross_az_attach = False catalog_info = volumev3:cinderv3:internalURL

Additionally to the [libvirt] section

krcmarik · 2024-09-21T07:25:21Z

examples/dt/dcn/values.yaml.j2

+                images_type=rbd
+                images_rbd_pool=vms
+                images_rbd_ceph_conf=/etc/ceph/{{ cifmw_ceph_client_cluster }}.conf
+                images_rbd_glance_store_name=default_backend


I believe It should be:
images_rbd_glance_store_name={{ cifmw_ceph_client_cluster }}

For example for az1:
images_rbd_glance_store_name=az1

krcmarik · 2024-09-21T07:30:35Z

examples/dt/dcn/service-values.yaml.j2

+  nova:
+      customServiceConfig: |
+        [DEFAULT]
+        default_schedule_zone=az0


I can't see the default AZ set anywhere on deployed env and default AZ for nova should be az0 If It's done the way (to be consistent with other parts of generated CRs). This should imo be

nova: template: apiServiceTemplate: customServiceConfig: | [DEFAULT] default_schedule_zone = az0

krcmarik · 2024-09-21T07:32:06Z

examples/dt/dcn/service-values.yaml.j2

+  cinderAPI:
+    replicas: 3
+  cinderBackup:
+    replicas: 3


I'd disable cinder-backup by setting replicas to 0 for now because tempest does have a capability to create a backup in a different/specified AZ anyway

krcmarik · 2024-09-21T07:33:14Z

examples/dt/dcn/service-values.yaml.j2

+    config.kubernetes.io/local-config: "true"
+data:
+  preserveJobs: false
+  cinderAPI:


We need to define the default AZ so some Volume resources such as Volumes groups created in tempest tests have a proper AZ to be created at, something like:

cinderAPI: replicas: 3 customServiceConfig: | [DEFAULT] default_availability_zone = ceph

Introduce support for DCN VA

c55407c

openshift-ci bot requested review from leifmadsen and raukadah September 9, 2024 18:03

openshift-ci bot added the needs-ok-to-test label Sep 9, 2024

fultonj self-requested a review September 11, 2024 12:15

abays reviewed Sep 16, 2024

View reviewed changes

Move DCN from VA to DT

359fd17

sbekkerm changed the title ~~Introduce DCN VA~~ Introduce DCN DT Sep 17, 2024

Fix incorrect path in automation variables

d38a463

krcmarik reviewed Sep 21, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce DCN DT #392

Introduce DCN DT #392

sbekkerm commented Sep 9, 2024

openshift-ci bot commented Sep 9, 2024

openshift-ci bot commented Sep 9, 2024

abays Sep 16, 2024

abays Sep 16, 2024

sbekkerm Sep 16, 2024

fultonj Sep 16, 2024

fultonj commented Sep 16, 2024

sbekkerm commented Sep 16, 2024

fultonj commented Sep 16, 2024 •

edited

Loading

sbekkerm commented Sep 16, 2024

fultonj commented Sep 16, 2024

sbekkerm commented Sep 17, 2024

krcmarik left a comment

krcmarik Sep 21, 2024

krcmarik Sep 21, 2024

krcmarik Sep 21, 2024

krcmarik Sep 21, 2024

krcmarik Sep 21, 2024

krcmarik Sep 21, 2024

krcmarik Sep 21, 2024

krcmarik Sep 21, 2024

krcmarik Sep 21, 2024

krcmarik Sep 21, 2024

		@@ -0,0 +1,42 @@
		# Distributed Compute Node (DCN) OpenStack Architecture with HCI and Ceph

		Based on OpenStack K8S operators from the "main" branch of the [OpenStack Operator repo](https://github.com/openstack-k8s-operators/openstack-operator/commit/aa63bf3931f74722dd48af8a0914233b2b384330) on Dec 19th, 2023

Introduce DCN DT #392

Are you sure you want to change the base?

Introduce DCN DT #392

Conversation

sbekkerm commented Sep 9, 2024

openshift-ci bot commented Sep 9, 2024

openshift-ci bot commented Sep 9, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fultonj commented Sep 16, 2024

sbekkerm commented Sep 16, 2024

fultonj commented Sep 16, 2024 • edited Loading

sbekkerm commented Sep 16, 2024

fultonj commented Sep 16, 2024

sbekkerm commented Sep 17, 2024

krcmarik left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fultonj commented Sep 16, 2024 •

edited

Loading