-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce DCN DT #392
base: main
Are you sure you want to change the base?
Introduce DCN DT #392
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: sbekkerm The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @sbekkerm. Thanks for your PR. I'm waiting for a openstack-k8s-operators member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
examples/va/dcn/README.md
Outdated
@@ -0,0 +1,42 @@ | |||
# Distributed Compute Node (DCN) OpenStack Architecture with HCI and Ceph | |||
|
|||
**Based on OpenStack K8S operators from the "main" branch of the [OpenStack Operator repo](https://github.com/openstack-k8s-operators/openstack-operator/commit/aa63bf3931f74722dd48af8a0914233b2b384330) on Dec 19th, 2023** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update this to indicate which commit of OpenStack operator you used to test this.
automation/vars/dcn.yaml
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be missing stages for the data plane. Or was that intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the data plane deployment to avoid issues with the Zuul jobs. The deployment is handled by a custom ansible playbook since, for DCN, we need to repeat Steps 3, 4, and install Ceph for each DCN site. This PR has been tested within downstream, job #665
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to put that playbook in a ci-framwork patch and document that it's something to be run separately?
https://github.com/openstack-k8s-operators/ci-framework/tree/main/playbooks
Have a look at this:
https://github.com/openstack-k8s-operators/architecture/blob/main/automation/vars/hci.yaml#L50-L54
In theory, you could add a post_stage_run
in this PR which calls your new playbook and have a depends-on.
What do you think @cjeanner ?
@sbekkerm I see two changes needed up front.
The readme files are incomplete. Please see the the four stage readme's for VA HCI: https://github.com/openstack-k8s-operators/architecture/tree/main/examples/va/hci#stages It contains English instructions that someone can read to implement the VA without ci-framework and using only the produced k8s manifests. If there are external automations for them, that's fine but I should be able to read the directions and reproduce your work so that we can have independent verification. Right now it looks like the the VA1 directions are still there and not updated. In my early example someone could read my directions and get a full deployment (and the https://github.com/fultonj/dcn?tab=readme-ov-file#steps No code should be required to implement what I'm talking about for this request. Just written instructions.
Would you please change this so that it puts the added files into the |
The main difference between VA and HCI is in the values.yaml and service-values.yaml files. For example, the nncp values.yaml contains the configuration necessary for spine and leaf: https://github.com/sbekkerm/architecture/blob/dcn/examples/va/dcn/control-plane/nncp/values.yaml#L18 and the post-ceph service-values.yaml contains Glance Multi Store configuration:
|
Yes, I understand but I still need clear written instructions so I (or anyone else) can reproduce. I want to collaborate with you on this and reproduce the results in my environment so I can find and fix bugs. I think the READMEs are missing too much and filling them in will help other engineers and the docs team. From a very high level it's so we can have https://en.wikipedia.org/wiki/Reproducibility
https://github.com/openstack-k8s-operators/architecture/blob/main/examples/dt/README.md I don't think this should be an update to an existing DT, it should be a new DT, but it's a DT, not a VA. I wouldn't want to hand the field the Jinja2 files. This is something we do for our CI but not yet ready to be a full blown VA we could hand to someone in the field. Maybe it can evolve into a VA in the future. For now, in order to merge what you have I think it should be a DT. |
The README contains all the steps to reproduce the environment. Could you please clarify what specifically is unclear?
The Jinja templates are not involved in the deployment process, as they are only used by CI to set the "CHANGEME" parameters in values.yaml and service-valiues.yaml |
Please move this from the I am putting it in my backlog to go through your README line by line and attempt to reproduce what you have deployed and I'll ask you clarifying questions along the way which will point out what is incomplete in the READMEs. |
Moved from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am suggesting some changes based on what we had in 17.1 openstack services configs and/or to make tempest tests pass (I've managed to make compute and volume tempest suites to pass all tests with the proposed changes which I applied manually)
customServiceConfig: | | ||
[DEFAULT] | ||
enabled_backends = ceph | ||
glance_api_servers = http://glance-az1-internal.openstack.svc:9292 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TLS is enabled by default so we should use https endpoint:
https://glance-az1-internal.openstack.svc:9292
customServiceConfig: | | ||
[DEFAULT] | ||
enabled_backends = ceph | ||
glance_api_servers = http://glance-az2-internal.openstack.svc:9292 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TLS is enabled by default so we should use https endpoint:
https://glance-az2-internal.openstack.svc:9292
[DEFAULT] | ||
enabled_backends = ceph | ||
{% if 'ceph' not in _ceph.cifmw_ceph_client_cluster %} | ||
glance_api_servers = http://glance-{{ _ceph.cifmw_ceph_client_cluster }}-internal.openstack.svc:9292 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TLS is enabled by default so we should use https endpoint:
https://glance-{{ _ceph.cifmw_ceph_client_cluster }}-internal.openstack.svc:9292
cinderAPI: | ||
replicas: 3 | ||
cinderBackup: | ||
replicas: 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd disable cinder-backup by setting replicas to 0 for now because tempest does have a capability to create a backup in a different/specified AZ anyway
data: | ||
preserveJobs: false | ||
cinderAPI: | ||
replicas: 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to define the default AZ so some Volume resources such as Volumes groups created in tempest tests have a proper AZ to be created at, something like:
cinderAPI:
replicas: 3
customServiceConfig: |
[DEFAULT]
default_availability_zone = ceph
images_rbd_glance_copy_poll_interval=15 | ||
images_rbd_glance_copy_timeout=600 | ||
rbd_user=openstack | ||
rbd_secret_uuid={{ cifmw_ceph_client_fsid }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add some more parameters to be set, the glance endpoint and the cross_az_attach parameter, so It could look like:
[glance]
endpoint_override = https://glance-{{ _az }}-internal.openstack.svc:9292
valid_interfaces = internal
[cinder]
cross_az_attach = False
catalog_info = volumev3:cinderv3:internalURL
Additionally to the [libvirt] section
images_type=rbd | ||
images_rbd_pool=vms | ||
images_rbd_ceph_conf=/etc/ceph/{{ cifmw_ceph_client_cluster }}.conf | ||
images_rbd_glance_store_name=default_backend |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe It should be:
images_rbd_glance_store_name={{ cifmw_ceph_client_cluster }}
For example for az1:
images_rbd_glance_store_name=az1
nova: | ||
customServiceConfig: | | ||
[DEFAULT] | ||
default_schedule_zone=az0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't see the default AZ set anywhere on deployed env and default AZ for nova should be az0 If It's done the way (to be consistent with other parts of generated CRs). This should imo be
nova:
template:
apiServiceTemplate:
customServiceConfig: |
[DEFAULT]
default_schedule_zone = az0
cinderAPI: | ||
replicas: 3 | ||
cinderBackup: | ||
replicas: 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd disable cinder-backup by setting replicas to 0 for now because tempest does have a capability to create a backup in a different/specified AZ anyway
config.kubernetes.io/local-config: "true" | ||
data: | ||
preserveJobs: false | ||
cinderAPI: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to define the default AZ so some Volume resources such as Volumes groups created in tempest tests have a proper AZ to be created at, something like:
cinderAPI:
replicas: 3
customServiceConfig: |
[DEFAULT]
default_availability_zone = ceph
This PR introduces DCN VA, which builds upon the HCI VA architecture and is designed for multi-site deployment.
In addition to the regular configuration files, this PR includes Jinja templates for generating values.yaml and service-values.yaml files. These templates are essential for Zuul job execution, allowing for the creation of site-specific configuration files multiple times for each DCN site.