Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce DCN DT #392

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Conversation

sbekkerm
Copy link

@sbekkerm sbekkerm commented Sep 9, 2024

This PR introduces DCN VA, which builds upon the HCI VA architecture and is designed for multi-site deployment.

In addition to the regular configuration files, this PR includes Jinja templates for generating values.yaml and service-values.yaml files. These templates are essential for Zuul job execution, allowing for the creation of site-specific configuration files multiple times for each DCN site.

Copy link

openshift-ci bot commented Sep 9, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sbekkerm
Once this PR has been reviewed and has the lgtm label, please assign fultonj for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

openshift-ci bot commented Sep 9, 2024

Hi @sbekkerm. Thanks for your PR.

I'm waiting for a openstack-k8s-operators member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@@ -0,0 +1,42 @@
# Distributed Compute Node (DCN) OpenStack Architecture with HCI and Ceph

**Based on OpenStack K8S operators from the "main" branch of the [OpenStack Operator repo](https://github.com/openstack-k8s-operators/openstack-operator/commit/aa63bf3931f74722dd48af8a0914233b2b384330) on Dec 19th, 2023**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update this to indicate which commit of OpenStack operator you used to test this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be missing stages for the data plane. Or was that intentional?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the data plane deployment to avoid issues with the Zuul jobs. The deployment is handled by a custom ansible playbook since, for DCN, we need to repeat Steps 3, 4, and install Ceph for each DCN site. This PR has been tested within downstream, job #665

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to put that playbook in a ci-framwork patch and document that it's something to be run separately?

https://github.com/openstack-k8s-operators/ci-framework/tree/main/playbooks

Have a look at this:

https://github.com/openstack-k8s-operators/architecture/blob/main/automation/vars/hci.yaml#L50-L54

In theory, you could add a post_stage_run in this PR which calls your new playbook and have a depends-on.

What do you think @cjeanner ?

@fultonj
Copy link
Contributor

fultonj commented Sep 16, 2024

@sbekkerm I see two changes needed up front.

  1. Written Instructions

The readme files are incomplete. Please see the the four stage readme's for VA HCI:

https://github.com/openstack-k8s-operators/architecture/tree/main/examples/va/hci#stages

It contains English instructions that someone can read to implement the VA without ci-framework and using only the produced k8s manifests. If there are external automations for them, that's fine but I should be able to read the directions and reproduce your work so that we can have independent verification. Right now it looks like the the VA1 directions are still there and not updated. In my early example someone could read my directions and get a full deployment (and the extra directory with scripts can technically be ignored).

https://github.com/fultonj/dcn?tab=readme-ov-file#steps

No code should be required to implement what I'm talking about for this request. Just written instructions.

  1. VA vs DT

Would you please change this so that it puts the added files into the dt directory instead of the va directory?

@sbekkerm
Copy link
Author

@sbekkerm I see two changes needed up front.

  1. Written Instructions

The readme files are incomplete. Please see the the four stage readme's for VA HCI:

https://github.com/openstack-k8s-operators/architecture/tree/main/examples/va/hci#stages

It contains English instructions that someone can read to implement the VA without ci-framework and using only the produced k8s manifests. If there are external automations for them, that's fine but I should be able to read the directions and reproduce your work so that we can have independent verification. Right now it looks like the the VA1 directions are still there and not updated. In my early example someone could read my directions and get a full deployment (and the extra directory with scripts can technically be ignored).

https://github.com/fultonj/dcn?tab=readme-ov-file#steps

No code should be required to implement what I'm talking about for this request. Just written instructions.

  1. VA vs DT

Would you please change this so that it puts the added files into the dt directory instead of the va directory?

  1. It contains the instructions. All four DCN steps are almost the same as HCI VA, except for the post-nova actions, which are already covered here: https://github.com/sbekkerm/architecture/blob/dcn/examples/va/dcn/dataplane-post-ceph.md#finalize-nova-computes
    Additionally, it mentions that Steps 3, 4, and the Ceph installation need to be executed for each DCN site

The main difference between VA and HCI is in the values.yaml and service-values.yaml files. For example, the nncp values.yaml contains the configuration necessary for spine and leaf: https://github.com/sbekkerm/architecture/blob/dcn/examples/va/dcn/control-plane/nncp/values.yaml#L18

and the post-ceph service-values.yaml contains Glance Multi Store configuration:
https://github.com/sbekkerm/architecture/blob/dcn/examples/va/dcn/service-values.yaml#L117

  1. Why should we use DT instead of VA?

@fultonj
Copy link
Contributor

fultonj commented Sep 16, 2024

@sbekkerm I see two changes needed up front.

  1. Written Instructions

The readme files are incomplete. Please see the the four stage readme's for VA HCI:
https://github.com/openstack-k8s-operators/architecture/tree/main/examples/va/hci#stages
It contains English instructions that someone can read to implement the VA without ci-framework and using only the produced k8s manifests. If there are external automations for them, that's fine but I should be able to read the directions and reproduce your work so that we can have independent verification. Right now it looks like the the VA1 directions are still there and not updated. In my early example someone could read my directions and get a full deployment (and the extra directory with scripts can technically be ignored).
https://github.com/fultonj/dcn?tab=readme-ov-file#steps
No code should be required to implement what I'm talking about for this request. Just written instructions.

  1. VA vs DT

Would you please change this so that it puts the added files into the dt directory instead of the va directory?

  1. It contains the instructions. All four DCN steps are almost the same as HCI VA, except for the post-nova actions, which are already covered here: https://github.com/sbekkerm/architecture/blob/dcn/examples/va/dcn/dataplane-post-ceph.md#finalize-nova-computes
    Additionally, it mentions that Steps 3, 4, and the Ceph installation need to be executed for each DCN site

The main difference between VA and HCI is in the values.yaml and service-values.yaml files. For example, the nncp values.yaml contains the configuration necessary for spine and leaf: https://github.com/sbekkerm/architecture/blob/dcn/examples/va/dcn/control-plane/nncp/values.yaml#L18

and the post-ceph service-values.yaml contains Glance Multi Store configuration: https://github.com/sbekkerm/architecture/blob/dcn/examples/va/dcn/service-values.yaml#L117

Yes, I understand but I still need clear written instructions so I (or anyone else) can reproduce.

I want to collaborate with you on this and reproduce the results in my environment so I can find and fix bugs. I think the READMEs are missing too much and filling them in will help other engineers and the docs team. From a very high level it's so we can have https://en.wikipedia.org/wiki/Reproducibility

  1. Why DT?

https://github.com/openstack-k8s-operators/architecture/blob/main/examples/dt/README.md

I don't think this should be an update to an existing DT, it should be a new DT, but it's a DT, not a VA.

I wouldn't want to hand the field the Jinja2 files. This is something we do for our CI but not yet ready to be a full blown VA we could hand to someone in the field. Maybe it can evolve into a VA in the future. For now, in order to merge what you have I think it should be a DT.

@sbekkerm
Copy link
Author

@sbekkerm I see two changes needed up front.

  1. Written Instructions

The readme files are incomplete. Please see the the four stage readme's for VA HCI:
https://github.com/openstack-k8s-operators/architecture/tree/main/examples/va/hci#stages
It contains English instructions that someone can read to implement the VA without ci-framework and using only the produced k8s manifests. If there are external automations for them, that's fine but I should be able to read the directions and reproduce your work so that we can have independent verification. Right now it looks like the the VA1 directions are still there and not updated. In my early example someone could read my directions and get a full deployment (and the extra directory with scripts can technically be ignored).
https://github.com/fultonj/dcn?tab=readme-ov-file#steps
No code should be required to implement what I'm talking about for this request. Just written instructions.

  1. VA vs DT

Would you please change this so that it puts the added files into the dt directory instead of the va directory?

  1. It contains the instructions. All four DCN steps are almost the same as HCI VA, except for the post-nova actions, which are already covered here: https://github.com/sbekkerm/architecture/blob/dcn/examples/va/dcn/dataplane-post-ceph.md#finalize-nova-computes
    Additionally, it mentions that Steps 3, 4, and the Ceph installation need to be executed for each DCN site

The main difference between VA and HCI is in the values.yaml and service-values.yaml files. For example, the nncp values.yaml contains the configuration necessary for spine and leaf: https://github.com/sbekkerm/architecture/blob/dcn/examples/va/dcn/control-plane/nncp/values.yaml#L18
and the post-ceph service-values.yaml contains Glance Multi Store configuration: https://github.com/sbekkerm/architecture/blob/dcn/examples/va/dcn/service-values.yaml#L117

Yes, I understand but I still need clear written instructions so I (or anyone else) can reproduce.

I want to collaborate with you on this and reproduce the results in my environment so I can find and fix bugs. I think the READMEs are missing too much and filling them in will help other engineers and the docs team. From a very high level it's so we can have https://en.wikipedia.org/wiki/Reproducibility

  1. Why DT?

https://github.com/openstack-k8s-operators/architecture/blob/main/examples/dt/README.md

I don't think this should be an update to an existing DT, it should be a new DT, but it's a DT, not a VA.

I wouldn't want to hand the field the Jinja2 files. This is something we do for our CI but not yet ready to be a full blown VA we could hand to someone in the field. Maybe it can evolve into a VA in the future. For now, in order to merge what you have I think it should be a DT.

The README contains all the steps to reproduce the environment. Could you please clarify what specifically is unclear?

  1. These steps for deploying the control plane:
    https://github.com/sbekkerm/architecture/blob/dcn/examples/va/dcn/control-plane.md
  2. These steps for preparing nodes for ceph installation:
    https://github.com/sbekkerm/architecture/blob/dcn/examples/va/dcn/dataplane-pre-ceph.md
  3. These steps for configuring nodes after ceph installation:
    https://github.com/sbekkerm/architecture/blob/dcn/examples/va/dcn/dataplane-post-ceph.md

The Jinja templates are not involved in the deployment process, as they are only used by CI to set the "CHANGEME" parameters in values.yaml and service-valiues.yaml

@fultonj
Copy link
Contributor

fultonj commented Sep 16, 2024

@sbekkerm

Please move this from the va directory to the dt directory.

I am putting it in my backlog to go through your README line by line and attempt to reproduce what you have deployed and I'll ask you clarifying questions along the way which will point out what is incomplete in the READMEs.

@sbekkerm
Copy link
Author

@fultonj

Moved from va to dt directory as requested.
Also added a high-level diagram to the README for more clarity. Let me know if you have any questions.

@sbekkerm sbekkerm changed the title Introduce DCN VA Introduce DCN DT Sep 17, 2024
Copy link

@krcmarik krcmarik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am suggesting some changes based on what we had in 17.1 openstack services configs and/or to make tempest tests pass (I've managed to make compute and volume tempest suites to pass all tests with the proposed changes which I applied manually)

customServiceConfig: |
[DEFAULT]
enabled_backends = ceph
glance_api_servers = http://glance-az1-internal.openstack.svc:9292

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TLS is enabled by default so we should use https endpoint:
https://glance-az1-internal.openstack.svc:9292

customServiceConfig: |
[DEFAULT]
enabled_backends = ceph
glance_api_servers = http://glance-az2-internal.openstack.svc:9292

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TLS is enabled by default so we should use https endpoint:
https://glance-az2-internal.openstack.svc:9292

[DEFAULT]
enabled_backends = ceph
{% if 'ceph' not in _ceph.cifmw_ceph_client_cluster %}
glance_api_servers = http://glance-{{ _ceph.cifmw_ceph_client_cluster }}-internal.openstack.svc:9292

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TLS is enabled by default so we should use https endpoint:
https://glance-{{ _ceph.cifmw_ceph_client_cluster }}-internal.openstack.svc:9292

cinderAPI:
replicas: 3
cinderBackup:
replicas: 3

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd disable cinder-backup by setting replicas to 0 for now because tempest does have a capability to create a backup in a different/specified AZ anyway

data:
preserveJobs: false
cinderAPI:
replicas: 3

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to define the default AZ so some Volume resources such as Volumes groups created in tempest tests have a proper AZ to be created at, something like:

cinderAPI:
    replicas: 3
    customServiceConfig: |
      [DEFAULT]
        default_availability_zone = ceph

images_rbd_glance_copy_poll_interval=15
images_rbd_glance_copy_timeout=600
rbd_user=openstack
rbd_secret_uuid={{ cifmw_ceph_client_fsid }}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add some more parameters to be set, the glance endpoint and the cross_az_attach parameter, so It could look like:


    [glance]
    endpoint_override = https://glance-{{ _az }}-internal.openstack.svc:9292
    valid_interfaces = internal
    [cinder]
    cross_az_attach = False
    catalog_info = volumev3:cinderv3:internalURL

Additionally to the [libvirt] section

images_type=rbd
images_rbd_pool=vms
images_rbd_ceph_conf=/etc/ceph/{{ cifmw_ceph_client_cluster }}.conf
images_rbd_glance_store_name=default_backend

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe It should be:
images_rbd_glance_store_name={{ cifmw_ceph_client_cluster }}

For example for az1:
images_rbd_glance_store_name=az1

Comment on lines +134 to +137
nova:
customServiceConfig: |
[DEFAULT]
default_schedule_zone=az0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see the default AZ set anywhere on deployed env and default AZ for nova should be az0 If It's done the way (to be consistent with other parts of generated CRs). This should imo be

nova:
    template:
      apiServiceTemplate:
        customServiceConfig: |
          [DEFAULT]
          default_schedule_zone = az0

cinderAPI:
replicas: 3
cinderBackup:
replicas: 3

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd disable cinder-backup by setting replicas to 0 for now because tempest does have a capability to create a backup in a different/specified AZ anyway

config.kubernetes.io/local-config: "true"
data:
preserveJobs: false
cinderAPI:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to define the default AZ so some Volume resources such as Volumes groups created in tempest tests have a proper AZ to be created at, something like:

cinderAPI:
    replicas: 3
    customServiceConfig: |
      [DEFAULT]
       default_availability_zone = ceph

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants