-
Notifications
You must be signed in to change notification settings - Fork 114
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
design doc for the externally-manage-pf support
Signed-off-by: Sebastian Sch <[email protected]>
- Loading branch information
Showing
1 changed file
with
246 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,246 @@ | ||
--- | ||
title: Externally Manage PF | ||
authors: | ||
- SchSeba | ||
reviewers: | ||
- zeeke | ||
- adrianchiris | ||
creation-date: 12-07-2023 | ||
last-updated: 12-07-2023 | ||
--- | ||
|
||
# Externally Manage PF | ||
|
||
## Summary | ||
|
||
Allow the SR-IOV network operator to configure and allocate a subset of virtual functions from | ||
a physical function that is configured externally from SR-IOV network operator. | ||
|
||
## Motivation | ||
|
||
The feature is needed to allow the operator to only configure a subset of virtual functions. | ||
This allows a third party component like nmstate, kubernetes-nmstate, NetworkManager to handle the creation | ||
and the usage of the virtual functions on the system. Some of the examples are using the virtual function as the primary | ||
nic for the k8s SDN network or a storage network. | ||
|
||
Before this change the SR-IOV network operator is the only component that should use/configure VFs. not allowing the user | ||
to use some of the VFs for host networking. | ||
|
||
### Use Cases | ||
|
||
* As a user I want to use a virtual function for SDN network, for SDN the network need to be configured before | ||
k8s is deployed and these VFs should be available at system startup before pods start running | ||
* As a user I want to create the virtual functions via nmstate | ||
* As a user I want pods to use virtual functions from a pre-configured PF | ||
* As a user I want to allocate virtual functions to pods from a PF with custom configuration/driver | ||
* As a user I want to use virtual functions to be configured for the storage subsystem before k8s is deployed / pods spinning up at system startup | ||
|
||
### Goals | ||
|
||
* Allow the SR-IOV network operator to handle the configure and pod allocation of a subset of virtual functions | ||
* Allow the user to Allocate the number of virtual functions he wants for the system and the subset he wants for pods | ||
* Not resetting the numOfVfs for PFs that are externally managed | ||
|
||
### Non-Goals | ||
|
||
* Supporting switchdev mode (may change in the future if there is a request) | ||
|
||
## Proposal | ||
|
||
Create a sub-flow in the SR-IOV network operator where the user can request a configuration for a subset of virtual functions | ||
without any changes in the PF level. | ||
|
||
The operator will first validate the requested PF contains the requested amount of virtual functions allocated, it | ||
will also validate the requested MTU is configured as expected on the PF. | ||
The `sriovNetworkNodeState.status.SyncStatus` field will be report a `Failed` | ||
|
||
Then the operator will configure the subset of virtual functions with the requested driver and will update the device plugin | ||
configmap with the expected information to create the relevant pools. | ||
|
||
Existing flow: | ||
1. Apply the `numOfVfs` | ||
2. Configure the MTU on the PF | ||
3. Copy the Administrative mac address from the VFs | ||
4. Bind the right driver for the VF | ||
5. Create SR-IOV device plugin pools | ||
|
||
Externally manage flow: | ||
1. Copy the Administrative mac address from the VFs | ||
2. Bind the right driver for the VF | ||
3. Create SR-IOV device plugin pools | ||
|
||
In both flows: | ||
* In case of Infiniband link type it will generate random node and port GUID for the interface. | ||
* in case of RDMA (both for ETH and IB) it will perform an unbind/bind of the VF driver to set RDMA Node/Port GUID. | ||
|
||
### Workflow Description | ||
|
||
The user will allocate the virtual functions on the system with any third party tool like nmstate, Kubnernetes-nmstate, | ||
systemd scripts, etc.. | ||
|
||
Then the user will be able to create a policy telling the operator that the PF is externally managed by the user. | ||
|
||
#### Policy Example: | ||
```yaml | ||
apiVersion: sriovnetwork.openshift.io/v1 | ||
kind: SriovNetworkNodePolicy | ||
metadata: | ||
name: sriov-nic-1 | ||
namespace: sriov-network-operator | ||
spec: | ||
deviceType: netdevice | ||
nicSelector: | ||
pfNames: ["ens3f0#5-9"] | ||
nodeSelector: | ||
node-role.kubernetes.io/worker: "" | ||
numVfs: 10 | ||
priority: 99 | ||
resourceName: sriov_nic_1 | ||
externallyManaged: true | ||
``` | ||
#### Another Policy Example: | ||
In this case we allocate all the virtual functions from the PF | ||
```yaml | ||
apiVersion: sriovnetwork.openshift.io/v1 | ||
kind: SriovNetworkNodePolicy | ||
metadata: | ||
name: sriov-nic-2 | ||
namespace: sriov-network-operator | ||
spec: | ||
deviceType: netdevice | ||
nicSelector: | ||
pfNames: ["ens3f0"] | ||
nodeSelector: | ||
node-role.kubernetes.io/worker: "" | ||
numVfs: 10 | ||
priority: 99 | ||
resourceName: sriov_nic_1 | ||
externallyManaged: true | ||
``` | ||
#### Validation | ||
The SR-IOV network operator will do a validation webhook to check if the requested `numVfs` is equal to what the user allocate | ||
if not it will reject the policy creation. | ||
|
||
The SR-IOV network operator will do a validation webhook to check if the requested MTU is equal to what exist on the PF | ||
if not it will reject the policy creation. | ||
|
||
|
||
*Note:* Same validation will be done in the SR-IOV config-daemon container to cover cases where the user doesn't want to deploy" | ||
the webhook and to cover scale-up adding new nodes. If the verification failed in the policy apply stage | ||
the `sriovNetworkNodeState.status.SyncStatus` field will be report a `Failed` status and the error description will | ||
get exposed in `sriovNetworkNodeState.status.LastSyncError` | ||
|
||
|
||
#### Configuration | ||
|
||
The SR-IOV network operator config daemon will reconcile on the SriovNetworkNodeState update and will follow the regular | ||
flow of virtual functions *SKIPPING* only the Virtual function allocation. | ||
|
||
The SR-IOV network operator will update the SR-IOV Network Device Plugin with the pool information | ||
|
||
Another change with the operator beavior is when we delete a policy with had `externallyManaged: true` the SR-IOV network operator | ||
will *NOT* reset the `numVfs` | ||
|
||
### API Extensions | ||
|
||
For SriovNetworkNodePolicy | ||
|
||
```golang | ||
// SriovNetworkNodePolicySpec defines the desired state of SriovNetworkNodePolicy | ||
type SriovNetworkNodePolicySpec struct { | ||
// SRIOV Network device plugin endpoint resource name | ||
ResourceName string `json:"resourceName"` | ||
// NodeSelector selects the nodes to be configured | ||
NodeSelector map[string]string `json:"nodeSelector"` | ||
// +kubebuilder:validation:Minimum=0 | ||
// +kubebuilder:validation:Maximum=99 | ||
// Priority of the policy, higher priority policies can override lower ones. | ||
Priority int `json:"priority,omitempty"` | ||
// +kubebuilder:validation:Minimum=1 | ||
// MTU of VF | ||
Mtu int `json:"mtu,omitempty"` | ||
// +kubebuilder:validation:Minimum=0 | ||
// Number of VFs for each PF | ||
NumVfs int `json:"numVfs"` | ||
// NicSelector selects the NICs to be configured | ||
NicSelector SriovNetworkNicSelector `json:"nicSelector"` | ||
// +kubebuilder:validation:Enum=netdevice;vfio-pci | ||
// The driver type for configured VFs. Allowed value "netdevice", "vfio-pci". Defaults to netdevice. | ||
DeviceType string `json:"deviceType,omitempty"` | ||
// RDMA mode. Defaults to false. | ||
IsRdma bool `json:"isRdma,omitempty"` | ||
// mount vhost-net device. Defaults to false. | ||
NeedVhostNet bool `json:"needVhostNet,omitempty"` | ||
// +kubebuilder:validation:Enum=eth;ETH;ib;IB | ||
// NIC Link Type. Allowed value "eth", "ETH", "ib", and "IB". | ||
LinkType string `json:"linkType,omitempty"` | ||
// +kubebuilder:validation:Enum=legacy;switchdev | ||
// NIC Device Mode. Allowed value "legacy","switchdev". | ||
EswitchMode string `json:"eSwitchMode,omitempty"` | ||
// +kubebuilder:validation:Enum=virtio | ||
// VDPA device type. Allowed value "virtio" | ||
VdpaType string `json:"vdpaType,omitempty"` | ||
// Exclude device's NUMA node when advertising this resource by SRIOV network device plugin. Default to false. | ||
ExcludeTopology bool `json:"excludeTopology,omitempty"` | ||
+ // don't create the virtual function only assign to the driver and allocated them to device plugin. Defaults to false. | ||
+ ExternallyManaged bool `json:"externallyManaged,omitempty"` | ||
} | ||
``` | ||
|
||
For SriovNetworkNodeState | ||
|
||
```golang | ||
type Interface struct { | ||
PciAddress string `json:"pciAddress"` | ||
NumVfs int `json:"numVfs,omitempty"` | ||
Mtu int `json:"mtu,omitempty"` | ||
Name string `json:"name,omitempty"` | ||
LinkType string `json:"linkType,omitempty"` | ||
EswitchMode string `json:"eSwitchMode,omitempty"` | ||
VfGroups []VfGroup `json:"vfGroups,omitempty"` | ||
+ ExternallyManaged bool `json:"externallyManaged,omitempty"` | ||
} | ||
``` | ||
|
||
### Implementation Details/Notes/Constraints | ||
|
||
#### Webhook | ||
For the webhook we add more validations when the policy contains `ExternallyManaged: true` | ||
* `numVfs` in the policy equal is equal or lower the number of virtual functions on the system | ||
* `MTU` in the policy equals or lower the MTU we discover on the PF | ||
* `LinkType` in the policy equals the link type we discover on the PF | ||
|
||
#### Controller/Manager | ||
|
||
The changes in the manager for this feature are minimal we only copy the `ExternallyManaged` boolean from the policy | ||
to the generated `nodeState.Spec` | ||
|
||
#### Config Daemon | ||
|
||
This is where most of the changes for this feature are implemented. | ||
|
||
First step we will do a validation same as on the webhook to check the PF have everything we need to apply the requested | ||
policy, by checking the `numVfs`, `MTU` and `LinkType`. | ||
Next config-daemon will skip all the PF configuration like `numVfs`, `MTU` and `LinkType`. he will only preform the virtual function | ||
driver binding, administrative mac allocation and MTU. Another step In case of Infiniband link type it will | ||
generate random node and port GUID for the interface, same in case of RDMA (both for ETH and IB) it will perform an | ||
unbind/bind of the VF driver to set RDMA Node/Port GUID. | ||
Last step as always will be to reset the device plugin so | ||
kubelet will be able to discover the SR-IOV devices. | ||
|
||
The config-daemon will also save on the node a cache of the last applied policy. this is needed to be able and understand | ||
if we need to reset the PF configuration(`ExternallyManaged` was false) or not when o policy is removed. | ||
|
||
### Upgrade & Downgrade considerations | ||
|
||
The feature supports both Upgrade and Downgrade as we are introducing a new field in the API | ||
|
||
### Test Plan | ||
|
||
* Should not allow to create a policy with externallyManaged true if there are no vfs configured | ||
* Should create a policy if the number of requested vfs is equal | ||
* Should create a policy if the number of requested vfs is equal and not delete them when the policy is removed | ||
* should reset the virtual functions if externallyCreated is false |