-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add 'Parallel SR-IOV configuration' design document #479
Conversation
Thanks for your PR,
To skip the vendors CIs use one of:
|
Pull Request Test Coverage Report for Build 7611451473
💛 - Coveralls |
doc/design/parallel-node-config.md
Outdated
|
||
### API Extensions | ||
|
||
#### Option 1: extend existing CR SriovNetworkPoolConfig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on this option from me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@e0ne if you are OK with it, lets remove "Option 1: " string here
and place option2 and 3 under a new subsection: "Alternative APIs"
unless you feel that we should go with a different approach in this design doc.
also others feedback is appreciated on which approach is preferable.
i think @SchSeba also prefers this option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed
03b497e
to
8275dd2
Compare
Thanks for your PR,
To skip the vendors CIs use one of:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice to see this work!
I am really expecting for this feature
doc/design/parallel-node-config.md
Outdated
AnnoDraining = "Draining_Complete" | ||
``` | ||
|
||
Drain controller will watch for node annotation changes, `SriovNetworkPoolConfig` and `SriovNetworkNodeState` changes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
I think we should stop using the node this way we can also reduce the RBAC for the config-daemon to only watch for sriovNetworkNodeState
doc/design/parallel-node-config.md
Outdated
// OvsHardwareOffloadConfig describes the OVS HWOL configuration for selected Nodes | ||
OvsHardwareOffloadConfig OvsHardwareOffloadConfig `json:"ovsHardwareOffloadConfig,omitempty"` | ||
// NodeSelectorTerms is a list of node selectors to apply SriovNetworkPoolConfig | ||
NodeSelectorTerms *v1.NodeSelector `json:"nodeSelectorTerms,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe just
nodeSelector:
matchLabels:
node-role.kubernetes.io/worker: ""
without the Terms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the idea was to re-use this:
https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.27/#nodeselector-v1-core
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
doc/design/parallel-node-config.md
Outdated
* introduce new API changes to support pool of nodes to be drained in a parallel: | ||
at this phase we introduce new API from one of the proposed options above and modify Drain controller to watch for | ||
the specified CRs and proceed drain in a parallel per node pool configuration | ||
* drop `NodeDrainAnnotation` usage and move this annotation values into the `SriovNetworkNodeState` object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree we should move this as part of the first change.
we much check how this will work in case the operator upgrade is done in the middle of the configuration.
doc/design/parallel-node-config.md
Outdated
All phases should be implemented one-by-one in a separate PRs in the order above. | ||
|
||
### Upgrade & Downgrade considerations | ||
This feature introduces changes to CRDs which means CRDs should be applied during upgrade or downgrade |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have a support for this as we have users that runs the sriov operator.
we need to be sure a case where the operator gets updated in the middle of a drain is something we support and are able to recover from it
One more comment is we need to see how the machine config will work with parallel node draining for openshift |
8275dd2
to
afea85f
Compare
Thanks for your PR,
To skip the vendors CIs use one of:
|
doc/design/parallel-node-config.md
Outdated
|
||
## Proposal | ||
|
||
Introduce nodes pool drain configuration to meet following requirements: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i really prefer not to copy the goals into proposal.
maybe something like:
Introduce nodes pool drain configuration and controller to meet Goals targets.
doc/design/parallel-node-config.md
Outdated
// OvsHardwareOffloadConfig describes the OVS HWOL configuration for selected Nodes | ||
OvsHardwareOffloadConfig OvsHardwareOffloadConfig `json:"ovsHardwareOffloadConfig,omitempty"` | ||
// NodeSelectorTerms is a list of node selectors to apply SriovNetworkPoolConfig | ||
NodeSelectorTerms *v1.NodeSelector `json:"nodeSelectorTerms,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the idea was to re-use this:
https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.27/#nodeselector-v1-core
afea85f
to
0918908
Compare
Thanks for your PR,
To skip the vendors CIs use one of:
|
0918908
to
53fcc49
Compare
Thanks for your PR,
To skip the vendors CIs use one of:
|
53fcc49
to
3c42bfc
Compare
Thanks for your PR,
To skip the vendors CIs use one of:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@e0ne gave this one another look.
once my remaining comments are addressed, consider it LGTM from my side.
3c42bfc
to
d3b2c56
Compare
Thanks for your PR,
To skip the vendors CIs use one of:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great work I am really waiting for this feature to go in!
I left some small comments
// +kubebuilder:rbac:groups="",resources=SriovNetworkPoolConfig,verbs=get;list;watch;update;patch | ||
// +kubebuilder:rbac:groups="",resources=SriovNetworkNodeState,verbs=get;list;watch;update;patch | ||
``` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to have it numerated
doc/design/parallel-node-config.md
Outdated
`SriovNetworkNodeStates` update. | ||
|
||
Config daemon will be responsible for setting `SriovNetworkNodeState.DrainStatus=Drain_Required` and | ||
`SriovNetworkNodeState.DrainStatus=DrainComplete` only. It will simplify its implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the DrainComplete
will be set by the config-daemon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How controllel will know that drain finished? It's less complicated to to it in a config daemon
doc/design/parallel-node-config.md
Outdated
type SriovNetworkNodeStateStatus struct { | ||
Interfaces InterfaceExts `json:"interfaces,omitempty"` | ||
// +kubebuilder:validation:Enum=Idle;Draining,Draining_MCP_Paused,Succeeded,Failed,InProgress | ||
DrainStatus string `json:"drainStatus,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we have this as a struct? that is a string but just for easier connection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
doc/design/parallel-node-config.md
Outdated
// OvsHardwareOffloadConfig describes the OVS HWOL configuration for selected Nodes | ||
OvsHardwareOffloadConfig OvsHardwareOffloadConfig `json:"ovsHardwareOffloadConfig,omitempty"` | ||
// NodeSelectorTerms is a list of node selectors to apply SriovNetworkPoolConfig | ||
NodeSelectorTerms *v1.NodeSelectorTerm `json:"nodeSelectorTerms,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we change this to NodeSelector only like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, NodeSelectorTerm
is more adjustable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to sticking with standard k8s NodeSelectorTerms.
these can be translated to machineconfig selector if needed.
WDYT?
OvsHardwareOffloadConfig OvsHardwareOffloadConfig `json:"ovsHardwareOffloadConfig,omitempty"` | ||
// NodeSelectorTerms is a list of node selectors to apply SriovNetworkPoolConfig | ||
NodeSelectorTerms *v1.NodeSelectorTerm `json:"nodeSelectorTerms,omitempty"` | ||
DrainConfig DrainConfigSpec `json:"drainConfig,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need another struct can we have something like
maxUnavailable | integer-or-string | maxUnavailable specifies the percentage or constant number of machines that can be updating at any given time. default is 1. |
---|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's introduced in case if we decide add more drain-related config options (e.g. false=true
) so we won't need to change API in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on having this a stuct for future extensability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please change to something like what we have in the PR?
// SriovNetworkPoolConfigSpec defines the desired state of SriovNetworkPoolConfig
type SriovNetworkPoolConfigSpec struct {
// OvsHardwareOffloadConfig describes the OVS HWOL configuration for selected Nodes
OvsHardwareOffloadConfig OvsHardwareOffloadConfig `json:"ovsHardwareOffloadConfig,omitempty"`
// nodeSelector specifies a label selector for Nodes
NodeSelector *metav1.LabelSelector `json:"nodeSelector,omitempty"`
// maxUnavailable defines either an integer number or percentage
// of nodes in the pool that can go Unavailable during an update.
//
// A value larger than 1 will mean multiple nodes going unavailable during
// the update, which may affect your workload stress on the remaining nodes.
// Drain will respect Pod Disruption Budgets (PDBs) such as etcd quorum guards,
// even if maxUnavailable is greater than one.
MaxUnavailable *intstr.IntOrString `json:"maxUnavailable,omitempty"`
}
doc/design/parallel-node-config.md
Outdated
Operator. To not introduce breaking changes we have to split this effort to several phases: | ||
* implement new Drain controller without user-facing API changes: | ||
it will proceed only one node configuration at the same time and doesn't require API changes | ||
* drop `SriovNetworkNodeState` usage and move this annotation values into the `SriovNetworkNodeState` object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think here is to drop the annotation from the node object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
d3b2c56
to
7601e18
Compare
Thanks for your PR,
To skip the vendors CIs use one of:
|
doc/design/parallel-node-config.md
Outdated
operator: "Exists" | ||
drainConfig: | ||
maxUnavailable: 5 | ||
ovsHardwareOffloadConfig: {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the indentation here is not right I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, drainConfig is already defined for this resource. Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea i believe this needs to be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
doc/design/parallel-node-config.md
Outdated
name: default | ||
namespace: network-operator | ||
spec: | ||
priority: 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the indentation here is not right I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@e0ne , @SchSeba, I was thinking of the following scenario:
If my considerations are right, please add this scenario to a "Examples" section. |
+1 on emphasizing that a node may belong to at most a single pool (the one with the most important priority) |
yep great example! |
also, another point is if the node selector is good for 2 pools and the priority is equal we will select the first pool in alphabetic order |
doc/design/parallel-node-config.md
Outdated
## Motivation | ||
SR-IOV Network Operator configures SR-IOV one node at a time and one nic at a same time. That means we’ll need to wait | ||
hours or even days to configure all NICs on large cluster deployments. With multi-NICs deployments Operator configures | ||
NICs one by one on each node which leads to a lot of unnecessary node drain calls during SR-IOV configuration. Also |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we really saving up on drain calls ?
even currently sriov-network-config-daemon will configure all relevant nic in the node. it just does it serially after draining.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
||
#### Extend existing CR SriovNetworkPoolConfig | ||
SriovNetworkPoolConfig is used only for OpenShift to provide configuration for | ||
OVS Hardware Offloading. We can extend it to add configuration for the drain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add some info that SriovNetworkPoolConfig
will now be relevant for all cluster types (kubernetes and openshift)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
### Upgrade & Downgrade considerations | ||
After operator upgrade we have to support `sriovnetwork.openshift.io/state` node annotation to | ||
`SriovNetworkNodeState.Status.DrainStatus` migration. This logic will be implemented in a Drain controller and should be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldnt we implement the logic in config daemon as it is the one who triggers drain flow ?
just so we dont have race condition where config daemon updates drain status AND controller tries to migrate the annotation
7601e18
to
d505c92
Compare
Thanks for your PR,
To skip the vendors CIs use one of:
|
drain and use `drain lock` in config daemon anymore. The overall drain process will be covered by the following states: | ||
|
||
```golang | ||
DrainIdle = "Idle" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @e0ne can you please change this to what we have in the PR?
should be
NodeDrainAnnotation = "sriovnetwork.openshift.io/state"
NodeStateDrainAnnotation = "sriovnetwork.openshift.io/desired-state"
NodeStateDrainAnnotationCurrent = "sriovnetwork.openshift.io/current-state"
DrainIdle = "Idle"
DrainRequired = "Drain_Required"
RebootRequired = "Reboot_Required"
Draining = "Draining"
DrainComplete = "DrainComplete"
OvsHardwareOffloadConfig OvsHardwareOffloadConfig `json:"ovsHardwareOffloadConfig,omitempty"` | ||
// NodeSelectorTerms is a list of node selectors to apply SriovNetworkPoolConfig | ||
NodeSelectorTerms *v1.NodeSelectorTerm `json:"nodeSelectorTerms,omitempty"` | ||
DrainConfig DrainConfigSpec `json:"drainConfig,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please change to something like what we have in the PR?
// SriovNetworkPoolConfigSpec defines the desired state of SriovNetworkPoolConfig
type SriovNetworkPoolConfigSpec struct {
// OvsHardwareOffloadConfig describes the OVS HWOL configuration for selected Nodes
OvsHardwareOffloadConfig OvsHardwareOffloadConfig `json:"ovsHardwareOffloadConfig,omitempty"`
// nodeSelector specifies a label selector for Nodes
NodeSelector *metav1.LabelSelector `json:"nodeSelector,omitempty"`
// maxUnavailable defines either an integer number or percentage
// of nodes in the pool that can go Unavailable during an update.
//
// A value larger than 1 will mean multiple nodes going unavailable during
// the update, which may affect your workload stress on the remaining nodes.
// Drain will respect Pod Disruption Budgets (PDBs) such as etcd quorum guards,
// even if maxUnavailable is greater than one.
MaxUnavailable *intstr.IntOrString `json:"maxUnavailable,omitempty"`
}
No description provided.