This section summarizes the steps that may be needed during the entire lifecycle of PMEM in a cluster, starting with the initial preparations and ending with decommissioning the hardware. The other sections explain each step in more detail.
When setting up a cluster, the administrator must install the PMEM hardware on nodes and configure some or all of those instances of PMEM for usage by PMEM-CSI (prerequisites]. Nodes where PMEM-CSI is supposed to run must have a certain label in Kubernetes.
The administrator must install PMEM-CSI, using the PMEM-CSI operator (recommended) or with scripts and YAML files in the source code. A PMEM-CSI installation can only use direct device mode or LVM device mode. It is possible to install PMEM-CSI twice on the same cluster with different modes, with these restrictions:
- The driver names must be different.
- The installations must run on different nodes by using different node labels, or the "usage" parameter of the LVM mode driver installation one a node must be so that it leaves spaces available for the direct mode driver installation on that same node.
The administrator must decide which storage classes shall be available to users of the cluster. A storage class references a driver installation by name, which indirectly determines the device mode. A storage class also chooses which filesystem is used (xfs or ext4) and enables Kata Containers support.
Optionally, the administrator can enable the scheduler extensions (recommended) and monitoring of resource usage via the metrics support.
When using the operator, existing PMEM-CSI installations can be upgraded seamlessly by installing a newer version of the operator. Downgrading by installing an older version is also supported, but may need manual work which will be documented in the release notes.
When using YAML files, the only reliable way of up- or downgrading is to remove the installation and install anew.
Users can then create PMEM volumes via volume claims that reference the storage classes or via ephemeral inline volumes.
A node should only be removed from a cluster after ensuring that there
is no pod running on it which uses PMEM and that there is no
persistent volume (PV
) on it. This can be checked via kubectl get -o yaml pv
and looking for a nodeAffinity
entry that references the
node or via metrics data for the node. When removing a node or even
the entire PMEM-CSI driver installation too soon, attempts to remove
pods or volumes via the Kubernetes API will fail. Administrators can
recover from that by force-deleting PVs for which the underlying
hardware has already been removed.
By default, PMEM-CSI wipes volumes after usage
(eraseAfter
), so shredding PMEM hardware
after decomissioning it is optional.
The recommended mimimum Linux kernel version for running the PMEM-CSI driver is 4.15. See Persistent Memory Programming for more details about supported kernel versions.
Persistent memory device(s) are required for operation. However, some development and testing can be done using QEMU-emulated persistent memory devices. See the "QEMU and Kubernetes" section for the commands that create such a virtual test cluster.
The PMEM-CSI driver needs pre-provisioned regions on the NVDIMM device(s). The PMEM-CSI driver itself intentionally leaves that to the administrator who then can decide how much and how PMEM is to be used for PMEM-CSI.
Beware that the PMEM-CSI driver will run without errors on a node where PMEM was not prepared for it. It will then report zero local storage for that node, something that currently is only visible in the log files.
When running the Kubernetes cluster and PMEM-CSI on bare metal, the ipmctl utility can be used to create regions. App Direct Mode has two configuration options - interleaved or non-interleaved. One region per each NVDIMM is created in non-interleaved configuration. In such a configuration, a PMEM-CSI volume cannot be larger than one NVDIMM.
Example of creating regions without interleaving, using all NVDIMMs:
$ ipmctl create -goal PersistentMemoryType=AppDirectNotInterleaved
Alternatively, multiple NVDIMMs can be combined to form an interleaved set. This causes the data to be striped over multiple NVDIMM devices for improved read/write performance and allowing one region (also, PMEM-CSI volume) to be larger than single NVDIMM.
Example of creating regions in interleaved mode, using all NVDIMMs:
$ ipmctl create -goal PersistentMemoryType=AppDirect
When running inside virtual machines, each virtual machine typically
already gets access to one region and ipmctl
is not needed inside
the virtual machine. Instead, that region must be made available for
use with PMEM-CSI because when the virtual machine comes up for the
first time, the entire region is already allocated for use as a single
block device:
$ ndctl list -RN
{
"regions":[
{
"dev":"region0",
"size":34357641216,
"available_size":0,
"max_available_extent":0,
"type":"pmem",
"persistence_domain":"unknown",
"namespaces":[
{
"dev":"namespace0.0",
"mode":"raw",
"size":34357641216,
"sector_size":512,
"blockdev":"pmem0"
}
]
}
]
}
$ ls -l /dev/pmem*
brw-rw---- 1 root disk 259, 0 Jun 4 16:41 /dev/pmem0
Labels must be initialized in such a region, which must be performed once after the first boot:
$ ndctl disable-region region0
disabled 1 region
$ ndctl init-labels nmem0
initialized 1 nmem
$ ndctl enable-region region0
enabled 1 region
$ ndctl list -RN
[
{
"dev":"region0",
"size":34357641216,
"available_size":34357641216,
"max_available_extent":34357641216,
"type":"pmem",
"iset_id":10248187106440278,
"persistence_domain":"unknown"
}
]
$ ls -l /dev/pmem*
ls: cannot access '/dev/pmem*': No such file or directory
This section assumes that a Kubernetes cluster is already available with at least one node that has persistent memory device(s). For development or testing, it is also possible to use a cluster that runs on QEMU virtual machines, see the "QEMU and Kubernetes".
- Make sure that the alpha feature gates CSINodeInfo and CSIDriverRegistry are enabled
The method to configure alpha feature gates may vary, depending on the Kubernetes deployment. It may not be necessary anymore when the feature has reached beta state, which depends on the Kubernetes version.
- Label the cluster nodes that provide persistent memory device(s)
$ kubectl label node <your node> storage=pmem
PMEM-CSI driver can be deployed to a Kubernetes cluster either using the PMEM-CSI operator or by using reference yaml files provided in source code.
PMEM-CSI operator facilitates deploying and managing the PMEM-CSI driver on a Kubernetes cluster.
- Setup deployment certificates
By default, the operator creates the needed private keys and certificates required for running the driver as described in driver security section. Those certificates are generated by the operator using a self-signed CA. This can be overridden with custom keys and certificates by using appropriate fields in the deployment specification.These encoded certificates and private keys are made available to driver pods via Kubernetes secrets by the operator.
NOTE: A production deployment that is not supposed to depend on the operator's self-signed CA instead must provide the certificates generated from a trusted certificate authority.
- Install the PMEM-CSI operator
$ kubectl create -f https://github.com/intel/pmem-csi/raw/devel/deploy/operator/pmem-csi-operator.yaml
- Create a custom resource
Once the operator deployed in the previous step is in the Running
state,
it is ready to handle PMEM-CSI Deployment
objects in the pmem-csi.intel.com
API group. Refer to the Deployment CRD API for
a complete list of supported properties.
Here is a minimal example driver deployment created with a custom resource:
$ kubectl create -f - <<EOF
apiVersion: pmem-csi.intel.com/v1alpha1
kind: Deployment
metadata:
name: pmem-csi.intel.com
spec:
deviceMode: lvm
nodeSelector:
storage: pmem
EOF
This uses the same pmem-csi.intel.com
driver name as the YAML files
in deploy
and the node label from the hardware
installation and setup section.
Once the above deployment installation is successful, we can see all the driver
pods in Running
state:
$ kubectl get deployments.pmem-csi.intel.com
NAME AGE
pmem-csi.intel.com 50s
$ kubectl describe deployment.pmem-csi.intel.com/pmem-csi.intel.com
Name: pmem-csi.intel.com
Namespace: default
Labels: <none>
Annotations: <none>
API Version: pmem-csi.intel.com/v1alpha1
Kind: Deployment
Metadata:
Creation Timestamp: 2020-01-23T13:40:32Z
Generation: 1
Resource Version: 3596387
Self Link: /apis/pmem-csi.intel.com/v1alpha1/deployments/pmem-csi.intel.com
UID: 454b5961-5aa2-41c3-b774-29fe932ae236
Spec:
Controller Resources:
Requests:
Cpu: 200m
Memory: 100Mi
Device Mode: lvm
Image: localhost/pmem-csi-driver:canary
Node Resources:
Requests:
Cpu: 200m
Memory: 100Mi
Status:
Phase: Running
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal NewDeployment 34s pmem-csi-operator Processing new driver deployment
Normal Running 2s (x10 over 26s) pmem-csi-operator Driver deployment successful
$ kubectl get po
NAME READY STATUS RESTARTS AGE
pmem-csi-intel-com-controller-0 2/2 Running 0 51s
pmem-csi-intel-com-node-4x7cv 2/2 Running 0 50s
pmem-csi-intel-com-node-6grt6 2/2 Running 0 50s
pmem-csi-intel-com-node-msgds 2/2 Running 0 51s
pmem-csi-operator-749c7c7c69-k5k8n 1/1 Running 0 3m
- Get source code
PMEM-CSI uses Go modules and thus can be checked out and (if that should be desired) built anywhere in the filesystem. Pre-built container images are available and thus users don't need to build from source, but they may need some additional files for the following sections. To get the source code, use:
$ git clone https://github.com/intel/pmem-csi
- Set up certificates
Certificates are required as explained in Security.
If you are not using the test cluster described in
Starting and stopping a test cluster
where certificates are created automatically, you must set up certificates manually.
This can be done by running the ./test/setup-ca-kubernetes.sh
script for your cluster.
This script requires "cfssl" tools which can be downloaded.
These are the steps for manual set-up of certificates:
- Download cfssl tools
$ curl -L https://pkg.cfssl.org/R1.2/cfssl_linux-amd64 -o _work/bin/cfssl --create-dirs
$ curl -L https://pkg.cfssl.org/R1.2/cfssljson_linux-amd64 -o _work/bin/cfssljson --create-dirs
$ chmod a+x _work/bin/cfssl _work/bin/cfssljson
- Run certificates set-up script
$ KUBCONFIG="<<your cluster kubeconfig path>>" PATH="$PWD/_work/bin:$PATH" ./test/setup-ca-kubernetes.sh
- Deploy the driver to Kubernetes
The deploy/kubernetes-<kubernetes version>
directory contains
pmem-csi*.yaml
files which can be used to deploy the driver on that
Kubernetes version. The files in the directory with the highest
Kubernetes version might also work for more recent Kubernetes
releases. All of these deployments use images published by Intel on
Docker Hub.
For each Kubernetes version, four different deployment variants are provided:
direct
orlvm
: one uses direct device mode, the other LVM device mode.testing
: the variants withtesting
in the name enable debugging features and shouldn't be used in production.
For example, to deploy for production with LVM device mode onto Kubernetes 1.17, use:
$ kubectl create -f deploy/kubernetes-1.17/pmem-csi-lvm.yaml
The PMEM-CSI scheduler extender and webhook are not enabled in this basic installation. See below for instructions about that.
These variants were generated with
kustomize
.
kubectl
>= 1.14 includes some support for that. The sub-directories
of deploy/kustomize-<kubernetes version>
can be used as bases
for kubectl kustomize
. For example:
-
Change namespace:
$ mkdir -p my-pmem-csi-deployment $ cat >my-pmem-csi-deployment/kustomization.yaml <<EOF namespace: pmem-csi bases: - ../deploy/kubernetes-1.17/lvm EOF $ kubectl create namespace pmem-csi $ kubectl create --kustomize my-pmem-csi-deployment
-
Configure how much PMEM is used by PMEM-CSI for LVM (see Namespace modes in LVM device mode):
$ mkdir -p my-pmem-csi-deployment $ cat >my-pmem-csi-deployment/kustomization.yaml <<EOF bases: - ../deploy/kubernetes-1.17/lvm patchesJson6902: - target: group: apps version: v1 kind: DaemonSet name: pmem-csi-node path: lvm-parameters-patch.yaml EOF $ cat >my-pmem-csi-deployment/lvm-parameters-patch.yaml <<EOF # pmem-driver is in the container #0. Append arguments at the end. - op: add path: /spec/template/spec/containers/0/args/- value: "-pmemPercentage=90" EOF $ kubectl create --kustomize my-pmem-csi-deployment
-
Wait until all pods reach 'Running' status
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
pmem-csi-node-8kmxf 2/2 Running 0 3m15s
pmem-csi-node-bvx7m 2/2 Running 0 3m15s
pmem-csi-controller-0 2/2 Running 0 3m15s
pmem-csi-node-fbmpg 2/2 Running 0 3m15s
Once after the driver deployed using one of the methods mentioned above verify that the node labels have been configured correctly
$ kubectl get nodes --show-labels
The command output must indicate that every node with PMEM has these two labels:
pmem-csi.intel.com/node=<NODE-NAME>,storage=pmem
If storage=pmem is missing, label manually as described above. If pmem-csi.intel.com/node is missing, then double-check that the alpha feature gates are enabled, that the CSI driver is running on the node, and that the driver's log output doesn't contain errors.
- Define two storage classes using the driver
$ kubectl create -f deploy/common/pmem-storageclass-ext4.yaml
$ kubectl create -f deploy/common/pmem-storageclass-xfs.yaml
- Provision two PMEM-CSI volumes
$ kubectl create -f deploy/common/pmem-pvc.yaml
- Verify two Persistent Volume Claims have 'Bound' status
$ kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
pmem-csi-pvc-ext4 Bound pvc-f70f7b36-6b36-11e9-bf09-deadbeef0100 4Gi RWO pmem-csi-sc-ext4 16s
pmem-csi-pvc-xfs Bound pvc-f7101fd2-6b36-11e9-bf09-deadbeef0100 4Gi RWO pmem-csi-sc-xfs 16s
- Start two applications requesting one provisioned volume each
$ kubectl create -f deploy/common/pmem-app.yaml
These applications use storage: pmem in the nodeSelector list to ensure scheduling to a node supporting pmem device, and each requests a mount of a volume, one with ext4-format and another with xfs-format file system.
- Verify two application pods reach 'Running' status
$ kubectl get po my-csi-app-1 my-csi-app-2
NAME READY STATUS RESTARTS AGE
my-csi-app-1 1/1 Running 0 6m5s
NAME READY STATUS RESTARTS AGE
my-csi-app-2 1/1 Running 0 6m1s
- Check that applications have a pmem volume mounted with added dax option
$ kubectl exec my-csi-app-1 -- df /data
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/ndbus0region0fsdax/5ccaa889-551d-11e9-a584-928299ac4b17
4062912 16376 3820440 0% /data
$ kubectl exec my-csi-app-2 -- df /data
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/ndbus0region0fsdax/5cc9b19e-551d-11e9-a584-928299ac4b17
4184064 37264 4146800 1% /data
$ kubectl exec my-csi-app-1 -- mount |grep /data
/dev/ndbus0region0fsdax/5ccaa889-551d-11e9-a584-928299ac4b17 on /data type ext4 (rw,relatime,dax)
$ kubectl exec my-csi-app-2 -- mount |grep /data
/dev/ndbus0region0fsdax/5cc9b19e-551d-11e9-a584-928299ac4b17 on /data type xfs (rw,relatime,attr2,dax,inode64,noquota)
Kubernetes cluster administrators can expose persistent and cache volumes
to applications using
StorageClass Parameters
. An
optional persistencyModel
parameter differentiates how the
provisioned volume can be used:
-
no
persistencyModel
parameter orpersistencyModel: normal
inStorageClass
A normal Kubernetes persistent volume. In this case PMEM-CSI creates PMEM volume on a node and the application that claims to use this volume is supposed to be scheduled onto this node by Kubernetes. Choosing of node is depend on StorageClass
volumeBindingMode
. In case ofvolumeBindingMode: Immediate
PMEM-CSI chooses a node randomly, and in case ofvolumeBindingMode: WaitForFirstConsumer
(also known as late binding) Kubernetes first chooses a node for scheduling the application, and PMEM-CSI creates the volume on that node. Applications which claim a normal persistent volume has to useReadOnlyOnce
access mode in itsaccessModes
list. This diagram illustrates how a normal persistent volume gets provisioned in Kubernetes using PMEM-CSI driver. -
persistencyModel: cache
Volumes of this type shall be used in combination with
volumeBindingMode: Immediate
. In this case, PMEM-CSI creates a set of PMEM volumes each volume on different node. The number of PMEM volumes to create can be specified bycacheSize
StorageClass parameter. Applications which claim acache
volume can useReadWriteMany
in itsaccessModes
list. Try it out with the provided cache storage class example. This diagram illustrates how a cache volume gets provisioned in Kubernetes using PMEM-CSI driver.
NOTE: Cache volumes are associated with a node, not a pod. Multiple pods using the same cache volume on the same node will not get their own instance but will end up sharing the same PMEM volume instead. Application deployment has to consider this and use available Kubernetes mechanisms like node anti-affinity. Try it out with provided cache application example.
WARNING: late binding (volumeBindingMode:WaitForFirstConsume
) has some caveats:
- Pod creation may get stuck when there isn't enough capacity left for the volumes; see the next section for details.
- A node is only chosen the first time a pod starts. After that it will always restart on that node, because that is where the persistent volume was created.
Kata Containers support gets enabled via
the kataContainers
storage class parameter. It accepts the following
values:
true/1/t/TRUE
Create the filesystem inside a partition inside a file, try to mount on the host through a loop device with-o dax
but proceed without-o dax
when the kernel does not support that. Currently Linux up to and including 5.4 do not support it. In other words, on the host such volumes are usable, but only without DAX. Inside Kata Containers, DAX works.false/0/f/FALSE
(default)
Create the filesystem directly on the volume.
Raw block volumes are only supported with
kataContainers: false
. Attempts to create them with kataContainers: true
are rejected.
At the moment (= Kata Containers 1.11.0-rc0), only Kata Containers with QEMU enable the special support for such volumes. Without QEMU or with older releases of Kata Containers, the volume is still usable through the normal remote filesystem support (9p or virtio-fs). Support for Cloud Hypervisor is in progress.
With Kata Containers for QEMU, the VM must be configured appropriately
to allow adding the PMEM volumes to their address space. This can be
done globally by setting the memory_offset
in the
configuration-qemu.toml
file
or per-pod by setting the
io.katacontainers.config.hypervisor.memory_offset
label
in the pod meta data. In both cases, the value has to be large enough
for all PMEM volumes used by the pod, otherwise pod creation will fail
with an error similar to this:
Error: container create failed: QMP command failed: not enough space, currently 0x8000000 in use of total space for memory devices 0x3c100000
The examples for usage of Kata Containers with
ephemeral and
persistent volumes use the pod
label. They assume that the kata-qemu
runtime class is
installed.
For the QEMU test cluster,
setup-kata-containers.sh
can be
used to install Kata Containers. However, this currently only works on
Clear Linux because on Fedora, the Docker container runtime is used
and Kata Containers does not support that one.
This is the original implementation of ephemeral inline volumes for CSI drivers in Kubernetes. It is currently available as a beta feature in Kubernetes.
Volume requests embedded in the pod spec are provisioned as
ephemeral volumes. The volume request could use below fields as
volumeAttributes
:
key | meaning | optional | values |
---|---|---|---|
size |
Size of the requested ephemeral volume as Kubernetes memory string ("1Mi" = 1024*1024 bytes, "1e3K = 1000000 bytes) | No | |
eraseAfter |
Clear all data after use and before deleting the volume |
Yes | true (default),false |
kataContainers |
Prepare volume for use in Kata Containers. | Yes | false/0/f/FALSE (default),true/1/t/TRUE |
Try out ephemeral volume usage with the provided example application.
This approach was introduced in Kubernetes
1.19
with the goal of using them for PMEM-CSI instead of the older
approach. In contrast CSI ephemeral inline volumes, no changes are
needed in CSI drivers, so PMEM-CSI already fully supports this if the
cluster has the feature enabled. See
pmem-app-generic-ephemeral.yaml
for an example.
When using generic ephemeral inline volumes together with storage capacity tracking, the PMEM-CSI scheduler extensions are not needed anymore.
Applications can use volumes provisioned by PMEM-CSI as raw block
devices. Such
volumes use the same "fsdax" namespace mode as filesystem volumes
and therefore are block devices. That mode only supports dax (=
mmap(MAP_SYNC)
) through a filesystem. Pages mapped on the raw block
device go through the Linux page cache. Applications have to format
and mount the raw block volume themselves if they want dax. The
advantage then is that they have full control over that part.
For provisioning a PMEM volume as raw block device, one has to create a
PersistentVolumeClaim
with volumeMode: Block
. See example PVC and
application for usage reference.
That example demonstrates how to handle some details:
mkfs.ext4
needs-b 4096
to produce volumes that support dax; without it, the automatic block size detection may end up choosing an unsuitable value depending on the volume size.- Kubernetes bug #85624 must be worked around to format and mount the raw block device.
The PMEM-CSI scheduler extender and admission webhook are provided by
the PMEM-CSI controller. They need to be enabled during deployment via
the --schedulerListen=[<listen address>]:<port>
parameter. The
listen address is optional and can be left out. The port is where a
HTTPS server will run. It uses the same certificates as the internal
gRPC service. When using the CA creation script described above, they
will contain alternative names for the URLs described in this section
(service names, 127.0.0.1
IP address).
This parameter can be added to one of the existing deployment files
with kustomize
. All of the following examples assume that the
current directory contains the deploy
directory from the PMEM-CSI
repository. It is also possible to reference the base via a
URL.
$ mkdir my-pmem-csi-deployment
$ cat >my-pmem-csi-deployment/kustomization.yaml <<EOF
bases:
- ../deploy/kubernetes-1.16/lvm
patchesJson6902:
- target:
group: apps
version: v1
kind: StatefulSet
name: pmem-csi-controller
path: scheduler-patch.yaml
EOF
$ cat >my-pmem-csi-deployment/scheduler-patch.yaml <<EOF
- op: add
path: /spec/template/spec/containers/0/command/-
value: "--schedulerListen=:8000"
EOF
$ kubectl create --kustomize my-pmem-csi-deployment
To enable the PMEM-CSI scheduler extender, a configuration file and an
additional --config
parameter for kube-scheduler
must be added to
the cluster control plane, or, if there is already such a
configuration file, one new entry must be added to the extenders
array. A full example is presented below.
The kube-scheduler
must be able to connect to the PMEM-CSI
controller via the urlPrefix
in its configuration. In some clusters
it is possible to use cluster DNS and thus a symbolic service name. If
that is the case, then deploy the scheduler
service as-is
and use https://pmem-csi-scheduler.default.svc
as urlPrefix
. If
the PMEM-CSI driver is deployed in a namespace, replace default
with
the name of that namespace.
In a cluster created with kubeadm, kube-scheduler
is unable to use
cluster DNS because the pod it runs in is configured with
hostNetwork: true
and without dnsPolicy
. Therefore the cluster DNS
servers are ignored. There also is no special dialer as in other
clusters. As a workaround, the PMEM-CSI service can be exposed via a
fixed node port like 32000 on all nodes. Then
https://127.0.0.1:32000
needs to be used as urlPrefix
. Here's how
the service can be created with that node port:
$ mkdir my-scheduler
$ cat >my-scheduler/kustomization.yaml <<EOF
bases:
- ../deploy/kustomize/scheduler
patchesJson6902:
- target:
version: v1
kind: Service
name: pmem-csi-scheduler
path: node-port-patch.yaml
EOF
$ cat >my-scheduler/node-port-patch.yaml <<EOF
- op: add
path: /spec/ports/0/nodePort
value: 32000
EOF
$ kubectl create --kustomize my-scheduler
How to (re)configure kube-scheduler
depends on the cluster. With
kubeadm it is possible to set all necessary options in advance before
creating the master node with kubeadm init
. One additional
complication with kubeadm is that kube-scheduler
by default doesn't
trust any root CA. The following kubeadm config file solves
this together with enabling the scheduler configuration by
bind-mounting the root certificate that was used to sign the certificate used
by the scheduler extender into the location where the Go
runtime will find it. It works for Kubernetes <= 1.18:
$ sudo mkdir -p /var/lib/scheduler/
$ sudo cp _work/pmem-ca/ca.pem /var/lib/scheduler/ca.crt
# https://github.com/kubernetes/kubernetes/blob/52d7614a8ca5b8aebc45333b6dc8fbf86a5e7ddf/staging/src/k8s.io/kube-scheduler/config/v1alpha1/types.go#L38-L107
$ sudo sh -c 'cat >/var/lib/scheduler/scheduler-policy.cfg' <<EOF
{
"kind" : "Policy",
"apiVersion" : "v1",
"extenders" :
[{
"urlPrefix": "https://<service name or IP>:<port>",
"filterVerb": "filter",
"prioritizeVerb": "prioritize",
"nodeCacheCapable": true,
"weight": 1,
"managedResources":
[{
"name": "pmem-csi.intel.com/scheduler",
"ignoredByScheduler": true
}]
}]
}
EOF
# https://github.com/kubernetes/kubernetes/blob/52d7614a8ca5b8aebc45333b6dc8fbf86a5e7ddf/staging/src/k8s.io/kube-scheduler/config/v1alpha1/types.go#L38-L107
$ sudo sh -c 'cat >/var/lib/scheduler/scheduler-config.yaml' <<EOF
apiVersion: kubescheduler.config.k8s.io/v1alpha1
kind: KubeSchedulerConfiguration
schedulerName: default-scheduler
algorithmSource:
policy:
file:
path: /var/lib/scheduler/scheduler-policy.cfg
clientConnection:
# This is where kubeadm puts it.
kubeconfig: /etc/kubernetes/scheduler.conf
EOF
$ cat >kubeadm.config <<EOF
apiVersion: kubeadm.k8s.io/v1beta1
kind: ClusterConfiguration
scheduler:
extraVolumes:
- name: config
hostPath: /var/lib/scheduler
mountPath: /var/lib/scheduler
readOnly: true
- name: cluster-root-ca
hostPath: /var/lib/scheduler/ca.crt
mountPath: /etc/ssl/certs/ca.crt
readOnly: true
extraArgs:
config: /var/lib/scheduler/scheduler-config.yaml
EOF
$ kubeadm init --config=kubeadm.config
In Kubernetes 1.19, the configuration API of the scheduler changed. The corresponding command for Kubernetes >= 1.19 are:
$ sudo mkdir -p /var/lib/scheduler/
$ sudo cp _work/pmem-ca/ca.pem /var/lib/scheduler/ca.crt
# https://github.com/kubernetes/kubernetes/blob/1afc53514032a44d091ae4a9f6e092171db9fe10/staging/src/k8s.io/kube-scheduler/config/v1beta1/types.go#L44-L96
$ sudo sh -c 'cat >/var/lib/scheduler/scheduler-config.yaml' <<EOF
apiVersion: kubescheduler.config.k8s.io/v1beta1
kind: KubeSchedulerConfiguration
clientConnection:
# This is where kubeadm puts it.
kubeconfig: /etc/kubernetes/scheduler.conf
extenders:
- urlPrefix: https://127.0.0.1:<service name or IP>:<port>
filterVerb: filter
prioritizeVerb: prioritize
nodeCacheCapable: true
weight: 1
managedResources:
- name: pmem-csi.intel.com/scheduler
ignoredByScheduler: true
EOF
$ cat >kubeadm.config <<EOF
apiVersion: kubeadm.k8s.io/v1beta1
kind: ClusterConfiguration
scheduler:
extraVolumes:
- name: config
hostPath: /var/lib/scheduler
mountPath: /var/lib/scheduler
readOnly: true
- name: cluster-root-ca
hostPath: /var/lib/scheduler/ca.crt
mountPath: /etc/ssl/certs/ca.crt
readOnly: true
extraArgs:
config: /var/lib/scheduler/scheduler-config.yaml
EOF
$ kubeadm init --config=kubeadm.config
It is possible to stop here without enabling the pod admission webhook. To enable also that, continue as follows.
First of all, it is recommended to exclude all system pods from passing through the web hook. This ensures that they can still be created even when PMEM-CSI is down:
$ kubectl label ns kube-system pmem-csi.intel.com/webhook=ignore
This special label is configured in the provided web hook definition. On Kubernetes >= 1.15, it can also be used to let individual pods bypass the webhook by adding that label. The CA gets configured explicitly, which is supported for webhooks.
$ mkdir my-webhook
$ cat >my-webhook/kustomization.yaml <<EOF
bases:
- ../deploy/kustomize/webhook
patchesJson6902:
- target:
group: admissionregistration.k8s.io
version: v1beta1
kind: MutatingWebhookConfiguration
name: pmem-csi-hook
path: webhook-patch.yaml
EOF
$ cat >my-webhook/webhook-patch.yaml <<EOF
- op: replace
path: /webhooks/0/clientConfig/caBundle
value: $(base64 -w 0 _work/pmem-ca/ca.pem)
EOF
$ kubectl create --kustomize my-webhook
Kubernetes 1.19 introduces support for publishing and using storage capacity information for pod scheduling. PMEM-CSI must be deployed differently to use this feature:
external-provisioner
must be told to publish storage capacity information via command line arguments.- A flag in the CSI driver information must be set for the Kubernetes scheduler, otherwise it ignores that information when considering pods with unbound volume.
Because the CSIStorageCapacity
feature is still alpha in 1.19 and
driver deployment would fail on a cluster without support for it, none
of the normal deployment files nor the operator enable that. Instead,
special kustomize variants are provided in
deploy/kubernetes-1.19-alpha*
.
They can be used for example in the QEMU test cluster with:
$ TEST_KUBERNETES_VERSION=1.19 make start
...
$ TEST_KUBERNETES_FLAVOR=-alpha test/setup-deployment.sh
INFO: deploying from /nvme/gopath/src/github.com/intel/pmem-csi/deploy/kubernetes-1.19-alpha/lvm/testing
...
Metrics support is controlled by command line options of the PMEM-CSI driver binary and of the CSI sidecars. Annotations and named container ports make it possible to discover these data scraping endpoints. The metrics kustomize base adds all of that to the pre-generated deployment files. The operator also enables the metrics support.
Access to metrics data is not restricted (no TLS, no client authorization) because the metrics data is not considered confidential and access control would just make client configuration unnecessarily complex.
PMEM-CSI exposes metrics data about the Go runtime, Prometheus, CSI method calls, and PMEM-CSI:
Name | Type | Explanation |
---|---|---|
build_info |
gauge | A metric with a constant '1' value labeled by version. |
csi_[sidecar|plugin]_operations_seconds |
histogram | gRPC call duration and error code, for sidecar to driver (aka plugin) communication. |
go_* |
Go runtime information | |
pmem_amount_available |
gauge | Remaining amount of PMEM on the host that can be used for new volumes. |
pmem_amount_managed |
gauge | Amount of PMEM on the host that is managed by PMEM-CSI. |
pmem_amount_max_volume_size |
gauge | The size of the largest PMEM volume that can be created. |
pmem_amount_total |
gauge | Total amount of PMEM on the host. |
pmem_csi_[controller|node]_operations_seconds |
histogram | gRPC call duration and error code, for the PMEM-CSI internal controller to node communication. The node label identifies the node. |
pmem_nodes |
gauge | The number of PMEM-CSI nodes registered in the controller. |
process_* |
Process information | |
promhttp_metric_handler_requests_in_flight |
gauge | Current number of scrapes being served. |
promhttp_metric_handler_requests_total |
counter | Total number of scrapes by HTTP status code. |
This list is tentative and may still change as long as metrics support is alpha. To see all available data, query a container. Different containers provide different data. For example, the controller provides:
$ kubectl port-forward pmem-csi-controller-0 10010
Forwarding from 127.0.0.1:10010 -> 10010
Forwarding from [::1]:10010 -> 10010
And in another shell:
$ curl --silent http://localhost:10010/metrics | grep '# '
# HELP build_info A metric with a constant '1' value labeled by version.
# TYPE build_info gauge
...
# HELP csi_plugin_operations_seconds [ALPHA] Container Storage Interface operation duration with gRPC error code status total
# TYPE csi_plugin_operations_seconds histogram
...
# HELP pmem_csi_controller_operations_seconds [ALPHA] Container Storage Interface operation duration with gRPC error code status total
# TYPE pmem_csi_controller_operations_seconds histogram
...
An extension of the scrape config is
necessary for Prometheus. When deploying
Prometheus via Helm,
that file can be added to the default configuration with the -f
parameter. The following example works for the QEMU-based
cluster and Helm v3.1.2. In a real
production deployment, some kind of persistent storage should be
provided. The URL can be used instead of
the file name, too.
$ helm install prometheus stable/prometheus \
--set alertmanager.persistentVolume.enabled=false,server.persistentVolume.enabled=false \
-f deploy/prometheus.yaml
NAME: prometheus
LAST DEPLOYED: Tue Aug 18 18:04:27 2020
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
The Prometheus server can be accessed via port 80 on the following DNS name from within your cluster:
prometheus-server.default.svc.cluster.local
Get the Prometheus server URL by running these commands in the same shell:
export POD_NAME=$(kubectl get pods --namespace default -l "app=prometheus,component=server" -o jsonpath="{.items[0].metadata.name}")
kubectl --namespace default port-forward $POD_NAME 9090
#################################################################################
###### WARNING: Persistence is disabled!!! You will lose your data when #####
###### the Server pod is terminated. #####
#################################################################################
...
After running this kubectl port-forward
command, it is possible to
access the Prometheus web
interface
and run some queries there. Here are some examples for the QEMU test
cluster with two volumes created on node pmem-csi-pmem-govm-worker2
.
Available PMEM as percentage:
pmem_amount_available / pmem_amount_managed
Result variable | Value | Tags |
---|---|---|
none | 0.7332986065893997 | instance = 10.42.0.1:10010 |
job = pmem-csi-containers | ||
kubernetes_namespace = default | ||
kubernetes_pod_container_name = pmem-driver | ||
kubernetes_pod_name = pmem-csi-node-dfkrw | ||
kubernetes_pod_node_name = pmem-csi-pmem-govm-worker2 | ||
node = pmem-csi-pmem-govm-worker2 | ||
1 | instance = 10.36.0.1:10010 | |
job = pmem-csi-containers | ||
kubernetes_namespace = default | ||
kubernetes_pod_container_name = pmem-driver | ||
kubernetes_pod_name = pmem-csi-node-z5vnp | ||
kubernetes_pod_node_name pmem-csi-pmem-govm-worker3 | ||
node = pmem-csi-pmem-govm-worker3 | ||
1 | instance = 10.44.0.1:10010 | |
job = pmem-csi-containers | ||
kubernetes_namespace = default | ||
kubernetes_pod_container_name = pmem-driver | ||
kubernetes_pod_name = pmem-csi-node-zzmsd | ||
kubernetes_pod_node_name = pmem-csi-pmem-govm-worker1 | ||
node = pmem-csi-pmem-govm-worker1 |
Number of CreateVolume
calls in nodes:
pmem_csi_node_operations_seconds_count{method_name="/csi.v1.Controller/CreateVolume"}
Result variable | Value | Tags |
---|---|---|
pmem_csi_node_operations_seconds_count |
2 | driver_name = pmem-csi.intel.com |
grpc_status_code = OK | ||
instance = 10.42.0.1:10010 | ||
job = pmem-csi-containers | ||
kubernetes_namespace = default | ||
kubernetes_pod_container_name = pmem-driver | ||
kubernetes_pod_name = pmem-csi-node-dfkrw | ||
kubernetes_pod_node_name = pmem-csi-pmem-govm-worker2 | ||
method_name = /csi.v1.Controller/CreateVolume | ||
node = pmem-csi-pmem-govm-worker2 |
Deployment
is a cluster-scoped Kubernetes resource in the
pmem-csi.intel.com
API group. It describes how a PMEM-CSI driver
instance is to be created.
The operator will create objects in the namespace in which the operator itself runs if the object type is namespaced.
The name of the deployment object is also used as CSI driver name. This ensures that the name is unique and immutable. However, name clashes with other CSI drivers are still possible, so the name should meet the CSI requirements:
- domain name notation format, including a unique top-level domain
- 63 characters or less, beginning and ending with an alphanumeric character ([a-z0-9A-Z]) with dashes (-), dots (.), and alphanumerics between.
The name is also used as prefix for the names of all objects created
for the deployment and for the local /var/lib/<name>
state directory
on each node.
The current API for PMEM-CSI Deployment
resources is:
Field | Type | Description |
---|---|---|
apiVersion | string | pmem-csi.intel.com/v1alpha1 |
kind | string | Deployment |
metadata | ObjectMeta | Object metadata, name used for CSI driver and as prefix for sub-objects |
spec | DeploymentSpec | Specification of the desired behavior of the deployment |
Field | Type | Description | Default Value |
---|---|---|---|
image | string | PMEM-CSI docker image name used for the deployment | the same image as the operator1 |
provisionerImage | string | CSI provisioner docker image name | latest external provisioner stable release image2 |
registrarImage | string | CSI node driver registrar docker image name | latest node driver registrar stable release image2 |
pullPolicy | string | Docker image pull policy. either one of Always , Never , IfNotPresent |
IfNotPresent |
logLevel | integer | PMEM-CSI driver logging level | 3 |
deviceMode | string | Device management mode to use. Supports one of lvm or direct |
lvm |
controllerResources | ResourceRequirements | Describes the compute resource requirements for controller pod | |
nodeResources | ResourceRequirements | Describes the compute resource requirements for the pods running on node(s) | |
registryCert | string | Encoded tls certificate signed by a certificate authority used for driver's controller registry server | generated by operator self-signed CA |
nodeControllerCert | string | Encoded tls certificate signed by a certificate authority used for driver's node controllers | generated by operator self-signed CA |
registryKey | string | Encoded RSA private key used for signing by registryCert |
generated by the operator |
nodeControllerKey | string | Encoded RSA private key used for signing by nodeControllerCert |
generated by the operator |
caCert | string | Certificate of the CA by which the registryCert and controllerCert are signed |
self-signed certificate generated by the operator |
nodeSelector | string map | Labels to use for selecting Nodes on which PMEM-CSI driver should run. | { "storage": "pmem" } |
pmemPercentage | integer | Percentage of PMEM space to be used by the driver on each node. This is only valid for a driver deployed in lvm mode. This field can be modified, but by that time the old value may have been used already. Reducing the percentage is not supported. |
100 |
labels | string map | Additional labels for all objects created by the operator. Can be modified after the initial creation, but removed labels will not be removed from existing objects because the operator cannot know which labels it needs to remove and which it has to leave in place. | |
kubeletDir | string | Kubelet's root directory path | /var/lib/kubelet |
1 To use the same container image as default driver image the operator pod must set with below environment variables with appropriate values:
- POD_NAME: Name of the operator pod. Namespace of the pod could be figured out by the operator.
- OPERATOR_NAME: Name of the operator container. If not set, defaults to "pmem-csi-operator"
2 Image versions depend on the Kubernetes release. The operator dynamically chooses suitable image versions. Users have to take care of that themselves when overriding the values.
WARNING: although all fields can be modified and changes will be
propagated to the deployed driver, not all changes are safe. In
particular, changing the deviceMode
will not work when there are
active volumes.
A PMEM-CSI Deployment's status
field is a DeploymentStatus
object, which has
a phase
field. The phase of a Deployment is high-level summary of where the
Deployment is in it's lifecycle.
The possible phase
values and their meaning are as below:
Value | Meaning |
---|---|
empty string | A new deployment. |
Initializing | All the direct sub-resources of the Deployment are created, but some indirect ones (like pods controlled by a daemon set) may still be missing. |
Running | The operator has determined that the driver is usable1. |
Failed | For some reason the state of the Deployment failed and cannot be progressed2. |
1 This check has not been implemented yet. Instead, the deployment goes straight to Running
after creating sub-resources.
2 Failure reason is supposed to be carried by one of additional DeploymentStatus
field, but not implemented yet.
The PMEM-CSI operator posts events on the progress of a Deployment
. If the
deployment is in the Failed
state, then one can look into the event(s) using
kubectl describe
on that deployment for the detailed failure reason.
Note on multiple deployments
Though the operator allows running multiple PMEM-CSI driver deployments, one has to take extreme care of such deployments by ensuring that not more than one driver ends up running on the same node(s). Nodes on which a PMEM-CSI driver could run can be configured by using
nodeSelector
property ofDeploymentSpec
.
Report a bug by filing a new issue.
Before making your first contribution, be sure to read the development documentation for guidance on code quality and branches.
Contribute by opening a pull request.
Learn about pull requests.
Reporting a Potential Security Vulnerability: If you have discovered potential security vulnerability in PMEM-CSI, please send an e-mail to [email protected]. For issues related to Intel Products, please visit Intel Security Center.
It is important to include the following details:
- The projects and versions affected
- Detailed description of the vulnerability
- Information on known exploits
Vulnerability information is extremely sensitive. Please encrypt all security vulnerability reports using our PGP key.
A member of the Intel Product Security Team will review your e-mail and contact you to collaborate on resolving the issue. For more information on how Intel works to resolve security issues, see: vulnerability handling guidelines.