Skip to content

Latest commit

 

History

History
143 lines (114 loc) · 7.2 KB

README.md

File metadata and controls

143 lines (114 loc) · 7.2 KB

A Simple Method for Void Linux Clustering

There are several methods for constructing and managing small-scale, heterogenous ("Beowulf") computing clusters. Warewulf is an end-to-end suite that facilitates the management of node images, provisioning of nodes (including custom, per-node configuration overlays), and control of TFTP, DHCP and NFS servers. Originally a suite of Perl scripts that provided a front end to a database, Warefulf is now written in Go and uses a simpler flat-file backing store. The server-control components rely exclusively on the systemctl command in systemd, but could be trivially patched to support runit as well.

After considering bringing Warewulf to Void, I concluded that most of its major utility can be realized with a much simpler approach. One of my requirements for a small cluster is that the master node run essentially the same image as the others, and this wouldn't be possible with Warewulf without some special accommodations. To satisfy this requirement, all nodes can mount a root filesystem image as the lower tree of an overlay filesystem that uses memory-backed tmpfs as its upper tree. The master node can mount the lower tree from a locally attached ZFS pool; all other nodes can mount the lower tree from an NFS export on the master. For home directories, the master will again mount and export local storage for the other nodes to access. The master then runs a PXE server that will provide kernels and an appropriate initramfs image to all others.

The most heavily customized node will be the master, because it should run services and use local filesystems that will not be used on other nodes. For other nodes, customization is generally limited to assigning unique hostnames to each, although more extensive per-node customization is possible. A simple override system allows the replacement common files with customized variants early in the boot process. The replacement system provides all of the flexibility necessary to assign unique roles to individual nodes as needed.

Contents

The subdirectories of this repository provide sample configurations and scripts that should be deployed on the master node. Each directory contains a dedicated README that describes how to install and use the components therein. The subsystems that must be modified for clustering are:

  • initcpio: I prefer the use of mkinitcpio to dracut both because mkinitcpio is generally simpler to configure and because dracut is increasingly hostile to systems that do not use systemd or attempt to include it in initramfs images. This subdirectory contains the pieces necessary to configure mkinitcpio to produce initramfs images for both the master node and the client nodes.

  • overlays provides the components necessary to implement early-boot configuration overlays on a per-node basis.

  • tftp provides instruction and a simple PXELINUX configuration that can be used to serve the client kernel and initramfs to diskless nodes.

Base Installation

These scripts and configuration overlays are intended to be added to a stock Void Linux installation that contains the desired software and configuration for all nodes in the cluster. Because the master node in my cluster runs atop a ZFS pool, the master node was originally installed according to the ZFSBootMenu guide for booting Void on a UEFI system. After a base installation is configured and booting, make sure mkinitcpio is installed and configured for use on the system:

xbps-install -S mkinitcpio mkinitcpio-zfs
xbps-alternatives -s mkinitcpio

At this point, the initcpio configuration from this repository can replace the default mkinitcpio.conf. The initramfs can be regenerated by running

xbps-reconfigure -f linuxX.Y

where X.Y should be replaced with whatever version describes the Void kernel series currently installed.

NOTE: at this point, rebooting the system will result in a root filesystem that consists of a tmpfs overlay on top of the underlying ZFS filesystem. Subsequent configuration on top of the tmpfs overlay will be lost after system shutdown. One of three alternatives will be required to complete configuration of the master node:

  1. Finish all configuration before rebooting with the new initramfs;
  2. Temporarily disable the overlayfs hook in mkinitcpio.conf; or
  3. After rebooting, make sure to complete subsequent configuration while chrooted into the lower layer of the overlay (/run/rootfs/lower).

Modifying the Base Installation

Eventually, it will be necessary to modify the base installation when the system has a tmpfs overlay mounted atop it. Because any changes to the upper layer will be lost after a reboot, modifications must be made in the lower layer. As configured by the default overlayfs hook, the lower level will be mounted at /run/rootfs/lower. A straightforward way to manipulate the lower layer is with the xchroot script that is provided by the xtools package:

xbps-install -S xtools
xchroot /run/rootfs/lower /bin/bash

Within the chroot, complete any configuration necessary (for example, a system upgrade with xbps-install -Su), then exit the shell and refresh the mount on the host:

mount -o remount /

Modifications to the lower root filesystem on the master node may trigger messages and I/O errors on clients that hold stale NFS handles to replaced files. Often times, after modifying the lower root filesystem on the master, it is easiest to just reboot the client nodes to make sure they have up-to-date views of the filesystem.

When the master installation is first adapted for cluster use, the image should be "generalized" by removing configuration that is specific to the master. In particular, /etc/fstab and, if it exists, /etc/zfs/zpool.cache should be removed. Enter a chroot into the lower root filesystem and move master-specific files into the /etc/overlays tree:

xchroot /run/rootfs/lower /bin/bash

macaddr="$(cat /sys/class/net/eth0/address)"
mkdir -p "/etc/overlays/${macaddr}/zfs"

mv /etc/fstab "/etc/overlays/${macaddr}"
mv /etc/zfs/zpool.cache "/etc/overlays/${macaddr}/zfs"

At this point, any other master-specific configuration files should be moved from /etc (under the chroot) to /etc/overlays/${macaddr}. Make sure to move, and not copy, these files to avoid leaving them accessible to client nodes; when necessary, replace any moved files with suitable alternatives for the client.

Note that moving /etc/fstab into the overlays tree, as recommended above, will produce unbootable client nodes. Make sure to replace that file with a basic version that will mount the home directory exported by the master:

cat > /etc/fstab <<EOF
tmpfs /tmp tmpfs defaults,nosuid,nodev 0 0
172.23.199.225:/home /home nfs4 rw,defaults,retrans=10,rsize=32768,wsize=32768 0 0
EOF

where 172.23.199.225 should be replaced with the IP address of the master node. The default fstab in the lower-level root will be used by client nodes, while the master relies on an overlay replacement.