Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

garm server configuration should be possible via garm-operator #143

Open
bavarianbidi opened this issue Jul 22, 2024 · 6 comments
Open

garm server configuration should be possible via garm-operator #143

bavarianbidi opened this issue Jul 22, 2024 · 6 comments

Comments

@bavarianbidi
Copy link
Member

bavarianbidi commented Jul 22, 2024

What is the feature you would like to have?

With the recent changes in garm (and also future ideas in the backlog) it will be possible to "configure" a garm-server instance in a declarative way.
The current version of garm-operator already takes care of the circumstance if a garm-instance is initialized or not and takes care of the initialization if needed.

As we already check on every api-request for an initialed garm-server (func Ensured Auth) we should at least decouple this implementation and use some existing primitives of kubernetes to make the code a bit more easier to read.

I would like to see a dedicated controller with is responsible for the generic garm-server management and controllers for scoped resources (Separation of concerns).

Current known resources

name scope depends_on
github endpoint shared across github credentials
github credentials shared across orgs/repos/enterprises github endpoint
webhooks
repo/org github credentials
pool repo/org, enterprise
enterprise github credentials
image shared across pools

Anything else you would like to add?

As the configuration of the github backend has already moved out of a config file and moved into the api (with cloudbase/garm#243) and we have an issue for this (#127) we should agree on the separation level first.

How could the new CRs look like? Maybe we can "design" the CRs first before start implementing 🙏

@bavarianbidi bavarianbidi added this to the support for garm > 0.1.4 milestone Jul 22, 2024
@bavarianbidi
Copy link
Member Author

cc @gabriel-samfira - as you already have similar thoughts about managing garm itself via the operator: would be nice if you could drop some of your thoughts about this in here 🙏

@gabriel-samfira
Copy link

I've actually started playing around with this here: https://github.com/gabriel-samfira/garm-operator/tree/add-garm-server-controller

However I've been sidetracked a bit. Hopefully I can resume soon.

The basic idea is as follows:

  • A new CR is created to describe the garm-server
  • A single CRD should be allowed (perhaps created automatically) that describes the managed garm server
  • Perhaps allow the controller to be disabled and use an external garm server, in which case things should work as they do now (not sure about this)
  • The controller is responsible for composing the GARM config file
  • GARM now returns the version it's running as part of the controller info API.
    • Using this version we can decide if the credentials/providers should be composed as part of the config or as database entries. v0.1.4 and bellow requires a config entry. From 0.1.5 (not yet released), credentials are in the config. Providers will perhaps be moved to the DB starting with v0.1.6.
    • The lack of a version in the controller info will make the operator assume the version is <= v0.1.4
  • New CR will be added for:
    • Github Endpoints
    • Github credentials
    • Providers. This might be split up in the future into provider types and provider endpoints. The provider type will indicate the cloud which that provider is meant for. The provider endpoint will be a set of credentials for that cloud. This is not yet set in stone, comments and ideas are welcome. This will be slatd for v0.1.6 or v0.1.7.
  • The garm server operator will list all github endpoints, all gh credentials and all providers and along with the settings specified by the user for the garm server CRD, a valid TOML config will be composed and set as a config map.
  • That config map will be mounted as the garm-server config file and the server will be started.
  • Status will be set to Ready for the garm server, at which point the garm client can be instantiated and the rest of the operators will be unblocked to reconcile to their heart's contempt.

This is just a rough idea so far. What do you think?

@bavarianbidi
Copy link
Member Author

bavarianbidi commented Jul 22, 2024

i'm not sure if i got the entire idea 🙈

The first few bullet-points in your list are awesome - especially as you probably know best how it should be 😅

are you even go one step further to manage the garm-server Deployment object (incl. Service, ... ) or is it more like "generating and creating/updating the configmap in k8s"?
If so, the "state" of the garm-server Deployment is managed by two dedicated controllers/components:

  • configmap (and image if i got your working branch right) are managed by the operator
  • everything else which belongs to the garm-server belongs to another tool (e.g. flux/argocd or a semi-automatic github action deploy workflow)

@gabriel-samfira
Copy link

The controller would manage a StatefulSet which is the garm-server itself. Not sure if it's worth having separate controllers for config map management. I think we could fit that into the reconciliation loop of the garm server controller.

the reconciliation loop would basically use the CRD settings to compose the config map, save the config map, update the stateful set. If we add CRs for credentials and providers, we just add those two types in the loop as well. List al creds/providers, get CRD settings, compose config map, update stateful set.

The controller can watch for changes to credentials, providers and the garm server CRD (which it owns) and react accordingly.

For credentials and providers we need separate controllers/webhooks.

I think the best way to go about it would be to create a WiP branch and discuss it once it's ready. Doesn't seem to difficult to write a controller using kubebuilder. An initial draft would be without tests and not necessarily correct, but would give everyone a view of the rough idea.

@bavarianbidi
Copy link
Member Author

hey gabriel, first of all - thanks for the detailed explanation and all the work you already put into garm ❤️

we can also jump into a dedicated call and take some time to talk about garm, garm-operator and other related stuff to get a better feeling what make sense from different point of views.

just some thoughts from our side which came into our mind:
Your desired approach feels very similar to the prometheus-operator implementation.
The entire configuration of the prometheus-stack (different parts of application configuration + parts of the pod.spec) is done in multiple CRs. This makes offering Prometheus-as-a-Service super handy.

Taking this similarity and reflecting to your approach means, it will be possible to offer Garm-as-a-Service very easy (which is awesome)

Pros:

  • offering Garm-as-a-Service
  • Getting a fully opinionated garm-stack became much more easier

Cons:

  • reflecting the entire pod-spec (below is a compressed version of our Deployment manifest from our helm-chart) must be implemented (not that it's not possible, but it's a lot of work to do and to maintain)
  • how to deal with additional resources (e.g. netpol, pdb, ...) - in our case we need networkpolicies beside the garm-application - so we still need an additional way to deploy resources which belong to the managed garm-instance (adding this to the operator, will also increase complexity and probably ending up re-implementing existing solutions - like helm-chart support, ...)
  • High complexity if operator "just" manages one garm-instance (please note, the current operator doesn't support multiple garm-instances yet)
  • Maintainability - as we do not have that large community with code contributions it might became hard to maintain this

Our preferred approach is still the following:

  • everything which must be configured on garm side via the API (e.g. github configuration , ...) will became a CRD where the operator takes care of
  • garm-server is deployed/configured/managed outside the operator (if wanted, we can try to contribute our existing helm chart back to cloudbase/garm )
deployment-manifest

e.g.: we have to switch the pods dnsPolicy and dnsConfig if we have to interact with azure

apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ include "garm.fullname" . }}
  labels:
    {{- include "garm.labels" . | nindent 4 }}
spec:
  replicas: 1
  strategy:
    {{- .Values.deployment.strategy | toYaml | nindent 4 }}
  selector:
    matchLabels:
      {{- include "garm.selectorLabels" . | nindent 6 }}
  template:
    metadata:
      labels:
        {{- include "garm.selectorLabels" . | nindent 8 }}
      annotations:
        {{- if .Values.rx.enabled  }}
        prometheus.io/scrape: "true"
        prometheus.io/path: "/metrics"
        prometheus.io/port: "{{ .Values.service.port }}"
        prometheus.io/rx-scrape-every-1m: "true" #default
        {{- end }}
    spec:
      serviceAccountName: {{ .Values.serviceAccount.name }}
      securityContext:
        runAsUser: {{ .Values.securityContext.runAsUser }}
        fsGroup: {{ .Values.securityContext.fsGroup }}
      {{- if .Values.garm.provider.azure.enabled }}
      dnsPolicy: "None"
      dnsConfig:
        nameservers:
          - 8.8.8.8
      {{- end }}
      containers:
        - name: garm
          image: "{{ .Values.image.registry }}/{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}"
          command: ["/opt/garm"]
          imagePullPolicy: {{ .Values.image.imagePullPolicy }}
          ports:
            - name: garm
              containerPort: 9997
              protocol: TCP
          livenessProbe:
            tcpSocket:
              port: 9997
            failureThreshold: 4
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 2
          readinessProbe:
            tcpSocket:
              port: 9997
            failureThreshold: 4
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 2
          resources:
            {{- toYaml .Values.resources | nindent 12 }}
          volumeMounts:
            - name: {{ include "garm.fullname" . }}-garm-config-volume
              mountPath: /etc/garm
              readOnly: true
            {{- if .Values.garm.provider.openstack.enabled }}
            - name: {{ include "garm.fullname" . }}-openstack-provider-config-volume
              mountPath: /etc/garm/provider-config/openstack
              readOnly: true
            {{- end }}
            {{- if .Values.garm.provider.azure.enabled }}
            - name: {{ include "garm.fullname" . }}-azure-provider-config-volume
              mountPath: /etc/garm/provider-config/azure
              readOnly: true
            {{- end }}
            {{- if .Values.garm.provider.k8s.enabled }}
            - name: {{ include "garm.fullname" . }}-kubernetes-provider-config-volume
              mountPath: /etc/garm/provider-config/kubernetes
              readOnly: true
            {{- end }}
            - name: {{ include "garm.fullname" . }}-database
              mountPath: "/etc/garm/database"
            - name: serviceaccount-volume
              mountPath: /var/run/secrets/kubernetes.io/serviceaccount
        {{- if .Values.garm.provider.azure.enabled }}
        - name: transparent-proxy
          image: internal-transparent-proxy:v0.1.12-nonroot
          command: ["/bin/bash", "-c"]
          args:
            - |
              cat << EOF > /tmp/transparent-proxy.yaml
              port: 3128
              disable_sudo: true
              debug: true
              proxies:
              - addresses:
                - proxy-server.company.io:3128
              EOF
              /app/cmd/transparent-proxy.binary -config-file /tmp/transparent-proxy.yaml 2>&1 | tee /tmp/logs/proxy-log.txt &
              CHILD_PID=$!
              trap "kill -SIGTERM $CHILD_PID" SIGTERM
              wait $CHILD_PID
          env:
            - name: USER
              value: "{{ .Values.securityContext.runAsUser }}"
          securityContext:
            capabilities:
              add:
                - NET_ADMIN
          resources:
            requests:
              cpu: 10m
              memory: 10Mi
            limits:
              memory: 50Mi
        {{- end }}
        {{- if .Values.queryExporter.enabled }}
        - name: query-exporter
          image: "{{ .Values.queryExporter.image.registry }}/{{ .Values.queryExporter.image.repository }}:{{ .Values.queryExporter.image.tag }}"
          command: ["query-exporter"]
          args:
            - "/etc/query-exporter/config.yaml"
            - "-H"
            - "0.0.0.0"
          imagePullPolicy: {{ .Values.queryExporter.image.imagePullPolicy }}
          ports:
            - name: query-exporter
              containerPort: {{ .Values.queryExporter.service.port }}
              protocol: TCP
          resources:
            {{- toYaml .Values.queryExporter.resources | nindent 12 }}
          volumeMounts:
            - name: query-exporter-config-volume
              mountPath: /etc/query-exporter
            - name: {{ include "garm.fullname" . }}-database
              mountPath: "/etc/garm/database"
        {{- end }}
      volumes:
        {{- if .Values.queryExporter.enabled }}
        - name: query-exporter-config-volume
          projected:
            sources:
              - configMap:
                  name: {{ include "garm.fullname" . }}-query-exporter-config
        {{- end}}
        - name: {{ include "garm.fullname" . }}-garm-config-volume
          projected:
            sources:
              - secret:
                  name: {{ include "garm.fullname" . }}-garm-config
        {{- if .Values.garm.provider.openstack.enabled }}
        - name: {{ include "garm.fullname" . }}-openstack-provider-config-volume
          projected:
            sources:
              - secret:
                  name: {{ include "garm.fullname" . }}-openstack-provider-cloud-config
              - secret:
                  name: {{ include "garm.fullname" . }}-openstack-provider-config
        {{- end }}
        {{- if .Values.garm.provider.azure.enabled }}
        - name: {{ include "garm.fullname" . }}-azure-provider-config-volume
          projected:
            sources:
              - secret:
                  name: {{ include "garm.fullname" . }}-azure-provider-config
        {{- end }}
        {{- if .Values.garm.provider.k8s.enabled }}
        - name: {{ include "garm.fullname" . }}-kubernetes-provider-config-volume
          projected:
            sources:
              {{- range $runner_cluster := .Values.garm.provider.k8s.runnerClusters }}
              {{- if ne $runner_cluster.name "local"}}
              - secret:
                  name: {{ include "garm.fullname" $ }}-kubernetes-provider-kubeconfig
              {{- end }}
              - configMap:
                  name: {{ include "garm.fullname" $ }}-kubernetes-provider-config
              {{- end }}
        {{- end }}
        - name: {{ include "garm.fullname" . }}-database
          persistentVolumeClaim:
            claimName: {{ include "garm.fullname" . }}
        # use a custom token with specified expiration time - this implementation behaves like the default token mechanism
        # see: https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#bound-service-account-token-volume
        - name: serviceaccount-volume
          projected:
            sources:
            - serviceAccountToken:
                path: token
                expirationSeconds: 600 # dont set exact to 3607 otherwise the token will be valid for 1 year
            - configMap:
                items:
                  - key: ca.crt
                    path: ca.crt
                name: kube-root-ca.crt
            - downwardAPI:
                items:
                  - fieldRef:
                      apiVersion: v1
                      fieldPath: metadata.namespace
                    path: namespace

@gabriel-samfira
Copy link

gabriel-samfira commented Jul 23, 2024

You're right. That makes a lot of sense. A proper helm chart is definitely preferable to reinventing the wheel. In that case, we should be fine with just implementing the resources that were moved to the DB (gh endpoints and credentials - potentially providers in the future), and potentially offer a helm chart that folks can use.

It's so easy to over complicate things. Thanks for the reality check!

Edit: It makes sense to have any helm chart live in the operator project. GARM should just do what it does. At most it should worry about packaging (deb, rpm, snap, flatpack, plain binary releases, etc). But automation should be separate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants