Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes #85: Write an ADR-010 about the infrastructure #86

Merged
merged 50 commits into from
May 1, 2024
Merged
Show file tree
Hide file tree
Changes from 36 commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
4956011
Create 010-infrastruture.md
ThomasCardin Dec 15, 2023
ae01252
Create 010-infrastructure.fr-ca.md
ThomasCardin Jan 3, 2024
28941ec
Update and rename 010-infrastruture.md to 010-infrastruture.en-ca.md
ThomasCardin Jan 3, 2024
c423fdd
issue #85: Alternatives Considered (template)
ThomasCardin Jan 3, 2024
60fc769
resolved comments, added formating rules consistent with organisation
SonOfLope Mar 4, 2024
3b7a666
Issue #85: removed double
SonOfLope Mar 4, 2024
1fe591c
Merge branch 'main' into 85-write-an-adr-about-the-infrastructure
SonOfLope Mar 4, 2024
7f0de8f
Issue #85: Resolving merge conflicts insettings.json
SonOfLope Mar 4, 2024
ddd1c93
workflow path cleanup
SonOfLope Mar 4, 2024
3da9d35
Issue #85: Update documentation with comparison details and adopted s…
SonOfLope Mar 19, 2024
357af54
issue #122: gitops adr (fr)
ThomasCardin Apr 4, 2024
8095b18
Issue #85: Update workflow
SonOfLope Apr 8, 2024
910a1e0
Add ADR 013 completed and template for 012
SonOfLope Apr 8, 2024
652257b
adds secret management ADR
SonOfLope Apr 8, 2024
08bbfea
update secret management with reference to our architecture doc
SonOfLope Apr 8, 2024
b116468
issue #85: adr on container and container orchestration
ThomasCardin Apr 10, 2024
435257f
Issue #85: Vouch-proxy ADR
SonOfLope Apr 12, 2024
511c5a4
Issue #85: Fix markdown
SonOfLope Apr 12, 2024
c66445c
issue #85: adr containers
ThomasCardin Apr 12, 2024
4368f9f
issue #85: networking adr
ThomasCardin Apr 15, 2024
62d4d56
issue #85: security adr (wip)
ThomasCardin Apr 17, 2024
b5211bb
issue #85: markdown linting fix for security adr
ThomasCardin Apr 17, 2024
070adb3
issue #85: en-ca adr for gitops, security, container and networking
ThomasCardin Apr 18, 2024
4deb6d7
Issue #85: Finalize ADR 10
SonOfLope Apr 18, 2024
596342b
Merge remote-tracking branch 'origin/main' into 85-write-an-adr-about…
ThomasCardin Apr 18, 2024
3ddda9f
Merge branch '85-write-an-adr-about-the-infrastructure' of https://gi…
ThomasCardin Apr 18, 2024
dc40e0a
Issue #85: translate to EN remaining ADR's
SonOfLope Apr 18, 2024
10ed080
issue #85: redundancy
ThomasCardin Apr 18, 2024
d20c380
Merge branch '85-write-an-adr-about-the-infrastructure' of https://gi…
ThomasCardin Apr 18, 2024
36d6e37
issue #85: redundancy (en)
ThomasCardin Apr 18, 2024
a2e0374
Update adr/013-IaC-tool.fr-ca.md
SonOfLope Apr 22, 2024
514b8ed
Update adr/013-IaC-tool.fr-ca.md
SonOfLope Apr 22, 2024
9378ccb
Issue #85: typo containers ADR
SonOfLope Apr 22, 2024
09453de
Update adr/015-authentication-management.fr-ca.md
SonOfLope Apr 22, 2024
8b19873
Issue #85: Add editorConfig and clean up ci
SonOfLope Apr 22, 2024
854f342
issue #85: typo
ThomasCardin Apr 22, 2024
e147c79
issue #85: fixed repo-standard errors
ThomasCardin May 1, 2024
45d2376
issue #85: added TESTING.md
ThomasCardin May 1, 2024
c569cfa
issue #85: fixed some markdown linting errors
ThomasCardin May 1, 2024
648ecee
issue #85: fixed some markdown linting errors
ThomasCardin May 1, 2024
08089d4
issue #85: markdown errors, line-length
ThomasCardin May 1, 2024
5ad3d05
issue #85: fixed some markdown linting errors
ThomasCardin May 1, 2024
0bf68f5
issue #85: fixed some markdown linting errors
ThomasCardin May 1, 2024
437dbba
issue #85: fixed some markdown linting errors
ThomasCardin May 1, 2024
52ee2e2
issue #85: fixed some markdown linting errors
ThomasCardin May 1, 2024
9ed04cf
issue #85: fixed some markdown linting errors
ThomasCardin May 1, 2024
b49170e
issue #85: dead links
ThomasCardin May 1, 2024
9f2470c
Merge remote-tracking branch 'origin/main' into 85-write-an-adr-about…
ThomasCardin May 1, 2024
3f4ddab
issue #85: fixed some markdown linting errors
ThomasCardin May 1, 2024
fdc4276
issue #85: fixed some markdown linting errors
ThomasCardin May 1, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions .editorconfig
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# top-most EditorConfig file
root = true

# Unix-style newlines with a newline ending every file
[*]
charset = utf-8
end_of_line = lf
insert_final_newline = true
trim_trailing_whitespace = true

# Exclude binary files
[*.svg]
insert_final_newline = false
8 changes: 0 additions & 8 deletions .github/workflows/markdown-testing.yml

This file was deleted.

21 changes: 21 additions & 0 deletions .github/workflows/workflow.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
name: Dev-Rel-Docs integration workflow

on:
pull_request:
types:
- opened
- closed
- synchronize

jobs:
lint-test:
uses: ai-cfia/github-workflows/.github/workflows/workflow-lint-test-python.yml@main
secrets: inherit

markdown-check:
uses: ai-cfia/github-workflows/.github/workflows/workflow-markdown-check.yml@main
secrets: inherit

repo-standard:
uses: ai-cfia/github-workflows/.github/workflows/workflow-repo-standards-validation.yml@main
secrets: inherit
2 changes: 1 addition & 1 deletion .vscode/settings.json
ibrahim-kabir marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,4 @@
"files.trimTrailingWhitespace": true,
"files.trimFinalNewlines": true,
"files.insertFinalNewline": true
}
}
132 changes: 132 additions & 0 deletions adr/010-infrastructure.en-ca.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
# ADR-010: Infrastructure

## Executive Summary

In an effort to optimize and secure our infrastructure operations, our
organization has adopted a strategy based on Infrastructure as Code (IaC) using
Terraform, complemented by the deployment of a Kubernetes cluster on Azure. This
approach allows us to overcome the limitations associated with traditional
methods such as ClickOps and manual deployments, which were both time-consuming
and error-prone. The adoption of HashiCorp Vault for centralized secret
management and ArgoCD for deployment orchestration strengthens our security and
agility posture. By integrating advanced monitoring solutions and considering
the use of technologies like OpenTelemetry for enhanced observability, we aim to
maintain high availability and performance of our services. This transformation
allows for more robust and automated infrastructure management, reduces the
risks of human error, and provides increased flexibility and portability across
different cloud environments. Our initiative aligns infrastructure management
with our operational goals while ensuring enhanced scalability and security to
meet future needs.

## Context

Our team faces challenges in deploying solutions, especially in choosing cloud
providers. Initially, we used [Google Cloud
Run](https://cloud.google.com/run/?hl=en) and [Azure App
Service](https://azure.microsoft.com/en-ca/products/app-service/). However, due
to the absence of a Google Cloud account and access restrictions on Azure, we
find ourselves switching from one account to another, resulting in significant
downtime for our applications.

Moreover, the manual creation of all services on cloud providers via ClickOps
proved tedious. To overcome this challenge, we decided to adopt Infrastructure
as Code (IaC) using Terraform. This approach allows us to manage and provision
our cloud infrastructures via codified configuration files, thus eliminating the
need for ClickOps and significantly reducing human errors.

Regarding security, we initially adopted [Azure Key
Vault](https://azure.microsoft.com/en-us/products/key-vault/) for the manual
retrieval of environment variable values. However, recognizing the need for a
more robust and versatile solution for secret management, we have evolved
towards maintaining a HashiCorp Vault instance. This transition enables
centralized management of secrets and credentials across different environments
and platforms.

Currently, scaling our applications is not a priority, as we have a fixed
visibility on the number of users. However, we have not yet implemented a
scaling solution.

For monitoring and telemetry, we currently rely exclusively on the built-in
tools of cloud providers, such as those from Google Cloud Run. However, it is
important to consider the flexibility and portability that external services
such as [OpenTelemetry](https://opentelemetry.io/) can offer. These solutions
can not only adapt to various cloud environments but also provide custom
customization specifically tailored to our needs. Although in-house solutions
may seem demanding in terms of maintenance, they allow us to optimize our
monitoring and telemetry in a targeted way, thus offering a more precise
alignment with our operational goals.

In short, many tasks are currently performed manually. Although we have a Github
Workflow for deploying Docker images, the management of deployments across
different cloud providers is not automated. In the event of a production error,
no solution allows developers to quickly resolve the issue.

## Use Cases

- Manage PostgreSQL database (and soon PostgreSQL ML) without resorting to
ClickOps.
- Increase data redundancy more effectively.
- Deploy, manage, monitor, and instrument applications within the organization.
- Improve secret management.
- Eliminate silos between the security team and the DevOps team within the
organization.
- Implement deployments across all cloud providers in case of outages. This
includes data persistence across different cloud providers.
- Manage a centralized SSO solution to authenticate users of hosted services.
- Use Infrastructure as Code to automate the creation, deployment, and
management of infrastructure, enabling faster infrastructure operations while
reducing manual errors.
- Automate scaling (HPA).
- Adopt a backup and disaster recovery strategy.
- Create documentation that is easy to read and adapt to enable a "shift-left"
transition (Early and thorough integration of testing, security, and quality
assurance at the beginning of the software development cycle, for earlier
identification and resolution of anomalies).
- Avoid single points of failure.

## Decision

Our solution will consist of deploying Kubernetes clusters on various cloud
providers. Here are the components that will be deployed to manage various use
cases:

- [Container management and deployment: Kubernetes](014-containers.fr-ca.md)
- [Secret management: HashiCorp Vault](012-secret-management.fr-ca.md)
- [Deployment management: ArgoCD](011-gitops.fr-ca.md)
- [Infrastructure as Code (IaC) management: Terraform](013-IaC-tool.fr-ca.md)
- Development environment management: AzureML (coming soon)
- [User authentication management:
Vouch-proxy](015-authentication-management.fr-ca.md)
- Observability management: Grafana, Prometheus, Open-Telemetry, and OneUptime
(coming soon)
- [Load balancing management: Ingress NGINX](016-networking.fr-ca.md)
- [Security management: Trivy and Falco](017-security.fr-ca.md)
- Managing redundancy: Itsio / Cluster mesh (coming soon)

Additional components will be added as needed.

## Consequences

The transition to Kubernetes-based infrastructure management and Terraform,
combined with the use of robust solutions for secret management (HashiCorp
Vault) and deployment (ArgoCD), marks significant progress towards full
automation and increased security of our cloud environment.

This approach minimizes manual interventions and error risks while enhancing
security at every stage of application deployment. Using open-source tools
promotes greater transparency, adaptability to multiple environments, and easier
integration with various ecosystems. Furthermore, adopting GitOps practices,
notably through Terraform and ArgoCD, improves the traceability and
reversibility of changes made to the infrastructure, essential for configuration
management and security compliance. These changes support our ability to scale
quickly and reliably while maintaining strict control over data security and
user authentication through Vouch-proxy and integrating solutions such as NGINX
Ingress for access management. However, this evolution requires ongoing skill
development of our teams and sustained attention to updates and maintenance of
these technologies to ensure their effectiveness and security over the long
term.

## References

- [Howard Repository - Contains the configuration of our infrastructure along
with documentation](https://github.com/ai-cfia/howard)
145 changes: 145 additions & 0 deletions adr/010-infrastructure.fr-ca.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
# ADR-010 : Infrastructure

## Résumé Exécutif

Dans un effort d'optimisation et de sécurisation de nos opérations
d'infrastructure, notre organisation a adopté une stratégie basée sur
l'Infrastructure as Code (IaC) en utilisant Terraform, accompagnée par le
déploiement d'un cluster Kubernetes sur Azure. Cette approche nous permet de
surmonter les limitations associées aux méthodes traditionnelles telles que
ClickOps et les déploiements manuels, qui étaient à la fois chronophages et
susceptibles d'erreur. L'adoption de HashiCorp Vault pour la gestion centralisée
des secrets et d'ArgoCD pour l'orchestration des déploiements renforce notre
posture de sécurité et d'agilité. En intégrant des solutions de monitoring
avancées et en envisageant l'utilisation de technologies comme OpenTelemetry
pour une observabilité accrue, nous visons à maintenir une haute disponibilité
et performance de nos services. Cette transformation permet une gestion plus
robuste et automatisée de l'infrastructure, réduit les risques d'erreur humaine
et offre une flexibilité et une portabilité accrues à travers différents
environnements cloud. Notre initiative aligne la gestion des infrastructures
avec nos objectifs opérationnels tout en assurant une évolutivité et une
sécurité renforcées pour répondre aux besoins futurs.

## Contexte

Notre équipe fait face à des défis en matière de déploiement de solutions,
notamment dans le choix des fournisseurs d'infonuages. Initialement, nous
utilisions [Google Cloud Run](https://cloud.google.com/run/?hl=en) et [Azure App
Service](https://azure.microsoft.com/en-ca/products/app-service/). Cependant, en
raison de l'absence de compte Google Cloud et des restrictions d'accès sur
Azure, nous nous retrouvons à basculer d'un compte à l'autre, entraînant
d'importants temps d'arrêt pour nos applications.

De plus, la création manuelle de tous les services sur les fournisseurs de cloud
via le ClickOps s'est avérée fastidieuse. Pour surmonter ce défi, , nous avons
décidé d'adopter l'Infrastructure as Code (IaC) en utilisant Terraform. Cette
approche nous permet de gérer et de provisionner nos infrastructures cloud via
des fichiers de configuration codifiés, éliminant ainsi le besoin de ClickOps et
réduisant significativement les erreurs humaines.

En ce qui concerne la sécurité, nous avions initialement adopté [Azure Key
Vault](https://azure.microsoft.com/en-us/products/key-vault/) pour la
récupération manuelle des valeurs des variables d'environnement. Cependant,
reconnaissant la nécessité d'une solution plus robuste et polyvalente pour la
gestion des secrets, nous avons évolué vers le maintien d'une instance de
HashiCorp Vault. Cette transition permet une gestion centralisée des secrets et
des identifiants à travers différents environnements et plateformes.

La mise à l'echelle de nos applications n'est pas actuellement une priorité, car
nous avons une visibilité fixe sur le nombre d'utilisateurs. Cependant, nous
n'avons pas encore mis en oeuvre de solution de mise à l'échelle.

Actuellement, pour le monitoring et la télémétrie, nous nous appuyons
exclusivement sur les outils intégrés des fournisseurs de cloud, comme ceux de
Google Cloud Run. Cependant, il est important de considérer la flexibilité et la
portabilité que peuvent offrir des services externes tels
qu'[OpenTelemetry](https://opentelemetry.io/). Ces solutions peuvent non
seulement s'adapter à divers environnements de cloud mais aussi offrir une
personnalisation poussée qui répond spécifiquement à nos besoins. Bien que les
solutions maison puissent sembler exigeantes en termes de maintenance, elles
nous permettent d'optimiser notre surveillance et notre télémétrie de manière
ciblée, offrant ainsi un potentiel d'alignement plus précis avec nos objectifs
opérationnels.

Bref, de nombreuses tâches sont actuellement effectuées manuellement. Bien que
nous disposions de Github Workflow pour déployer des images Docker, la gestion
des déploiements sur différents fournisseurs d'infonuages n'est pas automatisée.
En cas d'erreur en production, aucune solution ne permet aux développeurs de
résoudre rapidement le problème

## Cas d'utilisation

- Gérer la base de données PostgreSQL (et bientôt PostgreSQL ML) sans recourir
au ClickOps.
- Accroître la redondance des données de manière plus efficace.
- Déployer, gérer, surveiller et instrumenter les applications au sein de
l'organisation.
- Améliorer la gestion des secrets.
- Éliminer les silos entre l'équipe de sécurité et l'équipe DevOps au sein de
l'organisation
- Mettre en place des déploiements sur tous les fournisseurs de cloud en cas de
pannes. Cela inclue une persistences des données dans les différents
fournisseurs d'infonuages.
- Gérer une solution SSO centralisé pour authentifier les utilisateurs des
services hébergés.
- Utiliser l'Infrastructure as Code pour automatiser la création, le
déploiement, et la gestion de l'infrastructure permettant la rapidité des
opérations d'infrastructure tout en réduisant les erreurs manuelles.
- Automatisation de la mise à l'échelle (HPA).
- Adopter une stratégie de sauvegarde et de reprise après sinistre.
- Créer une documentation facile de lecture et d'adaption pour permettre une
transition "shift-left" (Intégration anticipée et approfondie des tests, de la
sécurité et de l'assurance qualité au début du cycle de développement
logiciel, pour une identification et résolution plus précoces des anomalies).
- Éviter les points de défaillance uniques.

## Décision

Notre solution consistera à déployer des clusters Kubernetes sur différents
fournisseurs de cloud. Voici les composants qui seront déployés pour gérer
divers cas d'utilisation

- [Gestion des conteneurs et leur déploiement:
Kubernetes](014-containers.fr-ca.md)
- [Gestion des secrets: HashiCorp Vault](012-secret-management.fr-ca.md)
- [Gestion des deployments: ArgoCD](011-gitops.fr-ca.md)
- [Gestion de l'Infrastructure as Code (IaC): Terraform](013-IaC-tool.fr-ca.md)
- Gestion des environnements de développement: AzureML (à venir)
- [Gestion d'authentification des utilisateurs:
Vouch-proxy](015-authentication-management.fr-ca.md)
- Gestion de l'observabilité: Grafana, Prometheus, Open-Telemetry et OneUptime
(À venir)
- [Gestion du load balancing: Ingress NGINX](016-networking.fr-ca.md)
- [Gestion de la securité: Trivy et Falco](017-security.fr-ca.md)
- Gestion de la redondance: Istio / Cluster mesh (à venir)

D'autres composants seront ajoutés au besoin.

## Conséquences

La transition vers une gestion d'infrastructure basée sur Kubernetes et
Terraform, combinée à l'utilisation de solutions robustes pour la gestion des
secrets (HashiCorp Vault) et des déploiements (ArgoCD), marque un progrès
significatif vers une automatisation complète et une sécurisation accrue de
notre environnement cloud.

Cette approche permet de minimiser les interventions manuelles et les risques
d'erreur, tout en renforçant la sécurité à chaque étape du déploiement des
applications. En utilisant des outils open source, nous favorisons une plus
grande transparence, une adaptabilité aux environnements multiples et une
intégration plus aisée avec divers écosystèmes. De plus, l'adoption de pratiques
GitOps, notamment à travers Terraform et ArgoCD, améliore la traçabilité et la
réversibilité des changements apportés à l'infrastructure, essentielles pour la
gestion des configurations et la conformité sécuritaire. Ces changements
soutiennent notre capacité à évoluer rapidement et de manière fiable, tout en
maintenant un contrôle rigoureux sur la sécurité des données et
l'authentification des utilisateurs à travers Vouch-proxy et l'intégration de
solutions telles que NGINX Ingress pour la gestion de l'accès. Cependant, cette
évolution nécessite une montée en compétence continue de nos équipes et une
attention soutenue aux mises à jour et à l'entretien de ces technologies pour
garantir leur efficacité et leur sécurité à long terme.

## Références

- [Repertoire Howard - Contient la configuration de notre infrastructure
accompagnée de documentation](https://github.com/ai-cfia/howard)
51 changes: 51 additions & 0 deletions adr/011-gitops.en-ca.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# ADR-011: GitOps

## Introduction

This document outlines the decision to use ArgoCD as the
continuous deployment tool for our Kubernetes applications.

## Background

Before implementing ArgoCD, the process of addressing production issues was
manual and time-consuming. A developer had to report an issue to a DevSecOps,
which could result in a waiting period before the issue was resolved.

## Use Cases

- Developers can deploy and test their changes without waiting for
a DevSecOps intervention.

- Developers can identify and resolve production issues more quickly.

- Development and operations teams can work more closely together.

## Decision

The team has already had positive experiences with ArgoCD.

## Considered Alternatives

### Flux

Advantages:

- Easy to set up

Disadvantages:

- No user interface

## Consequences

- Developers will be able to deploy and test their changes more quickly.

- Production issues can be resolved more swiftly.

- Development and operations teams can work more closely together.

## References

- [ArgoCD ACIA/CFIA url](https://argocd.inspection.alpha.canada.ca/)
- [Document on Secret Management](
https://github.com/ai-cfia/howard/blob/main/docs/secrets-management.md)
Loading
Loading