From 4c8731f64523e78cc1df6841c12f59fffe935597 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Guilhem=20Barth=C3=A9s?= Date: Wed, 25 Oct 2023 15:11:41 +0200 Subject: [PATCH] feat!: decoupled builder (#756) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * feat: decouple image builder from worker Signed-off-by: SdgJlbl * fix: update skaffold config Signed-off-by: Guilhem Barthes * feat: add `ServiceAccount` and modify role Signed-off-by: Guilhem Barthes * fix: improve `wait_for_image_built` Signed-off-by: Guilhem Barthes * feat: build image in new pod Signed-off-by: Guilhem Barthes * chore: rename `deployment-builder.yaml` to `stateful-builder.yaml` Signed-off-by: Guilhem Barthes * chore: rename `stateful-builder.yaml` to `statefulset-builder.yaml` Signed-off-by: Guilhem Barthes * chore: centralize params Signed-off-by: Guilhem Barthes * feat: create `BuildTask` Signed-off-by: Guilhem Barthes * feat: move some values to `builder` module Signed-off-by: Guilhem Barthes * feat: move more code to `builder` Signed-off-by: Guilhem Barthes * fix: remove TaskProfiling as Celery task + save Entrypoint in DB Signed-off-by: SdgJlbl * fix: extract entrypoint from registry Signed-off-by: SdgJlbl * fix: make doc for helm chart Signed-off-by: SdgJlbl * feat: build function at registration (#707) - [ ] [changelog](../CHANGELOG.md) was updated with notable changes - [ ] documentation was updated --------- Signed-off-by: SdgJlbl Signed-off-by: Guilhem Barthes Co-authored-by: SdgJlbl * feat: share images between backends (#708) Signed-off-by: SdgJlbl * chore: update helm worklfow Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com> * chore: add .DS_Store to gitignore Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com> * chore: rm DS_Store Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com> * chore: rm .DS_Store Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com> * [sub]fix: add missing migration poc (#728) ## Description Add a migration missing in the poc. This migration alters two things: - modify `ComputeTaskFailureReport.logs` - modify `FunctionImage.file` This migration has been generated automatically with `make migrations` ## How has this been tested? ## Checklist - [ ] [changelog](../CHANGELOG.md) was updated with notable changes - [ ] documentation was updated Signed-off-by: Guilhem Barthes * [sub]feat: add function events (#714) - https://github.com/Substra/orchestrator/pull/263 Add function events, used now we decoupled the building of the function with the execution of the compute task. For that it add a status field on the Function. It also includes another PR (merged here), to have functions build logs working again. In a future PR, we will change the compute task execution to avoid having to wait_for_function_built in compute_task() Fixes FL-1160 As this is going to be merged on a branch that is going to be merged to a POC branch, we use MNIST as a baseline of a working model. We will deal with failing tests on the POC before merging on main. - [x] [changelog](../CHANGELOG.md) was updated with notable changes - [ ] documentation was updated --------- Signed-off-by: SdgJlbl Signed-off-by: Guilhem Barthes Signed-off-by: Guilhem Barthés Co-authored-by: SdgJlbl * [sub]fix(app/orchestrator/resources): FunctionStatus.FUNCTION_STATUS_CREATED -> FunctionStatus.FUNCTION_STATUS_WAITING (#742) # Issue Backend FunctionStatus are not aligned with [orchestrator definitions](https://github.com/Substra/orchestrator/blob/poc-decoupled-builder/lib/asset/function.proto#L29-L36). In particular, `FunctionStatus.FUNCTION_STATUS_CREATED` leading to the following error: ```txt ValueError: 'FUNCTION_STATUS_WAITING' is not a valid FunctionStatus ``` ## Description FunctionStatus.FUNCTION_STATUS_CREATED -> FunctionStatus.FUNCTION_STATUS_WAITING ## How has this been tested? Running Camelyon benchmark on [poc-builder-flpc](https://substra.org-1.poc-builder-flpc.cg.owkin.tech/compute_plans/a420306f-5719-412b-ab9c-688b7bed9c70/tasks?page=1&ordering=-rank) environment. ## Checklist - [ ] [changelog](../CHANGELOG.md) was updated with notable changes - [ ] documentation was updated --------- Signed-off-by: Thibault Camalon <135698225+thbcmlowk@users.noreply.github.com> * fix: builder using builder SA (#754) * fix: builder using builder SA Signed-off-by: Guilhem Barthés * docs: changelog Signed-off-by: Guilhem Barthés --------- Signed-off-by: Guilhem Barthés * fix: rebase changelog Signed-off-by: Guilhem Barthés * fix: adapt to pydantic 2.x.x (#758) Signed-off-by: Guilhem Barthés * [sub]fix(backend/image_transfert/encoder): update pydantic method (#763) * fix(backend/image_transfert/encoder): update pydantic method Signed-off-by: Thibault Camalon <135698225+thbcmlowk@users.noreply.github.com> * fix(backend/image_transfer/decoder): parse_raw -> model_validate_json Signed-off-by: Thibault Camalon <135698225+thbcmlowk@users.noreply.github.com> --------- Signed-off-by: Thibault Camalon <135698225+thbcmlowk@users.noreply.github.com> * [sub]chore: upgrade chart (#765) * chore(charts): bump chart version Signed-off-by: Thibault Camalon <135698225+thbcmlowk@users.noreply.github.com> * chore(charts/substra-backend/CHANGELOG): bring back unreleased section Signed-off-by: Thibault Camalon <135698225+thbcmlowk@users.noreply.github.com> --------- Signed-off-by: Thibault Camalon <135698225+thbcmlowk@users.noreply.github.com> * fix: post-rebase Signed-off-by: SdgJlbl * chore: rationalize migrations Signed-off-by: SdgJlbl * [sub]chore(builder): waitPostgresqlInitContainer (#764) * fix: builder using builder SA (#754) * fix: builder using builder SA Signed-off-by: Guilhem Barthés * docs: changelog Signed-off-by: Guilhem Barthés --------- Signed-off-by: Guilhem Barthés * chore(charts/substra-backend/templates/statefulset-builder): add init-container waitPostgresqlInitContainer Signed-off-by: Thibault Camalon <135698225+thbcmlowk@users.noreply.github.com> --------- Signed-off-by: Guilhem Barthés Signed-off-by: Thibault Camalon <135698225+thbcmlowk@users.noreply.github.com> Co-authored-by: Guilhem Barthés --------- Signed-off-by: SdgJlbl Signed-off-by: Guilhem Barthes Signed-off-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com> Signed-off-by: Guilhem Barthés Signed-off-by: Thibault Camalon <135698225+thbcmlowk@users.noreply.github.com> Co-authored-by: SdgJlbl Co-authored-by: ThibaultFy <50656860+ThibaultFy@users.noreply.github.com> Co-authored-by: Thibault Camalon <135698225+thbcmlowk@users.noreply.github.com> --- .gitignore | 1 + CHANGELOG.md | 12 + Makefile | 2 +- backend/api/events/sync.py | 30 +- .../api/migrations/0053_function_status.py | 30 ++ backend/api/models/function.py | 6 + backend/api/serializers/function.py | 1 + backend/api/tests/asset_factory.py | 37 +- .../api/tests/views/test_views_computetask.py | 1 - ...ogs.py => test_views_failed_asset_logs.py} | 28 +- .../api/tests/views/test_views_function.py | 6 + backend/api/urls.py | 2 +- backend/api/views/__init__.py | 4 +- backend/api/views/computetask_logs.py | 18 - backend/api/views/datamanager.py | 16 +- backend/api/views/failed_asset_logs.py | 41 +++ backend/api/views/function.py | 41 ++- backend/api/views/model.py | 8 +- backend/api/views/utils.py | 132 ++++--- backend/backend/celery.py | 2 +- backend/backend/settings/celery/dev.py | 1 + backend/backend/settings/celery/prod.py | 1 + backend/backend/settings/common.py | 8 +- backend/backend/settings/deps/image_build.py | 4 + backend/backend/settings/test.py | 1 + .../backend/settings/worker/events/common.py | 1 + backend/builder/__init__.py | 0 backend/builder/apps.py | 5 + backend/builder/docker.py | 11 + backend/builder/exceptions.py | 42 +++ backend/builder/image_builder/__init__.py | 0 .../builder/image_builder/image_builder.py | 319 +++++++++++++++++ backend/builder/kubernetes.py | 190 +++++++++++ backend/builder/tasks/__init__.py | 3 + backend/builder/tasks/task.py | 39 +++ backend/builder/tasks/tasks_build_image.py | 43 +++ backend/builder/tests/conftest.py | 9 + .../tests}/test_image_builder.py | 25 +- backend/builder/tests/test_kubernetes.py | 28 ++ .../builder/tests/test_task_build_image.py | 20 ++ backend/builder/volumes.py | 7 + backend/image_transfer/__init__.py | 6 + backend/image_transfer/common.py | 138 ++++++++ backend/image_transfer/decoder.py | 165 +++++++++ backend/image_transfer/encoder.py | 183 ++++++++++ backend/orchestrator/__init__.py | 2 + backend/orchestrator/client.py | 8 + backend/orchestrator/failure_report_pb2.py | 22 +- backend/orchestrator/failure_report_pb2.pyi | 55 ++- backend/orchestrator/function_pb2.py | 68 ++-- backend/orchestrator/function_pb2.pyi | 81 ++++- backend/orchestrator/function_pb2_grpc.py | 33 ++ backend/orchestrator/function_pb2_grpc.pyi | 10 + backend/orchestrator/mock.py | 2 + backend/orchestrator/resources.py | 17 +- backend/requirements.txt | 2 + backend/substrapp/clients/organization.py | 6 +- backend/substrapp/compute_tasks/errors.py | 33 +- backend/substrapp/compute_tasks/execute.py | 8 + .../substrapp/compute_tasks/image_builder.py | 322 ++---------------- backend/substrapp/compute_tasks/volumes.py | 4 - backend/substrapp/docker_registry.py | 14 +- backend/substrapp/events/reactor.py | 41 +++ backend/substrapp/exceptions.py | 8 - backend/substrapp/kubernetes_utils.py | 181 ---------- .../0006_create_compute_task_failure_model.py | 2 +- ...go_description_alter_algo_file_and_more.py | 4 +- ...go_description_alter_algo_file_and_more.py | 4 +- .../migrations/0015_add_functionimage.py | 36 ++ ...ename_computetaskfailurereport_and_more.py | 29 ++ backend/substrapp/models/__init__.py | 8 +- ...lure_report.py => asset_failure_report.py} | 20 +- backend/substrapp/models/function.py | 23 ++ backend/substrapp/task_routing.py | 5 + backend/substrapp/tasks/__init__.py | 2 + backend/substrapp/tasks/task.py | 101 ++++++ .../tasks/tasks_asset_failure_report.py | 68 ++++ backend/substrapp/tasks/tasks_compute_task.py | 104 +----- backend/substrapp/tasks/tasks_save_image.py | 97 ++++++ .../tests/compute_tasks/test_errors.py | 3 +- .../tests/tasks/test_compute_task.py | 42 +-- .../tasks/test_store_asset_failure_report.py | 69 ++++ .../substrapp/tests/test_kubernetes_utils.py | 27 -- backend/substrapp/utils/errors.py | 32 ++ charts/substra-backend/CHANGELOG.md | 10 +- charts/substra-backend/Chart.yaml | 2 +- charts/substra-backend/README.md | 29 +- charts/substra-backend/templates/rbac.yaml | 55 ++- .../templates/statefulset-builder.yaml | 239 +++++++++++++ charts/substra-backend/values.yaml | 82 +++++ docker/substra-backend/Dockerfile | 2 + docs/settings.md | 2 +- fixtures/.DS_Store | Bin 6148 -> 0 bytes fixtures/chunantes/.DS_Store | Bin 6148 -> 0 bytes fixtures/chunantes/functions/.DS_Store | Bin 6148 -> 0 bytes .../chunantes/functions/function3/.DS_Store | Bin 6148 -> 0 bytes pyproject.toml | 7 + skaffold.yaml | 1 + 98 files changed, 2818 insertions(+), 871 deletions(-) create mode 100644 backend/api/migrations/0053_function_status.py rename backend/api/tests/views/{test_views_computetask_logs.py => test_views_failed_asset_logs.py} (90%) delete mode 100644 backend/api/views/computetask_logs.py create mode 100644 backend/api/views/failed_asset_logs.py create mode 100644 backend/backend/settings/deps/image_build.py create mode 100644 backend/builder/__init__.py create mode 100644 backend/builder/apps.py create mode 100644 backend/builder/docker.py create mode 100644 backend/builder/exceptions.py create mode 100644 backend/builder/image_builder/__init__.py create mode 100644 backend/builder/image_builder/image_builder.py create mode 100644 backend/builder/kubernetes.py create mode 100644 backend/builder/tasks/__init__.py create mode 100644 backend/builder/tasks/task.py create mode 100644 backend/builder/tasks/tasks_build_image.py create mode 100644 backend/builder/tests/conftest.py rename backend/{substrapp/tests/compute_tasks => builder/tests}/test_image_builder.py (71%) create mode 100644 backend/builder/tests/test_kubernetes.py create mode 100644 backend/builder/tests/test_task_build_image.py create mode 100644 backend/builder/volumes.py create mode 100644 backend/image_transfer/__init__.py create mode 100644 backend/image_transfer/common.py create mode 100644 backend/image_transfer/decoder.py create mode 100644 backend/image_transfer/encoder.py create mode 100644 backend/substrapp/migrations/0015_add_functionimage.py create mode 100644 backend/substrapp/migrations/0016_rename_computetaskfailurereport_and_more.py rename backend/substrapp/models/{compute_task_failure_report.py => asset_failure_report.py} (57%) create mode 100644 backend/substrapp/tasks/task.py create mode 100644 backend/substrapp/tasks/tasks_asset_failure_report.py create mode 100644 backend/substrapp/tasks/tasks_save_image.py create mode 100644 backend/substrapp/tests/tasks/test_store_asset_failure_report.py create mode 100644 backend/substrapp/utils/errors.py create mode 100644 charts/substra-backend/templates/statefulset-builder.yaml delete mode 100644 fixtures/.DS_Store delete mode 100644 fixtures/chunantes/.DS_Store delete mode 100644 fixtures/chunantes/functions/.DS_Store delete mode 100644 fixtures/chunantes/functions/function3/.DS_Store diff --git a/.gitignore b/.gitignore index dbefb3b4a..e486b6a06 100644 --- a/.gitignore +++ b/.gitignore @@ -24,6 +24,7 @@ wheels/ .installed.cfg *.egg MANIFEST +.DS_Store # PyInstaller # Usually these files are written by a python script from a template diff --git a/CHANGELOG.md b/CHANGELOG.md index c69d38f62..92351401c 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -7,6 +7,18 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ## [Unreleased] +### Added + +- Field `asset_type` on `AssetFailureReport` (based on protobuf enum `orchestrator.FailedAssetKind`) ([#727](https://github.com/Substra/substra-backend/pull/727)) +- Celery task `FailableTask` that contains the logic to store the failure report, that can be re-used in different assets. ([#727](https://github.com/Substra/substra-backend/pull/727)) +- Add `FunctionStatus` enum ([#714](https://github.com/Substra/orchestrator/pull/714)) +- BREAKING: Add `status` on `api.Function` (type `FunctionStatus`) ([#714](https://github.com/Substra/substra-backend/pull/714)) + +### Changed + +- `ComputeTaskFailureReport` renamed in `AssetFailureReport` ([#727](https://github.com/Substra/substra-backend/pull/727)) +- Field `AssetFailureReport.compute_task_key` renamed to `asset_key` ([#727](https://github.com/Substra/substra-backend/pull/727)) + ### Removed - BREAKING: remove `distributed` Skaffold profile [#768](https://github.com/Substra/substra-backend/pull/768) diff --git a/Makefile b/Makefile index 1923164ad..2694b6a63 100644 --- a/Makefile +++ b/Makefile @@ -32,7 +32,7 @@ format: ## Format code lint: ## Perform a static analysis of the code flake8 $(SRC_DIRS) bandit --ini=.bandit - mypy backend/substrapp/tasks/ + mypy .PHONY: shell shell: ## Start a Python shell for the Django project diff --git a/backend/api/events/sync.py b/backend/api/events/sync.py index f89711965..569c4e0db 100644 --- a/backend/api/events/sync.py +++ b/backend/api/events/sync.py @@ -33,6 +33,7 @@ from api.serializers import PerformanceSerializer from orchestrator import client as orc_client from orchestrator import computetask +from orchestrator import failure_report_pb2 logger = structlog.get_logger(__name__) @@ -85,13 +86,19 @@ def _create_function(channel: str, data: dict) -> None: def _on_update_function_event(event: dict) -> None: """Process update function event to update local database.""" logger.debug("Syncing function update", asset_key=event["asset_key"], event_id=event["id"]) - _update_function(key=event["asset_key"], data=event["function"]) + function = event["function"] + _update_function(key=event["asset_key"], name=function["name"], status=function["status"]) -def _update_function(key: str, data: dict) -> None: +def _update_function(key: str, *, name: Optional[str] = None, status: Optional[str] = None) -> None: """Process update function event to update local database.""" function = Function.objects.get(key=key) - function.name = data["name"] + + if name: + function.name = name + if status: + function.status = status + function.save() @@ -376,7 +383,22 @@ def _disable_model(key: str) -> None: def _on_create_failure_report(event: dict) -> None: """Process create failure report event to update local database.""" logger.debug("Syncing failure report create", asset_key=event["asset_key"], event_id=event["id"]) - _update_computetask(key=event["asset_key"], failure_report=event["failure_report"]) + + asset_key = event["asset_key"] + failure_report = event["failure_report"] + asset_type = failure_report_pb2.FailedAssetKind.Value(failure_report["asset_type"]) + + if asset_type == failure_report_pb2.FAILED_ASSET_FUNCTION: + # Needed as this field is only in ComputeTask + compute_task_keys = ComputeTask.objects.values_list("key", flat=True).filter( + function_id=asset_key, + status__in=[ComputeTask.Status.STATUS_TODO.value, ComputeTask.Status.STATUS_DOING.value], + ) + + for task_key in compute_task_keys: + _update_computetask(key=str(task_key), failure_report={"error_type": failure_report.get("error_type")}) + else: + _update_computetask(key=asset_key, failure_report=failure_report) EVENT_CALLBACKS = { diff --git a/backend/api/migrations/0053_function_status.py b/backend/api/migrations/0053_function_status.py new file mode 100644 index 000000000..ac554829d --- /dev/null +++ b/backend/api/migrations/0053_function_status.py @@ -0,0 +1,30 @@ +# Generated by Django 4.2.3 on 2023-08-23 13:18 + +from django.db import migrations +from django.db import models + + +class Migration(migrations.Migration): + dependencies = [ + ("api", "0052_remove_metric_from_performance"), + ] + + operations = [ + migrations.AddField( + model_name="function", + name="status", + field=models.CharField( + choices=[ + ("FUNCTION_STATUS_UNKNOWN", "Function Status Unknown"), + ("FUNCTION_STATUS_WAITING", "Function Status Waiting"), + ("FUNCTION_STATUS_BUILDING", "Function Status Building"), + ("FUNCTION_STATUS_READY", "Function Status Ready"), + ("FUNCTION_STATUS_CANCELED", "Function Status Canceled"), + ("FUNCTION_STATUS_FAILED", "Function Status Failed"), + ], + default="FUNCTION_STATUS_UNKNOWN", + max_length=64, + ), + preserve_default=False, + ), + ] diff --git a/backend/api/models/function.py b/backend/api/models/function.py index c6b094904..5c8ec8ebf 100644 --- a/backend/api/models/function.py +++ b/backend/api/models/function.py @@ -4,6 +4,7 @@ import orchestrator.common_pb2 as common_pb2 from api.models.utils import AssetPermissionMixin from api.models.utils import URLValidatorWithOptionalTLD +from orchestrator import function_pb2 class FunctionInput(models.Model): @@ -43,6 +44,10 @@ class Meta: class Function(models.Model, AssetPermissionMixin): """Function represent a function and its associated metadata""" + Status = models.TextChoices( + "Status", [(status_name, status_name) for status_name in function_pb2.FunctionStatus.keys()] + ) + key = models.UUIDField(primary_key=True) name = models.CharField(max_length=100) description_address = models.URLField(validators=[URLValidatorWithOptionalTLD()]) @@ -57,6 +62,7 @@ class Function(models.Model, AssetPermissionMixin): creation_date = models.DateTimeField() metadata = models.JSONField() channel = models.CharField(max_length=100) + status = models.CharField(max_length=64, choices=Status.choices) class Meta: ordering = ["creation_date", "key"] # default order for relations serializations diff --git a/backend/api/serializers/function.py b/backend/api/serializers/function.py index 24d3844f2..2920414ee 100644 --- a/backend/api/serializers/function.py +++ b/backend/api/serializers/function.py @@ -53,6 +53,7 @@ class Meta: "permissions", "inputs", "outputs", + "status", ] def to_representation(self, instance): diff --git a/backend/api/tests/asset_factory.py b/backend/api/tests/asset_factory.py index 0422a3c1d..4f1b479a4 100644 --- a/backend/api/tests/asset_factory.py +++ b/backend/api/tests/asset_factory.py @@ -62,6 +62,7 @@ import datetime import uuid +from typing import Optional from django.core import files from django.utils import timezone @@ -80,9 +81,10 @@ from api.models import Model from api.models import Performance from api.models import TaskProfiling -from substrapp.models import ComputeTaskFailureReport as ComputeTaskLogs +from substrapp.models import AssetFailureReport from substrapp.models import DataManager as DataManagerFiles from substrapp.models import DataSample as DataSampleFiles +from substrapp.models import FailedAssetKind from substrapp.models import Function as FunctionFiles from substrapp.models import Model as ModelFiles from substrapp.utils import get_hash @@ -236,6 +238,7 @@ def create_function( creation_date=timezone.now(), owner=owner, channel=channel, + status=Function.Status.FUNCTION_STATUS_WAITING, **get_permissions(owner, public), ) @@ -534,20 +537,36 @@ def create_model_files( return model_files -def create_computetask_logs( - compute_task_key: uuid.UUID, - logs: files.File = None, -) -> ComputeTaskLogs: +def create_asset_logs( + asset_key: uuid.UUID, + asset_type: FailedAssetKind, + logs: Optional[files.File] = None, +) -> AssetFailureReport: if logs is None: logs = files.base.ContentFile("dummy content") - compute_task_logs = ComputeTaskLogs.objects.create( - compute_task_key=compute_task_key, + asset_logs = AssetFailureReport.objects.create( + asset_key=asset_key, + asset_type=asset_type, logs_checksum=get_hash(logs), creation_date=timezone.now(), ) - compute_task_logs.logs.save("logs", logs) - return compute_task_logs + asset_logs.logs.save("logs", logs) + return asset_logs + + +def create_computetask_logs( + compute_task_key: uuid.UUID, + logs: Optional[files.File] = None, +) -> AssetFailureReport: + return create_asset_logs(compute_task_key, FailedAssetKind.FAILED_ASSET_COMPUTE_TASK, logs) + + +def create_function_logs( + function_key: uuid.UUID, + logs: Optional[files.File] = None, +) -> AssetFailureReport: + return create_asset_logs(function_key, FailedAssetKind.FAILED_ASSET_FUNCTION, logs) def create_computetask_profiling(compute_task: ComputeTask) -> TaskProfiling: diff --git a/backend/api/tests/views/test_views_computetask.py b/backend/api/tests/views/test_views_computetask.py index bb19e4992..a1b4bd4ed 100644 --- a/backend/api/tests/views/test_views_computetask.py +++ b/backend/api/tests/views/test_views_computetask.py @@ -259,7 +259,6 @@ class GenericTaskViewTests(ComputeTaskViewTests): def setUp(self): super().setUp() self.url = reverse("api:task-list") - self.maxDiff = None todo_task = self.compute_tasks[ComputeTask.Status.STATUS_TODO] waiting_task = self.compute_tasks[ComputeTask.Status.STATUS_WAITING] diff --git a/backend/api/tests/views/test_views_computetask_logs.py b/backend/api/tests/views/test_views_failed_asset_logs.py similarity index 90% rename from backend/api/tests/views/test_views_computetask_logs.py rename to backend/api/tests/views/test_views_failed_asset_logs.py index 5083c8a88..5fe6db3d4 100644 --- a/backend/api/tests/views/test_views_computetask_logs.py +++ b/backend/api/tests/views/test_views_failed_asset_logs.py @@ -13,11 +13,11 @@ from api.views import utils as view_utils from organization import authentication as organization_auth from organization import models as organization_models -from substrapp.models import ComputeTaskFailureReport +from substrapp.models import AssetFailureReport @pytest.fixture -def compute_task_failure_report() -> tuple[ComputeTask, ComputeTaskFailureReport]: +def asset_failure_report() -> tuple[ComputeTask, AssetFailureReport]: compute_task = factory.create_computetask( factory.create_computeplan(), factory.create_function(), @@ -41,12 +41,12 @@ def test_download_logs_failure_unauthenticated(api_client: test.APIClient): @pytest.mark.django_db def test_download_local_logs_success( - compute_task_failure_report, + asset_failure_report, authenticated_client: test.APIClient, ): """An authorized user download logs located on the organization.""" - compute_task, failure_report = compute_task_failure_report + compute_task, failure_report = asset_failure_report assert compute_task.owner == conf.settings.LEDGER_MSP_ID # local assert conf.settings.LEDGER_MSP_ID in compute_task.logs_permission_authorized_ids # allowed @@ -60,12 +60,12 @@ def test_download_local_logs_success( @pytest.mark.django_db def test_download_logs_failure_forbidden( - compute_task_failure_report, + asset_failure_report, authenticated_client: test.APIClient, ): """An authenticated user cannot download logs if he is not authorized.""" - compute_task, failure_report = compute_task_failure_report + compute_task, failure_report = asset_failure_report assert compute_task.owner == conf.settings.LEDGER_MSP_ID # local compute_task.logs_permission_authorized_ids = [] # not allowed compute_task.save() @@ -77,12 +77,12 @@ def test_download_logs_failure_forbidden( @pytest.mark.django_db def test_download_local_logs_failure_not_found( - compute_task_failure_report, + asset_failure_report, authenticated_client: test.APIClient, ): """An authorized user attempt to download logs that are not referenced in the database.""" - compute_task, failure_report = compute_task_failure_report + compute_task, failure_report = asset_failure_report assert compute_task.owner == conf.settings.LEDGER_MSP_ID # local assert conf.settings.LEDGER_MSP_ID in compute_task.logs_permission_authorized_ids # allowed failure_report.delete() # not found @@ -94,12 +94,12 @@ def test_download_local_logs_failure_not_found( @pytest.mark.django_db def test_download_remote_logs_success( - compute_task_failure_report, + asset_failure_report, authenticated_client: test.APIClient, ): """An authorized user download logs on a remote organization by using his organization as proxy.""" - compute_task, failure_report = compute_task_failure_report + compute_task, failure_report = asset_failure_report outgoing_organization = "outgoing-organization" compute_task.logs_owner = outgoing_organization # remote compute_task.logs_permission_authorized_ids = [conf.settings.LEDGER_MSP_ID, outgoing_organization] # allowed @@ -139,13 +139,13 @@ def get_proxy_headers(channel_name: str) -> dict[str, str]: @pytest.mark.django_db def test_organization_download_logs_success( - compute_task_failure_report, + asset_failure_report, api_client: test.APIClient, incoming_organization_user: organization_auth.OrganizationUser, ): """An authorized organization can download logs from another organization.""" - compute_task, failure_report = compute_task_failure_report + compute_task, failure_report = asset_failure_report compute_task.logs_owner = conf.settings.LEDGER_MSP_ID # local (incoming request from remote) compute_task.logs_permission_authorized_ids = [ conf.settings.LEDGER_MSP_ID, @@ -166,13 +166,13 @@ def test_organization_download_logs_success( @pytest.mark.django_db def test_organization_download_logs_forbidden( - compute_task_failure_report, + asset_failure_report, api_client: test.APIClient, incoming_organization_user: organization_auth.OrganizationUser, ): """An unauthorized organization cannot download logs from another organization.""" - compute_task, failure_report = compute_task_failure_report + compute_task, failure_report = asset_failure_report compute_task.logs_owner = conf.settings.LEDGER_MSP_ID # local (incoming request from remote) compute_task.logs_permission_authorized_ids = [conf.settings.LEDGER_MSP_ID] # incoming user not allowed compute_task.channel = incoming_organization_user.username diff --git a/backend/api/tests/views/test_views_function.py b/backend/api/tests/views/test_views_function.py index b287afd8d..ace51eedd 100644 --- a/backend/api/tests/views/test_views_function.py +++ b/backend/api/tests/views/test_views_function.py @@ -104,6 +104,7 @@ def setUp(self): "outputs": { "model": {"kind": "ASSET_MODEL", "multiple": False}, }, + "status": "FUNCTION_STATUS_WAITING", }, { "key": str(aggregate_function.key), @@ -135,6 +136,7 @@ def setUp(self): "outputs": { "model": {"kind": "ASSET_MODEL", "multiple": False}, }, + "status": "FUNCTION_STATUS_WAITING", }, { "key": str(composite_function.key), @@ -170,6 +172,7 @@ def setUp(self): "local": {"kind": "ASSET_MODEL", "multiple": False}, "shared": {"kind": "ASSET_MODEL", "multiple": False}, }, + "status": "FUNCTION_STATUS_WAITING", }, { "key": str(predict_function.key), @@ -204,6 +207,7 @@ def setUp(self): "outputs": { "predictions": {"kind": "ASSET_MODEL", "multiple": False}, }, + "status": "FUNCTION_STATUS_WAITING", }, { "key": str(metric_function.key), @@ -237,6 +241,7 @@ def setUp(self): "outputs": { "performance": {"kind": "ASSET_PERFORMANCE", "multiple": False}, }, + "status": "FUNCTION_STATUS_WAITING", }, ] @@ -448,6 +453,7 @@ def mock_orc_response(data): "function": data["function"], "inputs": data["inputs"], "outputs": data["outputs"], + "status": Function.Status.FUNCTION_STATUS_WAITING, } function_path = os.path.join(FIXTURE_PATH, filename) diff --git a/backend/api/urls.py b/backend/api/urls.py index 6826dff8c..6cbc27751 100644 --- a/backend/api/urls.py +++ b/backend/api/urls.py @@ -25,7 +25,7 @@ router.register(r"compute_plan_metadata", views.ComputePlanMetadataViewSet, basename="compute_plan_metadata") router.register(r"news_feed", views.NewsFeedViewSet, basename="news_feed") router.register(r"performance", views.PerformanceViewSet, basename="performance") -router.register(r"logs", views.ComputeTaskLogsViewSet, basename="logs") +router.register(r"logs", views.FailedAssetLogsViewSet, basename="logs") router.register(r"task_profiling", views.TaskProfilingViewSet, basename="task_profiling") task_profiling_router = routers.NestedDefaultRouter(router, r"task_profiling", lookup="task_profiling") diff --git a/backend/api/views/__init__.py b/backend/api/views/__init__.py index ab83a34e5..25484555e 100644 --- a/backend/api/views/__init__.py +++ b/backend/api/views/__init__.py @@ -2,10 +2,10 @@ from .computeplan import ComputePlanViewSet from .computetask import ComputeTaskViewSet from .computetask import CPTaskViewSet -from .computetask_logs import ComputeTaskLogsViewSet from .datamanager import DataManagerPermissionViewSet from .datamanager import DataManagerViewSet from .datasample import DataSampleViewSet +from .failed_asset_logs import FailedAssetLogsViewSet from .function import CPFunctionViewSet from .function import FunctionPermissionViewSet from .function import FunctionViewSet @@ -24,6 +24,7 @@ "DataManagerPermissionViewSet", "ModelViewSet", "ModelPermissionViewSet", + "FailedAssetLogsViewSet", "FunctionViewSet", "FunctionPermissionViewSet", "ComputeTaskViewSet", @@ -31,7 +32,6 @@ "CPTaskViewSet", "CPFunctionViewSet", "NewsFeedViewSet", - "ComputeTaskLogsViewSet", "CPPerformanceViewSet", "ComputePlanMetadataViewSet", "PerformanceViewSet", diff --git a/backend/api/views/computetask_logs.py b/backend/api/views/computetask_logs.py deleted file mode 100644 index 5eca090b6..000000000 --- a/backend/api/views/computetask_logs.py +++ /dev/null @@ -1,18 +0,0 @@ -from rest_framework import response as drf_response -from rest_framework import viewsets -from rest_framework.decorators import action - -from api.models import ComputeTask -from api.views import utils as view_utils -from substrapp.models import compute_task_failure_report - - -class ComputeTaskLogsViewSet(view_utils.PermissionMixin, viewsets.GenericViewSet): - queryset = compute_task_failure_report.ComputeTaskFailureReport.objects.all() - - @action(detail=True, url_path=compute_task_failure_report.LOGS_FILE_PATH) - def file(self, request, pk=None) -> drf_response.Response: - response = self.download_file(request, ComputeTask, "logs", "logs_address") - response.headers["Content-Type"] = "text/plain; charset=utf-8" - response.headers["Content-Disposition"] = f'attachment; filename="tuple_logs_{pk}.txt"' - return response diff --git a/backend/api/views/datamanager.py b/backend/api/views/datamanager.py index 0acc744e2..57df76802 100644 --- a/backend/api/views/datamanager.py +++ b/backend/api/views/datamanager.py @@ -212,8 +212,20 @@ class DataManagerPermissionViewSet(PermissionMixin, GenericViewSet): @action(detail=True, url_path="description", url_name="description") def description_(self, request, *args, **kwargs): - return self.download_file(request, DataManager, "description", "description_address") + return self.download_file( + request, + asset_class=DataManager, + local_file_class=DataManagerFiles, + content_field="description", + address_field="description_address", + ) @action(detail=True) def opener(self, request, *args, **kwargs): - return self.download_file(request, DataManager, "data_opener", "opener_address") + return self.download_file( + request, + asset_class=DataManager, + local_file_class=DataManagerFiles, + content_field="data_opener", + address_field="opener_address", + ) diff --git a/backend/api/views/failed_asset_logs.py b/backend/api/views/failed_asset_logs.py new file mode 100644 index 000000000..7b55f0df1 --- /dev/null +++ b/backend/api/views/failed_asset_logs.py @@ -0,0 +1,41 @@ +from rest_framework import response as drf_response +from rest_framework import status +from rest_framework import viewsets +from rest_framework.decorators import action + +from api.errors import AssetPermissionError +from api.models import ComputeTask +from api.models import Function +from api.views import utils as view_utils +from substrapp.models import asset_failure_report + + +class FailedAssetLogsViewSet(view_utils.PermissionMixin, viewsets.GenericViewSet): + queryset = asset_failure_report.AssetFailureReport.objects.all() + + @action(detail=True, url_path=asset_failure_report.LOGS_FILE_PATH) + def file(self, request, pk=None) -> drf_response.Response: + report = self.get_object() + channel_name = view_utils.get_channel_name(request) + if report.asset_type == asset_failure_report.FailedAssetKind.FAILED_ASSET_FUNCTION: + asset_class = Function + else: + asset_class = ComputeTask + + try: + asset = self.get_asset(request, report.key, channel_name, asset_class) + except AssetPermissionError as e: + return view_utils.ApiResponse({"detail": str(e)}, status=status.HTTP_403_FORBIDDEN) + + response = view_utils.get_file_response( + local_file_class=asset_failure_report.AssetFailureReport, + key=report.key, + content_field="logs", + channel_name=channel_name, + url=report.logs_address, + asset_owner=asset.get_owner(), + ) + + response.headers["Content-Type"] = "text/plain; charset=utf-8" + response.headers["Content-Disposition"] = f'attachment; filename="tuple_logs_{pk}.txt"' + return response diff --git a/backend/api/views/function.py b/backend/api/views/function.py index f14d280df..28b8a9f2a 100644 --- a/backend/api/views/function.py +++ b/backend/api/views/function.py @@ -1,6 +1,7 @@ import structlog from django.conf import settings from django.db import models +from django.http import Http404 from django.urls import reverse from django_filters.rest_framework import BaseInFilter from django_filters.rest_framework import DateTimeFromToRangeFilter @@ -19,16 +20,20 @@ from api.views.filters_utils import MatchFilter from api.views.filters_utils import ProcessPermissionFilter from api.views.utils import ApiResponse +from api.views.utils import CustomFileResponse from api.views.utils import PermissionMixin from api.views.utils import ValidationExceptionError from api.views.utils import get_channel_name +from api.views.utils import to_string_uuid from api.views.utils import validate_key from api.views.utils import validate_metadata from libs.pagination import DefaultPageNumberPagination from substrapp.models import Function as FunctionFiles +from substrapp.models import FunctionImage from substrapp.orchestrator import get_orchestrator_client from substrapp.serializers import FunctionSerializer as FunctionFilesSerializer from substrapp.utils import get_hash +from substrapp.utils import get_owner logger = structlog.get_logger(__name__) @@ -197,7 +202,13 @@ class FunctionPermissionViewSet(PermissionMixin, GenericViewSet): @action(detail=True) def file(self, request, *args, **kwargs): - return self.download_file(request, Function, "file", "function_address") + return self.download_file( + request, + asset_class=Function, + local_file_class=FunctionFiles, + content_field="file", + address_field="function_address", + ) # actions cannot be named "description" # https://github.com/encode/django-rest-framework/issues/6490 @@ -205,4 +216,30 @@ def file(self, request, *args, **kwargs): # https://www.django-rest-framework.org/api-guide/viewsets/#introspecting-viewset-actions @action(detail=True, url_path="description", url_name="description") def description_(self, request, *args, **kwargs): - return self.download_file(request, Function, "description", "description_address") + return self.download_file( + request, + asset_class=Function, + local_file_class=FunctionFiles, + content_field="description", + address_field="description_address", + ) + + @action(detail=True) + def image(self, request, *args, **kwargs): + # TODO refactor the code duplication with api.views.utils.PermissionMixin.download_file + channel_name = get_channel_name(request) + lookup_url_kwarg = self.lookup_url_kwarg or self.lookup_field + key = to_string_uuid(self.kwargs[lookup_url_kwarg]) + function = Function.objects.filter(channel=channel_name).get(key=key) + + if get_owner() != function.get_owner(): + return Http404("The function image is only available on the backend who owns the function.") + + try: + function_image = FunctionImage.objects.get(function__key=function.key) + except FunctionImage.DoesNotExist: + return Http404(f"The function image asociated with key {key} is not found.") + + # TODO we love hard-coded size, see also api.views.utils.PermissionMixin._download_remote_file + response = CustomFileResponse(streaming_content=(chunk for chunk in function_image.file.chunks(512 * 1024))) + return response diff --git a/backend/api/views/model.py b/backend/api/views/model.py index 6a051da5a..c29e78aef 100644 --- a/backend/api/views/model.py +++ b/backend/api/views/model.py @@ -140,4 +140,10 @@ def _check_export_enabled(channel_name): @if_true(gzip.gzip_page, settings.GZIP_MODELS) @action(detail=True) def file(self, request, *args, **kwargs): - return self.download_file(request, Model, "file", "model_address") + return self.download_file( + request, + asset_class=Model, + local_file_class=ModelFiles, + content_field="file", + address_field="model_address", + ) diff --git a/backend/api/views/utils.py b/backend/api/views/utils.py index 912c3c1d8..5a3626daf 100644 --- a/backend/api/views/utils.py +++ b/backend/api/views/utils.py @@ -1,10 +1,13 @@ import os import uuid from typing import Callable +from typing import Type +from typing import TypeVar from wsgiref.util import is_hop_by_hop import django.http from django.conf import settings +from django.db import models from rest_framework import status from rest_framework.authentication import BasicAuthentication from rest_framework.permissions import SAFE_METHODS @@ -26,6 +29,9 @@ HTTP_HEADER_PROXY_ASSET = "Substra-Proxy-Asset" +AssetType = TypeVar("AssetType", bound=models.Model) +LocalFileType = TypeVar("LocalFileType", bound=models.Model) + class ApiResponse(Response): """The Content-Disposition header is used for downloads and web service responses @@ -80,18 +86,33 @@ def check_access(self, channel_name: str, user, asset, is_proxied_request: bool) if not asset.is_public("process") and organization_id not in asset.get_authorized_ids("process"): raise AssetPermissionError() - def download_file(self, request, asset_class, content_field, address_field): - if settings.ISOLATED: - return ApiResponse({"detail": "Asset not available in isolated mode"}, status=status.HTTP_410_GONE) + def get_key(self, request) -> str: lookup_url_kwarg = self.lookup_url_kwarg or self.lookup_field key = self.kwargs[lookup_url_kwarg] - channel_name = get_channel_name(request) - - validated_key = validate_key(key) - asset = asset_class.objects.filter(channel=channel_name).get(key=validated_key) + return validate_key(key) + + def get_asset(self, request, key: str, channel_name: str, asset_class: Type[AssetType]) -> AssetType: + asset = asset_class.objects.filter(channel=channel_name).get(key=key) + self.check_access(channel_name, request.user, asset, is_proxied_request(request)) + + return asset + + def download_file( + self, + request, + *, + asset_class: Type[AssetType], + local_file_class: Type[LocalFileType], + content_field: str, + address_field: str, + ): + if settings.ISOLATED: + return ApiResponse({"detail": "Asset not available in isolated mode"}, status=status.HTTP_410_GONE) + key = self.get_key(request) + channel_name = get_channel_name(request) try: - self.check_access(channel_name, request.user, asset, is_proxied_request(request)) + asset = self.get_asset(request, key, channel_name, asset_class) except AssetPermissionError as e: return ApiResponse({"detail": str(e)}, status=status.HTTP_403_FORBIDDEN) @@ -99,49 +120,70 @@ def download_file(self, request, asset_class, content_field, address_field): if not url: return ApiResponse({"detail": "Asset not available anymore"}, status=status.HTTP_410_GONE) - if get_owner() == asset.get_owner(): - response = self._get_local_file_response(content_field) - else: - response = self._download_remote_file(channel_name, asset.get_owner(), url) - - return response + return get_file_response( + key=key, + local_file_class=local_file_class, + asset_owner=asset.get_owner(), + content_field=content_field, + channel_name=channel_name, + url=url, + ) - def _get_local_file_response(self, content_field): - obj = self.get_object() - data = getattr(obj, content_field) - if isinstance(data.storage, MinioStorage): - filename = str(obj.key) - else: - filename = os.path.basename(data.path) - data = open(data.path, "rb") +def get_file_response( + *, + local_file_class: Type[LocalFileType], + content_field: str, + key: str, + asset_owner: str, + channel_name: str, + url: str, +) -> django.http.FileResponse: + if get_owner() == asset_owner: + local_file = local_file_class.objects.get(pk=key) + response = _get_local_file_response(local_file, key, content_field) + else: + response = _download_remote_file(channel_name, asset_owner, url) - response = CustomFileResponse( - data, - as_attachment=True, - filename=filename, - ) - return response + return response - def _download_remote_file(self, channel_name: str, owner: str, url: str) -> django.http.FileResponse: - proxy_response = organization_client.streamed_get( - channel=channel_name, - organization_id=owner, - url=url, - headers={HTTP_HEADER_PROXY_ASSET: "True"}, - ) - response = CustomFileResponse( - streaming_content=(chunk for chunk in proxy_response.iter_content(512 * 1024)), - status=proxy_response.status_code, - ) - for header in proxy_response.headers: - # We don't use hop_by_hop headers since they are incompatible - # with WSGI - if not is_hop_by_hop(header): - response[header] = proxy_response.headers.get(header) +def _get_local_file_response(local_file: LocalFileType, key: str, content_field: str): + data = getattr(local_file, content_field) - return response + if isinstance(data.storage, MinioStorage): + filename = key + else: + filename = os.path.basename(data.path) + data = open(data.path, "rb") + + response = CustomFileResponse( + data, + as_attachment=True, + filename=filename, + ) + return response + + +def _download_remote_file(channel_name: str, owner: str, url: str) -> django.http.FileResponse: + proxy_response = organization_client.streamed_get( + channel=channel_name, + organization_id=owner, + url=url, + headers={HTTP_HEADER_PROXY_ASSET: "True"}, + ) + response = CustomFileResponse( + streaming_content=(chunk for chunk in proxy_response.iter_content(512 * 1024)), + status=proxy_response.status_code, + ) + + for header in proxy_response.headers: + # We don't use hop_by_hop headers since they are incompatible + # with WSGI + if not is_hop_by_hop(header): + response[header] = proxy_response.headers.get(header) + + return response def validate_key(key) -> str: diff --git a/backend/backend/celery.py b/backend/backend/celery.py index 27bc69853..875677da4 100644 --- a/backend/backend/celery.py +++ b/backend/backend/celery.py @@ -23,7 +23,7 @@ app.config_from_object("django.conf:settings", namespace="CELERY") app.steps["worker"].add(DjangoStructLogInitStep) - +app.steps["builder"].add(DjangoStructLogInitStep) # Load task modules from all registered Django app configs. app.autodiscover_tasks() diff --git a/backend/backend/settings/celery/dev.py b/backend/backend/settings/celery/dev.py index 79c3228a4..9cee88a36 100644 --- a/backend/backend/settings/celery/dev.py +++ b/backend/backend/settings/celery/dev.py @@ -1,3 +1,4 @@ +from ..deps.image_build import * from ..deps.ledger import * from ..deps.orchestrator import * from ..dev import * diff --git a/backend/backend/settings/celery/prod.py b/backend/backend/settings/celery/prod.py index e7a29075e..5fe79f5f3 100644 --- a/backend/backend/settings/celery/prod.py +++ b/backend/backend/settings/celery/prod.py @@ -1,3 +1,4 @@ +from ..deps.image_build import * from ..deps.ledger import * from ..deps.orchestrator import * from ..prod import * diff --git a/backend/backend/settings/common.py b/backend/backend/settings/common.py index 6f6394394..04926d3bc 100644 --- a/backend/backend/settings/common.py +++ b/backend/backend/settings/common.py @@ -59,6 +59,7 @@ "api", "drf_spectacular", "django_filters", + "builder", ] AUTHENTICATION_BACKENDS = [ @@ -208,7 +209,7 @@ # Used by the Secure aggregation mechanism to retrieve chainkeys K8S_SECRET_NAMESPACE = os.getenv("K8S_SECRET_NAMESPACE", "default") -REGISTRY = os.getenv("REGISTRY") +REGISTRY = os.getenv("REGISTRY", "") REGISTRY_SCHEME = os.getenv("REGISTRY_SCHEME") REGISTRY_PULL_DOMAIN = os.getenv("REGISTRY_PULL_DOMAIN") REGISTRY_IS_LOCAL = to_bool(os.environ.get("REGISTRY_IS_LOCAL")) @@ -303,6 +304,11 @@ "handlers": ["console"], "propagate": False, }, + "builder": { + "level": LOG_LEVEL, + "handlers": ["console"], + "propagate": False, + }, # third-party libraries "celery": { "level": "INFO", diff --git a/backend/backend/settings/deps/image_build.py b/backend/backend/settings/deps/image_build.py new file mode 100644 index 000000000..ad4d24253 --- /dev/null +++ b/backend/backend/settings/deps/image_build.py @@ -0,0 +1,4 @@ +# How long we wait before throwing errors, in seconds +IMAGE_BUILD_TIMEOUT = 3 * 60 * 60 # 3 hours +# Delay before two check +IMAGE_BUILD_CHECK_DELAY = 5 diff --git a/backend/backend/settings/test.py b/backend/backend/settings/test.py index a4952033e..e7a62fe90 100644 --- a/backend/backend/settings/test.py +++ b/backend/backend/settings/test.py @@ -2,6 +2,7 @@ import tempfile from .common import * +from .deps.image_build import * from .deps.restframework import * from .mods.cors import * from .mods.oidc import * diff --git a/backend/backend/settings/worker/events/common.py b/backend/backend/settings/worker/events/common.py index 7d6d5aa46..5e69f2054 100644 --- a/backend/backend/settings/worker/events/common.py +++ b/backend/backend/settings/worker/events/common.py @@ -1,3 +1,4 @@ +from ...deps.image_build import * from ...deps.ledger import * from ...deps.orchestrator import * from ...dev import * diff --git a/backend/builder/__init__.py b/backend/builder/__init__.py new file mode 100644 index 000000000..e69de29bb diff --git a/backend/builder/apps.py b/backend/builder/apps.py new file mode 100644 index 000000000..966bd8a6e --- /dev/null +++ b/backend/builder/apps.py @@ -0,0 +1,5 @@ +from django.apps import AppConfig + + +class BuilderConfig(AppConfig): + name = "builder" diff --git a/backend/builder/docker.py b/backend/builder/docker.py new file mode 100644 index 000000000..f0911389f --- /dev/null +++ b/backend/builder/docker.py @@ -0,0 +1,11 @@ +from substrapp.docker_registry import ImageNotFoundError +from substrapp.docker_registry import get_container_image + + +def container_image_exists(image_name: str) -> bool: + try: + get_container_image(image_name) + except ImageNotFoundError: + return False + else: + return True diff --git a/backend/builder/exceptions.py b/backend/builder/exceptions.py new file mode 100644 index 000000000..c01e67091 --- /dev/null +++ b/backend/builder/exceptions.py @@ -0,0 +1,42 @@ +from io import BytesIO + +from substrapp.compute_tasks.errors import CeleryNoRetryError +from substrapp.compute_tasks.errors import CeleryRetryError +from substrapp.compute_tasks.errors import ComputeTaskErrorType +from substrapp.compute_tasks.errors import _ComputeTaskError + + +class PodError(Exception): + pass + + +class PodTimeoutError(Exception): + pass + + +class BuildRetryError(_ComputeTaskError, CeleryRetryError): + """An error occurred during the build of a container image. + + Args: + logs (str): the container image build logs + """ + + error_type = ComputeTaskErrorType.BUILD_ERROR + + def __init__(self, logs: str, *args: list, **kwargs: dict): + self.logs = BytesIO(str.encode(logs)) + super().__init__(logs, *args, **kwargs) + + +class BuildError(_ComputeTaskError, CeleryNoRetryError): + """An error occurred during the build of a container image. + + Args: + logs (str): the container image build logs + """ + + error_type = ComputeTaskErrorType.BUILD_ERROR + + def __init__(self, logs: str, *args: list, **kwargs: dict): + self.logs = BytesIO(str.encode(logs)) + super().__init__(logs, *args, **kwargs) diff --git a/backend/builder/image_builder/__init__.py b/backend/builder/image_builder/__init__.py new file mode 100644 index 000000000..e69de29bb diff --git a/backend/builder/image_builder/image_builder.py b/backend/builder/image_builder/image_builder.py new file mode 100644 index 000000000..d51ce0a6c --- /dev/null +++ b/backend/builder/image_builder/image_builder.py @@ -0,0 +1,319 @@ +import json +import os +from tempfile import TemporaryDirectory +from typing import Union + +import kubernetes +import structlog +from django.conf import settings + +import orchestrator +from builder import docker +from builder import exceptions +from builder.exceptions import BuildError +from builder.exceptions import BuildRetryError +from builder.kubernetes import get_pod_logs +from builder.kubernetes import pod_exists +from builder.kubernetes import watch_pod +from builder.volumes import get_docker_cache_pvc_name +from substrapp.compute_tasks import datastore as ds +from substrapp.compute_tasks import utils +from substrapp.compute_tasks.compute_pod import Label +from substrapp.compute_tasks.volumes import get_worker_subtuple_pvc_name +from substrapp.docker_registry import USER_IMAGE_REPOSITORY +from substrapp.kubernetes_utils import delete_pod +from substrapp.kubernetes_utils import get_security_context +from substrapp.lock_local import lock_resource +from substrapp.utils import timeit +from substrapp.utils import uncompress_content + +logger = structlog.get_logger(__name__) + +REGISTRY = settings.REGISTRY +REGISTRY_SCHEME = settings.REGISTRY_SCHEME +NAMESPACE = settings.NAMESPACE +KANIKO_MIRROR = settings.TASK["KANIKO_MIRROR"] +KANIKO_IMAGE = settings.TASK["KANIKO_IMAGE"] +KANIKO_DOCKER_CONFIG_SECRET_NAME = settings.TASK["KANIKO_DOCKER_CONFIG_SECRET_NAME"] +KANIKO_DOCKER_CONFIG_VOLUME_NAME = "docker-config" +SUBTUPLE_TMP_DIR = settings.SUBTUPLE_TMP_DIR +IMAGE_BUILD_TIMEOUT = settings.IMAGE_BUILD_TIMEOUT +KANIKO_CONTAINER_NAME = "kaniko" +HOSTNAME = settings.HOSTNAME + + +def container_image_tag_from_function(function: orchestrator.Function) -> str: + """builds the container image tag from the function checksum + + Args: + function (orchestrator.Function): an function retrieved from the orchestrator + + Returns: + str: the container image tag + """ + return f"function-{function.function_address.checksum[:16]}" + + +# main entrypoint +# TODO refactor +def build_image_if_missing(channel: str, function: orchestrator.Function) -> None: + """ + Build the container image and the ImageEntryPoint entry if they don't exist already + """ + datastore = ds.Datastore(channel=channel) + container_image_tag = utils.container_image_tag_from_function(function) + with lock_resource("image-build", container_image_tag, ttl=IMAGE_BUILD_TIMEOUT, timeout=IMAGE_BUILD_TIMEOUT): + if docker.container_image_exists(container_image_tag): + logger.info("Reusing existing image", image=container_image_tag) + else: + asset_content = datastore.get_function(function) + _build_function_image(asset_content, function) + + +def _build_function_image(asset: bytes, function: orchestrator.Function) -> None: + """ + Build a function's container image. + + Perform multiple steps: + 1. Download the function using the provided asset storage_address/owner. Verify its checksum and uncompress the data + to a temporary folder. + 2. Extract the ENTRYPOINT from the Dockerfile. + 3. Build the container image using Kaniko. + 4. Save the ENTRYPOINT to the DB + """ + + os.makedirs(SUBTUPLE_TMP_DIR, exist_ok=True) + + with TemporaryDirectory(dir=SUBTUPLE_TMP_DIR) as tmp_dir: + # Download source + uncompress_content(asset, tmp_dir) + + # Build image + _build_container_image(tmp_dir, utils.container_image_tag_from_function(function)) + + +def _get_entrypoint_from_dockerfile(dockerfile_dir: str) -> list[str]: + """ + Get entrypoint from ENTRYPOINT in the Dockerfile. + + This is necessary because the user function can have arbitrary names, ie; "myfunction.py". + + Example: + ENTRYPOINT ["python3", "myfunction.py"] + """ + dockerfile_path = f"{dockerfile_dir}/Dockerfile" + + with open(dockerfile_path, "r") as file: + for line in file: + if line.startswith("ENTRYPOINT"): + try: + res = json.loads(line[len("ENTRYPOINT") :]) + except json.JSONDecodeError: + res = None + + if not isinstance(res, list): + raise BuildError( + "Invalid ENTRYPOINT in function/metric Dockerfile. " + "You must use the exec form in your Dockerfile. " + "See https://docs.docker.com/engine/reference/builder/#entrypoint" + ) + return res + + raise BuildError("Invalid Dockerfile: Cannot find ENTRYPOINT") + + +def _delete_kaniko_pod(create_pod: bool, k8s_client: kubernetes.client.CoreV1Api, pod_name: str) -> str: + logs = "" + if create_pod: + logs = get_pod_logs(k8s_client, pod_name, KANIKO_CONTAINER_NAME, ignore_pod_not_found=True) + delete_pod(k8s_client, pod_name) + logger.info(logs or "", pod_name=pod_name) + return logs + + +@timeit +def _build_container_image(path: str, tag: str) -> None: + _assert_dockerfile_exist(path) + + kubernetes.config.load_incluster_config() + k8s_client = kubernetes.client.CoreV1Api() + + pod_name = _build_pod_name(tag) + + create_pod = not pod_exists(k8s_client, pod_name) + if create_pod: + try: + logger.info("creating pod: building image", namespace=NAMESPACE, pod=pod_name, image=tag) + pod = _build_pod(path, tag) + k8s_client.create_namespaced_pod(body=pod, namespace=NAMESPACE) + except kubernetes.client.ApiException as e: + raise BuildRetryError( + f"Error creating pod {NAMESPACE}/{pod_name}. Reason: {e.reason}, status: {e.status}, body: {e.body}" + ) from e + + try: + watch_pod(k8s_client, pod_name) + + except Exception as e: + # In case of concurrent builds, it may fail. Check if the image exists. + if docker.container_image_exists(tag): + logger.warning( + f"Build of container image {tag} failed, probably because it was done by a concurrent build", + exc_info=True, + ) + return + + logs = _delete_kaniko_pod(create_pod, k8s_client, pod_name) + + if isinstance(e, exceptions.PodTimeoutError): + raise BuildRetryError(logs) from e + elif "ConnectionResetError" in logs: # retry when download failed + raise BuildRetryError(logs) from e + else: # exceptions.PodError or other + raise BuildError(logs) from e + + _delete_kaniko_pod(create_pod, k8s_client, pod_name) + + +def _assert_dockerfile_exist(dockerfile_path: Union[str, os.PathLike]) -> None: + dockerfile_fullpath = os.path.join(dockerfile_path, "Dockerfile") + if not os.path.exists(dockerfile_fullpath): + raise BuildError(f"Dockerfile does not exist : {dockerfile_fullpath}") + + +def _build_pod(dockerfile_mount_path: str, image_tag: str) -> kubernetes.client.V1Pod: + pod_name = _build_pod_name(image_tag) + pod_spec = _build_pod_spec(dockerfile_mount_path, image_tag) + return kubernetes.client.V1Pod( + api_version="v1", + kind="Pod", + metadata=kubernetes.client.V1ObjectMeta( + name=pod_name, + labels={ + Label.PodName: pod_name, + Label.PodType: "image-build", + Label.Component: Label.Component_Compute, + }, + ), + spec=pod_spec, + ) + + +def _build_pod_name(image_tag: str) -> str: + dns_1123_compliant_tag = image_tag.split("/")[-1].replace("_", "-") + return f"kaniko-{dns_1123_compliant_tag}" + + +def _build_pod_spec(dockerfile_mount_path: str, image_tag: str) -> kubernetes.client.V1PodSpec: + container = _build_container(dockerfile_mount_path, image_tag) + pod_affinity = _build_pod_affinity() + + cache_pvc_name = ( + settings.WORKER_PVC_DOCKER_CACHE if settings.WORKER_PVC_IS_HOSTPATH else get_docker_cache_pvc_name() + ) + cache = kubernetes.client.V1Volume( + name="cache", + persistent_volume_claim=kubernetes.client.V1PersistentVolumeClaimVolumeSource(claim_name=cache_pvc_name), + ) + + dockerfile_pvc_name = ( + settings.WORKER_PVC_SUBTUPLE if settings.WORKER_PVC_IS_HOSTPATH else get_worker_subtuple_pvc_name() + ) + dockerfile = kubernetes.client.V1Volume( + name="dockerfile", + persistent_volume_claim=kubernetes.client.V1PersistentVolumeClaimVolumeSource(claim_name=dockerfile_pvc_name), + ) + + volumes = [cache, dockerfile] + + if KANIKO_DOCKER_CONFIG_SECRET_NAME: + docker_config = kubernetes.client.V1Volume( + name=KANIKO_DOCKER_CONFIG_VOLUME_NAME, + secret=kubernetes.client.V1SecretVolumeSource( + secret_name=KANIKO_DOCKER_CONFIG_SECRET_NAME, + items=[kubernetes.client.V1KeyToPath(key=".dockerconfigjson", path="config.json")], + ), + ) + volumes.append(docker_config) + + return kubernetes.client.V1PodSpec( + restart_policy="Never", affinity=pod_affinity, containers=[container], volumes=volumes + ) + + +def _build_pod_affinity() -> kubernetes.client.V1Affinity: + return kubernetes.client.V1Affinity( + pod_affinity=kubernetes.client.V1PodAffinity( + required_during_scheduling_ignored_during_execution=[ + kubernetes.client.V1PodAffinityTerm( + label_selector=kubernetes.client.V1LabelSelector( + match_expressions=[ + kubernetes.client.V1LabelSelectorRequirement( + key="statefulset.kubernetes.io/pod-name", operator="In", values=[HOSTNAME] + ) + ] + ), + topology_key="kubernetes.io/hostname", + ) + ] + ) + ) + + +def _build_container(dockerfile_mount_path: str, image_tag: str) -> kubernetes.client.V1Container: + # kaniko build can be launched without privilege but + # it needs some capabilities and to be root + # https://github.com/GoogleContainerTools/kaniko/issues/778 + # https://github.com/GoogleContainerTools/kaniko/issues/778#issuecomment-619112417 + # https://github.com/moby/moby/blob/master/oci/caps/defaults.go + # https://man7.org/linux/man-pages/man7/capabilities.7.html + capabilities = ["CHOWN", "SETUID", "SETGID", "FOWNER", "DAC_OVERRIDE", "SETFCAP"] + container_security_context = get_security_context(root=True, capabilities=capabilities) + args = _build_container_args(dockerfile_mount_path, image_tag) + dockerfile_mount_subpath = dockerfile_mount_path.split("/subtuple/")[-1] + + dockerfile = kubernetes.client.V1VolumeMount( + name="dockerfile", mount_path=dockerfile_mount_path, sub_path=dockerfile_mount_subpath, read_only=True + ) + cache = kubernetes.client.V1VolumeMount(name="cache", mount_path="/cache", read_only=True) + volume_mounts = [dockerfile, cache] + + if KANIKO_DOCKER_CONFIG_SECRET_NAME: + docker_config = kubernetes.client.V1VolumeMount( + name=KANIKO_DOCKER_CONFIG_VOLUME_NAME, mount_path="/kaniko/.docker" + ) + volume_mounts.append(docker_config) + + return kubernetes.client.V1Container( + name=KANIKO_CONTAINER_NAME, + image=KANIKO_IMAGE, + command=None, + args=args, + volume_mounts=volume_mounts, + security_context=container_security_context, + ) + + +def _build_container_args(dockerfile_mount_path: str, image_tag: str) -> list[str]: + dockerfile_fullpath = os.path.join(dockerfile_mount_path, "Dockerfile") + args = [ + f"--dockerfile={dockerfile_fullpath}", + f"--context=dir://{dockerfile_mount_path}", + f"--destination={REGISTRY}/{USER_IMAGE_REPOSITORY}:{image_tag}", + "--cache=true", + "--log-timestamp=true", + "--snapshotMode=redo", + "--push-retry=3", + "--cache-copy-layers", + "--log-format=text", + f"--verbosity={('debug' if settings.LOG_LEVEL == 'DEBUG' else 'info')}", + ] + + if REGISTRY_SCHEME == "http": + args.append("--insecure") + + if KANIKO_MIRROR: + args.append(f"--registry-mirror={REGISTRY}") + if REGISTRY_SCHEME == "http": + args.append("--insecure-pull") + return args diff --git a/backend/builder/kubernetes.py b/backend/builder/kubernetes.py new file mode 100644 index 000000000..8e8d7af20 --- /dev/null +++ b/backend/builder/kubernetes.py @@ -0,0 +1,190 @@ +import enum +import time + +import kubernetes +import structlog +from django.conf import settings + +from builder.exceptions import PodError +from builder.exceptions import PodTimeoutError +from substrapp.utils import timeit + +logger = structlog.get_logger(__name__) + +NAMESPACE = settings.NAMESPACE + + +class ObjectState(enum.Enum): + PENDING = enum.auto() + WAITING = enum.auto() + RUNNING = enum.auto() + FAILED = enum.auto() + COMPLETED = enum.auto() + UNKNOWN = enum.auto() + + +class PodState: + def __init__(self, status: ObjectState, reason: str = "", message: str = ""): + self.status = status + self.reason = reason + self.message = message + + def set_reason(self, container_status: kubernetes.client.V1ContainerState) -> None: + if self.status == ObjectState.WAITING: + self.reason = container_status.waiting.reason + self.message = container_status.waiting.message + if self.status == ObjectState.FAILED: + self.reason = container_status.terminated.reason + self.message = container_status.terminated.message + + +def pod_exists(k8s_client: kubernetes.client.CoreV1Api, name: str) -> bool: + try: + k8s_client.read_namespaced_pod(name=name, namespace=NAMESPACE) + except kubernetes.client.ApiException: + return False + else: + return True + + +@timeit +def get_pod_logs( + k8s_client: kubernetes.client.CoreV1Api, name: str, container: str, ignore_pod_not_found: bool = False +) -> str: + try: + return k8s_client.read_namespaced_pod_log(name=name, namespace=NAMESPACE, container=container) + except kubernetes.client.ApiException as exc: + if ignore_pod_not_found and exc.reason == "Not Found": + return f"Pod not found: {NAMESPACE}/{name} ({container})" + if exc.reason == "Bad Request": + return f"In {NAMESPACE}/{name} \n {str(exc.body)}" + return f"Unable to get logs for pod {NAMESPACE}/{name} ({container}) \n {str(exc)}" + + +def watch_pod(k8s_client: kubernetes.client.CoreV1Api, name: str) -> None: + """Watch a Kubernetes pod status + It will observe all the containers inside the pod and return when the pod will + reach the Completed state. If the pod is pending indefinitely or fail, an exception will be raised. + Args: + k8s_client (kubernetes.client.CoreV1Api): Kubernetes API client + name (str): name of the pod to watch + Raises: + PodError: this exception is raised if the pod exits with an error + PodTimeoutError: this exception is raised if the pod does not reach the running state after some time + """ + attempt = 0 + # with 60 attempts we wait max 2 min with a pending pod + max_attempts = 60 + + # This variable is used to track the current status through retries + previous_pod_status = None + + while attempt < max_attempts: + try: + api_response = retrieve_pod_status(k8s_client, name) + except kubernetes.client.ApiException as exc: + logger.warning("Could not retrieve pod status", pod_name=name, exc_info=exc) + attempt += 1 + time.sleep(0.2) + continue + + pod_state = _get_pod_state(api_response) + + if pod_state.status != previous_pod_status: + previous_pod_status = pod_state.status + logger.info( + "Pod status changed", + pod_name=name, + status=pod_state.status, + reason=pod_state.reason, + message=pod_state.message, + attempt=attempt, + max_attempts=max_attempts, + ) + + if pod_state.status == ObjectState.COMPLETED: + return + + if pod_state.status == ObjectState.FAILED: + raise PodError(f"Pod {name} terminated with error: {pod_state.reason}") + + if pod_state.status == ObjectState.PENDING: + # Here we basically consume a free retry everytime but we still need to + # increment attempt because if at some point our pod is stuck in pending state + # we need to exit this function + attempt += 1 + time.sleep(2) + + # Here PodInitializing and ContainerCreating are valid reasons to wait more time + # Other possible reasons include "CrashLoopBackOff", "CreateContainerConfigError", + # "ErrImagePull", "ImagePullBackOff", "CreateContainerError", "InvalidImageName" + if ( + pod_state.status == ObjectState.WAITING + and pod_state.reason not in ["PodInitializing", "ContainerCreating"] + or pod_state.status == ObjectState.UNKNOWN + ): + attempt += 1 + + time.sleep(0.2) + + raise PodTimeoutError(f"Pod {name} didn't complete after {max_attempts} attempts") + + +def retrieve_pod_status(k8s_client: kubernetes.client.CoreV1Api, pod_name: str) -> kubernetes.client.V1PodStatus: + pod: kubernetes.client.V1Pod = k8s_client.read_namespaced_pod_status( + name=pod_name, namespace=NAMESPACE, pretty=True + ) + return pod.status + + +def _get_pod_state(pod_status: kubernetes.client.V1PodStatus) -> PodState: + """extracts the current pod state from the PodStatus Kubernetes object + Args: + pod_status (kubernetes.client.models.V1PodStatus): A Kubernetes PodStatus object + """ + if pod_status.phase in ["Pending"]: + # On the first query the pod just created and often pending as it is not already scheduled to a node + return PodState(ObjectState.PENDING, pod_status.reason, pod_status.message) + + container_statuses: list[kubernetes.client.V1ContainerStatus] = ( + pod_status.init_container_statuses if pod_status.init_container_statuses else [] + ) + container_statuses += pod_status.container_statuses + + completed_containers = 0 + for container in container_statuses: + container_state: ObjectState = _get_container_state(container) + + if container_state in [ObjectState.RUNNING, ObjectState.WAITING, ObjectState.FAILED]: + pod_state = PodState(container_state) + pod_state.set_reason(container.state) + return pod_state + if container_state == ObjectState.COMPLETED: + completed_containers += 1 + + if completed_containers == len(container_statuses): + return PodState(ObjectState.COMPLETED, "", "pod successfully completed") + + logger.debug("pod status", pod_status=pod_status) + return PodState(ObjectState.UNKNOWN, "", "Could not deduce the pod state from container statuses") + + +def _get_container_state(container_status: kubernetes.client.V1ContainerStatus) -> ObjectState: + """Extracts the container state from a ContainerStatus Kubernetes object + Args: + container_status (kubernetes.client.models.V1ContainerStatus): A ContainerStatus object + Returns: + ObjectState: the state of the container + """ + # Here we need to check if we are in a failed state first since kubernetes will retry + # we can end up running after a failure + if container_status.state.terminated: + if container_status.state.terminated.exit_code != 0: + return ObjectState.FAILED + else: + return ObjectState.COMPLETED + if container_status.state.running: + return ObjectState.RUNNING + if container_status.state.waiting: + return ObjectState.WAITING + return ObjectState.UNKNOWN diff --git a/backend/builder/tasks/__init__.py b/backend/builder/tasks/__init__.py new file mode 100644 index 000000000..4f5416567 --- /dev/null +++ b/backend/builder/tasks/__init__.py @@ -0,0 +1,3 @@ +from builder.tasks.tasks_build_image import build_image + +__all__ = ["build_image"] diff --git a/backend/builder/tasks/task.py b/backend/builder/tasks/task.py new file mode 100644 index 000000000..459f00e99 --- /dev/null +++ b/backend/builder/tasks/task.py @@ -0,0 +1,39 @@ +import structlog +from django.conf import settings + +import orchestrator +from substrapp.models import FailedAssetKind +from substrapp.orchestrator import get_orchestrator_client +from substrapp.tasks.task import FailableTask + +logger = structlog.get_logger("builder") + + +class BuildTask(FailableTask): + autoretry_for = settings.CELERY_TASK_AUTORETRY_FOR + max_retries = settings.CELERY_TASK_MAX_RETRIES + retry_backoff = settings.CELERY_TASK_RETRY_BACKOFF + retry_backoff_max = settings.CELERY_TASK_RETRY_BACKOFF_MAX + retry_jitter = settings.CELERY_TASK_RETRY_JITTER + acks_late = True + reject_on_worker_lost = True + ignore_result = False + + asset_type = FailedAssetKind.FAILED_ASSET_FUNCTION + + @property + def attempt(self) -> int: + return self.request.retries + 1 # type: ignore + + # Celery does not provide unpacked arguments, we are doing it in `get_task_info` + def before_start(self, task_id: str, args: tuple, kwargs: dict) -> None: + function_key, channel_name = self.get_task_info(args, kwargs) + with get_orchestrator_client(channel_name) as client: + client.update_function_status( + function_key=function_key, action=orchestrator.function_pb2.FUNCTION_ACTION_BUILDING + ) + + def get_task_info(self, args: tuple, kwargs: dict) -> tuple[str, str]: + function = orchestrator.Function.parse_raw(kwargs["function_serialized"]) + channel_name = kwargs["channel_name"] + return function.key, channel_name diff --git a/backend/builder/tasks/tasks_build_image.py b/backend/builder/tasks/tasks_build_image.py new file mode 100644 index 000000000..dda71d8dc --- /dev/null +++ b/backend/builder/tasks/tasks_build_image.py @@ -0,0 +1,43 @@ +import structlog +from django.conf import settings + +import orchestrator +from backend.celery import app +from builder.exceptions import BuildRetryError +from builder.exceptions import CeleryNoRetryError +from builder.image_builder.image_builder import build_image_if_missing +from builder.tasks.task import BuildTask + +logger = structlog.get_logger(__name__) +max_retries = settings.CELERY_TASK_MAX_RETRIES + + +@app.task( + bind=True, + base=BuildTask, +) +# Ack late and reject on worker lost allows use to +# see http://docs.celeryproject.org/en/latest/userguide/configuration.html#task-reject-on-worker-lost +# and https://github.com/celery/celery/issues/5106 +def build_image(task: BuildTask, function_serialized: str, channel_name: str) -> None: + function = orchestrator.Function.parse_raw(function_serialized) + + attempt = 0 + while attempt <= task.max_retries: + try: + # TODO refactor + build_image_if_missing(channel_name, function) + except BuildRetryError as e: + logger.info( + "Retrying build", + function_id=function.key, + attempt=(task.attempt + 1), + max_attempts=(task.max_retries + 1), + ) + attempt += 1 + if attempt >= task.max_retries: + logger.exception(e) + raise CeleryNoRetryError from e + else: + continue + break diff --git a/backend/builder/tests/conftest.py b/backend/builder/tests/conftest.py new file mode 100644 index 000000000..a8cce6f45 --- /dev/null +++ b/backend/builder/tests/conftest.py @@ -0,0 +1,9 @@ +import pytest + +import orchestrator +import orchestrator.mock as orc_mock + + +@pytest.fixture +def function() -> orchestrator.Function: + return orc_mock.FunctionFactory() diff --git a/backend/substrapp/tests/compute_tasks/test_image_builder.py b/backend/builder/tests/test_image_builder.py similarity index 71% rename from backend/substrapp/tests/compute_tasks/test_image_builder.py rename to backend/builder/tests/test_image_builder.py index 169e133e0..da603c2c6 100644 --- a/backend/substrapp/tests/compute_tasks/test_image_builder.py +++ b/backend/builder/tests/test_image_builder.py @@ -4,8 +4,8 @@ from pytest_mock import MockerFixture import orchestrator -from substrapp.compute_tasks import errors as compute_task_errors -from substrapp.compute_tasks import image_builder +from builder.exceptions import BuildError +from builder.image_builder import image_builder from substrapp.compute_tasks import utils _VALID_DOCKERFILE = """ @@ -23,27 +23,24 @@ def test_build_image_if_missing_image_already_exists(mocker: MockerFixture, function: orchestrator.Function): - ds = mocker.Mock() - m_container_image_exists = mocker.patch( - "substrapp.compute_tasks.image_builder.container_image_exists", return_value=True - ) + m_container_image_exists = mocker.patch("builder.docker.container_image_exists", return_value=True) function_image_tag = utils.container_image_tag_from_function(function) - image_builder.build_image_if_missing(datastore=ds, function=function) + image_builder.build_image_if_missing(channel="channel", function=function) m_container_image_exists.assert_called_once_with(function_image_tag) +@pytest.mark.django_db def test_build_image_if_missing_image_build_needed(mocker: MockerFixture, function: orchestrator.Function): - ds = mocker.Mock() - m_container_image_exists = mocker.patch( - "substrapp.compute_tasks.image_builder.container_image_exists", return_value=False - ) - m_build_function_image = mocker.patch("substrapp.compute_tasks.image_builder._build_function_image") + m_container_image_exists = mocker.patch("builder.docker.container_image_exists", return_value=False) + m_datastore = mocker.patch("substrapp.compute_tasks.datastore.Datastore") + m_build_function_image = mocker.patch("builder.image_builder.image_builder._build_function_image") function_image_tag = utils.container_image_tag_from_function(function) - image_builder.build_image_if_missing(datastore=ds, function=function) + image_builder.build_image_if_missing(channel="channel", function=function) + m_datastore.assert_called_once() m_container_image_exists.assert_called_once_with(function_image_tag) m_build_function_image.assert_called_once() assert m_build_function_image.call_args.args[1] == function @@ -70,7 +67,7 @@ def test_get_entrypoint_from_dockerfile_invalid_dockerfile( dockerfile_path = tmp_path / "Dockerfile" dockerfile_path.write_text(dockerfile) - with pytest.raises(compute_task_errors.BuildError) as exc: + with pytest.raises(BuildError) as exc: image_builder._get_entrypoint_from_dockerfile(str(tmp_path)) assert expected_exc_content in bytes.decode(exc.value.logs.read()) diff --git a/backend/builder/tests/test_kubernetes.py b/backend/builder/tests/test_kubernetes.py new file mode 100644 index 000000000..918876a19 --- /dev/null +++ b/backend/builder/tests/test_kubernetes.py @@ -0,0 +1,28 @@ +from unittest import mock + +import kubernetes + +from builder.kubernetes import get_pod_logs + + +def test_get_pod_logs(mocker): + mocker.patch("kubernetes.client.CoreV1Api.read_namespaced_pod_log", return_value="Super great logs") + k8s_client = kubernetes.client.CoreV1Api() + logs = get_pod_logs(k8s_client, "pod_name", "container_name", ignore_pod_not_found=True) + assert logs == "Super great logs" + + +def test_get_pod_logs_not_found(): + with mock.patch("kubernetes.client.CoreV1Api.read_namespaced_pod_log") as read_pod: + read_pod.side_effect = kubernetes.client.ApiException(404, "Not Found") + k8s_client = kubernetes.client.CoreV1Api() + logs = get_pod_logs(k8s_client, "pod_name", "container_name", ignore_pod_not_found=True) + assert "Pod not found" in logs + + +def test_get_pod_logs_bad_request(): + with mock.patch("kubernetes.client.CoreV1Api.read_namespaced_pod_log") as read_pod: + read_pod.side_effect = kubernetes.client.ApiException(400, "Bad Request") + k8s_client = kubernetes.client.CoreV1Api() + logs = get_pod_logs(k8s_client, "pod_name", "container_name", ignore_pod_not_found=True) + assert "pod_name" in logs diff --git a/backend/builder/tests/test_task_build_image.py b/backend/builder/tests/test_task_build_image.py new file mode 100644 index 000000000..151821d57 --- /dev/null +++ b/backend/builder/tests/test_task_build_image.py @@ -0,0 +1,20 @@ +import pytest + +from builder.exceptions import BuildError +from substrapp.models import FailedAssetKind +from substrapp.utils.errors import store_failure + + +@pytest.mark.django_db +def test_store_failure_build_error(): + compute_task_key = "42ff54eb-f4de-43b2-a1a0-a9f4c5f4737f" + msg = "Error building image" + exc = BuildError(msg) + + failure_report = store_failure( + exc, compute_task_key, FailedAssetKind.FAILED_ASSET_FUNCTION, error_type=BuildError.error_type.value + ) + failure_report.refresh_from_db() + + assert str(failure_report.asset_key) == compute_task_key + assert failure_report.logs.read() == str.encode(msg) diff --git a/backend/builder/volumes.py b/backend/builder/volumes.py new file mode 100644 index 000000000..46ecb06b3 --- /dev/null +++ b/backend/builder/volumes.py @@ -0,0 +1,7 @@ +import os + +from django.conf import settings + + +def get_docker_cache_pvc_name() -> str: + return f"{settings.WORKER_PVC_DOCKER_CACHE}-{os.getenv('HOSTNAME')}" diff --git a/backend/image_transfer/__init__.py b/backend/image_transfer/__init__.py new file mode 100644 index 000000000..0c568a958 --- /dev/null +++ b/backend/image_transfer/__init__.py @@ -0,0 +1,6 @@ +from image_transfer.decoder import BlobNotFound +from image_transfer.decoder import ManifestNotFound +from image_transfer.decoder import push_payload +from image_transfer.encoder import make_payload + +__all__ = (BlobNotFound, ManifestNotFound, push_payload, make_payload) diff --git a/backend/image_transfer/common.py b/backend/image_transfer/common.py new file mode 100644 index 000000000..abb5ef965 --- /dev/null +++ b/backend/image_transfer/common.py @@ -0,0 +1,138 @@ +from __future__ import annotations + +import json +from enum import Enum +from pathlib import Path +from typing import IO +from typing import Dict +from typing import Iterator +from typing import Optional +from typing import Union + +import requests +from dxf import DXF +from dxf import DXFBase +from pydantic import BaseModel + + +class PayloadSide(Enum): + ENCODER = "ENCODER" + DECODER = "DECODER" + + +class Blob: + def __init__(self, dxf_base: DXFBase, digest: str, repository: str): + self.dxf_base = dxf_base + self.digest = digest + self.repository = repository + + def __repr__(self): + return f"{self.repository}/{self.digest}" + + def __eq__(self, other: Blob): + return self.digest == other.digest and self.repository == other.repository + + +class Manifest: + def __init__( + self, + dxf_base: DXFBase, + docker_image_name: str, + payload_side: PayloadSide, + content: Optional[str] = None, + ): + self.dxf_base = dxf_base + self.docker_image_name = docker_image_name + self.payload_side = payload_side + self._content = content + + @property + def repository(self) -> str: + return get_repo_and_tag(self.docker_image_name)[0] + + @property + def tag(self) -> str: + return get_repo_and_tag(self.docker_image_name)[1] + + @property + def content(self) -> str: + if self._content is None: + if self.payload_side == PayloadSide.DECODER: + raise ValueError( + "This makes no sense to fetch the manifest from " "the registry if you're decoding the zip" + ) + dxf = DXF.from_base(self.dxf_base, self.repository) + self._content = dxf.get_manifest(self.tag) + return self._content + + def get_list_of_blobs(self) -> list[Blob]: + manifest_dict = json.loads(self.content) + result: list[Blob] = [Blob(self.dxf_base, manifest_dict["config"]["digest"], self.repository)] + for layer in manifest_dict["layers"]: + result.append(Blob(self.dxf_base, layer["digest"], self.repository)) + return result + + +class BlobPathInZip(BaseModel): + zip_path: str + + +class BlobLocationInRegistry(BaseModel): + repository: str + + +class PayloadDescriptor(BaseModel): + manifests_paths: Dict[str, Optional[str]] + blobs_paths: Dict[str, Union[BlobPathInZip, BlobLocationInRegistry]] + + @classmethod + def from_images( + cls, + docker_images_to_transfer: list[str], + docker_images_already_transferred: list[str], + ) -> PayloadDescriptor: + manifests_paths = {} + for docker_image in docker_images_to_transfer: + if docker_image in docker_images_already_transferred: + print(f"Skipping {docker_image} as it has already been transferred") + manifests_paths[docker_image] = None + else: + manifests_paths[docker_image] = f"manifests/{normalize_name(docker_image)}" + return cls(manifests_paths=manifests_paths, blobs_paths={}) + + def get_images_not_transferred_yet(self) -> Iterator[str]: + for docker_image, manifest_path in self.manifests_paths.items(): + if manifest_path is not None: + yield docker_image + + +def normalize_name(docker_image: str) -> str: + return docker_image.replace("/", "_") + + +def progress_as_string(index: int, container: list) -> str: + return f"[{index+1}/{len(container)}]" + + +def file_to_generator(file_like: IO) -> Iterator[bytes]: + while True: + chunk = file_like.read(2**15) + if not chunk: + break + yield chunk + + +PROJECT_ROOT = Path(__file__).parents[1] + + +def get_repo_and_tag(docker_image_name: str) -> (str, str): + return docker_image_name.split(":", 1) + + +class Authenticator: + def __init__(self, username: str, password: str): + self.username = username + self.password = password + + def auth(self, dxf: DXFBase, response: requests.Response) -> None: + dxf.authenticate(self.username, self.password, response=response) diff --git a/backend/image_transfer/decoder.py b/backend/image_transfer/decoder.py new file mode 100644 index 000000000..302890860 --- /dev/null +++ b/backend/image_transfer/decoder.py @@ -0,0 +1,165 @@ +from __future__ import annotations + +import sys +import warnings +from pathlib import Path +from typing import IO +from typing import Iterator +from typing import Optional +from typing import Union +from zipfile import ZipFile + +import requests +from dxf import DXF +from dxf import DXFBase + +from image_transfer.common import Authenticator +from image_transfer.common import Blob +from image_transfer.common import BlobLocationInRegistry +from image_transfer.common import BlobPathInZip +from image_transfer.common import Manifest +from image_transfer.common import PayloadDescriptor +from image_transfer.common import PayloadSide +from image_transfer.common import file_to_generator +from image_transfer.common import get_repo_and_tag +from image_transfer.common import progress_as_string + + +class ManifestNotFound(Exception): + pass + + +class BlobNotFound(Exception): + pass + + +def push_payload( + zip_file: Union[IO, Path, str], + strict: bool = False, + registry: str = "registry-1.docker.io", + secure: bool = True, + username: Optional[str] = None, + password: Optional[str] = None, +) -> list[str]: + """Push the payload to the registry. + + It will iterate over the docker images and push the blobs and the manifests. + + # Arguments + zip_file: the zip file containing the payload. It can be a `pathlib.Path`, a `str` + or a file-like object. + strict: `False` by default. If True, it will raise an error if the + some blobs/images are missing. + That can happen if the user + set an image in `docker_images_already_transferred` + that is not in the registry. + registry: the registry to push to. It defaults to `registry-1.docker.io` (dockerhub). + secure: whether to use TLS (HTTPS) or not to connect to the registry, + default is True. + username: the username to use to connect to the registry. Optional + if the registry does not require authentication. + password: the password to use to connect to the registry. Optional + if the registry does not require authentication. + + # Returns + The list of docker images loaded in the registry + It also includes the list of docker images that were already present + in the registry and were not included in the payload to optimize the size. + In other words, it's the argument `docker_images_to_transfer` that you passed + to the function `docker_charon.make_payload(...)`. + """ + authenticator = Authenticator(username, password) + + with DXFBase(host=registry, auth=authenticator.auth, insecure=not secure) as dxf_base: + with ZipFile(zip_file, "r") as zip_file: + return list(load_zip_images_in_registry(dxf_base, zip_file, strict)) + + +def push_all_blobs_from_manifest( + dxf_base: DXFBase, + zip_file: ZipFile, + manifest: Manifest, + blobs_paths: dict, +) -> None: + list_of_blobs = manifest.get_list_of_blobs() + for blob_index, blob in enumerate(list_of_blobs): + print(progress_as_string(blob_index, list_of_blobs), end=" ", file=sys.stderr) + + blob_path = blobs_paths[blob.digest] + + if isinstance(blob_path, BlobPathInZip): + print(f"pushing blob {blob}", file=sys.stderr) + dxf = DXF.from_base(dxf_base, blob.repository) + # we try to open the file in the zip and push it. If the file doesn't + # exists in the zip, it means that it's already been pushed. + with zip_file.open(blob_path.zip_path, "r") as blob_in_zip: + dxf.push_blob(data=file_to_generator(blob_in_zip), digest=blob.digest) + elif isinstance(blob_path, BlobLocationInRegistry): + blob_in_registry = Blob(dxf_base, blob.digest, blob_path.repository) + dxf = DXF.from_base(dxf_base, blob.repository) + print(f"Mounting {blob_in_registry} to {blob.repository}", file=sys.stderr) + dxf.mount_blob(blob_in_registry.repository, blob_in_registry.digest) + + +def load_single_image_from_zip_in_registry( + dxf_base: DXFBase, + zip_file: ZipFile, + docker_image: str, + manifest_path_in_zip: str, + blobs_paths: dict[str, Union[BlobPathInZip, BlobLocationInRegistry]], +) -> None: + print(f"Loading image {docker_image}", file=sys.stderr) + manifest_content = zip_file.read(manifest_path_in_zip).decode() + manifest = Manifest(dxf_base, docker_image, PayloadSide.DECODER, content=manifest_content) + push_all_blobs_from_manifest(dxf_base, zip_file, manifest, blobs_paths) + dxf = DXF.from_base(dxf_base, manifest.repository) + dxf.set_manifest(manifest.tag, manifest.content) + + +def check_if_the_docker_image_is_in_the_registry(dxf_base: DXFBase, docker_image: str, strict: bool): + """we skipped this image because the user said it was in the registry. Let's + check if it's true. Raise an warning/error if not. + """ + repo, tag = get_repo_and_tag(docker_image) + dxf = DXF.from_base(dxf_base, repo) + try: + dxf.get_manifest(tag) + except requests.HTTPError as e: + if e.response.status_code != 404: + raise + error_message = ( + f"The docker image {docker_image} is not present in the " + f"registry. But when making the payload, it was specified in " + f"`docker_images_already_transferred`." + ) + if strict: + raise ManifestNotFound( + f"{error_message}\n" f"If you still want to unpack your payload, set `strict=False`." + ) + else: + warnings.warn(error_message, UserWarning) + return + print(f"Skipping {docker_image} as its already in the registry", file=sys.stderr) + + +def load_zip_images_in_registry(dxf_base: DXFBase, zip_file: ZipFile, strict: bool) -> Iterator[str]: + payload_descriptor = get_payload_descriptor(zip_file) + for ( + docker_image, + manifest_path_in_zip, + ) in payload_descriptor.manifests_paths.items(): + if manifest_path_in_zip is None: + check_if_the_docker_image_is_in_the_registry(dxf_base, docker_image, strict) + else: + load_single_image_from_zip_in_registry( + dxf_base, + zip_file, + docker_image, + manifest_path_in_zip, + payload_descriptor.blobs_paths, + ) + yield docker_image + + +def get_payload_descriptor(zip_file: ZipFile) -> PayloadDescriptor: + return PayloadDescriptor.model_validate_json(zip_file.read("payload_descriptor.json").decode()) diff --git a/backend/image_transfer/encoder.py b/backend/image_transfer/encoder.py new file mode 100644 index 000000000..a78aac4f6 --- /dev/null +++ b/backend/image_transfer/encoder.py @@ -0,0 +1,183 @@ +from __future__ import annotations + +import sys +from pathlib import Path +from typing import IO +from typing import Iterator +from typing import Optional +from typing import Union +from zipfile import ZipFile + +from dxf import DXF +from dxf import DXFBase +from tqdm import tqdm + +from image_transfer.common import Authenticator +from image_transfer.common import Blob +from image_transfer.common import BlobLocationInRegistry +from image_transfer.common import BlobPathInZip +from image_transfer.common import Manifest +from image_transfer.common import PayloadDescriptor +from image_transfer.common import PayloadSide +from image_transfer.common import progress_as_string + + +def add_blobs_to_zip( + dxf_base: DXFBase, + zip_file: ZipFile, + blobs_to_pull: list[Blob], + blobs_already_transferred: list[Blob], +) -> dict[str, Union[BlobPathInZip, BlobLocationInRegistry]]: + blobs_paths = {} + for blob_index, blob in enumerate(blobs_to_pull): + print(progress_as_string(blob_index, blobs_to_pull), end=" ", file=sys.stderr) + if blob.digest in blobs_paths: + print( + f"Skipping {blob} because it's in {blobs_paths[blob.digest]}", + file=sys.stderr, + ) + continue + + if dest_blob := get_blob_with_same_digest(blobs_already_transferred, blob.digest): + print( + f"Skipping {blob} because it's already in the destination registry " + f"in the repository {dest_blob.repository}", + file=sys.stderr, + ) + blobs_paths[blob.digest] = BlobLocationInRegistry(repository=dest_blob.repository) + continue + + # nominal case + print(f"Pulling blob {blob} and storing it in the zip", file=sys.stderr) + blob_path_in_zip = download_blob_to_zip(dxf_base, blob, zip_file) + blobs_paths[blob.digest] = BlobPathInZip(zip_path=blob_path_in_zip) + return blobs_paths + + +def download_blob_to_zip(dxf_base: DXFBase, blob: Blob, zip_file: ZipFile): + repository_dxf = DXF.from_base(dxf_base, blob.repository) + bytes_iterator, total_size = repository_dxf.pull_blob(blob.digest, size=True) + + # we write the blob directly to the zip file + with tqdm(total=total_size, unit="B", unit_scale=True) as pbar: + blob_path_in_zip = f"blobs/{blob.digest}" + with zip_file.open(blob_path_in_zip, "w", force_zip64=True) as blob_in_zip: + for chunk in bytes_iterator: + blob_in_zip.write(chunk) + pbar.update(len(chunk)) + return blob_path_in_zip + + +def get_blob_with_same_digest(list_of_blobs: list[Blob], digest: str) -> Optional[Blob]: + for blob in list_of_blobs: + if blob.digest == digest: + return blob + + +def get_manifest_and_list_of_blobs_to_pull(dxf_base: DXFBase, docker_image: str) -> tuple[Manifest, list[Blob]]: + manifest = Manifest(dxf_base, docker_image, PayloadSide.ENCODER) + return manifest, manifest.get_list_of_blobs() + + +def get_manifests_and_list_of_all_blobs( + dxf_base: DXFBase, docker_images: Iterator[str] +) -> tuple[list[Manifest], list[Blob]]: + manifests = [] + blobs_to_pull = [] + for docker_image in docker_images: + manifest, blobs = get_manifest_and_list_of_blobs_to_pull(dxf_base, docker_image) + manifests.append(manifest) + blobs_to_pull += blobs + return manifests, blobs_to_pull + + +def uniquify_blobs(blobs: list[Blob]) -> list[Blob]: + result = [] + for blob in blobs: + if blob.digest not in [x.digest for x in result]: + result.append(blob) + return result + + +def separate_images_to_transfer_and_images_to_skip( + docker_images_to_transfer: list[str], docker_images_already_transferred: list[str] +) -> tuple[list[str], list[str]]: + docker_images_to_transfer_with_blobs = [] + docker_images_to_skip = [] + for docker_image in docker_images_to_transfer: + if docker_image not in docker_images_already_transferred: + docker_images_to_transfer_with_blobs.append(docker_image) + else: + print( + f"Skipping {docker_image} as it has already been transferred", + file=sys.stderr, + ) + docker_images_to_skip.append(docker_image) + return docker_images_to_transfer_with_blobs, docker_images_to_skip + + +def create_zip_from_docker_images( + dxf_base: DXFBase, + docker_images_to_transfer: list[str], + docker_images_already_transferred: list[str], + zip_file: ZipFile, +) -> None: + payload_descriptor = PayloadDescriptor.from_images(docker_images_to_transfer, docker_images_already_transferred) + + manifests, blobs_to_pull = get_manifests_and_list_of_all_blobs( + dxf_base, payload_descriptor.get_images_not_transferred_yet() + ) + _, blobs_already_transferred = get_manifests_and_list_of_all_blobs(dxf_base, docker_images_already_transferred) + payload_descriptor.blobs_paths = add_blobs_to_zip(dxf_base, zip_file, blobs_to_pull, blobs_already_transferred) + for manifest in manifests: + dest = payload_descriptor.manifests_paths[manifest.docker_image_name] + zip_file.writestr(dest, manifest.content) + + zip_file.writestr("payload_descriptor.json", payload_descriptor.model_dump_json(indent=4)) + + +def make_payload( + zip_file: Union[IO, Path, str], + docker_images_to_transfer: list[str], + docker_images_already_transferred: Optional[list[str]] = None, + registry: str = "registry-1.docker.io", + secure: bool = True, + username: Optional[str] = None, + password: Optional[str] = None, +) -> None: + """ + Creates a payload from a list of docker images + All the docker images must be in the same registry. + This is currently a limitation of the docker-charon package. + + If you are interested in multi-registries, please open an issue. + + # Arguments + zip_file: The path to the zip file to create. It can be a `pathlib.Path` or + a `str`. It's also possible to pass a file-like object. The payload with + all the docker images is a single zip file. + docker_images_to_transfer: The list of docker images to transfer. Do not include + the registry name in the image name. + docker_images_already_transferred: The list of docker images that have already + been transferred to the air-gapped registry. Do not include the registry + name in the image name. + registry: the registry to push to. It defaults to `registry-1.docker.io` (dockerhub). + secure: Set to `False` if the registry doesn't support HTTPS (TLS). Default + is `True`. + username: The username to use for authentication to the registry. Optional if + the registry doesn't require authentication. + password: The password to use for authentication to the registry. Optional if + the registry doesn't require authentication. + """ + if docker_images_already_transferred is None: + docker_images_already_transferred = [] + authenticator = Authenticator(username, password) + + with DXFBase(host=registry, auth=authenticator.auth, insecure=not secure) as dxf_base: + with ZipFile(zip_file, "w") as zip_file_opened: + create_zip_from_docker_images( + dxf_base, + docker_images_to_transfer, + docker_images_already_transferred, + zip_file_opened, + ) diff --git a/backend/orchestrator/__init__.py b/backend/orchestrator/__init__.py index 1440dcc80..1fb0015d4 100644 --- a/backend/orchestrator/__init__.py +++ b/backend/orchestrator/__init__.py @@ -13,6 +13,7 @@ from .resources import Function from .resources import FunctionInput from .resources import FunctionOutput +from .resources import FunctionStatus from .resources import InvalidInputAsset from .resources import Model from .resources import Permission @@ -38,4 +39,5 @@ "OrcError", "FunctionInput", "FunctionOutput", + "FunctionStatus", ) diff --git a/backend/orchestrator/client.py b/backend/orchestrator/client.py index 224cd86f4..664f04013 100644 --- a/backend/orchestrator/client.py +++ b/backend/orchestrator/client.py @@ -208,6 +208,14 @@ def update_function(self, args): data = self._function_client.UpdateFunction(function_pb2.UpdateFunctionParam(**args), metadata=self._metadata) return MessageToDict(data, **CONVERT_SETTINGS) + @grpc_retry + def update_function_status(self, function_key, action): + data = self._function_client.ApplyFunctionAction( + function_pb2.ApplyFunctionActionParam(function_key=function_key, action=action), + metadata=self._metadata, + ) + return MessageToDict(data, **CONVERT_SETTINGS) + @grpc_retry def query_function(self, key) -> Function: data = self._function_client.GetFunction(function_pb2.GetFunctionParam(key=key), metadata=self._metadata) diff --git a/backend/orchestrator/failure_report_pb2.py b/backend/orchestrator/failure_report_pb2.py index 0fc9bf1ef..b08cb7081 100644 --- a/backend/orchestrator/failure_report_pb2.py +++ b/backend/orchestrator/failure_report_pb2.py @@ -15,7 +15,7 @@ from . import common_pb2 as common__pb2 -DESCRIPTOR = _descriptor_pool.Default().AddSerializedFile(b'\n\x14\x66\x61ilure_report.proto\x12\x0corchestrator\x1a\x1fgoogle/protobuf/timestamp.proto\x1a\x0c\x63ommon.proto\"\xc9\x01\n\rFailureReport\x12\x18\n\x10\x63ompute_task_key\x18\x01 \x01(\t\x12+\n\nerror_type\x18\x02 \x01(\x0e\x32\x17.orchestrator.ErrorType\x12/\n\x0clogs_address\x18\x03 \x01(\x0b\x32\x19.orchestrator.Addressable\x12\x31\n\rcreation_date\x18\x04 \x01(\x0b\x32\x1a.google.protobuf.Timestamp\x12\r\n\x05owner\x18\x05 \x01(\t\"\x8a\x01\n\x10NewFailureReport\x12\x18\n\x10\x63ompute_task_key\x18\x01 \x01(\t\x12+\n\nerror_type\x18\x02 \x01(\x0e\x32\x17.orchestrator.ErrorType\x12/\n\x0clogs_address\x18\x03 \x01(\x0b\x32\x19.orchestrator.Addressable\"1\n\x15GetFailureReportParam\x12\x18\n\x10\x63ompute_task_key\x18\x01 \x01(\t*p\n\tErrorType\x12\x1a\n\x16\x45RROR_TYPE_UNSPECIFIED\x10\x00\x12\x14\n\x10\x45RROR_TYPE_BUILD\x10\x01\x12\x18\n\x14\x45RROR_TYPE_EXECUTION\x10\x02\x12\x17\n\x13\x45RROR_TYPE_INTERNAL\x10\x03\x32\xc2\x01\n\x14\x46\x61ilureReportService\x12T\n\x15RegisterFailureReport\x12\x1e.orchestrator.NewFailureReport\x1a\x1b.orchestrator.FailureReport\x12T\n\x10GetFailureReport\x12#.orchestrator.GetFailureReportParam\x1a\x1b.orchestrator.FailureReportB+Z)github.com/substra/orchestrator/lib/assetb\x06proto3') +DESCRIPTOR = _descriptor_pool.Default().AddSerializedFile(b'\n\x14\x66\x61ilure_report.proto\x12\x0corchestrator\x1a\x1fgoogle/protobuf/timestamp.proto\x1a\x0c\x63ommon.proto\"\xf5\x01\n\rFailureReport\x12\x11\n\tasset_key\x18\x01 \x01(\t\x12+\n\nerror_type\x18\x02 \x01(\x0e\x32\x17.orchestrator.ErrorType\x12/\n\x0clogs_address\x18\x03 \x01(\x0b\x32\x19.orchestrator.Addressable\x12\x31\n\rcreation_date\x18\x04 \x01(\x0b\x32\x1a.google.protobuf.Timestamp\x12\r\n\x05owner\x18\x05 \x01(\t\x12\x31\n\nasset_type\x18\x06 \x01(\x0e\x32\x1d.orchestrator.FailedAssetKind\"\xb6\x01\n\x10NewFailureReport\x12\x11\n\tasset_key\x18\x01 \x01(\t\x12+\n\nerror_type\x18\x02 \x01(\x0e\x32\x17.orchestrator.ErrorType\x12/\n\x0clogs_address\x18\x03 \x01(\x0b\x32\x19.orchestrator.Addressable\x12\x31\n\nasset_type\x18\x04 \x01(\x0e\x32\x1d.orchestrator.FailedAssetKind\"*\n\x15GetFailureReportParam\x12\x11\n\tasset_key\x18\x01 \x01(\t*p\n\tErrorType\x12\x1a\n\x16\x45RROR_TYPE_UNSPECIFIED\x10\x00\x12\x14\n\x10\x45RROR_TYPE_BUILD\x10\x01\x12\x18\n\x14\x45RROR_TYPE_EXECUTION\x10\x02\x12\x17\n\x13\x45RROR_TYPE_INTERNAL\x10\x03*e\n\x0f\x46\x61iledAssetKind\x12\x18\n\x14\x46\x41ILED_ASSET_UNKNOWN\x10\x00\x12\x1d\n\x19\x46\x41ILED_ASSET_COMPUTE_TASK\x10\x01\x12\x19\n\x15\x46\x41ILED_ASSET_FUNCTION\x10\x02\x32\xc2\x01\n\x14\x46\x61ilureReportService\x12T\n\x15RegisterFailureReport\x12\x1e.orchestrator.NewFailureReport\x1a\x1b.orchestrator.FailureReport\x12T\n\x10GetFailureReport\x12#.orchestrator.GetFailureReportParam\x1a\x1b.orchestrator.FailureReportB+Z)github.com/substra/orchestrator/lib/assetb\x06proto3') _builder.BuildMessageAndEnumDescriptors(DESCRIPTOR, globals()) _builder.BuildTopDescriptorsAndMessages(DESCRIPTOR, 'failure_report_pb2', globals()) @@ -23,14 +23,16 @@ DESCRIPTOR._options = None DESCRIPTOR._serialized_options = b'Z)github.com/substra/orchestrator/lib/asset' - _ERRORTYPE._serialized_start=481 - _ERRORTYPE._serialized_end=593 + _ERRORTYPE._serialized_start=562 + _ERRORTYPE._serialized_end=674 + _FAILEDASSETKIND._serialized_start=676 + _FAILEDASSETKIND._serialized_end=777 _FAILUREREPORT._serialized_start=86 - _FAILUREREPORT._serialized_end=287 - _NEWFAILUREREPORT._serialized_start=290 - _NEWFAILUREREPORT._serialized_end=428 - _GETFAILUREREPORTPARAM._serialized_start=430 - _GETFAILUREREPORTPARAM._serialized_end=479 - _FAILUREREPORTSERVICE._serialized_start=596 - _FAILUREREPORTSERVICE._serialized_end=790 + _FAILUREREPORT._serialized_end=331 + _NEWFAILUREREPORT._serialized_start=334 + _NEWFAILUREREPORT._serialized_end=516 + _GETFAILUREREPORTPARAM._serialized_start=518 + _GETFAILUREREPORTPARAM._serialized_end=560 + _FAILUREREPORTSERVICE._serialized_start=780 + _FAILUREREPORTSERVICE._serialized_end=974 # @@protoc_insertion_point(module_scope) diff --git a/backend/orchestrator/failure_report_pb2.pyi b/backend/orchestrator/failure_report_pb2.pyi index 2abced6e5..441fb970b 100644 --- a/backend/orchestrator/failure_report_pb2.pyi +++ b/backend/orchestrator/failure_report_pb2.pyi @@ -56,40 +56,62 @@ It is likely to be caused by a fault in the system. It would require the action """ global___ErrorType = ErrorType +class _FailedAssetKind: + ValueType = typing.NewType("ValueType", builtins.int) + V: typing_extensions.TypeAlias = ValueType + +class _FailedAssetKindEnumTypeWrapper(google.protobuf.internal.enum_type_wrapper._EnumTypeWrapper[_FailedAssetKind.ValueType], builtins.type): + DESCRIPTOR: google.protobuf.descriptor.EnumDescriptor + FAILED_ASSET_UNKNOWN: _FailedAssetKind.ValueType # 0 + FAILED_ASSET_COMPUTE_TASK: _FailedAssetKind.ValueType # 1 + FAILED_ASSET_FUNCTION: _FailedAssetKind.ValueType # 2 + +class FailedAssetKind(_FailedAssetKind, metaclass=_FailedAssetKindEnumTypeWrapper): ... + +FAILED_ASSET_UNKNOWN: FailedAssetKind.ValueType # 0 +FAILED_ASSET_COMPUTE_TASK: FailedAssetKind.ValueType # 1 +FAILED_ASSET_FUNCTION: FailedAssetKind.ValueType # 2 +global___FailedAssetKind = FailedAssetKind + @typing_extensions.final class FailureReport(google.protobuf.message.Message): - """FailureReport is used to store information related to a failed ComputeTask.""" + """FailureReport is used to store information related to a failed ComputeTask or Function builds.""" DESCRIPTOR: google.protobuf.descriptor.Descriptor - COMPUTE_TASK_KEY_FIELD_NUMBER: builtins.int + ASSET_KEY_FIELD_NUMBER: builtins.int ERROR_TYPE_FIELD_NUMBER: builtins.int LOGS_ADDRESS_FIELD_NUMBER: builtins.int CREATION_DATE_FIELD_NUMBER: builtins.int OWNER_FIELD_NUMBER: builtins.int - compute_task_key: builtins.str + ASSET_TYPE_FIELD_NUMBER: builtins.int + asset_key: builtins.str error_type: global___ErrorType.ValueType @property def logs_address(self) -> common_pb2.Addressable: ... @property def creation_date(self) -> google.protobuf.timestamp_pb2.Timestamp: ... owner: builtins.str - """The owner of a failure report matches the 'worker' field of the associated compute task but can differ from + """In the case of a compute task failure, the owner of a failure report matches the 'worker' field of the associated compute task but can differ from the owner of the compute task. Indeed, a task belonging to some user can be executed on an organization belonging - to another user. The failure report generated will be located on the execution organization and belong to the owner + to another user. + In the case of a function, the owner will be the owner of the function (which builds the function). + The failure report generated will be located on the execution organization and belong to the owner of this organization. """ + asset_type: global___FailedAssetKind.ValueType def __init__( self, *, - compute_task_key: builtins.str = ..., + asset_key: builtins.str = ..., error_type: global___ErrorType.ValueType = ..., logs_address: common_pb2.Addressable | None = ..., creation_date: google.protobuf.timestamp_pb2.Timestamp | None = ..., owner: builtins.str = ..., + asset_type: global___FailedAssetKind.ValueType = ..., ) -> None: ... def HasField(self, field_name: typing_extensions.Literal["creation_date", b"creation_date", "logs_address", b"logs_address"]) -> builtins.bool: ... - def ClearField(self, field_name: typing_extensions.Literal["compute_task_key", b"compute_task_key", "creation_date", b"creation_date", "error_type", b"error_type", "logs_address", b"logs_address", "owner", b"owner"]) -> None: ... + def ClearField(self, field_name: typing_extensions.Literal["asset_key", b"asset_key", "asset_type", b"asset_type", "creation_date", b"creation_date", "error_type", b"error_type", "logs_address", b"logs_address", "owner", b"owner"]) -> None: ... global___FailureReport = FailureReport @@ -101,22 +123,25 @@ class NewFailureReport(google.protobuf.message.Message): DESCRIPTOR: google.protobuf.descriptor.Descriptor - COMPUTE_TASK_KEY_FIELD_NUMBER: builtins.int + ASSET_KEY_FIELD_NUMBER: builtins.int ERROR_TYPE_FIELD_NUMBER: builtins.int LOGS_ADDRESS_FIELD_NUMBER: builtins.int - compute_task_key: builtins.str + ASSET_TYPE_FIELD_NUMBER: builtins.int + asset_key: builtins.str error_type: global___ErrorType.ValueType @property def logs_address(self) -> common_pb2.Addressable: ... + asset_type: global___FailedAssetKind.ValueType def __init__( self, *, - compute_task_key: builtins.str = ..., + asset_key: builtins.str = ..., error_type: global___ErrorType.ValueType = ..., logs_address: common_pb2.Addressable | None = ..., + asset_type: global___FailedAssetKind.ValueType = ..., ) -> None: ... def HasField(self, field_name: typing_extensions.Literal["logs_address", b"logs_address"]) -> builtins.bool: ... - def ClearField(self, field_name: typing_extensions.Literal["compute_task_key", b"compute_task_key", "error_type", b"error_type", "logs_address", b"logs_address"]) -> None: ... + def ClearField(self, field_name: typing_extensions.Literal["asset_key", b"asset_key", "asset_type", b"asset_type", "error_type", b"error_type", "logs_address", b"logs_address"]) -> None: ... global___NewFailureReport = NewFailureReport @@ -126,13 +151,13 @@ class GetFailureReportParam(google.protobuf.message.Message): DESCRIPTOR: google.protobuf.descriptor.Descriptor - COMPUTE_TASK_KEY_FIELD_NUMBER: builtins.int - compute_task_key: builtins.str + ASSET_KEY_FIELD_NUMBER: builtins.int + asset_key: builtins.str def __init__( self, *, - compute_task_key: builtins.str = ..., + asset_key: builtins.str = ..., ) -> None: ... - def ClearField(self, field_name: typing_extensions.Literal["compute_task_key", b"compute_task_key"]) -> None: ... + def ClearField(self, field_name: typing_extensions.Literal["asset_key", b"asset_key"]) -> None: ... global___GetFailureReportParam = GetFailureReportParam diff --git a/backend/orchestrator/function_pb2.py b/backend/orchestrator/function_pb2.py index 9074fafab..210de679a 100644 --- a/backend/orchestrator/function_pb2.py +++ b/backend/orchestrator/function_pb2.py @@ -15,7 +15,7 @@ from . import common_pb2 as common__pb2 -DESCRIPTOR = _descriptor_pool.Default().AddSerializedFile(b'\n\x0e\x66unction.proto\x12\x0corchestrator\x1a\x1fgoogle/protobuf/timestamp.proto\x1a\x0c\x63ommon.proto\"Z\n\rFunctionInput\x12%\n\x04kind\x18\x01 \x01(\x0e\x32\x17.orchestrator.AssetKind\x12\x10\n\x08multiple\x18\x02 \x01(\x08\x12\x10\n\x08optional\x18\x03 \x01(\x08\"I\n\x0e\x46unctionOutput\x12%\n\x04kind\x18\x01 \x01(\x0e\x32\x17.orchestrator.AssetKind\x12\x10\n\x08multiple\x18\x02 \x01(\x08\"\xf1\x04\n\x08\x46unction\x12\x0b\n\x03key\x18\x01 \x01(\t\x12\x0c\n\x04name\x18\x02 \x01(\t\x12.\n\x0b\x64\x65scription\x18\x04 \x01(\x0b\x32\x19.orchestrator.Addressable\x12+\n\x08\x66unction\x18\x05 \x01(\x0b\x32\x19.orchestrator.Addressable\x12.\n\x0bpermissions\x18\x06 \x01(\x0b\x32\x19.orchestrator.Permissions\x12\r\n\x05owner\x18\x07 \x01(\t\x12\x31\n\rcreation_date\x18\x08 \x01(\x0b\x32\x1a.google.protobuf.Timestamp\x12\x36\n\x08metadata\x18\x10 \x03(\x0b\x32$.orchestrator.Function.MetadataEntry\x12\x32\n\x06inputs\x18\x11 \x03(\x0b\x32\".orchestrator.Function.InputsEntry\x12\x34\n\x07outputs\x18\x12 \x03(\x0b\x32#.orchestrator.Function.OutputsEntry\x1a/\n\rMetadataEntry\x12\x0b\n\x03key\x18\x01 \x01(\t\x12\r\n\x05value\x18\x02 \x01(\t:\x02\x38\x01\x1aJ\n\x0bInputsEntry\x12\x0b\n\x03key\x18\x01 \x01(\t\x12*\n\x05value\x18\x02 \x01(\x0b\x32\x1b.orchestrator.FunctionInput:\x02\x38\x01\x1aL\n\x0cOutputsEntry\x12\x0b\n\x03key\x18\x01 \x01(\t\x12+\n\x05value\x18\x02 \x01(\x0b\x32\x1c.orchestrator.FunctionOutput:\x02\x38\x01J\x04\x08\x03\x10\x04R\x08\x63\x61tegory\"\xc2\x04\n\x0bNewFunction\x12\x0b\n\x03key\x18\x01 \x01(\t\x12\x0c\n\x04name\x18\x02 \x01(\t\x12.\n\x0b\x64\x65scription\x18\x04 \x01(\x0b\x32\x19.orchestrator.Addressable\x12+\n\x08\x66unction\x18\x05 \x01(\x0b\x32\x19.orchestrator.Addressable\x12\x35\n\x0fnew_permissions\x18\x06 \x01(\x0b\x32\x1c.orchestrator.NewPermissions\x12\x39\n\x08metadata\x18\x11 \x03(\x0b\x32\'.orchestrator.NewFunction.MetadataEntry\x12\x35\n\x06inputs\x18\x12 \x03(\x0b\x32%.orchestrator.NewFunction.InputsEntry\x12\x37\n\x07outputs\x18\x13 \x03(\x0b\x32&.orchestrator.NewFunction.OutputsEntry\x1a/\n\rMetadataEntry\x12\x0b\n\x03key\x18\x01 \x01(\t\x12\r\n\x05value\x18\x02 \x01(\t:\x02\x38\x01\x1aJ\n\x0bInputsEntry\x12\x0b\n\x03key\x18\x01 \x01(\t\x12*\n\x05value\x18\x02 \x01(\x0b\x32\x1b.orchestrator.FunctionInput:\x02\x38\x01\x1aL\n\x0cOutputsEntry\x12\x0b\n\x03key\x18\x01 \x01(\t\x12+\n\x05value\x18\x02 \x01(\x0b\x32\x1c.orchestrator.FunctionOutput:\x02\x38\x01J\x04\x08\x03\x10\x04R\x08\x63\x61tegory\"\x1f\n\x10GetFunctionParam\x12\x0b\n\x03key\x18\x01 \x01(\t\"\\\n\x16QueryFunctionsResponse\x12)\n\tFunctions\x18\x01 \x03(\x0b\x32\x16.orchestrator.Function\x12\x17\n\x0fnext_page_token\x18\x02 \x01(\t\"/\n\x13\x46unctionQueryFilter\x12\x18\n\x10\x63ompute_plan_key\x18\x02 \x01(\t\"o\n\x13QueryFunctionsParam\x12\x12\n\npage_token\x18\x01 \x01(\t\x12\x11\n\tpage_size\x18\x02 \x01(\r\x12\x31\n\x06\x66ilter\x18\x03 \x01(\x0b\x32!.orchestrator.FunctionQueryFilter\"0\n\x13UpdateFunctionParam\x12\x0b\n\x03key\x18\x01 \x01(\t\x12\x0c\n\x04name\x18\x02 \x01(\t\"\x18\n\x16UpdateFunctionResponse2\xd5\x02\n\x0f\x46unctionService\x12\x45\n\x10RegisterFunction\x12\x19.orchestrator.NewFunction\x1a\x16.orchestrator.Function\x12\x45\n\x0bGetFunction\x12\x1e.orchestrator.GetFunctionParam\x1a\x16.orchestrator.Function\x12Y\n\x0eQueryFunctions\x12!.orchestrator.QueryFunctionsParam\x1a$.orchestrator.QueryFunctionsResponse\x12Y\n\x0eUpdateFunction\x12!.orchestrator.UpdateFunctionParam\x1a$.orchestrator.UpdateFunctionResponseB+Z)github.com/substra/orchestrator/lib/assetb\x06proto3') +DESCRIPTOR = _descriptor_pool.Default().AddSerializedFile(b'\n\x0e\x66unction.proto\x12\x0corchestrator\x1a\x1fgoogle/protobuf/timestamp.proto\x1a\x0c\x63ommon.proto\"Z\n\rFunctionInput\x12%\n\x04kind\x18\x01 \x01(\x0e\x32\x17.orchestrator.AssetKind\x12\x10\n\x08multiple\x18\x02 \x01(\x08\x12\x10\n\x08optional\x18\x03 \x01(\x08\"I\n\x0e\x46unctionOutput\x12%\n\x04kind\x18\x01 \x01(\x0e\x32\x17.orchestrator.AssetKind\x12\x10\n\x08multiple\x18\x02 \x01(\x08\"\x9f\x05\n\x08\x46unction\x12\x0b\n\x03key\x18\x01 \x01(\t\x12\x0c\n\x04name\x18\x02 \x01(\t\x12.\n\x0b\x64\x65scription\x18\x04 \x01(\x0b\x32\x19.orchestrator.Addressable\x12+\n\x08\x66unction\x18\x05 \x01(\x0b\x32\x19.orchestrator.Addressable\x12.\n\x0bpermissions\x18\x06 \x01(\x0b\x32\x19.orchestrator.Permissions\x12\r\n\x05owner\x18\x07 \x01(\t\x12\x31\n\rcreation_date\x18\x08 \x01(\x0b\x32\x1a.google.protobuf.Timestamp\x12\x36\n\x08metadata\x18\x10 \x03(\x0b\x32$.orchestrator.Function.MetadataEntry\x12\x32\n\x06inputs\x18\x11 \x03(\x0b\x32\".orchestrator.Function.InputsEntry\x12\x34\n\x07outputs\x18\x12 \x03(\x0b\x32#.orchestrator.Function.OutputsEntry\x12,\n\x06status\x18\x13 \x01(\x0e\x32\x1c.orchestrator.FunctionStatus\x1a/\n\rMetadataEntry\x12\x0b\n\x03key\x18\x01 \x01(\t\x12\r\n\x05value\x18\x02 \x01(\t:\x02\x38\x01\x1aJ\n\x0bInputsEntry\x12\x0b\n\x03key\x18\x01 \x01(\t\x12*\n\x05value\x18\x02 \x01(\x0b\x32\x1b.orchestrator.FunctionInput:\x02\x38\x01\x1aL\n\x0cOutputsEntry\x12\x0b\n\x03key\x18\x01 \x01(\t\x12+\n\x05value\x18\x02 \x01(\x0b\x32\x1c.orchestrator.FunctionOutput:\x02\x38\x01J\x04\x08\x03\x10\x04R\x08\x63\x61tegory\"\xc2\x04\n\x0bNewFunction\x12\x0b\n\x03key\x18\x01 \x01(\t\x12\x0c\n\x04name\x18\x02 \x01(\t\x12.\n\x0b\x64\x65scription\x18\x04 \x01(\x0b\x32\x19.orchestrator.Addressable\x12+\n\x08\x66unction\x18\x05 \x01(\x0b\x32\x19.orchestrator.Addressable\x12\x35\n\x0fnew_permissions\x18\x06 \x01(\x0b\x32\x1c.orchestrator.NewPermissions\x12\x39\n\x08metadata\x18\x11 \x03(\x0b\x32\'.orchestrator.NewFunction.MetadataEntry\x12\x35\n\x06inputs\x18\x12 \x03(\x0b\x32%.orchestrator.NewFunction.InputsEntry\x12\x37\n\x07outputs\x18\x13 \x03(\x0b\x32&.orchestrator.NewFunction.OutputsEntry\x1a/\n\rMetadataEntry\x12\x0b\n\x03key\x18\x01 \x01(\t\x12\r\n\x05value\x18\x02 \x01(\t:\x02\x38\x01\x1aJ\n\x0bInputsEntry\x12\x0b\n\x03key\x18\x01 \x01(\t\x12*\n\x05value\x18\x02 \x01(\x0b\x32\x1b.orchestrator.FunctionInput:\x02\x38\x01\x1aL\n\x0cOutputsEntry\x12\x0b\n\x03key\x18\x01 \x01(\t\x12+\n\x05value\x18\x02 \x01(\x0b\x32\x1c.orchestrator.FunctionOutput:\x02\x38\x01J\x04\x08\x03\x10\x04R\x08\x63\x61tegory\"\x1f\n\x10GetFunctionParam\x12\x0b\n\x03key\x18\x01 \x01(\t\"\\\n\x16QueryFunctionsResponse\x12)\n\tFunctions\x18\x01 \x03(\x0b\x32\x16.orchestrator.Function\x12\x17\n\x0fnext_page_token\x18\x02 \x01(\t\"/\n\x13\x46unctionQueryFilter\x12\x18\n\x10\x63ompute_plan_key\x18\x02 \x01(\t\"o\n\x13QueryFunctionsParam\x12\x12\n\npage_token\x18\x01 \x01(\t\x12\x11\n\tpage_size\x18\x02 \x01(\r\x12\x31\n\x06\x66ilter\x18\x03 \x01(\x0b\x32!.orchestrator.FunctionQueryFilter\"0\n\x13UpdateFunctionParam\x12\x0b\n\x03key\x18\x01 \x01(\t\x12\x0c\n\x04name\x18\x02 \x01(\t\"\x18\n\x16UpdateFunctionResponse\"^\n\x18\x41pplyFunctionActionParam\x12\x14\n\x0c\x66unction_key\x18\x01 \x01(\t\x12,\n\x06\x61\x63tion\x18\x02 \x01(\x0e\x32\x1c.orchestrator.FunctionAction\"\x1d\n\x1b\x41pplyFunctionActionResponse*\xa0\x01\n\x0e\x46unctionAction\x12\x1b\n\x17\x46UNCTION_ACTION_UNKNOWN\x10\x00\x12\x1c\n\x18\x46UNCTION_ACTION_BUILDING\x10\x01\x12\x1c\n\x18\x46UNCTION_ACTION_CANCELED\x10\x02\x12\x1a\n\x16\x46UNCTION_ACTION_FAILED\x10\x03\x12\x19\n\x15\x46UNCTION_ACTION_READY\x10\x04*\xbd\x01\n\x0e\x46unctionStatus\x12\x1b\n\x17\x46UNCTION_STATUS_UNKNOWN\x10\x00\x12\x1b\n\x17\x46UNCTION_STATUS_WAITING\x10\x01\x12\x1c\n\x18\x46UNCTION_STATUS_BUILDING\x10\x02\x12\x19\n\x15\x46UNCTION_STATUS_READY\x10\x03\x12\x1c\n\x18\x46UNCTION_STATUS_CANCELED\x10\x04\x12\x1a\n\x16\x46UNCTION_STATUS_FAILED\x10\x05\x32\xbf\x03\n\x0f\x46unctionService\x12\x45\n\x10RegisterFunction\x12\x19.orchestrator.NewFunction\x1a\x16.orchestrator.Function\x12\x45\n\x0bGetFunction\x12\x1e.orchestrator.GetFunctionParam\x1a\x16.orchestrator.Function\x12Y\n\x0eQueryFunctions\x12!.orchestrator.QueryFunctionsParam\x1a$.orchestrator.QueryFunctionsResponse\x12Y\n\x0eUpdateFunction\x12!.orchestrator.UpdateFunctionParam\x1a$.orchestrator.UpdateFunctionResponse\x12h\n\x13\x41pplyFunctionAction\x12&.orchestrator.ApplyFunctionActionParam\x1a).orchestrator.ApplyFunctionActionResponseB+Z)github.com/substra/orchestrator/lib/assetb\x06proto3') _builder.BuildMessageAndEnumDescriptors(DESCRIPTOR, globals()) _builder.BuildTopDescriptorsAndMessages(DESCRIPTOR, 'function_pb2', globals()) @@ -35,38 +35,46 @@ _NEWFUNCTION_INPUTSENTRY._serialized_options = b'8\001' _NEWFUNCTION_OUTPUTSENTRY._options = None _NEWFUNCTION_OUTPUTSENTRY._serialized_options = b'8\001' + _FUNCTIONACTION._serialized_start=1994 + _FUNCTIONACTION._serialized_end=2154 + _FUNCTIONSTATUS._serialized_start=2157 + _FUNCTIONSTATUS._serialized_end=2346 _FUNCTIONINPUT._serialized_start=79 _FUNCTIONINPUT._serialized_end=169 _FUNCTIONOUTPUT._serialized_start=171 _FUNCTIONOUTPUT._serialized_end=244 _FUNCTION._serialized_start=247 - _FUNCTION._serialized_end=872 - _FUNCTION_METADATAENTRY._serialized_start=655 - _FUNCTION_METADATAENTRY._serialized_end=702 - _FUNCTION_INPUTSENTRY._serialized_start=704 - _FUNCTION_INPUTSENTRY._serialized_end=778 - _FUNCTION_OUTPUTSENTRY._serialized_start=780 - _FUNCTION_OUTPUTSENTRY._serialized_end=856 - _NEWFUNCTION._serialized_start=875 - _NEWFUNCTION._serialized_end=1453 - _NEWFUNCTION_METADATAENTRY._serialized_start=655 - _NEWFUNCTION_METADATAENTRY._serialized_end=702 - _NEWFUNCTION_INPUTSENTRY._serialized_start=704 - _NEWFUNCTION_INPUTSENTRY._serialized_end=778 - _NEWFUNCTION_OUTPUTSENTRY._serialized_start=780 - _NEWFUNCTION_OUTPUTSENTRY._serialized_end=856 - _GETFUNCTIONPARAM._serialized_start=1455 - _GETFUNCTIONPARAM._serialized_end=1486 - _QUERYFUNCTIONSRESPONSE._serialized_start=1488 - _QUERYFUNCTIONSRESPONSE._serialized_end=1580 - _FUNCTIONQUERYFILTER._serialized_start=1582 - _FUNCTIONQUERYFILTER._serialized_end=1629 - _QUERYFUNCTIONSPARAM._serialized_start=1631 - _QUERYFUNCTIONSPARAM._serialized_end=1742 - _UPDATEFUNCTIONPARAM._serialized_start=1744 - _UPDATEFUNCTIONPARAM._serialized_end=1792 - _UPDATEFUNCTIONRESPONSE._serialized_start=1794 - _UPDATEFUNCTIONRESPONSE._serialized_end=1818 - _FUNCTIONSERVICE._serialized_start=1821 - _FUNCTIONSERVICE._serialized_end=2162 + _FUNCTION._serialized_end=918 + _FUNCTION_METADATAENTRY._serialized_start=701 + _FUNCTION_METADATAENTRY._serialized_end=748 + _FUNCTION_INPUTSENTRY._serialized_start=750 + _FUNCTION_INPUTSENTRY._serialized_end=824 + _FUNCTION_OUTPUTSENTRY._serialized_start=826 + _FUNCTION_OUTPUTSENTRY._serialized_end=902 + _NEWFUNCTION._serialized_start=921 + _NEWFUNCTION._serialized_end=1499 + _NEWFUNCTION_METADATAENTRY._serialized_start=701 + _NEWFUNCTION_METADATAENTRY._serialized_end=748 + _NEWFUNCTION_INPUTSENTRY._serialized_start=750 + _NEWFUNCTION_INPUTSENTRY._serialized_end=824 + _NEWFUNCTION_OUTPUTSENTRY._serialized_start=826 + _NEWFUNCTION_OUTPUTSENTRY._serialized_end=902 + _GETFUNCTIONPARAM._serialized_start=1501 + _GETFUNCTIONPARAM._serialized_end=1532 + _QUERYFUNCTIONSRESPONSE._serialized_start=1534 + _QUERYFUNCTIONSRESPONSE._serialized_end=1626 + _FUNCTIONQUERYFILTER._serialized_start=1628 + _FUNCTIONQUERYFILTER._serialized_end=1675 + _QUERYFUNCTIONSPARAM._serialized_start=1677 + _QUERYFUNCTIONSPARAM._serialized_end=1788 + _UPDATEFUNCTIONPARAM._serialized_start=1790 + _UPDATEFUNCTIONPARAM._serialized_end=1838 + _UPDATEFUNCTIONRESPONSE._serialized_start=1840 + _UPDATEFUNCTIONRESPONSE._serialized_end=1864 + _APPLYFUNCTIONACTIONPARAM._serialized_start=1866 + _APPLYFUNCTIONACTIONPARAM._serialized_end=1960 + _APPLYFUNCTIONACTIONRESPONSE._serialized_start=1962 + _APPLYFUNCTIONACTIONRESPONSE._serialized_end=1991 + _FUNCTIONSERVICE._serialized_start=2349 + _FUNCTIONSERVICE._serialized_end=2796 # @@protoc_insertion_point(module_scope) diff --git a/backend/orchestrator/function_pb2.pyi b/backend/orchestrator/function_pb2.pyi index 49d3ba048..b492431e3 100644 --- a/backend/orchestrator/function_pb2.pyi +++ b/backend/orchestrator/function_pb2.pyi @@ -7,17 +7,63 @@ import collections.abc import common_pb2 import google.protobuf.descriptor import google.protobuf.internal.containers +import google.protobuf.internal.enum_type_wrapper import google.protobuf.message import google.protobuf.timestamp_pb2 import sys +import typing -if sys.version_info >= (3, 8): +if sys.version_info >= (3, 10): import typing as typing_extensions else: import typing_extensions DESCRIPTOR: google.protobuf.descriptor.FileDescriptor +class _FunctionAction: + ValueType = typing.NewType("ValueType", builtins.int) + V: typing_extensions.TypeAlias = ValueType + +class _FunctionActionEnumTypeWrapper(google.protobuf.internal.enum_type_wrapper._EnumTypeWrapper[_FunctionAction.ValueType], builtins.type): + DESCRIPTOR: google.protobuf.descriptor.EnumDescriptor + FUNCTION_ACTION_UNKNOWN: _FunctionAction.ValueType # 0 + FUNCTION_ACTION_BUILDING: _FunctionAction.ValueType # 1 + FUNCTION_ACTION_CANCELED: _FunctionAction.ValueType # 2 + FUNCTION_ACTION_FAILED: _FunctionAction.ValueType # 3 + FUNCTION_ACTION_READY: _FunctionAction.ValueType # 4 + +class FunctionAction(_FunctionAction, metaclass=_FunctionActionEnumTypeWrapper): ... + +FUNCTION_ACTION_UNKNOWN: FunctionAction.ValueType # 0 +FUNCTION_ACTION_BUILDING: FunctionAction.ValueType # 1 +FUNCTION_ACTION_CANCELED: FunctionAction.ValueType # 2 +FUNCTION_ACTION_FAILED: FunctionAction.ValueType # 3 +FUNCTION_ACTION_READY: FunctionAction.ValueType # 4 +global___FunctionAction = FunctionAction + +class _FunctionStatus: + ValueType = typing.NewType("ValueType", builtins.int) + V: typing_extensions.TypeAlias = ValueType + +class _FunctionStatusEnumTypeWrapper(google.protobuf.internal.enum_type_wrapper._EnumTypeWrapper[_FunctionStatus.ValueType], builtins.type): + DESCRIPTOR: google.protobuf.descriptor.EnumDescriptor + FUNCTION_STATUS_UNKNOWN: _FunctionStatus.ValueType # 0 + FUNCTION_STATUS_WAITING: _FunctionStatus.ValueType # 1 + FUNCTION_STATUS_BUILDING: _FunctionStatus.ValueType # 2 + FUNCTION_STATUS_READY: _FunctionStatus.ValueType # 3 + FUNCTION_STATUS_CANCELED: _FunctionStatus.ValueType # 4 + FUNCTION_STATUS_FAILED: _FunctionStatus.ValueType # 5 + +class FunctionStatus(_FunctionStatus, metaclass=_FunctionStatusEnumTypeWrapper): ... + +FUNCTION_STATUS_UNKNOWN: FunctionStatus.ValueType # 0 +FUNCTION_STATUS_WAITING: FunctionStatus.ValueType # 1 +FUNCTION_STATUS_BUILDING: FunctionStatus.ValueType # 2 +FUNCTION_STATUS_READY: FunctionStatus.ValueType # 3 +FUNCTION_STATUS_CANCELED: FunctionStatus.ValueType # 4 +FUNCTION_STATUS_FAILED: FunctionStatus.ValueType # 5 +global___FunctionStatus = FunctionStatus + @typing_extensions.final class FunctionInput(google.protobuf.message.Message): DESCRIPTOR: google.protobuf.descriptor.Descriptor @@ -127,6 +173,7 @@ class Function(google.protobuf.message.Message): METADATA_FIELD_NUMBER: builtins.int INPUTS_FIELD_NUMBER: builtins.int OUTPUTS_FIELD_NUMBER: builtins.int + STATUS_FIELD_NUMBER: builtins.int key: builtins.str name: builtins.str @property @@ -144,6 +191,7 @@ class Function(google.protobuf.message.Message): def inputs(self) -> google.protobuf.internal.containers.MessageMap[builtins.str, global___FunctionInput]: ... @property def outputs(self) -> google.protobuf.internal.containers.MessageMap[builtins.str, global___FunctionOutput]: ... + status: global___FunctionStatus.ValueType def __init__( self, *, @@ -157,9 +205,10 @@ class Function(google.protobuf.message.Message): metadata: collections.abc.Mapping[builtins.str, builtins.str] | None = ..., inputs: collections.abc.Mapping[builtins.str, global___FunctionInput] | None = ..., outputs: collections.abc.Mapping[builtins.str, global___FunctionOutput] | None = ..., + status: global___FunctionStatus.ValueType = ..., ) -> None: ... def HasField(self, field_name: typing_extensions.Literal["creation_date", b"creation_date", "description", b"description", "function", b"function", "permissions", b"permissions"]) -> builtins.bool: ... - def ClearField(self, field_name: typing_extensions.Literal["creation_date", b"creation_date", "description", b"description", "function", b"function", "inputs", b"inputs", "key", b"key", "metadata", b"metadata", "name", b"name", "outputs", b"outputs", "owner", b"owner", "permissions", b"permissions"]) -> None: ... + def ClearField(self, field_name: typing_extensions.Literal["creation_date", b"creation_date", "description", b"description", "function", b"function", "inputs", b"inputs", "key", b"key", "metadata", b"metadata", "name", b"name", "outputs", b"outputs", "owner", b"owner", "permissions", b"permissions", "status", b"status"]) -> None: ... global___Function = Function @@ -361,3 +410,31 @@ class UpdateFunctionResponse(google.protobuf.message.Message): ) -> None: ... global___UpdateFunctionResponse = UpdateFunctionResponse + +@typing_extensions.final +class ApplyFunctionActionParam(google.protobuf.message.Message): + DESCRIPTOR: google.protobuf.descriptor.Descriptor + + FUNCTION_KEY_FIELD_NUMBER: builtins.int + ACTION_FIELD_NUMBER: builtins.int + function_key: builtins.str + action: global___FunctionAction.ValueType + def __init__( + self, + *, + function_key: builtins.str = ..., + action: global___FunctionAction.ValueType = ..., + ) -> None: ... + def ClearField(self, field_name: typing_extensions.Literal["action", b"action", "function_key", b"function_key"]) -> None: ... + +global___ApplyFunctionActionParam = ApplyFunctionActionParam + +@typing_extensions.final +class ApplyFunctionActionResponse(google.protobuf.message.Message): + DESCRIPTOR: google.protobuf.descriptor.Descriptor + + def __init__( + self, + ) -> None: ... + +global___ApplyFunctionActionResponse = ApplyFunctionActionResponse diff --git a/backend/orchestrator/function_pb2_grpc.py b/backend/orchestrator/function_pb2_grpc.py index 17b7046f6..b557d15fc 100644 --- a/backend/orchestrator/function_pb2_grpc.py +++ b/backend/orchestrator/function_pb2_grpc.py @@ -34,6 +34,11 @@ def __init__(self, channel): request_serializer=function__pb2.UpdateFunctionParam.SerializeToString, response_deserializer=function__pb2.UpdateFunctionResponse.FromString, ) + self.ApplyFunctionAction = channel.unary_unary( + '/orchestrator.FunctionService/ApplyFunctionAction', + request_serializer=function__pb2.ApplyFunctionActionParam.SerializeToString, + response_deserializer=function__pb2.ApplyFunctionActionResponse.FromString, + ) class FunctionServiceServicer(object): @@ -63,6 +68,12 @@ def UpdateFunction(self, request, context): context.set_details('Method not implemented!') raise NotImplementedError('Method not implemented!') + def ApplyFunctionAction(self, request, context): + """Missing associated documentation comment in .proto file.""" + context.set_code(grpc.StatusCode.UNIMPLEMENTED) + context.set_details('Method not implemented!') + raise NotImplementedError('Method not implemented!') + def add_FunctionServiceServicer_to_server(servicer, server): rpc_method_handlers = { @@ -86,6 +97,11 @@ def add_FunctionServiceServicer_to_server(servicer, server): request_deserializer=function__pb2.UpdateFunctionParam.FromString, response_serializer=function__pb2.UpdateFunctionResponse.SerializeToString, ), + 'ApplyFunctionAction': grpc.unary_unary_rpc_method_handler( + servicer.ApplyFunctionAction, + request_deserializer=function__pb2.ApplyFunctionActionParam.FromString, + response_serializer=function__pb2.ApplyFunctionActionResponse.SerializeToString, + ), } generic_handler = grpc.method_handlers_generic_handler( 'orchestrator.FunctionService', rpc_method_handlers) @@ -163,3 +179,20 @@ def UpdateFunction(request, function__pb2.UpdateFunctionResponse.FromString, options, channel_credentials, insecure, call_credentials, compression, wait_for_ready, timeout, metadata) + + @staticmethod + def ApplyFunctionAction(request, + target, + options=(), + channel_credentials=None, + call_credentials=None, + insecure=False, + compression=None, + wait_for_ready=None, + timeout=None, + metadata=None): + return grpc.experimental.unary_unary(request, target, '/orchestrator.FunctionService/ApplyFunctionAction', + function__pb2.ApplyFunctionActionParam.SerializeToString, + function__pb2.ApplyFunctionActionResponse.FromString, + options, channel_credentials, + insecure, call_credentials, compression, wait_for_ready, timeout, metadata) diff --git a/backend/orchestrator/function_pb2_grpc.pyi b/backend/orchestrator/function_pb2_grpc.pyi index 66b7356a1..8dff7a938 100644 --- a/backend/orchestrator/function_pb2_grpc.pyi +++ b/backend/orchestrator/function_pb2_grpc.pyi @@ -24,6 +24,10 @@ class FunctionServiceStub: function_pb2.UpdateFunctionParam, function_pb2.UpdateFunctionResponse, ] + ApplyFunctionAction: grpc.UnaryUnaryMultiCallable[ + function_pb2.ApplyFunctionActionParam, + function_pb2.ApplyFunctionActionResponse, + ] class FunctionServiceServicer(metaclass=abc.ABCMeta): @abc.abstractmethod @@ -50,5 +54,11 @@ class FunctionServiceServicer(metaclass=abc.ABCMeta): request: function_pb2.UpdateFunctionParam, context: grpc.ServicerContext, ) -> function_pb2.UpdateFunctionResponse: ... + @abc.abstractmethod + def ApplyFunctionAction( + self, + request: function_pb2.ApplyFunctionActionParam, + context: grpc.ServicerContext, + ) -> function_pb2.ApplyFunctionActionResponse: ... def add_FunctionServiceServicer_to_server(servicer: FunctionServiceServicer, server: grpc.Server) -> None: ... diff --git a/backend/orchestrator/mock.py b/backend/orchestrator/mock.py index b0995298e..1cf4f2eca 100644 --- a/backend/orchestrator/mock.py +++ b/backend/orchestrator/mock.py @@ -11,6 +11,7 @@ from .resources import DataSample from .resources import Function from .resources import FunctionInput +from .resources import FunctionStatus from .resources import Model from .resources import Permission from .resources import Permissions @@ -94,6 +95,7 @@ class Meta: function_address = factory.SubFactory(AddressFactory) inputs = {} outputs = {} + status = FunctionStatus.FUNCTION_STATUS_WAITING class ComputePlanFactory(factory.Factory): diff --git a/backend/orchestrator/resources.py b/backend/orchestrator/resources.py index a2071cef4..2b1a62ea8 100644 --- a/backend/orchestrator/resources.py +++ b/backend/orchestrator/resources.py @@ -19,7 +19,7 @@ TAG_KEY = "__tag__" -class AutoNameEnum(enum.Enum): +class AutoNameEnum(str, enum.Enum): def _generate_next_value_(name, start, count, last_values): # noqa: N805 return name @@ -138,12 +138,26 @@ def from_grpc(cls, o: function_pb2.FunctionOutput) -> FunctionOutput: return cls(kind=AssetKind.from_grpc(o.kind), multiple=o.multiple) +class FunctionStatus(AutoNameEnum): + FUNCTION_STATUS_UNKNOWN = enum.auto() + FUNCTION_STATUS_WAITING = enum.auto() + FUNCTION_STATUS_BUILDING = enum.auto() + FUNCTION_STATUS_READY = enum.auto() + FUNCTION_STATUS_CANCELED = enum.auto() + FUNCTION_STATUS_FAILED = enum.auto() + + @classmethod + def from_grpc(cls, s: function_pb2.FunctionStatus.ValueType) -> FunctionStatus: + return cls(function_pb2.FunctionStatus.Name(s)) + + class Function(pydantic.BaseModel): key: str owner: str function_address: Address inputs: dict[str, FunctionInput] outputs: dict[str, FunctionOutput] + status: FunctionStatus @classmethod def from_grpc(cls, a: function_pb2.Function) -> Function: @@ -153,6 +167,7 @@ def from_grpc(cls, a: function_pb2.Function) -> Function: function_address=Address.from_grpc(a.function), inputs={k: FunctionInput.from_grpc(i) for k, i in a.inputs.items()}, outputs={k: FunctionOutput.from_grpc(o) for k, o in a.outputs.items()}, + status=FunctionStatus.from_grpc(a.status), ) diff --git a/backend/requirements.txt b/backend/requirements.txt index d0c943c59..6050d8805 100644 --- a/backend/requirements.txt +++ b/backend/requirements.txt @@ -27,3 +27,5 @@ mozilla-django-oidc==3.0.0 # Dependencies used in local dev mode argh==0.26.2 watchdog==2.1.9 +python-dxf +tqdm diff --git a/backend/substrapp/clients/organization.py b/backend/substrapp/clients/organization.py index 399a44746..086fc64cc 100644 --- a/backend/substrapp/clients/organization.py +++ b/backend/substrapp/clients/organization.py @@ -138,7 +138,7 @@ def _http_request( try: response.raise_for_status() except requests.exceptions.HTTPError as exc: - status_code = exc.response.status_code if exc.response else None + status_code = response.status_code if exc.response else None raise OrganizationHttpError(url=url, status_code=status_code) return response @@ -178,13 +178,13 @@ def get( channel: str, organization_id: str, url: str, - checksum: str, + checksum: typing.Optional[str], salt: typing.Optional[str] = None, ) -> bytes: """Get asset data.""" content = _http_request(_Method.GET, channel, organization_id, url).content new_checksum = compute_hash(content, key=salt) - if new_checksum != checksum: + if checksum is not None and new_checksum != checksum: raise IntegrityError(f"url {url}: checksum doesn't match {checksum} vs {new_checksum}") return content diff --git a/backend/substrapp/compute_tasks/errors.py b/backend/substrapp/compute_tasks/errors.py index a3dea56e6..71bd12823 100644 --- a/backend/substrapp/compute_tasks/errors.py +++ b/backend/substrapp/compute_tasks/errors.py @@ -1,7 +1,6 @@ """Objects to manage errors occurring in a compute task.""" import enum -from io import BytesIO from typing import BinaryIO from orchestrator import failure_report_pb2 @@ -61,34 +60,6 @@ class _ComputeTaskError(RuntimeError): error_type: ComputeTaskErrorType -class BuildRetryError(_ComputeTaskError, CeleryRetryError): - """An error occurred during the build of a container image. - - Args: - logs (str): the container image build logs - """ - - error_type = ComputeTaskErrorType.BUILD_ERROR - - def __init__(self, logs: str, *args, **kwargs): - self.logs = BytesIO(str.encode(logs)) - super().__init__(logs, *args, **kwargs) - - -class BuildError(_ComputeTaskError, CeleryNoRetryError): - """An error occurred during the build of a container image. - - Args: - logs (str): the container image build logs - """ - - error_type = ComputeTaskErrorType.BUILD_ERROR - - def __init__(self, logs: str, *args, **kwargs): - self.logs = BytesIO(str.encode(logs)) - super().__init__(logs, *args, **kwargs) - - class ExecutionError(_ComputeTaskError, CeleryNoRetryError): """An error occurred during the execution of a command in a container image. @@ -100,10 +71,10 @@ class ExecutionError(_ComputeTaskError, CeleryNoRetryError): def __init__(self, logs: BinaryIO, *args, **kwargs): self.logs = logs - super().__init__(*args, **kwargs) + super().__init__(logs, *args, **kwargs) -def get_error_type(exc: Exception) -> failure_report_pb2.ErrorType: +def get_error_type(exc: Exception) -> failure_report_pb2.ErrorType.ValueType: """From a given exception, return an error type safe to store and to advertise to the user. Args: diff --git a/backend/substrapp/compute_tasks/execute.py b/backend/substrapp/compute_tasks/execute.py index 7f82cd2dc..d652dab41 100644 --- a/backend/substrapp/compute_tasks/execute.py +++ b/backend/substrapp/compute_tasks/execute.py @@ -26,12 +26,14 @@ from substrapp.compute_tasks.volumes import get_volumes from substrapp.compute_tasks.volumes import get_worker_subtuple_pvc_name from substrapp.docker_registry import get_container_image_name +from substrapp.docker_registry import get_entrypoint from substrapp.exceptions import PodReadinessTimeoutError from substrapp.kubernetes_utils import delete_pod from substrapp.kubernetes_utils import execute from substrapp.kubernetes_utils import get_volume from substrapp.kubernetes_utils import pod_exists_by_label_selector from substrapp.kubernetes_utils import wait_for_pod_readiness +from substrapp.models import ImageEntrypoint from substrapp.orchestrator import get_orchestrator_client from substrapp.utils import timeit @@ -49,6 +51,12 @@ def execute_compute_task(ctx: Context) -> None: env = get_environment(ctx) image = get_container_image_name(container_image_tag) + # save entrypoint to DB + entrypoint = get_entrypoint(container_image_tag) + ImageEntrypoint.objects.get_or_create( + function_checksum=ctx.function.function_address.checksum, entrypoint_json=entrypoint + ) + k8s_client = _get_k8s_client() should_create_pod = not pod_exists_by_label_selector(k8s_client, compute_pod.label_selector) diff --git a/backend/substrapp/compute_tasks/image_builder.py b/backend/substrapp/compute_tasks/image_builder.py index 691788911..276e39887 100644 --- a/backend/substrapp/compute_tasks/image_builder.py +++ b/backend/substrapp/compute_tasks/image_builder.py @@ -1,310 +1,62 @@ -import json import os +import pathlib +import time from tempfile import TemporaryDirectory -import kubernetes import structlog from django.conf import settings import orchestrator -from substrapp import exceptions -from substrapp.compute_tasks import errors as compute_task_errors +import substrapp.clients.organization as organization_client +from api.models import Function as ApiFunction +from builder import exceptions +from image_transfer import push_payload from substrapp.compute_tasks import utils -from substrapp.compute_tasks.compute_pod import Label -from substrapp.compute_tasks.datastore import Datastore -from substrapp.compute_tasks.volumes import get_docker_cache_pvc_name -from substrapp.compute_tasks.volumes import get_worker_subtuple_pvc_name -from substrapp.docker_registry import USER_IMAGE_REPOSITORY -from substrapp.docker_registry import container_image_exists -from substrapp.kubernetes_utils import delete_pod -from substrapp.kubernetes_utils import get_pod_logs -from substrapp.kubernetes_utils import get_security_context -from substrapp.kubernetes_utils import pod_exists -from substrapp.kubernetes_utils import watch_pod -from substrapp.lock_local import lock_resource -from substrapp.models.image_entrypoint import ImageEntrypoint -from substrapp.utils import timeit -from substrapp.utils import uncompress_content logger = structlog.get_logger(__name__) +IMAGE_BUILD_TIMEOUT = settings.IMAGE_BUILD_TIMEOUT +IMAGE_BUILD_CHECK_DELAY = settings.IMAGE_BUILD_CHECK_DELAY REGISTRY = settings.REGISTRY -REGISTRY_SCHEME = settings.REGISTRY_SCHEME -NAMESPACE = settings.NAMESPACE -KANIKO_MIRROR = settings.TASK["KANIKO_MIRROR"] -KANIKO_IMAGE = settings.TASK["KANIKO_IMAGE"] -KANIKO_DOCKER_CONFIG_SECRET_NAME = settings.TASK["KANIKO_DOCKER_CONFIG_SECRET_NAME"] -KANIKO_DOCKER_CONFIG_VOLUME_NAME = "docker-config" -CELERY_WORKER_CONCURRENCY = settings.CELERY_WORKER_CONCURRENCY SUBTUPLE_TMP_DIR = settings.SUBTUPLE_TMP_DIR -MAX_IMAGE_BUILD_TIME = 3 * 60 * 60 # 3 hours -KANIKO_CONTAINER_NAME = "kaniko" -HOSTNAME = settings.HOSTNAME -def build_image_if_missing(datastore: Datastore, function: orchestrator.Function) -> None: - """ - Build the container image and the ImageEntryPoint entry if they don't exist already - """ - container_image_tag = utils.container_image_tag_from_function(function) - with lock_resource("image-build", container_image_tag, ttl=MAX_IMAGE_BUILD_TIME, timeout=MAX_IMAGE_BUILD_TIME): - if container_image_exists(container_image_tag): - logger.info("Reusing existing image", image=container_image_tag) - else: - asset_content = datastore.get_function(function) - _build_function_image(asset_content, function) - - -def _build_function_image(asset: bytes, function: orchestrator.Function) -> None: - """ - Build a function's container image. - - Perform multiple steps: - 1. Download the function using the provided asset storage_address/owner. Verify its checksum and uncompress the data - to a temporary folder. - 2. Extract the ENTRYPOINT from the Dockerfile. - 3. Build the container image using Kaniko. - 4. Save the ENTRYPOINT to the DB - """ - - os.makedirs(SUBTUPLE_TMP_DIR, exist_ok=True) - - with TemporaryDirectory(dir=SUBTUPLE_TMP_DIR) as tmp_dir: - # Download source - uncompress_content(asset, tmp_dir) - - # Extract ENTRYPOINT from Dockerfile - entrypoint = _get_entrypoint_from_dockerfile(tmp_dir) - - # Build image - _build_container_image(tmp_dir, utils.container_image_tag_from_function(function)) - - # Save entrypoint to DB if the image build was successful - ImageEntrypoint.objects.get_or_create( - function_checksum=function.function_address.checksum, entrypoint_json=entrypoint - ) - - -def _get_entrypoint_from_dockerfile(dockerfile_dir: str) -> list[str]: - """ - Get entrypoint from ENTRYPOINT in the Dockerfile. - - This is necessary because the user function can have arbitrary names, ie; "myfunction.py". - - Example: - ENTRYPOINT ["python3", "myfunction.py"] - """ - dockerfile_path = f"{dockerfile_dir}/Dockerfile" +def wait_for_image_built(function: orchestrator.Function, channel: str) -> None: + api_function = ApiFunction.objects.get(key=function.key) - with open(dockerfile_path, "r") as file: - for line in file: - if line.startswith("ENTRYPOINT"): - try: - res = json.loads(line[len("ENTRYPOINT") :]) - except json.JSONDecodeError: - res = None - - if not isinstance(res, list): - raise compute_task_errors.BuildError( - "Invalid ENTRYPOINT in function/metric Dockerfile. " - "You must use the exec form in your Dockerfile. " - "See https://docs.docker.com/engine/reference/builder/#entrypoint" - ) - return res - - raise compute_task_errors.BuildError("Invalid Dockerfile: Cannot find ENTRYPOINT") - - -def _delete_kaniko_pod(create_pod: bool, k8s_client: kubernetes.client.CoreV1Api, pod_name: str) -> str: - logs = "" - if create_pod: - logs = get_pod_logs(k8s_client, pod_name, KANIKO_CONTAINER_NAME, ignore_pod_not_found=True) - delete_pod(k8s_client, pod_name) - logger.info(logs or "", pod_name=pod_name) - return logs - - -@timeit -def _build_container_image(path: str, tag: str) -> None: - _assert_dockerfile_exist(path) - - kubernetes.config.load_incluster_config() - k8s_client = kubernetes.client.CoreV1Api() - - pod_name = _build_pod_name(tag) - - create_pod = not pod_exists(k8s_client, pod_name) - if create_pod: - try: - logger.info("creating pod: building image", namespace=NAMESPACE, pod=pod_name, image=tag) - pod = _build_pod(path, tag) - k8s_client.create_namespaced_pod(body=pod, namespace=NAMESPACE) - except kubernetes.client.ApiException as e: - raise compute_task_errors.BuildRetryError( - f"Error creating pod {NAMESPACE}/{pod_name}. Reason: {e.reason}, status: {e.status}, body: {e.body}" - ) from e - - try: - watch_pod(k8s_client, pod_name) - - except Exception as e: - # In case of concurrent builds, it may fail. Check if the image exists. - if container_image_exists(tag): - logger.warning( - f"Build of container image {tag} failed, probably because it was done by a concurrent build", - exc_info=True, - ) + attempt = 0 + # with 60 attempts we wait max 2 min with a pending pod + max_attempts = IMAGE_BUILD_TIMEOUT / IMAGE_BUILD_CHECK_DELAY + while attempt < max_attempts: + if api_function.status == ApiFunction.Status.FUNCTION_STATUS_READY: return + attempt += 1 + time.sleep(IMAGE_BUILD_CHECK_DELAY) + api_function.refresh_from_db() - logs = _delete_kaniko_pod(create_pod, k8s_client, pod_name) - - if isinstance(e, exceptions.PodTimeoutError): - raise compute_task_errors.BuildRetryError(logs) from e - else: # exceptions.PodError or other - raise compute_task_errors.BuildError(logs) from e - - _delete_kaniko_pod(create_pod, k8s_client, pod_name) - - -def _assert_dockerfile_exist(dockerfile_path): - dockerfile_fullpath = os.path.join(dockerfile_path, "Dockerfile") - if not os.path.exists(dockerfile_fullpath): - raise compute_task_errors.BuildError(f"Dockerfile does not exist : {dockerfile_fullpath}") - - -def _build_pod(dockerfile_mount_path: str, image_tag: str) -> kubernetes.client.V1Pod: - pod_name = _build_pod_name(image_tag) - pod_spec = _build_pod_spec(dockerfile_mount_path, image_tag) - return kubernetes.client.V1Pod( - api_version="v1", - kind="Pod", - metadata=kubernetes.client.V1ObjectMeta( - name=pod_name, - labels={ - Label.PodName: pod_name, - Label.PodType: "image-build", - Label.Component: Label.Component_Compute, - }, - ), - spec=pod_spec, - ) - - -def _build_pod_name(image_tag: str) -> str: - dns_1123_compliant_tag = image_tag.split("/")[-1].replace("_", "-") - return f"kaniko-{dns_1123_compliant_tag}" - - -def _build_pod_spec(dockerfile_mount_path: str, image_tag: str) -> kubernetes.client.V1PodSpec: - container = _build_container(dockerfile_mount_path, image_tag) - pod_affinity = _build_pod_affinity() - - cache_pvc_name = ( - settings.WORKER_PVC_DOCKER_CACHE if settings.WORKER_PVC_IS_HOSTPATH else get_docker_cache_pvc_name() - ) - cache = kubernetes.client.V1Volume( - name="cache", - persistent_volume_claim=kubernetes.client.V1PersistentVolumeClaimVolumeSource(claim_name=cache_pvc_name), - ) - - dockerfile_pvc_name = ( - settings.WORKER_PVC_SUBTUPLE if settings.WORKER_PVC_IS_HOSTPATH else get_worker_subtuple_pvc_name() - ) - dockerfile = kubernetes.client.V1Volume( - name="dockerfile", - persistent_volume_claim=kubernetes.client.V1PersistentVolumeClaimVolumeSource(claim_name=dockerfile_pvc_name), - ) - - volumes = [cache, dockerfile] - - if KANIKO_DOCKER_CONFIG_SECRET_NAME: - docker_config = kubernetes.client.V1Volume( - name=KANIKO_DOCKER_CONFIG_VOLUME_NAME, - secret=kubernetes.client.V1SecretVolumeSource( - secret_name=KANIKO_DOCKER_CONFIG_SECRET_NAME, - items=[kubernetes.client.V1KeyToPath(key=".dockerconfigjson", path="config.json")], - ), - ) - volumes.append(docker_config) - - return kubernetes.client.V1PodSpec( - restart_policy="Never", affinity=pod_affinity, containers=[container], volumes=volumes + raise exceptions.PodTimeoutError( + f"Build for function {function.key} didn't complete after {IMAGE_BUILD_TIMEOUT} seconds" ) -def _build_pod_affinity() -> kubernetes.client.V1Affinity: - return kubernetes.client.V1Affinity( - pod_affinity=kubernetes.client.V1PodAffinity( - required_during_scheduling_ignored_during_execution=[ - kubernetes.client.V1PodAffinityTerm( - label_selector=kubernetes.client.V1LabelSelector( - match_expressions=[ - kubernetes.client.V1LabelSelectorRequirement( - key="statefulset.kubernetes.io/pod-name", operator="In", values=[HOSTNAME] - ) - ] - ), - topology_key="kubernetes.io/hostname", - ) - ] - ) - ) - - -def _build_container(dockerfile_mount_path: str, image_tag: str) -> kubernetes.client.V1Container: - # kaniko build can be launched without privilege but - # it needs some capabilities and to be root - # https://github.com/GoogleContainerTools/kaniko/issues/778 - # https://github.com/GoogleContainerTools/kaniko/issues/778#issuecomment-619112417 - # https://github.com/moby/moby/blob/master/oci/caps/defaults.go - # https://man7.org/linux/man-pages/man7/capabilities.7.html - capabilities = ["CHOWN", "SETUID", "SETGID", "FOWNER", "DAC_OVERRIDE", "SETFCAP"] - container_security_context = get_security_context(root=True, capabilities=capabilities) - args = _build_container_args(dockerfile_mount_path, image_tag) - dockerfile_mount_subpath = dockerfile_mount_path.split("/subtuple/")[-1] - - dockerfile = kubernetes.client.V1VolumeMount( - name="dockerfile", mount_path=dockerfile_mount_path, sub_path=dockerfile_mount_subpath, read_only=True +def load_remote_function_image(function: orchestrator.Function, channel: str) -> None: + container_image_tag = utils.container_image_tag_from_function(function) + # Ask the backend owner of the function if it's available + logger.info( + f"Initial function URI {function.function_address.uri}; " + f"modified URI{function.function_address.uri.replace('file', 'image')}" ) - cache = kubernetes.client.V1VolumeMount(name="cache", mount_path="/cache", read_only=True) - volume_mounts = [dockerfile, cache] - - if KANIKO_DOCKER_CONFIG_SECRET_NAME: - docker_config = kubernetes.client.V1VolumeMount( - name=KANIKO_DOCKER_CONFIG_VOLUME_NAME, mount_path="/kaniko/.docker" - ) - volume_mounts.append(docker_config) - return kubernetes.client.V1Container( - name=KANIKO_CONTAINER_NAME, - image=KANIKO_IMAGE, - command=None, - args=args, - volume_mounts=volume_mounts, - security_context=container_security_context, + function_image_content = organization_client.get( + channel=channel, + organization_id=function.owner, + # TODO create a clean Address for function image + url=function.function_address.uri.replace("file", "image"), + checksum=None, ) - -def _build_container_args(dockerfile_mount_path: str, image_tag: str) -> list[str]: - dockerfile_fullpath = os.path.join(dockerfile_mount_path, "Dockerfile") - args = [ - f"--dockerfile={dockerfile_fullpath}", - f"--context=dir://{dockerfile_mount_path}", - f"--destination={REGISTRY}/{USER_IMAGE_REPOSITORY}:{image_tag}", - "--cache=true", - "--log-timestamp=true", - "--snapshotMode=redo", - "--push-retry=3", - "--cache-copy-layers", - "--log-format=text", - f"--verbosity={('debug' if settings.LOG_LEVEL == 'DEBUG' else 'info')}", - ] - - if REGISTRY_SCHEME == "http": - args.append("--insecure") - - if KANIKO_MIRROR: - args.append(f"--registry-mirror={REGISTRY}") - if REGISTRY_SCHEME == "http": - args.append("--insecure-pull") - return args + os.makedirs(SUBTUPLE_TMP_DIR, exist_ok=True) + with TemporaryDirectory(dir=SUBTUPLE_TMP_DIR) as tmp_dir: + storage_path = pathlib.Path(tmp_dir) / f"{container_image_tag}.zip" + storage_path.write_bytes(function_image_content) + push_payload(storage_path, registry=REGISTRY, secure=False) diff --git a/backend/substrapp/compute_tasks/volumes.py b/backend/substrapp/compute_tasks/volumes.py index e7b71a6b7..936f12112 100644 --- a/backend/substrapp/compute_tasks/volumes.py +++ b/backend/substrapp/compute_tasks/volumes.py @@ -61,7 +61,3 @@ def _create_mount(task_dir: str, folder: str, read_only: bool = False): def get_worker_subtuple_pvc_name(): return f"{settings.WORKER_PVC_SUBTUPLE}-{os.getenv('HOSTNAME')}" - - -def get_docker_cache_pvc_name(): - return f"{settings.WORKER_PVC_DOCKER_CACHE}-{os.getenv('HOSTNAME')}" diff --git a/backend/substrapp/docker_registry.py b/backend/substrapp/docker_registry.py index 84697ea7d..4dd443956 100644 --- a/backend/substrapp/docker_registry.py +++ b/backend/substrapp/docker_registry.py @@ -44,6 +44,11 @@ def get_container_image_name(image_name: str) -> str: return f"{pull_domain}/{USER_IMAGE_REPOSITORY}:{image_name}" +def get_entrypoint(image_tag: str) -> str: + d = get_container_image(image_tag) + return json.loads(d["history"][0]["v1Compatibility"])["config"]["Entrypoint"] + + def delete_container_image_safe(image_tag: str) -> None: """deletes a container image from the docker registry but will fail silently""" try: @@ -92,15 +97,6 @@ def _retrieve_image_digest(image_tag: str) -> str: return response.headers["Docker-Content-Digest"] -def container_image_exists(image_name: str) -> bool: - try: - get_container_image(image_name) - except ImageNotFoundError: - return False - else: - return True - - def get_container_image(image_name: str) -> dict: response = requests.get( f"{REGISTRY_SCHEME}://{REGISTRY}/v2/{USER_IMAGE_REPOSITORY}/manifests/{image_name}", diff --git a/backend/substrapp/events/reactor.py b/backend/substrapp/events/reactor.py index 4df8b5f8e..5edcbad2e 100644 --- a/backend/substrapp/events/reactor.py +++ b/backend/substrapp/events/reactor.py @@ -9,13 +9,18 @@ import orchestrator.common_pb2 as common_pb2 import orchestrator.computetask_pb2 as computetask_pb2 import orchestrator.event_pb2 as event_pb2 +from builder.tasks.tasks_build_image import build_image +from orchestrator import function_pb2 from orchestrator import model_pb2 from substrapp.events import handler_compute_engine from substrapp.events import health from substrapp.models import WorkerLastEvent from substrapp.orchestrator import get_orchestrator_client +from substrapp.task_routing import WORKER_QUEUE +from substrapp.task_routing import get_builder_queue from substrapp.tasks.tasks_compute_plan import queue_delete_cp_pod_and_dirs_and_optionally_images from substrapp.tasks.tasks_compute_task import queue_compute_task +from substrapp.tasks.tasks_save_image import save_image_task logger = structlog.get_logger("events") _MY_ORGANIZATION: str = settings.LEDGER_MSP_ID @@ -72,6 +77,39 @@ def on_computetask_event(payload): queue_compute_task(channel_name, task=orc_task) +def on_function_event(payload): + asset_key = payload["asset_key"] + channel_name = payload["channel"] + event_kind = payload["event_kind"] + function = payload["function"] + grpc_function = function_pb2.Function() + json_format.ParseDict(function, grpc_function) + orc_function = orchestrator.Function.from_grpc(grpc_function) + logger.info("Processing function", asset_key=asset_key, kind=event_kind) + + if event_pb2.EventKind.Value(event_kind) == event_pb2.EVENT_ASSET_CREATED: + if orc_function.owner == _MY_ORGANIZATION: + function_key = orc_function.key + builder_queue = get_builder_queue() + logger.info( + "Assigned function to builder queue", + asset_key=function_key, + queue=builder_queue, + ) + + building_params = { + "channel_name": channel_name, + "function_serialized": orc_function.model_dump_json(), + } + ( + build_image.si(**building_params).set(queue=builder_queue) + | save_image_task.si(**building_params).set(queue=WORKER_QUEUE) + ).apply_async() + + else: + logger.debug("Function not belonging to this organization, skipping building", asset_key=orc_function.key) + + def on_model_event(payload): event_kind = payload["event_kind"] channel_name = payload["channel"] @@ -93,6 +131,9 @@ def on_message_compute_engine(payload): on_computetask_event(payload) elif asset_kind == common_pb2.ASSET_MODEL: on_model_event(payload) + elif asset_kind == common_pb2.ASSET_FUNCTION: + logger.info("Processing function", asset_kind=payload["asset_kind"]) + on_function_event(payload) else: logger.debug("Nothing to do", asset_kind=payload["asset_kind"]) diff --git a/backend/substrapp/exceptions.py b/backend/substrapp/exceptions.py index aa4fbebe2..9453841f8 100644 --- a/backend/substrapp/exceptions.py +++ b/backend/substrapp/exceptions.py @@ -5,10 +5,6 @@ class KubernetesError(Exception): pass -class PodError(Exception): - pass - - class PodDeletedError(Exception): pass @@ -17,10 +13,6 @@ class PodReadinessTimeoutError(Exception): pass -class PodTimeoutError(Exception): - pass - - class ImageDeletionError(Exception): def __init__(self, image_tag: str, status_code: int = None) -> None: message = f"An error happened while deleting the container image. image_tag={image_tag}" diff --git a/backend/substrapp/kubernetes_utils.py b/backend/substrapp/kubernetes_utils.py index ad76f7070..5bb6ac65f 100644 --- a/backend/substrapp/kubernetes_utils.py +++ b/backend/substrapp/kubernetes_utils.py @@ -1,35 +1,19 @@ -import enum -import time - import kubernetes import structlog from django.conf import settings from substrapp.exceptions import KubernetesError from substrapp.exceptions import PodDeletedError -from substrapp.exceptions import PodError from substrapp.exceptions import PodReadinessTimeoutError -from substrapp.exceptions import PodTimeoutError -from substrapp.utils import timeit logger = structlog.get_logger(__name__) NAMESPACE = settings.NAMESPACE -HTTP_CLIENT_TIMEOUT_SECONDS = settings.HTTP_CLIENT_TIMEOUT_SECONDS RUN_AS_GROUP = settings.COMPUTE_POD_RUN_AS_GROUP RUN_AS_USER = settings.COMPUTE_POD_RUN_AS_USER FS_GROUP = settings.COMPUTE_POD_FS_GROUP -class ObjectState(enum.Enum): - PENDING = enum.auto() - WAITING = enum.auto() - RUNNING = enum.auto() - FAILED = enum.auto() - COMPLETED = enum.auto() - UNKNOWN = enum.auto() - - def get_pod_security_context(): return kubernetes.client.V1PodSecurityContext( run_as_non_root=True, @@ -63,159 +47,6 @@ def get_security_context(root: bool = False, capabilities: list[str] = None) -> return security_context -class PodState: - def __init__(self, status: ObjectState, reason: str = "", message: str = ""): - self.status = status - self.reason = reason - self.message = message - - def set_reason(self, container_status: kubernetes.client.V1ContainerState) -> None: - if self.status == ObjectState.WAITING: - self.reason = container_status.waiting.reason - self.message = container_status.waiting.message - if self.status == ObjectState.FAILED: - self.reason = container_status.terminated.reason - self.message = container_status.terminated.message - - -def watch_pod(k8s_client: kubernetes.client.CoreV1Api, name: str): - """Watch a Kubernetes pod status - It will observe all the containers inside the pod and return when the pod will - reach the Completed state. If the pod is pending indefinitely or fail, an exception will be raised. - Args: - k8s_client (kubernetes.client.CoreV1Api): Kubernetes API client - name (str): name of the pod to watch - Raises: - PodError: this exception is raised if the pod exits with an error - PodTimeoutError: this exception is raised if the pod does not reach the running state after some time - """ - attempt = 0 - # with 60 attempts we wait max 2 min with a pending pod - max_attempts = 60 - - # This variable is used to track the current status through retries - previous_pod_status = None - - while attempt < max_attempts: - try: - api_response = retrieve_pod_status(k8s_client, name) - except kubernetes.client.ApiException as exc: - logger.warning("Could not retrieve pod status", pod_name=name, exc_info=exc) - attempt += 1 - time.sleep(0.2) - continue - - pod_state = _get_pod_state(api_response) - - if pod_state.status != previous_pod_status: - previous_pod_status = pod_state.status - logger.info( - "Pod status changed", - pod_name=name, - status=pod_state.status, - reason=pod_state.reason, - message=pod_state.message, - attempt=attempt, - max_attempts=max_attempts, - ) - - if pod_state.status == ObjectState.COMPLETED: - return - - if pod_state.status == ObjectState.FAILED: - raise PodError(f"Pod {name} terminated with error: {pod_state.reason}") - - if pod_state.status == ObjectState.PENDING: - # Here we basically consume a free retry everytime but we still need to - # increment attempt because if at some point our pod is stuck in pending state - # we need to exit this function - attempt += 1 - time.sleep(2) - - # Here PodInitializing and ContainerCreating are valid reasons to wait more time - # Other possible reasons include "CrashLoopBackOff", "CreateContainerConfigError", - # "ErrImagePull", "ImagePullBackOff", "CreateContainerError", "InvalidImageName" - if ( - pod_state.status == ObjectState.WAITING - and pod_state.reason not in ["PodInitializing", "ContainerCreating"] - or pod_state.status == ObjectState.UNKNOWN - ): - attempt += 1 - - time.sleep(0.2) - - raise PodTimeoutError(f"Pod {name} didn't complete after {max_attempts} attempts") - - -def _get_pod_state(pod_status: kubernetes.client.V1PodStatus) -> PodState: - """extracts the current pod state from the PodStatus Kubernetes object - Args: - pod_status (kubernetes.client.models.V1PodStatus): A Kubernetes PodStatus object - """ - if pod_status.phase in ["Pending"]: - # On the first query the pod just created and often pending as it is not already scheduled to a node - return PodState(ObjectState.PENDING, pod_status.reason, pod_status.message) - - container_statuses: list[kubernetes.client.V1ContainerStatus] = ( - pod_status.init_container_statuses if pod_status.init_container_statuses else [] - ) - container_statuses += pod_status.container_statuses - - completed_containers = 0 - for container in container_statuses: - container_state: ObjectState = _get_container_state(container) - - if container_state in [ObjectState.RUNNING, ObjectState.WAITING, ObjectState.FAILED]: - pod_state = PodState(container_state) - pod_state.set_reason(container.state) - return pod_state - if container_state == ObjectState.COMPLETED: - completed_containers += 1 - - if completed_containers == len(container_statuses): - return PodState(ObjectState.COMPLETED, "", "pod successfully completed") - - logger.debug("pod status", pod_status=pod_status) - return PodState(ObjectState.UNKNOWN, "", "Could not deduce the pod state from container statuses") - - -def _get_container_state(container_status: kubernetes.client.V1ContainerStatus) -> ObjectState: - """Extracts the container state from a ContainerStatus Kubernetes object - Args: - container_status (kubernetes.client.models.V1ContainerStatus): A ContainerStatus object - Returns: - ObjectState: the state of the container - """ - # Here we need to check if we are in a failed state first since kubernetes will retry - # we can end up running after a failure - if container_status.state.terminated: - if container_status.state.terminated.exit_code != 0: - return ObjectState.FAILED - else: - return ObjectState.COMPLETED - if container_status.state.running: - return ObjectState.RUNNING - if container_status.state.waiting: - return ObjectState.WAITING - return ObjectState.UNKNOWN - - -def pod_exists(k8s_client, name: str) -> bool: - try: - k8s_client.read_namespaced_pod(name=name, namespace=NAMESPACE) - except kubernetes.client.ApiException: - return False - else: - return True - - -def retrieve_pod_status(k8s_client: kubernetes.client.CoreV1Api, pod_name: str) -> kubernetes.client.V1PodStatus: - pod: kubernetes.client.V1Pod = k8s_client.read_namespaced_pod_status( - name=pod_name, namespace=NAMESPACE, pretty=True - ) - return pod.status - - def pod_exists_by_label_selector(k8s_client: kubernetes.client.CoreV1Api, label_selector: str) -> bool: """Return True if the pod exists, else False. @@ -231,18 +62,6 @@ def pod_exists_by_label_selector(k8s_client: kubernetes.client.CoreV1Api, label_ return len(res.items) > 0 -@timeit -def get_pod_logs(k8s_client, name: str, container: str, ignore_pod_not_found: bool = False) -> str: - try: - return k8s_client.read_namespaced_pod_log(name=name, namespace=NAMESPACE, container=container) - except kubernetes.client.ApiException as exc: - if ignore_pod_not_found and exc.reason == "Not Found": - return f"Pod not found: {NAMESPACE}/{name} ({container})" - if exc.reason == "Bad Request": - return f"In {NAMESPACE}/{name} \n {str(exc.body)}" - return f"Unable to get logs for pod {NAMESPACE}/{name} ({container}) \n {str(exc)}" - - def delete_pod(k8s_client, name: str) -> None: # we retrieve the latest pod list version to retrieve only the latest events when watching for pod deletion pod_list_resource_version = k8s_client.list_namespaced_pod(namespace=NAMESPACE).metadata.resource_version diff --git a/backend/substrapp/migrations/0006_create_compute_task_failure_model.py b/backend/substrapp/migrations/0006_create_compute_task_failure_model.py index c6a7abe1e..1ce781d19 100644 --- a/backend/substrapp/migrations/0006_create_compute_task_failure_model.py +++ b/backend/substrapp/migrations/0006_create_compute_task_failure_model.py @@ -25,7 +25,7 @@ class Migration(migrations.Migration): models.FileField( max_length=36, storage=django.core.files.storage.FileSystemStorage(), - upload_to=substrapp.models.compute_task_failure_report._upload_to, + upload_to=substrapp.models.asset_failure_report._upload_to, ), ), ("logs_checksum", models.CharField(max_length=64)), diff --git a/backend/substrapp/migrations/0012_alter_algo_description_alter_algo_file_and_more.py b/backend/substrapp/migrations/0012_alter_algo_description_alter_algo_file_and_more.py index 0e891e227..81475f62b 100644 --- a/backend/substrapp/migrations/0012_alter_algo_description_alter_algo_file_and_more.py +++ b/backend/substrapp/migrations/0012_alter_algo_description_alter_algo_file_and_more.py @@ -3,7 +3,7 @@ from django.db import migrations from django.db import models -import substrapp.models.compute_task_failure_report +import substrapp.models.asset_failure_report import substrapp.models.datamanager import substrapp.models.function import substrapp.storages.minio @@ -39,7 +39,7 @@ class Migration(migrations.Migration): field=models.FileField( max_length=36, storage=substrapp.storages.minio.MinioStorage("substra-compute-task-logs"), - upload_to=substrapp.models.compute_task_failure_report._upload_to, + upload_to=substrapp.models.asset_failure_report._upload_to, ), ), migrations.AlterField( diff --git a/backend/substrapp/migrations/0013_alter_algo_description_alter_algo_file_and_more.py b/backend/substrapp/migrations/0013_alter_algo_description_alter_algo_file_and_more.py index a2f59d0eb..df8a7fe7d 100644 --- a/backend/substrapp/migrations/0013_alter_algo_description_alter_algo_file_and_more.py +++ b/backend/substrapp/migrations/0013_alter_algo_description_alter_algo_file_and_more.py @@ -4,7 +4,7 @@ from django.db import migrations from django.db import models -import substrapp.models.compute_task_failure_report +import substrapp.models.asset_failure_report import substrapp.models.datamanager import substrapp.models.function @@ -39,7 +39,7 @@ class Migration(migrations.Migration): field=models.FileField( max_length=36, storage=django.core.files.storage.FileSystemStorage(), - upload_to=substrapp.models.compute_task_failure_report._upload_to, + upload_to=substrapp.models.asset_failure_report._upload_to, ), ), migrations.AlterField( diff --git a/backend/substrapp/migrations/0015_add_functionimage.py b/backend/substrapp/migrations/0015_add_functionimage.py new file mode 100644 index 000000000..02d53ad2c --- /dev/null +++ b/backend/substrapp/migrations/0015_add_functionimage.py @@ -0,0 +1,36 @@ +# Generated by Django 4.1.7 on 2023-08-11 11:30 + +import django.db.models.deletion +from django.db import migrations +from django.db import models + +import substrapp.models.datamanager +import substrapp.models.function + + +class Migration(migrations.Migration): + dependencies = [ + ("substrapp", "0014_rename_algo_to_function"), + ] + + operations = [ + migrations.CreateModel( + name="FunctionImage", + fields=[ + ("id", models.AutoField(auto_created=True, primary_key=True, serialize=False, verbose_name="ID")), + ( + "file", + models.FileField( + max_length=500, + storage=django.core.files.storage.FileSystemStorage(), + upload_to=substrapp.models.function.upload_to_function, + ), + ), + ("checksum", models.CharField(blank=True, max_length=64)), + ( + "function", + models.OneToOneField(on_delete=django.db.models.deletion.CASCADE, to="substrapp.function"), + ), + ], + ), + ] diff --git a/backend/substrapp/migrations/0016_rename_computetaskfailurereport_and_more.py b/backend/substrapp/migrations/0016_rename_computetaskfailurereport_and_more.py new file mode 100644 index 000000000..e1bccdd36 --- /dev/null +++ b/backend/substrapp/migrations/0016_rename_computetaskfailurereport_and_more.py @@ -0,0 +1,29 @@ +# Generated by Django 4.2.3 on 2023-08-30 15:07 + +from django.db import migrations +from django.db import models + + +class Migration(migrations.Migration): + dependencies = [ + ("substrapp", "0015_add_functionimage"), + ] + + operations = [ + migrations.RenameModel("ComputeTaskFailureReport", "AssetFailureReport"), + migrations.RenameField("AssetFailureReport", "compute_task_key", "asset_key"), + migrations.AddField( + model_name="assetfailurereport", + name="asset_type", + field=models.CharField( + choices=[ + ("FAILED_ASSET_UNKNOWN", "Failed Asset Unknown"), + ("FAILED_ASSET_COMPUTE_TASK", "Failed Asset Compute Task"), + ("FAILED_ASSET_FUNCTION", "Failed Asset Function"), + ], + default="FAILED_ASSET_UNKNOWN", + max_length=100, + ), + preserve_default=False, + ), + ] diff --git a/backend/substrapp/models/__init__.py b/backend/substrapp/models/__init__.py index 7ca4e9815..28bc2de5b 100644 --- a/backend/substrapp/models/__init__.py +++ b/backend/substrapp/models/__init__.py @@ -1,8 +1,10 @@ -from .compute_task_failure_report import ComputeTaskFailureReport +from .asset_failure_report import AssetFailureReport +from .asset_failure_report import FailedAssetKind from .computeplan_worker_mapping import ComputePlanWorkerMapping from .datamanager import DataManager from .datasample import DataSample from .function import Function +from .function import FunctionImage from .image_entrypoint import ImageEntrypoint from .model import Model from .worker_last_event import WorkerLastEvent @@ -10,10 +12,12 @@ __all__ = [ "DataSample", "DataManager", + "FailedAssetKind", "Function", + "FunctionImage", "Model", "ComputePlanWorkerMapping", "ImageEntrypoint", - "ComputeTaskFailureReport", + "AssetFailureReport", "WorkerLastEvent", ] diff --git a/backend/substrapp/models/compute_task_failure_report.py b/backend/substrapp/models/asset_failure_report.py similarity index 57% rename from backend/substrapp/models/compute_task_failure_report.py rename to backend/substrapp/models/asset_failure_report.py index ab5dcca3b..0dbd0f45e 100644 --- a/backend/substrapp/models/compute_task_failure_report.py +++ b/backend/substrapp/models/asset_failure_report.py @@ -5,6 +5,8 @@ from django.conf import settings from django.db import models +from orchestrator import failure_report_pb2 + LOGS_BASE_PATH: Final[str] = "logs" LOGS_FILE_PATH: Final[str] = "file" @@ -12,14 +14,20 @@ _SHA256_STRING_REPR_LENGTH: Final[int] = 256 // 4 -def _upload_to(instance: "ComputeTaskFailureReport", _filename: str) -> str: - return str(instance.compute_task_key) +def _upload_to(instance: "AssetFailureReport", _filename: str) -> str: + return str(instance.asset_key) + + +FailedAssetKind = models.TextChoices( + "FailedAssetKind", [(status_name, status_name) for status_name in failure_report_pb2.FailedAssetKind.keys()] +) -class ComputeTaskFailureReport(models.Model): +class AssetFailureReport(models.Model): """Store information relative to a compute task.""" - compute_task_key = models.UUIDField(primary_key=True, editable=False) + asset_key = models.UUIDField(primary_key=True, editable=False) + asset_type = models.CharField(max_length=100, choices=FailedAssetKind.choices) logs = models.FileField( storage=settings.COMPUTE_TASK_LOGS_STORAGE, max_length=_UUID_STRING_REPR_LENGTH, upload_to=_upload_to ) @@ -28,9 +36,9 @@ class ComputeTaskFailureReport(models.Model): @property def key(self) -> uuid.UUID: - return self.compute_task_key + return self.asset_key @property def logs_address(self) -> str: - logs_path = f"{LOGS_BASE_PATH}/{self.compute_task_key}/{LOGS_FILE_PATH}/" + logs_path = f"{LOGS_BASE_PATH}/{self.asset_key}/{LOGS_FILE_PATH}/" return urllib.parse.urljoin(settings.DEFAULT_DOMAIN, logs_path) diff --git a/backend/substrapp/models/function.py b/backend/substrapp/models/function.py index 3eb3eed12..d501c6a1d 100644 --- a/backend/substrapp/models/function.py +++ b/backend/substrapp/models/function.py @@ -10,6 +10,10 @@ def upload_to(instance, filename) -> str: return f"functions/{instance.key}/{filename}" +def upload_to_function(instance, filename) -> str: + return upload_to(instance.function, filename) + + class Function(models.Model): """Storage Data table""" @@ -30,3 +34,22 @@ def save(self, *args, **kwargs) -> None: def __str__(self) -> str: return f"Function with key {self.key}" + + +class FunctionImage(models.Model): + """Serialized Docker image""" + + function = models.OneToOneField(Function, on_delete=models.CASCADE) + file = models.FileField( + storage=settings.FUNCTION_STORAGE, max_length=500, upload_to=upload_to_function + ) # path max length to 500 instead of default 100 + checksum = models.CharField(max_length=64, blank=True) + + def save(self, *args, **kwargs) -> None: + """Use hash of file as checksum""" + if not self.checksum and self.file: + self.checksum = get_hash(self.file) + super().save(*args, **kwargs) + + def __str__(self) -> str: + return f"Function image associated function key {self.function.key}" diff --git a/backend/substrapp/task_routing.py b/backend/substrapp/task_routing.py index 20a9fa8df..28c6fa43b 100644 --- a/backend/substrapp/task_routing.py +++ b/backend/substrapp/task_routing.py @@ -29,6 +29,7 @@ WORKER_QUEUE = f"{settings.ORG_NAME}.worker" +BUILDER_QUEUE = f"{settings.ORG_NAME}.builder" def get_generic_worker_queue() -> str: @@ -46,6 +47,10 @@ def get_worker_queue(compute_plan_key: str) -> str: return _get_worker_queue(worker_index) +def get_builder_queue() -> str: + return BUILDER_QUEUE + + def get_existing_worker_queue(compute_plan_key: str) -> Optional[str]: """ Return the name of a worker queue mapped to this compute plan, if it exists. diff --git a/backend/substrapp/tasks/__init__.py b/backend/substrapp/tasks/__init__.py index 3dab2c307..58a78d22d 100644 --- a/backend/substrapp/tasks/__init__.py +++ b/backend/substrapp/tasks/__init__.py @@ -5,6 +5,7 @@ from substrapp.tasks.tasks_outputs import remove_transient_outputs_from_orc from substrapp.tasks.tasks_remove_intermediary_models import remove_intermediary_model_from_db from substrapp.tasks.tasks_remove_intermediary_models import remove_intermediary_models_from_buffer +from substrapp.tasks.tasks_save_image import save_image_task __all__ = [ "delete_cp_pod_and_dirs_and_optionally_images", @@ -14,4 +15,5 @@ "remove_intermediary_models_from_buffer", "remove_transient_outputs_from_orc", "remove_intermediary_model_from_db", + "save_image_task", ] diff --git a/backend/substrapp/tasks/task.py b/backend/substrapp/tasks/task.py new file mode 100644 index 000000000..0fab4e653 --- /dev/null +++ b/backend/substrapp/tasks/task.py @@ -0,0 +1,101 @@ +""" +This file contains the main logic for executing a compute task: + +- Create execution context +- Populate asset buffer +- Loads assets from the asset buffer +- **Execute the compute task** +- Save the models/results +- Teardown the context + +We also handle the retry logic here. +""" +import enum +import pickle # nosec B403 +from typing import Any + +import structlog +from billiard.einfo import ExceptionInfo +from celery import Task +from django.conf import settings + +import orchestrator +from substrapp.compute_tasks.compute_pod import delete_compute_plan_pods +from substrapp.models import FailedAssetKind +from substrapp.task_routing import WORKER_QUEUE +from substrapp.tasks.tasks_asset_failure_report import store_asset_failure_report + +logger = structlog.get_logger(__name__) + + +class FailableTask(Task): + asset_type: FailedAssetKind + + # Celery does not provide unpacked arguments, we are doing it in `get_task_info` + def on_failure( + self, exc: Exception, task_id: str, args: tuple, kwargs: dict[str, Any], einfo: ExceptionInfo + ) -> None: + asset_key, channel_name = self.get_task_info(args, kwargs) + exception_pickled = pickle.dumps(exc) + store_asset_failure_report.apply_async( + args, + { + "asset_key": asset_key, + "asset_type": self.asset_type, + "channel_name": channel_name, + "exception_pickled": exception_pickled, + }, + queue=WORKER_QUEUE, + ) + + def get_task_info(self, args: tuple, kwargs: dict) -> tuple[str, str]: + raise NotImplementedError() + + +class ComputeTaskSteps(enum.Enum): + BUILD_IMAGE = "build_image" + PREPARE_INPUTS = "prepare_inputs" + TASK_EXECUTION = "task_execution" + SAVE_OUTPUTS = "save_outputs" + + +class ComputeTask(FailableTask): + autoretry_for = settings.CELERY_TASK_AUTORETRY_FOR + max_retries = settings.CELERY_TASK_MAX_RETRIES + retry_backoff = settings.CELERY_TASK_RETRY_BACKOFF + retry_backoff_max = settings.CELERY_TASK_RETRY_BACKOFF_MAX + retry_jitter = settings.CELERY_TASK_RETRY_JITTER + + asset_type = FailedAssetKind.FAILED_ASSET_COMPUTE_TASK + + @property + def attempt(self) -> int: + return self.request.retries + 1 # type: ignore + + # Celery does not provide unpacked arguments + def on_success(self, retval: dict[str, Any], task_id: str, args: tuple, kwargs: dict[str, Any]) -> None: + from django.db import close_old_connections + + close_old_connections() + + # Celery does not provide unpacked arguments, we are doing it in `split_args` + def on_retry(self, exc: Exception, task_id: str, args: tuple, kwargs: dict[str, Any], einfo: ExceptionInfo) -> None: + _, task = self.split_args(args) + # delete compute pod to reset hardware ressources + delete_compute_plan_pods(task.compute_plan_key) + logger.info( + "Retrying task", + celery_task_id=task_id, + attempt=(self.attempt + 1), + max_attempts=(settings.CELERY_TASK_MAX_RETRIES + 1), + ) + + def split_args(self, celery_args: tuple) -> tuple[str, orchestrator.ComputeTask]: + channel_name = celery_args[0] + task = orchestrator.ComputeTask.parse_raw(celery_args[1]) + return channel_name, task + + def get_task_info(self, args: tuple, kwargs: dict) -> tuple[str, str]: + channel_name, task = self.split_args(args) + + return task.key, channel_name diff --git a/backend/substrapp/tasks/tasks_asset_failure_report.py b/backend/substrapp/tasks/tasks_asset_failure_report.py new file mode 100644 index 000000000..bcac56e50 --- /dev/null +++ b/backend/substrapp/tasks/tasks_asset_failure_report.py @@ -0,0 +1,68 @@ +import pickle # nosec B403 - internal to the worker + +import structlog +from celery import Task +from django.conf import settings + +from backend.celery import app +from substrapp.compute_tasks import errors as compute_task_errors +from substrapp.models import FailedAssetKind +from substrapp.orchestrator import get_orchestrator_client +from substrapp.utils.errors import store_failure + +REGISTRY = settings.REGISTRY +REGISTRY_SCHEME = settings.REGISTRY_SCHEME +SUBTUPLE_TMP_DIR = settings.SUBTUPLE_TMP_DIR + +logger = structlog.get_logger("worker") + + +class StoreAssetFailureReportTask(Task): + max_retries = 0 + reject_on_worker_lost = True + ignore_result = False + + @property + def attempt(self) -> int: + return self.request.retries + 1 # type: ignore + + def get_task_info(self, args: tuple, kwargs: dict) -> tuple[str, str, str]: + asset_key = kwargs["asset_key"] + asset_type = kwargs["asset_type"] + channel_name = kwargs["channel_name"] + return asset_key, asset_type, channel_name + + +@app.task( + bind=True, + acks_late=True, + reject_on_worker_lost=True, + ignore_result=False, + base=StoreAssetFailureReportTask, +) +def store_asset_failure_report( + task: StoreAssetFailureReportTask, *, asset_key: str, asset_type: str, channel_name: str, exception_pickled: bytes +) -> None: + exception = pickle.loads(exception_pickled) # nosec B301 + + if asset_type == FailedAssetKind.FAILED_ASSET_FUNCTION: + error_type = compute_task_errors.ComputeTaskErrorType.BUILD_ERROR.value + else: + error_type = compute_task_errors.get_error_type(exception) + + failure_report = store_failure(exception, asset_key, asset_type, error_type) + + with get_orchestrator_client(channel_name) as client: + # On the backend, only building and execution errors lead to the creation of compute task failure + # report instances to store the execution logs. + if failure_report: + logs_address = { + "checksum": failure_report.logs_checksum, + "storage_address": failure_report.logs_address, + } + else: + logs_address = None + + client.register_failure_report( + {"asset_key": asset_key, "error_type": error_type, "asset_type": asset_type, "logs_address": logs_address} + ) diff --git a/backend/substrapp/tasks/tasks_compute_task.py b/backend/substrapp/tasks/tasks_compute_task.py index 815971837..6b7ea555f 100644 --- a/backend/substrapp/tasks/tasks_compute_task.py +++ b/backend/substrapp/tasks/tasks_compute_task.py @@ -4,14 +4,12 @@ - Create execution context - Populate asset buffer - Loads assets from the asset buffer -- Build container images - **Execute the compute task** - Save the models/results - Teardown the context We also handle the retry logic here. """ - from __future__ import annotations import datetime @@ -19,30 +17,24 @@ import errno import os from typing import Any -from typing import Optional import celery.exceptions import structlog -from billiard.einfo import ExceptionInfo -from celery import Task from celery.result import AsyncResult from django.conf import settings -from django.core import files from rest_framework import status import orchestrator from backend.celery import app -from substrapp import models -from substrapp import utils from substrapp.clients import organization as organization_client from substrapp.compute_tasks import compute_task as task_utils from substrapp.compute_tasks import errors as compute_task_errors +from substrapp.compute_tasks import image_builder from substrapp.compute_tasks.asset_buffer import add_assets_to_taskdir from substrapp.compute_tasks.asset_buffer import add_task_assets_to_buffer from substrapp.compute_tasks.asset_buffer import clear_assets_buffer from substrapp.compute_tasks.asset_buffer import init_asset_buffer from substrapp.compute_tasks.chainkeys import prepare_chainkeys_dir -from substrapp.compute_tasks.compute_pod import delete_compute_plan_pods from substrapp.compute_tasks.context import Context from substrapp.compute_tasks.datastore import Datastore from substrapp.compute_tasks.datastore import get_datastore @@ -53,14 +45,15 @@ from substrapp.compute_tasks.directories import restore_dir from substrapp.compute_tasks.directories import teardown_task_dirs from substrapp.compute_tasks.execute import execute_compute_task -from substrapp.compute_tasks.image_builder import build_image_if_missing from substrapp.compute_tasks.lock import MAX_TASK_DURATION from substrapp.compute_tasks.lock import acquire_compute_plan_lock from substrapp.compute_tasks.outputs import OutputSaver from substrapp.exceptions import OrganizationHttpError from substrapp.lock_local import lock_resource from substrapp.orchestrator import get_orchestrator_client +from substrapp.tasks.task import ComputeTask from substrapp.utils import Timer +from substrapp.utils import get_owner from substrapp.utils import list_dir from substrapp.utils import retry from substrapp.utils.url import TASK_PROFILING_BASE_URL @@ -78,67 +71,6 @@ class ComputeTaskSteps(enum.Enum): SAVE_OUTPUTS = "save_outputs" -class ComputeTask(Task): - autoretry_for = settings.CELERY_TASK_AUTORETRY_FOR - max_retries = settings.CELERY_TASK_MAX_RETRIES - retry_backoff = settings.CELERY_TASK_RETRY_BACKOFF - retry_backoff_max = settings.CELERY_TASK_RETRY_BACKOFF_MAX - retry_jitter = settings.CELERY_TASK_RETRY_JITTER - - @property - def attempt(self) -> int: - return self.request.retries + 1 # type: ignore - - def on_success(self, retval: dict[str, Any], task_id: str, args: tuple, kwargs: dict[str, Any]) -> None: - from django.db import close_old_connections - - close_old_connections() - - def on_retry(self, exc: Exception, task_id: str, args: tuple, kwargs: dict[str, Any], einfo: ExceptionInfo) -> None: - _, task = self.split_args(args) - # delete compute pod to reset hardware ressources - delete_compute_plan_pods(task.compute_plan_key) - logger.info( - "Retrying task", - celery_task_id=task_id, - attempt=(self.attempt + 1), - max_attempts=(settings.CELERY_TASK_MAX_RETRIES + 1), - ) - - def on_failure( - self, exc: Exception, task_id: str, args: tuple, kwargs: dict[str, Any], einfo: ExceptionInfo - ) -> None: - from django.db import close_old_connections - - close_old_connections() - - channel_name, task = self.split_args(args) - compute_task_key = task.key - - failure_report = _store_failure(exc, compute_task_key) - error_type = compute_task_errors.get_error_type(exc) - - with get_orchestrator_client(channel_name) as client: - # On the backend, only execution errors lead to the creation of compute task failure report instances - # to store the execution logs. - if failure_report: - logs_address = { - "checksum": failure_report.logs_checksum, - "storage_address": failure_report.logs_address, - } - else: - logs_address = None - - client.register_failure_report( - {"compute_task_key": compute_task_key, "error_type": error_type, "logs_address": logs_address} - ) - - def split_args(self, celery_args: tuple) -> tuple[str, orchestrator.ComputeTask]: - channel_name = celery_args[0] - task = orchestrator.ComputeTask.model_validate_json(celery_args[1]) - return channel_name, task - - def queue_compute_task(channel_name: str, task: orchestrator.ComputeTask) -> None: from substrapp.task_routing import get_worker_queue @@ -164,7 +96,10 @@ def queue_compute_task(channel_name: str, task: orchestrator.ComputeTask) -> Non worker_queue=worker_queue, ) - compute_task.apply_async((channel_name, task, task.compute_plan_key), queue=worker_queue, task_id=task.key) + compute_task.apply_async( + (channel_name, task, task.compute_plan_key), + queue=worker_queue, + ) @app.task( @@ -262,7 +197,10 @@ def _run( # start build_image timer timer.start() - build_image_if_missing(datastore, ctx.function) + image_builder.wait_for_image_built(ctx.function, channel_name) + + if get_owner() != ctx.function.owner: + image_builder.load_remote_function_image(ctx.function, channel_name) # stop build_image timer _create_task_profiling_step(channel_name, task.key, ComputeTaskSteps.BUILD_IMAGE, timer.stop()) @@ -339,23 +277,3 @@ def _run( def _prepare_chainkeys(compute_plan_dir: str, compute_plan_tag: str) -> None: chainkeys_dir = os.path.join(compute_plan_dir, CPDirName.Chainkeys) prepare_chainkeys_dir(chainkeys_dir, compute_plan_tag) # does nothing if chainkeys already populated - - -def _store_failure(exc: Exception, compute_task_key: str) -> Optional[models.ComputeTaskFailureReport]: - """If the provided exception is a `BuildError` or an `ExecutionError`, store its logs in the Django storage and - in the database. Otherwise, do nothing. - - Returns: - An instance of `models.ComputeTaskFailureReport` storing the error logs or None if the provided exception is - neither a `BuildError` nor an `ExecutionError`. - """ - - if not isinstance(exc, (compute_task_errors.ExecutionError, compute_task_errors.BuildError)): - return None - - file = files.File(exc.logs) - failure_report = models.ComputeTaskFailureReport( - compute_task_key=compute_task_key, logs_checksum=utils.get_hash(file) - ) - failure_report.logs.save(name=compute_task_key, content=file, save=True) - return failure_report diff --git a/backend/substrapp/tasks/tasks_save_image.py b/backend/substrapp/tasks/tasks_save_image.py new file mode 100644 index 000000000..23c8672a1 --- /dev/null +++ b/backend/substrapp/tasks/tasks_save_image.py @@ -0,0 +1,97 @@ +from __future__ import annotations + +import os +import pathlib +from tempfile import TemporaryDirectory +from typing import Any + +import structlog +from django.conf import settings +from django.core.files import File + +import orchestrator +from backend.celery import app +from image_transfer import make_payload +from substrapp.compute_tasks import utils +from substrapp.docker_registry import USER_IMAGE_REPOSITORY +from substrapp.models import FailedAssetKind +from substrapp.models import FunctionImage +from substrapp.orchestrator import get_orchestrator_client +from substrapp.tasks.task import FailableTask + +REGISTRY = settings.REGISTRY +REGISTRY_SCHEME = settings.REGISTRY_SCHEME +SUBTUPLE_TMP_DIR = settings.SUBTUPLE_TMP_DIR + +logger = structlog.get_logger("worker") + + +class SaveImageTask(FailableTask): + autoretry_for = settings.CELERY_TASK_AUTORETRY_FOR + max_retries = settings.CELERY_TASK_MAX_RETRIES + retry_backoff = settings.CELERY_TASK_RETRY_BACKOFF + retry_backoff_max = settings.CELERY_TASK_RETRY_BACKOFF_MAX + retry_jitter = settings.CELERY_TASK_RETRY_JITTER + acks_late = True + reject_on_worker_lost = True + ignore_result = False + + asset_type = FailedAssetKind.FAILED_ASSET_FUNCTION + + @property + def attempt(self) -> int: + return self.request.retries + 1 # type: ignore + + # Returns (function key, channel) + def get_task_info(self, args: tuple, kwargs: dict) -> tuple[str, str]: + function = orchestrator.Function.parse_raw(kwargs["function_serialized"]) + channel_name = kwargs["channel_name"] + return function.key, channel_name + + # Celery does not provide unpacked arguments, we are doing it in `get_task_info` + def on_success(self, retval: dict[str, Any], task_id: str, args: tuple, kwargs: dict[str, Any]) -> None: + function_key, channel_name = self.get_task_info(args, kwargs) + with get_orchestrator_client(channel_name) as client: + client.update_function_status( + function_key=function_key, action=orchestrator.function_pb2.FUNCTION_ACTION_READY + ) + + +@app.task( + bind=True, + acks_late=True, + reject_on_worker_lost=True, + ignore_result=False, + base=SaveImageTask, +) +# Ack late and reject on worker lost allows use to +# see http://docs.celeryproject.org/en/latest/userguide/configuration.html#task-reject-on-worker-lost +# and https://github.com/celery/celery/issues/5106 +def save_image_task(task: SaveImageTask, function_serialized: str, channel_name: str) -> tuple[str, str]: + logger.info("Starting save_image_task") + logger.info(f"Parameters: function_serialized {function_serialized}, " f"channel_name {channel_name}") + # create serialized image + function = orchestrator.Function.parse_raw(function_serialized) + container_image_tag = utils.container_image_tag_from_function(function) + + os.makedirs(SUBTUPLE_TMP_DIR, exist_ok=True) + + logger.info("Serialising the image from the registry") + + with TemporaryDirectory(dir=SUBTUPLE_TMP_DIR) as tmp_dir: + storage_path = pathlib.Path(tmp_dir) / f"{container_image_tag}.zip" + make_payload( + zip_file=storage_path, + docker_images_to_transfer=[f"{USER_IMAGE_REPOSITORY}:{container_image_tag}"], + registry=REGISTRY, + secure=False, + ) + + logger.info("Start saving the serialized image") + # save it + FunctionImage.objects.create( + function_id=function.key, file=File(file=storage_path.open(mode="rb"), name="image.zip") + ) + logger.info("Serialized image saved") + + return function_serialized, channel_name diff --git a/backend/substrapp/tests/compute_tasks/test_errors.py b/backend/substrapp/tests/compute_tasks/test_errors.py index 390ba6e55..254d88eb4 100644 --- a/backend/substrapp/tests/compute_tasks/test_errors.py +++ b/backend/substrapp/tests/compute_tasks/test_errors.py @@ -2,6 +2,7 @@ import pytest +from builder import exceptions as build_errors from orchestrator import failure_report_pb2 from substrapp.compute_tasks import errors @@ -19,7 +20,7 @@ def test_from_int(self, input_value: int, expected: errors.ComputeTaskErrorType) @pytest.mark.parametrize( ("exc", "expected"), [ - (errors.BuildError(logs="some build error"), failure_report_pb2.ERROR_TYPE_BUILD), + (build_errors.BuildError(logs="some build error"), failure_report_pb2.ERROR_TYPE_BUILD), (errors.ExecutionError(logs=io.BytesIO()), failure_report_pb2.ERROR_TYPE_EXECUTION), (Exception(), failure_report_pb2.ERROR_TYPE_INTERNAL), ], diff --git a/backend/substrapp/tests/tasks/test_compute_task.py b/backend/substrapp/tests/tasks/test_compute_task.py index d1bb25dbe..b136e9288 100644 --- a/backend/substrapp/tests/tasks/test_compute_task.py +++ b/backend/substrapp/tests/tasks/test_compute_task.py @@ -1,9 +1,7 @@ import datetime import errno -import io import tempfile from functools import wraps -from typing import Type from unittest.mock import MagicMock import pytest @@ -31,7 +29,8 @@ def test_compute_task_exception(mocker: MockerFixture): mock_init_task_dirs = mocker.patch("substrapp.tasks.tasks_compute_task.init_task_dirs") mock_add_asset_to_buffer = mocker.patch("substrapp.tasks.tasks_compute_task.add_task_assets_to_buffer") mock_add_asset_to_task_dir = mocker.patch("substrapp.tasks.tasks_compute_task.add_assets_to_taskdir") - mock_build_image_if_missing = mocker.patch("substrapp.tasks.tasks_compute_task.build_image_if_missing") + mock_load_remote_function_image = mocker.patch("substrapp.compute_tasks.image_builder.load_remote_function_image") + mock_wait_for_image_built = mocker.patch("substrapp.compute_tasks.image_builder.wait_for_image_built") mock_execute_compute_task = mocker.patch("substrapp.tasks.tasks_compute_task.execute_compute_task") saver = mocker.MagicMock() mock_output_saver = mocker.patch("substrapp.tasks.tasks_compute_task.OutputSaver", return_value=saver) @@ -66,7 +65,8 @@ class FakeDirectories: mock_init_task_dirs.assert_called_once() mock_add_asset_to_buffer.assert_called_once() mock_add_asset_to_task_dir.assert_called_once() - mock_build_image_if_missing.assert_called_once() + mock_load_remote_function_image.assert_called_once() + mock_wait_for_image_built.assert_called_once() mock_execute_compute_task.assert_called_once() saver.save_outputs.assert_called_once() mock_output_saver.assert_called_once() @@ -135,7 +135,8 @@ def test_celery_retry(mocker: MockerFixture): mocker.patch("substrapp.tasks.tasks_compute_task.add_task_assets_to_buffer") mocker.patch("substrapp.tasks.tasks_compute_task.add_assets_to_taskdir") mocker.patch("substrapp.tasks.tasks_compute_task.restore_dir") - mocker.patch("substrapp.tasks.tasks_compute_task.build_image_if_missing") + mocker.patch("substrapp.compute_tasks.image_builder.load_remote_function_image") + mocker.patch("substrapp.compute_tasks.image_builder.wait_for_image_built") mock_execute_compute_task = mocker.patch("substrapp.tasks.tasks_compute_task.execute_compute_task") mocker.patch("substrapp.tasks.tasks_compute_task.teardown_task_dirs") mock_retry = mocker.patch("substrapp.tasks.tasks_compute_task.ComputeTask.retry") @@ -181,37 +182,6 @@ def basic_retry(exc, **retry_kwargs): assert mock_retry.call_count == 2 -@pytest.mark.django_db -@pytest.mark.parametrize("logs", [b"", b"Hello, World!"]) -def test_store_failure_execution_error(logs: bytes): - compute_task_key = "42ff54eb-f4de-43b2-a1a0-a9f4c5f4737f" - exc = errors.ExecutionError(logs=io.BytesIO(logs)) - - failure_report = tasks_compute_task._store_failure(exc, compute_task_key) - failure_report.refresh_from_db() - - assert str(failure_report.compute_task_key) == compute_task_key - assert failure_report.logs.read() == logs - - -@pytest.mark.django_db -def test_store_failure_build_error(): - compute_task_key = "42ff54eb-f4de-43b2-a1a0-a9f4c5f4737f" - msg = "Error building image" - exc = errors.BuildError(msg) - - failure_report = tasks_compute_task._store_failure(exc, compute_task_key) - failure_report.refresh_from_db() - - assert str(failure_report.compute_task_key) == compute_task_key - assert failure_report.logs.read() == str.encode(msg) - - -@pytest.mark.parametrize("exc_class", [Exception]) -def test_store_failure_ignored_exception(exc_class: Type[Exception]): - assert tasks_compute_task._store_failure(exc_class(), "uuid") is None - - @pytest.mark.django_db def test_send_profiling_event(mock_retry: MagicMock, mocker: MockerFixture): mock_post = mocker.patch("substrapp.clients.organization.post") diff --git a/backend/substrapp/tests/tasks/test_store_asset_failure_report.py b/backend/substrapp/tests/tasks/test_store_asset_failure_report.py new file mode 100644 index 000000000..da897dce8 --- /dev/null +++ b/backend/substrapp/tests/tasks/test_store_asset_failure_report.py @@ -0,0 +1,69 @@ +import io +import pickle +from typing import Type + +import pytest +from pytest_mock import MockerFixture + +from substrapp.compute_tasks import errors +from substrapp.compute_tasks.errors import ComputeTaskErrorType +from substrapp.models import FailedAssetKind +from substrapp.tasks.tasks_asset_failure_report import store_asset_failure_report +from substrapp.utils.errors import store_failure + +CHANNEL = "mychannel" + + +@pytest.fixture +def mock_orchestrator_client(mocker: MockerFixture): + return mocker.patch("substrapp.tasks.tasks_asset_failure_report.get_orchestrator_client") + + +@pytest.mark.django_db +def test_store_asset_failure_report_success(mock_orchestrator_client: MockerFixture): + exc = errors.ExecutionError(io.BytesIO(b"logs")) + exception_pickled = pickle.dumps(exc) + store_asset_failure_report( + asset_key="e21f6352-75c1-4b79-9a00-1f547697ef25", + asset_type=FailedAssetKind.FAILED_ASSET_COMPUTE_TASK, + channel_name=CHANNEL, + exception_pickled=exception_pickled, + ) + + +def test_store_asset_failure_report_ignored(mock_orchestrator_client): + exception_pickled = pickle.dumps(Exception()) + store_asset_failure_report( + asset_key="750836e4-0def-465a-8397-57c49ebd38bf", + asset_type=FailedAssetKind.FAILED_ASSET_COMPUTE_TASK, + channel_name=CHANNEL, + exception_pickled=exception_pickled, + ) + + +@pytest.mark.django_db +@pytest.mark.parametrize("logs", [b"", b"Hello, World!"]) +def test_store_failure_execution_error(logs: bytes): + compute_task_key = "42ff54eb-f4de-43b2-a1a0-a9f4c5f4737f" + exc = errors.ExecutionError(logs=io.BytesIO(logs)) + + failure_report = store_failure( + exc, + compute_task_key, + FailedAssetKind.FAILED_ASSET_COMPUTE_TASK, + error_type=ComputeTaskErrorType.EXECUTION_ERROR.value, + ) + failure_report.refresh_from_db() + + assert str(failure_report.asset_key) == compute_task_key + assert failure_report.logs.read() == logs + + +@pytest.mark.parametrize("exc_class", [Exception]) +def test_store_failure_ignored_exception(exc_class: Type[Exception]): + assert ( + store_failure( + exc_class(), "uuid", FailedAssetKind.FAILED_ASSET_COMPUTE_TASK, ComputeTaskErrorType.INTERNAL_ERROR.value + ) + is None + ) diff --git a/backend/substrapp/tests/test_kubernetes_utils.py b/backend/substrapp/tests/test_kubernetes_utils.py index 044584b89..8821e4bc4 100644 --- a/backend/substrapp/tests/test_kubernetes_utils.py +++ b/backend/substrapp/tests/test_kubernetes_utils.py @@ -34,30 +34,3 @@ def test_get_service_node_port(): service.spec.ports[0].node_port = 9000 port = substrapp.kubernetes_utils.get_service_node_port("my_service") assert port == 9000 - - -def test_get_pod_logs(mocker): - mocker.patch("kubernetes.client.CoreV1Api.read_namespaced_pod_log", return_value="Super great logs") - k8s_client = kubernetes.client.CoreV1Api() - logs = substrapp.kubernetes_utils.get_pod_logs(k8s_client, "pod_name", "container_name", ignore_pod_not_found=True) - assert logs == "Super great logs" - - -def test_get_pod_logs_not_found(): - with mock.patch("kubernetes.client.CoreV1Api.read_namespaced_pod_log") as read_pod: - read_pod.side_effect = kubernetes.client.ApiException(404, "Not Found") - k8s_client = kubernetes.client.CoreV1Api() - logs = substrapp.kubernetes_utils.get_pod_logs( - k8s_client, "pod_name", "container_name", ignore_pod_not_found=True - ) - assert "Pod not found" in logs - - -def test_get_pod_logs_bad_request(): - with mock.patch("kubernetes.client.CoreV1Api.read_namespaced_pod_log") as read_pod: - read_pod.side_effect = kubernetes.client.ApiException(400, "Bad Request") - k8s_client = kubernetes.client.CoreV1Api() - logs = substrapp.kubernetes_utils.get_pod_logs( - k8s_client, "pod_name", "container_name", ignore_pod_not_found=True - ) - assert "pod_name" in logs diff --git a/backend/substrapp/utils/errors.py b/backend/substrapp/utils/errors.py new file mode 100644 index 000000000..013406f66 --- /dev/null +++ b/backend/substrapp/utils/errors.py @@ -0,0 +1,32 @@ +from typing import Optional + +from django.core import files + +from orchestrator import failure_report_pb2 +from substrapp import models +from substrapp import utils + + +def store_failure( + exception: Exception, + asset_key: str, + asset_type: models.FailedAssetKind, + error_type: failure_report_pb2.ErrorType.ValueType, +) -> Optional[models.AssetFailureReport]: + """If the provided exception is a `BuildError` or an `ExecutionError`, store its logs in the Django storage and + in the database. Otherwise, do nothing. + + Returns: + An instance of `models.AssetFailureReport` storing the error logs or None if the provided exception is + neither a `BuildError` nor an `ExecutionError`. + """ + + if error_type not in [failure_report_pb2.ERROR_TYPE_BUILD, failure_report_pb2.ERROR_TYPE_EXECUTION]: + return None + + file = files.File(exception.logs) + failure_report = models.AssetFailureReport( + asset_key=asset_key, asset_type=asset_type, logs_checksum=utils.get_hash(file) + ) + failure_report.logs.save(name=asset_key, content=file, save=True) + return failure_report diff --git a/charts/substra-backend/CHANGELOG.md b/charts/substra-backend/CHANGELOG.md index ccdfdd871..f48692449 100644 --- a/charts/substra-backend/CHANGELOG.md +++ b/charts/substra-backend/CHANGELOG.md @@ -1,5 +1,13 @@ # Changelog +## [] - Unreleased + +## [24.0.0] - 2023-10-16 + +### Added + +- Builder service + ## [23.0.2] - 2023-10-18 ### Changed @@ -64,7 +72,7 @@ ## [22.8.0] - 2023-08-16 -## Added +### Added - New `server.allowImplicitLogin` field, controlling whether "implicit login" (`Client.login` in the Substra SDK) is enabled diff --git a/charts/substra-backend/Chart.yaml b/charts/substra-backend/Chart.yaml index 3b6fbb351..4a25510c9 100644 --- a/charts/substra-backend/Chart.yaml +++ b/charts/substra-backend/Chart.yaml @@ -1,7 +1,7 @@ apiVersion: v2 name: substra-backend home: https://github.com/Substra -version: 23.0.2 +version: 24.0.0 appVersion: 0.42.2 kubeVersion: ">= 1.19.0-0" description: Main package for Substra diff --git a/charts/substra-backend/README.md b/charts/substra-backend/README.md index f647ec72d..b31a9601a 100644 --- a/charts/substra-backend/README.md +++ b/charts/substra-backend/README.md @@ -187,6 +187,31 @@ See [UPGRADE.md](https://github.com/Substra/substra-backend/blob/main/charts/sub | `scheduler.podSecurityContext.runAsGroup` | Group ID for the pod | `1001` | | `scheduler.podSecurityContext.fsGroup` | FileSystem group ID for the pod | `1001` | +### Builder settings + +| Name | Description | Value | +| --------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------- | +| `builder.replicaCount` | Number of builder replicas | `1` | +| `builder.enabled` | Enable worker service | `true` | +| `builder.replicaCount` | Replica count for the worker service | `1` | +| `builder.concurrency` | Maximum amount of tasks to process in parallel | `1` | +| `builder.image.registry` | Substra backend server image registry | `ghcr.io` | +| `builder.image.repository` | Substra backend server image repository | `substra/substra-backend` | +| `builder.image.tag` | Substra backend server image tag (defaults to AppVersion) | `nil` | +| `builder.image.pullPolicy` | Substra backend server image pull policy | `IfNotPresent` | +| `builder.image.pullSecrets` | Specify image pull secrets | `[]` | +| `builder.podSecurityContext.enabled` | Enable security context | `true` | +| `builder.podSecurityContext.runAsUser` | User ID for the pod | `1001` | +| `builder.podSecurityContext.runAsGroup` | Group ID for the pod | `1001` | +| `builder.podSecurityContext.fsGroup` | FileSystem group ID for the pod | `1001` | +| `builder.resources` | Builder container resources requests and limits | `{}` | +| `builder.nodeSelector` | Node labels for pod assignment | `{}` | +| `builder.tolerations` | Toleration labels for pod assignment | `[]` | +| `builder.affinity` | Affinity settings for pod assignment, ignored if `DataSampleStorageInServerMedia` is `true` | `{}` | +| `builder.persistence.storageClass` | Specify the _StorageClass_ used to provision the volume. Or the default _StorageClass_ will be used. Set it to `-` to disable dynamic provisioning | `""` | +| `builder.persistence.size` | The size of the volume. | `10Gi` | +| `builder.rbac.create` | Create a role and service account for the builder | `true` | + ### Substra container registry settings | Name | Description | Value | @@ -406,7 +431,7 @@ The backend uses a PostgreSQL database. By default it will deploy one as a subch ```yaml database: host: my.database.host - + auth: username: my-user password: aStrongPassword @@ -414,4 +439,4 @@ database: postgresql: enabled: false -``` \ No newline at end of file +``` diff --git a/charts/substra-backend/templates/rbac.yaml b/charts/substra-backend/templates/rbac.yaml index c3bed5e82..ff98028b9 100644 --- a/charts/substra-backend/templates/rbac.yaml +++ b/charts/substra-backend/templates/rbac.yaml @@ -129,4 +129,57 @@ roleRef: kind: Role name: {{ template "substra.fullname" . }}-api-event apiGroup: rbac.authorization.k8s.io -{{- end }} +{{- end -}} +{{- if .Values.builder.rbac.create }} +--- +apiVersion: v1 +kind: ServiceAccount +metadata: + name: {{ template "substra.fullname" . }}-builder + labels: + {{ include "substra.labels" . | nindent 4 }} + app.kubernetes.io/name: {{ template "substra.name" . }} +--- +kind: Role +apiVersion: rbac.authorization.k8s.io/v1 +metadata: + name: {{ template "substra.fullname" . }}-builder + labels: + {{ include "substra.labels" . | nindent 4 }} + app.kubernetes.io/name: {{ template "substra.name" . }} +rules: + - apiGroups: [""] + resources: ["secrets"] + verbs: ["get", "watch", "list"] + - apiGroups: [""] + resources: ["pods/log", "pods/status"] + verbs: ["get", "list", "watch"] + - apiGroups: [""] + resources: ["pods", "pods/exec"] + verbs: ["get", "list", "watch", "create", "update", "patch", "delete"] + - apiGroups: [""] + resources: ["services"] + verbs: ["get"] + {{- if .Values.psp.create }} + - apiGroups: [""] + resources: ["podsecuritypolicies"] + verbs: ["use"] + resourceNames: + - {{ template "substra.fullname" . }}-psp + {{- end }} +--- +kind: RoleBinding +apiVersion: rbac.authorization.k8s.io/v1 +metadata: + name: {{ template "substra.fullname" . }}-builder + labels: + {{ include "substra.labels" . | nindent 4 }} + app.kubernetes.io/name: {{ template "substra.name" . }} +subjects: + - kind: ServiceAccount + name: {{ template "substra.fullname" . }}-builder +roleRef: + kind: Role + name: {{ template "substra.fullname" . }}-builder + apiGroup: rbac.authorization.k8s.io +{{- end -}} \ No newline at end of file diff --git a/charts/substra-backend/templates/statefulset-builder.yaml b/charts/substra-backend/templates/statefulset-builder.yaml new file mode 100644 index 000000000..7c077b23c --- /dev/null +++ b/charts/substra-backend/templates/statefulset-builder.yaml @@ -0,0 +1,239 @@ +{{- if .Values.builder.enabled }} +## Headless service doesn't get its own file +apiVersion: v1 +kind: Service +metadata: + name: {{ template "substra.fullname" . }}-builder + labels: + {{- include "substra.labels" . | nindent 4 }} + app.kubernetes.io/name: {{ template "substra.name" . }}-builder +spec: + clusterIP: None + selector: + app.kubernetes.io/component: substra-builder + app.kubernetes.io/instance: {{ .Release.Name }} +--- +apiVersion: apps/v1 +kind: StatefulSet +metadata: + name: {{ template "substra.fullname" . }}-builder + labels: + {{ include "substra.labels" . | nindent 4 }} + app.kubernetes.io/name: {{ template "substra.name" . }}-builder +spec: + replicas: {{ .Values.builder.replicaCount }} + serviceName: {{ template "substra.fullname" . }}-builder + selector: + matchLabels: + app.kubernetes.io/name: {{ template "substra.name" . }}-builder + {{ include "substra.selectorLabels" . | nindent 8}} + template: + metadata: + labels: + app.kubernetes.io/name: {{ template "substra.name" . }}-builder + {{ include "substra.labels" . | nindent 8 }} + app.kubernetes.io/component: substra-builder + annotations: + # This will cause the pod to restart if the content of the ConfigMap is updated through Helm + checksum-cm-orchestrator: {{ include (print $.Template.BasePath "/configmap-orchestrator.yaml") . | sha256sum }} + checksum-cm-settings: {{ include (print $.Template.BasePath "/configmap-settings.yaml") . | sha256sum }} + checksum-secret-objectstore : {{ include (print $.Template.BasePath "/secret-objectstore.yaml") . | sha256sum }} + checksum-secret-redis: {{ include (print $.Template.BasePath "/secret-redis.yaml") . | sha256sum }} + spec: + {{- if .Values.builder.podSecurityContext.enabled }} + securityContext: + fsGroup: {{ .Values.builder.podSecurityContext.fsGroup }} + runAsUser: {{ .Values.builder.podSecurityContext.runAsUser }} + runAsGroup: {{ .Values.builder.podSecurityContext.runAsGroup }} + {{- end }} + {{- with .Values.builder.image.pullSecrets }} + imagePullSecrets: + {{- toYaml . | nindent 8 }} + {{- end }} + serviceAccountName: {{ template "substra.fullname" . }}-builder + initContainers: + {{- include "common.waitPostgresqlInitContainer" . | nindent 6 }} + {{- if .Values.privateCa.enabled }} + - name: add-cert + image: {{ include "common.images.name" .Values.privateCa.image }} + imagePullPolicy: {{ .Values.privateCa.image.pullPolicy }} + securityContext: + runAsUser: 0 + command: ['sh', '-c'] + args: + - | + {{- if .Values.privateCa.image.apkAdd }} + apt update + apt install -y ca-certificates openssl + {{- end }} + update-ca-certificates && cp /etc/ssl/certs/* /tmp/certs/ + volumeMounts: + - mountPath: /usr/local/share/ca-certificates/{{ .Values.privateCa.configMap.fileName }} + name: private-ca + subPath: {{ .Values.privateCa.configMap.fileName }} + - mountPath: /tmp/certs/ + name: ssl-certs + {{- end }} + - name: wait-minio + image: jwilder/dockerize:0.6.1 + command: ['dockerize', '-wait', 'tcp://{{ .Release.Name }}-minio:9000'] + {{- if .Values.kaniko.cache.warmer.cachedImages }} + - name: kaniko-cache-warmer + image: {{ include "common.images.name" .Values.kaniko.cache.warmer.image }} + args: + - "--cache-dir=/cache" + {{- range .Values.kaniko.cache.warmer.cachedImages }} + - "--image={{ . }}" + {{- end }} + - "--verbosity=debug" + volumeMounts: + - name: docker-cache + mountPath: /cache + readOnly: False + {{- if .Values.kaniko.dockerConfigSecretName }} + - name: docker-config + mountPath: /kaniko/.docker + {{- end }} + {{- end }} + containers: + - name: builder + image: {{ include "substra-backend.images.name" (dict "img" .Values.builder.image "defaultTag" $.Chart.AppVersion) }} + imagePullPolicy: "{{ .Values.builder.image.pullPolicy }}" + command: ["/bin/bash", "-c"] + {{- if eq .Values.settings "prod" }} + args: ["celery -A backend worker -E -l info -Q {{ .Values.organizationName }}.builder,{{ .Values.organizationName }}.builder-${HOSTNAME##*-},{{ .Values.organizationName }}.broadcast --hostname {{ .Values.organizationName }}.builder-${HOSTNAME##*-}"] + {{ else }} + args: ["watchmedo auto-restart --directory=./ --pattern=*.py --recursive -- celery -A backend worker -E -l info -Q {{ .Values.organizationName }}.builder,{{ .Values.organizationName }}.builder-${HOSTNAME##*-},{{ .Values.organizationName }}.broadcast --hostname {{ .Values.organizationName }}.builder-${HOSTNAME##*-}"] + {{ end }} + envFrom: + # TODO: Remove dependency for LDEGER_MSP_ID + - configMapRef: + name: {{ include "substra.fullname" . }}-orchestrator + - configMapRef: + name: {{ include "substra.fullname" . }}-settings + - configMapRef: + name: {{ include "substra.fullname" . }}-redis + - configMapRef: + name: {{ include "substra.fullname" . }}-registry + # TODO: Remove once moved ImageResitryEntrypoint logic + - configMapRef: + name: {{ include "substra.fullname" . }}-database + - secretRef: + name: {{ include "substra.fullname" . }}-objectstore + - secretRef: + name: {{ include "substra.fullname" . }}-redis + # TODO: Remove once moved ImageResitryEntrypoint logic + - secretRef: + name: {{ include "substra-backend.database.secret-name" . }} + env: + - name: HOST_IP + valueFrom: + fieldRef: + fieldPath: status.hostIP + - name: POD_IP + valueFrom: + fieldRef: + fieldPath: status.podIP + - name: DJANGO_SETTINGS_MODULE + value: backend.settings.celery.{{ .Values.settings }} + - name: DEFAULT_DOMAIN + value: "{{ .Values.server.defaultDomain }}" + - name: "CELERY_WORKER_CONCURRENCY" + value: {{ .Values.builder.concurrency | quote }} + - name: WORKER_PVC_DOCKER_CACHE + value: docker-cache + - name: WORKER_PVC_SUBTUPLE + value: subtuple + {{- if .Values.privateCa.enabled }} + - name: REQUESTS_CA_BUNDLE + value: /etc/ssl/certs/ca-certificates.crt + {{- end }} + - name: NAMESPACE + valueFrom: + fieldRef: + fieldPath: metadata.namespace + - name: NODE_NAME + valueFrom: + fieldRef: + fieldPath: spec.nodeName + - name: KANIKO_DOCKER_CONFIG_SECRET_NAME + value: {{ .Values.kaniko.dockerConfigSecretName | quote }} + - name: OBJECTSTORE_URL + value: {{ .Release.Name }}-minio:9000 + ports: + - name: http + containerPort: 8000 + protocol: TCP + volumeMounts: + - name: subtuple + mountPath: /var/substra/medias/subtuple + {{- if .Values.privateCa.enabled }} + - mountPath: /etc/ssl/certs + name: ssl-certs + {{- end }} + {{ if .Values.orchestrator.tls.enabled }} + - name: orchestrator-tls-cacert + mountPath: /var/substra/orchestrator/tls/server + {{ if .Values.orchestrator.tls.mtls.enabled }} + - name: orchestrator-tls-client-pair + mountPath: /var/substra/orchestrator/tls/client + {{ end }} + {{ end }} + resources: + {{- toYaml .Values.builder.resources | nindent 12 }} + volumes: + {{- if .Values.privateCa.enabled }} + - name: ssl-certs + emptyDir: {} + - name: private-ca + configMap: + name: {{ .Values.privateCa.configMap.name }} + {{- end }} + {{ if .Values.orchestrator.tls.enabled }} + - name: orchestrator-tls-cacert + configMap: + name: {{ .Values.orchestrator.tls.cacert }} + {{ if .Values.orchestrator.tls.mtls.enabled }} + - name: orchestrator-tls-client-pair + secret: + secretName: {{ .Values.orchestrator.tls.mtls.clientCertificate }} + {{ end }} + {{ end }} + {{- if .Values.kaniko.dockerConfigSecretName }} + - name: docker-config + secret: + secretName: {{ .Values.kaniko.dockerConfigSecretName }} + items: + - key: .dockerconfigjson + path: config.json + {{- end }} + {{- with .Values.builder.nodeSelector }} + nodeSelector: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.builder.affinity }} + affinity: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.builder.tolerations }} + tolerations: + {{- toYaml . | nindent 8 }} + {{- end }} + volumeClaimTemplates: + - metadata: + name: subtuple + spec: + accessModes: [ "ReadWriteOnce" ] + {{ include "common.storage.class" .Values.builder.persistence }} + resources: + requests: + storage: {{ .Values.builder.persistence.size }} + - metadata: + name: docker-cache + spec: + accessModes: [ "ReadWriteOnce" ] + {{ include "common.storage.class" .Values.kaniko.cache.persistence }} + resources: + requests: + storage: {{ .Values.kaniko.cache.persistence.size }} +{{- end }} diff --git a/charts/substra-backend/values.yaml b/charts/substra-backend/values.yaml index e0cef4db6..0935408c0 100644 --- a/charts/substra-backend/values.yaml +++ b/charts/substra-backend/values.yaml @@ -471,6 +471,88 @@ scheduler: runAsGroup: 1001 fsGroup: 1001 + +## @section Builder settings +## @param builder.replicaCount Number of builder replicas +## +builder: + ## @param builder.enabled Enable worker service + ## + enabled: true + ## @param builder.replicaCount Replica count for the worker service + ## + replicaCount: 1 + + ## @param builder.concurrency Maximum amount of tasks to process in parallel + ## + concurrency: 1 + + ## Substra backend image version + ## @param builder.image.registry Substra backend server image registry + ## @param builder.image.repository Substra backend server image repository + ## @param builder.image.tag Substra backend server image tag (defaults to AppVersion) + ## @param builder.image.pullPolicy Substra backend server image pull policy + ## @param builder.image.pullSecrets Specify image pull secrets + ## + image: + registry: ghcr.io + repository: substra/substra-backend + tag: null + pullPolicy: IfNotPresent + ## Optionally specify an array of imagePullSecrets. + ## Secrets must be created manually in the namespace. + ## + pullSecrets: [] + + ## @param builder.podSecurityContext.enabled Enable security context + ## @param builder.podSecurityContext.runAsUser User ID for the pod + ## @param builder.podSecurityContext.runAsGroup Group ID for the pod + ## @param builder.podSecurityContext.fsGroup FileSystem group ID for the pod + ## + podSecurityContext: + enabled: true + runAsUser: 1001 + runAsGroup: 1001 + fsGroup: 1001 + + + ## @param builder.resources Builder container resources requests and limits + ## e.g: + ## resources: + ## limits: + ## cpu: 100m + ## memory: 128Mi + ## requests: + ## cpu: 100m + ## memory: 128Mi + ## + resources: {} + + ## @param builder.nodeSelector Node labels for pod assignment + ## + nodeSelector: { } + ## @param builder.tolerations Toleration labels for pod assignment + ## + tolerations: [ ] + ## @param builder.affinity Affinity settings for pod assignment, ignored if `DataSampleStorageInServerMedia` is `true` + ## + affinity: { } + + + persistence: + ## @param builder.persistence.storageClass Specify the _StorageClass_ used to provision the volume. Or the default _StorageClass_ will be used. Set it to `-` to disable dynamic provisioning + ## @param builder.persistence.size The size of the volume. + ## + storageClass: "" + size: 10Gi + + ## @param builder.rbac.create Create a role and service account for the builder + ## + rbac: + create: true + + + ## @section Substra container registry settings ## containerRegistry: diff --git a/docker/substra-backend/Dockerfile b/docker/substra-backend/Dockerfile index fb90ffc1a..1bd6f37a3 100644 --- a/docker/substra-backend/Dockerfile +++ b/docker/substra-backend/Dockerfile @@ -24,6 +24,8 @@ COPY ./backend/organization_register /usr/src/app/organization_register COPY ./backend/users /usr/src/app/users COPY ./backend/orchestrator /usr/src/app/orchestrator COPY ./backend/api /usr/src/app/api +COPY ./backend/builder /usr/src/app/builder +COPY ./backend/image_transfer /usr/src/app/image_transfer FROM build AS arm64 diff --git a/docs/settings.md b/docs/settings.md index f0a2ecc09..205c0e66e 100644 --- a/docs/settings.md +++ b/docs/settings.md @@ -51,7 +51,7 @@ Accepted true values for `bool` are: `1`, `ON`, `On`, `on`, `T`, `t`, `TRUE`, `T | string | `OBJECTSTORE_URL` | nil | | | int | `PAGINATION_MAX_PAGE_SIZE` | `10000` | | | string | `POD_IP` | nil | | -| string | `REGISTRY` | nil | | +| string | `REGISTRY` | empty string | | | bool | `REGISTRY_IS_LOCAL` | nil | | | string | `REGISTRY_PULL_DOMAIN` | nil | | | string | `REGISTRY_SCHEME` | nil | | diff --git a/fixtures/.DS_Store b/fixtures/.DS_Store deleted file mode 100644 index 5159b676d60215a00475b7c14f61ab97ded558c7..0000000000000000000000000000000000000000 GIT binary patch literal 0 HcmV?d00001 literal 6148 zcmeHK%}T>S5T0$TZ74zy3OxqA7OYw;;w8lT0!H+pQWINjFlMDm&7l->))(?gd>&_Z zH)3f8Pa<{(X203_$+F*u{Q&^sP6oRG4FIU25|%7%mI%d37o=o7ghGA8hZs^ALK0*P z+3fg>4A9zHa0erp;M#pzzvL;1WPs7P;3*1|wB3FerE+C$y;`-Z_3Di`l9`wK=_GOe z@fG#XrHq2Y_JfOXnDrZ*Co)d`FdmLoK{yy<%JpR!4`k-bNjykYt)~N)RkQky=5*TW z9<=Pv{^6`;PmlK6Xm>lanN{1`-Z?(&KSYnQe9?3Y{PUD-7%bowjU_XC_Qr7}<9m!2 zl|=<3Gr$Zm1FOP-+5c+wRhcl)lNn$Je#`*v4>l^HV=&jKwhnCQ`bhB#Aqm>_mLRkZ zItFu%7(o%b6j7H7^TZIk9R0S*a}4GhbvXz%GJeO5EX)fwO3&-LI3;toE=-QjS{wBim4xzgjh`iGsG}Hj=_qcZ aDnY+Z2BKpy*N7ezz6dB9xM2oH1@V-^m;4Wg<&0T*E43hX&L&p$$qDprKhvt+--jT7}7np#A3 zem<@ulZcFPQ@L2!n>{z**++&mCkOWA81W14cNZlEfg7;MkzE(HCqgga^y>{tEnwC%0;vJ&^%eQ zLs35+`xjp>T0*t5QnPvY}9 zlcZv41y3S&1}0xJKO6F8$s_}oE+G?DHBc1CThvSr-ZlyY% zw%mG6)*G!^O->K@-I{D3G-orXw6nW+blQE4pAz+A=oI+pY1y(khga;ZSlP2bOk$PX zqj#P;&tPN*m;q*BT^O+EU#Yw<3+8n)1I)mW8KCn)q7r%*GlTl*z=p1mkO{%a@3}MpIE^VA=F*9h=LFk$BJ9cJaUnoM)j&`ZTL3jqaWd@jm zMF#R_SfToV^8NdNF^Naa05kBf7!ZYy-)ZBLY;9dy9MxKhdV@+rahbu-6zu3yjImUT ctEgJgE~$g)S}CK!G=`lTKnVamG~%3(%^c7;*$i^dlSdZj9x242K^-Rg zY9^W;zfl2NI}fg52nw#?!}>*uj&O`Vj1{t1hk5bQXwY-<&rvUl2WhqXB69h{!eX)L z7nh0`;jJEpX*(Sb8twiWT^(zkMALpdI*GfZR%vx#tF#@fZr>!toi2u)pT?@AM-4qx zoq@>>?SSuleyda-kL%T1Mb@_JlZqVgZdNODduMwx@x8V6jlHARU2?DVBfd96;CE`U z?08B~*qPDcA?&L}s~a4f=1t?7Spimn75K{vxIN8V`pc8xcV`7yfp1ZO_6LbZ95~J_ zo2>&IQvx6t&~1Qy`emdX={RtlS;i5Bu&Ky4Rk#vE*mU$u7Z*6rEZcMtuJ{n{$--4A z!aN=CFHJazz%sY204wlW0jm8FB|85PzOMgG5I3vj#?nz-Ml(Raqz-c6IJ1l%1pf#a8Mt8uepG=sQFB|j diff --git a/pyproject.toml b/pyproject.toml index bad767575..655e81baa 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -83,6 +83,13 @@ check_untyped_defs = true ignore_missing_imports = true cache_dir = "/dev/null" follow_imports = "silent" +packages = [ + "builder", + "substrapp.tasks", +] +exclude = [ + "tests/*", +] [tool.django-stubs] django_settings_module = "backend.settings.test" diff --git a/skaffold.yaml b/skaffold.yaml index bb5ade701..faf7c2cb4 100644 --- a/skaffold.yaml +++ b/skaffold.yaml @@ -37,6 +37,7 @@ deploy: server.image: *image-params worker.events.image: *image-params worker.image: *image-params + builder.image: *image-params createNamespace: true - name: backend-org-2