Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bitnami/mlflow] Artifacts Issue in UI - botocore.errorfactory.NoSuchKey: An error occurred (NoSuchKey) when calling the ListObjectsV2 operation: The specified key does not exist. #23959

Closed
iamhritik290799 opened this issue Feb 28, 2024 · 33 comments · Fixed by #25294 or #28955
Assignees
Labels
mlflow solved tech-issues The user has a technical issue about an application

Comments

@iamhritik290799
Copy link
Contributor

iamhritik290799 commented Feb 28, 2024

Name and Version

bitnami/mlflow and 0.7.4

What architecture are you using?

amd64

What steps will reproduce the bug?

We can simply use the bitnami/mlflow helm chart version 0.7.4 and provide the below s3 details like this

externalS3.bucket = mlflow-bucket-name
externalS3.useCredentialsInSecret = false
externalS3.host = mlflow-bucket-name.s3.eu-central-1.amazonaws.com

externalS3.existingSecretAccessKeyIDKey = root-user
externalS3.existingSecretKeySecretKey = root-password
externalS3.port = 443
externalS3.protocol = https
externalS3.serveArtifacts = true

Are you using any custom parameters or values?

Nope

What is the expected behavior?

Able to see artifacts in the mlflow UI Console

What do you see instead?

Getting this error in the logs and it's just showing Loading in the UI screen.

2024/02/28 09:46:35 ERROR mlflow.server: Exception on /ajax-api/2.0/mlflow/artifacts/list [GET]
Traceback (most recent call last):
  File "/opt/bitnami/python/lib/python3.10/site-packages/flask/app.py", line 1463, in wsgi_app
    response = self.full_dispatch_request()
  File "/opt/bitnami/python/lib/python3.10/site-packages/flask/app.py", line 872, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/opt/bitnami/python/lib/python3.10/site-packages/flask/app.py", line 870, in full_dispatch_request
    rv = self.dispatch_request()
  File "/opt/bitnami/python/lib/python3.10/site-packages/flask/app.py", line 855, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)  # type: ignore[no-any-return]
  File "/opt/bitnami/python/lib/python3.10/site-packages/mlflow/server/handlers.py", line 494, in wrapper
    return func(*args, **kwargs)
  File "/opt/bitnami/python/lib/python3.10/site-packages/mlflow/server/handlers.py", line 535, in wrapper
    return func(*args, **kwargs)
  File "/opt/bitnami/python/lib/python3.10/site-packages/mlflow/server/handlers.py", line 953, in _list_artifacts
    artifact_entities = _get_artifact_repo(run).list_artifacts(path)
  File "/opt/bitnami/python/lib/python3.10/site-packages/mlflow/store/artifact/s3_artifact_repo.py", line 211, in list_artifacts
    for result in results:
  File "/opt/bitnami/python/lib/python3.10/site-packages/botocore/paginate.py", line 269, in __iter__
    response = self._make_request(current_kwargs)
  File "/opt/bitnami/python/lib/python3.10/site-packages/botocore/paginate.py", line 357, in _make_request
    return self._method(**current_kwargs)
  File "/opt/bitnami/python/lib/python3.10/site-packages/botocore/client.py", line 553, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/opt/bitnami/python/lib/python3.10/site-packages/botocore/client.py", line 1009, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.errorfactory.NoSuchKey: An error occurred (NoSuchKey) when calling the ListObjectsV2 operation: The specified key does not exist.
image

Additional information

No response

@iamhritik290799 iamhritik290799 added the tech-issues The user has a technical issue about an application label Feb 28, 2024
@iamhritik290799 iamhritik290799 changed the title Artifact Issue in UI - botocore.errorfactory.NoSuchKey: An error occurred (NoSuchKey) when calling the ListObjectsV2 operation: The specified key does not exist. Artifacts Issue in UI - botocore.errorfactory.NoSuchKey: An error occurred (NoSuchKey) when calling the ListObjectsV2 operation: The specified key does not exist. Feb 28, 2024
@github-actions github-actions bot added the triage Triage is needed label Feb 28, 2024
@javsalgar javsalgar changed the title Artifacts Issue in UI - botocore.errorfactory.NoSuchKey: An error occurred (NoSuchKey) when calling the ListObjectsV2 operation: The specified key does not exist. [bitnami/mlflow] Artifacts Issue in UI - botocore.errorfactory.NoSuchKey: An error occurred (NoSuchKey) when calling the ListObjectsV2 operation: The specified key does not exist. Feb 28, 2024
@javsalgar
Copy link
Contributor

Hi,

Looking at the issue, it is not clear to me that the issue is related to the Bitnami packaging of MLflow or some issue with S3 inside the mlflow code. Did you check with the upstream developers?

@iamhritik290799
Copy link
Contributor Author

iamhritik290799 commented Mar 1, 2024

checked but they were saying maybe there is some issue in bitnami mlflow image due to which you are getting this error when trying to access your Artifacts from S3 bucket.

btw we are these args in our mlflow container :

containers:
  - args:
    - server
    - --backend-store-uri=postgresql://admin:$(MLFLOW_DATABASE_PASSWORD)@rds-instance-endpoint.rds.amazonaws.com:5432/mlflow
    - --artifacts-destination=s3://mlflow-artifacts
    - --serve-artifacts
    - --host=0.0.0.0
    - --port=5000
    - --app-name=basic-auth

@carrodher
Copy link
Member

The issue may not be directly related to the Bitnami container image or Helm chart, but rather to how the application is being utilized or configured in your specific environment.

Having said that, if you think that's not the case and are interested in contributing a solution, we welcome you to create a pull request. The Bitnami team is excited to review your submission and offer feedback. You can find the contributing guidelines here.

Your contribution will greatly benefit the community. Feel free to reach out if you have any questions or need assistance.

If you have any questions about the application itself, customizing its content, or questions about technology and infrastructure usage, we highly recommend that you refer to the forums and user guides provided by the project responsible for the application or technology.

With that said, we'll keep this ticket open until the stale bot automatically closes it, in case someone from the community contributes valuable insights.

@iamhritik290799
Copy link
Contributor Author

Hi @carrodher

I have checked and after removing this ENV MLFLOW_S3_ENDPOINT_URL from the deployment, it's working fine and I'm able to load artifact in my mlflow experiments

image

I have checked on the mlflow documentation as well and they suggested to unset MLFLOW_S3_ENDPOINT_URL env on the client system but somehow after removing this env in deployment it worked.

image

but in the bitnami/mlflow helm chart tracking deployment template there is no parameter used to exclude this env if not required. PFBR

image

@github-actions github-actions bot removed the triage Triage is needed label Mar 6, 2024
@github-actions github-actions bot assigned CeliaGMqrz and unassigned carrodher Mar 6, 2024
@iamhritik290799
Copy link
Contributor Author

@CeliaGMqrz any update on this ?

@act-mreeves
Copy link

@iamhritik290799 I am doing an ugly hack to work around this for now. You can figure out what your tracking args are by ssh-ing into a running tracking pod and running ps aux | cat. Thanks for troubleshooting this.

tracking:
  command: [ "/bin/sh", "-c" ]
  args:
    - >
      unset MLFLOW_S3_ENDPOINT_URL;
      mlflow server --host=0.0.0.0 --port=5000 --app-name=basic-auth
      --serve-artifacts --artifacts-destination=s3://$YOUR_BUCKET
      --backend-store-uri=postgresql://postgres:$(MLFLOW_DATABASE_PASSWORD)@$YOUR_DB_HOST:5432/mlflow_db;

@iamhritik290799
Copy link
Contributor Author

Team, have you made any changes on the helm chart to customize env variable for the mlflow deployment ?

@andresbono
Copy link
Contributor

Even before we make any changes in the Helm chart, I think we should clarify in which specific cases or scenarios setting the MLFLOW_S3_ENDPOINT_URL env-var for the tracking component is required. @iamhritik290799, @act-mreeves, can you help clarifying that?

BTW @iamhritik290799, just to confirm, the screenshot you shared that suggests to unset the env-var comes from this documentation page, right? https://mlflow.org/docs/2.10.2/tracking/artifacts-stores.html#setting-bucket-region

@act-mreeves
Copy link

@andresbono My specific issue which requires me to unset MLFLOW_S3_ENDPOINT_URL is when using IRSA (IAM for Service accounts) and an AWS S3 Bucket.
You are 100% correct I have not exhaustively tested if and when this env var IS required.

What I think I am seeing that only the bucket name is needed in this scenario and these are the relevant arguments given to the mlflow binary: --serve-artifacts --artifacts-destination=s3://my-mlflow-bucket.

tracking:
  serviceAccount:
    annotations:
      eks.amazonaws.com/role-arn: arn:aws:iam::1234567890:role/my-mlflow-s3-role

externalS3:
  useCredentialsInSecret: false
  protocol: "https"
  host: "my-mlflow-bucket.s3.us-east-1.amazonaws.com"
  bucket: "my-mlflow-bucket"
  serveArtifacts: true

In a nut shell I think if you use external s3 per mlflow/mlflow#9523 (comment) if you use minio (which is default on this helm chart) you use MLFLOW_S3_ENDPOINT_URL.

There is a lot of discussion here too: mlflow/mlflow#7104. I think @Gekko0114 would have more domain knowledge to explain what is going on here.

@iamhritik290799
Copy link
Contributor Author

iamhritik290799 commented Apr 2, 2024

@act-mreeves correct. Additionally, in my setup, I only require the "--artifacts-destination" argument, which I can define in the pod args section. However, I do not need the default "MLFLOW_S3_ENDPOINT_URL" environment variable that comes with the Bitnami Helm chart.

@andresbono My request is to either remove this ENV variable that's by default right now in the helm chart or make it optional rather than mandatory.

@dwolfeu
Copy link

dwolfeu commented Apr 15, 2024

We have the same use case as @iamhritik290799 (artifacts saved in S3, no custom endpoint, AWS service user) and the suggested solution also worked for us.

@iamhritik290799
Copy link
Contributor Author

Hi @andresbono , any update on this ?

@andresbono
Copy link
Contributor

Thank you for all the additional information you provided. Based on that, I think the best option is to remove the environment variable from the deployment. When needed in some specific scenarios, users can always add it via tracking.extraEnvVars.

Would you like to send a PR addressing the change? Thank you!!

@aaj-synth
Copy link

Another solution is not to set the externalS3.host when using IRSA. I tried it and that worked perfectly

@dwolfeu
Copy link

dwolfeu commented Apr 18, 2024

@aaj-synth Alas this solution doesn't work for us: If we remove externalS3.host from values.yaml, then we get the error message "No Artifacts Recorded Use the log artifact APIs to store file outputs from MLflow runs" in the Artifacts tab of the web interface. So far only the solution suggested by @iamhritik290799 has solved the issue for us.

@andresbono
Copy link
Contributor

FYI, #26462 may look as a regression of this issue, but it shouldn't be. Please check the comments in the PR for more information. TL;DR:

  • externalS3.host=mlflow-bucket-name.s3.eu-central-1.amazonaws.com
  • externalS3.host=s3.eu-central-1.amazonaws.com
  • externalS3.host=s3.amazonaws.com

@Jasper-Ben
Copy link
Contributor

Jasper-Ben commented Aug 15, 2024

@andresbono cannot confirm, after upgrading to the latest release we get this error yet again!

As stated before MLFLOW_S3_ENDPOINT_URL should not be set when using AWS S3, see also: mlflow/mlflow#9523 (comment)

The change in #26462 causes yet again MLFLOW_S3_ENDPOINT_URL to be set when using AWS S3. We now had to resort back to using the initial workaround described in #23959 (comment)

Can we get this issue re-opened please...

@carrodher carrodher reopened this Aug 17, 2024
@github-actions github-actions bot added triage Triage is needed and removed solved labels Aug 17, 2024
@github-actions github-actions bot removed the triage Triage is needed label Aug 19, 2024
@github-actions github-actions bot assigned andresbono and unassigned andresbono and carrodher Aug 19, 2024
@andresbono
Copy link
Contributor

Hi @Jasper-Ben, could you share what is the value you are passing for externalS3.host? see #23959 (comment). You can redact it, I'm just interested in the format.

I don't know if you had a chance to check the comments of #26462, but we did some extensive testing and it worked for all the test cases, given the proper values were passed.

@Jasper-Ben
Copy link
Contributor

Jasper-Ben commented Aug 19, 2024

Hi @Jasper-Ben, could you share what is the value you are passing for externalS3.host? see #23959 (comment). You can redact it, I'm just interested in the format.

I don't know if you had a chance to check the comments of #26462, but we did some extensive testing and it worked for all the test cases, given the proper values were passed.

Yes, I have read the comments and we are using s3.amazonaws.com as host.

What did these tests include? The initial connection test to s3 works (has been before the initial fix as well). The issue only appears when you try to actually access the artifacts of a job.

If that helps, this is our terraform config (without the workaround):

resource "helm_release" "k8s_mlflow" {
  name       = local.release_name
  namespace  = kubernetes_namespace.mlflow.metadata[0].name
  repository = "https://charts.bitnami.com/bitnami"
  chart      = "mlflow"
  version    = "1.4.22"
  values = [
    file("${path.module}/helm_values.yaml"),
    yamlencode(var.extra_helm_configuration),
    yamlencode({
      commonLabels = local.common_labels_k8s
      tracking = {
        auth = {
          username = var.tracking_username
          password = var.tracking_password
        }
        persistence = {
          enabled = false
        }
      },
      run = {
        persistence = {
          enabled = false
        }
      },
      externalS3 = {
        host   = "s3.amazonaws.com"
        bucket = local.artifact_bucket # this is the plain bucket name

        accessKeyID     = aws_iam_access_key.s3_access.id
        accessKeySecret = aws_iam_access_key.s3_access.secret
      },
      externalDatabase = {
        host                      = kubernetes_manifest.postgres.manifest.metadata.name
        user                      = keys(kubernetes_manifest.postgres.manifest.spec.users).0
        existingSecret            = "${local.postgres_user}.${local.postgres_name}.credentials.postgresql.acid.zalan.do"
        existingSecretPasswordKey = "password"
        database                  = "${keys(kubernetes_manifest.postgres.manifest.spec.databases).0}?sslmode=require"
        authDatabase              = "${keys(kubernetes_manifest.postgres.manifest.spec.databases).1}?sslmode=require"
      }
    })
  ]
}

Also, we have the following additional helm values:

minio:
  enabled: false
tracking:
  nodeAffinityPreset:
    type: hard
    key: node.kubernetes.io/lifecycle
    values:
      - normal
  service:
    type: "ClusterIP"
  metrics:
    enabled: true
    serviceMonitor:
      enabled: true
  resources:
    requests:
      cpu: 2
      memory: 6Gi
    limits:
      cpu: 3
      memory: 10Gi
  auth:
    enabled: true
run:
  nodeAffinityPreset:
    type: hard
    key: node.kubernetes.io/lifecycle
    values:
      - normal
  source:
    type: "configmap"
postgresql:
  enabled: false

@Jasper-Ben
Copy link
Contributor

Jasper-Ben commented Aug 19, 2024

We have been using s3.amazonaws.com from the beginning btw, and we have also tried regional endpoint. Both don't work.

Also, I do not see how #26462 would fix anything compared to the pre-#25294 code.

#25294 caused the MLFLOW_S3_ENDPOINT_URL variable to be set only when the internal minio setup is used. Which fixed it for AWS S3 users but broke it for other external S3 compatible storage solutions.

#26462 from a AWS S3 perspective basically reverted the previous change with the exception that it will now set the MLFLOW_S3_ENDPOINT_URL variable on the following condition ("hidden" behind the include):

{{- if or .Values.minio.enabled .Values.externalS3.host -}}

Which of course will always evaluate to true for the AWS S3 use-case, since we also need to set externalS3.host for mflow to be configured to use S3 at all, thus the MLFLOW_S3_ENDPOINT_URL variable is set again.

So basically we went full circle on this issue and it has been "fixed" for one use-case while breaking it for another (2x).

Maybe the addressing style stuff from #26462 (comment) fixes things (haven't fully understood / tested that yet) but just setting s3.amazonaws.com does not.

@andresbono
Copy link
Contributor

Thank you @Jasper-Ben.

What did these tests include? The initial connection test to s3 works (has been before the initial fix as well). The issue only appears when you try to actually access the artifacts of a job.

You can find what we tested here: #26462 (comment) (unfold Scenario 2). I specifically tested the access to job artifacts, see the screenshot. When I tested it with theexternalS3.host=s3.amazonaws.com value, it worked fine. There should be some relevant difference between my testing scenario and yours.

I share your concern about going in circles on this issue. My assumption was that setting the proper externalS3.host was enough, that is why merging #26462 made sense.

Maybe the addressing style stuff from #26462 (comment) fixes things (haven't fully understood / tested that yet)

Please, try that and let us know about any other update you may have.

@Jasper-Ben
Copy link
Contributor

Jasper-Ben commented Aug 21, 2024 via email

@Jasper-Ben
Copy link
Contributor

Jasper-Ben commented Aug 21, 2024

Just to reiterate the status quo:

I set up a second test instance using the exact same configuration as mentioned in #23959 (comment).

The important bits:

  1. externalS3.host is set to s3.amazonaws.com
  2. externalS3.bucket is set to a plain bucket name

I then used the following example project to create an experiment:

import mlflow
from mlflow.models import infer_signature

import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

mlflow.set_tracking_uri(uri="<MLFLOW_URI>")

# Load the Iris dataset
X, y = datasets.load_iris(return_X_y=True)

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Define the model hyperparameters
params = {
    "solver": "lbfgs",
    "max_iter": 1000,
    "multi_class": "auto",
    "random_state": 8888,
}

# Train the model
lr = LogisticRegression(**params)
lr.fit(X_train, y_train)

# Predict on the test set
y_pred = lr.predict(X_test)

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)


# Create a new MLflow Experiment
mlflow.set_experiment("MLflow Quickstart")

# Start an MLflow run
with mlflow.start_run():
    # Log the hyperparameters
    mlflow.log_params(params)

    # Log the loss metric
    mlflow.log_metric("accuracy", accuracy)

    # Set a tag that we can use to remind ourselves what this run was for
    mlflow.set_tag("Training Info", "Basic LR model for iris data")

    # Infer the model signature
    signature = infer_signature(X_train, lr.predict(X_train))

    # Log the model
    model_info = mlflow.sklearn.log_model(
        sk_model=lr,
        artifact_path="iris_model",
        signature=signature,
        input_example=X_train,
        registered_model_name="tracking-quickstart",
    )

(basically step 3 and 4 from https://mlflow.org/docs/latest/getting-started/intro-quickstart/index.html)

This will cause the following error while trying to upload the artifacts to AWS S3:

mlflow boto3.exceptions.S3UploadFailedError: Failed to upload /tmp/tmpl8tifcwt/input_example.json to <BUCKET_NAME>/2/d4780a6c6c50403fab62785c7a08d8db/artifacts/iris_model/input_example.json: An error occurred (PermanentRedirect) when calling the PutObject operation: The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint.

So I was able to reproduce the issue on a fresh setup. I will now experiment with the addressing style.

@Jasper-Ben
Copy link
Contributor

I changed the addressing style to path as suggested in #26462 (comment). Still, I get the same error message. So that does not seem to help.

Pinging @frittentheke for visibility.

The tracking pod looks like this now:

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2024-08-21T13:48:36Z"
  generateName: iris-devops-mlflow-test-tracking-ffb6fd9f-
  labels:
    app.kubernetes.io/component: tracking
    app.kubernetes.io/instance: iris-devops-mlflow-test
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: mlflow
    app.kubernetes.io/part-of: mlflow
    app.kubernetes.io/version: 2.15.1
    generator: Terraform
    helm.sh/chart: mlflow-1.4.22
    pod-template-hash: ffb6fd9f
  name: iris-devops-mlflow-test-tracking-ffb6fd9f-kqblg
  namespace: mlflow-test
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: iris-devops-mlflow-test-tracking-ffb6fd9f
    uid: 0a5ca0e4-48c7-4007-8d37-1edc0156792c
  resourceVersion: "755350757"
  uid: 709a958f-f5ee-4450-b172-e147e05153a3
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: node.kubernetes.io/lifecycle
            operator: In
            values:
            - normal
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - podAffinityTerm:
          labelSelector:
            matchLabels:
              app.kubernetes.io/component: mlflow
              app.kubernetes.io/instance: iris-devops-mlflow-test
              app.kubernetes.io/name: mlflow
          topologyKey: kubernetes.io/hostname
        weight: 1
  automountServiceAccountToken: false
  containers:
  - args:
    - server
    - --backend-store-uri=postgresql://mlflow:$(MLFLOW_DATABASE_PASSWORD)@iris-devops-mlflow-test-postgres:5432/mlflow?sslmode=require
    - --artifacts-destination=s3://<BUCKET_NAME>
    - --serve-artifacts
    - --host=0.0.0.0
    - --port=5000
    - --expose-prometheus=/bitnami/mlflow/metrics
    - --app-name=basic-auth
    command:
    - mlflow
    env:
    - name: BITNAMI_DEBUG
      value: "false"
    - name: MLFLOW_DATABASE_PASSWORD
      valueFrom:
        secretKeyRef:
          key: password
          name: mlflow.iris-devops-mlflow-test-postgres.credentials.postgresql.acid.zalan.do
    - name: AWS_ACCESS_KEY_ID
      valueFrom:
        secretKeyRef:
          key: root-user
          name: iris-devops-mlflow-test-externals3
    - name: AWS_SECRET_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          key: root-password
          name: iris-devops-mlflow-test-externals3
    - name: MLFLOW_S3_ENDPOINT_URL
      value: https://s3.amazonaws.com:443
    - name: MLFLOW_BOTO_CLIENT_ADDRESSING_STYLE
      value: path
    image: docker.io/bitnami/mlflow:2.15.1-debian-12-r0
    imagePullPolicy: IfNotPresent
    livenessProbe:
      exec:
        command:
        - pgrep
        - -f
        - mlflow.server
      failureThreshold: 5
      initialDelaySeconds: 5
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 5
    name: mlflow
    ports:
    - containerPort: 5000
      name: http
      protocol: TCP
    readinessProbe:
      failureThreshold: 5
      initialDelaySeconds: 5
      periodSeconds: 10
      successThreshold: 1
      tcpSocket:
        port: http
      timeoutSeconds: 5
    resources:
      limits:
        cpu: "3"
        memory: 10Gi
      requests:
        cpu: "2"
        memory: 6Gi
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
      privileged: false
      readOnlyRootFilesystem: true
      runAsGroup: 1001
      runAsNonRoot: true
      runAsUser: 1001
      seLinuxOptions: {}
      seccompProfile:
        type: RuntimeDefault
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /tmp
      name: tmp
    - mountPath: /app/mlruns
      name: mlruns
    - mountPath: /app/mlartifacts
      name: mlartifacts
    - mountPath: /bitnami/mlflow-basic-auth/basic_auth.ini
      name: rendered-basic-auth
      subPath: basic_auth.ini
    - mountPath: /bitnami/mlflow
      name: data
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  initContainers:
  - command:
    - bash
    - -ec
    - |
      #!/bin/bash
      retry_while() {
        local -r cmd="${1:?cmd is missing}"
        local -r retries="${2:-12}"
        local -r sleep_time="${3:-5}"
        local return_value=1

        read -r -a command <<< "$cmd"
        for ((i = 1 ; i <= retries ; i+=1 )); do
            "${command[@]}" && return_value=0 && break
            sleep "$sleep_time"
        done
        return $return_value
      }

      check_host() {
          local -r host="${1:-?missing host}"
          local -r port="${2:-?missing port}"
          if wait-for-port --timeout=5 --host=${host} --state=inuse $port ; then
             return 0
          else
             return 1
          fi
      }

      echo "Checking connection to iris-devops-mlflow-test-postgres:5432"
      if ! retry_while "check_host iris-devops-mlflow-test-postgres 5432"; then
          echo "Connection error"
          exit 1
      fi

      echo "Connection success"
      exit 0
    image: docker.io/bitnami/os-shell:12-debian-12-r27
    imagePullPolicy: IfNotPresent
    name: wait-for-database
    resources: {}
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
      privileged: false
      readOnlyRootFilesystem: true
      runAsGroup: 1001
      runAsNonRoot: true
      runAsUser: 1001
      seLinuxOptions: {}
      seccompProfile:
        type: RuntimeDefault
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /tmp
      name: tmp
  - command:
    - bash
    - -ec
    - |
      #!/bin/bash
      cp /bitnami/mlflow-basic-auth/basic_auth.ini /bitnami/rendered-basic-auth/basic_auth.ini
    image: docker.io/bitnami/mlflow:2.15.1-debian-12-r0
    imagePullPolicy: IfNotPresent
    name: get-default-auth-conf
    resources: {}
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
      privileged: false
      readOnlyRootFilesystem: true
      runAsGroup: 1001
      runAsNonRoot: true
      runAsUser: 1001
      seLinuxOptions: {}
      seccompProfile:
        type: RuntimeDefault
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /tmp
      name: tmp
    - mountPath: /bitnami/rendered-basic-auth
      name: rendered-basic-auth
  - command:
    - bash
    - -ec
    - |
      #!/bin/bash
      # First render the overrides
      render-template /bitnami/basic-auth-overrides/*.ini > /tmp/rendered-overrides.ini
      # Loop through the ini overrides and apply it to the final basic_auth.ini
      # read the file line by line
      while IFS='=' read -r key value
      do
        # remove leading and trailing spaces from key and value
        key="$(echo $key | tr -d " ")"
        value="$(echo $value | tr -d " ")"

        ini-file set -s mlflow -k "$key" -v "$value" /bitnami/rendered-basic-auth/basic_auth.ini
      done < "/tmp/rendered-overrides.ini"
      # Remove temporary files
      rm /tmp/rendered-overrides.ini
    env:
    - name: MLFLOW_DATABASE_PASSWORD
      valueFrom:
        secretKeyRef:
          key: password
          name: mlflow.iris-devops-mlflow-test-postgres.credentials.postgresql.acid.zalan.do
    - name: MLFLOW_DATABASE_AUTH_URI
      value: postgresql://mlflow:$(MLFLOW_DATABASE_PASSWORD)@iris-devops-mlflow-test-postgres:5432/mlflow_auth?sslmode=require
    - name: MLFLOW_TRACKING_USERNAME
      valueFrom:
        secretKeyRef:
          key: admin-user
          name: iris-devops-mlflow-test-tracking
    - name: MLFLOW_TRACKING_PASSWORD
      valueFrom:
        secretKeyRef:
          key: admin-password
          name: iris-devops-mlflow-test-tracking
    - name: MLFLOW_BOTO_CLIENT_ADDRESSING_STYLE
      value: path
    image: docker.io/bitnami/os-shell:12-debian-12-r27
    imagePullPolicy: IfNotPresent
    name: render-auth-conf
    resources: {}
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
      privileged: false
      readOnlyRootFilesystem: true
      runAsGroup: 1001
      runAsNonRoot: true
      runAsUser: 1001
      seLinuxOptions: {}
      seccompProfile:
        type: RuntimeDefault
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /tmp
      name: tmp
    - mountPath: /bitnami/basic-auth-overrides
      name: basic-auth-overrides
    - mountPath: /bitnami/rendered-basic-auth
      name: rendered-basic-auth
  - args:
    - -m
    - mlflow.server.auth
    - db
    - upgrade
    - --url
    - postgresql://mlflow:$(MLFLOW_DATABASE_PASSWORD)@iris-devops-mlflow-test-postgres:5432/mlflow_auth?sslmode=require
    command:
    - python
    env:
    - name: MLFLOW_DATABASE_PASSWORD
      valueFrom:
        secretKeyRef:
          key: password
          name: mlflow.iris-devops-mlflow-test-postgres.credentials.postgresql.acid.zalan.do
    image: docker.io/bitnami/mlflow:2.15.1-debian-12-r0
    imagePullPolicy: IfNotPresent
    name: upgrade-db-auth
    resources: {}
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
      privileged: false
      readOnlyRootFilesystem: true
      runAsGroup: 1001
      runAsNonRoot: true
      runAsUser: 1001
      seLinuxOptions: {}
      seccompProfile:
        type: RuntimeDefault
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /tmp
      name: tmp
  - command:
    - bash
    - -ec
    - |
      #!/bin/bash
      retry_while() {
        local -r cmd="${1:?cmd is missing}"
        local -r retries="${2:-12}"
        local -r sleep_time="${3:-5}"
        local return_value=1

        read -r -a command <<< "$cmd"
        for ((i = 1 ; i <= retries ; i+=1 )); do
            "${command[@]}" && return_value=0 && break
            sleep "$sleep_time"
        done
        return $return_value
      }

      check_host() {
          local -r host="${1:-?missing host}"
          local -r port="${2:-?missing port}"
          if wait-for-port --timeout=5 --host=${host} --state=inuse $port ; then
             return 0
          else
             return 1
          fi
      }

      echo "Checking connection to s3.amazonaws.com:443"
      if ! retry_while "check_host s3.amazonaws.com 443"; then
          echo "Connection error"
          exit 1
      fi

      echo "Connection success"
      exit 0
    image: 693612562064.dkr.ecr.eu-central-1.amazonaws.com/docker.io/bitnami/os-shell:12-debian-12-r27
    imagePullPolicy: IfNotPresent
    name: wait-for-s3
    resources: {}
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
      privileged: false
      readOnlyRootFilesystem: true
      runAsGroup: 1001
      runAsNonRoot: true
      runAsUser: 1001
      seLinuxOptions: {}
      seccompProfile:
        type: RuntimeDefault
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /tmp
      name: tmp
  nodeName: ip-10-208-18-75.eu-central-1.compute.internal
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext:
    fsGroup: 1001
    fsGroupChangePolicy: Always
  serviceAccount: iris-devops-mlflow-test-tracking
  serviceAccountName: iris-devops-mlflow-test-tracking
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - emptyDir: {}
    name: tmp
  - emptyDir: {}
    name: mlruns
  - emptyDir: {}
    name: mlartifacts
  - configMap:
      defaultMode: 420
      name: iris-devops-mlflow-test-tracking-auth-overrides
    name: basic-auth-overrides
  - emptyDir: {}
    name: rendered-basic-auth
  - emptyDir: {}
    name: data

@frittentheke
Copy link
Contributor

@Jasper-Ben

mlflow boto3.exceptions.S3UploadFailedError: Failed to upload /tmp/tmpl8tifcwt/input_example.json to <BUCKET_NAME>/2/d4780a6c6c50403fab62785c7a08d8db/artifacts/iris_model/input_example.json: An error occurred (PermanentRedirect) when calling the PutObject operation: The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint.

I suppose the bucket resides in some region and AWS does not like you to continue using the global S3 hostname.
See e.g. thoughtbot/paperclip#2151 on how setting the endpoint to the correct regional endpoint fixes things.

See https://docs.aws.amazon.com/general/latest/gr/s3.html#s3_region for list of endpoints.

@Jasper-Ben
Copy link
Contributor

Jasper-Ben commented Aug 21, 2024

I figured it out (also thanks to @frittentheke).

It works when using a regional endpoint, regardless of the addressing style.

Turns out that using the HTTP endpoint is always regional, in contrast to the s3 endpoint (https://<bucket_name>.s3.eu-central-1.amazonaws.com vs s3://<bucket_name>). (Yes that is f****n confusing). I probably knew that at some point but the information was purged out of my brain, so I had to rediscover it. When setting the MLFLOW_S3_ENDPOINT_URL environment variable, Mlflow uses the HTTP endpoint.

So the reason why @andresbono tested successful with the host set to s3.amazonaws.com is that he just happened to test with a bucket deployed in us-east-1, which AWS will default to for HTTP endpoints when no region specific endpoint is used (see: https://stackoverflow.com/questions/51611874/access-amazon-s3-bucket-without-region-end-point/51612461#51612461).

We use a bucket in eu-central-1, which is why just setting externalS3.host=s3.amazonaws.com breaks for us.

So basically the fix here is to always use the regional endpoint, everything else will just cause confusion. So what I would do: Delete / Update comment #23959 (comment) to only include the regional endpoint as ✅ and update the example at https://github.com/bitnami/charts/blob/main/bitnami/mlflow/README.md?plain=1#L451C74-L451C83 to use a regional endpoint with the note to pick the appropriate regional endpoint from https://docs.aws.amazon.com/general/latest/gr/s3.html#s3_region. For the latter I will create a PR.

Also, maybe someone else could verify/reproduce my findings, just in case?

Jasper-Ben added a commit to Jasper-Ben/charts that referenced this issue Aug 21, 2024
AWS S3 HTTP endpoints are always regional, with `s3.amazonaws.com`
defaulting to us-east-1. To avoid issues when using buckets from
other regions, we should encourage including region code in the host.

Fixes: bitnami#23959

Signed-off-by: Jasper Orschulko <[email protected]>
Jasper-Ben added a commit to Jasper-Ben/charts that referenced this issue Aug 21, 2024
AWS S3 HTTP endpoints are always regional, with `s3.amazonaws.com`
defaulting to us-east-1. To avoid issues when using buckets from
other regions, we should encourage including region code in the host.

Fixes: bitnami#23959

Signed-off-by: Jasper Orschulko <[email protected]>
Jasper-Ben added a commit to Jasper-Ben/charts that referenced this issue Aug 21, 2024
AWS S3 HTTP endpoints are always regional, with `s3.amazonaws.com`
defaulting to us-east-1. To avoid issues when using buckets from
other regions, we should encourage including region code in the host.

Fixes: bitnami#23959

Signed-off-by: Jasper Orschulko <[email protected]>
Jasper-Ben added a commit to Jasper-Ben/charts that referenced this issue Aug 21, 2024
AWS S3 HTTP endpoints are always regional, with `s3.amazonaws.com`
defaulting to us-east-1. To avoid issues when using buckets from
other regions, we should encourage including region code in the host.

Fixes: bitnami#23959

Signed-off-by: Jasper Orschulko <[email protected]>
@Jasper-Ben
Copy link
Contributor

Jasper-Ben commented Aug 21, 2024

The only thing that still baffles me is that I am 100% certain I tried including the regional code before. But maybe there was something else misconfigured at that time. Which I guess shows that it is good to challenge your own assumptions.

Edit: the original issue description also included the region, so apparently it was broken at some point but maybe got fixed in mlflow itself? Especially since from an AWS S3 usecase perspective nothing relevant has changed in the chart (as far as I can tell), as I already mentioned in #23959 (comment). Super weird.

Edit edit: Ok, the inital bug description also contained the bucket name in the host, so maybe we really just never tested it with "just" the regional. Well, hopefully the updated value description in my PR will eliminate any remaining question marks that dummies like myself could have in the future and we can finally close this issue once and for all 😅

dgomezleon pushed a commit that referenced this issue Aug 22, 2024
* [bitnami/mlflow] Update externalS3.host example

AWS S3 HTTP endpoints are always regional, with `s3.amazonaws.com`
defaulting to us-east-1. To avoid issues when using buckets from
other regions, we should encourage including region code in the host.

Fixes: #23959

Signed-off-by: Jasper Orschulko <[email protected]>

* Update CHANGELOG.md

Signed-off-by: Bitnami Containers <[email protected]>

* Update CHANGELOG.md

Signed-off-by: Bitnami Containers <[email protected]>

---------

Signed-off-by: Jasper Orschulko <[email protected]>
Signed-off-by: Bitnami Containers <[email protected]>
Co-authored-by: Bitnami Containers <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
mlflow solved tech-issues The user has a technical issue about an application
Projects
None yet