Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with "pytorch-dist-mnist-test:v1.0" image in example notebook "create-pytorchjob.ipynb" #2266

Open
saileshd1402 opened this issue Sep 21, 2024 · 1 comment

Comments

@saileshd1402
Copy link
Contributor

saileshd1402 commented Sep 21, 2024

What happened?

When I run the examples/pytorch/image-classification/create-pytorchjob.ipynb file, the "pytorch-dist-mnist-test:v1.0" image is using https://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz url to download the mnist training dataset, but url is currently not working

Error:

 Defaulted container "pytorch" out of: pytorch, init-pytorch (init)
Using distributed PyTorch with gloo backend
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Traceback (most recent call last):
  File "/var/mnist.py", line 150, in <module>
    main()
  File "/var/mnist.py", line 123, in main
    transforms.Normalize((0.1307,), (0.3081,))
  File "/opt/conda/lib/python3.6/site-packages/torchvision-0.2.1-py3.6.egg/torchvision/datasets/mnist.py", line 46, in __init__
    epoch, batch_idx * len(data), len(train_loader.dataset),
  File "/opt/conda/lib/python3.6/site-packages/torchvision-0.2.1-py3.6.egg/torchvision/datasets/mnist.py", line 114, in download
    if should_distribute():
  File "/opt/conda/lib/python3.6/urllib/request.py", line 223, in urlopen
    return opener.open(url, data, timeout)
  File "/opt/conda/lib/python3.6/urllib/request.py", line 532, in open
    response = meth(req, response)
  File "/opt/conda/lib/python3.6/urllib/request.py", line 642, in http_response
    'http', request, response, code, msg, hdrs)
  File "/opt/conda/lib/python3.6/urllib/request.py", line 504, in _call_chain
    result = func(*args)
  File "/opt/conda/lib/python3.6/urllib/request.py", line 650, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

There is seems to be the same dataset hosted at https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz which can be replaced with.

ref: https://github.com/pytorch/vision/blob/6d7851bd5e2bedc294e40e90532f0e375fcfee04/torchvision/datasets/mnist.py#L39

What did you expect to happen?

Ideally "pytorch-dist-mnist-test:v1.0" image should be updated or should provide a replacement image

Environment

Kubernetes version:

$ kubectl version

Client Version: v1.30.0
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.5

Training Operator version:

$ kubectl get pods -n kubeflow -l control-plane=kubeflow-training-operator -o jsonpath="{.items[*].spec.containers[*].image}"

kubeflow/training-operator:latest

Training Operator Python SDK version:

$ pip show kubeflow-training

Name: kubeflow-training
Version: 1.8.1
Summary: Training Operator Python SDK
Home-page: https://github.com/kubeflow/training-operator/tree/master/sdk/python
Author: Kubeflow Authors
Author-email: [email protected]
License: Apache License Version 2.0
Location: /home/ubuntu/.kflowenv/lib/python3.11/site-packages
Requires: certifi, kubernetes, retrying, setuptools, six, urllib3
Required-by: 

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

@YosiElias
Copy link
Contributor

/assign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants