TPU-V4 #255

wimjan123 · 2023-03-05T21:58:04Z

How can one use this project to fine-tune using a TPU-v4 instance?
I tried everything, but always get errors.
Most commonly:

UserWarning: cloud_tpu_init failed: KeyError('v4-8')
This a JAX bug; please report an issue at https://github.com/google/jax/issues
_warn(f"cloud_tpu_init failed: {repr(exc)}\n This a JAX bug; please report "
2023-03-05 21:55:43.305762: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-03-05 21:55:43.941977: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-03-05 21:55:43.942070: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2023-03-05 21:55:43.942076: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
Traceback (most recent call last):
File "device_train.py", line 191, in
raise ValueError(msg)
ValueError: each shard needs a separate device, but device count (1) < shard count (4)

Wingie · 2023-03-06T19:44:16Z

your jax version installed is wrong for your tpu version. (this repo is old)
basically you have to keep trying installations and images (i use image v2-alpha on TPUv3-8)
once this command works, then you have jax installed on your tpu working fine.

python3 -c "import jax; print(jax.devices())"  # should print TpuDevice

also, your libcudart errors means you need to uninstall your tensorflow and install tensorflow-cpu as you do not have a GPU on a TPU device.

i would recommend you go through https://github.com/ayaka14732/tpu-starter it can help with some errors you face.

wimjan123 · 2023-03-06T21:28:49Z

I use V2-alpha-tpu4 on TPUv4-8.
The command to check if jax is installed returns this:
[TpuDevice(id=0, process_index=0, coords=(0,0,0), core_on_chip=0), TpuDevice(id=1, process_index=0, coords=(1,0,0), core_on_chip=0), TpuDevice(id=2, process_index=0, coords=(0,1,0), core_on_chip=0), TpuDevice(id=3, process_index=0, coords=(1,1,0), core_on_chip=0)]

leejason · 2023-03-07T10:14:00Z

According to the following, the number of TPU cores has changed from 8 to 4 for TPU v4.

Display the number of TPU cores available:
jax.device_count()
The number of TPU cores is displayed. If you are using a v4 TPU, this should be 4. If you are using a v2 or v3 TPU, this should be 8.

(source: https://cloud.google.com/tpu/docs/run-calculation-jax)

wimjan123 · 2023-03-08T10:03:08Z

Aha, I see. Is there any way to fine tune gpt-j using 4 tpu cores?

leejason · 2023-03-08T10:14:39Z

Aha, I see. Is there any way to fine tune gpt-j using 4 tpu cores?

I change the following from 8 to 4 in the configuration file.

"cores_per_replica": 4

wimjan123 · 2023-03-08T10:21:02Z

If I do that, I get a "AssertionError: Incompatible checkpoints" error

leejason · 2023-03-08T13:05:04Z

If I do that, I get a "AssertionError: Incompatible checkpoints" error

I forgot to mention that it's for pre-training from scratch. The above compatibility seems a valid issue since it's not clear whether the checkpoints on 8 cores can work on 4 cores.

wimjan123 · 2023-03-08T13:09:35Z

Is there any way to convert the checkpoints to, let's say, 4 shards?

leejason · 2023-03-08T14:15:02Z

Is there any way to convert the checkpoints to, let's say, 4 shards?

No idea but I guess not and didn't try. I plan to move forward to TPU v4.

mosmos6 · 2024-01-03T09:50:35Z

I'm curious how this attempt turned out. Has anyone succeeded in running GPT-J on TPU v4?

sokarblue13 · 2024-01-03T10:01:51Z

PARTE 1 YARSY.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TPU-V4 #255

TPU-V4 #255

wimjan123 commented Mar 5, 2023

Wingie commented Mar 6, 2023 •

edited

Loading

wimjan123 commented Mar 6, 2023

leejason commented Mar 7, 2023

wimjan123 commented Mar 8, 2023

leejason commented Mar 8, 2023

wimjan123 commented Mar 8, 2023

leejason commented Mar 8, 2023

wimjan123 commented Mar 8, 2023

leejason commented Mar 8, 2023

mosmos6 commented Jan 3, 2024

sokarblue13 commented Jan 3, 2024

TPU-V4 #255

TPU-V4 #255

Comments

wimjan123 commented Mar 5, 2023

Wingie commented Mar 6, 2023 • edited Loading

wimjan123 commented Mar 6, 2023

leejason commented Mar 7, 2023

wimjan123 commented Mar 8, 2023

leejason commented Mar 8, 2023

wimjan123 commented Mar 8, 2023

leejason commented Mar 8, 2023

wimjan123 commented Mar 8, 2023

leejason commented Mar 8, 2023

mosmos6 commented Jan 3, 2024

sokarblue13 commented Jan 3, 2024

Wingie commented Mar 6, 2023 •

edited

Loading