Getting Nan loss when training dlrm with Kaggle Criteo dataset #363

ZhanqiuHu · 2023-10-26T15:33:32Z

Hello,

I'm running some training with the Kaggle Criteo dataset, and here is the command I ran:

torchx run -s local_cwd dist.ddp -j 1x1 --script dlrm_main.py --\
    --in_memory_binary_criteo_path $PREPROCESSED_DATASET \
    --pin_memory \
    --mmap_mode \
    --batch_size 128 \
    --test_batch_size 16384 \
    --learning_rate 0.001 \
    --dataset_name criteo_kaggle \
    --dense_arch_layer_sizes "13,512,256,64,16" \
    --over_arch_layer_sizes "512,256,1" \
    --epochs 10 \
    --embedding_dim 16 \
    --validation_freq_within_epoch 1024 \
    --shuffle_batches

The model hyperparameters I chose follow this example script. I'm getting Nan results for some iterations. The preprocessed dataset does not contain Nan values, and I have tried using 0.1, 0.01, 0.001 for the start learning rate, but I always get Nan results. Is there something I'm doing wrong here? What might be the cause for this issue?

Thanks!

The text was updated successfully, but these errors were encountered:

ZhanqiuHu · 2023-10-26T22:51:48Z

It seems like running torchrec.datasets.scripts.npy_preproc_criteo encounters RuntimeWarning: divide by zero encountered in log. Is there a workaround for that?

mnaumovfb · 2023-11-27T00:08:47Z

What happens when you run the test and bench script as shown in the documentation?
./test/dlrm_s_test.sh
./bench/dlrm_s_criteo_kaggle.sh --test-freq=1024

TomekWei · 2024-04-11T18:13:57Z

Hi, I also get NaN when run it in DLRCs with TorchRec. Did you sovle it? I found that there are some -inf in Kaggle Criteo dataset. I'm not sure if torch team handled it.

ZhanqiuHu · 2024-04-11T18:57:03Z

I think it is one preprocessing operation in the script that is causing the problem. I ended up using some custom preprocessing steps instead of torchrec.datasets.scripts.npy_preproc_criteo.

TomekWei · 2024-04-11T19:12:05Z

I'm also trying to do that. If you still have that script, would you mind sharing it with me? Really thanks for your responding.

ZhanqiuHu · 2024-04-11T20:06:23Z

Sorry, I'm not working on this now so I didn't keep a copy of the code. I remember I used the some part of the torchrec.datasets.scripts.npy_preproc_criteo code to decode the text to values and got a bunch of numpy files, and then did normalization with the dense values. Hope this helps!

TomekWei · 2024-04-11T20:08:03Z

It's ok. Thank you very much.

The original script simply added 3 to the target value before taking the log. This led to the issue that in data preprocessing, if there was a value of -3, it would result in a value of -inf. This problem was mentioned in the issue facebookresearch/dlrm#363 (comment). I changed the preprocessing operation to dense_np -= (dense_np.min() - 2) in the tsv_to_npys function, and correctly handled the Criteo Kaggle dataset.

The original script simply added 3 to the target value before taking the log. This led to the issue that in data preprocessing, if there was a value of -3, it would result in a value of -inf. This problem was mentioned in the issue facebookresearch/dlrm#363 (comment). I changed the preprocessing operation to dense_np -= dense_np.min() - 2 in the tsv_to_npys function, and correctly handled the Criteo Kaggle dataset.

TomekWei mentioned this issue Jun 21, 2024

Fix bug in criteo.py that caused NaN issues pytorch/torchrec#2150

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting Nan loss when training dlrm with Kaggle Criteo dataset #363

Getting Nan loss when training dlrm with Kaggle Criteo dataset #363

ZhanqiuHu commented Oct 26, 2023

ZhanqiuHu commented Oct 26, 2023

mnaumovfb commented Nov 27, 2023

TomekWei commented Apr 11, 2024

ZhanqiuHu commented Apr 11, 2024

TomekWei commented Apr 11, 2024

ZhanqiuHu commented Apr 11, 2024

TomekWei commented Apr 11, 2024

Getting Nan loss when training dlrm with Kaggle Criteo dataset #363

Getting Nan loss when training dlrm with Kaggle Criteo dataset #363

Comments

ZhanqiuHu commented Oct 26, 2023

ZhanqiuHu commented Oct 26, 2023

mnaumovfb commented Nov 27, 2023

TomekWei commented Apr 11, 2024

ZhanqiuHu commented Apr 11, 2024

TomekWei commented Apr 11, 2024

ZhanqiuHu commented Apr 11, 2024

TomekWei commented Apr 11, 2024