Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting Nan loss when training dlrm with Kaggle Criteo dataset #363

Open
ZhanqiuHu opened this issue Oct 26, 2023 · 7 comments
Open

Getting Nan loss when training dlrm with Kaggle Criteo dataset #363

ZhanqiuHu opened this issue Oct 26, 2023 · 7 comments

Comments

@ZhanqiuHu
Copy link

Hello,

I'm running some training with the Kaggle Criteo dataset, and here is the command I ran:

torchx run -s local_cwd dist.ddp -j 1x1 --script dlrm_main.py --\
    --in_memory_binary_criteo_path $PREPROCESSED_DATASET \
    --pin_memory \
    --mmap_mode \
    --batch_size 128 \
    --test_batch_size 16384 \
    --learning_rate 0.001 \
    --dataset_name criteo_kaggle \
    --dense_arch_layer_sizes "13,512,256,64,16" \
    --over_arch_layer_sizes "512,256,1" \
    --epochs 10 \
    --embedding_dim 16 \
    --validation_freq_within_epoch 1024 \
    --shuffle_batches

The model hyperparameters I chose follow this example script. I'm getting Nan results for some iterations. The preprocessed dataset does not contain Nan values, and I have tried using 0.1, 0.01, 0.001 for the start learning rate, but I always get Nan results. Is there something I'm doing wrong here? What might be the cause for this issue?

Thanks!

@ZhanqiuHu
Copy link
Author

It seems like running torchrec.datasets.scripts.npy_preproc_criteo encounters RuntimeWarning: divide by zero encountered in log. Is there a workaround for that?

@mnaumovfb
Copy link
Contributor

What happens when you run the test and bench script as shown in the documentation?
./test/dlrm_s_test.sh
./bench/dlrm_s_criteo_kaggle.sh --test-freq=1024

@TomekWei
Copy link

Hi, I also get NaN when run it in DLRCs with TorchRec. Did you sovle it? I found that there are some -inf in Kaggle Criteo dataset. I'm not sure if torch team handled it.

@ZhanqiuHu
Copy link
Author

I think it is one preprocessing operation in the script that is causing the problem. I ended up using some custom preprocessing steps instead of torchrec.datasets.scripts.npy_preproc_criteo.

@TomekWei
Copy link

I'm also trying to do that. If you still have that script, would you mind sharing it with me? Really thanks for your responding.

@ZhanqiuHu
Copy link
Author

Sorry, I'm not working on this now so I didn't keep a copy of the code. I remember I used the some part of the torchrec.datasets.scripts.npy_preproc_criteo code to decode the text to values and got a bunch of numpy files, and then did normalization with the dense values. Hope this helps!

@TomekWei
Copy link

It's ok. Thank you very much.

TomekWei added a commit to TomekWei/torchrec that referenced this issue Jun 21, 2024
The original script simply added 3 to the target value before taking the log. This led to the issue that in data preprocessing, if there was a value of -3, it would result in a value of -inf. This problem was mentioned in the issue facebookresearch/dlrm#363 (comment). I changed the preprocessing operation to dense_np -= (dense_np.min() - 2) in the tsv_to_npys function, and correctly handled the Criteo Kaggle dataset.
TomekWei added a commit to TomekWei/torchrec that referenced this issue Jun 21, 2024
The original script simply added 3 to the target value before taking the log. This led to the issue that in data preprocessing, if there was a value of -3, it would result in a value of -inf. This problem was mentioned in the issue facebookresearch/dlrm#363 (comment). I changed the preprocessing operation to dense_np -= dense_np.min() - 2 in the tsv_to_npys function, and correctly handled the Criteo Kaggle dataset.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants