Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: Specified device cuda:0 does not match device of data cuda:-2 #33

Closed
EmreOzkose opened this issue Sep 2, 2021 · 19 comments

Comments

@EmreOzkose
Copy link
Contributor

Hello,

I am training a TDNN-LSTM model with librispeech recipe on 16k 100 hours data. After training, I run decode.py. I sometimes observe a cuda issue (given below). Have you ever observe something like that? I think it is related to something during training. Because after some trainings, decode.py works well, however after some of trainings, decode.py gives this error. I googled RuntimeError: Specified device cuda:0 does not match device of data cuda:-2 error, but found nothing. I have Tesla-p100 16gb. I should also mention that 1best works well, but problem occurs during nbest and rescorings.

(k2) yunusemre.ozkose@boxx-3:/path/to/k2/icefall/egs/from_wav_scp/ASR$ python tdnn_lstm_ctc/decode.py --avg 1 --epoch 9
2021-09-02 14:24:46,677 INFO [decode.py:324] Decoding started
2021-09-02 14:24:46,678 INFO [decode.py:325] {'exp_dir': PosixPath('tdnn_lstm_ctc/exp9_w2v2'), 'lang_dir': PosixPath('data/lang_phone'), 'lm_dir': PosixPath('data/lm'), 'feature_dim': 1024, 'subsampling_factor': 1, 'search_beam': 20, 'output_beam': 5, 'min_active_states': 30, 'max_active_states': 10000, 'use_double_scores': True, 'method': 'nbest-rescoring', 'num_paths': 10, 'epoch': 9, 'avg': 1, 'feature_dir': PosixPath('data/fbank'), 'max_duration': 500.0, 'bucketing_sampler': False, 'num_buckets': 30, 'concatenate_cuts': True, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'full_libri': False}
2021-09-02 14:24:47,880 INFO [lexicon.py:96] Loading pre-compiled data/lang_phone/Linv.pt
2021-09-02 14:24:48,469 INFO [decode.py:334] device: cuda:0
2021-09-02 14:25:02,211 INFO [decode.py:362] Loading pre-compiled G_4_gram.pt
2021-09-02 14:25:02,846 INFO [checkpoint.py:75] Loading checkpoint from tdnn_lstm_ctc/exp9_w2v2/epoch-9.pt
/path/to/k2/lhotse/lhotse/dataset/sampling/single_cut.py:170: UserWarning: The first cut drawn in batch collection violates the max_frames or max_cuts constraints - we'll return it anyway. Consider increasing max_frames/max_cuts.
  warnings.warn(
2021-09-02 14:25:07,886 INFO [decode.py:271] batch 0, cuts processed until now is 1/171 (0.584795%)
Traceback (most recent call last):
  File "tdnn_lstm_ctc/decode.py", line 432, in <module>
    main()
  File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "tdnn_lstm_ctc/decode.py", line 415, in main
    results_dict = decode_dataset(
  File "tdnn_lstm_ctc/decode.py", line 250, in decode_dataset
    hyps_dict = decode_one_batch(
  File "tdnn_lstm_ctc/decode.py", line 190, in decode_one_batch
    best_path_dict = rescore_with_n_best_list(
  File "/path/to/k2/icefall/icefall/decode.py", line 405, in rescore_with_n_best_list
    am_scores, _ = compute_am_and_lm_scores(
  File "/path/to/k2/icefall/icefall/decode.py", line 297, in compute_am_and_lm_scores
    path_lattice = _intersect_device(
  File "/path/to/k2/icefall/icefall/decode.py", line 25, in _intersect_device
    return k2.intersect_device(
  File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/fsa_algo.py", line 204, in intersect_device
    out_fsas = k2.utils.fsa_from_binary_function_tensor(a_fsas, b_fsas,
  File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/utils.py", line 581, in fsa_from_binary_function_tensor
    value = index_select(a_value, a_arc_map, default_value=filler) \
  File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py", line 160, in index_select
    ans = _IndexSelectFunction.apply(src, index, default_value)
  File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py", line 66, in forward
    return _k2.index_select(src, index, default_value)
RuntimeError: Specified device cuda:0 does not match device of data cuda:-2
Exception raised from from_blob at /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/include/ATen/Functions.h:2267 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f41692162f2 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7f416921367b in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0x28200 (0x7f40c8316200 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #3: <unknown function> + 0x10e0a1 (0x7f40c83fc0a1 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #4: <unknown function> + 0x84bce (0x7f40c8372bce in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #5: <unknown function> + 0x8858f (0x7f40c837658f in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #6: <unknown function> + 0x9f876 (0x7f40c838d876 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #7: <unknown function> + 0x1dfcf (0x7f40c830bfcf in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
<omitting python frames>
frame #13: THPFunction_apply(_object*, _object*) + 0x8fd (0x7f41c016d41d in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #54: __libc_start_main + 0xe7 (0x7f41f24cbb97 in /lib/x86_64-linux-gnu/libc.so.6)
@EmreOzkose
Copy link
Contributor Author

Note that it can be also a memory issue, because I have a small memory (16gb). However, If the problem was a memory issue, I would expect to observe an error like:

RuntimeError: CUDA out of memory. Tried to allocate 420.00 MiB (GPU 0; 15.90 GiB total capacity; 3.23 GiB already allocated; 168.75 MiB free; 3.56 GiB reserved in total by PyTorch)

@danpovey
Copy link
Collaborator

danpovey commented Sep 2, 2021

Perhaps it's trying to use >1 GPU somehow?  (But it shouldn't).  If that's the case, setting something likeCUDA_VISIBLE_DEVICES=0(or whatever)should address it.Another possibility is that cuda:-2 is not a real device but some kind of error code.  That error message likely comes from torch.I think it would be worthwhile to try to catch the error in pdb, and print out the devices of all inputs to the function that failed.Once we know which object has the bad device, we can more easily debug.

@csukuangfj
Copy link
Collaborator

 File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py", line 66, in forward
    return _k2.index_select(src, index, default_value)

Could you modify /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py , line 66,

print(src.device, index.device)
return _k2.index_select(src, index, default_value)

It may show something that is useful.

@EmreOzkose
Copy link
Contributor Author

@csukuangfj I already printed devices before, but all of them was cuda:0.

@EmreOzkose
Copy link
Contributor Author

EmreOzkose commented Sep 2, 2021

@danpovey I have 4 devices, but before training, I am setting CUDA_VISIBLE_DEVICES=0. I will also try to debug with pdb.

@EmreOzkose
Copy link
Contributor Author

I added try-catch block to function decode_one_batch() in decode.py as:

try:
    best_path = nbest_decoding(
        lattice=lattice,
        num_paths=params.num_paths,
        use_double_scores=params.use_double_scores,
    )
except:
    breakpoint()

when I run python -m pdb tdnn_lstm_ctc/decode.py --avg 1 --epoch 8:

(k2) yunusemre.ozkose@boxx-3:/path/to/k2/icefall/egs/from_wav_scp/ASR$ python -m pdb tdnn_lstm_ctc/decode.py --avg 1 --epoch 8
> /path/to/k2/icefall/egs/sestek/ASR/tdnn_lstm_ctc/decode.py(3)<module>()
-> import os
(Pdb) c
2021-09-02 15:43:01,990 INFO [decode.py:330] Decoding started
2021-09-02 15:43:01,990 INFO [decode.py:331] {'exp_dir': PosixPath('tdnn_lstm_ctc/exp9_w2v2'), 'lang_dir': PosixPath('data/lang_phone'), 'lm_dir': PosixPath('data/lm'), 'feature_dim': 1024, 'subsampling_factor': 3, 'search_beam': 20, 'output_beam': 5, 'min_active_states': 30, 'max_active_states': 10000, 'use_double_scores': True, 'method': 'nbest', 'num_paths': 30, 'max_frames': 1000, 'epoch': 8, 'avg': 1, 'feature_dir': PosixPath('data/fbank'), 'max_duration': 500.0, 'bucketing_sampler': False, 'num_buckets': 30, 'concatenate_cuts': True, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'full_libri': False}
2021-09-02 15:43:02,604 INFO [lexicon.py:96] Loading pre-compiled data/lang_phone/Linv.pt
2021-09-02 15:43:02,963 INFO [decode.py:340] device: cuda:0
2021-09-02 15:43:09,784 INFO [checkpoint.py:75] Loading checkpoint from tdnn_lstm_ctc/exp9_w2v2/epoch-8.pt
/path/to/k2/lhotse/lhotse/dataset/sampling/single_cut.py:170: UserWarning: The first cut drawn in batch collection violates the max_frames or max_cuts constraints - we'll return it anyway. Consider increasing max_frames/max_cuts.
  warnings.warn(
2021-09-02 15:43:11,389 INFO [decode.py:277] batch 0, cuts processed until now is 1/171 (0.584795%)
> /path/to/k2/icefall/egs/sestek/ASR/tdnn_lstm_ctc/decode.py(185)decode_one_batch()
-> key = f"no_rescore-{params.num_paths}"
(Pdb) lattice.device
device(type='cuda', index=0)
(Pdb) 

Problem occurs in nbest_decoding(). Only lattice tensor is given to that function and its device is 0.

@danpovey
Copy link
Collaborator

danpovey commented Sep 2, 2021

I think you are not quite at the place where it failed-need to do "c" (continue) maybe?

@EmreOzkose
Copy link
Contributor Author

When I didn't add a try-catch block, log is :

(k2) yunusemre.ozkose@boxx-3:/path/to/k2/icefall/egs/from_wav_scp/ASR$ python -m pdb tdnn_lstm_ctc/decode.py --avg 1 --epoch 8
> /path/to/k2/icefall/egs/sestek/ASR/tdnn_lstm_ctc/decode.py(3)<module>()
-> import os
(Pdb) c
2021-09-02 16:33:33,700 INFO [decode.py:327] Decoding started
2021-09-02 16:33:33,701 INFO [decode.py:328] {'exp_dir': PosixPath('tdnn_lstm_ctc/exp9_w2v2'), 'lang_dir': PosixPath('data/lang_phone'), 'lm_dir': PosixPath('data/lm'), 'feature_dim': 1024, 'subsampling_factor': 3, 'search_beam': 20, 'output_beam': 5, 'min_active_states': 30, 'max_active_states': 10000, 'use_double_scores': True, 'method': 'nbest', 'num_paths': 30, 'max_frames': 1000, 'epoch': 8, 'avg': 1, 'feature_dir': PosixPath('data/fbank'), 'max_duration': 500.0, 'bucketing_sampler': False, 'num_buckets': 30, 'concatenate_cuts': True, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'full_libri': False}
2021-09-02 16:33:34,178 INFO [lexicon.py:96] Loading pre-compiled data/lang_phone/Linv.pt
2021-09-02 16:33:34,494 INFO [decode.py:337] device: cuda:0
2021-09-02 16:33:45,349 INFO [checkpoint.py:75] Loading checkpoint from tdnn_lstm_ctc/exp9_w2v2/epoch-8.pt
/path/to/k2/lhotse/lhotse/dataset/sampling/single_cut.py:170: UserWarning: The first cut drawn in batch collection violates the max_frames or max_cuts constraints - we'll return it anyway. Consider increasing max_frames/max_cuts.
  warnings.warn(
2021-09-02 16:33:47,481 INFO [decode.py:274] batch 0, cuts processed until now is 1/171 (0.584795%)
Traceback (most recent call last):
  File "/path/to/miniconda3/envs/k2/lib/python3.8/pdb.py", line 1705, in main
    pdb._runscript(mainpyfile)
  File "/path/to/miniconda3/envs/k2/lib/python3.8/pdb.py", line 1573, in _runscript
    self.run(statement)
  File "/path/to/miniconda3/envs/k2/lib/python3.8/bdb.py", line 580, in run
    exec(cmd, globals, locals)
  File "<string>", line 1, in <module>
  File "/path/to/k2/icefall/egs/sestek/ASR/tdnn_lstm_ctc/decode.py", line 3, in <module>
    import os
  File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/path/to/k2/icefall/egs/sestek/ASR/tdnn_lstm_ctc/decode.py", line 418, in main
    results_dict = decode_dataset(
  File "/path/to/k2/icefall/egs/sestek/ASR/tdnn_lstm_ctc/decode.py", line 253, in decode_dataset
    hyps_dict = decode_one_batch(
  File "/path/to/k2/icefall/egs/sestek/ASR/tdnn_lstm_ctc/decode.py", line 176, in decode_one_batch
    best_path = nbest_decoding(
  File "/path/to/k2/icefall/icefall/decode.py", line 208, in nbest_decoding
    path_lattice = _intersect_device(
  File "/path/to/k2/icefall/icefall/decode.py", line 25, in _intersect_device
    return k2.intersect_device(
  File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/fsa_algo.py", line 204, in intersect_device
    out_fsas = k2.utils.fsa_from_binary_function_tensor(a_fsas, b_fsas,
  File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/utils.py", line 581, in fsa_from_binary_function_tensor
    value = index_select(a_value, a_arc_map, default_value=filler) \
  File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py", line 160, in index_select
    ans = _IndexSelectFunction.apply(src, index, default_value)
  File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py", line 66, in forward
    return _k2.index_select(src, index, default_value)
RuntimeError: Specified device cuda:0 does not match device of data cuda:-2
Exception raised from from_blob at /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/include/ATen/Functions.h:2267 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f359b5e32f2 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7f359b5e067b in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0x28200 (0x7f34fa699200 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #3: <unknown function> + 0x10e0a1 (0x7f34fa77f0a1 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #4: <unknown function> + 0x84bce (0x7f34fa6f5bce in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #5: <unknown function> + 0x8858f (0x7f34fa6f958f in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #6: <unknown function> + 0x9f876 (0x7f34fa710876 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #7: <unknown function> + 0x1dfcf (0x7f34fa68efcf in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
<omitting python frames>
frame #13: THPFunction_apply(_object*, _object*) + 0x8fd (0x7f35f253a41d in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libtorch_python.so)

Uncaught exception. Entering post mortem debugging
Running 'cont' or 'step' will restart the program
> /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py(66)forward()
-> return _k2.index_select(src, index, default_value)
(Pdb) lattice.device
*** NameError: name 'lattice' is not defined
(Pdb) 

I can't reach lattice after error, hence I added try-catch block.

@EmreOzkose
Copy link
Contributor Author

EmreOzkose commented Sep 2, 2021

I added breakpoint to place where @csukuangfj said. Log is here:

(Pdb) c
> /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py(67)forward()
-> return _k2.index_select(src, index, default_value)
(Pdb) src.device; index.device; default_value;
device(type='cuda', index=0)
device(type='cuda', index=0)
0.0
(Pdb) c
> /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py(67)forward()
-> return _k2.index_select(src, index, default_value)
(Pdb) src.device; index.device; default_value;
device(type='cuda', index=0)
device(type='cuda', index=0)
0.0
(Pdb) c
> /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py(67)forward()
-> return _k2.index_select(src, index, default_value)
(Pdb) src.device; index.device; default_value;
device(type='cuda', index=0)
device(type='cuda', index=0)
0.0
(Pdb) c
> /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py(67)forward()
-> return _k2.index_select(src, index, default_value)
(Pdb) src.device; index.device; default_value;
device(type='cuda', index=0)
device(type='cuda', index=0)
0.0
(Pdb) c
> /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py(67)forward()
-> return _k2.index_select(src, index, default_value)
(Pdb) src.device; index.device; default_value;
device(type='cuda', index=0)
device(type='cuda', index=0)
0.0
(Pdb) c
Traceback (most recent call last):
  File "/path/to/miniconda3/envs/k2/lib/python3.8/pdb.py", line 1705, in main
    pdb._runscript(mainpyfile)
  File "/path/to/miniconda3/envs/k2/lib/python3.8/pdb.py", line 1573, in _runscript
    self.run(statement)
  File "/path/to/miniconda3/envs/k2/lib/python3.8/bdb.py", line 580, in run
    exec(cmd, globals, locals)
  File "<string>", line 1, in <module>
  File "/path/to/k2/icefall/egs/sestek/ASR/tdnn_lstm_ctc/decode.py", line 435, in <module>
    main()
  File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/path/to/k2/icefall/egs/sestek/ASR/tdnn_lstm_ctc/decode.py", line 418, in main
    results_dict = decode_dataset(
  File "/path/to/k2/icefall/egs/sestek/ASR/tdnn_lstm_ctc/decode.py", line 253, in decode_dataset
    hyps_dict = decode_one_batch(
  File "/path/to/k2/icefall/egs/sestek/ASR/tdnn_lstm_ctc/decode.py", line 176, in decode_one_batch
    best_path = nbest_decoding(
  File "/path/to/k2/icefall/icefall/decode.py", line 208, in nbest_decoding
    path_lattice = _intersect_device(
  File "/path/to/k2/icefall/icefall/decode.py", line 25, in _intersect_device
    return k2.intersect_device(
  File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/fsa_algo.py", line 204, in intersect_device
    out_fsas = k2.utils.fsa_from_binary_function_tensor(a_fsas, b_fsas,
  File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/utils.py", line 581, in fsa_from_binary_function_tensor
    value = index_select(a_value, a_arc_map, default_value=filler) \
  File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py", line 161, in index_select
    ans = _IndexSelectFunction.apply(src, index, default_value)
  File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py", line 67, in forward
    return _k2.index_select(src, index, default_value)
RuntimeError: Specified device cuda:0 does not match device of data cuda:-2
Exception raised from from_blob at /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/include/ATen/Functions.h:2267 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fe9a54c82f2 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7fe9a54c567b in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0x28200 (0x7fe904576200 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #3: <unknown function> + 0x10e0a1 (0x7fe90465c0a1 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #4: <unknown function> + 0x84bce (0x7fe9045d2bce in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #5: <unknown function> + 0x8858f (0x7fe9045d658f in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #6: <unknown function> + 0x9f876 (0x7fe9045ed876 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #7: <unknown function> + 0x1dfcf (0x7fe90456bfcf in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
<omitting python frames>
frame #13: THPFunction_apply(_object*, _object*) + 0x8fd (0x7fe9fc41f41d in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libtorch_python.so)

Uncaught exception. Entering post mortem debugging
Running 'cont' or 'step' will restart the program
> /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py(67)forward()
-> return _k2.index_select(src, index, default_value)
(Pdb) 

the place in miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py :

65: breakpoint()
66: return _k2.index_select(src, index, default_value)

@danpovey
Copy link
Collaborator

danpovey commented Sep 2, 2021 via email

@danpovey
Copy link
Collaborator

danpovey commented Sep 2, 2021 via email

@csukuangfj
Copy link
Collaborator

https://k2.readthedocs.io/en/latest/installation/for_developers.html

The above link contains instructions to build a debug version of k2.

@csukuangfj
Copy link
Collaborator

I added breakpoint to place where @csukuangfj said. Log is here:

Could you also print the shape of src and index?

print(src.shape)
print(index.shape)

to verify that neither of them is empty?

@EmreOzkose
Copy link
Contributor Author

I checked if index or src is empty, and noticed that index is empty when the problem occurs.

(k2) yunusemre.ozkose@boxx-3:/path/to/k2/icefall/egs/from_wav_scp/ASR$ python tdnn_lstm_ctc/decode.py --avg 1 --epoch 8
2021-09-03 08:14:46,220 INFO [decode.py:327] Decoding started
2021-09-03 08:14:46,220 INFO [decode.py:328] {'exp_dir': PosixPath('tdnn_lstm_ctc/exp9_w2v2'), 'lang_dir': PosixPath('data/lang_phone'), 'lm_dir': PosixPath('data/lm'), 'feature_dim': 1024, 'subsampling_factor': 3, 'search_beam': 20, 'output_beam': 5, 'min_active_states': 30, 'max_active_states': 10000, 'use_double_scores': True, 'method': 'nbest', 'num_paths': 30, 'max_frames': 1000, 'epoch': 8, 'avg': 1, 'feature_dir': PosixPath('data/fbank'), 'max_duration': 500.0, 'bucketing_sampler': False, 'num_buckets': 30, 'concatenate_cuts': True, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'full_libri': False}
2021-09-03 08:14:46,837 INFO [lexicon.py:96] Loading pre-compiled data/lang_phone/Linv.pt
2021-09-03 08:14:47,150 INFO [decode.py:337] device: cuda:0
2021-09-03 08:14:55,636 INFO [checkpoint.py:75] Loading checkpoint from tdnn_lstm_ctc/exp9_w2v2/epoch-8.pt
/path/to/k2/lhotse/lhotse/dataset/sampling/single_cut.py:170: UserWarning: The first cut drawn in batch collection violates the max_frames or max_cuts constraints - we'll return it anyway. Consider increasing max_frames/max_cuts.
  warnings.warn(
cuda:0 cuda:0
torch.Size([36466453]) torch.Size([729618])
cuda:0 cuda:0
torch.Size([36466453]) torch.Size([729618])
cuda:0 cuda:0
torch.Size([729618]) torch.Size([729618])
cuda:0 cuda:0
torch.Size([729618]) torch.Size([729618])
cuda:0 cuda:0
torch.Size([729618]) torch.Size([729618])
cuda:0 cuda:0
torch.Size([729618]) torch.Size([729618])
cuda:0 cuda:0
torch.Size([729618]) torch.Size([729618])
cuda:0 cuda:0
torch.Size([729618]) torch.Size([15309908])
cuda:0 cuda:0
torch.Size([562]) torch.Size([15309908])
cuda:0 cuda:0
torch.Size([729618]) torch.Size([15309908])
cuda:0 cuda:0
torch.Size([729618]) torch.Size([15309908])
cuda:0 cuda:0
torch.Size([729618]) torch.Size([15309908])
cuda:0 cuda:0
torch.Size([15309908]) torch.Size([106588])
cuda:0 cuda:0
torch.Size([15309908]) torch.Size([106588])
cuda:0 cuda:0
torch.Size([15309908]) torch.Size([106588])
cuda:0 cuda:0
torch.Size([106588]) torch.Size([106588])
cuda:0 cuda:0
torch.Size([106588]) torch.Size([106588])
cuda:0 cuda:0
torch.Size([106588]) torch.Size([106588])
cuda:0 cuda:0
torch.Size([30]) torch.Size([1])
2021-09-03 08:14:57,654 INFO [decode.py:274] batch 0, cuts processed until now is 1/171 (0.584795%)
cuda:0 cuda:0
torch.Size([36466453]) torch.Size([1375261])
cuda:0 cuda:0
torch.Size([36466453]) torch.Size([1375261])
cuda:0 cuda:0
torch.Size([1375261]) torch.Size([1375261])
cuda:0 cuda:0
torch.Size([1375261]) torch.Size([1375261])
cuda:0 cuda:0
torch.Size([1375261]) torch.Size([1375261])
cuda:0 cuda:0
torch.Size([1375261]) torch.Size([1375261])
cuda:0 cuda:0
torch.Size([1375261]) torch.Size([1375261])
cuda:0 cuda:0
torch.Size([1375261]) torch.Size([36303965])
cuda:0 cuda:0
torch.Size([2322]) torch.Size([36303965])
cuda:0 cuda:0
torch.Size([1375261]) torch.Size([36303965])
cuda:0 cuda:0
torch.Size([1375261]) torch.Size([36303965])
cuda:0 cuda:0
torch.Size([1375261]) torch.Size([36303965])
cuda:0 cuda:0
torch.Size([36303965]) torch.Size([178240])
cuda:0 cuda:0
torch.Size([36303965]) torch.Size([178240])
cuda:0 cuda:0
torch.Size([36303965]) torch.Size([178240])
cuda:0 cuda:0
torch.Size([178240]) torch.Size([178240])
cuda:0 cuda:0
torch.Size([178240]) torch.Size([178240])
cuda:0 cuda:0
torch.Size([178240]) torch.Size([178240])
cuda:0 cuda:0
torch.Size([30]) torch.Size([1])
cuda:0 cuda:0
torch.Size([36466453]) torch.Size([749184])
cuda:0 cuda:0
torch.Size([36466453]) torch.Size([749184])
cuda:0 cuda:0
torch.Size([749184]) torch.Size([749184])
cuda:0 cuda:0
torch.Size([749184]) torch.Size([749184])
cuda:0 cuda:0
torch.Size([749184]) torch.Size([749184])
cuda:0 cuda:0
torch.Size([749184]) torch.Size([749184])
cuda:0 cuda:0
torch.Size([749184]) torch.Size([749184])
cuda:0 cuda:0
torch.Size([749184]) torch.Size([21094213])
cuda:0 cuda:0
torch.Size([1308]) torch.Size([21094213])
cuda:0 cuda:0
torch.Size([749184]) torch.Size([21094213])
cuda:0 cuda:0
torch.Size([749184]) torch.Size([21094213])
cuda:0 cuda:0
torch.Size([749184]) torch.Size([21094213])
cuda:0 cuda:0
torch.Size([21094213]) torch.Size([101191])
cuda:0 cuda:0
torch.Size([21094213]) torch.Size([101191])
cuda:0 cuda:0
torch.Size([21094213]) torch.Size([101191])
cuda:0 cuda:0
torch.Size([101191]) torch.Size([101191])
cuda:0 cuda:0
torch.Size([101191]) torch.Size([101191])
cuda:0 cuda:0
torch.Size([101191]) torch.Size([101191])
cuda:0 cuda:0
torch.Size([30]) torch.Size([1])
cuda:0 cuda:0
torch.Size([36466453]) torch.Size([183094])
cuda:0 cuda:0
torch.Size([36466453]) torch.Size([183094])
cuda:0 cuda:0
torch.Size([183094]) torch.Size([183094])
cuda:0 cuda:0
torch.Size([183094]) torch.Size([183094])
cuda:0 cuda:0
torch.Size([183094]) torch.Size([183094])
cuda:0 cuda:0
torch.Size([183094]) torch.Size([183094])
cuda:0 cuda:0
torch.Size([183094]) torch.Size([183094])
cuda:0 cuda:0
torch.Size([183094]) torch.Size([0])
Traceback (most recent call last):
  File "tdnn_lstm_ctc/decode.py", line 435, in <module>
    main()
  File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "tdnn_lstm_ctc/decode.py", line 418, in main
    results_dict = decode_dataset(
  File "tdnn_lstm_ctc/decode.py", line 253, in decode_dataset
    hyps_dict = decode_one_batch(
  File "tdnn_lstm_ctc/decode.py", line 176, in decode_one_batch
    best_path = nbest_decoding(
  File "/path/to/k2/icefall/icefall/decode.py", line 208, in nbest_decoding
    path_lattice = _intersect_device(
  File "/path/to/k2/icefall/icefall/decode.py", line 25, in _intersect_device
    return k2.intersect_device(
  File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/fsa_algo.py", line 204, in intersect_device
    out_fsas = k2.utils.fsa_from_binary_function_tensor(a_fsas, b_fsas,
  File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/utils.py", line 581, in fsa_from_binary_function_tensor
    value = index_select(a_value, a_arc_map, default_value=filler) \
  File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py", line 163, in index_select
    ans = _IndexSelectFunction.apply(src, index, default_value)
  File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py", line 69, in forward
    return _k2.index_select(src, index, default_value)
RuntimeError: Specified device cuda:0 does not match device of data cuda:-2
Exception raised from from_blob at /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/include/ATen/Functions.h:2267 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f42803f82f2 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7f42803f567b in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0x28200 (0x7f41df4f8200 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #3: <unknown function> + 0x10e0a1 (0x7f41df5de0a1 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #4: <unknown function> + 0x84bce (0x7f41df554bce in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #5: <unknown function> + 0x8858f (0x7f41df55858f in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #6: <unknown function> + 0x9f876 (0x7f41df56f876 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #7: <unknown function> + 0x1dfcf (0x7f41df4edfcf in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
<omitting python frames>
frame #13: THPFunction_apply(_object*, _object*) + 0x8fd (0x7f42d734f41d in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #52: __libc_start_main + 0xe7 (0x7f43096adb97 in /lib/x86_64-linux-gnu/libc.so.6)

@csukuangfj
Copy link
Collaborator

@EmreOzkose
Could you show us the version of k2 you are using?

$ python3 -m k2.version

should give you such information.

@EmreOzkose
Copy link
Contributor Author

@csukuangfj
My version info is :

Collecting environment information...

k2 version: 1.3
Build type: Release
Git SHA1: 6b8a10fa95213da285b8fce6525b2c5ed42198a6
Git date: Tue Aug 3 05:36:48 2021
Cuda used to build k2: 11.1
cuDNN used to build k2: 8.0.5
Python version used to build k2: 3.8
OS used to build k2: Ubuntu 16.04.7 LTS
CMake version: 3.18.4
GCC version: 5.5.0
CMAKE_CUDA_FLAGS:  --expt-extended-lambda -gencode arch=compute_35,code=sm_35 --expt-extended-lambda -gencode arch=compute_50,code=sm_50 --expt-extended-lambda -gencode arch=compute_60,code=sm_60 --expt-extended-lambda -gencode arch=compute_61,code=sm_61 --expt-extended-lambda -gencode arch=compute_70,code=sm_70 --expt-extended-lambda -gencode arch=compute_75,code=sm_75 -D_GLIBCXX_USE_CXX11_ABI=0 --compiler-options -Wall --compiler-options -Wno-unknown-pragmas --compiler-options -Wno-strict-overflow
CMAKE_CXX_FLAGS:  -D_GLIBCXX_USE_CXX11_ABI=0 -Wno-strict-overflow
PyTorch version used to build k2: 1.8.1
PyTorch is using Cuda: 11.1
NVTX enabled: True
With CUDA: True
Disable debug: True
Sync kernels : False
Disable checks: False

I think I understand the issue. I am trying different architectures and features. Since my memory is small, when I increase number of layer of the model, I have to decrease max_frames. When I use small number of frames (like 5000), index comes 0 for some batches.

@csukuangfj
Copy link
Collaborator

I would recommend you to update your k2.

k2 v1.6 contains several bug fixes, including the one you are facing, I think.
As you are using conda, steps to update k2 are fairly simple. Please see
https://k2.readthedocs.io/en/latest/installation/conda.html

@EmreOzkose
Copy link
Contributor Author

Thank you so much! I am updating at once.

@EmreOzkose
Copy link
Contributor Author

I want to report here. I updated k2 and run decode.py again. The problem is not occurring now, thank you. However hyps are coming empty :). After now, it is my design's problem :).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants