No response in training process #69

dexter2406 · 2020-11-24T14:13:20Z

Hi I found the program doesn't respond when I start training. The displayed information is like the following. There is no error report either.

 np_resource = np.dtype([("resource", np.ubyte, 1)])
{'add_dispnet': True,
 'add_flownet': False,
 'add_posenet': True,
 'alpha_recon_image': 0.85,
 'batch_size': 4,
 'checkpoint_dir': 'models\\geonet_posenet\\results',
 'dataset_dir': 'data\\kitti\\formatted_data',
 'depth_test_split': 'eigen',
 'disp_smooth_weight': 0.5,
 'dispnet_encoder': 'resnet50',
...
 'output_dir': None,
 'pose_test_seq': 9,
 'rigid_warp_weight': 1.0,
 'save_ckpt_freq': 5000,
 'scale_normalize': False,
 'seq_length': 5}
2020-11-24 15:04:21.853792: W c:\tf_jenkins\home\workspace\release-win\m\windows\py\36\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE
instructions, but these are available on your machine and could speed up CPU computations.
...
2020-11-24 15:04:21.933181: W c:\tf_jenkins\home\workspace\release-win\m\windows\py\36\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA
instructions, but these are available on your machine and could speed up CPU computations.
Trainable variables:
depth_net/Conv/weights:0
depth_net/Conv/BatchNorm/beta:0
depth_net/Conv_1/weights:0
depth_net/Conv_1/BatchNorm/beta:0
depth_net/Conv_2/weights:0
...
pose_net/Conv_3/BatchNorm/beta:0
pose_net/Conv_4/weights:0
pose_net/Conv_4/BatchNorm/beta:0
pose_net/Conv_5/weights:0
pose_net/Conv_5/BatchNorm/beta:0
pose_net/Conv_6/weights:0
pose_net/Conv_6/BatchNorm/beta:0
pose_net/Conv_7/weights:0
pose_net/Conv_7/biases:0
parameter_count = 60047292

The text was updated successfully, but these errors were encountered:

dexter2406 · 2020-11-24T14:33:56Z

I wait for about 20min and notice that there are following files are generated:

graph.pbtxt
events.out.tfevents.1606226671.DESKTOP-AVNMGK4

even though there's still no progress shown - maybe because your code has no visualization for training process? And what are these two files for?

Thanks for your time!

yzcjtr · 2020-11-24T19:30:20Z

Hi, can you confirm the library version you are using? From the signal above, the training hasn't started at all; otherwise, the loss value per iteration will be printed.

dexter2406 · 2020-11-25T08:32:49Z

Thanks for the reply. I'm using (mainly):

python=3.6.12
tensorflow==1.2.0
scipy==1.1.0
numpy==1.19.4
matplotlib==3.3.3
opencv-python==4.4.0
pillow==8.0.1

I know it's stated that this code is only tested in python==2.7 and tf==1.1, but they are not supported right now, so I tried new versions. I slightly modified the code according to the error repoort, but then I came to this where I didn't know what went wrong.

yzcjtr · 2020-11-26T16:53:09Z

TF 1.2 should be alright, but I'm not sure if python 3 is okay for this repo. I would suggest adding some checkpoints in the code and locate where it's stuck?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No response in training process #69

No response in training process #69

dexter2406 commented Nov 24, 2020

dexter2406 commented Nov 24, 2020

yzcjtr commented Nov 24, 2020

dexter2406 commented Nov 25, 2020

yzcjtr commented Nov 26, 2020

No response in training process #69

No response in training process #69

Comments

dexter2406 commented Nov 24, 2020

dexter2406 commented Nov 24, 2020

yzcjtr commented Nov 24, 2020

dexter2406 commented Nov 25, 2020

yzcjtr commented Nov 26, 2020