Training speed #6

random649 · 2021-09-09T08:39:35Z

Hi, I'm training a 3d model (engine) by your code. And I completely followed the steps in README. But the code runs too slow (more than 1000 hours to finish). So where the problem is ?
(I used a GeForce RTX 2080 Ti)

davelindell · 2021-09-09T16:42:33Z

The maximum number of iterations in the training script is probably much more than is necessary for the model to converge. How many iterations are you running it for? Does the loss decrease and begin to converge?

The training script will save out model checkpoints at intermediate points, and you can try saving out meshes from these models to see how they look.

davelindell · 2021-11-12T18:45:59Z

I haven't been able to replicate this issue, so closing for now. Please follow up if this is still a problem.

xindonglin99 · 2022-04-14T06:47:45Z

I have the same issue training the 3D models. Currently, I used the default epochs in the thai statue config file, which is 10000 epochs. It trained on v100 for 24hours but only got 60000 iterations done, which is not many numbers of epochs. I exported the dae mesh and it didn't seem like it's converged. Following is a snapshot of that dae mesh in meshlab:

davelindell · 2022-04-14T11:18:49Z

Training to 60,000 iterations should yield a better result than what you're showing above, so something seems off. We optimize to 48k iters in the paper for the Thai Statue and it looks much more detailed than the above.

Also, it seems strange that it takes 24 hours to get to 60k iterations. I can run around 10 it/s on my laptop GPU (GTX 1650), and at this rate it should only take a couple hours to reach 60k. I guess a V100 should be even faster.

Are you sure you are using the default config file without any changes from the repo? How many workers are you using for the dataloader? Can you also post the tensorboard summaries for the occupancy loss?

xindonglin99 · 2022-04-14T17:05:40Z

I'm sure that I used the config cloned from the repo. The followings are loss for thai statue, my config used

I remembered the speed is around 3-4 iters/s or even below. I initially suspect that the GPU is not utilized during the training but I check 'torch.cuda.is_avaiable() = True' . It's weird to have this effect. Thanks for the help in advance.

PS: Is the training 3D mesh watertight? Does this matter? We used watertight mesh to train.

davelindell · 2022-04-15T06:16:53Z

Hmm, unfortunately I'm still having a hard time reproducing this.

One observation is that my occupancy loss curves look very different compared to yours. It's almost as if the block optimization is not happening at all in your case. You should see the error spike a bit at intervals where the block optimization is done. The loss also doesn't go down monotonically because as the blocks subdivide, there are more blocks and hence more fitting error.

I followed the below steps:

re-downloaded the repo
created a new conda environment using the instructions in the README.md
downloaded the thai statue PLY file from the Stanford 3D Scanning repo (ideally the models should be watertight, but in practice it seems to work if the meshes are close to watertight)
re-ran the training using the config_thai_acorn.ini file

I get the below result after 20K iterations (I export the mesh and visualize using meshlab). This took an hour or so on an old Titan X GPU.

xindonglin99 · 2022-04-19T23:23:28Z

Thank you for your feedback. It seems the issue still happened on our side even though I try to replicate using the process you mentioned. Either

The exported mesh still looks bad regarding details if I use the thai_statue mesh from other sources.
There are some errors (number of octants > maximum octants, problem infeasible) during training if I use the mesh downloaded from the Stanford webpage. I changed the max octants to 8192 but it still doesn't help.

I will dig a little bit more into this and will update if I find the issue. Thank you for your help again!

davelindell · 2022-04-22T01:15:50Z

Hmm this is strange, and it's hard to diagnose since I can't reproduce it. A few other thoughts:

Maybe there is some difference in the hardware or python packages?

Otherwise did you rename the downloaded model file to thai_statue.ply in the data directory?

Does it work if you try on a different machine?

davelindell closed this as completed Nov 12, 2021

davelindell reopened this Apr 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training speed #6

Training speed #6

random649 commented Sep 9, 2021

davelindell commented Sep 9, 2021

davelindell commented Nov 12, 2021

xindonglin99 commented Apr 14, 2022 •

edited

Loading

davelindell commented Apr 14, 2022

xindonglin99 commented Apr 14, 2022 •

edited

Loading

davelindell commented Apr 15, 2022

xindonglin99 commented Apr 19, 2022

davelindell commented Apr 22, 2022

Training speed #6

Training speed #6

Comments

random649 commented Sep 9, 2021

davelindell commented Sep 9, 2021

davelindell commented Nov 12, 2021

xindonglin99 commented Apr 14, 2022 • edited Loading

davelindell commented Apr 14, 2022

xindonglin99 commented Apr 14, 2022 • edited Loading

davelindell commented Apr 15, 2022

xindonglin99 commented Apr 19, 2022

davelindell commented Apr 22, 2022

xindonglin99 commented Apr 14, 2022 •

edited

Loading

xindonglin99 commented Apr 14, 2022 •

edited

Loading