Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training speed #6

Open
random649 opened this issue Sep 9, 2021 · 8 comments
Open

Training speed #6

random649 opened this issue Sep 9, 2021 · 8 comments

Comments

@random649
Copy link

Hi, I'm training a 3d model (engine) by your code. And I completely followed the steps in README. But the code runs too slow (more than 1000 hours to finish). So where the problem is ?
(I used a GeForce RTX 2080 Ti)

@davelindell
Copy link
Collaborator

The maximum number of iterations in the training script is probably much more than is necessary for the model to converge. How many iterations are you running it for? Does the loss decrease and begin to converge?

The training script will save out model checkpoints at intermediate points, and you can try saving out meshes from these models to see how they look.

@davelindell
Copy link
Collaborator

I haven't been able to replicate this issue, so closing for now. Please follow up if this is still a problem.

@xindonglin99
Copy link

xindonglin99 commented Apr 14, 2022

I have the same issue training the 3D models. Currently, I used the default epochs in the thai statue config file, which is 10000 epochs. It trained on v100 for 24hours but only got 60000 iterations done, which is not many numbers of epochs. I exported the dae mesh and it didn't seem like it's converged. Following is a snapshot of that dae mesh in meshlab:

Screen Shot 2022-04-13 at 11 46 25 PM

@davelindell
Copy link
Collaborator

Training to 60,000 iterations should yield a better result than what you're showing above, so something seems off. We optimize to 48k iters in the paper for the Thai Statue and it looks much more detailed than the above.

Also, it seems strange that it takes 24 hours to get to 60k iterations. I can run around 10 it/s on my laptop GPU (GTX 1650), and at this rate it should only take a couple hours to reach 60k. I guess a V100 should be even faster.

Are you sure you are using the default config file without any changes from the repo? How many workers are you using for the dataloader? Can you also post the tensorboard summaries for the occupancy loss?

@davelindell davelindell reopened this Apr 14, 2022
@xindonglin99
Copy link

xindonglin99 commented Apr 14, 2022

I'm sure that I used the config cloned from the repo. The followings are loss for thai statue, my config used

image

image

image

I remembered the speed is around 3-4 iters/s or even below. I initially suspect that the GPU is not utilized during the training but I check 'torch.cuda.is_avaiable() = True' . It's weird to have this effect. Thanks for the help in advance.

PS: Is the training 3D mesh watertight? Does this matter? We used watertight mesh to train.

@davelindell
Copy link
Collaborator

Hmm, unfortunately I'm still having a hard time reproducing this.

One observation is that my occupancy loss curves look very different compared to yours. It's almost as if the block optimization is not happening at all in your case. You should see the error spike a bit at intervals where the block optimization is done. The loss also doesn't go down monotonically because as the blocks subdivide, there are more blocks and hence more fitting error.

loss

I followed the below steps:

  • re-downloaded the repo
  • created a new conda environment using the instructions in the README.md
  • downloaded the thai statue PLY file from the Stanford 3D Scanning repo (ideally the models should be watertight, but in practice it seems to work if the meshes are close to watertight)
  • re-ran the training using the config_thai_acorn.ini file

I get the below result after 20K iterations (I export the mesh and visualize using meshlab). This took an hour or so on an old Titan X GPU.

thai statue

@xindonglin99
Copy link

Thank you for your feedback. It seems the issue still happened on our side even though I try to replicate using the process you mentioned. Either

  • The exported mesh still looks bad regarding details if I use the thai_statue mesh from other sources.
  • There are some errors (number of octants > maximum octants, problem infeasible) during training if I use the mesh downloaded from the Stanford webpage. I changed the max octants to 8192 but it still doesn't help.

I will dig a little bit more into this and will update if I find the issue. Thank you for your help again!

@davelindell
Copy link
Collaborator

Hmm this is strange, and it's hard to diagnose since I can't reproduce it. A few other thoughts:

Maybe there is some difference in the hardware or python packages?

Otherwise did you rename the downloaded model file to thai_statue.ply in the data directory?

Does it work if you try on a different machine?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants