-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I found some bugs and questions. #30
Comments
the referencenet is a mess at the moment. the referencenet - I collapse the training part so it's just extract feature maps. But now I'm having doubts. I'm not clear on https://github.com/johndpope/Emote-hack/blob/main/train_stage_1_0.py#L194 (this needs updating) I had this function to extract them. #27 Related - this is saying throw out the Reference Attention and use EMSA - Efficient Multi-Head Self-Attention (EMSA) |
crazy day - some open ended questions I don't have answers. Some observations Is the diagram wrong? does RefernceNet actually use self attention? There's a way to inject self attention into unet - i push as new model. or there's a simpler way to upgrade referencenet to use self attention layers. When I looked at the VideoNet by @jimmyl02 - it iintroduce reference attention when the model loads. the other option from earlier is just to extract the features from unet. If ReferenceNet didn't need updating - then it should frozen too. https://blog.metaphysic.ai/plausible-stable-diffusion-video-from-a-single-image/ so I introduce that as preprocessing) stage 1 (but not matching up to diagram) |
Hi, one thing I am not sure, the way you pushed is passing latent_representations through the whole reference UNet the diagram from the paper showed that input of backbone network should come from the first down block layer of reference net, not from the end. |
@sarperkilic thats this option https://github.com/johndpope/Emote-hack/blob/main/Net.py#L123 |
@wangshiwen-ai / @sarperkilic N.B. - similar code is actually slated for release in a couple of months I could use some extra eyes - code is mostly complete - but have troubles with warpgenerator. |
I may have another go at this using latest chatgpt https://chatgpt.com/share/66ece68a-606c-800e-8078-0e6081a82b5c |
The key point of he mat dimension is that the cross_attention_dim is 768 default but we do not need the text embedding. So I think we may modified the cross attention layers and ignore the cross attention in reference net.
In denoising net, I think it should take the noisy image latents as input and cross attentioned with reference features to reconstruct the reference images. And the unet in stage 1 may be all 2D.
The text was updated successfully, but these errors were encountered: