I found some bugs and questions. #30

wangshiwen-ai · 2024-03-26T01:12:21Z

The key point of he mat dimension is that the cross_attention_dim is 768 default but we do not need the text embedding. So I think we may modified the cross attention layers and ignore the cross attention in reference net.

In denoising net, I think it should take the noisy image latents as input and cross attentioned with reference features to reconstruct the reference images. And the unet in stage 1 may be all 2D.

johndpope · 2024-03-26T01:33:59Z

Hi @wangshiwen-ai

the referencenet is a mess at the moment.
I just pushed some code and it's a problem.

the referencenet - I collapse the training part so it's just extract feature maps. But now I'm having doubts.
from the image - there's no full pass through but then reading this - the extraction is happening on their trained model with millions of images. so we still need training stage.....
https://blog.metaphysic.ai/plausible-stable-diffusion-video-from-a-single-image/ - they are operating in tandem.

I'm not clear on

https://github.com/johndpope/Emote-hack/blob/main/train_stage_1_0.py#L194 (this needs updating)
https://github.com/johndpope/Emote-hack/blob/main/Net.py#L53C4-L65C31

I had this function to extract them. #27
q) do we need the motion frames to be black and white? (the diffusedheads did this.
I'm going through their forks to find a working remote branch.)
q) should the images be resized to 256x256 from 512? I think not, right? this results in much smaller vae encode image - 32x32 ? reading this - https://blog.metaphysic.ai/plausible-stable-diffusion-video-from-a-single-image/ it seems like the size of videos are 512x512.
q) concatenating the reference + motions frames - I drafted this - but I need to successfully pass the whole tensor through the first layer of referencenet to just spit out features - then it's on to the backbone and 'injection into the resolution stratum'.

Related - this is saying throw out the Reference Attention and use EMSA - Efficient Multi-Head Self-Attention (EMSA)
#16

johndpope · 2024-03-26T08:56:14Z

crazy day - some open ended questions I don't have answers.
Following up on a query from @sarperkilic I was focusing on referencenet today.
There's 3 different pathways - not sure which is best route.

Some observations

Is the diagram wrong? does RefernceNet actually use self attention?
or is that just the nature of unet?
q) why didn't that just say unet - sd instead of referenenet.

There's a way to inject self attention into unet - i push as new model.
908ad9a

or there's a simpler way to upgrade referencenet to use self attention layers.
https://github.com/johndpope/Emote-hack/blob/main/Net.py#L66

When I looked at the VideoNet by @jimmyl02 - it iintroduce reference attention when the model loads.
https://github.com/jimmyl02/animate/blob/main/animate-anyone/models/videonet.py

the other option from earlier is just to extract the features from unet.
that is just a function call.
https://github.com/johndpope/Emote-hack/blob/main/Net.py#L87

If ReferenceNet didn't need updating - then it should frozen too.
This doc explains quite well that the ReferenceNet model was trained on huge amount of data.

https://blog.metaphysic.ai/plausible-stable-diffusion-video-from-a-single-image/

so I introduce that as preprocessing) stage 1 (but not matching up to diagram)
https://github.com/johndpope/Emote-hack/blob/main/train_stage_1_referencenet.py

sarperkilic · 2024-03-26T09:32:10Z

Hi,

one thing I am not sure,

the way you pushed is passing latent_representations through the whole reference UNet
908ad9a

the diagram from the paper showed that input of backbone network should come from the first down block layer of reference net, not from the end.

johndpope · 2024-03-26T10:50:35Z

@sarperkilic thats this option https://github.com/johndpope/Emote-hack/blob/main/Net.py#L123
but then there's no need for self attention? maybe thats the right option. will sleep on it.

johndpope · 2024-05-22T03:01:59Z

@wangshiwen-ai / @sarperkilic
been working on another paper recreation -
https://arxiv.org/abs/2207.07621
https://github.com/johndpope/MegaPortrait-hack

N.B. - similar code is actually slated for release in a couple of months

https://github.com/neeek2303/EMOPortraits

I could use some extra eyes - code is mostly complete - but have troubles with warpgenerator.

johndpope · 2024-09-20T03:06:00Z

I may have another go at this using latest chatgpt https://chatgpt.com/share/66ece68a-606c-800e-8078-0e6081a82b5c

johndpope closed this as completed Sep 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I found some bugs and questions. #30

I found some bugs and questions. #30

wangshiwen-ai commented Mar 26, 2024

johndpope commented Mar 26, 2024 •

edited

Loading

johndpope commented Mar 26, 2024

sarperkilic commented Mar 26, 2024

johndpope commented Mar 26, 2024

johndpope commented May 22, 2024

johndpope commented Sep 20, 2024

I found some bugs and questions. #30

I found some bugs and questions. #30

Comments

wangshiwen-ai commented Mar 26, 2024

johndpope commented Mar 26, 2024 • edited Loading

johndpope commented Mar 26, 2024

sarperkilic commented Mar 26, 2024

johndpope commented Mar 26, 2024

johndpope commented May 22, 2024

johndpope commented Sep 20, 2024

johndpope commented Mar 26, 2024 •

edited

Loading