Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VASA-1 hack #34

Open
johndpope opened this issue Apr 22, 2024 · 1 comment
Open

VASA-1 hack #34

johndpope opened this issue Apr 22, 2024 · 1 comment

Comments

@johndpope
Copy link
Owner

Working to align code to VASA white paper
https://github.com/johndpope/VASA-1-hack/blob/main/Net.py

I cherry picked some code from here - which I believe builds off the MegaPortraits stuff
https://github.com/yerfor/Real3DPortrait/

I'm yet to wire up training code - won't get to it for a few days.


Here's how the provided code aligns with the VASA paper:

1. Face Latent Space Construction:
   - The code defines encoders for extracting various latent variables from face images, similar to the approach mentioned in the VASA paper.
   - The `Canonical3DVolumeEncoder`, `IdentityEncoder`, `HeadPoseEncoder`, and `FacialDynamicsEncoder` classes correspond to the encoders for extracting the canonical 3D appearance volume, identity code, 3D head pose, and facial dynamics code, respectively.
   - The `ExpressiveDisentangledFaceLatentSpace` class combines these encoders and a decoder to form the overall framework for learning the disentangled face latent space.
   - The loss functions used in the `ExpressiveDisentangledFaceLatentSpace` class, such as reconstruction loss, pairwise transfer loss, and identity similarity loss, align with the losses mentioned in the paper for achieving disentanglement and expressiveness.

2. Holistic Facial Dynamics Generation with Diffusion Transformer:
   - The `DiffusionTransformer` class in the code represents the diffusion transformer model used for generating holistic facial dynamics and head motion.
   - The architecture of the `DiffusionTransformer` class, with its transformer layers and input feature concatenation, aligns with the description in the VASA paper.
   - The forward method of the `DiffusionTransformer` class takes in the latent codes, audio features, gaze direction, head distance, and emotion offset as conditioning signals, similar to the approach mentioned in the paper.

3. Talking Face Video Generation:
   - The `Decoder` class in the code corresponds to the decoder mentioned in the VASA paper for generating talking face videos.
   - The decoder takes the canonical 3D appearance volume, identity code, head pose, and facial dynamics latent codes as input and applies 3D warping based on the head pose and facial dynamics to generate the final face image.
   - The warping process in the decoder, using the `transform_kp` function and grid sampling, aligns with the description in the paper for applying the generated motion latent codes to the appearance volume.

Overall, the code follows the high-level architecture and components described in the VASA paper, including the face latent space construction, holistic facial dynamics generation with diffusion transformer, and the decoder for generating talking face videos. The specific implementation details and function names may differ, but the overall structure and flow of the code align with the concepts presented in the paper.
@johndpope
Copy link
Owner Author

i refactor new branch based on Megaportraits code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant