VASA-1 hack #34

johndpope · 2024-04-22T09:47:07Z

Working to align code to VASA white paper
https://github.com/johndpope/VASA-1-hack/blob/main/Net.py

I cherry picked some code from here - which I believe builds off the MegaPortraits stuff
https://github.com/yerfor/Real3DPortrait/

I'm yet to wire up training code - won't get to it for a few days.


Here's how the provided code aligns with the VASA paper:

1. Face Latent Space Construction:
   - The code defines encoders for extracting various latent variables from face images, similar to the approach mentioned in the VASA paper.
   - The `Canonical3DVolumeEncoder`, `IdentityEncoder`, `HeadPoseEncoder`, and `FacialDynamicsEncoder` classes correspond to the encoders for extracting the canonical 3D appearance volume, identity code, 3D head pose, and facial dynamics code, respectively.
   - The `ExpressiveDisentangledFaceLatentSpace` class combines these encoders and a decoder to form the overall framework for learning the disentangled face latent space.
   - The loss functions used in the `ExpressiveDisentangledFaceLatentSpace` class, such as reconstruction loss, pairwise transfer loss, and identity similarity loss, align with the losses mentioned in the paper for achieving disentanglement and expressiveness.

2. Holistic Facial Dynamics Generation with Diffusion Transformer:
   - The `DiffusionTransformer` class in the code represents the diffusion transformer model used for generating holistic facial dynamics and head motion.
   - The architecture of the `DiffusionTransformer` class, with its transformer layers and input feature concatenation, aligns with the description in the VASA paper.
   - The forward method of the `DiffusionTransformer` class takes in the latent codes, audio features, gaze direction, head distance, and emotion offset as conditioning signals, similar to the approach mentioned in the paper.

3. Talking Face Video Generation:
   - The `Decoder` class in the code corresponds to the decoder mentioned in the VASA paper for generating talking face videos.
   - The decoder takes the canonical 3D appearance volume, identity code, head pose, and facial dynamics latent codes as input and applies 3D warping based on the head pose and facial dynamics to generate the final face image.
   - The warping process in the decoder, using the `transform_kp` function and grid sampling, aligns with the description in the paper for applying the generated motion latent codes to the appearance volume.

Overall, the code follows the high-level architecture and components described in the VASA paper, including the face latent space construction, holistic facial dynamics generation with diffusion transformer, and the decoder for generating talking face videos. The specific implementation details and function names may differ, but the overall structure and flow of the code align with the concepts presented in the paper.

The text was updated successfully, but these errors were encountered:

johndpope · 2024-06-11T02:50:39Z

i refactor new branch based on Megaportraits code

https://github.com/johndpope/VASA-1-hack/tree/MegaPortraits

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VASA-1 hack #34

VASA-1 hack #34

johndpope commented Apr 22, 2024

johndpope commented Jun 11, 2024

VASA-1 hack #34

VASA-1 hack #34

Comments

johndpope commented Apr 22, 2024

johndpope commented Jun 11, 2024