You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm yet to wire up training code - won't get to it for a few days.
Here's how the provided code aligns with the VASA paper:
1. Face Latent Space Construction:
- The code defines encoders for extracting various latent variables from face images, similar to the approach mentioned in the VASA paper.
- The `Canonical3DVolumeEncoder`, `IdentityEncoder`, `HeadPoseEncoder`, and `FacialDynamicsEncoder` classes correspond to the encoders for extracting the canonical 3D appearance volume, identity code, 3D head pose, and facial dynamics code, respectively.
- The `ExpressiveDisentangledFaceLatentSpace` class combines these encoders and a decoder to form the overall framework for learning the disentangled face latent space.
- The loss functions used in the `ExpressiveDisentangledFaceLatentSpace` class, such as reconstruction loss, pairwise transfer loss, and identity similarity loss, align with the losses mentioned in the paper for achieving disentanglement and expressiveness.
2. Holistic Facial Dynamics Generation with Diffusion Transformer:
- The `DiffusionTransformer` class in the code represents the diffusion transformer model used for generating holistic facial dynamics and head motion.
- The architecture of the `DiffusionTransformer` class, with its transformer layers and input feature concatenation, aligns with the description in the VASA paper.
- The forward method of the `DiffusionTransformer` class takes in the latent codes, audio features, gaze direction, head distance, and emotion offset as conditioning signals, similar to the approach mentioned in the paper.
3. Talking Face Video Generation:
- The `Decoder` class in the code corresponds to the decoder mentioned in the VASA paper for generating talking face videos.
- The decoder takes the canonical 3D appearance volume, identity code, head pose, and facial dynamics latent codes as input and applies 3D warping based on the head pose and facial dynamics to generate the final face image.
- The warping process in the decoder, using the `transform_kp` function and grid sampling, aligns with the description in the paper for applying the generated motion latent codes to the appearance volume.
Overall, the code follows the high-level architecture and components described in the VASA paper, including the face latent space construction, holistic facial dynamics generation with diffusion transformer, and the decoder for generating talking face videos. The specific implementation details and function names may differ, but the overall structure and flow of the code align with the concepts presented in the paper.
The text was updated successfully, but these errors were encountered:
Working to align code to VASA white paper
https://github.com/johndpope/VASA-1-hack/blob/main/Net.py
I cherry picked some code from here - which I believe builds off the MegaPortraits stuff
https://github.com/yerfor/Real3DPortrait/
I'm yet to wire up training code - won't get to it for a few days.
The text was updated successfully, but these errors were encountered: