concatenate by channel

#27
johndpope · Mar 25, 2024 · bbbad33 · bbbad33
1 parent 00206f0
commit bbbad33
Show file tree

Hide file tree

Showing 8 changed files with 1,887 additions and 328 deletions.
diff --git a/Net.py b/Net.py
@@ -687,7 +687,7 @@ def forward(self, x):
 
 
 # given an image - spit out the mask
-
+# I dont think we need this - https://github.com/johndpope/Emote-hack/issues/28
 # Instantiate the model
 # model = FaceLocator()
 

diff --git a/README.md b/README.md
@@ -25,6 +25,8 @@ The heavy lifting now is implementing the denoise of unet/ integrating attention
 - **AnimateAnyone** - https://github.com/jimmyl02/animate/tree/main/animate-anyone
   3 training stages here
   https://github.com/jimmyl02/animate/tree/main/animate-anyone
+ - **DiffusedHeads** - (no training code)  https://github.com/MStypulkowski/diffused-heads
+
 While this is using poseguider - it's not hard to see a dwpose / facial driving the animation. https://www.reddit.com/r/StableDiffusion/comments/1281iva/new_controlnet_face_model/?rdt=50313&onetap_auto=true
 
 
@@ -51,6 +53,8 @@ ideally the network would take a sound (wav2vec stuff) - and show an facial expr
 
 ## Face Locator:
 The face locator is a separate module that learns to detect and localize the face region in a single input image.It takes a reference image as input and outputs the corresponding face region mask.(DRAFTED - train_stage_0.py)
+UPDATE - I think we can substitute this work for Alibaba's existing trained model (6.8gb) to drop in replace and provide mask conditioning https://github.com/johndpope/Emote-hack/issues/28
+
 
 ## Speed Encoder:
 The speed encoder takes the audio waveform as input and extracts speed embeddings.
@@ -130,29 +134,14 @@ Note: The sample includes rich tagging. For more details, see `./data/test.json`
 
 
 ### Models / architecture
-
+(flux)
 
 
 
 ```javascript
-
--✅ FramesEncodingVAE
-  - __init__(input_channels, latent_dim, img_size, reference_net)
-  - reparameterize(mu, logvar)
-  - forward(reference_image, motion_frames, speed_value)
-  - vae_loss(recon_frames, reference_image, motion_frames, reference_mu, reference_logvar, motion_mu, motion_logvar)
-
-- DownsampleBlock
-  - __init__(in_channels, out_channels)
-  - forward(x)
-
-- UpsampleBlock
-  - __init__(in_channels, out_channels)
-  - forward(x1, x2)
-
 - ✅ ReferenceNet
-  - __init__(vae_model, speed_encoder, config)
-  - forward(reference_image, motion_frames, head_rotation_speed)
+  - __init__(self, config, reference_unet, denoising_unet, vae, dtype)
+  - forward(self, reference_image, motion_features, timesteps)
 
 - ✅ SpeedEncoder
   - __init__(num_speed_buckets, speed_embedding_dim)
@@ -216,5 +205,9 @@ Note: The sample includes rich tagging. For more details, see `./data/test.json`
   - has some training code
 ```
 
-
+magicanimate code - it has custom blocks for unet - maybe very useful when wiring up the attentions in unet.
+```javascript
+- EMOAnimationPipeline (copied from magicanimate)
+  - has some training code / this should not need text encoder / clip to aling with EMO paper. 
+```
 
diff --git a/configs/training/stage0.yaml b/configs/training/stage0.yaml
@@ -11,6 +11,7 @@ training:
   learning_rate: 1.0e-5
   num_epochs: 2
   use_gpu_video_tensor: True
+  video_data_dir: '/home/oem/Downloads/CelebV-HQ/celebvhq/35666'
 solver:
   gradient_accumulation_steps: 1
   mixed_precision: 'fp16'

diff --git a/configs/training/stage1.yaml b/configs/training/stage1.yaml
@@ -13,6 +13,7 @@ training:
   num_epochs: 2
   use_gpu_video_tensor: True
   prev_frames: 2  # Add this line to specify the number of previous frames to consider
+  video_data_dir: '/home/oem/Downloads/CelebV-HQ/celebvhq/35666'
 
 solver:
   gradient_accumulation_steps: 1