test video

pairlab · Apr 2, 2024 · 595c540 · 595c540
1 parent 919868e
commit 595c540
Showing 1 changed file with 16 additions and 14 deletions.
diff --git a/README.md b/README.md
@@ -6,56 +6,54 @@ In the realm of object manipulation, human engagement typically manifests throug
 
 We introduce **ActAIM2**, which given an RGBD image of an articulated object and a robot, identifies meaningful interaction modes like opening drawer and closing drawer. ActAIM2 represents the interaction modes as discrete clusters of embedding. ActAIM2 then trains a policy that takes cluster embedding as input and produces control actions for the corresponding interactions. 
 
-<img width="1238" alt="teaser_3" src="https://github.com/pairlab/actaim2-eccv24/assets/30140814/687daaa0-3cb3-4697-b3f8-b33d5351b7dd">
-
+![Alt text for the image](Pictures/teaser.png)
 
 ## Problem Formulation
 
 Our model training aims to uncover the policy's distribution $\mathbb{P}(a|o)$, with $o$ representing the observation and $a=(\mathbf{p}, \mathbf{R}, \mathbf{q})$ the action, through a decomposition strategy that reconfigures the action distribution as:
 
-![problem_formulation](https://github.com/pairlab/actaim2-eccv24/assets/30140814/1f01df77-8c06-449f-91ba-7bddcb034c98)
+![Alt text for the image](Pictures/problem_formulation.png)
 
 ## Data Generation
 
 Our dataset was constructed through a combination of random sampling, heuristic grasp sampling, and Gaussian Mixture Model (GMM)-based adaptive sampling, featuring the Franka Emika robot engaging with various articulated objects across multiple interaction modes.
 
 The figure below indicates how we achieve diverse interaction mode sampling by using GMM-based adaptive sampling.
 
-![data_collection_gmm](https://github.com/pairlab/actaim2-eccv24/assets/30140814/976e31a2-1986-4506-a50a-bd764eb430a8)
+![Alt text for the image](Pictures/data_generation.png)
 
 We also formulate our data collection algorithm here. 
 
-![data_collection_algo](https://github.com/pairlab/actaim2-eccv24/assets/30140814/3032710a-ac79-400e-99c9-8d23ea881806)
+![Alt text for the image](Pictures/data_generation_algo.png)
 
 ## Unsupervised Mode Selector Learning
 
 In this part, we show how we train and infer from the mode selector to extract the discrete task embedding for action predictor training. Our mode selector is a VAE-style generative model but replacing the simple Gaussian with the Mixture of Gaussian. 
 
 ### Mode Selector Training Process
 
-<img width="1275" alt="mode_selector_training" src="https://github.com/pairlab/actaim2-eccv24/assets/30140814/4b0b1a79-ca3f-498d-8673-9a18b62a54cf">
+![Alt text for the image](Pictures/mode_train.png)
 
 This figure illustrates the training procedure of the mode selector, mirroring the approach of a conditional generative model. It highlights the contrastive analysis between the initial and final observations—the latter serving as the ground truth for task embedding—to delineate generated data against the backdrop of encoded initial images as the conditional variable. The process involves inputting both the generated task embedding data and the conditional variable into a 4-layer Residual network-based mode encoder, which then predicts the categorical variable $c$. Following the Gaussian Mixture Variational Autoencoder (GMVAE) methodology, the Gaussian Mixture Model (GMM) variable $x$ is computed and introduced alongside the conditional variable to the task embedding transformer decoder. This model is tasked with predicting the reconstructed task embedding, sampled from the Gaussian distribution as outlined in the architecture of the mode selector decoder and calculating the reconstruction loss against the input ground truth data.
 
 
 ### Mode Selector Inference Process
 
-<img width="1077" alt="mode_selector_inference" src="https://github.com/pairlab/actaim2-eccv24/assets/30140814/82160436-f418-48b4-a428-75d1962d7554">
+![Alt text for the image](Pictures/mode_infer.png)
 
 In the inference phase, the agent discretely samples a cluster from the trained Gaussian Mixture Variational Autoencoder (GMVAE) model to calculate the Mixture of Gaussian variable $x$. This variable $x$, in conjunction with the conditional variable (initial image observation), is then inputted into the mode selector transformer decoder. The objective is to reconstruct the task embedding for inference, effectively translating the conditional information and sampled cluster into actionable embeddings.
 
 ### Mode Selector Qualitative Results 
 
-<img width="532" alt="vis_mode_selector" src="https://github.com/pairlab/actaim2-eccv24/assets/30140814/00eaab35-c3b1-4c80-95e0-a6826e6ef365">
-
+![Alt text for the image](Pictures/mode_qual.png)
 
 This disentanglement visualization with CGMVAE illustrates the efficacy of the Conditional Gaussian Mixture Variational Autoencoder (CGMVAE) in disentangling interaction modes for the "single drawer" object (ID: 20411), using a t-SNE plot for visualization. Task embeddings $\epsilon_j$, defined by the variance between initial and final object states, are visualized in distinct colors to denote various interaction modes and clusters. The sequence of figures demonstrates the CGMVAE's precision in clustering and aligning data points with their respective interaction modes: (1) Generated clusters from the CGMVAE mode selector reveal distinct groupings. (2) Ground truth task embeddings confirm the model's capacity for accurate interaction mode classification. (3) A combined visualization underscores the alignment between generated clusters and ground truth, showcasing the model's ability to consistently categorize tasks within identical interaction modes.
 
 ## Supervised Action Predictor Learning
 
 Our objective is to infer a sequence of low-level actions $a=(\mathbf{p}, \mathbf{R}, \mathbf{q})$ from the current observation $O$ and task representation $\epsilon$, ensuring the action sequence effectively accomplishes the articulated object manipulation task while aligning with the constraints imposed by $\epsilon$. 
 
-![actaim2_rvt](https://github.com/pairlab/actaim2-eccv24/assets/30140814/1f98e07b-a38a-4833-8b86-9247159f3e2d)
+![Alt text for the image](Pictures/action.png)
 
 Interaction mode $\epsilon$ is sampled from latent space embedding from the model selector. Multiview RGBD observations are back-projected and fused into a color point cloud. Novel views are rendered by projecting the point cloud onto orthogonal image planes. Rendered image tokens and interaction mode tokens are contacted and fed through the multiview transformer. This output consists of global feature for rotation $\mathbf{R}$ and gripper state $\mathbf{q}$ estimation and 2D per-view heatmap for position $\mathbf{p}$ prediction.
 
@@ -65,18 +63,23 @@ Here, we provide more qualitative results about how our agent interacts with art
 We also show the correspondent video of how our robot interacts with the object by extracting the gripper pose from the predicted heatmap. 
 
 ### Interacting with Faucet in 3 different interaction modes
-<img width="1061" alt="qual_2" src="https://github.com/pairlab/actaim2-eccv24/assets/30140814/c01a6735-7c26-49d7-98fb-5d7267534c0f">
+![Alt text for the image](Pictures/qual1.png)
 
 Click the video here to see how the robot is interacting with the faucet above in 3 different interaction modes.
 
+<video width="640" height="480" controls>
+  <source src="Videos/video_154_0.mp4" type="video/mp4">
+</video>
+
+
 https://github.com/pairlab/actaim2-eccv24/assets/30140814/e895a768-315f-4141-a16c-89fbaf7c9911
 
 https://github.com/pairlab/actaim2-eccv24/assets/30140814/adfe14b1-cc89-45b1-8b38-7bba6dd9001d
 
 https://github.com/pairlab/actaim2-eccv24/assets/30140814/581f9083-2200-4322-a441-e3a885ab96e7
 
 ### Interacting with a Table with multiple drawers in 3 different interaction modes
-<img width="1066" alt="qual_3" src="https://github.com/pairlab/actaim2-eccv24/assets/30140814/acbe5d9b-4dbe-4f71-b5ee-134c85c197a6">
+![Alt text for the image](Pictures/qual2.png)
 
 Click the video here to see how the robot is interacting with the table with multiple drawers above in 3 different interaction modes.
 
@@ -88,7 +91,7 @@ https://github.com/pairlab/actaim2-eccv24/assets/30140814/27109fa4-56df-45f4-b27
 
 
 ### Interacting with a Table with multiple drawers in 3 different interaction modes
-<img width="1027" alt="qual_4" src="https://github.com/pairlab/actaim2-eccv24/assets/30140814/ac36d934-849a-41a6-9cca-c2ae9412cb07">
+![Alt text for the image](Pictures/qual3.png)
 
 Click the video here to see how the robot is interacting with the table with multiple drawers above in 3 different interaction modes.
 
@@ -115,7 +118,6 @@ Interacting with a single drawer table and performing opening and closing the dr
 https://github.com/pairlab/actaim2-eccv24/assets/30140814/92e03793-5e60-4e2f-9af3-f729ed02977a
 
 
-
 https://github.com/pairlab/actaim2-eccv24/assets/30140814/b2b5bf73-5b92-4682-8d71-220bc5806f62
 
 Interacting with a door and performing opening and closing on either side of the door