ICCV-2023-Papers

ICCV 2023 Papers: Explore a comprehensive collection of cutting-edge research papers presented at ICCV 2023, the premier computer vision conference. Keep up to date with the latest advances in computer vision and deep learning. Code implementations included. ⭐ the repository for the development of visual intelligence!

The online version of the ICCV 2023 Conference Programme, comprises a list of all accepted full papers, their presentation order, as well as the designated presentation times.

Other collections of the best AI conferences

Conference	Year
Computer Vision (CV)
CVPR	2023
Speech (SP)
ICASSP	2023
INTERSPEECH	2023

Contributors

Contributions to improve the completeness of this list are greatly appreciated. If you come across any overlooked papers, please feel free to create pull requests, open issues or contact me via email. Your participation is crucial to making this repository even better.

Papers

List of sections

3D from Multi-View and Sensors
Adversarial Attack and Defense
Vision and Robotics
Vision and Graphics
Segmentation, Grouping and Shape Analysis
Recognition: Categorization
Explainable AI for CV
Neural Generative Models
Vision and Language
Vision, Graphics, and Robotics
Privacy, Security, Fairness, and Explainability
Fairness, Privacy, Ethics, Social-good, Transparency, Accountability in Vision
First Person (Egocentric) Vision
Representation Learning
Deep Learning Architectures
Recognition: Detection
Image and Video Synthesis
Vision and Audio
Recognition, Segmentation, and Shape Analysis
Generative AI
Humans, 3D Modeling, and Driving
Low-Level Vision and Theory
Navigation and Autonomous Driving
3D from a Single Image and Shape-from-X
Motion Estimation, Matching and Tracking
Action and Event Understanding
Computational Imaging
Embodied Vision: Active Agents; Simulation
Recognition: Retrieval
Transfer, Low-Shot, Continual, Long-Tail Learning
Low-Level and Physics-based Vision
Computer Vision Theory
Video Analysis and Understanding
Object Pose Estimation and Tracking
3D Shape Modeling and Processing
Human Pose/Shape Estimation
Transfer, Low-Shot, and Continual Learning
Self-, Semi-, and Unsupervised Learning
Self-, Semi-, Meta-, Unsupervised Learning
Photogrammetry and Remote Sensing
Efficient and Scalable Vision
Machine Learning (other than Deep Learning)
Document Analysis and Understanding
Biometrics
Datasets and Evaluation
Faces and Gestures
Medical and Biological Vision; Cell Microscopy
Scene Analysis and Understanding
Multimodal Learning
Human-in-the-Loop Computer Vision
Image and Video Forensics
Geometric Deep Learning
Vision Applications and Systems
Machine Learning and Dataset

3D from Multi-View and Sensors

Title	Repo	Paper	Video
Multi-Modal Neural Radiance Field for Monocular Dense SLAM with a Light-Weight ToF Sensor	➖		➖
ScanNet++: A High-Fidelity Dataset of 3D Indoor Scenes
Translating Images to Road Network: A Non-Autoregressive Sequence-to-Sequence Approach	➖	➖	➖
Doppelgangers: Learning to Disambiguate Images of Similar Structures		➖	➖
EgoLoc: Revisiting 3D Object Localization from Egocentric Videos with Visual Queries			➖
ClothPose: A Real-world Benchmark for Visual Analysis of Garment Pose via an Indirect Recording Solution	➖	➖	➖
EMR-MSF: Self-Supervised Recurrent Monocular Scene Flow Exploiting Ego-Motion Rigidity	➖	➖	➖
ENVIDR: Implicit Differentiable Renderer with Neural Environment Lighting
Learning a more Continuous Zero Level Set in Unsigned Distance Fields through Level Set Projection			➖
GNT-MOVE: Generalizable NeRF Transformer with Mixture-of-View-Experts			➖
MatrixCity: A Large-Scale City Dataset for City-Scale Neural Rendering and Beyond			➖
R3D3: Dense 3D Reconstruction of Dynamic Scenes from Multiple Cameras
ClimateNeRF: Extreme Weather Synthesis in Neural Radiance Field			➖
Rendering Humans from Object-Occluded Monocular Videos
AssetField: Assets Mining and Reconfiguration in Ground Feature Plane Representation			➖
PETRv2: A Unified Framework for 3D Perception from Multi-Camera Images			➖
MIMO-NeRF: Fast Neural Rendering with Multi-Input Multi-Output Neural Radiance Fields	➖	➖	➖
Adaptive Positional Encoding for Bundle-Adjusting Neural Radiance Fields	➖	➖	➖
NeuS2: Fast Learning of Neural Implicit Surfaces for Multi-View Reconstruction			➖
Learning from Semantic Alignment between Unpaired Multiviews for Egocentric Video Recognition			➖
Uncertainty Guided Adaptive Warping for Robust and Efficient Stereo Matching	➖		➖
Compatibility of Fundamental Matrices for Complete Viewing Graphs			➖
ProtoTransfer: Cross-Modal Prototype Transfer for Point Cloud Segmentation	➖	➖	➖
SA-BEV: Generating Semantic-Aware Bird's-Eye-View Feature for Multi-View 3D Object Detection			➖
GraphAlign: Enhancing Accurate Feature Alignment by Graph matching for Multi-Modal 3D Object Detection	➖	➖	➖
Tangent Sampson Error: Fast Approximate Two-View Reprojection Error for Central Camera Models	➖	➖	➖
Using a Waffle Iron for Automotive Point Cloud Semantic Segmentation			➖
Fast Globally Optimal Surface Normal Estimation from an Affine Correspondence	➖	➖	➖
HeadsUp: A Data-Driven Volumetric Prior for Few-Shot Synthesis of Ultra High-Resolution Human Heads	➖	➖	➖
TILTED: Robust Neural Fields via Latent Registration	➖	➖	➖
Center-based Decoupled Point-Cloud Registration for 6D Object Pose Estimation		➖	➖
Deep Geometry-Aware Camera Self-Calibration from Video	➖	➖	➖
V-FUSE: Volumetric Depth Map Fusion with Long-Range Constraints			➖
Consistent Depth Prediction for Transparent Object Reconstruction from RGB-D Camera	➖	➖	➖
FaceCLIPNeRF: Text-Driven 3D Face Manipulation using Deformable Neural Radiance Fields			➖
HollowNeRF: Pruning Hashgrid-based NeRFs with Trainable Collision Mitigation	➖		➖
ICE-NeRF: Interactive Color Editing of NeRFs via Decomposition-Aware Weight Optimization	➖	➖	➖
FULLER: Unified Multi-Modality Multi-Task 3D Perception via Multi-Level Gradient Calibration			➖
Neural Fields for Structured Lighting	➖	➖	➖
CO-Net: Learning Multiple Point Cloud Tasks at Once with a Cohesive Network	➖	➖	➖
Pose-Free Neural Radiance Fields via Implicit Pose Regularization	➖		➖
TransHuman: A Transformer-based Human Representation for Generalizable Neural Human Rendering			➖
S-VolSDF: Sparse Multi-View Stereo Regularization of Neural Implicit Surfaces
DPS-Net: Deep Polarimetric Stereo Depth Estimation		➖	➖
3DPPE: 3D Point Positional Encoding for Transformer-based Multi-Camera 3D Object Detection			➖
Deformable Neural Radiance Fields using RGB and Event Cameras	➖	➖	➖
Inter-Reflectable Light Fields for Geometry and Material Estimation			➖
Hierarchical Prior Mining for Non-Local Multi-View Stereo			➖
Exploring Object-Centric Temporal Modeling for Efficient Multi-View 3D Object Detection			➖
Re-ReND: Real-Time Rendering of NeRFs Across Devices			➖
Learning Shape Primitives via Implicit Convexity Regularization		➖	➖
Geometry-Guided Feature Learning and Fusion for Indoor Scene Reconstruction	➖	➖	➖
LiDAR-Camera Panoptic Segmentation via Geometry-Consistent and Semantic-Aware Alignment			➖
PivotNet: End-to-End Learning for Vectorized HD Map Construction	➖		➖
Sat2Density: Faithful Density Learning from Satellite-Ground Image Pairs
Mask-Attention-Free Transformer for 3D Instance Segmentation			➖
Scene-Aware Feature Matching	➖		➖
Revisiting Domain-Adaptive 3D Object Detection by Reliable, Diverse and Class-Balanced Pseudo-Labeling			➖
GO-SLAM: Global Optimization for Consistent 3D Instant Reconstruction
BANSAC: A Dynamic BAyesian Network for SAmple Consensus		➖	➖
Theoretical and Numerical Analysis of 3D Reconstruction using Point and Line Incidences			➖
RealGraph: A Multiview Dataset for 4D Real-World Context Graph Generation			➖
CL-MVSNet: Unsupervised Multi-View Stereo with Dual-Level Contrastive Learning			➖
Temporal Enhanced Training of Multi-View 3D Object Detector via Historical Object Prediction			➖
Object as Query: Lifting any 2D Object Detector to 3D Detection			➖
PARTNER: Level up the Polar Representation for LiDAR 3D Object Detection	➖		➖
Not Every Side is Equal: Localization Uncertainty Estimation for Semi-Supervised 3D Object Detection	➖	➖	➖

Adversarial Attack and Defense

Title	Repo	Paper	Video
Robust Mixture-of-Expert Training for Convolutional Neural Networks			➖
Set-Level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-Training Models			➖
CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive Learning			➖
CGBA: Curvature-Aware Geometric Black-Box Attack			➖
Robust Evaluation of Diffusion-based Adversarial Purification	➖		➖
Advancing Example Exploitation can Alleviate Critical Challenges in Adversarial Training	➖	➖	➖
The Victim and the Beneficiary: Exploiting a Poisoned Model to Train a Clean Model on Poisoned Data	➖	➖	➖
TIJO: Trigger Inversion with Joint Optimization for Defending Multimodal Backdoored Models			➖
SAGA: Spectral Adversarial Geometric Attack on 3D Meshes			➖
Benchmarking and Analyzing Robust Point Cloud Recognition: Bag of Tricks for Defending Adversarial Examples			➖
ACTIVE: Towards Highly Transferable 3D Physical Camouflage for Universal and Robust Vehicle Evasion
Frequency-Aware GAN for Adversarial Manipulation Generation	➖	➖	➖
Breaking Temporal Consistency: Generating Video Universal Adversarial Perturbations using Image Models	➖	➖	➖
Tracing the Origin of Adversarial Attack for Forensic Investigation and Deterrence	➖		➖
Downstream-Agnostic Adversarial Examples			➖
Hiding Visual Information via Obfuscating Adversarial Perturbations			➖
An Embarrassingly Simple Self-Supervised Trojan Attack	➖	➖	➖
Efficient Decision-based Black-Box Patch Attacks on Video Recognition	➖		➖
Adversarial Finetuning with Latent Representation Constraint to Mitigate Accuracy-Robustness Tradeoff	➖		➖
Towards Building more Robust Models with Frequency Bias			➖
Does Physical Adversarial Example Really Matter to Autonomous Driving? Towards System-Level Effect of Adversarial Object Evasion Attack			➖
Improving Generalization of Adversarial Training via Robust Critical Fine-Tuning			➖
Enhancing Generalization of Universal Adversarial Perturbation through Gradient Aggregation			➖
Unified Adversarial Patch for Cross-Modal Attacks in the Physical World	➖		➖
RFLA: A Stealthy Reflected Light Adversarial Attack in the Physical World			➖
Enhancing Fine-Tuning based Backdoor Defense with Sharpness-Aware Minimization	➖		➖
Conditional 360-Degree Image Synthesis for Immersive Indoor Scene Decoration			➖
An Adaptive Model Ensemble Adversarial Attack for Boosting Adversarial Transferability			➖
Mitigating Adversarial Vulnerability through Causal Parameter Estimation by Adversarial Double Machine Learning			➖
LEA2: A Lightweight Ensemble Adversarial Attack via Non-Overlapping Vulnerable Frequency Regions	➖	➖	➖
Explaining Adversarial Robustness of Neural Networks from Clustering Effect Perspective	➖	➖	➖
VertexSerum: Poisoning Graph Neural Networks for Link Inference	➖		➖
How to Choose Your Best Allies for a Transferable Attack?			➖
Enhancing Adversarial Robustness in Low-Label Regime via Adaptively Weighted Regularization and Knowledge Distillation			➖
AdvDiffuser: Natural Adversarial Example Synthesis with Diffusion Models	➖	➖	➖
FnF Attack Adversarial Attack against Multiple Object Trackers by Inducing False Negatives and False Positives	➖	➖	➖
Rickrolling the Artist: Injecting Backdoors into Text Encoders for Text-to-Image Synthesis			➖
Hard No-Box Adversarial Attack on Skeleton-based Human Action Recognition with Skeleton-Motion-Informed Gradient
Structure Invariant Transformation for Better Adversarial Transferability		➖	➖
Beating Backdoor Attack at its Own Game			➖
Transferable Adversarial Attack for Both Vision Transformers and Convolutional Networks via Momentum Integrated Gradients	➖	➖	➖
REAP: A Large-Scale Realistic Adversarial Patch Benchmark			➖
Multi-Metrics Adaptively Identifies Backdoors in Federated Learning			➖
Backpropagation Path Search on Adversarial Transferability	➖		➖
Fast Adaptation of Neural Networks using Test-Time Feedback	➖	➖	➖
One-Bit Flip is All You Need: When Bit-Flip Attack Meets Model Training			➖
PolicyCleanse: Backdoor Detection and Mitigation for Competitive Reinforcement Learning	➖		➖
Towards Viewpoint-Invariant Visual Recognition via Adversarial Training	➖		➖
Fast Adversarial Training with Smooth Convergence			➖
The Perils of Learning from Unlabeled Data: Backdoor Attacks on Semi-Supervised Learning	➖		➖
Boosting Adversarial Transferability via Gradient Relevance Attack	➖	➖	➖
Towards Robust Model Watermark via Reducing Parametric Vulnerability			➖
TRM-UAP: Enhancing the Transferability of Data-Free Universal Adversarial Perturbation via Truncated Ratio Maximization	➖	➖	➖

Vision and Robotics

Title	Repo	Paper	Video
Simoun: Synergizing Interactive Motion-Appearance Understanding for Vision-based Reinforcement Learning	➖	➖	➖
Among Us: Adversarially Robust Collaborative Perception by Consensus			➖
Walking Your LiDOG: A Journey Through Multiple Domains for LiDAR Semantic Segmentation			➖
Stabilizing Visual Reinforcement Learning via Asymmetric Interactive Cooperation	➖	➖	➖
MAAL: Multimodality-Aware Autoencoder-based Affordance Learning for 3D Articulated Objects	➖	➖	➖
Rethinking Range View Representation for LiDAR Segmentation	➖		➖
PourIt!: Weakly-Supervised Liquid Perception from a Single Image for Visual Closed-Loop Robotic Pouring
CROSSFIRE: Camera Relocalization On Self-Supervised Features from an Implicit Representation	➖		➖
Environment Agnostic Representation for Visual Reinforcement Learning	➖	➖	➖
Test-Time Personalizable Forecasting of 3D Human Poses	➖	➖	➖
HM-ViT: Hetero-Modal Vehicle-to-Vehicle Cooperative Perception with Vision Transformer	➖		➖

Vision and Graphics

Title	Repo	Paper	Video
Efficient Neural Supersampling on a Novel Gaming Dataset	➖		➖
Locally Stylized Neural Radiance Fields	➖	➖	➖
NEMTO: Neural Environment Matting for Novel View and Relighting Synthesis of Transparent Objects	➖		➖
DDColor: Towards Photo-Realistic and Semantic-Aware Image Colorization via Dual Decoders			➖
IntrinsicNeRF: Learning Intrinsic Neural Radiance Fields for Editable Novel View Synthesis			➖
PARIS: Part-Level Reconstruction and Motion Analysis for Articulated Objects
ReMoDiffuse: Retrieval-Augmented Motion Diffusion Model
DS-Fusion: Artistic Typography via Discriminated and Stylized Diffusion			➖
Dynamic Mesh-Aware Radiance Fields			➖
Neural Reconstruction of Relightable Human Model from Monocular Video	➖	➖	➖
Neural Microfacet Fields for Inverse Rendering			➖
A Theory of Topological Derivatives for Inverse Rendering of Geometry			➖
Vox-E: Text-Guided Voxel Editing of 3D Objects			➖
StegaNeRF: Embedding Invisible Information within Neural Radiance Fields			➖
GlobalMapper: Arbitrary-Shaped Urban Layout Generation	➖		➖
Urban Radiance Field Representation with Deformable Neural Mesh Primitives
End2End Multi-View Feature Matching with Differentiable Pose Optimization
Tree-Structured Shading Decomposition
Lens Parameter Estimation for Realistic Depth of Field Synthesis		➖	➖
AttT2M: Text-Driven Human Motion Generation with Multi-Perspective Attention Mechanism	➖	➖	➖
Cross-Modal Latent Space Alignment for Image to Avatar Translation	➖	➖	➖
Computationally Efficient Neural Image Compression with Shallow Decoders			➖

Segmentation, Grouping and Shape Analysis

Title	Repo	Paper	Video
Enhancing Spatial and Semantic Supervision for Hybrid-based 3D Instance Segmentation	➖	➖	➖
Learning Neural Eigenfunctions for Unsupervised Semantic Segmentation			➖
Divide and Conquer: 3D Point Cloud Instance Segmentation with Point-Wise Binarization			➖
Point2Mask: Point-Supervised Panoptic Segmentation via Optimal Transport			➖
Handwritten and Printed Text Segmentation: A Signature Case Study			➖
Semantic-Aware Template Learning via Part Deformation Consistency	➖		➖
LeaF: Learning Frames for 4D Point Cloud Sequence Understanding	➖	➖	➖
MARS: Model-Agnostic Biased Object Removal without Additional Supervision for Weakly-Supervised Semantic Segmentation			➖
USAGE: A Unified Seed Area Generation Paradigm for Weakly Supervised Semantic Segmentation	➖		➖
Production-Level Video Segmentation from Few Annotated Frames
ΣIGMA: Scale-Invariant Global Sparse Shape Matching	➖		➖
Self-Calibrated Cross Attention Network for Few-Shot Segmentation			➖
Multi-Granularity Interaction Simulation for Unsupervised Interactive Segmentation	➖		➖
Texture Learning Domain Randomization for Domain Generalized Segmentation			➖
Unsupervised Video Object Segmentation with Online Adversarial Self-Tuning	➖	➖	➖
Exploring Open-Vocabulary Semantic Segmentation without Human Labels	➖		➖
RbA: Segmenting Unknown Regions Rejected by All			➖
SEMPART: Self-Supervised Multi-Resolution Partitioning of Image Semantics	➖	➖	➖
Multi-Object Discovery by Low-Dimensional Object Motion			➖
MemorySeg: Online LiDAR Semantic Segmentation with a Latent Memory	➖	➖	➖
Treating Pseudo-Labels Generation as Image Matting for Weakly Supervised Semantic Segmentation	➖	➖	➖
BoxSnake: Polygonal Instance Segmentation with Box Supervision			➖
Dynamic Token Pruning in Plain Vision Transformers for Semantic Segmentation	➖		➖
Instance Neural Radiance Field
Global Knowledge Calibration for Fast Open-Vocabulary Segmentation	➖		➖
Diffusion-based Image Translation with Label Guidance for Domain Adaptive Semantic Segmentation	➖		➖
Boosting Semantic Segmentation from an Explicit Class Embedding's Perspective			➖
The Making and Breaking of Camouflage	➖	➖	➖
CoinSeg: Contrast Inter- and Intra- Class Representations for Incremental Segmentation	➖	➖	➖
Few-Shot Physically-Aware Articulated Mesh Generation via Hierarchical Deformation
HAL3D: Hierarchical Active Learning for Fine-Grained 3D Part Labeling	➖		➖
FreeCOS: Self-Supervised Learning from Fractals and Unlabeled Images for Curvilinear Object Segmentation			➖
MasQCLIP for Open-Vocabulary Universal Image Segmentation	➖	➖	➖
CTVIS: Consistent Training for Online Video Instance Segmentation			➖
A Simple Framework for Panoptic Segmentation	➖	➖	➖
Spectrum-Guided Multi-Granularity Referring Video Object Segmentation			➖
Space Engage: Collaborative Space Supervision for Contrastive-based Semi-Supervised Semantic Segmentation			➖
Adaptive Superpixel for Active Learning in Semantic Segmentation	➖		➖
Multimodal Variational Auto-Encoder based Audio-Visual Segmentation	➖	➖	➖
Isomer: Isomerous Transformer for Zero-Shot Video Object Segmentation			➖
2D-3D Interlaced Transformer for Point Cloud Segmentation with Scene-Level Supervision			➖
Foreground-Background Separation through Concept Distillation from Generative Image Foundation Models			➖
SegPrompt: Boosting Open-World Segmentation via Category-Level Prompt Learning			➖
Monte Carlo Linear Clustering with Single-Point Supervision is Enough for Infrared Small Target Detection			➖
A Simple Framework for Open-Vocabulary Segmentation and Detection
Source-Free Depth for Object Pop-Out			➖
DynaMITe: Dynamic Query Bootstrapping for Multi-Object Interactive Segmentation Transformer			➖
Atmospheric Transmission and Thermal Inertia Induced Blind Road Segmentation with a Large-Scale Dataset TBRSD		➖	➖
Informative Data Mining for One-Shot Cross-Domain Semantic Segmentation	➖	➖	➖
Homography Guided Temporal Fusion for Road Line and Marking Segmentation		➖	➖
Zero-Shot Semantic Segmentation with Decoupled One-Shot Network	➖	➖	➖
TCOVIS: Temporally Consistent Online Video Instance Segmentation		➖	➖
FPR: False Positive Rectification for Weakly Supervised Semantic Segmentation			➖
Stochastic Segmentation with Conditional Categorical Diffusion Models			➖
SegGPT: Segmenting Everything in Context
Open-Vocabulary Panoptic Segmentation with Embedding Modulation	➖		➖
Residual Pattern Learning for Pixel-Wise Out-of-Distribution Detection in Semantic Segmentation			➖
Zero-Guidance Segmentation using Zero Segment Labels			➖
Model Calibration in Dense Classification with Adaptive Label Perturbation			➖
Enhanced Soft Label for Semi-Supervised Semantic Segmentation	➖	➖	➖
MixReorg: Cross-Modal Mixed Patch Reorganization is a Good Mask Learner for Open-World Semantic Segmentation	➖		➖
DiffuMask: Synthesizing Images with Pixel-Level Annotations for Semantic Segmentation using Diffusion Models			➖
Alignment Before Aggregation: Trajectory Memory Retrieval Network for Video Object Segmentation	➖	➖	➖
Semi-Supervised Semantic Segmentation under Label Noise via Diverse Learning Groups	➖	➖	➖
SUMMIT: Source-Free Adaptation of Uni-Modal Models to Multi-Modal Targets
Class-Incremental Continual Learning for Instance Segmentation with Image-Level Weak Supervision	➖	➖	➖
Coarse-to-Fine Amodal Segmentation with Shape Prior			➖
Rethinking Amodal Video Segmentation from Learning Supervised Signals with Object-Centric Representation		➖	➖
DVIS: Decoupled Video Instance Segmentation Framework			➖
3D Segmentation of Humans in Point Clouds with Synthetic Data			➖
WaterMask: Instance Segmentation for Underwater Imagery	➖	➖	➖
Decoupled or End-to-End Trained Video Segmentation if Target Data is Scarce?	➖	➖	➖

Recognition: Categorization

Title	Repo	Paper	Video
Cross Contrasting Feature Perturbation for Domain Generalization			➖
Flexible Visual Recognition by Evidential Modeling of Confusion and Ignorance	➖		➖
CDUL: CLIP-Driven Unsupervised Learning for Multi-Label Image Classification	➖		➖
RankMixup: Ranking-based Mixup Training for Network Calibration			➖
Label-Noise Learning with Intrinsically Long-Tailed Data			➖
Parallel Attention Interaction Network for Few-Shot Skeleton-based Action Recognition		➖	➖
Rethinking Mobile Block for Efficient Attention-based Models			➖
Read-Only Prompt Optimization for Vision-Language Few-Shot Learning			➖
Understanding Self-Attention Mechanism via Dynamical System Perspective	➖		➖
Learning in Imperfect Environment: Multi-Label Classification with Long-Tailed Distribution and Partial Labels	➖		➖
What do Neural Networks Learn in Image Classification? A Frequency Shortcut Perspective			➖
Inducing Neural Collapse to a Fixed Hierarchy-Aware Frame for Reducing Mistake Severity			➖
Unified Out-of-Distribution Detection: A Model-Specific Perspective	➖		➖
A Unified Framework for Robustness on Diverse Sampling Errors	➖	➖	➖
Scene-Aware Label Graph Learning for Multi-Label Image Classification	➖	➖	➖
Holistic Label Correction for Noisy Multi-Label Classification	➖	➖	➖
Strip-MLP: Efficient Token Interaction for Vision MLP			➖
EQ-Net: Elastic Quantization Neural Networks			➖
Data-Free Knowledge Distillation for Fine-Grained Vision Categorization	➖	➖	➖
Shift from Texture-Bias to Shape-Bias: edge Deformation-based Augmentation for Robust Object Recognition		➖	➖
Latent-OFER: Detect, Mask, and Reconstruct with Latent Vectors for Occluded Facial Expression Recognition			➖
DR-Tune: Improving Fine-Tuning of Pretrained Visual Models by Distribution Regularization with Semantic Calibration			➖
Understanding the Feature Norm for Out-of-Distribution Detection	➖	➖	➖
Multi-View Active Fine-Grained Visual Recognition			➖
DiffGuard: Semantic Mismatch-Guided Out-of-Distribution Detection using Pre-Trained Diffusion Models			➖
Task-Aware Adaptive Learning for Cross-Domain Few-Shot Learning	➖	➖	➖
Improving Adversarial Robustness of Masked Autoencoders via Test-Time Frequency-Domain Prompting			➖
Saliency Regularization for Self-Training with Partial Annotations	➖	➖	➖
Learning Gabor Texture Features for Fine-Grained Recognition	➖		➖
UniFormerV2: Unlocking the Potential of Image ViTs for Video Understanding			➖
RankMatch: Fostering Confidence and Consistency in Learning with Noisy Labels	➖	➖	➖
MetaGCD: Learning to Continually Learn in Generalized Category Discovery	➖		➖
FerKD: Surgical Label Adaptation for Efficient Distillation	➖	➖	➖
Point-Query Quadtree for Crowd Counting, Localization, and more			➖
Nearest Neighbor Guidance for Out-of-Distribution Detection	➖	➖	➖
Bayesian Optimization Meets Self-Distillation			➖
When Prompt-based Incremental Learning does not Meet Strong Pretraining			➖
When to Learn what: Model-Adaptive Data Augmentation Curriculum	➖		➖
Parametric Information Maximization for Generalized Category Discovery			➖
Boosting Few-Shot Action Recognition with Graph-Guided Hybrid Matching			➖
Domain Generalization via Rationale Invariance			➖
Masked Spiking Transformer			➖
Prototype Reminiscence and Augmented Asymmetric Knowledge Aggregation for Non-Exemplar Class-Incremental Learning	➖	➖	➖
Distilled Reverse Attention Network for Open-World Compositional Zero-Shot Learning	➖		➖
Candidate-Aware Selective Disambiguation based on Normalized Entropy for Instance-Dependent Partial-Label Learning	➖	➖	➖
CLIPN for Zero-Shot OOD Detection: Teaching CLIP to Say No			➖
Self-Similarity Driven Scale-Invariant Learning for Weakly Supervised Person Search	➖		➖
Sample-Wise Label Confidence Incorporation for Learning with Noisy Labels	➖	➖	➖
Combating Noisy Labels with Sample Selection by Mining High-Discrepancy Examples	➖	➖	➖
Spatial-Aware Token for Weakly Supervised Object Localization			➖

Explainable AI for CV

Title	Repo	Paper	Video
Towards Improved Input Masking for Convolutional Neural Networks			➖
PDiscoNet: Semantically Consistent Part Discovery for Fine-Grained Recognition			➖
Corrupting Neuron Explanations of Deep Visual Features	➖	➖	➖
ICICLE: Interpretable Class Incremental Continual Learning			➖
ProbVLM: Probabilistic Adapter for Frozen Vison-Language Models			➖
Out-of-Distribution Detection for Monocular Depth Estimation			➖
Using Explanations to Guide Models	➖		➖
Rosetta Neurons: Mining the Common Units in a Model Zoo			➖
Prototype-based Dataset Comparison			➖
Learning to Identify Critical States for Reinforcement Learning from Videos			➖
Leaping Into Memories: Space-Time Deep Feature Synthesis			➖
MAGI: Multi-Annotated Explanation-Guided Learning	➖	➖	➖
SAFARI: Versatile and Efficient Evaluations for Robustness of Interpretability			➖
Do BLIP and Stable Diffusion Understand Each Other?			➖
Evaluation and Improvement of Interpretability for Self-Explainable Part-Prototype Networks			➖
MoreauGrad: Sparse and Robust Interpretation of Neural Networks via Moreau Envelope			➖
Towards Understanding the Generalization of Deepfake Detectors from a Game-Theoretical View	➖	➖	➖
Counterfactual-based Saliency Map: Towards Visual Contrastive Explanations for Neural Networks	➖	➖	➖
Beyond Single Path Integrated Gradients for Reliable Input Attribution via Randomized Path Sampling	➖	➖	➖
Learning Support and Trivial Prototypes for Interpretable Image Classification	➖		➖
Visual Explanations via Iterated Integrated Gradients	➖	➖	➖

Neural Generative Models

Title	Repo	Paper	Video
Unsupervised Compositional Concepts Discovery with Text-to-Image Generative Models			➖
Better Aligning Text-to-Image Models with Human Preference			➖
DLT: Conditioned Layout Generation with Joint Discrete-Continuous Diffusion Layout Transformer			➖
Anti-DreamBooth: Protecting users from Personalized Text-to-Image Synthesis			➖
GECCO: Geometrically-Conditioned Point Diffusion Models			➖
DiffDreamer: Towards Consistent Unsupervised Single-View Scene Extrapolation with Conditional Diffusion Models
Controllable Human Motion Synthesis via Guided Diffusion Models
COOP: Decoupling and Coupling of Whole-Body Grasping Pose Generation	➖	➖	➖
Zero-Shot Spatial Layout Conditioning for Text-to-Image Diffusion Models	➖		➖
StyleDomain: Efficient and Lightweight Parameterizations of StyleGAN for One-Shot and Few-Shot Domain Adaptation	➖		➖
GRAM-HD: 3D-Consistent Image Generation at High Resolution with Generative Radiance Manifolds
Your Diffusion Model is Secretly a Zero-Shot Classifier			➖
Learning Hierarchical Features with Joint Latent Space Energy-based Prior	➖	➖	➖
ActFormer: A GAN-based Transformer towards General Action-Conditioned 3D Human Motion Generation	➖		➖
Landscape Learning for Neural Network Inversion	➖		➖
Diffusion in Style	➖	➖	➖
Diffusion-SDF: Conditional Generative Modeling of Signed Distance Functions			➖
GETAvatar: Generative Textured Meshes for Animatable Human Avatars	➖	➖	➖
A-STAR: Test-Time Attention Segregation and Retention for Text-to-Image Synthesis	➖		➖
TF-ICON: Diffusion-based Training-Free Cross-Domain Image Composition			➖
Breaking The Limits of Text-Conditioned 3D Motion Synthesis with Elaborative Descriptions	➖	➖	➖
BeLFusion: Latent Diffusion for Behavior-Driven Human Motion Prediction			➖
Delta Denoising Score			➖
Mimic3D: Thriving 3D-Aware GANs via 3D-to-2D Imitation			➖
DreamBooth3D: Subject-Driven Text-to-3D Generation
Feature Proliferation the Cancer in StyleGAN and its Treatments	➖	➖	➖
Unsupervised Facial Performance Editing via Vector-Quantized StyleGAN Representations	➖	➖	➖
3D-Aware Image Generation using 2D Diffusion Models			➖
Neural Collage Transfer: Artistic Reconstruction via Material Manipulation	➖	➖	➖
Phasic Content Fusing Diffusion Model with Directional Distribution Consistency for Few-Shot Model Adaption			➖
Single-Stage Diffusion NeRF: A Unified Approach to 3D Generation and Reconstruction			➖
Erasing Concepts from Diffusion Models			➖
Make Encoder Great Again in 3D GAN Inversion through Geometry and Occlusion-Aware Encoding
HairNeRF: Geometry-Aware Hair Swapped Image Synthesis	➖	➖	➖

Vision and Language

Title	Repo	Paper	Video
SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-Training	➖		➖
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model			➖
Explore and Tell: Embodied Visual Captioning in 3D Environments			➖
Distilling Large Vision-Language Model with Out-of-Distribution Generalizability			➖
Learning Trajectory-Word Alignments for Video-Language Tasks	➖		➖
Variational Causal Inference Network for Explanatory Visual Question Answering	➖	➖	➖
TextManiA: Enriching Visual Feature by Text-Driven Manifold Augmentation			➖
UniRef: A Unified Model for Reference-based Object Segmentation Tasks	➖	➖	➖
Gradient-Regulated Meta-Prompt Learning for Generalizable Vision-Language Models	➖		➖
Misalign, Contrast then Distill: Rethinking Misalignments in Language-Image Pre-Training	➖	➖	➖
Toward Multi-Granularity Decision-Making: Explicit Visual Reasoning with Hierarchical Knowledge	➖	➖	➖
VL-Match: Enhancing Vision-Language Pretraining with Token-Level and Instance-Level Matching	➖	➖	➖
Moment Detection in Long Tutorial Videos		➖	➖
Not All Features Matter: Enhancing Few-Shot CLIP with Adaptive Prior Refinement			➖
Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images			➖
Advancing Referring Expression Segmentation Beyond Single Image			➖
CLIPoint: Adapting CLIP for Powerful 3D Open-World Learning	➖	➖	➖
Unsupervised Prompt Tuning for Text-Driven Object Detection	➖	➖	➖
Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual Grounding	➖		➖
I can't Believe there's no Images! Learning Visual Tasks using Only Language Data			➖
Learning Cross-Modal Affinity for Referring Video Object Segmentation Targeting Limited Samples			➖
MeViS: A Large-Scale Benchmark for Video Segmentation with Motion Expressions			➖
Diverse Data Augmentation with Diffusions for Effective Test-Time Prompt Tuning			➖
ShapeScaffolder: Structure-Aware 3D Shape Generation from Text	➖		➖
SuS-X: Training-Free Name-Only Transfer of Vision-Language Models			➖
BEVBert: Multimodal Map Pre-Training for Language-Guided Navigation			➖
X-Mesh: Towards Fast and Accurate Text-Driven 3D Stylization via Dynamic Textual Guidance			➖
OnlineRefer: A Simple Online Baseline for Referring Video Object Segmentation			➖
Attentive Mask CLIP	➖		➖
Knowledge Proxy Intervention for Deconfounded Video Question Answering	➖	➖	➖
UniVTG: Towards Unified Video-Language Temporal Grounding			➖
Self-Supervised Cross-View Representation Reconstruction for Change Captioning		➖	➖
Unified Coarse-to-Fine Alignment for Video-Text Retrieval			➖
Confidence-Aware Pseudo-Label Learning for Weakly Supervised Visual Grounding	➖	➖	➖
TextPSG: Panoptic Scene Graph Generation from Textual Descriptions	➖	➖	➖
MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge			➖
Unify, Align and Refine: Multi-Level Semantic Alignment for Radiology Report Generation	➖		➖
Transferring Visual Knowledge with Pre-Trained Models for Multimodal Machine Translation			➖
Learning Human-Human Interactions in Images from Weak Textual Supervision			➖
BUS: Efficient and Effective Vision-Language Pretraining with Bottom-Up Patch Summarization	➖		➖
3D-VisTA: Pre-Trained Transformer for 3D Vision and Text Alignment
ALIP: Adaptive Language-Image Pre-Training with Synthetic Caption			➖
LoGoPrompt: Synthetic Text Images can be Good Visual Prompts for Vision-Language Models			➖
Noise-Aware Learning from Web-Crawled Image-Text Data for Image Captioning			➖
Decouple Before Interact: Multi-Modal Prompt Learning for Continual Visual Question Answering	➖	➖	➖
Prompt-Guided Image Captioning for VQA with GPT-3			➖
Grounded Image Text Matching with Mismatched Relation Reasoning	➖		➖
GePSAn: Generative Procedure Step Anticipation in Cooking Videos	➖	➖	➖
LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models			➖
VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control			➖
With a Little Help from Your own Past: Prototypical Memory Networks for Image Captioning			➖
Improving Zero-Shot Generalization for CLIP with Synthesized Prompts			➖
DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generation Models			➖
Learning Navigational Visual Representations with Semantic Map Supervision			➖
CoTDet: Affordance Knowledge Prompting for Task Driven Object Detection			➖
Open Set Video HOI detection from Action-Centric Chain-of-Look Prompting	➖	➖	➖
Learning Concise and Descriptive Attributes for Visual Recognition	➖		➖
Open-Vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models			➖
Encyclopedic VQA: Visual Questions About Detailed Properties of Fine-Grained Categories			➖
Story Visualization by Online Text Augmentation with Context Memory			➖
Transferable Decoding with Visual Entities for Zero-Shot Image Captioning			➖
Too Large; Data Reduction for Vision-Language Pre-Training			➖
ViLTA: Enhancing Vision-Language Pre-Training through Textual Augmentation	➖		➖

Vision, Graphics, and Robotics

Title	Repo	Paper	Video
Learning Conditional Control for Pretrained Text-to-Image Diffusion Models	➖	➖	➖
Factorized Inverse Path Tracing for Efficient and Accurate Material-Lighting Estimation
Manipulate by Seeing: Creating Manipulation Controllers from Pre-Trained Representations			➖
3D Implicit Transporter for Temporally Consistent Keypoint Discovery			➖
Chordal Averaging on Flag Manifolds and its Applications			➖
UniDexGrasp++: Improving Universal Dexterous Grasping via Geometry-Aware Curriculum Learning and Iterative Generalist-Specialist Learning	➖		➖
GameFormer: Game-Theoretic Modeling and Learning of Transformer-based Interactive Prediction and Planning for Autonomous Driving			➖
PPR: Physically Plausible Reconstruction from Monocular Videos			➖

Privacy, Security, Fairness, and Explainability

Title	Repo	Video
Zolly: Zoom Focal Length Correctly for Perspective-Distorted Human Mesh Reconstruction		➖
ACLS: Adaptive and Conditional Label Smoothing for Network Calibration		➖
PGFed: Personalize Each Client's Global Objective for Federated Learning		➖
Overcoming Bias in Pretrained Models by Manipulating the Finetuning Dataset		➖
ITI-GEN: Inclusive Text-to-Image Generation		➖
FunnyBirds: A Synthetic Vision Dataset for a Part-based Analysis of Explainable AI Methods		➖
X-VoE: Measuring eXplanatory Violation of Expectation in Physical Events		➖
Adaptive Testing of Computer Vision Models	➖	➖

Fairness, Privacy, Ethics, Social-good, Transparency, Accountability in Vision

Title	Repo	Paper	Video
Enhancing Privacy Preservation in Federated Learning via Learning Rate Perturbation	➖	➖	➖
TARGET: Federated Class-Continual Learning via Exemplar-Free Distillation			➖
FACTS: First Amplify Correlations and then Slice to Discover Bias	➖	➖	➖
Computation and Data Efficient Backdoor Attacks	➖	➖	➖
Global Balanced Experts for Federated Long-Tailed Learning	➖	➖	➖
Source-Free Domain Adaptive Human Pose Estimation			➖
Gender Artifacts in Visual Datasets			➖
FRAug: Tackling Federated Learning with Non-IID Features via Representation Augmentation	➖		➖
zPROBE: Zero Peek Robustness Checks for Federated Learning	➖		➖
Practical Membership Inference Attacks Against Large-Scale Multi-Modal Models: A Pilot Study	➖	➖	➖
FedPD: Federated Open Set Recognition with Parameter Disentanglement	➖	➖	➖
MUter: Machine Unlearning for Adversarial Training Models	➖	➖	➖
Beyond Skin Tone: A Multidimensional Measure of Apparent Skin Color	➖		➖
A Multidimensional Analysis of Social Biases in Vision Transformers			➖
Partition-and-Debias: Agnostic Biases Mitigation via a Mixture of Biases-Specific Experts			➖
Rethinking Data Distillation: Do not Overlook Calibration	➖		➖
Mining Bias-Target Alignment from Voronoi Cells			➖
Better May not be Fairer: A Study on Subgroup Discrepancy in Image Classification		➖	➖
GIFD: A Generative Gradient Inversion Method with Feature Domain Optimization			➖
Benchmarking Algorithmic Bias in Face Recognition: An Experimental Approach using Synthetic Faces and Human Evaluation	➖		➖
FedPerfix: Towards Partial Model Personalization of Vision Transformers in Federated Learning			➖
Towards Attack-Tolerant Federated Learning via Critical Parameter Analysis			➖
What can Discriminator do? Towards Box-Free Ownership Verification of Generative Adversarial Networks			➖
Robust Heterogeneous Federated Learning under Data Corruption		➖	➖
Communication-Efficient Federated Learning with Single-Step Synthetic Features Compressor for Faster Convergence			➖
GPFL: Simultaneously Learning Global and Personalized Feature Information for Personalized Federated Learning			➖
MPCViT: Searching for Accurate and Efficient MPC-Friendly Vision Transformer with Heterogeneous Attention	➖		➖
Identification of Systematic Errors of Image Classifiers on Rare Subgroups	➖		➖
Adaptive Image Anonymization in the Context of Image Classification with Neural Networks	➖	➖	➖
When do Curricula Work in Federated Learning?	➖		➖
Domain Specified Optimization for Deployment Authorization	➖	➖	➖
STPrivacy: Spatio-Temporal Privacy-Preserving Action Recognition			➖
SAL-ViT: Towards Latency Efficient Private Inference on ViT using Selective Attention Search with a Learnable Softmax Approximation	➖	➖	➖
Generative Gradient Inversion without Prior	➖	➖	➖
Inspecting the Geographical Representativeness of Images from Text-to-Image Models	➖		➖
Divide and Conquer: A Two-Step Method for High Quality Face De-Identification with Model Explainability	➖	➖	➖
Exploring the Benefits of Visual Prompting in Differential Privacy			➖
Towards Fairness-Aware Adversarial Network Pruning	➖	➖	➖
AutoReP: Automatic ReLU Replacement for Fast Private Network Inference			➖
Flatness-Aware Minimization for Domain Generalization	➖		➖
Communication-Efficient Vertical Federated Learning with Limited Overlapping Samples			➖

First Person (Egocentric) Vision

Title	Repo	Paper	Video
Multimodal Distillation for Egocentric Action Recognition			➖
Self-Supervised Object Detection from Egocentric Videos	➖	➖	➖
Multi-Label Affordance Mapping from Egocentric Vision	➖		➖
Ego-Only: Egocentric Action Detection without Exocentric Transferring	➖		➖
COPILOT: Human-Environment Collision Prediction and Localization from Egocentric Videos
EgoPCA: A New Framework for Egocentric Hand-Object Interaction Understanding			➖
EgoVLPv2: Egocentric Video-Language Pre-Training with Fusion in the Backbone			➖

Representation Learning

Title	Repo	Paper	Video
WDiscOOD: Out-of-Distribution Detection via Whitened Linear Discriminant Analysis			➖
Pairwise Similarity Learning is SimPLE	➖	➖	➖
No Fear of Classifier Biases: Neural Collapse Inspired Federated Learning with Synthetic and Fixed Classifier			➖
Generalizable Neural Fields as Partially Observed Neural Processes	➖		➖
M2T: Masking Transformers Twice for Faster Decoding	➖		➖
Keep it SimPool: Who Said Supervised Transformers Suffer from Attention Deficit?			➖
Improving Pixel-based MIM by Reducing Wasted Modeling Capability			➖
Learning Image-Adaptive Codebooks for Class-Agnostic Image Restoration	➖		➖
Quality Diversity for Visual Pre-Training		➖	➖
Subclass-Balancing Contrastive Learning for Long-Tailed Recognition	➖		➖
Mastering Spatial Graph Prediction of Road Networks	➖		➖
Poincaré ResNet			➖
Exploring Model Transferability through the Lens of Potential Energy			➖
Improving CLIP Fine-Tuning Performance	➖	➖	➖
Unsupervised Manifold Linearizing and Clustering			➖
Generalized Sum Pooling for Metric Learning			➖
Partition Speeds Up Learning Implicit Neural Representations based on Exponential-Increase Hypothesis	➖	➖	➖
The Effectiveness of MAE Pre-Pretraining for Billion-Scale Pretraining	➖		➖
Token-Label Alignment for Vision Transformers			➖
Efficiently Robustify Pre-Trained Models	➖		➖
OFVL-MS: Once for Visual Localization Across Multiple Indoor Scenes			➖
Feature Prediction Diffusion Model for Video Anomaly Detection	➖	➖	➖
Joint Implicit Neural Representation for High-Fidelity and Compact Vector Fonts	➖	➖	➖
How Far Pre-Trained Models are from Neural Collapse on the Target Dataset Informs their Transferability	➖	➖	➖
OPERA: Omni-Supervised Representation Learning with Hierarchical Supervisions			➖
Perceptual Grouping in Contrastive Vision-Language Models			➖
Fully Attentional Networks with Self-Emerging Token Labeling	➖	➖	➖
Instance and Category Supervision are Alternate Learners for Continual Learning	➖	➖	➖
SkeletonMAE: Graph-based Masked Autoencoder for Skeleton Sequence Pre-Training			➖
Motion-Guided Masking for Spatiotemporal Representation Learning	➖		➖
Data Augmented Flatness-Aware Gradient Projection for Continual Learning	➖	➖	➖
Take-a-Photo: 3D-to-2D Generative Pre-Training of Point Cloud Models			➖
BiViT: Extremely Compressed Binary Vision Transformers			➖
Spatio-Temporal Crop Aggregation for Video Representation Learning			➖
Hierarchical Visual Primitive Experts for Compositional Zero-Shot Learning			➖
Semantic Information in Contrastive Learning	➖	➖	➖
Cross-Domain Product Representation Learning for Rich-Content E-Commerce			➖
Contrastive Continuity on Augmentation Stability Rehearsal for Continual Self-Supervised Learning	➖	➖	➖
HybridAugment++: Unified Frequency Spectra Perturbations for Model Robustness			➖
Unleashing Text-to-Image Diffusion Models for Visual Perception			➖

Deep Learning Architectures

Title	Repo	Paper	Video
Efficient Controllable Multi-Task Architectures	➖		➖
ParCNetV2: Oversized Kernel with Enhanced Attention			➖
Unleashing the Power of Gradient Signal-to-Noise Ratio for Zero-Shot NAS	➖	➖	➖
MMST-ViT: Climate Change-Aware Crop Yield Prediction via Multi-Modal Spatial-Temporal Vision Transformer			➖
FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization			➖
IIEU: Rethinking Neural Feature Activation from Decision-Making	➖	➖	➖
Scratching Visual Transformer's Back with Uniform Attention	➖		➖
SpaceEvo: Hardware-Friendly Search Space Design for Efficient INT8 Inference	➖		➖
ElasticViT: Conflict-Aware Supernet Training for Deploying Fast Vision Transformer on Diverse Mobile Devices	➖		➖
Gramian Attention Heads are Strong yet Efficient Vision Learners	➖	➖	➖
EfficientTrain: Exploring Generalized Curriculum Learning for Training Visual Backbones			➖
Ord2Seq: Regarding Ordinal Regression as Label Sequence Prediction			➖
Unified Data-Free Compression: Pruning and Quantization without Fine-Tuning	➖		➖
LaPE: Layer-Adaptive Position Embedding for Vision Transformers with Independent Layer Normalization			➖
Exemplar-Free Continual Transformer with Convolutions			➖
Building Vision Transformers with Hierarchy Aware Feature Aggregation	➖	➖	➖
ShiftNAS: Improving One-Shot NAS via Probability Shift			➖
DarSwin: Distortion Aware Radial Swin Transformer			➖
ROME: Robustifying Memory-Efficient NAS via Topology Disentanglement and Gradient Accumulation	➖		➖
FDViT: Improve the Hierarchical Architecture of Vision Transformer	➖	➖	➖
FLatten Transformer: Vision Transformer using Focused Linear Attention			➖
MixPath: A Unified Approach for One-Shot Neural Architecture Search	➖		➖
SSF: Accelerating Training of Spiking Neural Networks with Stabilized Spiking Flow	➖	➖	➖
Dynamic Perceiver for Efficient Visual Recognition			➖
SG-Former: Self-Guided Transformer with Evolving Token Reallocation			➖
Scale-Aware Modulation Meet Transformer			➖
Learning to Upsample by Learning to Sample			➖
GET: Group Event Transformer for Event-based Vision	➖	➖	➖
Adaptive Frequency Filters as Efficient Global Token Mixers			➖
Fcaformer: Forward Cross Attention in Hybrid Vision Transformer			➖
Dynamic Snake Convolution based on Topological Geometric Constraints for Tubular Structure Segmentation			➖
Sentence Attention Blocks for Answer Grounding	➖	➖	➖
MST-Compression: Compressing and Accelerating Binary Neural Networks with Minimum Spanning Tree	➖		➖
EGformer: Equirectangular Geometry-biased Transformer for 360 Depth Estimation	➖		➖
SPANet: Frequency-Balancing Token Mixer using Spectral Pooling Aggregation Modulation
ModelGiF: Gradient Fields for Model Functional Distance		➖	➖
ClusT3: Information Invariant Test-Time Training	➖	➖	➖
Cumulative Spatial Knowledge Distillation for Vision Transformers	➖		➖
Luminance-Aware Color Transform for Multiple Exposure Correction	➖	➖	➖
Towards Memory- and Time-Efficient Backpropagation for Training Spiking Neural Networks			➖
Domain Generalization Guided by Gradient Signal to Noise Ratio of Parameters	➖	➖	➖
DOT: A Distillation-Oriented Trainer	➖		➖
Extensible and Efficient Proxy for Neural Architecture Search	➖	➖	➖
Learning to Transform for Generalizable Instance-Wise Invariance		➖	➖
Convolutional Networks with Oriented 1D Kernels	➖	➖	➖

Recognition: Detection

Title	Repo	Paper	Video
Random Boxes are Open-World Object Detectors			➖
Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection			➖
CoIn: Contrastive Instance Feature Mining for Outdoor 3D Object Detection with Very Limited Annotations		➖	➖
A Dynamic Dual-Processing Object Detection Framework Inspired by the Brain's Recognition Mechanism	➖	➖	➖
Anchor-Intermediate Detector: Decoupling and Coupling Bounding Boxes for Accurate Object Detection	➖	➖	➖
Inter-Realization Channels: Unsupervised Anomaly Detection Beyond One-Class Classification	➖	➖	➖
Deep Equilibrium Object Detection			➖
RecursiveDet: End-to-End Region-based Recursive Object Detection			➖
Small Object Detection via Coarse-to-Fine Proposal Generation and Imitation Learning
ASAG: Building Strong One-Decoder-Layer Sparse Detectors via Adaptive Sparse Anchor Generation			➖
COCO-O: A Benchmark for Object Detectors under Natural Distribution Shifts			➖
Generative Prompt Model for Weakly Supervised Object Localization			➖
UniKD: Universal Knowledge Distillation for Mimicking Homogeneous or Heterogeneous Object Detectors	➖	➖	➖
PNI: Industrial Anomaly Detection using Position and Neighborhood Information			➖
Masked Autoencoders are Stronger Knowledge Distillers	➖	➖	➖
GPA-3D: Geometry-Aware Prototype Alignment for Unsupervised Domain Adaptive 3D Object Detection from Point Clouds			➖
ADNet: Lane Shape Prediction via Anchor Decomposition			➖
Periodically Exchange Teacher-Student for Source-Free Object Detection	➖	➖	➖
Towards Fair and Comprehensive Comparisons for Image-based 3D Object Detection	➖	➖	➖
Monocular 3D Object Detection with Bounding Box Denoising in 3D by Perceiver	➖		➖
Template-Guided Hierarchical Feature Restoration for Anomaly Detection	➖	➖	➖
ALWOD: Active Learning for Weakly-Supervised Object Detection	➖		➖
ProtoFL: Unsupervised Federated Learning via Prototypical Distillation	➖		➖
Efficient Adaptive Human-Object Interaction Detection with Concept-Guided Memory			➖
Detection Transformer with Stable Matching			➖
Distilling DETR with Visual-Linguistic Knowledge for Open-Vocabulary Object Detection	➖	➖	➖
Anomaly Detection Under Distribution Shift			➖
Detecting Objects with Context-Likelihood Graphs and Graph Refinement	➖		➖
Unsupervised Object Localization with Representer Point Selection			➖
DETR does not Need Multi-Scale or Locality Design			➖
Deep Directly-Trained Spiking Neural Networks for Object Detection			➖
GACE: Geometry Aware Confidence Enhancement for Black-Box 3D Object Detectors on LiDAR-Data	➖	➖	➖
StageInteractor: Query-based Object Detector with Cross-Stage Interaction	➖		➖
Adaptive Rotated Convolution for Rotated Object Detection	➖		➖
Decoupled DETR: Spatially Disentangling Localization and Classification for Improved End-to-End Object Detection	➖	➖	➖
Exploring Transformers for Open-World Instance Segmentation	➖		➖
DDG-Net: Discriminability-Driven Graph Network for Weakly-Supervised Temporal Action Localization			➖
Group DETR: Fast DETR Training with Group-Wise One-to-Many Assignment			➖
Category-Aware Allocation Transformer for Weakly Supervised Object Localization		➖	➖
The Devil is in the Crack Orientation: A New Perspective for Crack Detection	➖	➖	➖
Clusterformer: Cluster-based Transformer for 3D Object Detection in Point Clouds	➖	➖	➖
Less is more: Focus Attention for Efficient DETR			➖
DFA3D: 3D Deformable Attention For 2D-to-3D Feature Lifting			➖
Multi-Label Self-Supervised Learning with Scene Images	➖		➖
Cascade-DETR: Delving into High-Quality Universal Object Detection			➖
Representation Disparity-Aware Distillation for 3D Object Detection	➖		➖
FeatEnHancer: Enhancing Hierarchical Features for Object Detection and Beyond Under Low-Light Vision	➖		➖
DetZero: Rethinking Offboard 3D Object Detection with Long-Term Sequential Point Clouds			➖
DETRs with Collaborative Hybrid Assignments Training			➖
Open-Vocabulary Object Detection with an Open Corpus	➖	➖	➖
SparseDet: Improving Sparsely Annotated Object Detection with Pseudo-Positive Mining			➖
Unsupervised Surface Anomaly Detection with Diffusion Probabilistic Model	➖	➖	➖
UniTR: A Unified and Efficient Multi-Modal Transformer for Bird's-Eye-View Representation			➖
Focus the Discrepancy: Intra- and Inter-Correlation Learning for Image Anomaly Detection			➖
MonoNeRD: NeRF-Like Representations for Monocular 3D Object Detection			➖
Integrally Migrating Pre-Trained Transformer Encoder-Decoders for Visual Object Detection			➖
Generating Dynamic Kernels via Transformers for Lane Detection	➖	➖	➖
Meta-ZSDETR: Zero-Shot DETR with Meta-Learning	➖		➖
Spatial Self-Distillation for Object Detection with Inaccurate Bounding Boxes			➖
AlignDet: Aligning Pre-Training and Fine-Tuning in Object Detection			➖
MULLER: Multilayer Laplacian Resizer for Vision	➖		➖
Unilaterally Aggregated Contrastive Learning with Hierarchical Augmentation for Anomaly Detection	➖		➖
DETRDistill: A Universal Knowledge Distillation Framework for DETR-Families	➖		➖
Delving into Motion-Aware Matching for Monocular 3D Object Tracking			➖
FB-BEV: BEV Representation from Forward-Backward View Transformations			➖
Learning from Noisy Data for Semi-Supervised 3D Object Detection	➖	➖	➖
Boosting Long-Tailed Object Detection via Step-Wise Learning on Smooth-Tail Data	➖		➖
Objects do not Disappear: Video Object Detection by Single-Frame Object Location Anticipation			➖
Unified Visual Relationship Detection with Vision and Language Models			➖
Universal Domain Adaptation via Compressive Attention Matching	➖		➖
Unsupervised Domain Adaptive Detection with Network Stability Analysis			➖
ImGeoNet: Image-Induced Geometry-Aware Voxel Representation for Multi-View 3D Object Detection			➖
Cyclic-Bootstrap Labeling for Weakly Supervised Object Detection			➖

Image and Video Synthesis

Title	Repo	Paper	Video
Text-Driven Generative Domain Adaptation with Spectral Consistency Regularization	➖	➖	➖
MosaiQ: Quantum Generative Adversarial Networks for Image Generation on NISQ Computers	➖		➖
Controllable Visual-Tactile Synthesis
Editing Implicit Assumptions in Text-to-Image Diffusion Models			➖
DINAR: Diffusion Inpainting of Neural Textures for One-Shot Human Avatars			➖
Smoothness Similarity Regularization for Few-Shot GAN Adaptation	➖		➖
HSR-Diff: Hyperspectral Image Super-Resolution via Conditional Diffusion Models	➖		➖
Long-Term Photometric Consistent Novel View Synthesis with Diffusion Models			➖
AutoDiffusion: Training-Free Optimization of Time Steps and Architectures for Automated Diffusion Model Acceleration	➖		➖
GaFET: Learning Geometry-Aware Facial Expression Translation from in-the-Wild Images	➖		➖
Collecting the Puzzle Pieces: Disentangled Self-Driven Human Pose Transfer by Permuting Textures			➖
Multi-Directional Subspace Editing in Style-Space			➖
HyperReenact: One-Shot Reenactment via Jointly Learning to Refine and Retarget Faces			➖
Generating Realistic Images from in-the-Wild Sounds	➖		➖
CC3D: Layout-Conditioned Generation of Compositional 3D Scenes			➖
UMFuse: Unified Multi View Fusion for Human Editing Applications	➖		➖
Evaluating Data Attribution for Text-to-Image Models
Neural Characteristic Function Learning for Conditional Image Generation	➖	➖	➖
WaveIPT: Joint Attention and Flow Alignment in the Wavelet Domain for Pose Transfer	➖	➖	➖
LayoutDiffusion: Improving Graphic Layout Generation by Discrete Diffusion Probabilistic Models			➖
Human-Inspired Facial Sketch Synthesis with Dynamic Adaptation			➖
Conceptual and Hierarchical Latent Space Decomposition for Face Editing	➖	➖	➖
Improving Diversity in Zero-Shot GAN Adaptation with Semantic Variations	➖		➖
BallGAN: 3D-Aware Image Synthesis with a Spherical Background
End-to-End Diffusion Latent Optimization Improves Classifier Guidance			➖
Deep Geometrized Cartoon Line Inbetweening		➖	➖
UnitedHuman: Harnessing Multi-Source Data for High-Resolution Human Generation		➖
Towards Authentic Face Restoration with Iterative Diffusion Models and Beyond	➖		➖
SVDiff: Compact Parameter Space for Diffusion Fine-Tuning			➖
MI-GAN: A Simple Baseline for Image Inpainting on Mobile Devices	➖	➖	➖
Structure and Content-Guided Video Synthesis with Diffusion Models
Scenimefy: Learning to Craft Anime Scene via Semi-Supervised Image-to-Image Translation			➖
Efficient-VQGAN: Towards High-Resolution Image Generation with Efficient Vision Transformers	➖	➖	➖
A Latent Space of Stochastic Diffusion Models for Zero-Shot Image Editing and Guidance	➖	➖	➖
Generative Multiplane Neural Radiance for 3D-Aware Image Generation			➖
Parallax-Tolerant Unsupervised Deep Image Stitching			➖
GAIT: Generating Aesthetic Indoor Tours with Deep Reinforcement Learning		➖	➖
EverLight: Indoor-Outdoor Editable HDR Lighting Estimation			➖
Prompt Tuning Inversion for Text-Driven Image Editing using Diffusion Models	➖		➖
Efficient Diffusion Training via Min-SNR Weighting Strategy			➖
BoxDiff: Text-to-Image Synthesis with Training-Free Box-Constrained Diffusion			➖
Improving Sample Quality of Diffusion Models using Self-Attention Guidance
Not All Steps are Created Equal: Selective Diffusion Distillation for Image Manipulation
Deep Image Harmonization with Learnable Augmentation
Out-of-Domain GAN Inversion via Invertibility Decomposition for Photo-Realistic Human Face Manipulation
Bidirectionally Deformable Motion Modulation for Video-based Human Pose Transfer
Size does Matter: Size-Aware Virtual Try-On via Clothing-Oriented Transformation Try-On Network
VidStyleODE: Disentangled Video Editing via StyleGAN and NeuralODEs
Learning Global-Aware Kernel for Image Harmonization
Expressive Text-to-Image Generation with Rich Text
A Large-Scale Outdoor Multi-Modal Dataset and Benchmark for Novel View Synthesis and Implicit Scene Reconstruction
Efficient Region-Aware Neural Radiance Fields for High-Fidelity Talking Portrait Synthesis
Perceptual Artifacts Localization for Image Synthesis Tasks
Learning to Generate Semantic Layouts for Higher Text-Image Correspondence in Text-to-Image Synthesis
StylerDALLE: Language-Guided Style Transfer using a Vector-Quantized Tokenizer of a Large-Scale Generative Model
Shortcut-V2V: Compression Framework for Video-to-Video Translation based on Temporal Redundancy Reduction
Tune-a-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
BlendFace: Re-Designing Identity Encoders for Face-Swapping
Talking Head Generation with Probabilistic Audio-to-Visual Diffusion Priors
LinkGAN: Linking GAN Latents to Pixels for Controllable Image Synthesis
Open-Vocabulary Object Segmentation with Diffusion Models
StyleDiffusion: Controllable Disentangled Style Transfer via Diffusion Models
ToonTalker: Cross-Domain Face Reenactment
Dense Text-to-Image Generation with Attention Modulation
Householder Projector for Unsupervised Latent Semantics Discovery
Deep Image Harmonization with Globally Guided Feature Transformation and Relation Distillation
One-Shot Generative Domain Adaptation
Hashing Neural Video Decomposition with Multiplicative Residuals in Space-Time
Versatile Diffusion: Text, Images and Variations All in One Diffusion Model
Harnessing the Spatial-Temporal Attention of Diffusion Models for High-Fidelity Text-to-Image Synthesis
FreeDoM: Training-Free Energy-Guided Conditional Diffusion Model			➖

Vision and Audio

Recognition, Segmentation, and Shape Analysis

Generative AI

Title	Repo	Paper	Video
Simulating Fluids in Real-World Still Images			➖
FateZero: Fusing Attentions for Zero-Shot Text-based Video Editing			➖

Humans, 3D Modeling, and Driving

Low-Level Vision and Theory

Title	Repo	Paper	Video
DDFM: Denoising Diffusion Model for Multi-Modality Image Fusion			➖

Navigation and Autonomous Driving

3D from a Single Image and Shape-from-X

Motion Estimation, Matching and Tracking

Action and Event Understanding

Computational Imaging

Embodied Vision: Active Agents; Simulation

Recognition: Retrieval

Transfer, Low-Shot, Continual, Long-Tail Learning

Low-Level and Physics-based Vision

Title	Repo	Paper	Video
High-Resolution Document Shadow Removal via a Large-Scale Real-World Dataset and a Frequency-Aware Shadow Erasing Net			➖
Iterative Soft Shrinkage Learning for Efficient Image Super-Resolution			➖

Computer Vision Theory

Title	Repo	Paper	Video
Femtodet: An Object Detection Baseline for Energy Versus Performance Tradeoffs			➖

Video Analysis and Understanding

Object Pose Estimation and Tracking

3D Shape Modeling and Processing

Human Pose/Shape Estimation

Transfer, Low-Shot, and Continual Learning

Self-, Semi-, and Unsupervised Learning

Self-, Semi-, Meta-, Unsupervised Learning

Photogrammetry and Remote Sensing

Efficient and Scalable Vision

Title	Repo	Paper	Video
AdaNIC: Towards Practical Neural Image Compression via Dynamic Transform Routing	➖	➖	➖
Rethinking Vision Transformers for MobileNet Size and Speed			➖
DELFlow: Dense Efficient Learning of Scene Flow for Large-Scale Point Clouds			➖
Eventful Transformers: Leveraging Temporal Redundancy in Vision Transformers	➖		➖
Inherent Redundancy in Spiking Neural Networks	➖		➖
Achievement-based Training Progress Balancing for Multi-Task Learning	➖	➖	➖
Prune Spatio-temporal Tokens by Semantic-aware Temporal Accumulation	➖		➖
Differentiable Transportation Pruning	➖		➖
XiNet: Efficient Neural Networks for tinyML	➖	➖	➖
Jumping through Local Minima: Quantization in the Loss Landscape of Vision Transformers			➖
A2Q: Accumulator-Aware Quantization with Guaranteed Overflow Avoidance	➖		➖
Workie-Talkie: Accelerating Federated Learning by Overlapping Computing and Communications via Contrastive Regularization	➖	➖	➖
DenseShift: Towards Accurate and Transferable Low-Bit Shift Network	➖		➖
PRANC: Pseudo RAndom Networks for Compacting deep models			➖
Reinforce Data, Multiply Impact: Improved Model Accuracy and Robustness with Dataset Reinforcement			➖
A Fast Unified System for 3D Object Detection and Tracking	➖	➖	➖
Estimator Meets Equilibrium Perspective: A Rectified Straight Through Estimator for Binary Neural Networks Training	➖		➖
I-ViT: Integer-only Quantization for Efficient Vision Transformer Inference			➖
EMQ: Evolving Training-free Proxies for Automated Mixed Precision Quantization	➖		➖
Local or Global: Selective Knowledge Assimilation for Federated Learning with Limited Labels	➖		➖
DataDAM: Efficient Dataset Distillation with Attention Matching	➖	➖	➖
SAFE: Machine Unlearning With Shard Graphs	➖		➖
ResQ: Residual Quantization for Video Perception	➖		➖
Efficient Computation Sharing for Multi-Task Visual Scene Understanding			➖
Essential Matrix Estimation using Convex Relaxations in Orthogonal Space	➖	➖	➖
TripLe: Revisiting Pretrained Model Reuse and Progressive Learning for Efficient Vision Transformer Scaling and Searching	➖	➖	➖
DiffRate: Differentiable Compression Rate for Efficient Vision Transformers			➖
Bridging Cross-task Protocol Inconsistency for Distillation in Dense Object Detection			➖
From Knowledge Distillation to Self-Knowledge Distillation: A Unified Approach with Normalized Loss and Customized Soft Labels			➖
Efficient 3D Semantic Segmentation with Superpoint Transformer			➖
Dataset Quantization	➖		➖
Revisiting the Parameter Efficiency of Adapters from the Perspective of Precision Redundancy			➖
RepQ-ViT: Scale Reparameterization for Post-Training Quantization of Vision Transformers			➖
Semantically Structured Image Compression via Irregular Group-Based Decoupling			➖
SeiT: Storage-Efficient Vision Training with Tokens Using 1% of Pixel Storage			➖
SMMix: Self-Motivated Image Mixing for Vision Transformers			➖
Multi-Label Knowledge Distillation			➖
UGC: Unified GAN Compression for Efficient Image-to-Image Translation	➖	➖	➖
MotionDeltaCNN: Sparse CNN Inference of Frame Differences in Moving Camera Videos with Spherical Buffers and Padded Convolutions			➖
EfficientViT: Lightweight Multi-Scale Attention for High-Resolution Dense Prediction			➖
DREAM: Efficient Dataset Distillation by Representative Matching			➖
INSTA-BNN: Binary Neural Network with INSTAnce-aware Threshold	➖		➖
Deep Incubation: Training Large Models by Divide-and-Conquering			➖
AdaMV-MoE: Adaptive Multi-Task Vision Mixture-of-Experts	➖	➖	➖
Overcoming Forgetting Catastrophe in Quantization-Aware Training	➖	➖	➖
Window-Based Early-Exit Cascades for Uncertainty Estimation: When Deep Ensembles are More Efficient than Single Models			➖
ORC: Network Group-based Knowledge Distillation using Online Role Change			➖
RMP-Loss: Regularizing Membrane Potential Distribution for Spiking Neural Networks	➖		➖
Structural Alignment for Network Pruning through Partial Regularization	➖	➖	➖
Automated Knowledge Distillation via Monte Carlo Tree Search	➖	➖	➖
SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications			➖
Causal-DFQ: Causality Guided Data-Free Network Quantization	➖	➖	➖
Efficient Joint Optimization of Layer-Adaptive Weight Pruning in Deep Neural Networks	➖	➖	➖
Automatic Network Pruning via Hilbert-Schmidt Independence Criterion Lasso under Information Bottleneck Principle	➖	➖	➖
Distribution Shift Matters for Knowledge Distillation with Webly Collected Images	➖		➖
FastRecon: Few-shot Industrial Anomaly Detection via Fast Feature Reconstruction	➖	➖	➖
E^2VPT: An Effective and Efficient Approach for Visual Prompt Tuning			➖
Bridging Vision and Language Encoders: Parameter-Efficient Tuning for Referring Image Segmentation			➖
SHACIRA: Scalable HAsh-grid Compression for Implicit Neural Representations	➖	➖	➖
Efficient Deep Space Filling Curve	➖	➖	➖
Q-Diffusion: Quantizing Diffusion Models			➖
Lossy and Lossless (L2) Post-training Model Size Compression			➖
Robustifying Token Attention for Vision Transformers			➖

Machine Learning (other than Deep Learning)

Document Analysis and Understanding

Biometrics

Datasets and Evaluation

Faces and Gestures

Medical and Biological Vision; Cell Microscopy

Title	Repo	Paper	Video
BoMD: Bag of Multi-Label Local Descriptors for Noisy Chest X-Ray Classification			➖
CLIP-Driven Universal Model for Organ Segmentation and Tumor Detection			➖

Scene Analysis and Understanding

Multimodal Learning

Human-in-the-Loop Computer Vision

Image and Video Forensics

Geometric Deep Learning

Vision Applications and Systems

Title	Repo	Paper	Video
Multimodal Garment Designer: Human-Centric Latent Diffusion Models for Fashion Image Editing			➖

Machine Learning and Dataset

Title	Repo	Paper	Video
Unmasked Teacher: Towards Training-Efficient Video Foundation Models			➖

Name		Name	Last commit message	Last commit date
Latest commit History 233 Commits
images		images
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

License

anonlim/ICCV-2023-Papers

Folders and files

Latest commit

History

Repository files navigation